At the moment, Ralf’s German speech model 0.1.3 is installed on my computer (as user generated model). I would like to increase the vocabulary size (from 18000 words to 300000 words). So I am trying the following: I import Ralf's German dictionary 0.1.9.9. This will create of course a lot of duplicate entries in the active vocabulary.
This is what I am doing after the import of the dictionary:
- add terminals to the grammar (Adjektiv, Substantiv, Verb),
- press the Synchronize button.
So this is the situation at the moment:
Ralf's German speech model 0.1.3 contains 18000 words (it has been trained with 18000 audio files). I want to see how many additional words I have to train to cover all words that are contained in Ralf's German dictionary.
Then the following error message appears:
I could use a hint what I should do to increase the vocabulary size without having to train all the words. Do you have any suggestions?
Edit: I try the following: I clear the whole active vocabulary. Then I import Ralf's German dictionary 0.1.9.9 again (as active vocabulary). Then I disconnect simon. Then I restart ksimond. Then I press the simon connect button. Then I press the Synchronize button.
Now I have to wait a few minutes or so. The computer reacts pretty slowly because obviously simon is using a lot of processing capacity.
It was possible to compile the speech model. Now I press the Activate button. Then the following error message appears:
The recognition reported the following error:
The recognition could not be started because your model contains words that consits of sounds that are not covered by your acoustic model.You need to either remove those words, transcribe them differently or train them.
Warning: The latter will not work if you are using static base models!
This could also be a sign of a base model that uses a different phoneme set than your scenario vocabulary.
The following words are affected (list may not be complete):
abelschen, abelschen, abonnierbaren, abonnierbaren, abonnierende, abonnierende, abonnierendem, abonnierendem, abonnierender, abonnierender, abonnierendes, abonnierendes, adressatengerechten, adressate…The following phonemes are affected (list may not be complete):
*-C+e or biphone C+e, *-EI+ts or biphone EI+ts, @-l+gls, @-ts+E, C-a:+R, C-e+m, C-n=+s, E-C+n=, E-N+N=, E-S+n=, E-b+m=, E-k+N=, E-p+m=, E-x+t, E:-ah+t, E:-d+n=, E:-f+m=, E:-f+t, E:-g+N=, EI-C+k, EI-ts…
It would be interesting to get a complete list of all words that are missing. Then I could train or remove these words efficiently.
Tags: de
You can use sam to not suffer from the automatic removal of untrained words.
But I think improving the accuracy is far more important before continuing to increase the dictionary size.
I tried your latest model yesterday and could only get it to recognize one or two of all the words I tried. Everything else wasn’t even close…
The vocabulary is just too big for that amount of training data.
Regards,
Peter
“I tried your latest model yesterday and could only get it to recognize one or two of all the words I tried.” – OK. I think that your dialect is too different from my dialect.
I want to build a prototype that works with my own voice. I will upload a video that proves that my speech model works with my own voice.
“The vocabulary is just too big for that amount of training data.” – Can you prove your theory?
OK, I think that you are right. I tried it with a vocabulary of 300000 words (and 18000 training wav samples). The recognition rate is not acceptable. I need more training wav samples, or a reduced vocabulary size.
Well sadly it wasn’t just me… Two other people tried and they couldn’t get it to recognize anything either.
“Can you prove your theory?”
I could probably remove words until the recognition starts to work…
Also, a complete list of affected words can be found in the details section…
Regards,
Peter
“sadly it wasn’t just me… Two other people tried and they couldn’t get it to recognize anything either” – Thanks for the feedback.
“a complete list of affected words can be found in the details section” – yes, this list helped me a lot.