In the context of speech recognition, I use the pseudonym “Ralf Herzog”. In the future, I will use my real name “Kai Schott”, too. E.g. instead of Ralf's Latin IPA FLAC files, I use the expression Schott's Latin IPA FLAC files. Maybe I will name the next version of my German PLS dictionary “Schott’s German dictionary” (instead of Ralf's German dictionary).
Archive for the ‘General’ Category
Real name
Thursday, February 17th, 2011German: how to pronounce ‘Loveparade’
Tuesday, September 28th, 2010Take a look into german-dictionary.xml:
<lexeme role="Substantiv">
<grapheme>Loveparade</grapheme>
<phoneme>lavpɛʀɛɪ̯t</phoneme>
</lexeme>
When a German person speaks the word ‘Loveparade’, how would he probably pronounce the ‘r’? According to the Wikipedia, there are nine different ‘r’ phonemes available:
Alveolar trill [r] · Alveolar approximant [ɹ] · Alveolar tap [ɾ] · Alveolar lateral flap [ɺ] · Retroflex approximant [ɻ] · Retroflex flap [ɽ] · Uvular trill [ʀ] · Voiced uvular fricative [ʁ] · Labialized [ʋ]
Should I replace the ʀ by a different ‘r’-phoneme?
HMM approach at the phoneme level
Tuesday, May 4th, 2010I totally agree with this statement:
“Where other ASR attempts focused on either understanding words semantically (what does this word mean?) or on word bigram and trigram patterns (which words are most likely to come next?), both techniques you described, the HMM approach at the phoneme level was far more successful.”
This is why I am developing lots of PLS dictionaries. The PLS dictionaries work mainly on the phoneme level. When you train long words (words with lots of phonemes) you get better recognition results than with short words. What does this mean for the English and for the German language?
1. For the English language, you need “word bigram and trigram patterns” to get good recognition results. This means that you need a grammar.
2. For the German language, “the HMM approach at the phoneme level” is sufficient if you train only long words (e.g. “zusammengesetzte Wörter”). If you want to train short words for the German language (e.g. the German words zwei or zwölf or zurück), you should train them in “word bigram and trigram patterns” (like in the English language).
I want to train several thousand long German words (without a specific grammar; only the terminal category “Unknown” will be used) because I believe that “the HMM approach at the phoneme level” is the most promising way to start. Later, short German words can be added. For them, I need “word bigram and trigram patterns”.
English vocabulary doesn’t contain a lot of long words. This means that good recognition results for the English language require “word bigram and trigram patterns”. I find this task more difficult than “the HMM approach at the phoneme level”.
In short:
a. Train several thousand long German words (“the HMM approach at the phoneme level”).
b. Train short German words (“word bigram and trigram patterns”).
c. Train English words (“word bigram and trigram patterns”).
Community and participation
Sunday, April 25th, 2010This is a good idea: the user could
“go through a 10 minute training procedure to adapt the static model to his voice. This adaption data could then be sent to voxforge to improve the general model.”
Yes, the user should have the chance to help with the improvement of speech models by submitting his own recordings via simon to Voxforge.
“It all comes down to the community and very much depends upon their participation.”
Yes. I agree. We need more participants who can build plug-ins on their own. By the way, if you want people to participate, offer them export functionality. Maybe this is possible with the simon scenario concept. The user should be able to export everything he has created with simon (wav files, prompts, hmmdefs, tiedlist, macros, scenario).
“If the speech model is working, I am sure I could implement a basic dictation function in simon in a couple of monts if not weeks.”
What is the problem with the German speech model? Personally, I don’t know how to build a speech model with sam. I have recorded about 10.000 German FLAC files, and I would record more if I knew that this would improve the speech model. Unfortunately, I don’t know how I could process them.
By the way, if you want good results, read the SAMPA phonemes when you record your voice with simon. You should avoid to omit any phoneme when speaking. Speak intuitively.
simon should offer a dictation function that inserts automatically a space-bar after each word.
HMM-definitions, Tiedlist, Dict, DFA
Saturday, April 24th, 2010When I open Applications > Universal Access > sam > Input & output files, I see the following four options in the Output files area:
(a) HMM-definitions
(b) Tiedlist
(c) Dict
(d) DFA
When I am looking into simon, I can see the following options:
(a) Hmm Definition
(b) Tiedlist
(c) Macros
(d) Stats
I find it confusing:
- (a) and (b) obviously the same: sam can output the (a) Hmm Definition; simon can use the (a) Hmm Definition as input. sam can output (b) Tiedlist; simon can use the (b) Tiedlist that sam produced as input. So far so good.
- but what is with (c) and (d)? Is (c) Dict the same as (c) Macros? Is (d) DFA the same as (d) Stats?
Why does sam produce (c) Dict and (d) DFA as output files? Is it possible to use these sam output files as simon input files?
Some clarification would be helpful.
“simonPad”: default area for dictation
Friday, April 23rd, 2010In my opinion, simon should offer a simonPad. The simonPad is a simple text editor that could be integrated into the simon main window:
After completing the simon starter wizard (Welcome, Senarios, Base models, Server, Finished), the user could dictate directly into the simonPad. Please, help the user by offering a simonPad (with blinking cursor). It should be possible to switch between “dictation” and “command and control”. So if the user has chosen a specific simon scenario, he should be able to dictate the words that are included in this scenario into the simonPad.
The default area for dictation should be the simonPad.
What is an “Adapted base model”?
Friday, April 23rd, 2010Take a look at this picture:
What is the difference between
(a) Static model,
(b) Adapted base model,
(c) User generated model?
The answer can be found in the simon wiki:
“If you choose to use adapted base models but don’t record any trainingssamples, simon will use the base models to be adapted just like static models.”
I find the simon options difficult to understand. I suggest the following options:
(a) Import acoustic model from Voxforge
(b) Adjust an imported acoustic model
(c) Create acoustic model from scratch
Why is it possible to adapt a static model? The word “static” implies that it can’t be adjusted. I would avoid the word “static”, and use “acoustic” instead. I think that the files hmmdefs, tiedlist, macros and stats contain the acoustic model (I might be wrong because I still don’t understand the difference between acoustic model and language model). I think that a simon scenario contains the language model because it contains grammar information.
If you name the option (b) Adapted base model, this implies that the base model has already been adapted. This is difficult to understand.
At the moment, the options that simon has to offer are too cryptic.
Further thoughts: My guess is that most users want to start immediately with dictation. So the user should select the option (a) Import acoustic model from Voxforge. It doesn’t matter if the recognition rate is low. At least, the computer should write some words (what about a simonPad? DNS offers DragonPad, Windows offers Notepad) when the user speaks into the microphone.
If simon doesn’t work out of the box (after completing a short wizard), the average user (= 80% of all users) will loose interest, and do something else. This means that the option (a) Import acoustic model from Voxforge is very important for 80% of all simon users.
20% of all users want one of the following options:
(b) Adjust an imported acoustic model
(c) Create acoustic model from scratch
So I would suggest to concentrate on the following option that 80% of all users want:
(a) Import acoustic model from Voxforge
And I wouldn’t let the user decide in which window he wants to dictate. You could develop simonPad – a simple text editor that “catches” the speech when the user dictates into his microphone.
“gls” phoneme for glottal stop
Thursday, April 22nd, 2010Let’s take a look at this file (ISO-8859-1):
<word> <name>Abbrechen</name> <pronunciation>gls a p b R E C @ n</pronunciation> <terminal>SimpelKommando</terminal> </word> <word> <name>Acht</name> <pronunciation>a x t</pronunciation> <terminal>Nummer</terminal> </word>
Obviously, simon does now import the glottal stop. The word Abbrechen is with glottal stop. The word Acht is without glottal stop. More info:
Introduced the “gls” phoneme for a glotal stop (“?” in SAMPA).
Great. This should improve the recognition rate when you use Ralf's German dictionary for training. As far as I know, the German BOMP dictionary does have glottal stops, too.
The glottal stop is a difficult phoneme. I think that it is a good decision if simon supports the glottal stop.
The abbreviation gls is better than kn.
Additional thought: Maybe it is possible to develop a simon scenario (XML file) with an .xsl style-sheet? You all know that I am primarily interested in dictation, so a simon scenario with let’s say 1.000 German words would be an interesting start. This should be possible, but I don’t want to invest too much time. So I am looking for a way to get quick results.
My dream is an XML-based framework:
- SSML prompts that link to FLAC audio files;
- Ralf's German dictionary (PLS);
- an .xsl style-sheet that accesses e.g. 99 SSML prompts for training (source file), and that chooses the corresponding words from the PLS dictionary (source file);
- the object file could be a simon scenario.
The command in the Ubuntu terminal would look as follows:
$ saxonb-xslt -ext:on -s:'http://script.blau.in/german/35/prompts.xml' -xsl:'create-simon-scenario.xsl' -o:simon-scenario.xml
What do you think? It is possible to access files that are located in the Internet directly from the Ubuntu terminal, and process them.




