Ralf’s Polish speech model

Thursday, May 17th, 2012

Some words about the creation of this speech model:

1. Get Ralf’s Polish dictionary.
2. Create a Polish scenario.
3. Delete the shadow vocabulary from my previous scenario.
4. Import Ralf’s Polish dictionary as shadow dictionary.
5. Add 10 words to training. Simon asks:

Your vocabulary does not define all words used in this text. These words are missing:
Knopik, knorra, pitbul, cedziny, ceglasto, celebracja, celebra, cella, celoteks, frencz

Do you want to add them now?

Press the Yes button.

6. Record the ten words with Simon.
7. Actions > Synchronize. Actions > Activate. It doesn’t work. Why not? I forgot to add as Grammar the terminal “Unknown“. And I forgot to add the Dictation plugin.
8. Let’s dictate a few Polish words:

celebra cella cella frencz Knopik knorra pitbul cella celebra

9. Download Ralf’s Polish speech model.

Ralf’s Polish dictionary 0.1.1

Tuesday, May 18th, 2010

How I improve Ralf's Polish dictionary:

1. Language code is pl. Edit espeak2ipa.xsl. The section that is relevant to the Polish language begins with matches(/lexicon/@xml:lang, 'pl').

2. Transform espeak phonemes into IPA phonemes via Ubuntu terminal:

$ cat /media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/polish/polish-dictionary.xml.bz2 | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

3. Download Ralf's Polish dictionary, and import it into simon.

4. Unfortunately, a lot of <grapheme> elements in Ralf's Polish dictionary contain garbage characters.

Try to train the Polish word “JEDEN”

Thursday, November 19th, 2009

I want to import a small sample dictionary into simon (Sphinx format):


The source can be found here (I don’t know how long this link will be valid). The dictionary contains 19 Polish words (US-ASCII). Here is what you have to do next:


1. Select Applications > Universal Access > simon.


2. Press the Wordlist button.
3. Press Import Dictionary.


4. You can select the target: shadow dictionary or active dictionary. For this Polish example dictionary, choose active dictionary.
5. Press the Next button.

And now it is time to choose the appropriate lexicon format:


Import the dictionary (with the 19 Polish words; see the screen-shot at the beginning of this post) as SPHINX lexicon.


You have to select the path to the Polish Sphinx dictionary. After pressing the Next button, the following message appears:


The Polish Sphinx dictionary has been imported successfully. Press the Finish button.

Now let’s train a Polish word:


a. Select the Polish word JEDEN.
b. Add to Training.
c. Train selected Words.

You can now record the Polish word with simon:



Ralf’s Polish dictionary

Tuesday, November 10th, 2009

You can import Ralf's Polish dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is currently not possible.

Some details about (the creation of) this dictionary:
- The grapheme elements contain often garbage characters. I tried to fix this issue (with iconv), but unfortunately I wasn’t successful. When the grapheme element contains crap, the phoneme element is crappy, too. These lexeme elements are unusable.
- I didn’t convert the eSpeak phonemes into IPA phonemes. They are still in their original format. This is indicated by the attribute of the lexicon element: alphabet="espeak"
– As source I used a spelling dictionary.

This dictionary is of really bad quality. It is necessary to solve the encoding issue. But I tried three different conversions, and I failed.

At least, you can get an impression how easy it is to import 300.000 Polish words.