Posts Tagged ‘Hebrew’

Ralf’s Hebrew speech model

Wednesday, May 16th, 2012

Some words about the creation of Ralf’s Hebrew speech model:

1. Add a new Hebrew Scenario.

2. Import Ralf’s Hebrew dictionary as shadow dictionary.

3. Now I want to train 10 Hebrew words. Press the Train selected words button.

4. Simon is now asking a question:

Your vocabulary does not define all words used in this text. These words are missing:
פורניר, פורע, פורק, שומני, שומע, שומקום, קומקומי, קומקומו, טומנו, טונו

Do you want to add them now?

Of course, I want to add them now. I just trained these then Hebrew words with Simon. They are now part of the active vocabulary.


Recognizing a Hebrew word

Friday, May 4th, 2012

I just imported Ralf’s Hebrew dictionary into simon after creating a scenario with the name “hebrew.” I had to delete the General American dictionary first that I had imported recently.

Simon just recognized a Hebrew word.

Special training: Please record the text below. It is possible to record a Hebrew word with simon.

My computer isn’t localized for Hebrew. It would be interesting to know if it works on a localized computer.

Ralf’s Hebrew dictionary

Tuesday, January 10th, 2012

In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.

A. Creation of the dictionary:

1. Get Hebrew spelling dictionary from
2. License is GPL. There is a copyright notice inside the file he_IL.aff.

3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test

4. The source file he_IL.dic contains a lot of numbers. I remove them with the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ sed 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers

With Geany, I remove the “,” (commas) and the “/” (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.

5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.
6. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'' -o:hebrew.xml

7. ISO 639-1 language code is he.
8. I need a table for grapheme to phoneme conversion. Maybe I will use this table. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew share the same alphabet. This means I could try to use the Yiddish improve-yiddish.xsl style sheet:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml

The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn’t been converted: [א] I will add this phone to the .xsl style sheet with the name improve-hebrew.xsl. Now I try it again:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'improve-hebrew.xsl' -o:hebrew-dictionary.xml

The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.

B. Download the dictionary. Import it into simon as shadow dictionary.

Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just Unknown) since the source PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn’t be a problem to adjust the style sheet improve-hebrew.xsl so that the phoneme results are better.

Disabling Hebrew speech model

Friday, September 11th, 2009

Because I don’t want to use the Hebrew speech model any more, I renamed the following folders:

Old name: /home/liberty/.kde/share/apps/simon/model
New name: /home/liberty/.kde/share/apps/simon/model-20090911-hebrew

Old name: /home/liberty/.kde/share/apps/simond/models
New name: file:///home/liberty/.kde/share/apps/simond/models-20090911-hebrew

Now I can import a different lexicon. Which lexicon should I import next? The German PLS dictionary? Or should I import the Voxforge dictionary (HTK format)? I didn't import a Dutch dictionary yet. I think that there is one available at Voxforge.

Confidence score with Hebrew

Thursday, September 10th, 2009

A few hours ago, I created a sample Hebrew PLS dictionary. It is very short, but it shows the concept.


1. I imported the Hebrew PLS dictionary into simon.
2. I dragged each word to the right side for training.
3. The recorded Hebrew words are stored in the folder /home/liberty/.kde/share/apps/simon/model/
4. After starting ksimond (PDF), I pressed the Synchronize button.
5. Then I activated simon.
6. When dictating several words, simon is not sure which word is the right one. Is it מדפסת, or is it עִבְרִית? Unfortunately, I didn’t get any output in gedit or in Geany. Maybe it has something to do with the right-to-left encoding?

You can see that thanks to UTF-8 Hebrew shouldn’t be a big problem. I don’t know what went wrong with the missing output. But at least the Hebrew words are displayed correctly, so only the last step is missing.

If anyone is interested in building an Hebrew PLS dictionary, I propose you take a look into the Hebrew Voxforge prompts. I suggest that you take the words that are contained in the prompts into the dictionary. You can take my sample Hebrew dictionary, and expand it. Later, you can use the Voxforge prompts for training (after you have made your first experiences with simon).