In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.
A. Creation of the dictionary:
1. Get Hebrew spelling dictionary from OpenOffice.org.
2. License is GPL. There is a copyright notice inside the file
3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:
ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test
4. The source file
he_IL.dic contains a lot of numbers. I remove them with the Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ sed 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers
With Geany, I remove the “,” (commas) and the “/” (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.
5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.
6. Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml
7. ISO 639-1 language code is
8. I need a table for grapheme to phoneme conversion. Maybe I will use this table. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew share the same alphabet. This means I could try to use the Yiddish improve-yiddish.xsl style sheet:
ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml
The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn’t been converted: [א] I will add this phone to the
.xsl style sheet with the name
improve-hebrew.xsl. Now I try it again:
ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'improve-hebrew.xsl' -o:hebrew-dictionary.xml
The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.
B. Download the dictionary. Import it into simon as shadow dictionary.
Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just
Unknown) since the source PLS dictionary contains no
Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn’t be a problem to adjust the style sheet
improve-hebrew.xsl so that the phoneme results are better.