This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.
A. Creation of the dictionary:
1. Get Arabic spelling dictionary.
2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file:
GPL 2.0/LGPL 2.1/MPL 1.1 tri-license
This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.
3. Now I have to extract
4. Let’s try the
unmunch command inside the Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic
It failed. I wasn’t able to unmunch the word list.
5. I have to remove all numbers from ar.dic. This can be done with the
sed 's/[0-9]*//g' ar.dic > arabic-without-numbers
6. Remove the slash (“/”) from arabic-without-numbers with Geany.
7. Add lexicon tags at the beginning and the end of the file.
8. Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml
9. ISO 639-1 language code is ar.
10. Maybe I will use this table for the grapheme to phoneme conversion.
11. Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'improve-arabic.xsl' -o:arabic-dictionary.xml
I have to remove the number sign (“#”) with Geany from arabic.xml.
B. Download the dictionary. Import it into simon.
The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with “Unknown”. This is because the PLS dictionary contains no
Now you know how I created the dictionary. And you know how the result looks like in simon.