This article explains how I create this PLS dictionary and how the imported result looks like.
A. Creation of the Belarusian PLS dictionary:
1. Get spelling dictionary. I choose the official orthography.
2. License is LGPL (see hyph_be_BY.dic). I am allowed to “convert any LGPLed piece of software into a GPLed piece of software.” I did this before. And I will do it again. This means that I get a spelling dictionary that is licensed under the LGPL. And I will produce a pronunciation dictionary that is licensed under the GPLv3. By the way, all my dictionaries are licensed under the GPLv3.
3. Extract dict-be-official.oxt.
4. The file be-official.aff is encoded in UTF-8. The file be-official-dic may be encoded in ISO-8859-1. At least this encoding is displayed by Geany. I believe that be-official-dic is encoded in microsoft-cp1251. I had this encoding before (Macedonian and Bulgarian).
Now it is time to use the Ubuntu terminal:
cd /home/ubuntu/Documents/2011-II/Belarusian
iconv -f cp1251 -t UTF-8 <be-official.dic >belarusian-utf8.dic
The text file belarusian-utf8.dic looks fine.
5. Now I change the line SET microsoft-cp1251 in the file be-official.aff into SET UTF-8
6. I don’t know whether the next step is necessary. I could convert the file hyph_be_BY.dic from cp1251 into UTF-8. At the moment, I skip this step.
7. Ubuntu terminal: unmunch belarusian-utf8.dic be-official.aff > belarusian-wordlist I think that this step wasn’t necessary. It didn’t extract the word list. At the moment, I have a word list of 1.5 million words. This is way too much. I have to reduce the dictionary size. The target size is 400.000 words.
8. I have to reduce the dictionary size. I found a tip. Ubuntu terminal:
sed -n 'p;N;N;N' belarusian-wordlist > belarusian-wordlist-reduced
Yes, it worked. The word list contains now 391.000 words. This is a good basis for a PLS dictionary.
9. Add lexicon elements at the beginning and the end of belarusian-wordlist-reduced.
10. Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian-wordlist-reduced -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:belarusian.xml
11. Language code is be.
12. I will use this table for grapheme to phoneme mapping.
13. Creation of the phoneme elements:
ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian.xml -xsl:'improve-belarusian.xsl' -o:belarusian-dictionary.xml
B. Download and import the dictionary.
Let’s take a look at the result. The left column contains 391669 Belarusian words. The pronunciation column contains the corresponding SAMPA transcriptions. All entries in the third column are marked as “Unknown”. This is because the Belarusian PLS dictionary doesn’t contain any role attribute.
Now you know how I created the dictionary. And you got an impression how the result looks like when imported into simon.
Tags: Belarusian, iconv, PLS, unmunch