Here is how I create Ralf's Bulgarian dictionary:
1. Get spelling dictionary from OpenOffice.org, and extract it.
2. Trying to unmunch the word list via the Ubuntu terminal:
am3msi@am3msi-desktop:~/Documents/201004/bulgarian-dictionary$ unmunch bg_BG.dic bg_BG.aff > bulgarian-wordlist
Encoding problems occured. The Bulgarian wordlist is obviously encoded in cp1251, not in UTF-8. It is necessary to convert the files bg_BG.dic and bg_BG.aff from cp1251 to UTF-8.
3. Converting bg_BG.dic via the Ubuntu terminal:
am3msi@am3msi-desktop:~/Documents/201004/bulgarian-dictionary$ iconv -f cp1251 -t UTF-8 bg_BG.dic > bulgarian-utf8.dic
4. Converting bg_BG.aff via the Ubuntu terminal:
am3msi@am3msi-desktop:~/Documents/201004/bulgarian-dictionary$ iconv -f cp1251 -t UTF-8 bg_BG.aff > bulgarian-utf8.aff
5. Changing the first line of the file bulgarian-utf8.aff from SET microsoft-cp1251 to SET UTF-8 with gedit.
6. Generating the Bulgarian word list with the Ubuntu terminal:
am3msi@am3msi-desktop:~/Documents/201004/bulgarian-dictionary$ unmunch bulgarian-utf8.dic bulgarian-utf8.aff > bulgarian-wordlist-utf8
Result: about 890.000 Bulgarian words are available.
7. Editing bulgarian-wordlist-utf8 with gedit. This word list has to be transformed into an XML file. I am enclosing the word list with <lexicon> tags (<lexicon> at the beginning of the file; </lexicon> at the end of the file).
8. Generating <lexeme> and <grapheme> elements via Ubuntu terminal:
am3msi@am3msi-desktop:~/Documents/201004/bulgarian-dictionary$ saxonb-xslt -ext:on -s:bulgarian-wordlist-utf8 -xsl:'create-xml-file.xsl' o:bulgarian.xml
9. Editing the .xsl style-sheet improve-bulgarian-dictionary.xsl. Reading the Wikipedia about Bulgarian vowels. Implementing the vowel transformation rules into the .xsl style-sheet:
10. Implementing Bulgarian consonants to IPA conversion into the style-sheet. Taking a look into this table. I will have to replace each letter of the Bulgarian alphabet by the corresponding IPA phoneme because:
Most letters in the Bulgarian alphabet stand for just one specific sound.
So I can use the Bulgarian alphabet table in the Wikipedia for my .xsl conversion style-sheet.
11. Now it is time to build the PLS dictionary via the Ubuntu terminal:
am3msi@am3msi-desktop:~/Documents/201004/bulgarian-dictionary$ saxonb-xslt -ext:on -s:bulgarian.xml -xsl:'improve-bulgarian-dictionary.xsl' -o:bulgarian-dictionary.xml
12. Here is the result: Download Ralf's Bulgarian dictionary, and import it into simon:
Because the dictionary is pretty big, the import process takes a few moments:
simon displays the Cyrillic alphabet correctly in the left column:
The Pronunciation column contains the corresponding Bulgarian SAMPA phonemes.
13. If you are a native speaker of the Bulgarian language, I am looking for you. You can help to improve Ralf's Bulgarian dictionary. You don’t need to ask for my permission because Ralf's Bulgarian dictionary is licensed under the GPLv3 (as well as the .xsl style-sheet improve-bulgarian-dictionary.xsl).
Tags: node29



