Posts Tagged ‘maˈkɛdɔnski’

Ralf’s Macedonian dictionary

Saturday, April 24th, 2010

Let me explain how I create Ralf's Macedonian dictionary:

1. Get spelling dictionary, license is GPLv2.

2. Convert mk_MK.dic from cp1251 to UTF-8 via Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/macedonian-dictionary$ iconv -f cp1251 -t UTF-8 macedonian-utf8.dic

3. Convert mk_MK.aff

am3msi@am3msi-desktop:~/Documents/201004/macedonian-dictionary$ iconv -f cp1251 -t UTF-8 macedonian-utf8.aff

4. Changing the first line of the file macedonian-utf8.aff from SET microsoft-cp1251 to SET UTF-8 with gedit.

5. The unmunch command wouldn’t bring additional Macedonian words.

6. Adding <lexicon> tags to the file macedonian-utf8.dic (<lexicon> at the beginning of the file; </lexicon> at the end of the file).

7. Generating .xml file:

am3msi@am3msi-desktop:~/Documents/201004/macedonian-dictionary$ saxonb-xslt -ext:on -s:macedonian-utf8.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:macedonian.xml

8. Thinking about the style-sheet improve-macedonian-dictionary.xsl. This style-sheet should contain the language tag mk. The grapheme-to-phoneme conversion should follow this table. Great:

“Macedonian orthography is consistent and phonemic in practice, an approximation of the principle of one grapheme per phoneme. A principle represented by Adelung’s saying, “write as you speak and read as it is written” („пишувај како што зборуваш и читај како што е напишано“).”

This means that Ralf's Macedonian dictionary shouldn’t be too bad.

9. Generating <phoneme> elements via Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/macedonian-dictionary$ saxonb-xslt -ext:on -s:macedonian.xml -xsl:'improve-macedonian-dictionary.xsl' -o:macedonian-dictionary.xml

10. Download Ralf's Macedonian dictionary, and import it into simon. Take a look at the Shadow Vocabulary:

macedonian

Word column: Macedonian words
Pronunciation column: Corresponding SAMPA pronunciation

11. It should be possible to train the word геометриска:

train-macedonian-word

At the moment, I am not able to record this word because my sound configuration has to be adjusted. But the important thing is: You should be able to get some initial recognition results with Ralf's Macedonian dictionary. It would be nice if somebody used Ralf's Macedonian dictionary for training of some Macedonian words. I would like to know whether the Macedonian/Cyrillic alphabet causes issues with HTK or not. My guess is that it works, but I haven’t tested it.