Posts Tagged ‘Slovenian’

Ralf’s Slovenian speech model

Thursday, May 17th, 2012

Some words about the creation of this speech model.

1. Get Ralf’s Slovenian dictionary.
2. Create a Simon scenario with the name “Slovenian”.
3. Remove the shadow vocabulary.
4. Import Ralf’s Slovenian dictionary as shadow dictionary.
5. Add ten words to training. Question – Simon:

Your vocabulary does not define all words used in this text. These words are missing:
encijane, encijanovo, enciklika, imenovale, kuretensko, nepozaben, plavolase, Zule, šiponovi, zlomom

Do you want to add them now?

Press the Yes button.

6. Add grammar Unknown. Add Dictation plugin. Actions > Synchronize. Actions > Activate. Dictate a few words:

encijane encijanovo enciklika imenovale šiponovi nepozaben plavolase šiponovi plavolase

7. Export scenario. Export base model.
8. Download Ralf’s Slovenian speech model.

Ralf’s Slovenian dictionary

Tuesday, April 20th, 2010

How I create Ralf's Slovenian dictionary:

1. Get spelling dictionary, and extract it.
2. Check the license. It is GPLv3 compatible.

3. Generate word list via Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ unmunch sl_SI.dic sl_SI.aff > slovenian-wordlist

I have the impression that some encoding problems occurred. Maybe I should change the encoding before applying the unmunch command.

4. Changing the encoding of the file sl_SI.dic from ISO8859-2 to UTF-8:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ iconv -f ISO8859-2 -t UTF-8 < sl_SI.dic > slovenian-utf8.dic

5. Changing the encoding of the file sl_SI.aff from ISO8859-2 to UTF-8:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ iconv -f ISO8859-2 -t UTF-8 < sl_SI.aff > slovenian-utf8.aff

7. Editing the first line of the file slovenian-utf8.aff with gedit: Replacing "SET ISO8859-2" by "SET UTF-8"

8. Generating Slovenian word list:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ unmunch slovenian-utf8.dic slovenian-utf8.aff > slovenian-wordlist-utf8

I think that there are no encoding errors. Fine.

9. Adding <lexicon> tags to the file slovenian-wordlist-utf8 (<lexicon> at the beginning of the file; </lexicon> at the end of the file).

10. Creating .xml file with Slovenian words:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ saxonb-xslt -ext:on -s:slovenian-wordlist-utf8 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:slovenian.xml

The file slovenian.xml has a size of 72 MB (1.4 million words).

11. Generating the first draft of the Slovenian PLS dictionary:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ saxonb-xslt -ext:on -s:slovenian.xml -xsl:'http://spirit.blau.in/simon/files/2010/04/improve-estonian-dictionary.xsl' -o:slovenian-dictionary.xml

12. I should implement some grapheme-to-phoneme rules following the Wikipedia. This means that I will have to edit the style-sheet improve-estonian-dictionary.xsl. I added a few lines to the style-sheet improve-slovenian-dictionary.xsl:

slovenian

(1) I transform the value of the specific <grapheme> element from upper-case to lower-case. This value will be used later for the content of the <phoneme> element.
(2) Some Slovenian letters are transformed to specific IPA symbols. Most Slovenian letters correspond with the IPA, so no transformation is necessary. This approach is very efficient. Of course, the resulting PLS dictionary will just be a first draft, and needs further refinement by a native speaker.

13. Because the dictionary is pretty big, I have to cut it into two files (slovenian-part-1.xml and slovenian-part-2.xml) so that it can be processed with saxonb-xslt. Otherwise, I get an error message (Java heap space). Here is what is happening:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ saxonb-xslt -ext:on -s:slovenian-part-1.xml -xsl:'improve-slovenian-dictionary.xsl' -o:slovenian-pls-part-1.xml

14. And now I will generate the second part of the PLS dictionary:

am3msi@am3msi-desktop:~/Documents/201004/slovenian-dictionary$ saxonb-xslt -ext:on -s:slovenian-part-2.xml -xsl:'improve-slovenian-dictionary.xsl' -o:slovenian-pls-part-2.xml

15. With gedit, I join the two parts of the PLS dictionary (with copy and paste). And now, you can download Ralf's Slovenian dictionary. Import into simon is possible but it will be pretty slow because the dictionary is very big (in my opinion too big). On my computer, the import of the dictionary took about two minutes.