Posts Tagged ‘nso’

Ralf’s Northern Sotho speech model

Thursday, May 17th, 2012

Some words about the creation of this speech model:

1. Download Ralf’s Northern Sotho dictionary.
2. Create a Simon scenario with the name NorthernSotho.
3. Import the dictionary as shadow dictionary.
4. I want to train five words.
5. Add as grammar the word “Unknown”. Add the dictation plugin.
6. Actions > Synchronize. Actions > Activate. Here are my recognition results:

dikati ditofo Guest kobokela Guest Guest

Simon recognized Guest instead of Lenhard. Never mind.

7. Download Ralf’s Northern Sotho speech model.

Ralf’s Northern Sotho dictionary

Saturday, May 22nd, 2010

How I create Ralf's Northern Sotho dictionary:

1. Get spelling dictionary. License is LGPL. I will do a conversion from LGPL to GPL:

“One feature of the LGPL is that one can convert any LGPLed piece of software into a GPLed piece of software (section 3 of the license). This feature is useful for direct reuse of LGPLed code in GPLed libraries and applications, or if one wants to create a version of the code that cannot be used in proprietary software products.”

2. There is no ISO 639-1 language code. I will use ISO 639-2 nso instead.

3. Off-topic: I am editing espeak2ipa.xsl now for all espeak languages. Take a look into this PDF. Replacing the following phonemes:

replace($espeak2ipa, 'r', 'ɹ') (except German de)
replace($espeak2ipa, 'A', 'ɑ')
replace($espeak2ipa, 'B', 'β')
replace($espeak2ipa, 'H', 'ħ')
replace($espeak2ipa, 'J', 'ɟ')
replace($espeak2ipa, 'L', 'ɫ')
replace($espeak2ipa, 'Q', 'ɣ')
replace($espeak2ipa, 'R', 'ɚ')
replace($espeak2ipa, 'T', 'θ')
replace($espeak2ipa, 'V', 'ʌ')
replace($espeak2ipa, 'X', 'χ')
replace($espeak2ipa, '\?', 'ʔ')
replace($espeak2ipa, '&', 'æ')
replace($espeak2ipa, '\*', 'ɾ')

4. Convert encoding of .dic file via Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ iconv -f ISO8859-15 -t UTF-8 < /home/ubuntu/Documents/201005/northern-sotho/ns_ZA.dic > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic

Yes, obviously thanks to the conversion there are no garbage characters any more. Should I convert .aff, too? Yes, I will try that.

5. Convert .aff file:

ubuntu@ubuntu-desktop:~$ iconv -f ISO8859-15 -t UTF-8 < /home/ubuntu/Documents/201005/northern-sotho/ns_ZA.aff > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.aff

6. Change the line in the file northern-sotho-utf8.aff that contains the information SET ISO8859-15 to SET UTF-8.

7. Try to generate Northern Sotho word list:

ubuntu@ubuntu-desktop:~$ unmunch /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.aff > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho

The command was “successful”. The result document contains as much words as the source documents. So I didn’t gain any words by using the unmunch command. At least, you can learn from me how to work with OpenOffice.org spelling dictionaries for PLS dictionary generation.

8. Add <lexicon> at the beginning of northern-sotho-utf8.dic. Add </lexicon> at the end.

9. Generate .xml document:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho.xml

10. Maybe I should use the Sesotho phonology for grapheme to phoneme conversion.

11. Generate <phoneme> elements:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho.xml -xsl:'/home/ubuntu/Documents/201005/northern-sotho/improve-northern-sotho.xsl' -o:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho-dictionary.xml

12. Download Ralf's Northern Sotho dictionary, and import it into simon.