How I create Ralf's Northern Sotho dictionary:
1. Get spelling dictionary. License is LGPL. I will do a conversion from LGPL to GPL:
“One feature of the LGPL is that one can convert any LGPLed piece of software into a GPLed piece of software (section 3 of the license). This feature is useful for direct reuse of LGPLed code in GPLed libraries and applications, or if one wants to create a version of the code that cannot be used in proprietary software products.”
2. There is no ISO 639-1 language code. I will use ISO 639-2 nso instead.
3. Off-topic: I am editing espeak2ipa.xsl now for all espeak languages. Take a look into this PDF. Replacing the following phonemes:
replace($espeak2ipa, 'r', 'ɹ') (except German de)
replace($espeak2ipa, 'A', 'ɑ')
replace($espeak2ipa, 'B', 'β')
replace($espeak2ipa, 'H', 'ħ')
replace($espeak2ipa, 'J', 'ɟ')
replace($espeak2ipa, 'L', 'ɫ')
replace($espeak2ipa, 'Q', 'ɣ')
replace($espeak2ipa, 'R', 'ɚ')
replace($espeak2ipa, 'T', 'θ')
replace($espeak2ipa, 'V', 'ʌ')
replace($espeak2ipa, 'X', 'χ')
replace($espeak2ipa, '\?', 'ʔ')
replace($espeak2ipa, '&', 'æ')
replace($espeak2ipa, '\*', 'ɾ')
4. Convert encoding of .dic file via Ubuntu terminal:
ubuntu@ubuntu-desktop:~$ iconv -f ISO8859-15 -t UTF-8 < /home/ubuntu/Documents/201005/northern-sotho/ns_ZA.dic > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic
Yes, obviously thanks to the conversion there are no garbage characters any more. Should I convert .aff, too? Yes, I will try that.
5. Convert .aff file:
ubuntu@ubuntu-desktop:~$ iconv -f ISO8859-15 -t UTF-8 < /home/ubuntu/Documents/201005/northern-sotho/ns_ZA.aff > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.aff
6. Change the line in the file northern-sotho-utf8.aff that contains the information SET ISO8859-15 to SET UTF-8.
7. Try to generate Northern Sotho word list:
ubuntu@ubuntu-desktop:~$ unmunch /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.aff > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho
The command was “successful”. The result document contains as much words as the source documents. So I didn’t gain any words by using the unmunch command. At least, you can learn from me how to work with OpenOffice.org spelling dictionaries for PLS dictionary generation.
8. Add <lexicon> at the beginning of northern-sotho-utf8.dic. Add </lexicon> at the end.
9. Generate .xml document:
ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho.xml
10. Maybe I should use the Sesotho phonology for grapheme to phoneme conversion.
11. Generate <phoneme> elements:
ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho.xml -xsl:'/home/ubuntu/Documents/201005/northern-sotho/improve-northern-sotho.xsl' -o:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho-dictionary.xml
12. Download Ralf's Northern Sotho dictionary, and import it into simon.