Archive for the ‘dictionary’ Category

Ralf’s Vietnamese dictionary 0.1.1

Monday, May 24th, 2010

Let’s improve Ralf's Vietnamese dictionary:

1. Convert eSpeak phonemes into IPA phonemes:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/vietnamese/dictionaries/vietnamese-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

2. Download Ralf's Vietnamese dictionary, and import it into simon.

Ralf’s Russian dictionary 0.1.1

Monday, May 24th, 2010

How I improve Ralf's Russian dictionary:

1. Version 0.1 contains eSpeak characters.

2. Convert <phoneme> elements via Ubuntu terminal:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/russian/russian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

3. Download Ralf's Russian dictionary with 146263 words, and import it into simon.

Ralf’s Norwegian Bokmål dictionary 0.1.1

Sunday, May 23rd, 2010

How I improve Ralf's Norwegian Bokmål dictionary:

1. Version 0.1 contains eSpeak characters.

2. Language code is no. Or I should better use nb – Bokmål? I think that I will use no – it is easier at the moment (because espeak2ipa.xsl doesn’t contain a specific entry for nb. Only no is supported.

3. Convert <phoneme> elements:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/norwegian/norwegian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Norwegian Bokmål dictionary with 322043 words, and import it into simon.

Ralf’s Swedish dictionary 0.1.1

Sunday, May 23rd, 2010

How I improve Ralf's Swedish dictionary:

1. Version 0.1 contains eSpeak phonemes. They should be converted into IPA phonemes.

2. Language code is sv.

3. Convert eSpeak phonemes into IPA phonemes:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/swedish/dictionaries/swedish-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Swedish dictionary 0.1.1, and import it into simon. The PLS dictionary contains 398964 words.

Ralf’s Northern Sotho dictionary

Saturday, May 22nd, 2010

How I create Ralf's Northern Sotho dictionary:

1. Get spelling dictionary. License is LGPL. I will do a conversion from LGPL to GPL:

“One feature of the LGPL is that one can convert any LGPLed piece of software into a GPLed piece of software (section 3 of the license). This feature is useful for direct reuse of LGPLed code in GPLed libraries and applications, or if one wants to create a version of the code that cannot be used in proprietary software products.”

2. There is no ISO 639-1 language code. I will use ISO 639-2 nso instead.

3. Off-topic: I am editing espeak2ipa.xsl now for all espeak languages. Take a look into this PDF. Replacing the following phonemes:

replace($espeak2ipa, 'r', 'ɹ') (except German de)
replace($espeak2ipa, 'A', 'ɑ')
replace($espeak2ipa, 'B', 'β')
replace($espeak2ipa, 'H', 'ħ')
replace($espeak2ipa, 'J', 'ɟ')
replace($espeak2ipa, 'L', 'ɫ')
replace($espeak2ipa, 'Q', 'ɣ')
replace($espeak2ipa, 'R', 'ɚ')
replace($espeak2ipa, 'T', 'θ')
replace($espeak2ipa, 'V', 'ʌ')
replace($espeak2ipa, 'X', 'χ')
replace($espeak2ipa, '\?', 'ʔ')
replace($espeak2ipa, '&', 'æ')
replace($espeak2ipa, '\*', 'ɾ')

4. Convert encoding of .dic file via Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ iconv -f ISO8859-15 -t UTF-8 < /home/ubuntu/Documents/201005/northern-sotho/ns_ZA.dic > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic

Yes, obviously thanks to the conversion there are no garbage characters any more. Should I convert .aff, too? Yes, I will try that.

5. Convert .aff file:

ubuntu@ubuntu-desktop:~$ iconv -f ISO8859-15 -t UTF-8 < /home/ubuntu/Documents/201005/northern-sotho/ns_ZA.aff > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.aff

6. Change the line in the file northern-sotho-utf8.aff that contains the information SET ISO8859-15 to SET UTF-8.

7. Try to generate Northern Sotho word list:

ubuntu@ubuntu-desktop:~$ unmunch /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic /home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.aff > /home/ubuntu/Documents/201005/northern-sotho/northern-sotho

The command was “successful”. The result document contains as much words as the source documents. So I didn’t gain any words by using the unmunch command. At least, you can learn from me how to work with OpenOffice.org spelling dictionaries for PLS dictionary generation.

8. Add <lexicon> at the beginning of northern-sotho-utf8.dic. Add </lexicon> at the end.

9. Generate .xml document:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho-utf8.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho.xml

10. Maybe I should use the Sesotho phonology for grapheme to phoneme conversion.

11. Generate <phoneme> elements:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho.xml -xsl:'/home/ubuntu/Documents/201005/northern-sotho/improve-northern-sotho.xsl' -o:/home/ubuntu/Documents/201005/northern-sotho/northern-sotho-dictionary.xml

12. Download Ralf's Northern Sotho dictionary, and import it into simon.

Ralf’s Swahili dictionary 0.1.1

Tuesday, May 18th, 2010

How I improve Ralf's Swahili dictionary:

1. Version 0.1 contains espeak phonemes.

2. Language code is sw. Edit espeak2ipa.xsl. The section that is relevant to the Swahili language begins with matches(/lexicon/@xml:lang, 'sw'). Take a look at Swahili sounds.

3. Convert espeak phonemes into IPA phonemes via Ubuntu terminal:

$ cat /media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/swahili/swahili-dictionary.xml.bz2 | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Swahili dictionary (version 0.1.1), and import it into simon.

Ralf’s Tamil dictionary 0.1.1

Tuesday, May 18th, 2010

How I improve Ralf's Tamil dictionary:

1. Version 0.1 contains espeak phonemes.

2. Language code is ta. Edit espeak2ipa.xsl. The section that is relevant to the Tamil language begins with matches(/lexicon/@xml:lang, 'ta').

3. Convert espeak phonemes into IPA phonemes:

$ cat /media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/tamil/tamil-dictionary.xml.bz2 | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Tamil dictionary, and import it into simon.

Ralf’s Polish dictionary 0.1.1

Tuesday, May 18th, 2010

How I improve Ralf's Polish dictionary:

1. Language code is pl. Edit espeak2ipa.xsl. The section that is relevant to the Polish language begins with matches(/lexicon/@xml:lang, 'pl').

2. Transform espeak phonemes into IPA phonemes via Ubuntu terminal:

$ cat /media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/polish/polish-dictionary.xml.bz2 | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

3. Download Ralf's Polish dictionary, and import it into simon.

4. Unfortunately, a lot of <grapheme> elements in Ralf's Polish dictionary contain garbage characters.

Ralf’s Malayalam dictionary

Tuesday, May 18th, 2010

How I create Ralf's Malayalam dictionary:

1. Get spelling dictionary. License is GPLv2.

2. Language code (ISO 639-1) is ml.

3. The file ml_IN.dic is UTF-8 encoded. I don’t need ml_IN.aff because it contains no additional information.

4. Add <lexicon> at the beginning of ml_IN.dic. Add </lexicon> at the end of ml_IN.dic.

5. Generate .xml file:

$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/ml/ml_IN.dic' -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:'/home/ubuntu/Documents/201005/ml/malayalam.xml'

6. I will use these tables for grapheme to phoneme conversion.

7. Generate <phoneme> elements:

$ saxonb-xslt -o:'/home/ubuntu/Documents/201005/ml/malayalam-dictionary.xml' -xsl:'/home/ubuntu/Documents/201005/ml/improve-malayalam-dictionary.xsl' -s:'/home/ubuntu/Documents/201005/ml/malayalam.xml'

8. Download Ralf's Malayalam dictionary, and import it into simon.

Ralf’s Latvian dictionary 0.1.1

Monday, May 17th, 2010

How I improve Ralf's Latvian dictionary:

1. Take a look at the Latvian pronunciation.

2. Language code is lv. Edit espeak2ipa.xsl. The section that is relevant to the Latvian language begins with matches(/lexicon/@xml:lang, 'lv').

3. Convert eSpeak phonemes into IPA phonemes:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/latvian/latvian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Latvian dictionary, and import it into simon. Take a look at the imported PLS dictionary:

latvian The word column offers 154740 Latvian words. The pronunciation column contains the corresponding SAMPA transcriptions.

5. Is there a native speaker who wants to improve Ralf's Latvian dictionary? Just do it, the license of the dictionary is GPLv3.

Ralf’s Kurdish dictionary 0.1.1

Monday, May 17th, 2010

How I improve Ralf's Kurdish dictionary:

1. Ubuntu terminal:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/kurdish/kurdish-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

2. Language code is ku.

3. Download Ralf's Kurdish dictionary version 0.1.1, and import it into simon. Take a look at the shadow dictionary:

kurdish The word column offers 15108 Kurdish words. The pronunciation column contains the corresponding SAMPA phonemes.

4. Is there a native speaker who wants to improve Ralf's Kurdish dictionary?

Ralf’s Hindi dictionary 0.1.1

Monday, May 17th, 2010

How I improve Ralf's Hindi dictionary:

1. Language code is hi.

2. Hindi uses Devanagari script. Ralf’s Maithili dictionary and Ralf’s Nepali dictionary use Devanagari script, too. But the Devanagari script is at the moment not my concern. I just want to convert the eSpeak phonemes into IPA phonemes.

3. Adjust espeak2ipa.xsl. Take a look at the Hindustani orthography. I want to know which IPA phonemes I should use. Ralf's Hindi dictionary is just a first draft.

4. Convert <phoneme> elements:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/hindi/hindi-dictionary.xml' | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

A native speaker should take a closer look at espeak2ipa.xsl (the section that is relevant to the Hindi language begins with /lexicon/@xml:lang, 'hi'). I am just guessing the Hindi IPA phonemes.

5. Download Ralf's Hindi dictionary, and import it into simon. The SAMPA transcriptions are not usable for training.

6. Is there a native speaker who wants to improve Ralf's Hindi dictionary?

Ralf’s Italian dictionary 0.1.2

Monday, May 17th, 2010

How I create Ralf's Italian dictionary version 0.1.2:

1. Make some adjustments to espeak2ipa.xsl (Ralf's eSpeak2IPA style-sheet).

2. Transform the eSpeak phonemes (Ralf's Italian dictionary version 0.1.1 contains espeak phonemes) into IPA phonemes via the Ubuntu terminal:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/italian/it_IT/italian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

Some explanations: The cat command outputs the content of Ralf's Italian dictionary 0.1.1 in compressed form. The special character "|" causes the output of the cat command to be used as input for the bunzip command. The output of the bunzip command is then used as input for saxonb-xslt.

3. Download Ralf's Italian dictionary 0.1.2, and import it into simon. Take a look at the shadow dictionary:

italian The word column contains 95192 Italian words. The pronunciation column contains the corresponding SAMPA transcriptions.

4. A native speaker could improve Ralf's Italian dictionary.

Ralf’s Armenian dictionary

Sunday, May 16th, 2010

How I create Ralf's Armenian dictionary:

1. Get spelling dictionary. License is GPL.

2. The encoding of hy_AM.dic and hy_AM.aff is UTF-8.

3. Try to generate Armenian word list:

$ unmunch '/home/ubuntu/Documents/201005/armenian-0.1/hy_AM.dic' '/home/ubuntu/Documents/201005/armenian-0.1/hy_AM.aff' > '/home/ubuntu/Documents/201005/armenian-0.1/armenian'

Unfortunately, no Armenian words were generated. This means that I can only use hy_AM.dic as source.

4. Add <lexicon> at the beginning of hy_AM.dic. Add </lexicon> at the end of hy_AM.dic.

5. Generate .xml file:

$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/armenian-0.1/hy_AM.dic' -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:'/home/ubuntu/Documents/201005/armenian-0.1/armenian.xml'

6. Remove the code after the slash:

$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/armenian-0.1/armenian.xml' -xsl:'http://spirit.blau.in/simon/files/2010/05/substring-before-slash.xsl' -o:'/home/ubuntu/Documents/201005/armenian-0.1/armenian-ssml.xml'

The result is an SSML file that I will use with eSpeak.

7. Generate phonemes with eSpeak:

$ espeak -f '/home/ubuntu/Documents/201005/armenian-0.1/armenian-ssml.xml' -m -v hy -q -x --phonout="/home/ubuntu/Documents/201005/armenian-0.1/armenian-espeak"

8. Open the file armenian-espeak with Geany. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme></lexeme>\n<phoneme>" (Use escape sequences). Add <lexicon> at the beginning (<lexicon>) of the file armenian-espeak. Add </lexicon> at the end of the file armenian-espeak.

9. Combine <grapheme> with <phoneme> elements:

$ paste '/home/ubuntu/Documents/201005/armenian-0.1/armenian-ssml.xml' "/home/ubuntu/Documents/201005/armenian-0.1/armenian-espeak" > '/home/ubuntu/Documents/201005/armenian-0.1/armenian-pls.xml'

10. Transform eSpeak phonemes into IPA phonemes:

$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/armenian-0.1/armenian-pls.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/armenian-0.1/armenian-dictionary.xml'

11. Make adjustments to ralfs-ipa-stylesheet.xsl. I want to remove the single-quote with the style-sheet. This should be doable if I take a look into the table ASCII printable characters. Now I know the trick: select='replace($espeak2ipa, "'", "")'. Why didn’t I understand this before? At least, now I know how I can remove stress marks from the <phoneme> elements.
Some additional remarks about stress information: My theory is that we don’t need at the moment primary/secondary stress information for Ralf’s PLS dictionaries because they should be used for ASR (and not for TTS). This means a PLS dictionary
- doesn’t need stress information if it developed for ASR (my goal);
- needs stress information if it is developed for TTS (not my goal).
This means that I can remove stress info that has been generated by eSpeak. Sorry, I only develop for ASR, not for TTS. I remove information that is unnecessary for ASR.

12. Download Ralf's Armenian dictionary, and import it into simon. Take a look at the shadow dictionary:

armenian The word column offers 63791 Armenian words. The pronunciation column displays the corresponding SAMPA transcription. The Armenian word ֆօտոլաբորատորիայ is transcribed as f o t o l a b o r a t o r i a. That doesn’t look too bad.

13. It would be nice if a native speaker would improve Ralf's Armenian dictionary.

Ralf’s Albanian dictionary

Sunday, May 16th, 2010

How I create Ralf's Albanian dictionary:

1. Get spelling dictionary. License is GPL.

2. The files sq_AL.dic and sq_AL.aff are UTF-8 encoded.

3. Generate Albanian word list:

$ unmunch '/home/ubuntu/Documents/201005/albanian-0.1/sq_AL.dic' '/home/ubuntu/Documents/201005/albanian-0.1/sq_AL.aff' > '/home/ubuntu/Documents/201005/albanian-0.1/albanian'

4. Add <lexicon> at the beginning of albanian. Add </lexicon> at the end of albanian.

5. Replace "\n" by "</audio>\n<audio>" with Geany (Use escape sequences).

6. Generate espeak phonemes:

$ espeak -f '/home/ubuntu/Documents/201005/albanian-0.1/albanian' -m -v sq -q -x --phonout="/home/ubuntu/Documents/201005/albanian-0.1/albanian-espeak"

The ISO-639-1 language code is sq.

7. Take a look at the Albanian alphabet.

8. Open the file albanian-espeak with Geany. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme></lexeme>\n<phoneme>" (Use escape sequences). Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>).

9. Combine <grapheme> with <phoneme> elements:

$ paste '/home/ubuntu/Documents/201005/albanian-0.1/albanian' '/home/ubuntu/Documents/201005/albanian-0.1/albanian-espeak' > '/home/ubuntu/Documents/201005/albanian-0.1/albanian-pls.xml'

10. Convert eSpeak phonemes into IPA phonemes:

$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/albanian-0.1/albanian-pls.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/albanian-0.1/albanian-dictionary.xml'

11. Download Ralf's Albanian dictionary, and import it into simon. Take a look at the shadow dictionary:

albanianThe word column contains 576613 Albanian words. The pronunciation column offers the corresponding SAMPA transcription.

12. A native speaker should improve Ralf's Albanian dictionary.

Ralf’s Romanian dictionary 0.1.1

Sunday, May 16th, 2010

How I create Ralf's Romanian dictionary version 0.1.1:

1. Version 0.1 contains espeak phonemes. They should be converted into IPA phonemes.

2. Take a look at Romanian letters and pronunciation.

3. Off topic: The Romanian Revolution of 1989 was violent and forcefully. I hope that the next revolution will be peacefully. I support freedom and justice, especially the freedom of speech. Because the German legal system doesn’t guarantee the freedom of speech, I am developing PLS dictionaries for a lot of languages. It would be nice if a native speaker from Romania would continue with the development of the Romanian PLS dictionary. Let’s defend our freedom of speech with open source ASR software. It would be great if my PLS dictionaries would become a part of the upcoming revolution.

4. I can’t find the word Ceauşescu in Ralf's Romanian dictionary. I don’t know why this <grapheme> element is missing. Sorry. At least his forename is in my PLS dictionary:

<lexeme>
<grapheme>Nicolae</grapheme>
<phoneme>n,ikol'ae</phoneme>
</lexeme>

5. Edit the section matches(/lexicon/@xml:lang, 'ro'):

replace($espeak2ipa, 'aU', 'aʊ̯')

This diphtong is available in Ralf's German dictionary, too.

6. Take a look into Romanian phonology – diphthongs. Adjusting the replacement rules:

replace($espeak2ipa, 'ea', 'e̯a')
replace($espeak2ipa, 'eI', 'ej')
replace($espeak2ipa, 'eo', 'e̯o')
replace($espeak2ipa, 'eU', 'e̯u')
replace($espeak2ipa, 'iI', 'ij')
replace($espeak2ipa, 'iU', 'ju')
replace($espeak2ipa, 'Oa', 'o̯a')
replace($espeak2ipa, 'uI', 'uj')
replace($espeak2ipa, 'yU', 'ɨw')
replace($espeak2ipa, 'yI', 'ɨj')
replace($espeak2ipa, 'w2', 'wə')

Of course, these are just guesses. At least, you should be able to understand the concept. I use XPath to transform the eSpeak phonemes into IPA phonemes.

7. Obviously, there are a lot of diphtongs and triphthongs available in the Romanian language. I have never heard of triphthongs before.

8. Generate PLS dictionary:

$ saxonb-xslt -s:'/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/romanian/romanian-dictionary.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/romanian-0.1.1/romanian-dictionary.xml'

9. Download Ralf's Romanian dictionary 0.1.1, and import it into simon. Take a look at the shadow dictionary:

romanianThe left column offers 155288 Romanian words. The right column contains the corresponding SAMPA phonemes.

10. I hope that a native speaker will improve this PLS dictionary.

Ralf’s Croatian dictionary 0.1.1

Sunday, May 16th, 2010

How I create Ralf's Croatian dictionary version 0.1.1:

1. I want to convert the eSpeak phonemes from version 0.1 into IPA phonemes. The language code for Croatian is hr.

2. Take a look into the phonemes which are used in the *_rules and *_list files. I can’t find the language code hr. I assume that I should use hbs instead. These are the phonemes that I have to transform into IPA code:

177 Dictionary hbs_dict
178
179 & @ @2 a A a: aI aU
180 E e e: i I i: l- O
181 o o: oU r* r- u U u:
182
183 * ; b d dZ dz dZ; f
184 g h j k l l^ m n
185 N n^ p r R R2 s S
186 t tS ts tS; v x z Z

The language code hbs should do the job, I hope. I will rename hbs into hr in ralfs-ipa-stylesheet.xsl.

3. Ubuntu terminal:

$ saxonb-xslt -s:'/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/croatian/croatian-dictionary.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/croatian-0.1.1/croatian-dictionary.xml'

This was just a first test run. I will have to improve some phoneme replacement rules in ralfs-ipa-stylesheet.xsl.

4. Download Ralf's Croatian dictionary 0.1.1, and import it into simon. Take a look at the shadow dictionary:

croatianThe left column contains 215956 Croatian words. The right column contains the corresponding SAMPA phonemes.

5. Is there a native speaker who wants to take a closer look at the <phoneme> elements?

Ralf’s Czech dictionary 0.1.1

Saturday, May 15th, 2010

How I create Ralf's Czech dictionary version 0.1.1:

1. Version 0.1 contains eSpeak phonemes. I want to convert the eSpeak phonemes into IPA phonemes.

2. Ubuntu terminal:

$ saxonb-xslt -s:'/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/czech/czech-dictionary.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/czech-0.1.1/czech-dictionary.xml'

3. Adjust ralfs-ipa-stylesheet.xslCzech diphtongs:
replace($espeak2ipa, 'eI', 'eɪ')
replace($espeak2ipa, 'eU', 'eʊ̯')
replace($espeak2ipa, 'oU', 'oʊ̯')
– I think that the last phoneme could be included into Ralf's General American dictionary. I will have to check that later.

4. Download Ralf's Czech dictionary 0.1.1, and import it into simon. Take a look into the shadow dictionary:

czechThis PLS dictionary contains 166565 Czech words (left column). The right column contains the corresponding SAMPA pronunciations.

5. A native speaker should take a closer look at the <phoneme> elements in Ralf's Czech dictionary.

Ralf’s Afrikaans dictionary 0.1.1

Saturday, May 15th, 2010

How I create Ralf's Afrikaans dictionary version 0.1.1:

1. The previous version 0.1 contains slightly modified eSpeak phonemes (the '&' had been replaced by '5'). I will now replace the the 5 by &amp;.

2. Preparing the style-sheet:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-phonemes.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/create-ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl'

3. Convert eSpeak phonemes into IPA phonemes:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/afrikaans/afrikaans-dictionary.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/afrikaans-0.1.1/afrikaans-dictionary.xml'

The <phoneme> elements contain now some IPA phonemes. I will have to adjust ralfs-ipa-stylesheet.xsl by adjusting create-ralfs-ipa-stylesheet.xsl.

4. Take a look at the table of Afrikaans characters.

5. Add the XPath expression replace($current-ipa, '&', 'ɛ') to create-ralfs-ipa-stylesheet.xsl. It works: the eSpeak <phoneme>kl'Ipv&amp;rkt,Yyx2</phoneme> element has been converted into <phoneme>kl'ɪpvɛrkt,Yyx2</phoneme>.

6. I have to find out about the <espeakphoneme>Yy</espeakphoneme>. The corresponding IPA phoneme is /œj/. This results in the following XPath expression: replace($current-ipa, 'Yy', 'œj').

7. I am now changing my concept. I won’t edit create-ralfs-ipa-stylesheet.xsl any more. I will only edit ralfs-ipa-stylesheet.xsl directly.

8. An example for the expression replace($espeak2ipa, '&:', 'ɛː') is sê ('say' or 'says') (example is from Wikipedia).

9. Download Ralf's Afrikaans dictionary 0.1.1, import it into simon, and take a look at the result:

afrikaansThe left column contains Afrikaans words. The pronunciation column contains the corresponding SAMPA phonemes. You can see the SAMPA phoneme gls – the glottal stop. I couldn’t find the glottal stop in the Afrikaans table of characters. Maybe this phoneme should be removed from the dictionary.

10. A native speaker (or someone whose native language is Dutch) should take a closer look at ralfs-ipa-stylesheet.xsl. In this style-sheet, there is a section that begins with matches(/lexicon/@xml:lang, 'af'). This is the section that does the transformation from eSpeak phonemes into IPA phonemes for the Afrikaans language. If you improve this eSpeak-to-IPA conversion rules, you can invoke the saxonb-xslt command again (similar to step 3).

Ralf’s Sinhala dictionary

Saturday, May 15th, 2010

How I create Ralf's Sinhala dictionary:

1. Get spelling dictionary. License is GPLv3.

2. The encoding of si-LK.dic and si-LK.aff is UTF-8. Trying to generate word list:

ubuntu@ubuntu-desktop:~$ unmunch '/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/si-LK.dic' '/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/si-LK.aff' > '/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/sinhala'

Unfortunately, the unmunch command didn’t generate a word list. This means that I will have to use only si-LK.dic.

3. Language code is si.

4. Add <lexicon> at the beginning of si-LK.dic. Add </lexicon> at the end of si-LK.dic.

5. Generate <lexeme> and <grapheme> elements:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/si-LK.dic' -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:'/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/sinhala.xml'

6. The Sinhala script is

“an alphabet with another alphabet, due to the presence of two different sets of letters. The core set, known as the śuddha siṃhala (Pure Sinhala, ශුද්ධ සිංහලimg) or eḷu hōḍiya (Eḷu alphabet එළු හෝඩියimg), can represent all native phonemes.”

That doesn’t sound too bad: The core set can represent all native phonemes. I hope that I will be able to develop a style-sheet that translates some letters of the Sinhala script into IPA phonemes. I can use tables for consonants, vowels, prenasalized consonants.

7. Should I use miśra alphabet, too? Yes, I think that would be a good decision. The Wikipedia explains phonetic details. The miśra alphabet

“adds letters for aspirates, retroflexes and sibilants, which are not phonemic in today’s Sinhala, but which are necessary to represent non-native words, like loanwords from Sanskrit, Pali or English. The use of the extra letters is mainly a question of prestige. From a purely phonemic point of view, there is no benefit in using them”

OK, a native speaker should solve the details. I will develop a style-sheet that shows the concept how to translate Sinhala script (śuddha graphemes and miśra alphabet) into IPA phonemes.

8. Generate Sinhala <phoneme> elements:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/sinhala.xml' -xsl:'/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/improve-sinhala-dictionary.xsl' -o:'/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/sinhala-dictionary.xml'

This was just a test that was successful. I didn’t implement any Sinhala specific grapheme to phoneme conversion rules. I have to do that now.

9. I am trying something new: I press Edit – Wikipedia- Sinhala script. Then I copy the hole Wikipedia article, and save it with gedit as text file with the name sinhala-grep. Then I run the grep command via Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ grep 'IPA' '/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/sinhala-grep' > '/home/ubuntu/Documents/201005/sinhala-dictionary/dictionaries/sinhala-grep-object'

You all know that I am a fan of XSLT. But the grep command should do it for this raw operation.

10. I am now ready with the style-sheet. This means that I can repeat step 8.

11. Ralf's Sinhala dictionary (version 0.1; 2010-05-15) contains 26707 words. Import it into simon, and take a look at the result:

sinhala The left column contains Sinhala words. The pronunciation column contains the corresponding SAMPA phonemes.

12. I am looking for a native speaker who is willing to improve this Sinhala PLS dictionary.