Posts Tagged ‘eSpeak’

Ralf’s Swiss German dictionary

Wednesday, May 12th, 2010

How I create Ralf's Swiss German dictionary:

1. Get spelling dictionary. License is GPL.

2. Language code is de-CH (not mentioned in the Wikipedia, but you know the concept: de-DE for Ralf's German dictionary; de-AT for Ralf's Austrian German dictionary).

3. Ralf's Swiss German dictionary should become a sister project of Ralf's German dictionary. I hope that someone from Switzerland is willing to improve Ralf's Swiss German dictionary.

4. The encoding of de_CH_frami.dic and de_CH_frami.aff is ISO-8859-1. I will have to convert both files into UTF-8.

5. Convert de_CH_frami.dic into UTF-8:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ iconv -f ISO8859-1 -t UTF-8 < de_CH_frami.dic > swiss.dic

6. Convert de_CH_frami.aff into UTF-8:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ iconv -f ISO8859-1 -t UTF-8 < de_CH_frami.aff > swiss.aff

Change in the file swiss.aff the line SET ISO8859-1 into SET UTF-8.

7. Trying to generate a list with Swiss German words:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ unmunch swiss.dic swiss.aff > swiss-wordlist

Unfortunately, the result is not usable. I will have to find a different way. I think that I will use swiss.dic as source. Unfortunately, in this file swiss.dic a lot of nouns are written in lower-case (in the German language, nouns are always written in upper-case). Never mind, this has to be fixed later.

8. Add <lexicon> at the beginning of the file swiss.dic. Add </lexicon> at the end of swiss.dic.

9. Create XML file:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:swiss.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:swiss.xml

10. Remove substring after slash:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:swiss.xml -xsl:'substring-before-slash.xsl' -o:swiss-ssml.xml

11. Generate Swiss eSpeak phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ espeak -f swiss-ssml.xml -m -v de -q -x --phonout="swiss-espeak"

12. Open the file swiss-espeak with Geany. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences):

use-escape-sequences

13. Add <lexicon> at the beginning of the file swiss-espeak. Add </lexicon> at the end.

14. Paste <grapheme> and <phoneme> elements:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ paste swiss-ssml.xml swiss-espeak > swiss-pls

15. Edit swiss-pls with Geany so that it will become a valid PLS dictionary.

16. Convert some eSpeak phonemes into IPA phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:'swiss-pls' -xsl:'http://spirit.blau.in/simon/files/2010/05/ralfs-ipa-stylesheet.xsl' -o:'swiss-dictionary.xml'

17. It is necessary that I improve create-ralfs-ipa-stylesheet.xsl. At the moment, there are several German phonemes that aren’t converted. Taking a look into this script. Why am I doing this? At the moment, ralfs-ipa-stylesheet.xsl doesn’t contain almost none eSpeak to IPA conversion rules. Here are the XPath expressions that have to be specified for the German language:

matches(/lexicon/@xml:lang, 'de')
replace($espeak2ipa, '3', '3')
replace($espeak2ipa, '@', 'ə')
replace($espeak2ipa, '@-', 'ə-')
replace($espeak2ipa, 'a', 'a')
replace($espeak2ipa, 'A', 'A')
replace($espeak2ipa, 'A:', 'A:')
replace($espeak2ipa, 'aI', 'aI')
replace($espeak2ipa, 'aU', 'aU')
replace($espeak2ipa, 'E', 'ɛ')
replace($espeak2ipa, 'E2', 'ɛ2')
replace($espeak2ipa, 'E:', 'ɛ:')
replace($espeak2ipa, 'e:', 'eː')
replace($espeak2ipa, 'EI', 'ɛɪ̯')
replace($espeak2ipa, 'I', 'I')
replace($espeak2ipa, 'i2', 'i2')
replace($espeak2ipa, 'i:', 'iː')
replace($espeak2ipa, 'O', 'O')
replace($espeak2ipa, 'o:', 'oː')
replace($espeak2ipa, 'OY', 'OY')
replace($espeak2ipa, 'U', 'U')
replace($espeak2ipa, 'u:', 'uː')
replace($espeak2ipa, 'W', 'W')
replace($espeak2ipa, 'y', 'y')
replace($espeak2ipa, 'y:', 'y:')
replace($espeak2ipa, 'Y:', 'Y:')
replace($espeak2ipa, '\*', '*')
replace($espeak2ipa, ':', ':')
replace($espeak2ipa, ';', ';')
replace($espeak2ipa, 'b', 'b')
replace($espeak2ipa, 'C', 'C')
replace($espeak2ipa, 'd', 'd')
replace($espeak2ipa, 'D', 'D')
replace($espeak2ipa, 'dZ', 'dZ')
replace($espeak2ipa, 'f', 'f')
replace($espeak2ipa, 'g', 'g')
replace($espeak2ipa, 'g#', 'g#')
replace($espeak2ipa, 'h', 'h')
replace($espeak2ipa, 'j', 'j')
replace($espeak2ipa, 'k', 'k')
replace($espeak2ipa, 'l', 'l')
replace($espeak2ipa, 'm', 'm')
replace($espeak2ipa, 'n', 'n')
replace($espeak2ipa, 'N', 'N')
replace($espeak2ipa, 'p', 'p')
replace($espeak2ipa, 'pF', 'pF')
replace($espeak2ipa, 'r', 'r')
replace($espeak2ipa, 's', 's')
replace($espeak2ipa, 'S', 'S')
replace($espeak2ipa, 't', 't')
replace($espeak2ipa, 'tS', 'tS')
replace($espeak2ipa, 'ts', 'ts')

replace($espeak2ipa, 'v', 'v')
replace($espeak2ipa, 'w', 'w')
replace($espeak2ipa, 'x', 'x')
replace($espeak2ipa, 'z', 'z')
replace($espeak2ipa, 'Z', 'Z')

I won’t do a direct conversion from eSpeak phonemes to SAMPA phonemes. I want a conversion from eSpeak phonemes to IPA phonemes. During the PLS simon import process, the IPA phonemes will be transformed into SAMPA phonemes.

18. Do you see the XPath expression replace($espeak2ipa, 'tS', 'tS')? I think that maybe this indicates the voiceless alveolar affricate (e.g. "zehn" [t͡seːn]). Am I right? Or am I wrong? There is another XPath expression: replace($espeak2ipa, 'ts', 'ts'). At the moment, I am not sure which one indicates the voiceless alveolar affricate. Probably, eSpeak [ts] stands for IPA [t͡s]. And probably, eSpeak [tS] stands for the voiceless palato-alveolar affricate [t͡ʃ]. This means that I can add the following XPath expressions to create-ralfs-ipa-stylesheet.xsl:

replace($current-ipa, 'tS', 't͡ʃ')
replace($current-ipa, 'ts', 't͡s')

By the way, the current version of Ralf's German dictionary doesn’t contain the phones [t͡s] and [t͡ʃ]. Maybe I will add both phones to the next version of Ralf's German dictionary. At least, I will add the [t͡s] phone.

19. I hope that you understand the strength of create-ralfs-ipa-stylesheet.xsl: I add the XPath expression replace($current-ipa, 'ts', 't͡s') to create-ralfs-ipa-stylesheet.xsl. This will influence PLS dictionaries that have one of the following xml:lang language codes: ca (Catalan), hu (Hungarian), de (Standard German, Swiss German, Austrian German), el (Greek), eo (Esperanto), hbs (I haven’t created a PLS dictionary with this language code), hy (Armenian – no PLS dictionary; I can use this spelling dictionary), it, lv, mk, pl, pt (espeak offers pt and pt-pt; probably both dialects use the same phone set), ru, sk, sq (Albanian – this is a dictionary that I should create using this spelling dictionary), and many more.

So I am adding the XPath expression replace($current-ipa, 'ts', 't͡s') once, and more than 10 PLS dictionaries will be affected.

20. Let’s be more specific, and take a look into my Swiss German PLS dictionary with eSpeak phonemes:

<lexeme>
<grapheme>Abklatsch</grapheme>
<phoneme>_!'apkl,atS</phoneme>
</lexeme>

And now, take a look at my “flagshipRalf’s German dictionary:

<lexeme role="Substantiv">
<grapheme>Abklatsch</grapheme>
<phoneme>ʔapklatʃ</phoneme>
</lexeme>

The final result of both dictionaries (Ralf's Swiss German dictionary and Ralf's German dictionary) should be: <phoneme>ʔapklat͡ʃ</phoneme>. I will achieve this goal by adjusting create-ralfs-ipa-stylesheet.xsl for Swiss German (de-CH). In contrast, Ralf's German dictionary (de-DE) contains already IPA phonemes. For this Standard German dictionary, I am using a specific .xsl style-sheet.

21. You can see that I am working as abstract as possible (one .xsl style-sheet for all eSpeak languages), and as concrete as necessary. The phoneme adjustments are done at the appropriate level:

a. Abstract style-sheet for all eSpeak languages: create-ralfs-ipa-stylesheet.xsl
b. Concrete adjustments for a specific eSpeak language can be done here: ralfs-ipa-stylesheet.xsl
c. Fine-Tuning for Standard German: improve-german-dictionary.xsl

A native speaker from Switzerland could develop a specific .xsl style-sheet for the fine-tuning of the Swiss German PLS dictionary.

22. Adding the XPath expression replace($current-ipa, 'E:', 'ɛː') to create-ralfs-ipa-stylesheet.xsl.

23. What should I do with the expression replace($espeak2ipa, 'OY', 'OY')? I should follow the Wiktionary:

[ɔɪ̯] U+0254, U+026A, U+032F Heu /[hɔɪ̯]/, Läufer /[ˈlɔɪ̯fɐ]/, neu /[nɔɪ̯]/

The simon import process accepts the phone [ɔɪ̯]. According to the Wikipedia, the following transcription would be possible:

Instead of the transcription /ɔ͡ʏ/, the transcription /ɔ͡ɪ/ is used as well.

You can see that there are several possible solutions. We have to decide which solution we want to use for a specific language. For Standard German, we use [ɔɪ̯]. It may be possible to use for different languages different transcriptions for the diphtong [ɔɪ̯].

24. And now, I learned something:

Diphthongs in German:

* [aɪ̯] as in Reich ‘empire’
* [aʊ̯] as in Maus ‘mouse’
* [ɔʏ̯] as in neu ‘new’
* [eːɐ̯] as in sehr ‘very’
* [iːɐ̯] as in dir ‘you (dative)’
* [oːɐ̯] as in Bor ‘boron (element)’
* [øːɐ̯] as in Öhr ‘eye (hole in a needle)’
* [uːɐ̯] as in nur ‘only’
* [yːɐ̯] as in Tür ‘door’

Some diphthongs in Bernese, a Swiss German dialect:

* [iə̯] as in Bier ‘beer’
* [yə̯] as in Fuß ‘feet’
* [uə̯] as in Schue ‘shoes’
* [ou̯] as in Stou ‘holdup’
* [au̯] as in Stau ‘stable’
* [aːu̯] as in Staau ‘steel’
* [æu̯] as in Wäut ‘world’
* [æːu̯] as in wääut ‘elects’
* [ʊu̯] as in tschúud ‘guilty’

Great. Now we are coming to a more specific level: Bernese German (Bärndütsch). You can see: Bernese German has specific diphtongs. It is necessary to develop a Bernese German PLS dictionary. We go the following way:
- First, I develop Ralf's German dictionary for Standard German.
- Second, I develop Ralf's Swiss German dictionary for Switzerland.
- Third, someone who lives “in the Swiss plateau (Mittelland) part of the canton of Bern” should develop a Bernese German PLS dictionary.

25. Let me make one thing clear: If you speak Bernese German, you should give Ralf's German dictionary a try. Use my flagship dictionary for training of a few Bernese German words. It is necessary that you understand the concept. But of course, Standard German is different from Bernese German. For good recognition results, it is necessary that you have your own PLS dictionary with the Bernese specific diphtong [yə̯]. At the moment, there is no Bernese German PLS dictionary available (as far as I know). So you should use Ralf's Swiss German dictionary or Ralf's German dictionary.

In the long run, we need specific dialect dictionaries for the Swiss German language:
- Basel German (Baslerdüütsch) PLS dictionary
- Walliser German (Wallisertiitsch) PLS dictionary
- Walser German (Walserdeutsch) PLS dictionary
- Zürich German (Züritüütsch) PLS dictionary

For good recognition results, the PLS dictionary has to match your dialect. And this is why I am a fan of the IPA: You can be as specific as necessary. And we have a common standard that is applicable to all languages. So we use standards (UTF-8, PLS, IPA, XML, XSLT, GPLv3) that are recognized worldwide. And for each Swiss German dialect, there is a solution that can be developed.

26. I am adding the new phoneme /p͡f/ with the XPath expression replace($current-ipa, 'pF', 'p͡f'). It is similar to the voiceless labiodental affricate:

German has a similar sound in Pfeffer [ˈp͡fɛfˑɐ] (‘pepper;) and Apfel [ˈapˑ͡fl̩] (‘apple’). This /p͡f/ only occurs word-initially and behind short vowels, though it differs from a true labiodental affricate in that it starts out bilabial but then the lower lip retracts slightly for the frication.

Question: do we need this phoneme /p͡f/?

27. OK, let’s create the style-sheet:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-phonemes.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/create-ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl'

28. And now, let’s transform the eSpeak phonemes into IPA phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:'swiss-pls' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'swiss-dictionary.xml'

29. Download Ralf's Swiss German dictionary, and import it into simon.

lautsprecherbox Left column: Swiss German words. Unfortunately, a lot of nouns are written in lower-case.
Right column: Corresponding SAMPA phonemes.

You can imagine the following problem: The phonemes l aU tS p R E C ah b O k s are not perfect. Take a look at the corresponding entry in the PLS dictionary:

<lexeme>
<grapheme>Lautsprecherbox</grapheme>
<phoneme>l'aʊtʃpʀɛçɐb,ɔks</phoneme>
</lexeme>

I am not sure whether the phonemes /t/ and /ʃ/ should be treated as one single phoneme /t͡ʃ/, or not (see above 18.).

30. I changed the code of Ralf's Swiss German dictionary:

<lexeme>
<grapheme>Lautsprecherbox</grapheme>
<phoneme>l'aʊt͡ʃpʀɛçɐb,ɔks</phoneme>
</lexeme>

I don’t know how the result will look like when I import this into simon. I tested it. It is no difference for the end-user.

Ralf’s Occitan dictionary

Friday, April 30th, 2010

Here is how I create Ralf's Occitan dictionary:

1. Get spelling dictionary. License is GPL.

2. Converting oc_FR.dic from ISO8859-15 to UTF-8 via Ubuntu terminal:

~/Documents/201004/occitan-dictionary$ iconv -f ISO8859-15 -t UTF-8 < oc_FR.dic > occitan-utf8.dic

3. Convert oc_FR.aff:

am3msi@am3msi-desktop:~/Documents/201004/occitan-dictionary$ iconv -f ISO8859-15 -t UTF-8 < oc_FR.aff > occitan-utf8.aff

4. Change first line of file occitan-utf8.aff from SET ISO8859-15 to SET UTF-8.

5. Generating list with 1.5 million Occitan words:

~/Documents/201004/occitan-dictionary$ unmunch occitan-utf8.dic occitan-utf8.aff > occitan-wordlist-utf8

6. Downloading the style-sheet improve-latin-dictionary.xsl. I am modifying this style-sheet because I want to reduce the size of the Occitan word list.

7. Adding <speak> tags to the file occitan-wordlist-utf8.dic (<speak> at the beginning of the file; </speak> at the end of the file).

8. Creating SSML file:

~/Documents/201004/occitan-dictionary$ saxonb-xslt -ext:on -s:occitan-wordlist-utf8 -xsl:'create-audio-elements.xsl' -o:occitan-speak-audio

9. eSpeak doesn’t offer support for the Occitan language. According to the Wikipedia,

Modern Occitan is the closest relative of Catalan.

Because of that, I will generate the phonemes of the Occitan word list as if the word list were a Catalan word list. I am doing this with the Ubuntu terminal:

~/Documents/201004/occitan-dictionary$ espeak -f occitan-speak-audio -m -v ca -q -x --phonout="occitan-espeak"

10. Adding <lexicon> tags to the file occitan-espeak (<lexicon> at the beginning of the file; </lexicon> at the end of the file).

11. Creating <phoneme> elements:

~/Documents/201004/occitan-dictionary$ saxonb-xslt -ext:on -s:occitan-espeak -xsl:'replace-newline-newline-space-by-phoneme-element.xsl' -o:occitan-phoneme-elements

12. Editing occitan-speak-audio and occitan-phoneme-elements with gedit. Both files have to have exactly the same number of lines.

13. Combining list with <grapheme> elements and list with <phoneme> elements:

~/Documents/201004/occitan-dictionary$ paste occitan-speak-audio occitan-phoneme-elements > occitan-dictionary.xml

14. Editing occitan-dictionary.xml with gedit so that it will be a valid XML file.

15. The size of Ralf's Occitan dictionary version 0.1 is too big. But unfortunately, I don’t know how I can filter out words with XSLT that begin with t' m' s' n' (because I have problems with character entity references). I hope that I will find a solution for the next version of this dictionary.

16. The <phoneme> elements contain eSpeak characters. It is possible to convert the eSpeak phonemes into IPA phonemes via an XSLT style sheet.

17. Download Ralf's Occitan dictionary, and import it into simon (the import takes about 2 minutes on my computer).

Zuführungsdrähten: two pronunciations

Thursday, December 24th, 2009

I am now adding the word Zuführungsdrähten (which is part of the shadow dictionary):

zufuehrungsdraehten

You can see that there are two pronunciation alternatives. And this proves the strength of Ralf’s German dictionary: I am using an XSLT stylesheet to fix recurring pronunciation errors (eSpeak is not perfect), and to add alternate pronunciations (using replace: replace($ieren, 'tən', 'tn̩').

It would be great if someone would be willing to volunteer with the development of Ralf's German dictionary. My concept is as follows (compare with the XSLT concept):

xslt-concept
Image source: Wikipedia

Ralf’s German dictionary (current version) = XML input
espeak2perfectipa.xsl = XSLT code
saxonb-xslt (Ubuntu terminal) = XSLT processor
Ralf’s German dictionary (future version) = Result document

The development of Ralf's German dictionary is done outside and independent from simon. simon does a pretty good conversion during import from IPA to SAMPA. So there is no need to worry.

Ralf’s German dictionary has a lot of known weaknesses. It would be great if someone who is interested in the improvement of this dictionary would volunteer. It is not that difficult to get involved.

Everyone is permitted to improve Ralf’s German dictionary (and the corresponding XSLT code espeak2perfectipa.xsl) because both are GPLv3 licensed.

Ralf's German dictionary is the flagship. There are other dictionaries which need improvement:

A. Austrian German
Ralf’s Austrian German dictionary – it is a very small dictionary. The target group is very specific. This dictionary should contain only words that are not included in Ralf's German dictionary. So if you live in Austria, it is intended that you import two dictionaries:
1. Ralf's German dictionary (with 300.000 words);
2. Ralf's Austrian German dictionary (with specific words).

I don’t know much about Austrian German. You can get an impression of how Austrian German sounds when watching the simon video tutorial.

B. Swiss German
Maybe I will release a Swiss German dictionary. If someone from Switzerland is interested in the development of such a dictionary, I could create Ralf's Swiss German dictionary (I haven’t done that so far). I think that there is a GPL word list at OpenOffice.org available. So a volunteer would be welcome. I can help you with the first steps (unmunch, eSpeak, paste). The result would be a PLS dictionary with a vocabulary that only contains words that are specific to Swiss German.

The future Ralf's Swiss German dictionary is interesting for people who live in (or emigrated from Germany to) Switzerland. If you emigrated from Germany to Switzerland, you should get familiar with Swiss German. So Ralf's Swiss German dictionary would be interesting for German people who immigrate to Switzerland. If you want to stay in Switzerland, learn their language! simon / Ralf's Swiss German dictionary might help you to reach that goal.

C. Medical German
Ralf’s German medical dictionary is targeted at people who are interested in medical education. It is necessary to develop specialised medical dictionaries. The concept is easy:
1. Import Ralf's German dictionary into simon.
2. Import Ralf's German medical dictionary. Then you can train simon to recognize medical terms (e.g. LinsenchirurgielɪnzənçiːʀʊʀgiːEpilepsiebehandlungʔeːpiːlɛpsiːbeːandlʊŋ).
3. Develop specialised medical dictionaries: Human anatomy, pharmacology, genetics, etc..

simon could be used by medical students. So if you are a medical student (German language), you can improve Ralf's German medical dictionary. You can add words to this dictionary. Later, in a few years when you become a doctor, you might be able to use your experiences with simon / Ralf's German medical dictionary. Develop your own medical pronunciation dicitionary, and become a better doctor!

Of course, because we are in a very early stage of development, this is just something for medical students who have enough time.

D. Latin (German pronunciation)
Ralf’s Latin dictionary needs improvement. Latin has pronunciation rules that are different from German. The following way is possible: Improve Ralf's Latin dictionary with an XSLT stylesheet. I explained the concept above. The stylesheet needs information that are specific to the Latin language. Sometimes the Latin e is short (e.g. currere), sometimes it is long (e.g. dēbēre). These things need to be fixed.

And of course, 1.7 million Latin words is too much. The size of this dictionary has to be reduced because of performance issues. A dictionary with about 100.000 Latin words would be optimal at the moment. We don’t have yet a routine (compression algorithm) to handle dictionaries with e.g. 1.7 million words. This has to be developed (maybe someone could do that who is familiar with the unmunch command). But a dictionary with 100.000 Latin words would be a good start.

Latin is a “dead language”. But it should be possible – thanks to simon / Ralf's Latin dictionary to make the computer write down Latin words when you speak them into your microphone. From my point of view, Ralf’s Latin dictionary is something for Latin teachers (school or university). So the target group is very specific. I think that it can be fun for Latin students to use simon for the recognition of spoken Latin words.

E. Conclusion
The different dictionaries need improvement. Interested persons (people from Austria, Switzerland; medical students; Latin teachers) are encouraged to improve the specific dictionary on their own. My dictionaries are GPLv3 licensed. It is intended that someone improves them. This is my concept:

1. unmunch an OpenOffice.org spelling dictionary
2. generate phonemes with eSpeak
3. paste
4. convert eSpeak phonemes into IPA phonemes with XSLT
5. import the resulting PLS dictionary into simon
6. record a few words with simon
7. use simon for recognition (dictate e.g. into a gedit window)

This concept should work with dictionaries that use the German pronunciation (Austrian, Swiss, German medical, Latin). I didn’t test these dictionaries for training / recognition with simon. But the concept is the same. Since Ralf's German dictionary (flagship) works with simon, the other dictionaries with German pronunciation should work, too.

Ralf’s Vietnamese dictionary

Sunday, November 15th, 2009

You can import Ralf's Vietnamese dictionary (version 0.1; GPLv3) into simon. The dictionary contains about 6.000 words; training is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

Import 14.000 Kurdish words

Monday, November 9th, 2009

Download Ralf's Kurdish dictionary (version 0.1; GPLv3), and import it into simon. Unfortunately, training with this dictionary is currently not possible.

The phoneme elements contain eSpeak phonemes (not IPA phonemes). Some words about the creation of Ralf's Kurdish dictionary: After getting a spelling dictionary, I unmunched a list of about 14.000 words. Then I generated the phonemes, and created the PLS file.

Now you know that you can import Ralf's Kurdish dictionary into simon.

Get 80.000 Hindi lexeme elements

Monday, November 9th, 2009

You can import Ralf's Hindi dictionary (version 0.1; GPLv3) into simon. The alphabet of the phonemes is not IPA. Instead, it is the eSpeak output (eSpeak ASCII phonemes). The lexicon element contains the following information: alphabet="espeak". This means that the phoneme elements contain eSpeak phonemes.

Some words about how I created the dictionary: I got a spelling dictionary, then I created an SSML file, and generated the phonemes. Finally, I used paste to combine the SSML file with the eSpeak phonemes.

Training with this dictionary is currently almost impossible because there exists no eSpeak specific import function. It might be worth a thought to think about the question whether it would be suitable to use eSpeak phonemes directly for speech recognition.

You can see that simon displays the Hindi characters correctly:

hindi

You now know that you can import 80.000 Hindi words into simon.

Expanding Ralf’s French dictionary

Tuesday, November 3rd, 2009

This article is interesting for people who want to create a pronunciation dictionary for their own language that can be imported into simon.

Currently, I am preparing a new version of Ralf's French dictionary. I am adding a lot of words to the dictionary. Here is what I am doing:

1. I got a French spelling dictionary from OpenOffice.org (Orthographe «Réforme 1990»). There are two important files:
a. .dict file: contains the word list
b. .aff file: contains rules about the possible suffixes. French has a lot of suffixes. Let’s take a look into Conjugaison française:Premier groupe:

# INDICATIF

* Présent :-e, -es, -e, -ons, -ez, -ent
* Imparfait : -ais, -ais, -ait, -ions, -iez, -aient
* Futur simple : -erai, -eras, -era, -erons, -erez, -eront
* Passé simple : -ai, -as, -a, -âmes, -âtes, -èrent

# SUBJONCTIF

* Présent : -e, -es, -e, -ions, -iez, -ent
* Imparfait : -asse, -asses, -ât, -assions, -assiez, -assent

# CONDITIONNEL

* Présent : -erais, -erais, -erait, -erions, -eriez, -eraient

The next version of Ralf’s French dictionary should cover most of the French suffixes. And of course, the French accents should be correct (according to the Réforme 1990).

2. I typed into the Ubuntu terminal the command: unmunch fr-1990.dic fr-1990.aff > french-wordlist-o.txt
This created a list of about 3 million French words. But there are many duplicates. This means that I will use only a subset of these words. I will have to sort out the duplicates with the distinct-values(), or with sort -u | \:

“sort sorts the list of ‘words’ into alphabetical order, and the -u switch removes duplicates”

I will have to be careful with the special characters. The sort command might be easier, but maybe there will be problems with ISO-8859-1 vs. UTF-8. To avoid this possible problem, it might be better to use an XSLT style-sheet. But there can be a problem with java heap space (I solved this problem under Ubuntu 9.04 with VisualVM. Unfortunately, I didn’t find out how to deal with this issue under Ubuntu 9.10.). I will have to try out. The tools that I am using are very good, but they aren’t perfect. Fixing the issues costs me a lot of time.

There are encoding problems. If you want to create a dictionary with non-US-ASCII characters (like it is the case in German – äöüß – or French), you probably will encounter the following problem:

ISO-8859-1 vs. UTF-8

Both standards are very common, and it can cause a lot of headache. I am trying to solve the problem via the style-sheet create-graphemes-french.xsl:

replace-french

The unmunch command introduced a lot of crap characters. I don’t know how I can prevent this problem. But I know how I can try to fix the crap characters. One solution would be a simple search and replace with gedit. But I prefer to use a style-sheet because it allows several transformations at a time.

You can see that dictionary development is possible. I am using very powerful tools (the unmunch command for word generation; the .xsl style-sheet for fixing encoding issues; the espeak command for phoneme generation).

Why am I writing this in this blog? Because if you want to use simon, you need a pronunciation dictionary. I want to help people who don’t have access to a dictionary. Create your own dictionary! OpenOffice.org offers spelling dictionaries in a lot of languages. This is your source to get a word list. With eSpeak you transform the words into phonemes. You can use eSpeak for the following languages:

Afrikaans, Bosnian, Catalan, Czech, German (most phonemes of Ralf's German dictionary are created with eSpeak), Greek, Esperanto, Spanish, Finnish, Croatian, Hungarian, Italian, Kurdish, Latvian, Polish, Romanian, Slovak, Serbian, Swedish, Swahili, Tamil, Turkish.

If you are a native speaker of one of those languages, my concept of dictionary creation is something for you.

With paste you combine the word list and the phoneme list.

[Edited on November 16, 2009]

Ralf’s French dictionary with 40.000 words

Friday, October 30th, 2009

You can import Ralf's French dictionary (license: GPL; version 0.1) into simon. Currently, training with simon is not recommended with this dictionary. The dictionary should be more or less OK, but there are problems with the simon import process which I will describe later in this article.

I used a French OpenOffice.org spelling dictionary (license: GPL) to get the words for this dictionary. The phones were generated with eSpeak. I converted the eSpeak ASCII phones into IPA with this style-sheet. I am not sure whether all French vowels are converted correctly from the eSpeak phone set into IPA. Especially, I am unsure whether the following XPath expressions do the correct replacements:

select="replace($sierra, 'Y:', 'ø')"
select="replace($sierra, 'Y','œ')"

It is difficult to make the right distinction between these similar phones. Maybe there is a native speaker who is familiar with the French eSpeak phone set out there? The question is: how can I improve the conversion from French eSpeak phones into IPA phones? The XSLT-style-sheet (see the link above) contains information which replacements have been made. I am not sure whether eSpeak generated all French phones correctly. There might be some inconsistencies that have to be fixed.

I am open to suggestions on how to improve the XSLT-style-sheet. The concept is great. Support from a native speaker would be appreciated because it is difficult to catch all the details (especially vowels) of the French language.

When you import the dictionary into simon, not all phones are converted properly by simon. Probably, the following IPA symbols aren’t imported properly by simon:

1. ɲ = stimmhafter palataler Nasal = U+0272; a future version of simon could transform this phone into SAMPA J. Example:

<lexeme>
<grapheme>Avignon</grapheme>
<phoneme>aviɲɔ̃</phoneme>
</lexeme>

2. θ = stimmloser dentaler Frikativ = U+03B8; a future version of simon could transform this phone into SAMPA T. Example:

<lexeme>
<grapheme>Commonwealth</grapheme>
<phoneme>kɔmənwɛlθ</phoneme>
</lexeme>

We need this phone also for (a future version of) Ralf’s German dictionary for words like Thunderbird. Currently, Ralf’s German dictionary doesn’t employ this foreign phone. But this phone is needed.

3. ɔ̃ ; see Wikipedia: “im Deutschen in französischen Lehnwörtern wie Balkon, Chanson“. Example:

<lexeme>
<grapheme>Simon</grapheme>
<phoneme>simɔ̃</phoneme>
</lexeme>

4. ɑ̃; nasaliertes a. Example:

<lexeme>
<grapheme>diligence</grapheme>
<phoneme>diliʒɑ̃s</phoneme>
</lexeme>

5. ɛ̃ = heller Nasalvokal Example:

<lexeme>
<grapheme>infidélité</grapheme>
<phoneme>ɛ̃fidelite</phoneme>
</lexeme>

6. œ̃ = gerundeter halboffener Nasalvokal. Example:

<lexeme>
<grapheme>jardin</grapheme>
<phoneme>ʒaʀdœ̃</phoneme>
</lexeme>

7. ɥ = konsonantisch benutzter Ü-Laut = U+0265. Example:

<lexeme>
<grapheme>minuit</grapheme>
<phoneme>minɥi</phoneme>
</lexeme>

You can see that the French language employs a lot of vowels. Maybe it is possible to adjust the PLS import process?

I have imported the dictionary into simon. Let’s take a look at the results (HTK format):

1. AVIGNON [Avignon] aviJOnas – has to be fixed
2. COMMONWEALTH [Commonwealth] kOm@nwElT – this has to be fixed, too.
3. SIMON [Simon] s i m O n a s – interesting: it seems that the phones are separated correctly. But it is easy to recognize that the simon import process should be adjusted.
4. DILIGENCE [diligence] d i l i Z A n a s s – the error is similar to the previous one.
5. INFIDÉLITÉ [infidélité] E n a s f i d e l i t e – has to be fixed.
6. JARDIN [jardin] Z a R d oe n a s – has to be fixed, too.
7. MINUIT [minuit] minHithe H is the correct SAMPA transcription. But there should be spaces between the phones.

Momentarily, the major issue is the simon PLS import process.

Create your own PLS dictionary

Tuesday, October 27th, 2009

This article is for people who want to use simon, but don’t have access to a pronunciation dictionary.

If you want to use simon, you need a pronunciation dictionary. Such dictionaries are available for some languages.

What if you don’t have a dictionary for your own language? You can create your own dictionary. Here are a few thoughts about the topic “dictionary creation”:

Today, I created about 100.000 words from an OpenOffice.org spelling dictionary with the style-sheet create-graphemes.xsl. The source spelling dictionary is licensed under the GPL. There is no licensing problem. Ralf’s German dictionary is GPL licensed, too. I “steal” words only from GPL sources.

I recommend OpenOffice.org spelling dictionaries if you want to create a PLS dictionary for your own language. Think about languages like Czech, Greek, or Finnish. Native speakers can use such an OpenOffice.org spelling dictionary for word creation. The word list can then be used with eSpeak for phoneme generation. eSpeak works with several languages. The word list and the phoneme list can be combined with paste. After that, the eSpeak phonemes can be converted into IPA with the style-sheet espeak2perfectipa.xsl. You can write your own style-sheet for your own language. I want to show you the concept of dictionary creation.

I created a word list of Ralf’s German dictionary using the style-sheet output-text.xsl. Then I used comm: $ comm -23 output-o 0.1.7-wordlist-o > compared
The result should be a word list with words that are not part of the current version of Ralf’s German dictionary (I am not sure whether this approach was successful). I will integrate these words into the dictionary.

Ralf’s German dictionary is a solution for the German language. Create a solution for your own language!

Format of imported text files

Thursday, October 8th, 2009

It is possible to import text files for training into simon. Internally, simon stores them in the following format:

<?xml version="1.0" encoding="UTF-8"?>
<text title="xam" >
<page>
<text>Bugleine</text>
</page>
</text>

I would prefer if simon would use the following format:

<?xml version="1.0" encoding="UTF-8"?>
<speak
title="xam">
<audio>Bugleine</audio>
</speak>

When I create the PLS dictionary, I am using eSpeak for the generation of German phonemes. And eSpeak accepts SSML files with the <speak> and <audio> tags.

This is not a major issue, and it is not important. But I want to tell you what I am thinking. Maybe in the long term, a future version of simon would be able to generate pronunciations of unknown words automatically? For this purpose, the use of SSML tags would be helpful. The TTS program eSpeak accepts SSML. And eSpeak is suitable for a lot of languages. So maybe it would be useful to add a function to simon that would be able to make use of eSpeak?

From my point of view, eSpeak produces good phoneme quality for the German language. A graphical user interface would be good.

It is just a thought, probably there are more important things to do.