Posts Tagged ‘node16’

Is the glottal stop a phoneme?

Thursday, May 13th, 2010

Ralf's German dictionary contains entries with a glottal stop. Here is an example:

<lexeme role="Substantiv">
<grapheme>Immobilieneuphorie</grapheme>
<phoneme>ʔɪmoːbiːliːənɔɪ̯foːʀiː</phoneme>
</lexeme>

When you import the dictionary into simon, this word is displayed as follows:

immobilieneuphorie The SAMPA transcription gls I m o: b i: l i: @ n OI f o: R i: contains the glottal stop. Does this improve the recognition result?

Take a look into the Wikipedia:

“In the northern varieties, [ʔ] occurs before word stems with initial vowel. It is not considered a phoneme, but an optional boundary mark of word stems.”

So maybe the glottal stop is not a phoneme but a boundary mark? Maybe it would be better to create a new IPA “phoneme” /ʔɪ/ – or to be more specific: /ʔ͡ɪ/? This “phoneme” doesn’t exist, but if the glottal stop is just a boundary mark, I think about a solution for this issue.

Are there any test results available? I don’t know how HTK does handle the glottal stop. Maybe it is a good decision to treat the glottal stop as phoneme. Maybe it would be better to introduce “new” phonemes? There could be a “normal” /ɪ/ phoneme, and a “new” /ʔ͡ɪ/ phoneme.

I am just thinking out loud. It is great that simon imports the glottal stop. I would like to know: does this improve the recognition results? I haven’t tested it.

Ralf’s German dictionary 0.1.9.3

Thursday, May 13th, 2010

How I create Ralf's German dictionary version 0.1.9.3:

1. The current version number (when writing this article) is 0.1.9.2 (April 27, 2010). It doesn’t contain the phoneme /t͡s/.

2. The next version will be 0.1.9.3. It will contain the phoneme /t͡s/.

3. The improvement is done via the Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/german-0.1.9.3/german-0.1.9.2.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.3/improve-german-dictionary.xsl' -o:'/home/ubuntu/Documents/201005/german-0.1.9.3/german-0.1.9.3.xml'

4. Remove the XPath expression replace($sierra, 'giːʀ', 'giːɐ̯') from improve-german-dictionary.xsl. I don’t need this replacement instruction any more.

5. You all know: Ralf's German dictionary should cover Standard German. You are allowed to edit this dictionary because the license is GPLv3. You could use it as source for your own German PLS dialect dictionary.

6. Adding the following XPath expressions to improve-german-dictionary.xsl:
a. When starts-with(grapheme, 'Z') then replace($sierra, 'ts', 't͡s').
b. When ends-with(lower-case(grapheme), 'z') then replace($sierra, 'ts', 't͡s'). Take a look at the result of this operation:

<lexeme role="Substantiv">
<grapheme>Äquivalents</grapheme>
<phoneme>ʔɛːkviːvalɛnts</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Äquivalenz</grapheme>
<phoneme>ʔɛːkviːvalɛnt͡s</phoneme>
</lexeme>

If the <grapheme> element ends with ts, the corresponding phonemes /t/ and /s/ in the <phoneme> element won’t be converted. My approach is conservative. I don’t want to introduce any errors.
c. Replace the phonemes 'vɐtaɪ̯lʊŋ' by 'fɐtaɪ̯lʊŋ' with the expression replace($sierra, 'vɐtaɪ̯lʊŋ', 'fɐtaɪ̯lʊŋ').

7. Correct <phoneme>kʀɛbsvɔʀzɔʀgɔɪ̯ntɐzuːxʊŋ</phoneme> into
<phoneme>kʀɛbsfɔʀzɔʀgəʔʊntɐzuːxʊŋ</phoneme>. There is no diphtong in the middle of the word. By the way, you can see that the Wikipedia uses “[ɔʏ̯] as in neu”. Ralf's German dictionary uses [ɔɪ̯] instead.

8. Which phonemes does Ralf's German dictionary use? Mainly the phonemes that are used in the Wiktionary. Additionally, it seems to be a good decision to add the phoneme /t͡s/ because there were problems with the recognition of the German words zwei and zurück.

9. When contains(lower-case(grapheme), 'zwei') then replace($sierra, 'tsvaɪ̯', 't͡svaɪ̯').

10. I am counting the number of the <lexeme> elements with the XPath expression count(lexeme). The dictionary contains 384358 German words.

11. When contains(lower-case(grapheme), 'ungssch') then replace($sierra, 'ʊŋsç', 'ʊŋsʃ').

12. Ralf's German dictionary contains <lexeme> elements with empty role attributes. The test whether the attribute is empty or not is a bit difficult:

"Testing for missing attributes in XML elements is different from testing for attributes with empty values. You can test for missing attributes with <xsl:if test="not(@attribute)">, but this test will never succeed if the attribute is present but empty. In that case, you have to use the <xsl:if test="@attribute = ''"> condition."

Interesting. At the moment, almost all <lexeme> elements contain a role attribute (empty or non-empty). This means that I should use the second expression <xsl:if test="@role = ''">.

13. I want to add several role attributes. This is where I start:

<xsl:if test="ends-with(grapheme, 'endem')">
<xsl:text>Adjektiv Dativ</xsl:text>
</xsl:if>

And this is the result:

<lexeme role="Adjektiv Dativ">
<grapheme>heißlaufendem</grapheme>
<phoneme>haɪ̯slaʊ̯fəndəm</phoneme>
</lexeme>

And this is what simon does with the role attribute:

adjektiv-dativ simon displays both alternatives Adjektiv and Dativ. A word can play several roles. Take a look into the example in the PLS specification. I think that it is useful to tag words with several role attributes.

14. Add <metadata> element to improve-german-dictionary.xsl. I don’t understand the details at the moment. This has low-priority.

15. Here is another example how I add role attributes:

<xsl:if test="ends-with(grapheme, 'test')">
<xsl:text>Verb Singular KonjunktivII</xsl:text>
</xsl:if>

And this is the result:

<lexeme role="Verb Singular KonjunktivII">
<grapheme>zujubeltest</grapheme>
<phoneme>t͡suːjuːbəltəst</phoneme>
</lexeme>

The word zujubeltest can be described with Imperfekt, too.

16. Download Ralf's German dictionary 0.1.9.3, and import it into simon. Take a look into the shadow vocabulary:

grammara. The word hinschauendem has three terminals: Adjektiv, Dativ, Singular.
b. The adjective hinschauenden has two different pronunciations.
c. hinschauendes is a Neutrum.

Should I continue to add terminal information?

Ralf’s Swiss German dictionary

Wednesday, May 12th, 2010

How I create Ralf's Swiss German dictionary:

1. Get spelling dictionary. License is GPL.

2. Language code is de-CH (not mentioned in the Wikipedia, but you know the concept: de-DE for Ralf's German dictionary; de-AT for Ralf's Austrian German dictionary).

3. Ralf's Swiss German dictionary should become a sister project of Ralf's German dictionary. I hope that someone from Switzerland is willing to improve Ralf's Swiss German dictionary.

4. The encoding of de_CH_frami.dic and de_CH_frami.aff is ISO-8859-1. I will have to convert both files into UTF-8.

5. Convert de_CH_frami.dic into UTF-8:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ iconv -f ISO8859-1 -t UTF-8 < de_CH_frami.dic > swiss.dic

6. Convert de_CH_frami.aff into UTF-8:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ iconv -f ISO8859-1 -t UTF-8 < de_CH_frami.aff > swiss.aff

Change in the file swiss.aff the line SET ISO8859-1 into SET UTF-8.

7. Trying to generate a list with Swiss German words:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ unmunch swiss.dic swiss.aff > swiss-wordlist

Unfortunately, the result is not usable. I will have to find a different way. I think that I will use swiss.dic as source. Unfortunately, in this file swiss.dic a lot of nouns are written in lower-case (in the German language, nouns are always written in upper-case). Never mind, this has to be fixed later.

8. Add <lexicon> at the beginning of the file swiss.dic. Add </lexicon> at the end of swiss.dic.

9. Create XML file:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:swiss.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:swiss.xml

10. Remove substring after slash:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:swiss.xml -xsl:'substring-before-slash.xsl' -o:swiss-ssml.xml

11. Generate Swiss eSpeak phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ espeak -f swiss-ssml.xml -m -v de -q -x --phonout="swiss-espeak"

12. Open the file swiss-espeak with Geany. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences):

use-escape-sequences

13. Add <lexicon> at the beginning of the file swiss-espeak. Add </lexicon> at the end.

14. Paste <grapheme> and <phoneme> elements:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ paste swiss-ssml.xml swiss-espeak > swiss-pls

15. Edit swiss-pls with Geany so that it will become a valid PLS dictionary.

16. Convert some eSpeak phonemes into IPA phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:'swiss-pls' -xsl:'http://spirit.blau.in/simon/files/2010/05/ralfs-ipa-stylesheet.xsl' -o:'swiss-dictionary.xml'

17. It is necessary that I improve create-ralfs-ipa-stylesheet.xsl. At the moment, there are several German phonemes that aren’t converted. Taking a look into this script. Why am I doing this? At the moment, ralfs-ipa-stylesheet.xsl doesn’t contain almost none eSpeak to IPA conversion rules. Here are the XPath expressions that have to be specified for the German language:

matches(/lexicon/@xml:lang, 'de')
replace($espeak2ipa, '3', '3')
replace($espeak2ipa, '@', 'ə')
replace($espeak2ipa, '@-', 'ə-')
replace($espeak2ipa, 'a', 'a')
replace($espeak2ipa, 'A', 'A')
replace($espeak2ipa, 'A:', 'A:')
replace($espeak2ipa, 'aI', 'aI')
replace($espeak2ipa, 'aU', 'aU')
replace($espeak2ipa, 'E', 'ɛ')
replace($espeak2ipa, 'E2', 'ɛ2')
replace($espeak2ipa, 'E:', 'ɛ:')
replace($espeak2ipa, 'e:', 'eː')
replace($espeak2ipa, 'EI', 'ɛɪ̯')
replace($espeak2ipa, 'I', 'I')
replace($espeak2ipa, 'i2', 'i2')
replace($espeak2ipa, 'i:', 'iː')
replace($espeak2ipa, 'O', 'O')
replace($espeak2ipa, 'o:', 'oː')
replace($espeak2ipa, 'OY', 'OY')
replace($espeak2ipa, 'U', 'U')
replace($espeak2ipa, 'u:', 'uː')
replace($espeak2ipa, 'W', 'W')
replace($espeak2ipa, 'y', 'y')
replace($espeak2ipa, 'y:', 'y:')
replace($espeak2ipa, 'Y:', 'Y:')
replace($espeak2ipa, '\*', '*')
replace($espeak2ipa, ':', ':')
replace($espeak2ipa, ';', ';')
replace($espeak2ipa, 'b', 'b')
replace($espeak2ipa, 'C', 'C')
replace($espeak2ipa, 'd', 'd')
replace($espeak2ipa, 'D', 'D')
replace($espeak2ipa, 'dZ', 'dZ')
replace($espeak2ipa, 'f', 'f')
replace($espeak2ipa, 'g', 'g')
replace($espeak2ipa, 'g#', 'g#')
replace($espeak2ipa, 'h', 'h')
replace($espeak2ipa, 'j', 'j')
replace($espeak2ipa, 'k', 'k')
replace($espeak2ipa, 'l', 'l')
replace($espeak2ipa, 'm', 'm')
replace($espeak2ipa, 'n', 'n')
replace($espeak2ipa, 'N', 'N')
replace($espeak2ipa, 'p', 'p')
replace($espeak2ipa, 'pF', 'pF')
replace($espeak2ipa, 'r', 'r')
replace($espeak2ipa, 's', 's')
replace($espeak2ipa, 'S', 'S')
replace($espeak2ipa, 't', 't')
replace($espeak2ipa, 'tS', 'tS')
replace($espeak2ipa, 'ts', 'ts')

replace($espeak2ipa, 'v', 'v')
replace($espeak2ipa, 'w', 'w')
replace($espeak2ipa, 'x', 'x')
replace($espeak2ipa, 'z', 'z')
replace($espeak2ipa, 'Z', 'Z')

I won’t do a direct conversion from eSpeak phonemes to SAMPA phonemes. I want a conversion from eSpeak phonemes to IPA phonemes. During the PLS simon import process, the IPA phonemes will be transformed into SAMPA phonemes.

18. Do you see the XPath expression replace($espeak2ipa, 'tS', 'tS')? I think that maybe this indicates the voiceless alveolar affricate (e.g. "zehn" [t͡seːn]). Am I right? Or am I wrong? There is another XPath expression: replace($espeak2ipa, 'ts', 'ts'). At the moment, I am not sure which one indicates the voiceless alveolar affricate. Probably, eSpeak [ts] stands for IPA [t͡s]. And probably, eSpeak [tS] stands for the voiceless palato-alveolar affricate [t͡ʃ]. This means that I can add the following XPath expressions to create-ralfs-ipa-stylesheet.xsl:

replace($current-ipa, 'tS', 't͡ʃ')
replace($current-ipa, 'ts', 't͡s')

By the way, the current version of Ralf's German dictionary doesn’t contain the phones [t͡s] and [t͡ʃ]. Maybe I will add both phones to the next version of Ralf's German dictionary. At least, I will add the [t͡s] phone.

19. I hope that you understand the strength of create-ralfs-ipa-stylesheet.xsl: I add the XPath expression replace($current-ipa, 'ts', 't͡s') to create-ralfs-ipa-stylesheet.xsl. This will influence PLS dictionaries that have one of the following xml:lang language codes: ca (Catalan), hu (Hungarian), de (Standard German, Swiss German, Austrian German), el (Greek), eo (Esperanto), hbs (I haven’t created a PLS dictionary with this language code), hy (Armenian – no PLS dictionary; I can use this spelling dictionary), it, lv, mk, pl, pt (espeak offers pt and pt-pt; probably both dialects use the same phone set), ru, sk, sq (Albanian – this is a dictionary that I should create using this spelling dictionary), and many more.

So I am adding the XPath expression replace($current-ipa, 'ts', 't͡s') once, and more than 10 PLS dictionaries will be affected.

20. Let’s be more specific, and take a look into my Swiss German PLS dictionary with eSpeak phonemes:

<lexeme>
<grapheme>Abklatsch</grapheme>
<phoneme>_!'apkl,atS</phoneme>
</lexeme>

And now, take a look at my “flagshipRalf’s German dictionary:

<lexeme role="Substantiv">
<grapheme>Abklatsch</grapheme>
<phoneme>ʔapklatʃ</phoneme>
</lexeme>

The final result of both dictionaries (Ralf's Swiss German dictionary and Ralf's German dictionary) should be: <phoneme>ʔapklat͡ʃ</phoneme>. I will achieve this goal by adjusting create-ralfs-ipa-stylesheet.xsl for Swiss German (de-CH). In contrast, Ralf's German dictionary (de-DE) contains already IPA phonemes. For this Standard German dictionary, I am using a specific .xsl style-sheet.

21. You can see that I am working as abstract as possible (one .xsl style-sheet for all eSpeak languages), and as concrete as necessary. The phoneme adjustments are done at the appropriate level:

a. Abstract style-sheet for all eSpeak languages: create-ralfs-ipa-stylesheet.xsl
b. Concrete adjustments for a specific eSpeak language can be done here: ralfs-ipa-stylesheet.xsl
c. Fine-Tuning for Standard German: improve-german-dictionary.xsl

A native speaker from Switzerland could develop a specific .xsl style-sheet for the fine-tuning of the Swiss German PLS dictionary.

22. Adding the XPath expression replace($current-ipa, 'E:', 'ɛː') to create-ralfs-ipa-stylesheet.xsl.

23. What should I do with the expression replace($espeak2ipa, 'OY', 'OY')? I should follow the Wiktionary:

[ɔɪ̯] U+0254, U+026A, U+032F Heu /[hɔɪ̯]/, Läufer /[ˈlɔɪ̯fɐ]/, neu /[nɔɪ̯]/

The simon import process accepts the phone [ɔɪ̯]. According to the Wikipedia, the following transcription would be possible:

Instead of the transcription /ɔ͡ʏ/, the transcription /ɔ͡ɪ/ is used as well.

You can see that there are several possible solutions. We have to decide which solution we want to use for a specific language. For Standard German, we use [ɔɪ̯]. It may be possible to use for different languages different transcriptions for the diphtong [ɔɪ̯].

24. And now, I learned something:

Diphthongs in German:

* [aɪ̯] as in Reich ‘empire’
* [aʊ̯] as in Maus ‘mouse’
* [ɔʏ̯] as in neu ‘new’
* [eːɐ̯] as in sehr ‘very’
* [iːɐ̯] as in dir ‘you (dative)’
* [oːɐ̯] as in Bor ‘boron (element)’
* [øːɐ̯] as in Öhr ‘eye (hole in a needle)’
* [uːɐ̯] as in nur ‘only’
* [yːɐ̯] as in Tür ‘door’

Some diphthongs in Bernese, a Swiss German dialect:

* [iə̯] as in Bier ‘beer’
* [yə̯] as in Fuß ‘feet’
* [uə̯] as in Schue ‘shoes’
* [ou̯] as in Stou ‘holdup’
* [au̯] as in Stau ‘stable’
* [aːu̯] as in Staau ‘steel’
* [æu̯] as in Wäut ‘world’
* [æːu̯] as in wääut ‘elects’
* [ʊu̯] as in tschúud ‘guilty’

Great. Now we are coming to a more specific level: Bernese German (Bärndütsch). You can see: Bernese German has specific diphtongs. It is necessary to develop a Bernese German PLS dictionary. We go the following way:
- First, I develop Ralf's German dictionary for Standard German.
- Second, I develop Ralf's Swiss German dictionary for Switzerland.
- Third, someone who lives “in the Swiss plateau (Mittelland) part of the canton of Bern” should develop a Bernese German PLS dictionary.

25. Let me make one thing clear: If you speak Bernese German, you should give Ralf's German dictionary a try. Use my flagship dictionary for training of a few Bernese German words. It is necessary that you understand the concept. But of course, Standard German is different from Bernese German. For good recognition results, it is necessary that you have your own PLS dictionary with the Bernese specific diphtong [yə̯]. At the moment, there is no Bernese German PLS dictionary available (as far as I know). So you should use Ralf's Swiss German dictionary or Ralf's German dictionary.

In the long run, we need specific dialect dictionaries for the Swiss German language:
- Basel German (Baslerdüütsch) PLS dictionary
- Walliser German (Wallisertiitsch) PLS dictionary
- Walser German (Walserdeutsch) PLS dictionary
- Zürich German (Züritüütsch) PLS dictionary

For good recognition results, the PLS dictionary has to match your dialect. And this is why I am a fan of the IPA: You can be as specific as necessary. And we have a common standard that is applicable to all languages. So we use standards (UTF-8, PLS, IPA, XML, XSLT, GPLv3) that are recognized worldwide. And for each Swiss German dialect, there is a solution that can be developed.

26. I am adding the new phoneme /p͡f/ with the XPath expression replace($current-ipa, 'pF', 'p͡f'). It is similar to the voiceless labiodental affricate:

German has a similar sound in Pfeffer [ˈp͡fɛfˑɐ] (‘pepper;) and Apfel [ˈapˑ͡fl̩] (‘apple’). This /p͡f/ only occurs word-initially and behind short vowels, though it differs from a true labiodental affricate in that it starts out bilabial but then the lower lip retracts slightly for the frication.

Question: do we need this phoneme /p͡f/?

27. OK, let’s create the style-sheet:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-phonemes.xml' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/create-ralfs-ipa-stylesheet.xsl' -o:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl'

28. And now, let’s transform the eSpeak phonemes into IPA phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swiss-german/de_CH_frami$ saxonb-xslt -s:'swiss-pls' -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/ralfs-ipa-stylesheet.xsl' -o:'swiss-dictionary.xml'

29. Download Ralf's Swiss German dictionary, and import it into simon.

lautsprecherbox Left column: Swiss German words. Unfortunately, a lot of nouns are written in lower-case.
Right column: Corresponding SAMPA phonemes.

You can imagine the following problem: The phonemes l aU tS p R E C ah b O k s are not perfect. Take a look at the corresponding entry in the PLS dictionary:

<lexeme>
<grapheme>Lautsprecherbox</grapheme>
<phoneme>l'aʊtʃpʀɛçɐb,ɔks</phoneme>
</lexeme>

I am not sure whether the phonemes /t/ and /ʃ/ should be treated as one single phoneme /t͡ʃ/, or not (see above 18.).

30. I changed the code of Ralf's Swiss German dictionary:

<lexeme>
<grapheme>Lautsprecherbox</grapheme>
<phoneme>l'aʊt͡ʃpʀɛçɐb,ɔks</phoneme>
</lexeme>

I don’t know how the result will look like when I import this into simon. I tested it. It is no difference for the end-user.

Convert German words like “zehn”

Saturday, May 1st, 2010

I want to convert German words like “zehn”. Here is what I do: I am developing the style sheet improve-german-dictionary.xsl that contains the following lines:

<xsl:when test="contains(grapheme, '10')">
<xsl:for-each select="phoneme"><xsl:text>
</xsl:text><phoneme>
<xsl:variable name="sierra"><xsl:value-of select="."/></xsl:variable>
<xsl:variable name="sierra" select="replace($sierra, 'tseːn', 't͡seːn')"/>
<xsl:sequence select="$sierra"/></phoneme>
</xsl:for-each>
</xsl:when>

I am invoking this style sheet via the Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/german-0.1.9.3$ saxonb-xslt -s:german-0.1.9.2.xml -xsl:improve-german-dictionary.xsl -o:german-0.1.9.3.xml

This is an example <lexeme> element from the source file german-0.1.9.2.xml:

<lexeme role="Zahlwort">
<grapheme>10%</grapheme>
<phoneme>tseːnpʀotsɛnt</phoneme>
<phoneme>tseːnʔpʀoːtsɛnt</phoneme>
</lexeme>

This is an example <lexeme> element from the object file german-0.1.9.3.xml:

<lexeme role="Zahlwort">
<grapheme>10%</grapheme>
<phoneme>t͡seːnpʀotsɛnt</phoneme>
<phoneme>t͡seːnʔpʀoːtsɛnt</phoneme>
</lexeme>

You can see that I don’t convert all occurrences of "ts" automatically. Only specific phonemes will be converted. The improvement of Ralf's German dictionary is done carefully. In the example, the phonemes [t] and [s] will only be converted into the phoneme [t͡s] if
- the <grapheme> element contains the sequence "10",
- and the phonemes [t] and [s] are part of the sequence "tseːn". My approach guarantees that only phonemes are improved that match the specific criteria. This approach is efficient, and very precise. You can see: XSLT is a great language for PLS dictionary development.

Please implement the phoneme [t͡s] into the simon PLS import process. The next version of Ralf's German dictionary will contain this phoneme because obviously it is necessary to get better recognition results.

Remark: I don’t distinguish between phones and phonemes. This means that my terminology might not be linguistically correct. My approach is to employ as much nodes (phones/phonemes) as necessary to get acceptable recognition results.

Voiceless alveolar affricate

Tuesday, April 27th, 2010

I will have to think about the voiceless alveolar affricate. Maybe I should convert German words like "zehn" [t͡seːn].

If problems occur with the recognition, and you think that something is wrong with the phonemes, your feedback is welcome.

341 words from German Voxforge prompts

Tuesday, April 27th, 2010

Here is how I create additional German words for Ralf's German dictionary:

1. Get word list with words that are included in the German Voxforge prompts, but are missing in Ralf's German dictionary.

2. Add <lexicon> tag at the beginning (<lexicon>) and at the end (</lexicon>) of the word list 341-german-words with gedit.

3. I will search for the sequence "\n" (backslash-n). This sequence will be replaced by the expression "</phoneme>\n<phoneme>" Press Replace All.

4. I made a small mistake. Now I am replacing each occurence of the word "phoneme" by the word "audio".

5. And I made another small mistake. I am replacing the <lexicon> tag at the beginning (<lexicon>) and at the end (</lexicon>) of the word list by <speak> at the beginning and by </speak> at the end.

6. Generating German phonemes with eSpeak:

~/Documents/201004/german-additional-words$ espeak -f 341-german-words -m -v de -q -x --phonout="341-german-words-espeak"

7. Add <lexicon> tag at the beginning (<lexicon>) and at the end (</lexicon>) of 341-german-words-espeak.

8. Create 341 German <phoneme> elements:

~/Documents/201004/german-additional-words$ saxonb-xslt -ext:on -s:341-german-words-espeak -xsl:'create-phoneme-elements.xsl' -o:341-german-phoneme-elements.xml

At the moment, the 341 German <phoneme> elements contain eSpeak phonemes. These eSpeak phonemes will have to be converted into IPA phonemes. This will be done in a later step. Before I do the conversion from eSpeak phonemes to IPA phonemes, I have to merge the file 341-german-words with the file 341-german-phoneme-elements.xml.

9. Ubuntu terminal:

~/Documents/201004/german-additional-words$ paste 341-german-words 341-german-phoneme-elements.xml > 341-german-lexeme-elements

10. I am now editing the file 341-german-lexeme-elements with gedit so that it will be a valid XML file.

11. Converting eSpeak phonemes into IPA phonemes with an old conversion style-sheet that should do the job:

~/Documents/201004/german-additional-words$ saxonb-xslt -ext:on -s:341-german-lexeme-elements -xsl:'/home/am3msi/Documents/200910/espeak2perfectipa.xsl' -o:341-german-pls.xml

12. Deleting the folder /home/am3msi/.kde/share/apps/simon.

13. Applications > Universal Access > simon. Import /home/am3msi/Documents/201004/german-additional-words/341-german-pls.xml as PLS dictionary into simon. The result is acceptable.

14. I have included the 341 German word in the new version of Ralf's German dictionary (version 0.1.9.1; April 27, 2010). These 341 additional words are located at the end of the file.

15. The dictionary should now cover all words that are part of the German Voxforge prompts. Suggestions for improvements of Ralf's German dictionary are always welcome.

Omitting the “Knacklaut”

Tuesday, April 6th, 2010

This example explains two phonetic errors of the <phoneme> element laɪ̯tʔʊngsvɛʀbɪndʊŋən that you can find in Ralf's German dictionary 0.1.9:

<lexeme role="Substantiv">
<grapheme>Leitungsverbindungen</grapheme>
<phoneme>laɪ̯tʔʊngsvɛʀbɪndʊŋən</phoneme>
</lexeme>

What is wrong? There are two errors:

1. The “Knacklaut” ʔ should be omitted in this specific context. Obviously, simon always omits the Knacklaut ʔ when I import Ralf's German dictionary into simon:

leitungsverbindungen

So simon omits the “Knacklaut” ʔ, and the result is correct for this specific word. In my opinion, simon shouldn’t omit the “Knacklaut” because I think that the recognition results would be better for other words. Let’s take a look into the Wiktionary:

Glottale Plosive ([ʔ], sog. Knacklaute) werden nur wortintern geschrieben, z. B. Theater [...eʔaː...], aufatmen [ˈaʊ̯fˌʔaːtmən].

And now take a look into Ralf's German dictionary:

<lexeme role="Substantiv">
<grapheme>Theaterstücken</grapheme>
<phoneme>teːatɐʃtykən</phoneme>
<phoneme>teːatɐʃtykŋ̩</phoneme>
</lexeme>

There isn’t a “Knacklaut” in teːatɐʃtykən, but this <phoneme> element could have one: teːʔatɐʃtykən. And it would be nice if simon wouldn’t omit the Knacklaut when the dictionary offers one. Of course, I don’t know whether the recognition results would be better, but I believe that the “Knacklaut” should be part of the imported dictionary.

2. I will have to fix another error: The phoneme v should be replaced by the phoneme f. The result will be: laɪ̯tʊngsfɛʀbɪndʊŋən or laɪ̯tʊngsfɐbɪndʊŋən.

Why am I writing these things? Because I believe that it is important. The dictionary has to catch every phoneme. And the “Knacklaut” is a phoneme, too. There are errors (e.g. laɪ̯tʔʊngsvɛʀbɪndʊŋən is wrong; laɪ̯tʊngsfɛʀbɪndʊŋən is correct). And there are small things that are tolerable (e.g. laɪ̯tʊngsfɛʀbɪndʊŋən is tolerable; laɪ̯tʊngsfɐbɪndʊŋən is better). Another example: teːatɐʃtykən is tolerable; teːʔatɐʃtykən is better.

I would suggest that a future version of simon should catch the “Knacklaut” ʔ during the import of the dictionary.

Fixing phonemic errors: “tschalpen”

Sunday, April 4th, 2010

Here is an example that explains how I am fixing errors that occur in Ralf's German dictionary 0.1.8:

<lexeme>
<grapheme>tschalpen</grapheme>
<phoneme>teːʃalpən</phoneme>
<phoneme>teːʃalpm̩</phoneme>
</lexeme>

Because I don’t know the meaning of the German word “tschalpen”, I take a look into the Wiktionary. OK, now I know that this word exists in the German language (obviously, it is Swiss German). So this word could be part of my future dictionary Ralf's Swiss German dictionary (I want to create such a dictionary, but this task has low priority). So for the moment, the Swiss German word tschalpen is part of Ralf's German dictionary (= Standard German).

The Swiss German word tschalpen does have two pronunciations: tʃalpən and tʃalpm̩. These two pronunciations had been generated with a previous version of my style-sheet espeak2perfectipa.xsl. How do I invoke the style-sheet? I am using the following command in the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201002/0.1.9$ saxonb-xslt -ext:on -s:german-dictionary-0.1.8.xml -xsl:espeak2perfectipa.xsl -o:german-0.1.9.xml

With this command, I transform version 0.1.8 into version 0.1.9 which I am planning to release during the next couple of days or weeks.

Well, the pronunciation of tschalpen is obviously wrong in Ralf's German dictionary 0.1.8: tʃalpən and tʃalpm̩. There is a long “e”-vowel which shouldn’t exist. This error will be fixed with the following expression select="replace($sierra, 'teːʃ', 'tʃ')" that is part of the style-sheet espeak2perfectipa.xsl. And this is the result:

<lexeme role="">
<grapheme>tschalpen</grapheme>
<phoneme>tʃalpən</phoneme>
<phoneme>tʃalpm̩</phoneme>
</lexeme>

You can see that the phoneme quality is now OK. The long “e” has been removed.

Error in Ralf’s German dictionary

Sunday, April 4th, 2010

I just found out that some lexeme elements of Ralf’s German dictionary 0.1.8 contain a specific kind of error. Here is an example:

<lexeme role="Substantiv">
<grapheme>Zufahrten</grapheme>
<element name="phoneme">tsuːfaːɐ̯tən</element>
</lexeme>

The name of the element shouldn’t be element. The name should be phoneme. Why did this error happen? Let’s take a look into the style-sheet espeak2perfectipa.xsl (I am using this style-sheet to fix recurring errors in the dictionary). Here is what introduced the wrong element name:

<xsl:when test="contains(grapheme, 'ahrt')"> <!-- modify phoneme if grapheme contains .. -->
<xsl:for-each select="phoneme"><xsl:text>
</xsl:text><element name="phoneme">
<xsl:variable name="sierra"><xsl:value-of select="."/></xsl:variable>
<xsl:variable name="sierra" select="replace($sierra, 'aːʀt', 'aːɐ̯t')"/>
<xsl:sequence select="$sierra"/></element>
</xsl:for-each>
</xsl:when>

I will fix this error. It is no big deal because just a minor part of the dictionary is affected.

Let me explain what happened: I used the expression select="replace($sierra, 'aːʀt', 'aːɐ̯t')" to improve the quality of the r phoneme. Is the “r” in Zufahrt a consonant or a vowel? It is a consonant, but it is spoken like a vowel because before the “r”, there is the long vowel Zufahrt.

These are difficult phonetic improvements. German native speakers normally don’t think about these details because they speak the German word Zufahrt unconsciously with the correct pronunciation.

The “r” consonant is always difficult. People who immigrated into Germany (e.g. from Turkey or from Russia) often speak the “r” differently. Maybe in Austrian German, the “r” needs specific phonetic treatment.

Never mind, I am developing Ralf's German dictionary for Standard German. I improved the phonetic quality of the “r” in Zufahrt, but I introduced the error with the element name.

That is the way it is: Fixing an error (= improving the “r” phoneme in this specific phonetic context) while introducing a new one (= giving the element name the wrong name).

Ralf’s German dictionary 0.1.8

Thursday, February 18th, 2010

Ralf’s German dictionary (version 0.1.8; February 18, 2010) is available, and can be imported into simon as shadow dictionary.

It is slightly better than the previous version 0.1.7 (not substantially better).

Here is how I built this dictionary via the Ubuntu terminal:

$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:espeak2perfectipa.xsl -o:prepare-0.1.8.xml

I don’t plan to add more words to the next version of this dictionary. The current focus is phoneme improvement. E.g. replace long vowels with short vowels and vice versa when necessary. To achieve this goal, I will have to modify the XSLT style-sheet espeak2perfectipa.xsl.

Ralf’s German dictionary (version 0.1.6)

Saturday, October 24th, 2009

You can download Ralf’s German dictionary (version 0.1.6). It has the following features:

Many lexemes now contain a role attribute, e.g.:

<lexeme role="Substantiv">
<grapheme>Simon</grapheme>
<phoneme>ziːmɔn</phoneme>
</lexeme>

I hope that a future version of simon makes use of the role attribute. Maybe I will transform Ralf's German dictionary into the Hadifix format. This would have the advantage that terminal information (= Category) would be available for many lexemes. Here is an example for a Verb:

<lexeme role="Verb">
<grapheme>anreden</grapheme>
<phoneme>ʔanʀeːdən</phoneme>
<phoneme>ʔanʀeːdn̩</phoneme>
</lexeme>

You can see that two pronunciations are available. You can use both pronunciations with simon. Internally, simon uses the following SAMPA notations:

a n R e: d @ n
a n R e: d n=

Here is another example:

<lexeme>
<grapheme>eintippen</grapheme>
<phoneme>ʔaɪ̯ntɪpən</phoneme>
<phoneme>ʔaɪ̯ntɪpm̩</phoneme>
</lexeme>

You can see that the second pronunciation ends with an (and not with an ). This means that you can speak intuitively.

I added more than 10.000 alternative pronunciations to the current version of the dictionary using this stylesheet.

Ralf’s Austrian German dictionary

Saturday, October 10th, 2009

Take a look at Ralf’s Austrian German dictionary (license: GPL). It contains more than 130 words which are not included in Ralf’s German dictionary (version 0.1.2).

The following usage is intended:
1. Download and import Ralf’s German dictionary. It contains Standard German, and is targeted at all people who speak the German language.
2. Download and import Ralf’s Austrian German dictionary if you want to use this dialect specific vocabulary.

I would like to create a dictionary for Swiss German. It would be nice if someone would point me to a GPL word list with Swiss German words.

Ralf’s German dictionary (version 0.1.2)

Friday, October 9th, 2009

You can download and import Ralf’s German PLS dictionary (version 0.1.2; October 9, 2009) with more than 100.000 words into simon.

These are the differences to version 0.1.1:
- increased the size of the dictionary (from about 40.000 words to more than 100.000);
- removed duplicate graphemes/phonemes;
- improved phoneme quality, e.g. the word Vogel is not transcribed as voːgəl any more. The phonemes are now corrected to foːgəl. So a lot of Vogel-words are now correctly transcribed: Donnervogel, Eisvogel, Feuervogel, Galgenvogel, Hornvogel, Hühnervogel, Jungvogel, and many more:

feuervogel

The dictionary is getting better and better. Of course, there is a lot of work to be done.

The following is interesting for people who are interested in dictionary development:

I am using this stylesheet to improve the phoneme quality. In the stylesheet, you can find the following line which replaces voːgəl by foːgəl:

<xsl:variable name="phoneme-vogel" select="replace($phoneme-vertrags, 'voːgəl', 'foːgəl')"/>

eSpeak produces phonemes that aren’t always correct. I correct recurring phoneme errors with this stylesheet. It is an efficient approach for the dictionary development. It would be too complicated to correct each word separately. I can adjust the required level of phoneme correction very good thanks to XPath. So you can see that it is no big deal to improve Ralf’s German dictionary. XPath is the right language for PLS dictionary development.

The size of the dictionary increases. To handle the increased amount of information, it is necessary to navigate in the PLS dictionary. This can easily be done with an XSLT stylesheet. I want to develop an XSLT stylesheet that transforms the PLS dictionary into Sphinx format.

Import 40.000 German words

Thursday, October 1st, 2009

You can import Ralf’s German dictionary (version 0.1.1) with more than 40.000 words into simon. Download the dictionary, then import it into simon:

dictionary

I hope that it works with simon. I didn’t use it for training. But it should work. It contains many known errors.

Standards of this dictionary: PLS, GPL, IPA, UTF-8.

If there are compatibility issues with simon, please report back.

It might be possible to convert this lexicon into a BOMP compatible format. I would need a way to get information about the terminals (= Wortarten). I don’t know at the moment how I could achieve this goal.

Creating Ralf’s German dictionary

Tuesday, September 22nd, 2009

To get an impression how I create the German PLS dictionary, watch the video (19.2 MB, WMV):

[20100101: video removed]

Currently, I am preparing a new version of Ralf’s German dictionary. The dictionary should be 100% simon compatible (version 0.1 contains some minor mistakes).

This is what I did yesterday:
1. I created more than 80.000 pronunciations with eSpeak from a set of 300.000 words. Not all words were transcribed, I don’t know what went wrong.
2. Then I created an XSLT stylesheet to transform the eSpeak phoneset into IPA with saxonb-xslt.
3. The result was that I had a list of the phonemes, but the graphemes are missing. What can I do? I decided to start dictating the missing graphemes with DNS 9.5. You can see the dictation process in the video.

Ralf’s German dictionary

Saturday, September 12th, 2009

In this article, I will explain how to import Ralf’s German dictionary into simon, and you will read about some of the properties of this dictionary.

universal

1. Select Applications > Universal Access > simon.

import-dictionary

2. Press the Word list button.
3. Press Import Dictionary.

shadow-dictionary

4. You can select the target: shadow dictionary or active dictionary. What is the right choice? For dictionary development, I often choose active dictionary (so that I have a dictionary in HTK compatible format which I use in conjunction with sam). But let’s now choose the shadow dictionary as target.

5. Press the Next > button.

hadifix-htk-pls

6. You can choose between different lexicon types: Hadifix, HTK, PLS, and Sphinx. Select PLS.
7. Press the Next > button.

save-page

8. You are now here: http://script.blau.in/xml/german.xml
9. Save Page As... doesn’t work. I just tried that. If you choose this option, the page will be saved as html file. You have to choose a different way.

page-source

10. Select View Page Source.

lexeme-grapheme

11. You can now see the source of the page http://script.blau.in/xml/german.xml.

12. The encoding of the page is UTF-8. This encoding ensures that even languages like Hebrew can be processed correctly. You can imagine that UTF-8 is a very good standard for all languages.

13. Let’s take a look at the address of style sheet http://script.blau.in/xml/ralf-german-dictionary.xsl. This style sheet document changes the appearance of Ralf’s German dictionary when you view it with Firefox.

14. The license is GPL. It would be great if someone would expand the German dictionary.

15. The dictionary has a specific tree structure using the elements lexicon, lexeme, grapheme, phoneme.

16. Select Save Page As....

import

17. Choose the location of Ralf’s German dictionary that you downloaded a few moments ago. On my computer, the XML file is located here: /home/liberty/200909/german.xml.

finish

18. Ralf’s German dictionary has been imported successfully.
19. Press the Finish button.

maschine

20. To take a look at the imported dictionary, select Include unused words from the shadow lexicon.
21. Drag and drop the word Maschine into the white area.

train-selected

22. You want to train the word Maschine.
23. Press the button to start with the training.

add-maschine

24. Currently, the word Maschine is just part of the shadow lexicon. It is not part of the active lexicon. Press Yes to add it to the active lexicon.

sampa

25. You want to define the pronunciation of the word Maschine. The pronunciation is being displayed in SAMPA.
26. I find the concept with the terminals difficult, it is explained in the simon handbook. I am using the terminal Unknown.

OK, I am finishing here. If you want to know more about simon, please read in the simon handbook (PDF).

You should now be able to import Ralf’s German dictionary into simon.