Archive for the ‘dictionary’ Category

Ralf’s German dictionary 0.2.1

Monday, August 30th, 2010

How I build Ralf's German dictionary version 0.2.1: (more…)

Symbol of the “Ach-Laut”

Monday, August 23rd, 2010

Ralf's German dictionary uses the normal [x] as sign for the Ach-Laut. According to the Wiktionary, the Ach-Laut is represented by [χ] (U+03C7):

lachen /[ˈlaχn̩]/, Dach /[daχ]/, Bucht /[bʊχt]/, doch /[dɔχ]/, auch /[aʊ̯χ]/

The current development version of simon doesn’t accept the [χ]. I would like to transform the [x] into [χ] in a future version of Ralf's German dictionary.

Ralf’s Latin speech model 0.1.1

Saturday, August 21st, 2010

Take a look at Ralf's Latin dictionary - German pronunciation (version 0.1.1; 2010-08-21), and download Ralf's Latin speech model 0.1.1. Of course, this is an early approach. The speech model contains about 800 different Latin words.

Ralf’s German speech model 0.1.4

Saturday, August 21st, 2010

Some information about Ralf’s German speech model version 0.1.4 (download):

1. Paths to the files on my computer:

file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/hmm24/hmmdefs
file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/tiedlist
file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/hmm24/macros
file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/stats

2. The hmmdefs file of Ralf’s German speech model 0.1.3 has a size of 28 MB. At the moment, I am preparing Ralf's German speech model 0.1.4. But the size of hmmdefs is just 3.7 MB. I don’t know the reason for this.

3. I didn’t test whether Ralf’s German speech model 0.1.4 works as static model.

4. Ralf’s German speech model 0.1.4 contains a vocabulary with 21000 words.

Feedback is as always welcome.

Ralf’s German dictionary 0.2

Saturday, August 21st, 2010

How I create Ralf's German dictionary version 0.2:

1. Add this rule to improve-german.xsl:

<xsl:when test="contains(lower-case(../grapheme), 'prüf')"><xsl:value-of select="replace($sierra, 'pʀyf','pʀyːf')"/></xsl:when>

2. Add rule: (more…)

German speech model “xab”

Friday, August 6th, 2010

In this article, I will explain how I create the German speech model "xab". This speech model will contain about 758 German words.

import-pls-dictionary-xab1. I import the German PLS dictionary xab.xml.bz2. The result will become visible in the active vocabulary.

2. Grammar > Add sentence > Unknown
3. Commands > Manage plug-ins > Add > Dictation
4. Dictation > Append text after result > “[hit space bar]”
5. Training > Import training data > Import prompts. Paths on my computer:
/home/ubuntu/Documents/201006/audacity/xab-folder/prompts-xab (license: GPLv3)
/home/ubuntu/Documents/201006/audacity/xab-folder/wav
(license: GPLv3)

6. Export the current scenario. Then modify the scenario slightly with the text editor Geany. Then import the scenario.

7. Start ksimond. Simon > Press the Connect button. Then simon > Press the Synchronize button. Simon > Press the Activate button.

8. Dictate some words with simon:

abebbend abebbten Abendanfrage Abendanzug Abendblatt Abendessen Abendfriede Abendgebeten Abendkarte Abendmahl Abendmahlsgemeinschaft Abendprogramme Abendschauen Abendunterricht Abenteuerbuch abenteuerreichem Aberchen abundant abänderndem Abänderungsantrags Arbeitsstunde Arbeitsweise Arbeitszeitmodell Arbeitszimmer Aubings Rabenmutter raubendem Reibeisen Reibung Reiblaut Reibungsgewicht Reibungsverlust Schablone schablonierende Scheiben schraubenartige Schraubengangs Schraubklemmenleiste schreiben schreibfaule

Some recognition errors occurred, but most words were recognized correctly.

9. I want to offer the German speech model "xab". It seems that simon uses different paths. Here are the paths on my computer:

file:///home/ubuntu/.kde/share/apps/simon/model/hmmdefs
file:///home/ubuntu/.kde/share/apps/simon/model/tiedlist

A few seconds ago, I found out that the paths aren’t changed. I will use the following paths for the German base model "xab":

file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/hmm24/hmmdefs
file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/tiedlist
file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/hmm24/macros
file:///tmp/kde-ubuntuJPP9Sy/simond/default/compile/stats

Get these four files (hmmdefs tiedlist macros stats), and the corresponding scenario file (xab-scenario.xml): Download the German speech model "xab" (license: GPLv3).

Disable safety check

Wednesday, August 4th, 2010

Feature request: It would be nice if it would be possible to disable the safety check when using the option user generated model.

I tried it several times with my 18000 words base model and my 300000 words dictionary. I failed always. Only the words that had been explicitly trained were recognized.

It is no problem if the recognition rate is low (at the moment). I would like to have a prototype that theoretically recognizes 300000 different German words. It would help me if I could disable the safety check.

Phonemes of “Packpapieren”

Sunday, July 25th, 2010

Take a look into Ralf's German dictionary:

<lexeme role="Substantiv">
<grapheme>Packpapieren</grapheme>
<phoneme>pakpapiːʀən</phoneme>
<phoneme>pakpapiːɐ̯n</phoneme>
</lexeme>

Take a look into the corresponding section of ralfdic-scenario-20100725.xml:

<word>
<name>Packpapieren</name>
<pronunciation>p a kp a p i: R @ n</pronunciation>
<terminal>Substantiv</terminal>
</word>
<word>
<name>Packpapieren</name>
<pronunciation>p a kp a p i: ah n</pronunciation>
<terminal>Substantiv</terminal>
</word>

What is wrong? The phoneme kp should be written as two separate phonemes.

Only a subset is recognized

Sunday, July 25th, 2010

This is the current situation: I increased the size of the active vocabulary from 18000 to 300000. When I export my current scenario, I get an XML file of 58 MB with the name ralfdic-scenario-20100725.xml (please compare this value with the size of my phonetic dictionary: Ralf's German dictionary 0.1.9.9 has a size of 45 MB). This means that almost all words that are included in Ralf's German dictionary are included in my current simon scenario. Only the words that are affected are excluded from the active vocabulary (just about 1000 words are affected. This means that almost all words that you can find in Ralf's German dictionary can be found in ralfdic-scenario-20100725.xml (license: GPLv3).

But what happens when I dictate? simon obviously recognizes only a subset of 18000 words (only words that are included in Ralf's German IPA FLAC files). Why is that?

Take a look into file:///home/ubuntu/.kde/share/apps/simon/model/prompts (license of this file is GPLv3). My current prompts file contains only 18000 words, and not 300000 words. But my German scenario file ralfdic-scenario-20100725.xml contains 300000 words. And when I dictate, simon only recognizes words that are included in the prompts file. simon doesn’t recognize the other words that are part of the active vocabulary.

My current prompts file is a subset of my current active vocabulary.
My current active vocabulary is a superset of my current prompts file.

How can I solve this issue? I want that simon recognizes words that haven’t been trained before. Even words that are marked in red color in the active vocabulary should be recognized. I don’t want to record all 300000 words. This issue should be solvable, but how?

Increasing German speech model

Friday, July 23rd, 2010

At the moment, Ralf’s German speech model 0.1.3 is installed on my computer (as user generated model). I would like to increase the vocabulary size (from 18000 words to 300000 words). So I am trying the following: I import Ralf's German dictionary 0.1.9.9. This will create of course a lot of duplicate entries in the active vocabulary.

This is what I am doing after the import of the dictionary:
- add terminals to the grammar (Adjektiv, Substantiv, Verb),
- press the Synchronize button.

So this is the situation at the moment:
Ralf's German speech model 0.1.3 contains 18000 words (it has been trained with 18000 audio files). I want to see how many additional words I have to train to cover all words that are contained in Ralf's German dictionary.

Then the following error message appears:

generate-monophonesI could use a hint what I should do to increase the vocabulary size without having to train all the words. Do you have any suggestions?

Edit: I try the following: I clear the whole active vocabulary. Then I import Ralf's German dictionary 0.1.9.9 again (as active vocabulary). Then I disconnect simon. Then I restart ksimond. Then I press the simon connect button. Then I press the Synchronize button.

Now I have to wait a few minutes or so. The computer reacts pretty slowly because obviously simon is using a lot of processing capacity.

It was possible to compile the speech model. Now I press the Activate button. Then the following error message appears:

The recognition reported the following error:
The recognition could not be started because your model contains words that consits of sounds that are not covered by your acoustic model.

You need to either remove those words, transcribe them differently or train them.

Warning: The latter will not work if you are using static base models!

This could also be a sign of a base model that uses a different phoneme set than your scenario vocabulary.

The following words are affected (list may not be complete):
abelschen, abelschen, abonnierbaren, abonnierbaren, abonnierende, abonnierende, abonnierendem, abonnierendem, abonnierender, abonnierender, abonnierendes, abonnierendes, adressatengerechten, adressate…

The following phonemes are affected (list may not be complete):
*-C+e or biphone C+e, *-EI+ts or biphone EI+ts, @-l+gls, @-ts+E, C-a:+R, C-e+m, C-n=+s, E-C+n=, E-N+N=, E-S+n=, E-b+m=, E-k+N=, E-p+m=, E-x+t, E:-ah+t, E:-d+n=, E:-f+m=, E:-f+t, E:-g+N=, EI-C+k, EI-ts…

It would be interesting to get a complete list of all words that are missing. Then I could train or remove these words efficiently.

Ralf’s German dictionary 0.1.9.9

Friday, July 23rd, 2010

How I am developing Ralf's German dictionary version 0.1.9.9:

1. Improve phoneme quality:

<xsl:when test="contains(lower-case(../grapheme), 'möglichkeit')"><xsl:value-of select="replace($sierra, 'mœglɪçkaɪ̯t','møːglɪçkaɪ̯t')"/></xsl:when>

2. Check whether the object dictionary contains møːglɪçkaɪ̯t instead of mœglɪçkaɪ̯t:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -ext:on -s:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.8.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/improve-german.xsl' -o:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.9.xml'

3. Improve phoneme quality:

<xsl:when test="contains(lower-case(../grapheme), 'benzin')"><xsl:value-of select="replace($sierra, 'bɛnt͡sɪn','bɛnt͡siːn')"/></xsl:when>

(more…)

Ralf’s German dictionary 0.1.9.8

Saturday, June 26th, 2010

How I am preparing Ralf's German dictionary version 0.1.9.8:

1. Add the rule to improve-german.xsl:

<xsl:when test="ends-with(../grapheme, 'zustand')
or ends-with (../grapheme, 'zustands')"><xsl:value-of select="replace($sierra, 't͡sʊstant', 't͡suːʃtant')"/></xsl:when>

2. Add the rule:

<xsl:when test="ends-with(../grapheme, 'ständen')"><xsl:value-of select="replace($sierra, 'stɛnd', 'ʃtɛnd')"/></xsl:when>

3. Add rule:

<xsl:when test="contains(../grapheme, 'spiel')"><xsl:value-of select="replace($sierra, 'spiːl', 'ʃpiːl')"/></xsl:when>

4. Add rule:

<xsl:when test="contains(../grapheme, 'angebot')"><xsl:value-of select="replace($sierra, 'aŋeːboːt', 'aŋgeːboːt')"/></xsl:when>

5. Add rule:

<xsl:when test="contains(lower-case(../grapheme), 'stöcke')"><xsl:value-of select="replace($sierra, 'stœk', 'ʃtœk')"/></xsl:when>

6. I want only one terminal per role attribute, not several terminals. This rule removes the other terminals except the first terminal:

<xsl:when test="contains($terminal, ' ')"><xsl:sequence select="substring-before($terminal, ' ')"/></xsl:when>

7. Add rule:

<xsl:when test="starts-with(../grapheme, 'vorz')
and starts-with($sierra, 'foːʀt͡s')"><xsl:value-of select="replace($sierra, 'foːʀt͡s', 'foːɐ̯t͡s')"/></xsl:when>

8. Add additional phoneme element:

<xsl:when test="contains(lower-case(../grapheme), 'angabe')"><xsl:value-of select="replace($sierra, 'aŋgaːb', 'angaːb')"/></xsl:when>

9. Improve phoneme quality:

<xsl:when test="contains(lower-case(../grapheme), 'graphik')"><xsl:value-of select="replace($sierra, 'gʀafiːk','gʀafɪk')"/></xsl:when>

10. Generate version 0.1.9.8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -ext:on -s:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.7.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/improve-german.xsl' -o:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.8.xml'

Some phoneme elements will be duplicated. This issue will be fixed later.

11. Download Ralf's German dictionary 0.1.9.8, and import it into simon. Suggestions for improvement are welcome.

Tutorial: import German dictionary

Thursday, June 17th, 2010

This article explains how to import Ralf's German dictionary into simon.

german-011. Ubuntu > Applications > Universal Access > simon.

german-022. Press the Vocabulary button.

german-033. Press Import Dictionary.

german-044. The target is the Shadow Dictionary. Press the Next button.

german-055. Select PLS Lexicon. Press the Next button.

german-066. Select the path to Ralf's German dictionary.

german-077. Download Ralf's German dictionary, and specify the path. Press the Next button.

german-088. You have to wait a few seconds until simon is finished with the import of the PLS dictionary.

german-099. Ralf's German dictionary has been imported successfully. Press the Finish button.

german-1010. Where is the list with the imported words? You don’t see any words because the Active Vocabulary tab is opened. Press the Shadow Vocabulary tab.

Note for the simon developers: If someone imports a dictionary as Shadow Dictionary (step 4 in this tutorial), simon should switch automatically to the Shadow Vocabulary tab after the import has been finished.

german-1111. You can now see the Shadow Vocabulary. The first column contains the word. The second column displays the corresponding SAMPA transcription. And the third column contains grammar information (e.g. Zahlwort, Substantiv, Adjektiv, Verb).

Conclusion: Now you know how you can import Ralf's German dictionary into simon.

Ralf’s German dictionary 0.1.9.7

Thursday, June 17th, 2010

This article explains how I am preparing Ralf's German dictionary 0.1.9.7. At moment, I am focusing on the creation of new <phoneme> elements:

1. Add the following rule to improve-german.xsl:

<xsl:when test="ends-with(lower-case(../grapheme), 'ier') and
ends-with(., 'iːʀ') and
not(ends-with(../grapheme, 'eier'))"><xsl:value-of select="replace($sierra, 'iːʀ', 'iːɐ̯')"/></xsl:when>

2. Add this rule:

<xsl:when test="ends-with(../grapheme, 'gen') and
not(ends-with(../grapheme, 'ngen'))"><xsl:value-of select="replace($sierra, 'gən', 'gŋ̩')"/></xsl:when>

3. Invoke the following instruction via Ubuntu terminal because I want to test whether improve-german.xsl produces the desired results:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -ext:on -s:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.6.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/improve-german.xsl' -o:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.7.xml'

4. Add the following rule:

<xsl:when test="contains(../grapheme, 'gens') and
not(ends-with(../grapheme, 'ngens'))"><xsl:value-of select="replace($sierra, 'gəns', 'gŋ̩s')"/></xsl:when>

What does this rule create? Here is an example:
Source dictionary:

<lexeme role="Substantiv">
<grapheme>Volkswagens</grapheme>
<phoneme>fɔlksvagəns</phoneme>
</lexeme>

Object dictionary:

<lexeme role="Substantiv">
<grapheme>Volkswagens</grapheme>
<phoneme>fɔlksvagəns</phoneme>
<phoneme>fɔlksvagŋ̩s</phoneme>
</lexeme>

You can see that I am using XSLT to produce additional <phoneme> elements.

5. Add this rule:

<xsl:when test="ends-with(../grapheme, 'bens')"><xsl:value-of select="replace($sierra, 'bəns', 'bm̩s')"/></xsl:when>

Example: Source dictionary:

<lexeme role="Substantiv">
<grapheme>Schneetreibens</grapheme>
<phoneme>ʃneːtʀaɪ̯bəns</phoneme>
</lexeme>

Object dictionary:

<lexeme role="Substantiv">
<grapheme>Schneetreibens</grapheme>
<phoneme>ʃneːtʀaɪ̯bəns</phoneme>
<phoneme>ʃneːtʀaɪ̯bm̩s</phoneme>
</lexeme>

6. Add the following rule:

<xsl:when test="ends-with(../grapheme, 'ben')"><xsl:value-of select="replace($sierra, 'bən', 'bm̩')"/></xsl:when>

Example from the source dictionary:

<lexeme role="Verb">
<grapheme>ausgeben</grapheme>
<phoneme>ʔaʊ̯sgeːbən</phoneme>
</lexeme>

Target dictionary:

<lexeme role="Verb">
<grapheme>ausgeben</grapheme>
<phoneme>ʔaʊ̯sgeːbən</phoneme>
<phoneme>ʔaʊ̯sgeːbm̩</phoneme>
</lexeme>

With this rule, I added about 1400 <phoneme> elements. It would be too much work to do this manually. Thanks to saxonb-xslt I can work efficiently and precisely.

7. If you have suggestions for improvements of Ralf's German dictionary, please tell me. At the moment, the dictionary contains 384067 <lexeme> elements. I don’t want to add more words to the dictionary at the moment. I want to improve the phoneme quality. And there are a lot of <grapheme> elements that have more than one possible pronunciation. This is my current focus to add a <phoneme> element where it seems to be appropriate. If you are missing something, please tell me.

(more…)

Ralf’s German dictionary 0.1.9.6

Friday, June 11th, 2010

Via Ubuntu terminal I created Ralf's German dictionary version 0.1.9.6:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -ext:on -s:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.5.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/improve-german.xsl' -o:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.6.xml'

A lot of terminal information (mostly Verb; Adjektiv) has been added. About 90% of all <lexeme> elements now contain terminal information.

Some words are marked wrongly: E.g. the word Übungsheften is marked as Verb. But it should be marked as a noun (Substantiv). I will fix that later.

Ralf’s German dictionary 0.1.9.5

Friday, June 4th, 2010

Ralf's German dictionary version 0.1.9.5 includes the following replacement rules:

replace($sierra, 't͡suːɔʀdn', 't͡suːʔɔʀdn')
replace($sierra, 'vɐbɛnd', 'fɐbɛnd')
replace($sierra, 'ɔʏ', 'ɔɪ̯')
replace($sierra, 'œstɛʀʀaɪ̯ç', 'øːstɐʀaɪ̯ç')
replace($sierra, 'stʊnd', 'ʃtʊnd')
replace($sierra, 'o:', 'oː')
replace($sierra, 'ts', 't͡s')
replace($sierra, 'gyltɪg', 'gʏltɪg')
replace($sierra, 'shoːv', 'shoʊ̯')
replace($sierra, 'S', 'ʃ')

The phoneme [ɔʏ] has been replaced by the phoneme [ɔɪ̯]. I wanted to have more consistency within the dictionary. I think that the Wikipedia uses [ɔ͡ʏ] while the Wiktionary uses [ɔɪ̯]:

Heu /[hɔɪ̯]/, Läufer /[ˈlɔɪ̯fɐ]/, neu /[nɔɪ̯]/

Both transcriptions would be correct. But I want to use only [ɔɪ̯].

Ralf’s German dictionary 0.1.9.4

Friday, June 4th, 2010

How I create Ralf's German dictionary version 0.1.9.4:

1. Editing improve-german.xsl:
If test="ends-with(grapheme, 'gehst'), then <xsl:text>Verb Singular Gegenwart Indikativ</xsl:text>.

2. Doing several replacements with Geany:
replaced 48 occurrences of “ʃadən” with “ʃaːdən”.
replaced 110 occurrences of “vɐlʊst” with “fɐlʊst”.
replaced 24 occurrences of “veːʀɛndəʀʊŋ” with “fɛʀɛndəʀʊŋ”.
replaced 290 occurrences of “bətʀiːbs” with “bətʀiːps”.
replaced 62 occurrences of “tsʊstɛl” with “t͡sʊʃtɛl”.

3. Edit improve-german.xsl again:
When contains(lower-case(grapheme), 'quer'), then replace($sierra, 'kvɛʀ', 'kveːʀ').
replace($sierra, 'stʀaɪ̯t', 'ʃtʀaɪ̯t')
replace($sierra, 'ʔœl', 'ʔøːl')

4. Create version 0.1.9.4 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -ext:on -s:'/home/ubuntu/Documents/201005/german-0.1.9.4/german-0.1.9.3.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/improve-german.xsl' -o:'/home/ubuntu/Documents/201005/german-0.1.9.4/german-0.1.9.4.xml'

5. Version 0.1.9.4 is slightly better than the previous version. I will continue with the improvement of this dictionary.

Ralf’s Vietnamese dictionary 0.1.1

Monday, May 24th, 2010

Let’s improve Ralf's Vietnamese dictionary:

1. Convert eSpeak phonemes into IPA phonemes:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/vietnamese/dictionaries/vietnamese-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

2. Download Ralf's Vietnamese dictionary, and import it into simon.

Ralf’s Russian dictionary 0.1.1

Monday, May 24th, 2010

How I improve Ralf's Russian dictionary:

1. Version 0.1 contains eSpeak characters.

2. Convert <phoneme> elements via Ubuntu terminal:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/russian/russian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

3. Download Ralf's Russian dictionary with 146263 words, and import it into simon.

Ralf’s Norwegian Bokmål dictionary 0.1.1

Sunday, May 23rd, 2010

How I improve Ralf's Norwegian Bokmål dictionary:

1. Version 0.1 contains eSpeak characters.

2. Language code is no. Or I should better use nb – Bokmål? I think that I will use no – it is easier at the moment (because espeak2ipa.xsl doesn’t contain a specific entry for nb. Only no is supported.

3. Convert <phoneme> elements:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/norwegian/norwegian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Norwegian Bokmål dictionary with 322043 words, and import it into simon.