Posts Tagged ‘German’

Ralf’s German speech model 0.1.9.4

Monday, June 4th, 2012

This article explains how I am creating version 0.1.9.4 of Ralf’s German speech model. This speech model should contain about 300.000 words. Let’s see whether it works out. Here is what I do:

1. I have imported all German IPA FLAC files into simon (more than 50.000 FLAC files have been imported). The speech model is working with about 50.000 words.
2. Import a reduced version of my German PLS dictionary from here: file:///home/linuxmint/Music/preparing-de-0.1.9.4/reduced-german-dictionary-0.2.8.1.xml
Simon > File > Connect. Simon has now automatically been activated. Deactivate Simon. Synchronize.
It will be necessary to remove words from the dictionary that contain triphones that are not part of the acoustic model.
3. Linux Mint terminal:

cd /home/linuxmint/Music/preparing-de-0.1.9.4
saxonb-xslt -ext:on -s:words-not-found-1 -xsl:analyze.xsl -o:words-not-found-2
saxonb-xslt -ext:on -s:reduced-german-dictionary-0.2.8.1.xml -xsl:compare-missing-graphemes.xsl

4. Delete the scenario demega.
5. Import the scenario demega. Import the base model demega as static base model.
6. Synchonize. Activate Simon and dicate a few words. It is working.
7. Import reduced-german-dictionary-0.2.8.1-1.xml.
8. Disconnect. Connect. Synchronize. Wait a few moments. Activate. An error message occurs. There are a lot of words that consist of sounds that are not covered by the base model.
9. Terminal:

saxonb-xslt -ext:on -s:words-not-found-3 -xsl:analyze.xsl -o:words-not-found-4
saxonb-xslt -ext:on -s:reduced-german-dictionary-0.2.8.1-1.xml -xsl:compare-missing-graphemes.xsl

I won’t explain the next steps because they are just a repetition of the previous steps.

Now you got an impression how the next version of my speech model was created. The speech model contains 290.000 words.

Will this speech model run on your computer? You have to compile Julius from source:

./configure --enable-words-int

Normally, Julius has a limit of 65.535 words. I don’t know what the actual limit is, but 290.000 words is possible when compiling with this option.

Missing triphones in German speech model

Sunday, May 6th, 2012

When importing the Reduced German dictionary (I didn’t publish this dictionary) then I get the message that lots of triphones are missing. I extracted these triphones, and put them into the .xml file just-triphones-unique.xml. About 7.000 triphones are missing! This means that I have to record up to 7.000 German words (each word has to contain at least one of the missing triphones) to make “it” work. What does that mean? If I record 7.000 additional German words, I can generate a speech model that covers 380.000 German words (all words from Schott’s German dictionary).

I will have to transform the phonemes inside the file just-triphones-unique.xml into IPA format. This means that I have to do a conversion from SAMPA to IPA. Then I will have to compare the <phoneme> elements from Schott’s German dictionary with the transformed .xml file. If there is a match, I can output the corresponding <lexeme> element. It is pretty complicated. But it should be the fastest way to get a not too bad result.

Or I could go another way. I could compare the future IPA version of just-triphones-unique.xml with Schott’s German dictionary. If there is a match, then the word should be excluded. This means that I can produce a reduced version of my German dictionary with as much words as possible.

There is another different way, too. I could extract the missing words, and put them into a list. Then I could compare this list with my German dictionary. If there is a match, then these specific words will have to be excluded from the German dictionary.

Which way is the easiest one?

Update: I created a list of the missing grapheme elements missing-graphemes.xml. I will have to write an .xsl style sheet that compares this list with my German dictionary. If there isn’t a match, then include this word in a reduced version of the dictionary.

Import of the German word “Bergbach”

Sunday, May 6th, 2012

Let’s take a look into Schott’s German dictionary:

<lexeme role="Substantiv">
<grapheme>Bergbach</grapheme>
<phoneme>bɛʀgbaχ</phoneme>
</lexeme>

The transcription is as follows after this word has been imported into Simon: [Bergbach] b E R gb a x

You can see that the gb is treated as one single phoneme. But it should be treated as two different phonemes: /g/ and /b/.

Ralf’s German speech model 0.1.9.3

Saturday, May 5th, 2012

I just imported 36.000 German audio files. Then I exported the corresponding scenario, and the corresponding base model. In contrast to the previous version 0.1.9.2 it seems to work. The last version was much smaller: about 0.8 MB. The current size is about 7.1 MB (both files are compressed; I don’t know why the difference is that big. Is the .sbm container compressed or uncompressed? My guess is that it is compressed).

Download Ralf’s German speech model 0.1.9.3, and use it with Simon.

Update June 4, 2012: The .sbm container is compressed. Just rename the .sbm container into .tar.gz, then open with Archive Manager. You can find several files inside the container, e.g. julius.jconf which contains a copyright note.

Importing 36.000 German audio files

Saturday, May 5th, 2012

I am now importing 36.000 German audio files. They are from sections alpha, bravo, charlie, diego, echo and friedrich. Why am I doing that? I want to produce the next version of Ralf’s German speech model (current version is version 0.1.9.2).

The import of 36.000 German audio files takes a while. I want to generate the speech model again.

Unfortunately I had to restart the computer because Simon didn’t respond any more. Now I restarted the computer, and again the same problem: Simon uses a lot of CPU power. Obviously, 36.000 audio files might be too much for Simon.

I am so sorry that I can’t test the speech model at the moment. But I could try to extract hmmdefs, tiedlist, macros, and stats from the following location (for later import): /tmp/kde-ubuntu/simond/default/compile – if there weren’t one problem: the folder compile isn’t available at the moment.

Now, I killed the simon process with the System Monitor (sorry for the bad language, but it says in the menu “Kill Process”). When I restart simon again, simon uses again 100% of CPU power. I would call this a bug.

Update: I had to wait a little while, but then it is working: About 20 % of the following words were recognized correctly:

scheuerte Dingen Dingen scheuerte Schäuble Gehirnwäsche erzkonservativ Erzielens erzieltest Erzielung erzogen erzogene erzwingbarem erzwingbarer erzwinge Erträgen erteilend Eröffnungstermin Eröffnungsspiel Eröffnungssitzung Eröffnungsreferat Eschborn Escudo Eskudos

This means that you get more than just random recognition results. Don’t forget: This speech model consists of 36.000 words! I have the following unproven theory: It is sufficient to record each word one time. I can’t prove it. I am just testing.

I want to point the following detail: Simon recognized the word Eskudos (written with k). I wanted to test whether Simon recognized Escudos (with c).

Import of your Grammar

Monday, January 2nd, 2012

In this post I want to write some words about the Grammar / Import function of simon 0.3. Here is what I do:

1. Import Schott’s German dictionary as active dictionary into simon.

2. Open the Grammar tab. Press the Import button.

3. Simon starts a wizard. Press the Next button.

4. Let’s try and check the option Also import unknown sentences. I don’t know whether this is a good decision. So let’s give it a try.
This is interesting: “words with more than one terminal” – is it now possible to use more than one entry for the role attribute? The current version of Schott’s German dictionary employs just one entry for each role attribute. The PLS standard allows more entries.
Please, download and extract Schott’s German utterances. This compressed folder 15000-german-utterances.zip contains a plain text file with more than 15000 utterances. I am the author of these utterances, and I have licensed them under the GPLv3. The utterances are designed to be used in conjunction with Schott’s German dictionary (former name: Ralf’s German dictionary). Every word that is included within Schott’s German utterances should be included in Schott’s German dictionary, too. I am 99% sure that this is the case, but I can’t guarantee it. If some words are missing in Schott’s German dictionary, please inform me, and I will include them within the next version of Schott’s German dictionary.
You can import my German utterances using simon’s Import Text option (copy & paste).

5. The import has been completed. There are a lot of lines that contain the Unknown terminal. Probably it would have been better if I wouldn’t have checked the option Also import unknown sentences in step 4.

6. Because simon didn’t react any more, I forced it to quit. I tried to start simon several times, but it wouldn’t start. Now there are several simon zombie statuses displayed. I was able to end these zombie / sleeping processes. But at the moment, it seems to be impossible to start simon again (a new zombie / sleeping status is beeing created if I try to start simon again).

Conclusion: I don’t recommend to check the option Also import unknown sentences. I tried the import Grammar function before without checking this option. Simon reacted normal, everything seemed to be fine.

Schott’s German dictionary 0.2.8

Tuesday, November 1st, 2011

Here is how I create Schott’s German dictionary 0.2.8 (with the style sheet improve-german.xsl):

1. Replace 152 matches:

<xsl:when test=”contains(lower-case(../grapheme), ‘planung’)”><xsl:value-of select=”replace($sierra, ‘planʊŋ’,'plaːnʊŋ’)”/></xsl:when>

2. Replace 178 matches:

<xsl:when test=”contains(lower-case(../grapheme), ‘fußball’)”><xsl:value-of select=”replace($sierra, ‘fʊsbal’,'fuːsbal’)”/></xsl:when>

A lot of other small changes have been made. Please, import Schott’s German dictionary (author: Kai Schott) into simon.

German speech model ‘dedv’

Thursday, July 28th, 2011

Here is how I create the German speech model ‘dedv’:

1. Open the file de-dv. It contains a list of 1000 words (only the phonetic transcriptions).

2. Start Audacity.

3. Now I read every word in the list. Between each word, there is a pause of 1-2 seconds. Later, Audacity will find the pauses automatically.

4. Mark the whole recording with a double-click.
5. Then select Analyze > Sound Finder…

6. Set the Label starting point to 1.0. Set the Label ending point to 1.0. Why? Because simon has to know the amount of background noise.

7. Let’s eliminate the error at position 237. There was some noise (above 26 dB, I think) at position 237, and not a word. Mark the area with the mouse. Then press the Silence button (see top right of the picture).

8. Let’s have a look at the text file de-dv, and at the audio file. The text file ends with line 841. The Audacity audio file ends with number 841. Both files correspond to each other.

9. Select Audacity > File > Export Labels… The position (starting point and ending point) of each label will be exported into a simple text file. Export the labels to a file named labels.txt.

10. Open labels.txt with Geany. The file labels.txt ends at line 841. The first number of each line indicates the label starting point. The second number indicates the label ending point. The third number indicates the label itself.
You can see in the picture that de-dv is open, too. Both files – labels.txt and de-dv – have a length of exactly 841 lines.

11. Geany > Search > Replace.
Search for: \t\w+$ (t means tab; w means alphanumeric character; + means this: “The plus sign indicates that there is one or more of the preceding element”; $ means end of line)
Don’t forget to mark Use regular expressions.
This procedure removes the third number from each line.

12. You can see that the third column has been removed thanks to the regular expression procedure.

13. Now it is time to merge both files: labels.txt should be merged with de-dv. This is done via the paste command in the Ubuntu terminal:

ubuntu@ubuntu:~$ paste /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/labels.txt /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/de-dv > /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/pasted.txt

The resulting document is named pasted.txt.

14. You can see that the document pasted.txt has a third column: The labels are the phonetic transcriptions!

15. Now let’s go back to Audacity > File > Import > Labels… Take a look at the result. Each label is a phonetic transcription of the corresponding recording.

16. Audacity > File > Export Multiple…
Export format: FLAC files
Export location: /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/flac-dedv
Split files based on: Labels
Name files: Using Label/Track Name
Press the Export button.

17. Now you know how I create the FLAC files that are part of Schott’s German IPA FLAC files.

18. Let’s generate a PLS dictionary that contains about 841 entries. This is done in the Ubuntu terminal:

ubuntu@ubuntu:~$ cat /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/german-0.2.7.xml | saxonb-xslt -ext:on -s:- -xsl:/media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/combine-0.2.4/compare.xsl

The result is a PLS dictionary at the following location: file:///media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/lexicon-dedv.xml

19. Now I need a prompts file. This is generated, too, via Ubuntu terminal:

ubuntu@ubuntu:~$ saxonb-xslt -ext:on -s:/media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/german-0.2.7.xml -xsl:’/media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/combine-0.2.4/lexicon2prompts.xsl‘ -o:’/home/ubuntu/Documents/dummy.xml’

20. Now it is time to upload the package file:///media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/german-ipa-flac-files-dedv-20110727.tar.bz2 to Voxforge.

21. Delete file:///home/ubuntu/.kde/share/apps/simon and file:///home/ubuntu/.kde/share/apps/simond.

22. Start simon.

23. I am skipping the next steps. Please read my article German speech model ‘dedq’ to get more details.

24. And now it is time to watch the video about this speech model:

The video cannot be shown at the moment. Please try again later.

The following words were recognized in the video:

Orakelspruchs Orangensaft Orangensaftes Orangenschale Orangenschalenstruktur Orangenscheibe Orangensekt Orangerie Orangerien Orangerücken Oranienburger Oranienburgs Orchester Orchesterbegleitung Orchesterkanzel Orchestergraben Orchesterbesoldung Orchestermitglieder Orchestermusik Orchestermusiker Orchesterprobe Orchesterraum Orchester Ordensgelübde Ordensschwester Ordensschwestern Orderpapier Ordination Ordnungsbehörde Ordnungsbehörden Ordnungsmacht Ordnungssystem Orffs Organbank Organe Organell Organelle Organells Organhandel Organigramm Organigramme Organigrammen Organigramms Organik Organisationsabteilung Organisationsaufgabe Organisationsaufgaben Organisationsausschuss Organisationsbegabung Organisationseinheit Organisationserfahrung Organisationsfachmann Organisationsform Organisationsformen Organisationsgabe Organisationskomitee Organisationslösung Organisationslösungen Organisationsmethoden Organisationsplan Organisationsplanung Organisationspsychologie Organisationsreform Organisationsstruktur Organisationsteam Organisator Organisatoren Organisators Organisierung Organismus Organist Organistin Organographie Orgelbauer Orchesters Orgelbauers Orgelklang Orgelklangs Orgelkonzert Orgelkonzerte Orgelmusik Orgel Orgelpfeife Orgelton Orgelwerke Orientbrücken Orienthandel Orientierungskrise Orientierungskrisen Orientierungspunkte Orientierungsstufe Ornithologie Origami Originalantwortschein Originalantwortscheine Organigrammen Organstreit Originalausgabe Originalausgaben Originalbeleg Organstreit Originaldiskette Originalersatzteil Originalfassung Originalgehäuse Orgelkonzerte Originalität Organizismus Originalprüfunterlagen Organells Orderscheck Originalschecks Organhandel Organstreitverfahren Originalversion Originalverpackung Originalversion Organ Organstreit Orkanen Organstreit Orkanschadens Orkantiefs Orkantiefs Orlando Orlandos Orléans Ornamentband Ornamentbands Ornamentbänder Ornamentbändern Ornamente Ornaments Orographie Orographien Orpheus Ortbeton Orte Orten Ortens Ortgang Ortgangbrett Orthodoxie Orthodoxien Ostgeschäft Orthographie Orthographiefehler Orthographiefehlern Orthographiefehlers Orthografien Orthographie Ortholexikon Ortholexikons Orthonormalbasis Orthopäde Orthopäden Orthopädie Orthopädien Ortleb Ortolf Ostkirche Ortsbehörden Ortsbesichtigung Ortsbezeichnung Ortsbild Ortschaftsrats Ortschaftsräte Ortschaftsräten Ortsdurchfahrten Ortsfremde Ortsfremden Ortsgebühr Ortsgespräche Ortsgesprächen Ortsgrammatik Ortsgruppe Ortsgruppenleiter Ortskirchen Ostkredite Ortskrankenkassen Orchestern Ortsmitte Ortsname Ortsnamen Ortsnetz Ortsnetze Ortsnetzen Ortsnetzes Ortleb Ortssendern Ortsteil Ortsteilen Ortsteils Orchester Ortsvektoren Ortsverbandes Ortsverzeichnis Ortsveränderung Ortsvorsitzende Ortsvorsteher Ostdeutschland Ortszulage Ortungsgeräte Ortung Ostblock Ostblockes Osteolyse Ostfriesentee Ostallgäu Ostdeutschlands Ost-Berliner Ost-Berlins Ost-SPD Ostwestfale Ost-West-Konflikt Ostafrika Ostafrikas Ostalgie Ostasien Ostbahnhof Ost-Berlins Ostbesuche Ostbesuchen Ostbewohner Ostblock Ostblockländer Ostblockländern Ostblockreisen Ostblockstaaten Ostbündnis Ostbündnisse Ostbündnissen Ostbündnisses Ostdeutschland Ostelbien Ostfront Ostens Osteoporose Osteoporosen Ozeanriesen Osteuropäer Osteuropas Osteuropäern Osteuropäers Ostexport Ostexports Ostfalen Ostfildern Ostfilderns Ostgebiete Ostgeschäft Ostprovinz Ostpreußen Ostpreußens Ostteil Ostsee Ostseebäder Ostseehandel Ostseeheilbad Ostseeinsel Oszillator Oszillatoren Ostwirtschaft Ostsektor Ovation Ovationen

You can see that there are a lot of recognition errors.

25. Now you know how I created the German speech model ‘dedv’, and how good / bad it is when used for recognition.

German speech model ‘decw’

Saturday, March 26th, 2011

You can import the German speech model ‘decw’ into simon. The words are taken from section decw, and they can be found at Voxforge, too.

German speech model ‘decr’

Friday, March 25th, 2011

You can import the German speech model ‘decr’ into simon. The words are from section decr; you can find them at Voxforge, too.

German speech model ‘dech’

Friday, March 25th, 2011

You can import the German speech model ‘dech’ into simon. The words are from section dech, and they can be found at Voxforge, too.

Schott’s German dictionary 0.2.5

Tuesday, February 22nd, 2011

Here is how I create Schott's German dictionary version 0.2.5 (former name: Ralf's German dictionary). Within the XSLT style sheet improve-german.xsl, I implement these transformation rules:

1. About 70 occurences have to be replaced:

<xsl:when test="contains(lower-case(../grapheme), 'alkohol')"><xsl:value-of select="replace($sierra, 'alkoːɔl','alkohoːl')"/></xsl:when>

2. Replace about 50 occurences:

<xsl:when test="contains(lower-case(../grapheme), 'vertret')"><xsl:value-of select="replace($sierra, 'vɐtʀeːt','fɐtʀeːt')"/></xsl:when>

3. Replace 32 matches:

<xsl:when test="contains(lower-case(../grapheme), 'aluminium')"><xsl:value-of select="replace($sierra, 'aluːmiːniːʊm','alumiːni̯ʊm')"/></xsl:when>

4. Replace 386 matches: (more…)

German speech model ‘friedrich’

Friday, February 18th, 2011

You can import the German speech model 'friedrich' (0.5 MB, GPLv3) into simon. The package contains all necessary files: hmmdefs-friedrich, tiedlist-friedrich, macros-friedrich, stats-friedrich, and of course the scenario-file scenario-friedrich.xml.

Edit: Download the video German speech model ‘friedrich’ (40 min., 47 MB, WMV, link will become invalid soon)

Edit: These words are recognized in the video:

ARBEITSAUFNAHME Aachener Rachens Einsparens Aalen Racheschach RAF
Abarbeiten Barockzeit Abarbeitungszyklus Abartung Abbauvermgen
Abbauvertrag Abberufens Abberufungen Abbestellungen Abbiegevorgang
Abbiegevorgangs Abbiegevorgnge Abbiegevorgngen Abbindebehandlung
Abbindebereich Abbindebeschleuniger Abbindebeschleunigung Abbindedauer
Abbitte Abbindung Abbindezeit Abbindewasser Abbindeverhalten (more…)

Nordic German PLS dictionary

Thursday, February 17th, 2011

I could develop a Nordic German PLS dictionary. The difference to Ralf’s German dictionary would be as follows:

In großen Gebieten Deutschlands (im Norden) fehlt die Unterscheidung zwischen langem e und langem ä; beide werden als langes e gesprochen.

This means: I have to convert /ɛː/ to //.

OT: German phonemes for “Depot”

Saturday, February 6th, 2010

As I told earlier, I want to give information about PLS dictionary development. At the moment, Ralf’s German dictionary 0.1.7 is available. Here is how I want to modify the dictionary:

I take a closer look at the XSLT style-sheet espeak2perfectipa.xsl. I didn’t improve this style-sheet during the last 4 months.

A few minutes ago, I made a slight modification to espeak2perfectipa.xsl (= XSLT style-sheet). These are the lines that modify the phonemes for Depot:

<xsl:when test="starts-with(grapheme, 'Depot')">
<xsl:for-each select="phoneme"><xsl:text>
</xsl:text><phoneme>
<xsl:variable name="sierra"><xsl:value-of select="."/></xsl:variable>
<xsl:variable name="sierra" select="replace($sierra, 'deːpɔt', 'deːpoː')"/>
<xsl:sequence select="$sierra"/></phoneme>
</xsl:for-each>
</xsl:when>

These are some lines from Ralf’s German dictionary 0.1.7 (= source XML document):

<lexeme role="Substantiv">
<grapheme>Depotbank</grapheme>
<phoneme>deːpɔtbaŋk</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotfett</grapheme>
<phoneme>deːpɔtfɛt</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgebühr</grapheme>
<phoneme>deːpɔtgeːbyːʀ</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgesetz</grapheme>
<phoneme>deːpɔtgeːzɛts</phoneme>
</lexeme>

These are the corresponding lines of the future version of the dictionary (= result XML document):

<lexeme role="Substantiv">
<grapheme>Depotbank</grapheme>
<phoneme>deːpoːbaŋk</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotfett</grapheme>
<phoneme>deːpoːfɛt</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgebühr</grapheme>
<phoneme>deːpoːgeːbyːʀ</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgesetz</grapheme>
<phoneme>deːpoːgeːzɛts</phoneme>
</lexeme>

You can see the concept: The XSLT style-sheet defines which modifications the result PLS dictionary should contain.

The whole process is invoked via the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201001/0.1.8$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:espeak2perfectipa.xsl -o:prepare-0.1.8.xml

Let me explain:
- saxonb-xslt is the XSLT processor;
- german-dictionary-0.1.7.xml (= Ralf’s German dictionary 0.1.7) is the XML source document;
- espeak2perfectipa.xsl is the XSLT style-sheet;
- prepare-0.1.8.xml is the XML result document. It should become Ralf’s German dictionary 0.1.8 which I will release as soon as I have made substiantial progress.

I will document further steps of dictionary development here in this blog. I hope that I can convince some people out there to apply this concept to other PLS dictionaries. So my goal is to educate people in PLS dictionary development.

Julius dictionary; PLS: role attribute

Wednesday, January 20th, 2010

This blog post is about (A) Julius dictionary and (B) the import of a PLS dictionary.

A. Obviously, it is possible to import a Julius dictionary:

julius-vocabulary

I didn’t know that this kind of dictionary existed. What are the properties of this format? And what are the advantages?

B. I want to import Ralf's German dictionary (version 0.1.7; October 29, 2009). Great, simon now recognizes the role attribute:

adjektiv-substantiv

1. A few minutes ago, I imported Ralf's German dictionary into simon. I am offering 27 PLS dictionaries for 27 different languages. Choose the dictionary that suits your native language, and import it into simon.

2. Let’s take a look into the shadow dictionary.

3. The word kernchemischen is an Adjektiv. Let’s take a look at the specific entry in Ralf's German dictionary:

<lexeme role=”Adjektiv”>
<grapheme>kernchemischen</grapheme>
<phoneme>kɛʀnçeːmɪʃən</phoneme>
</lexeme>

You can see that the role attribute which is part of the <lexeme> element was imported by simon. Thanks for implementing that feature.

4. The word Kerndurchmesser is a Substantiv. The corresponding entry in the PLS dictionary:

<lexeme role=”Substantiv”>
<grapheme>Kerndurchmesser</grapheme>
<phoneme>kɛʀndʊʀçmɛsɐ</phoneme>
</lexeme>

You can see the strength of the simon import process: The last two letters Kerndurchmesser correspond with one single phoneme kɛʀndʊʀçmɛsɐ. Because such details are implemented, we can get a very good recognition rate as I showed in the video with 200 German words.

Why is Ralf's German dictionary good? Let me explain about the history of this dictionary:

a. The initial steps were done at Voxforge with the development of a German pronunciation dictionary. You can convince yourself: the script espeak2Phones.pl is great because it transforms eSpeak’s cryptic ASCII output into SAMPA. This approach is good for the German language.

b. Later, we used the dictionary acquistion project for the collection of about 8.000 pronunciations. Each single pronunciation was human-controlled. The phoneme concept follows the Wiktionary.

c. I used a German spelling dictionary from OpenOffice.org to get more words for the dictionary (Ubuntu terminal command: unmunch). With eSpeak I created the phonemes. With an XSLT style-sheet (Ubuntu terminal command: saxonb-xslt) I transformed the eSpeak phonemes into IPA phonemes. And I used the XSTL style-sheet for inserting the role attribute (Substantiv, Adjektiv, Zahlwort).

d. The result is the current version of Ralf's German dictionary. It would be nice if someone would help with the improvement. The real difficult work has been done. But it is necessary to fine-tune the dictionary. Let me give you a concrete example:

<lexeme>
<grapheme>stromsparen</grapheme>
<phoneme>ʃtʀɔmʃpaːʀən</phoneme>
</lexeme>
<lexeme>
<grapheme>stromsparend</grapheme>
<phoneme>ʃtʀɔmspaːʀənt</phoneme>
</lexeme>

What is wrong or could be improved? First, the role attribute is missing. stromsparen is a Verb. stromsparend is an Adverb. It would be good if someone added the missing role attributes. Second, there are small phoneme corrections necessary: ʃtʀɔmʃpaːʀən is OK because you speak “schtromschparen”. But ʃtʀɔmspaːʀənt is wrong because you don’t say “schtromsparent”.

You can see that improvements are necessary. Because Ralf's German dictionary is GPLv3, everyone is permitted to improve it.

For good recognition results, things like “schtromsparent” have to be fixed. It is possible that some dialects (e.g. Hamburg) speak “stromsparen” and not “schtromschparen”. I recommend that specific dialect dictionaries should be developed. This would be part of the fine-tuning, too. Ralf's German dictionary covers Standard German. You can use my dictionary for the development of a dialect dictionary that can be used by people who prefer to dictate in their own specific dialect.

By the way, did you notice the following detail? ʃtʀɔmspaːʀənt
ends with a “t” and not with a “d” because of the “Auslautverhärtung” (which is part of the German pronunciation). Such small details are implemented in the dictionary.

C. Conclusion: I know about the strengths of PLS, but I don’t know which advantages a Julius dictionary would have to offer.

Zuführungsdrähten: two pronunciations

Thursday, December 24th, 2009

I am now adding the word Zuführungsdrähten (which is part of the shadow dictionary):

zufuehrungsdraehten

You can see that there are two pronunciation alternatives. And this proves the strength of Ralf’s German dictionary: I am using an XSLT stylesheet to fix recurring pronunciation errors (eSpeak is not perfect), and to add alternate pronunciations (using replace: replace($ieren, 'tən', 'tn̩').

It would be great if someone would be willing to volunteer with the development of Ralf's German dictionary. My concept is as follows (compare with the XSLT concept):

xslt-concept
Image source: Wikipedia

Ralf’s German dictionary (current version) = XML input
espeak2perfectipa.xsl = XSLT code
saxonb-xslt (Ubuntu terminal) = XSLT processor
Ralf’s German dictionary (future version) = Result document

The development of Ralf's German dictionary is done outside and independent from simon. simon does a pretty good conversion during import from IPA to SAMPA. So there is no need to worry.

Ralf’s German dictionary has a lot of known weaknesses. It would be great if someone who is interested in the improvement of this dictionary would volunteer. It is not that difficult to get involved.

Everyone is permitted to improve Ralf’s German dictionary (and the corresponding XSLT code espeak2perfectipa.xsl) because both are GPLv3 licensed.

Ralf's German dictionary is the flagship. There are other dictionaries which need improvement:

A. Austrian German
Ralf’s Austrian German dictionary – it is a very small dictionary. The target group is very specific. This dictionary should contain only words that are not included in Ralf's German dictionary. So if you live in Austria, it is intended that you import two dictionaries:
1. Ralf's German dictionary (with 300.000 words);
2. Ralf's Austrian German dictionary (with specific words).

I don’t know much about Austrian German. You can get an impression of how Austrian German sounds when watching the simon video tutorial.

B. Swiss German
Maybe I will release a Swiss German dictionary. If someone from Switzerland is interested in the development of such a dictionary, I could create Ralf's Swiss German dictionary (I haven’t done that so far). I think that there is a GPL word list at OpenOffice.org available. So a volunteer would be welcome. I can help you with the first steps (unmunch, eSpeak, paste). The result would be a PLS dictionary with a vocabulary that only contains words that are specific to Swiss German.

The future Ralf's Swiss German dictionary is interesting for people who live in (or emigrated from Germany to) Switzerland. If you emigrated from Germany to Switzerland, you should get familiar with Swiss German. So Ralf's Swiss German dictionary would be interesting for German people who immigrate to Switzerland. If you want to stay in Switzerland, learn their language! simon / Ralf's Swiss German dictionary might help you to reach that goal.

C. Medical German
Ralf’s German medical dictionary is targeted at people who are interested in medical education. It is necessary to develop specialised medical dictionaries. The concept is easy:
1. Import Ralf's German dictionary into simon.
2. Import Ralf's German medical dictionary. Then you can train simon to recognize medical terms (e.g. LinsenchirurgielɪnzənçiːʀʊʀgiːEpilepsiebehandlungʔeːpiːlɛpsiːbeːandlʊŋ).
3. Develop specialised medical dictionaries: Human anatomy, pharmacology, genetics, etc..

simon could be used by medical students. So if you are a medical student (German language), you can improve Ralf's German medical dictionary. You can add words to this dictionary. Later, in a few years when you become a doctor, you might be able to use your experiences with simon / Ralf's German medical dictionary. Develop your own medical pronunciation dicitionary, and become a better doctor!

Of course, because we are in a very early stage of development, this is just something for medical students who have enough time.

D. Latin (German pronunciation)
Ralf’s Latin dictionary needs improvement. Latin has pronunciation rules that are different from German. The following way is possible: Improve Ralf's Latin dictionary with an XSLT stylesheet. I explained the concept above. The stylesheet needs information that are specific to the Latin language. Sometimes the Latin e is short (e.g. currere), sometimes it is long (e.g. dēbēre). These things need to be fixed.

And of course, 1.7 million Latin words is too much. The size of this dictionary has to be reduced because of performance issues. A dictionary with about 100.000 Latin words would be optimal at the moment. We don’t have yet a routine (compression algorithm) to handle dictionaries with e.g. 1.7 million words. This has to be developed (maybe someone could do that who is familiar with the unmunch command). But a dictionary with 100.000 Latin words would be a good start.

Latin is a “dead language”. But it should be possible – thanks to simon / Ralf's Latin dictionary to make the computer write down Latin words when you speak them into your microphone. From my point of view, Ralf’s Latin dictionary is something for Latin teachers (school or university). So the target group is very specific. I think that it can be fun for Latin students to use simon for the recognition of spoken Latin words.

E. Conclusion
The different dictionaries need improvement. Interested persons (people from Austria, Switzerland; medical students; Latin teachers) are encouraged to improve the specific dictionary on their own. My dictionaries are GPLv3 licensed. It is intended that someone improves them. This is my concept:

1. unmunch an OpenOffice.org spelling dictionary
2. generate phonemes with eSpeak
3. paste
4. convert eSpeak phonemes into IPA phonemes with XSLT
5. import the resulting PLS dictionary into simon
6. record a few words with simon
7. use simon for recognition (dictate e.g. into a gedit window)

This concept should work with dictionaries that use the German pronunciation (Austrian, Swiss, German medical, Latin). I didn’t test these dictionaries for training / recognition with simon. But the concept is the same. Since Ralf's German dictionary (flagship) works with simon, the other dictionaries with German pronunciation should work, too.

Ralf’s German dictionary

Saturday, September 12th, 2009

In this article, I will explain how to import Ralf’s German dictionary into simon, and you will read about some of the properties of this dictionary.

universal

1. Select Applications > Universal Access > simon.

import-dictionary

2. Press the Word list button.
3. Press Import Dictionary.

shadow-dictionary

4. You can select the target: shadow dictionary or active dictionary. What is the right choice? For dictionary development, I often choose active dictionary (so that I have a dictionary in HTK compatible format which I use in conjunction with sam). But let’s now choose the shadow dictionary as target.

5. Press the Next > button.

hadifix-htk-pls

6. You can choose between different lexicon types: Hadifix, HTK, PLS, and Sphinx. Select PLS.
7. Press the Next > button.

save-page

8. You are now here: http://script.blau.in/xml/german.xml
9. Save Page As... doesn’t work. I just tried that. If you choose this option, the page will be saved as html file. You have to choose a different way.

page-source

10. Select View Page Source.

lexeme-grapheme

11. You can now see the source of the page http://script.blau.in/xml/german.xml.

12. The encoding of the page is UTF-8. This encoding ensures that even languages like Hebrew can be processed correctly. You can imagine that UTF-8 is a very good standard for all languages.

13. Let’s take a look at the address of style sheet http://script.blau.in/xml/ralf-german-dictionary.xsl. This style sheet document changes the appearance of Ralf’s German dictionary when you view it with Firefox.

14. The license is GPL. It would be great if someone would expand the German dictionary.

15. The dictionary has a specific tree structure using the elements lexicon, lexeme, grapheme, phoneme.

16. Select Save Page As....

import

17. Choose the location of Ralf’s German dictionary that you downloaded a few moments ago. On my computer, the XML file is located here: /home/liberty/200909/german.xml.

finish

18. Ralf’s German dictionary has been imported successfully.
19. Press the Finish button.

maschine

20. To take a look at the imported dictionary, select Include unused words from the shadow lexicon.
21. Drag and drop the word Maschine into the white area.

train-selected

22. You want to train the word Maschine.
23. Press the button to start with the training.

add-maschine

24. Currently, the word Maschine is just part of the shadow lexicon. It is not part of the active lexicon. Press Yes to add it to the active lexicon.

sampa

25. You want to define the pronunciation of the word Maschine. The pronunciation is being displayed in SAMPA.
26. I find the concept with the terminals difficult, it is explained in the simon handbook. I am using the terminal Unknown.

OK, I am finishing here. If you want to know more about simon, please read in the simon handbook (PDF).

You should now be able to import Ralf’s German dictionary into simon.

Import PLS dictionary to active vocabulary

Sunday, August 23rd, 2009

I imported the whole PLS dictionary /home/liberty/200905/voxDE20090209.xml into the active vocabulary. This feature had been added to simon a few weeks ago:

“simon can now import dictionaries to the active lexicon.”

You know that my next goal is to hit the 1000 words mark. 1000 words should be recognized by simon. At the moment, I have major recognition problems. simon isn’t very responsive. It recognizes e.g. the word “abnahmen”, but when I dictate other words (that are of course part of the active vocabulary and had been successfully trained by me), simon doesn’t react. Maybe it is something with the confidence score? Or maybe while playing with sam the speech model has been changed?

Well, the active vocabulary now contains more than 8000 words. When I dictate, simon now recognizes words that I never had trained. And of course, it recognizes the wrong words. So I will have to do figure out how to adjust the speech model.

For example, I could record with Audacity lots of single words (not utterances because I find it difficult to define an appropriate grammar), and choose the Export Multiple... function. I am using Audacity in combination with my external USB sound card. This sound card only works with 22050 hertz, not with 16000 hertz under Ubuntu. This is the reason why I am using my on board sound card when dictating into simon directly (= recognition) or when recording words with simon (= training).

It is a bit complicated to explain. I prefer Audacity for recording because it allows me to record lots of training samples in a short amount of time. So if I record with Audacity in 22050 hertz, I have to resample the wav files with sox. I tested the command from the Sphinx guide. The following command allowed me to transform a 22050 hertz file successfully into 16000 hertz:

$ sox de27-02.wav -r 16000 -c 1 -s de27-02-test.wav

With Audacity, I could record all 8000 words that are now in my active vocabulary. Let’s say in packages of 100 words. Two years ago, Audacity allowed me to export just about 30 wav files at a time otherwise the application would crash. I will have to test the current version of Audacity. Probably, this issue has been fixed.

My main concern is that the words of my dictionary are often very similar. Here is an example:

DUTZEND [Dutzend] d U ts @ n t
DUTZEND [Dutzend] d U ts n= t
DUTZENDE [Dutzende] d U ts @ n d @
DUTZENDE [Dutzende] d U ts n= d @
DUTZENDEN [Dutzenden] d U ts @ n d @ n
DUTZENDEN [Dutzenden] d U ts n= d @ n
DUTZENDEN [Dutzenden] d U ts n= d n=
DUTZENDS [Dutzends] d U ts n= ts

Eight entries that are very similar. I think that this is very hard to train successfully. I will have to find out how I can achive my 1000 words goal. Maybe I should reduce the size of the active vocabulary from 8000 words to 1000 words? The result would be that I could use a set of words that aren’t too similar, and thus I would get better recognition results.

And I could sort out short words. Short words are harder to recognize than long words. It is a trick to only train long words, and to leave out the short ones.

I am interested in the following function:

“simon can now import prompts files through the import training data wizard.”

This is a very interesting function for me. I have recorded lots of utterances. They could be imported into simon. But I have one problem: I didn’t define an appropriate grammar. I could use a grammar that uses just one category of words (e.g. all words are marked as noun, it doesn’t matter what they really are; they can be adverbs, adjectives, verbs, etc.). So this could be the way to go.

I think that the 1000 mark goal could be hit with the present vocabulary of 8000 words. Julius allows 20.000 words dictation. So 1000 words is a reasonable goal. When I have reached that goal, I will have to think about the following question: How can I hit the 10.000 words mark? First, I need a bigger lexicon. I don’t want to use BOMP since it would be necessary to write them an email. I prefer to stick to free dictionaries.

Another solution could be that I switch from the German PLS dictionary to the English Voxforge dictionary. I could do the testing in English.

Video: I just recorded about 77 words

Monday, July 20th, 2009

I just recorded about 77 words. Most of them were recognized correctly (about 72 words). Some were recognized wrongly (about 5 words). The wrongly recognized are marked with an asterisk (*):

Geheimnissen Gehirns Geistern Geldern gemeinsam Genehmigungen Gerechtigkeit Gerichte geringem Geschichten Geschmacks Geschenken Generals Geschwindigkeiten Gewässern Gewinnern Gesetzgebers Geschäften Globalisierung Gestalten Gesprächen Glücks Großmutter Großvater Grundgesetz Grundschulen Gründern Gräbern Gymnasiums Gänge Gärtner Gästen Haaren Hafens Hamburgern Handlung Handwerker Hannovers Hauptbahnhof Hauptstadt Haushalten Heimat Herbst Hessens Herstellung Herzog Himmel Hochzeiten Hoffnungen Horizonten Hubschraubern Hunger Hälften Höfen Indonesiens Initiativen Instrumenten Jahrhundert Kalifornien Geistern* Erdbeben* niedrigem Fußbällen* optimistisch organisiert positives Professor Schauspieler Distanz* Technik Technologie Therapien Transport Grundschulen* zeichnet Zuschauer Zweifel

The lexicon contains about 159 entries (I didn’t dictate all of the entries). Maybe I should dictate all of them in a row, and publish a video about the result?

About 93 % of the words were recognized correctly.

And now, watch the video Dictating more than 70 words under Ubuntu (17.6 MB, 7:10 min).