Archive for the ‘dictionary’ Category

Ralf’s German dictionary 0.1.8

Thursday, February 18th, 2010

Ralf’s German dictionary (version 0.1.8; February 18, 2010) is available, and can be imported into simon as shadow dictionary.

It is slightly better than the previous version 0.1.7 (not substantially better).

Here is how I built this dictionary via the Ubuntu terminal:

$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:espeak2perfectipa.xsl -o:prepare-0.1.8.xml

I don’t plan to add more words to the next version of this dictionary. The current focus is phoneme improvement. E.g. replace long vowels with short vowels and vice versa when necessary. To achieve this goal, I will have to modify the XSLT style-sheet espeak2perfectipa.xsl.

OT: German phonemes for “Depot”

Saturday, February 6th, 2010

As I told earlier, I want to give information about PLS dictionary development. At the moment, Ralf’s German dictionary 0.1.7 is available. Here is how I want to modify the dictionary:

I take a closer look at the XSLT style-sheet espeak2perfectipa.xsl. I didn’t improve this style-sheet during the last 4 months.

A few minutes ago, I made a slight modification to espeak2perfectipa.xsl (= XSLT style-sheet). These are the lines that modify the phonemes for Depot:

<xsl:when test="starts-with(grapheme, 'Depot')">
<xsl:for-each select="phoneme"><xsl:text>
</xsl:text><phoneme>
<xsl:variable name="sierra"><xsl:value-of select="."/></xsl:variable>
<xsl:variable name="sierra" select="replace($sierra, 'deːpɔt', 'deːpoː')"/>
<xsl:sequence select="$sierra"/></phoneme>
</xsl:for-each>
</xsl:when>

These are some lines from Ralf’s German dictionary 0.1.7 (= source XML document):

<lexeme role="Substantiv">
<grapheme>Depotbank</grapheme>
<phoneme>deːpɔtbaŋk</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotfett</grapheme>
<phoneme>deːpɔtfɛt</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgebühr</grapheme>
<phoneme>deːpɔtgeːbyːʀ</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgesetz</grapheme>
<phoneme>deːpɔtgeːzɛts</phoneme>
</lexeme>

These are the corresponding lines of the future version of the dictionary (= result XML document):

<lexeme role="Substantiv">
<grapheme>Depotbank</grapheme>
<phoneme>deːpoːbaŋk</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotfett</grapheme>
<phoneme>deːpoːfɛt</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgebühr</grapheme>
<phoneme>deːpoːgeːbyːʀ</phoneme>
</lexeme>
<lexeme role="Substantiv">
<grapheme>Depotgesetz</grapheme>
<phoneme>deːpoːgeːzɛts</phoneme>
</lexeme>

You can see the concept: The XSLT style-sheet defines which modifications the result PLS dictionary should contain.

The whole process is invoked via the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201001/0.1.8$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:espeak2perfectipa.xsl -o:prepare-0.1.8.xml

Let me explain:
- saxonb-xslt is the XSLT processor;
- german-dictionary-0.1.7.xml (= Ralf’s German dictionary 0.1.7) is the XML source document;
- espeak2perfectipa.xsl is the XSLT style-sheet;
- prepare-0.1.8.xml is the XML result document. It should become Ralf’s German dictionary 0.1.8 which I will release as soon as I have made substiantial progress.

I will document further steps of dictionary development here in this blog. I hope that I can convince some people out there to apply this concept to other PLS dictionaries. So my goal is to educate people in PLS dictionary development.

shadowvocabulary.xml

Wednesday, February 3rd, 2010

The file /home/am3msi/.kde/share/apps/simon/shadowvocabulary.xml seems to be compressed. How is this being done?

Clear button; improve phoneme

Wednesday, February 3rd, 2010

1. This is what I did a couple of minutes ago:

$ cd Documents/201001/speech2text
$ git pull origin master
$ ./build_ubuntu.sh

After starting simon, I can see that a Clear button is available:

clear

It should now be possible to delete the active dictionary.

2. Let me make an additional remark about the phonemes of Ralf's German dictionary. The phonemes

S t E n d e: k E m pf @
S t E n d e: O R g a n i: z a ts I o: n

are not optimal. e: indicates a long vowel. Instead, there should be the short vowel @. When you watch the video 200 German words, then you can listen how I pronounce these words. I pronounce them with e: (long e) instead of @ (short e). Such problems occur often in Ralf's German dictionary.

How do I modify Ralf's German dictionary? I use the Ubuntu terminal:

$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:modify-german-dictionary.xsl -o:prepare-0.1.8.xml

To avoid a special java heap space error, I run VisualVM in the background. The result is that it is possible to modify Ralf's German dictionary with the XSLT style-sheet.

Why am I telling you these details that are not directly related to simon? Because you need a pronunciation dictionary if you want to use simon for dictation. And it is necessary to improve the quality of the phonemes that are contained in Ralf's German dictionary.

3. It is possible to modify phonemes with simon using the Edit Word button. My approach with saxonb-xslt is necessary for dictionary development because this way it is possible to modify lots of <lexeme> elements.

4. I removed the active and the shadow dictionary using the Clear button. What about an Export dictionary button? If someone edits the dictionary with simon (Edit Word button), he may want to export the dictionary.

Importing my German backup lexicon

Tuesday, January 26th, 2010

Now, I am importing my German backup lexicon into simon. I import it as Julius vocabulary (into the active vocabulary) from the location /home/am3msi/Documents/201001/model/lexicon. Obviously something went wrong, this is the result:

active-vocabulary

1. The pronunciation is not displayed.
2. The words in the vocabulary are all upper-case.

My guess is that I imported the wrong file as Julius vocabulary. I should have imported a different file. Which one is it? Probably I should have imported model.voca. I will try that now. I am now importing /home/am3msi/Documents/201001/model/model.voca as Julius vocabulary. model.voca is the right choice.

Now, I will have to delete the existing simon active vocabulary because it contains a lot of garbage entries (because I imported lexicon instead of model.voca as Julius vocabulary). I will take a look into the folder with the shared files.

I just opened the file /home/am3msi/.kde/share/apps/simon/scenarios/general. Obviously, simon now stores the lexicon (with terminal information) in an XML file. This approach is obviously new. Is it sufficient to delete this file? I will try it.

After deleting the file /home/am3msi/.kde/share/apps/simon/scenarios/general, I restart simon. The active vocabulary is gone (this is what I intended). And the shadow dictionary is still available. Fine. It worked as intended.

You can see that it is necessary to know where the specific file with the active vocabulary is located to fix my error that I made when importing the wrong file as Julius vocabulary.

What are my targets?

Friday, January 22nd, 2010

I have two targets:

1. Produce more PLS dictionaries that can be imported into simon. I am planning to explain the development steps in this blog. It might be a little off-topic, but I think it is important to inform the people. This means that I will provide the reader with details of dictionary development / improvement.

I want that people understand how to handle the development / improvement of PLS dictionaries.

2. I want to learn about the simon source code. How does simon work internally? I don’t need to understand every detail, but I would like to be able to understand what is going on “behind the scenes” (scenes = simon GUI; behind = simon source code). Where can I start?

The article Becoming a KDE developer contains some useful links (e.g. I could type qtdemo into the Ubuntu terminal). Qt looks like a very interesting development software. How can I get involved?

Let me make an example: I read the C++ tutorial. The first chapters were easy. Then suddenly, it became extremely difficult. What are pointers? What is an array? At least I know how to compile a simple C++ program under Ubuntu. That is a start.

Or I could read the simon source code that is available via Sourceforge. E.g. I could read clientsocket.cpp. But I understand almost nothing.

3. Conclusion: It is a lot of work to focus on these targets.

Julius dictionary; PLS: role attribute

Wednesday, January 20th, 2010

This blog post is about (A) Julius dictionary and (B) the import of a PLS dictionary.

A. Obviously, it is possible to import a Julius dictionary:

julius-vocabulary

I didn’t know that this kind of dictionary existed. What are the properties of this format? And what are the advantages?

B. I want to import Ralf's German dictionary (version 0.1.7; October 29, 2009). Great, simon now recognizes the role attribute:

adjektiv-substantiv

1. A few minutes ago, I imported Ralf's German dictionary into simon. I am offering 27 PLS dictionaries for 27 different languages. Choose the dictionary that suits your native language, and import it into simon.

2. Let’s take a look into the shadow dictionary.

3. The word kernchemischen is an Adjektiv. Let’s take a look at the specific entry in Ralf's German dictionary:

<lexeme role=”Adjektiv”>
<grapheme>kernchemischen</grapheme>
<phoneme>kɛʀnçeːmɪʃən</phoneme>
</lexeme>

You can see that the role attribute which is part of the <lexeme> element was imported by simon. Thanks for implementing that feature.

4. The word Kerndurchmesser is a Substantiv. The corresponding entry in the PLS dictionary:

<lexeme role=”Substantiv”>
<grapheme>Kerndurchmesser</grapheme>
<phoneme>kɛʀndʊʀçmɛsɐ</phoneme>
</lexeme>

You can see the strength of the simon import process: The last two letters Kerndurchmesser correspond with one single phoneme kɛʀndʊʀçmɛsɐ. Because such details are implemented, we can get a very good recognition rate as I showed in the video with 200 German words.

Why is Ralf's German dictionary good? Let me explain about the history of this dictionary:

a. The initial steps were done at Voxforge with the development of a German pronunciation dictionary. You can convince yourself: the script espeak2Phones.pl is great because it transforms eSpeak’s cryptic ASCII output into SAMPA. This approach is good for the German language.

b. Later, we used the dictionary acquistion project for the collection of about 8.000 pronunciations. Each single pronunciation was human-controlled. The phoneme concept follows the Wiktionary.

c. I used a German spelling dictionary from OpenOffice.org to get more words for the dictionary (Ubuntu terminal command: unmunch). With eSpeak I created the phonemes. With an XSLT style-sheet (Ubuntu terminal command: saxonb-xslt) I transformed the eSpeak phonemes into IPA phonemes. And I used the XSTL style-sheet for inserting the role attribute (Substantiv, Adjektiv, Zahlwort).

d. The result is the current version of Ralf's German dictionary. It would be nice if someone would help with the improvement. The real difficult work has been done. But it is necessary to fine-tune the dictionary. Let me give you a concrete example:

<lexeme>
<grapheme>stromsparen</grapheme>
<phoneme>ʃtʀɔmʃpaːʀən</phoneme>
</lexeme>
<lexeme>
<grapheme>stromsparend</grapheme>
<phoneme>ʃtʀɔmspaːʀənt</phoneme>
</lexeme>

What is wrong or could be improved? First, the role attribute is missing. stromsparen is a Verb. stromsparend is an Adverb. It would be good if someone added the missing role attributes. Second, there are small phoneme corrections necessary: ʃtʀɔmʃpaːʀən is OK because you speak “schtromschparen”. But ʃtʀɔmspaːʀənt is wrong because you don’t say “schtromsparent”.

You can see that improvements are necessary. Because Ralf's German dictionary is GPLv3, everyone is permitted to improve it.

For good recognition results, things like “schtromsparent” have to be fixed. It is possible that some dialects (e.g. Hamburg) speak “stromsparen” and not “schtromschparen”. I recommend that specific dialect dictionaries should be developed. This would be part of the fine-tuning, too. Ralf's German dictionary covers Standard German. You can use my dictionary for the development of a dialect dictionary that can be used by people who prefer to dictate in their own specific dialect.

By the way, did you notice the following detail? ʃtʀɔmspaːʀənt
ends with a “t” and not with a “d” because of the “Auslautverhärtung” (which is part of the German pronunciation). Such small details are implemented in the dictionary.

C. Conclusion: I know about the strengths of PLS, but I don’t know which advantages a Julius dictionary would have to offer.

Benefitting from eSpeak

Saturday, January 9th, 2010

Is eSpeak good or bad?

Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn’t let it be good with any possible modifications.”

I used eSpeak for the creation of my 27 PLS dictionaries (the phonemes were created with the help of eSpeak). I found out that the phoneme quality for German isn’t that bad. It is usable for speech recognition after I made some adjustments with an XSLT style-sheet.

What about the other languages? To be honest: at the moment, I don’t care. I need the 27 PLS dictionaries mainly for propaganda. It is necessary to involve more people in the development of an open source ASR solution.

A Polish native speaker wants to dictate in the Polish language. Or another user wants to dictate in the Vietnamese language. Or someone wants to dictate in the Greek language. These people could take advantage of PLS dictionaries in their own languages.

This is what I want to do: Build a PLS dictionary in a Chinese language (e.g. Cantonese – eSpeak offers this language as provisional language). I need a GPL word list with Cantonese words. But I didn’t find one in the internet (the description of this word list should be in English because I don’t understand Cantonese).

Is eSpeak’s synthesis method out of date? I don’t know. At least eSpeak creates phonemes that I can implement in my PLS dictionaries. Is there a program available that produces better results than eSpeak? The program has to work out of the box. I can use eSpeak by simply typing “espeak” into the Ubuntu terminal. And eSpeak can interpret SSML mark-up. That worked fine for me.

My PLS dictionaries are in an early state of development. It should be possible to increase the quality substantially with the help of some engaged native speakers.

Things have to work. To be more precise: it should be possible for the user to import a PLS dictionary in his own native language into simon. I made a start by offering 27 PLS dictionaries. At the moment, I am thinking about whether I should offer much more PLS dictionaries. The problem is: I don’t know how I can create the phonemes for the specific language. I will find a work-around for this problem.

Which kind of phonemes should the PLS dictionary contain? There are several possibilities:
- IPA phonemes (like in Ralf's German dictionary), advantage: can easily be edited by linguists;
- eSpeak phonemes (like in Ralf's Polish dictionary), advantage: I didn’t introduce new errors by trying to convert them into IPA;
- SAMPA phonemes (none of my dictionaries uses SAMPA), I don’t see any advantage at the moment.

In my opinion, a good phoneme quality can be achieved by using IPA phonemes. Because IPA phonemes are easy to read by linguists. So what can you learn from this post? If you are a native speaker of Vietnamese, Polish, Greek, you may want to take a closer look at Ralf’s Vietnamese / Polish / Greek dictionary, and think about what you can do to improve the quality.

Which advantage offer Ralf’s PLS dictionaries? They show you a way to make speech recognition work for your native language. As soon as you have a PLS dictionary with acceptable quality for your own language, you can think about using it for training with simon.

You can learn from my blog that you can import
- Ralf's Vietnamese dictionary,
- Ralf's Polish dictionary,
- Ralf's Greek dictionary
into simon. So simon is the target application. If you improve the quality of the specific dictionary, there is a chance that it might work.

And there is another thing that I found out when building / importing each of these 27 PLS dictionaries: the dictionary size should be about 100.000 words (not 1 million words, not 10.000 words). Help is needed to implement a good compression algorithm (like unmunch for OpenOffice.org dictionaries).

The focus should be to
- improve the quality of each PLS dictionary – native speakers should do that;
- integrate an option into simon to automatically download & import each of these PLS dictionaries into simon;
- think about a good compression algorith for PLS dictionaries (like unmunch) – languages like Spanish, Dutch, German, Latin need such a compression algorithm – not necessary for English.

Tutorial: how to install under Ubuntu

Friday, January 8th, 2010

This tutorial explains how to install simon under Ubuntu, and how to import Ralf’s Portuguese dictionary.

1. Download simon.
2. Double-click on simon-0.2-Linux_i386.deb:

ubuntu-deb

3. Press Install Package:

install-speechrecognition

4. Enter the password that you had chosen during your Ubuntu installation:

administrative-rights

Press the OK button.

5. The installation has been finished. The package simon-0.2-Linux_i386.deb has been installed:

installation-finished

Press the Close button.

6. Select Applications > Universal Access > simon:

universal-access

7. Take a look at the simon main window:

press-wordlist

Press the Wordlist button.

8. The Wordlist tab has opened:

import-dictionary

Press the Import Dictionary button.

9. You have to select the type of the dictionary:

select-dictionary-type

Choose PLS Lexicon, then press the Next button.

Note for simon development team: it would be nice if simon now offered a list of the 27 PLS dictionaries that are available.

10. You can now import one of my 27 PLS dictionaries. In the sidebar of testing simon, you can find a PLS dictionary that you can import:

sidebar-pls

Right-click on Ralf’s Portuguese dictionary, then Save Link As....

11. The dictionary with the name portuguese-dictionary.xml.bz2 will be saved:

save-portuguese-dictionary

It will be saved in the Downloads folder. Press the Save button.

12. It is time to import Ralf's Portuguese dictionary that you have just downloaded:

select-pls-file

Please press the File button to point simon to the downloaded PLS dictionary.

13. Select the Downloads folder:

select-downloads-folder

14. Select portuguese-dictionary.xml.bz2:

select-portuguese-dictionary

On my computer, I didn’t have to press the OK button.

15. simon displays the path to the PLS dictionary:

path-to-dictionary

Note for simon development team: it is pretty complicated to first download, and then import the dictionary. It would be nice if the wizard offered automatic download directly from the internet.

My guess is: the average user begins to lose interest in simon at this point of the installation because he already has invested about 20 minutes of his precious time. It is getting annoying. Don’t annoy the user! Offer automatic PLS dictionary import directly from the internet.

The automatic BOMP import is a great thing. But not everybody is a native German speaker. At the moment, I am offering 27 different languages. An automatic import would make simon much more interesting for a lot of people. E.g. Portuguese is spoken by 200 million native speakers. Recently, someone showed interest in Portuguese at Voxforge. You can imagine that almost all people don’t have a clue where to start, and what is necessary. Helping people from a lot of different countries would be so easy by adding an automatic import function to the wizard.

Press the Next button.

16. simon is now processing the lexicon:

processing-lexicon

What does that mean? It means that simon converts the dictionary from PLS format into HTK format. This process works fine for Ralf's German dictionary. The process is not yet optimized for the other PLS dictionaries. If you are a native speaker of Portuguese (European), you can edit Ralf's Portuguese dictionary with a simple text editor.

17. Ralf's Portuguese dictionary has been imported:

imported-successfully

Press the Finish button.

18. The dictionary is now available:

portuguese-shadow

a. Select Include unused words from the shadow lexicon.
b. Use the scroll bar to get an impression of the Portuguese dictionary.
c. First column: word. Second column: corresponding pronunciation.

19. Let’s finish here. Now you know how to install simon under Ubuntu, and how to import Ralf's Portuguese dictionary into simon.

There are more steps necessary to make it work:
- install HTK;
- record a few training samples;
- define a grammar;
- start ksimond (PDF).

Take a look into the simon handbook to find out more about simon.

Try to install revision 1112 on 32-bit Ubuntu

Wednesday, January 6th, 2010

A few days ago, my Socket 939 computer (64-bit) which has simon revision 1090 installed on it stopped working. Because of that I want to try to install the current revision 1112 on a 32-bit Ubuntu 9.10 computer. This is what I do (following these and these instructions :
1. $ sudo apt-get install subversion build-essential cmake bison flex gettext gettext-kde kdeartwork kdelibs5-dev portaudio19-dev libxtst-dev libqt4-sql-sqlite libqt4-phonon-dev kdelibs4c2a
I don’t know how to install libattica. I will try it without libattica.
2. $ cd Documents/201001
3. $ svn co https://speech2text.svn.sourceforge.net/svnroot/speech2text/trunk simonsource
4. Checked out revision 1112.
5. $ cd simonsource
6. $ ./build_ubuntu.sh
It didn’t work out:

revision-1112

I hope that the simon developers will fix this issue.

Why am I interested in the development version? Because I want to adjust my work flow for speech model development. simon/sam should increase my productivity. I want to publish a speech model that works (more or less) out of the box. But first, I have to build one. This future speech model can be used by people who have a similar voice like me (I am not planning to build a speaker independent speech model; I want to offer only a speaker dependent speech model with my own voice). To achieve this goal, I need the development version.

A lot of people will be interested in simon if it works for dictation out of the box (without the need to install HTK first; without the need to record a few training samples). So my goal is to develop a speech model with > 200 German words (maybe up to 1000 German words) that the user can import into simon, and use it directly for dictation. Of course, the recognition rate probably would be pretty low. But the important thing would be: it would attract more people who maybe would invest some time to submit German speech to VoxForge.

So it has to work. The installation has to work. And the recognition has to work (even if the recognition rate is very low). The average user won’t invest more than 20 minutes of his precious time when trying simon for speech recognition. He wants that the computer recognizes his voice. Either speech recognition works within 20 minutes, or the average user will lose interest in simon.

To make simon successful, it is necessary to offer speech model packs for the following languages: German (use Ralf's German dictionary – PLS format), English (use VoxForgeDict – HTK format), French (I have found one in the internet in Sphinx format), Spanish (I have found one at Voxforge, I think. Maybe someone is willing to edit Ralf's Spanish dictionary – it should be possible because the Spanish pronunciation rules are fairly regular).

I would try to prepare speech models for these four major Western languages if simon/sam works sufficiently (especially the import/export functionality has to work). So my personal focus are the following languages:

German – 105 million native speakers
English – 350 million native speakers
French – 110 million native speakers
Spanish – 329 million native speakers

German + English + French + Spanish = 894 million native speakers

These languages are pretty similar. simon should offer automatic dictionary import for these languages (German is already covered by Hadifix import).

It is necessary to make it as easy as possible for the average user. The German Hadifix dictionary (probably the best German dictionary currently available) can be imported automatically into simon. Why not extend this import function to other dictionaries? E.g. a future version of simon could download / import
- Ralf's German dictionary (advantage: GPLv3 – it is no problem to use this dictionary for the development of GPLv3 speech models).
- VoxForgeDict (advantage: probably very good phoneme quality).

This concept could be extended to other dictionaries (Ralf’s Spanish dictionary, Ralf’s French dictionary, …).

An automatic import function of my PLS dictionaries (I am offering most languages that eSpeak can handle) would make simon more attractive. Think about it: a user who lives in the Indian subcontinent whose native language is Tamil (66 million native speakers) could do the following: Install simon, and choose Ralf's Tamil dictionary for automatic import into simon. Of course, this dictionary contains at the moment eSpeak characters so that the imported phonemes aren’t usable. But that is OK for the beginning. Because this problem can be fixed later.

You understand why help from native speakers is needed. I don’t speak a word Tamil. But what the simon developers could do: offer an automatic dictionary import function for 27 languages. These are the paths to the dictionaries:

  1. http://script.blau.in/afrikaans-dictionary.xml.bz2
  2. http://script.blau.in/catalan-dictionary.xml.bz2
  3. http://script.blau.in/croatian-dictionary.xml.bz2
  4. http://script.blau.in/czech-dictionary.xml.bz2
  5. http://script.blau.in/dutch-dictionary.xml.bz2
  6. http://script.blau.in/english-dictionary.xml.bz2
  7. http://script.blau.in/esperanto-dictionary.xml.bz2
  8. http://script.blau.in/french-dictionary.xml.bz2
  9. http://script.blau.in/german-dictionary.xml.bz2
  10. http://script.blau.in/greek-dictionary.xml.bz2
  11. http://script.blau.in/hindi-dictionary.xml.bz2
  12. http://script.blau.in/icelandic-dictionary.xml.bz2
  13. http://script.blau.in/italian-dictionary.xml.bz2
  14. http://script.blau.in/kurdish-dictionary.xml.bz2
  15. http://script.blau.in/latin-dictionary.xml.bz2
  16. http://script.blau.in/latvian-dictionary.xml.bz2
  17. http://script.blau.in/norwegian-dictionary.xml.bz2
  18. http://script.blau.in/polish-dictionary.xml.bz2
  19. http://script.blau.in/portuguese-dictionary.xml.bz2
  20. http://script.blau.in/romanian-dictionary.xml.bz2
  21. http://script.blau.in/russian-dictionary.xml.bz2
  22. http://script.blau.in/slovak-dictionary.xml.bz2
  23. http://script.blau.in/spanish-dictionary.xml.bz2
  24. http://script.blau.in/swahili-dictionary.xml.bz2
  25. http://script.blau.in/swedish-dictionary.xml.bz2
  26. http://script.blau.in/tamil-dictionary.xml.bz2
  27. http://script.blau.in/vietnamese-dictionary.xml.bz2

You can see: 27 PLS dictionaries are available. All PLS dictionaries are GPLv3 (I got most word lists from OpenOffice.org spelling dictionaries – obviously, most of the word lists are GPL; some are not GPL – I didn’t use them). So there is no licensing problem. You can be sure that there is no copyright infringement because OpenOffice.org is a very good source. I didn’t use word lists without an explicit GPL license.

An example: Welsh is offered by eSpeak, but I didn’t find a GPL word list with Welsh words. So I didn’t build a dictionary for this language. I took a look into the license file (included in the Welsh spelling dictionary):

# This dictionary is covered by the GNU General Public License,
[...]
# Redistribution and use in source and binary forms, with or without

# modification, are permitted provided that the following conditions

# are met:
[...]
# 3. All modifications to the source code must be clearly marked as

# such. Binary redistributions based on modified source code

# must be clearly marked as modified versions in the documentation

# and/or other materials provided with the distribution.

#

# 4. The name of [...] may not be

# used to endorse or promote products derived from this software

# without specific prior written permission.

You can see that this license seems to be a mixed license: GPL + modifications. I can’t work with such a license. So I couldn’t use this Welsh word list for the development of a Welsh PLS dictionary. You can see: I check the license very exactly when building my dictionaries. I only use sources that clearly use the GPL without any modification.

I am using only one license for my dictionaries: GPL (almost all dictionaries are GPLv3, maybe a very old version of my German PLS dictionary is GPLv2). The concept is easy to understand: Only GPL. One license. No difficult dual-licensing or triple-licensing scheme. Voxforge collects speech under the GPL. My dictionaries are GPL. Easy, isn’t it?

simon should try to enter the market for these 27 languages. Why not offer automatic import for these 27 dictionaries? OK, eSpeak phonemes aren’t working right now. But that can be fixed later with the help of native speakers (or adjust the simon import process to eSpeak phonemes).

So my wish is an automatic import function for these PLS dictionaries. This makes it easier for interested people to become involved.

simon should try to gain market share. At the moment, the user has to do two steps:
1. Download a dictionary from the internet (he has to know where to find such a dictionary, but most users don’t have a clue).
2. Import this dictionary into simon.

These two steps could be combined into one single step: automatic download & import via a wizard. 27 PLS dictionaries could be offered at the moment. I don’t have a bandwidth limit, so it would be OK to get the dictionary directly from script.blau.in.

Marketing is the big deficit of open source ASR projects. An automatic dictionary import directly from the internet for 27 languages would be good for marketing.

Import English dictionary VoxForgeDict

Friday, January 1st, 2010

Because I don’t know how to restore my German speech model (with more than 200 words), I want to train simon with the English dictionary VoxForgeDict. It is necessary to import VoxForgeDict as HTK lexicon into simon:

VoxForgeDict

This dictionary has about 130.000 entries. That should be enough for the English language. The encoding is UTF-8.

Now I can start to train a few words. Let’s start with REGRETTING:

regretting

1. Include unused words from the shadow dictionary.
2. Drag & drop the word REGRETTING into the right field.
3. Train selected words.
4. The active vocabulary doesn’t contain the word REGRETTING. I want to add it now. Press the Yes button.

I recorded the word REGRETTING three times. And I defined a grammar (”Unknown”). I disconnected simon, then I restarted ksimond. Then I connected simon again, then pressed the Synchronize button.

simon now recognizes the one word that is part of my active vocabulary: REGRETTING.

VoxForgeDict just contains words that are written in capital letters. It would be possible with a simple text editor to fix that (though it would be a lot of work to go through the hole dictionary). Does anyone have an idea how the dictionary could be uncapitalized elegantly?

Zuführungsdrähten: two pronunciations

Thursday, December 24th, 2009

I am now adding the word Zuführungsdrähten (which is part of the shadow dictionary):

zufuehrungsdraehten

You can see that there are two pronunciation alternatives. And this proves the strength of Ralf’s German dictionary: I am using an XSLT stylesheet to fix recurring pronunciation errors (eSpeak is not perfect), and to add alternate pronunciations (using replace: replace($ieren, 'tən', 'tn̩').

It would be great if someone would be willing to volunteer with the development of Ralf's German dictionary. My concept is as follows (compare with the XSLT concept):

xslt-concept
Image source: Wikipedia

Ralf’s German dictionary (current version) = XML input
espeak2perfectipa.xsl = XSLT code
saxonb-xslt (Ubuntu terminal) = XSLT processor
Ralf’s German dictionary (future version) = Result document

The development of Ralf's German dictionary is done outside and independent from simon. simon does a pretty good conversion during import from IPA to SAMPA. So there is no need to worry.

Ralf’s German dictionary has a lot of known weaknesses. It would be great if someone who is interested in the improvement of this dictionary would volunteer. It is not that difficult to get involved.

Everyone is permitted to improve Ralf’s German dictionary (and the corresponding XSLT code espeak2perfectipa.xsl) because both are GPLv3 licensed.

Ralf's German dictionary is the flagship. There are other dictionaries which need improvement:

A. Austrian German
Ralf’s Austrian German dictionary – it is a very small dictionary. The target group is very specific. This dictionary should contain only words that are not included in Ralf's German dictionary. So if you live in Austria, it is intended that you import two dictionaries:
1. Ralf's German dictionary (with 300.000 words);
2. Ralf's Austrian German dictionary (with specific words).

I don’t know much about Austrian German. You can get an impression of how Austrian German sounds when watching the simon video tutorial.

B. Swiss German
Maybe I will release a Swiss German dictionary. If someone from Switzerland is interested in the development of such a dictionary, I could create Ralf's Swiss German dictionary (I haven’t done that so far). I think that there is a GPL word list at OpenOffice.org available. So a volunteer would be welcome. I can help you with the first steps (unmunch, eSpeak, paste). The result would be a PLS dictionary with a vocabulary that only contains words that are specific to Swiss German.

The future Ralf's Swiss German dictionary is interesting for people who live in (or emigrated from Germany to) Switzerland. If you emigrated from Germany to Switzerland, you should get familiar with Swiss German. So Ralf's Swiss German dictionary would be interesting for German people who immigrate to Switzerland. If you want to stay in Switzerland, learn their language! simon / Ralf's Swiss German dictionary might help you to reach that goal.

C. Medical German
Ralf’s German medical dictionary is targeted at people who are interested in medical education. It is necessary to develop specialised medical dictionaries. The concept is easy:
1. Import Ralf's German dictionary into simon.
2. Import Ralf's German medical dictionary. Then you can train simon to recognize medical terms (e.g. LinsenchirurgielɪnzənçiːʀʊʀgiːEpilepsiebehandlungʔeːpiːlɛpsiːbeːandlʊŋ).
3. Develop specialised medical dictionaries: Human anatomy, pharmacology, genetics, etc..

simon could be used by medical students. So if you are a medical student (German language), you can improve Ralf's German medical dictionary. You can add words to this dictionary. Later, in a few years when you become a doctor, you might be able to use your experiences with simon / Ralf's German medical dictionary. Develop your own medical pronunciation dicitionary, and become a better doctor!

Of course, because we are in a very early stage of development, this is just something for medical students who have enough time.

D. Latin (German pronunciation)
Ralf’s Latin dictionary needs improvement. Latin has pronunciation rules that are different from German. The following way is possible: Improve Ralf's Latin dictionary with an XSLT stylesheet. I explained the concept above. The stylesheet needs information that are specific to the Latin language. Sometimes the Latin e is short (e.g. currere), sometimes it is long (e.g. dēbēre). These things need to be fixed.

And of course, 1.7 million Latin words is too much. The size of this dictionary has to be reduced because of performance issues. A dictionary with about 100.000 Latin words would be optimal at the moment. We don’t have yet a routine (compression algorithm) to handle dictionaries with e.g. 1.7 million words. This has to be developed (maybe someone could do that who is familiar with the unmunch command). But a dictionary with 100.000 Latin words would be a good start.

Latin is a “dead language”. But it should be possible – thanks to simon / Ralf's Latin dictionary to make the computer write down Latin words when you speak them into your microphone. From my point of view, Ralf’s Latin dictionary is something for Latin teachers (school or university). So the target group is very specific. I think that it can be fun for Latin students to use simon for the recognition of spoken Latin words.

E. Conclusion
The different dictionaries need improvement. Interested persons (people from Austria, Switzerland; medical students; Latin teachers) are encouraged to improve the specific dictionary on their own. My dictionaries are GPLv3 licensed. It is intended that someone improves them. This is my concept:

1. unmunch an OpenOffice.org spelling dictionary
2. generate phonemes with eSpeak
3. paste
4. convert eSpeak phonemes into IPA phonemes with XSLT
5. import the resulting PLS dictionary into simon
6. record a few words with simon
7. use simon for recognition (dictate e.g. into a gedit window)

This concept should work with dictionaries that use the German pronunciation (Austrian, Swiss, German medical, Latin). I didn’t test these dictionaries for training / recognition with simon. But the concept is the same. Since Ralf's German dictionary (flagship) works with simon, the other dictionaries with German pronunciation should work, too.

Vietnamese Hanoi dictionary (HTK)

Friday, December 11th, 2009

I just imported a Vietnamese Northern (Hanoi) dialect dictionary. You can download the dictionary. I imported it as HTK lexicon. This is the result:

vietnamese-htk

It looks fine. Here is a small excerpt:

mà [mà] m aa2
má [má] m aa3
mả [mả] m aa4
mã [mã] m aa5
mạ [mạ] m aa6
ma [ma] m aa7

If you speak the Vietnamese language, you should get the concept. The different a-vowels are different phonemes (aa2, aa3, aa4, aa5, aa6, aa7). This approach should be OK.

It would be nice if a native speaker would try to record a few Vietnamese words with simon:

record-vietnamese

It would be interesting to know whether it works since Vietnamese is completely different from English. I recommend that you try to record 10 different Vietnamese words with simon (each word 8 times).

Try to train the Polish word “JEDEN”

Thursday, November 19th, 2009

I want to import a small sample dictionary into simon (Sphinx format):

polish-sphinx

The source can be found here (I don’t know how long this link will be valid). The dictionary contains 19 Polish words (US-ASCII). Here is what you have to do next:

universal

1. Select Applications > Universal Access > simon.

import-dictionary

2. Press the Wordlist button.
3. Press Import Dictionary.

shadow-dictionary

4. You can select the target: shadow dictionary or active dictionary. For this Polish example dictionary, choose active dictionary.
5. Press the Next button.

And now it is time to choose the appropriate lexicon format:

import-sphinx

Import the dictionary (with the 19 Polish words; see the screen-shot at the beginning of this post) as SPHINX lexicon.

sphinx-automatic

You have to select the path to the Polish Sphinx dictionary. After pressing the Next button, the following message appears:

finish

The Polish Sphinx dictionary has been imported successfully. Press the Finish button.

Now let’s train a Polish word:

add-polish

a. Select the Polish word JEDEN.
b. Add to Training.
c. Train selected Words.

You can now record the Polish word with simon:

train-polish

(more…)

Import 140.000 English words

Tuesday, November 17th, 2009

You can import Ralf's English dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

[2010/01/01: it is recommended that you use VoxForgeDict if you want to use simon with the English language. VoxForgeDict does have a high quality.]

Import 180.000 Slovak words

Monday, November 16th, 2009

You can import Ralf's Slovak dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Import 200.000 Icelandic words

Sunday, November 15th, 2009

Now, you can import Ralf's Icelandic dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Ralf’s Vietnamese dictionary

Sunday, November 15th, 2009

You can import Ralf's Vietnamese dictionary (version 0.1; GPLv3) into simon. The dictionary contains about 6.000 words; training is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

Import 140.000 Russian words

Saturday, November 14th, 2009

Now, you can import Ralf's Russian dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Import 300.000 Norwegian words

Saturday, November 14th, 2009

You can import Ralf's Norwegian Bokmål dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".