Posts Tagged ‘PLS’

What are my targets?

Friday, January 22nd, 2010

I have two targets:

1. Produce more PLS dictionaries that can be imported into simon. I am planning to explain the development steps in this blog. It might be a little off-topic, but I think it is important to inform the people. This means that I will provide the reader with details of dictionary development / improvement.

I want that people understand how to handle the development / improvement of PLS dictionaries.

2. I want to learn about the simon source code. How does simon work internally? I don’t need to understand every detail, but I would like to be able to understand what is going on “behind the scenes” (scenes = simon GUI; behind = simon source code). Where can I start?

The article Becoming a KDE developer contains some useful links (e.g. I could type qtdemo into the Ubuntu terminal). Qt looks like a very interesting development software. How can I get involved?

Let me make an example: I read the C++ tutorial. The first chapters were easy. Then suddenly, it became extremely difficult. What are pointers? What is an array? At least I know how to compile a simple C++ program under Ubuntu. That is a start.

Or I could read the simon source code that is available via Sourceforge. E.g. I could read clientsocket.cpp. But I understand almost nothing.

3. Conclusion: It is a lot of work to focus on these targets.

Julius dictionary; PLS: role attribute

Wednesday, January 20th, 2010

This blog post is about (A) Julius dictionary and (B) the import of a PLS dictionary.

A. Obviously, it is possible to import a Julius dictionary:

julius-vocabulary

I didn’t know that this kind of dictionary existed. What are the properties of this format? And what are the advantages?

B. I want to import Ralf's German dictionary (version 0.1.7; October 29, 2009). Great, simon now recognizes the role attribute:

adjektiv-substantiv

1. A few minutes ago, I imported Ralf's German dictionary into simon. I am offering 27 PLS dictionaries for 27 different languages. Choose the dictionary that suits your native language, and import it into simon.

2. Let’s take a look into the shadow dictionary.

3. The word kernchemischen is an Adjektiv. Let’s take a look at the specific entry in Ralf's German dictionary:

<lexeme role=”Adjektiv”>
<grapheme>kernchemischen</grapheme>
<phoneme>kɛʀnçeːmɪʃən</phoneme>
</lexeme>

You can see that the role attribute which is part of the <lexeme> element was imported by simon. Thanks for implementing that feature.

4. The word Kerndurchmesser is a Substantiv. The corresponding entry in the PLS dictionary:

<lexeme role=”Substantiv”>
<grapheme>Kerndurchmesser</grapheme>
<phoneme>kɛʀndʊʀçmɛsɐ</phoneme>
</lexeme>

You can see the strength of the simon import process: The last two letters Kerndurchmesser correspond with one single phoneme kɛʀndʊʀçmɛsɐ. Because such details are implemented, we can get a very good recognition rate as I showed in the video with 200 German words.

Why is Ralf's German dictionary good? Let me explain about the history of this dictionary:

a. The initial steps were done at Voxforge with the development of a German pronunciation dictionary. You can convince yourself: the script espeak2Phones.pl is great because it transforms eSpeak’s cryptic ASCII output into SAMPA. This approach is good for the German language.

b. Later, we used the dictionary acquistion project for the collection of about 8.000 pronunciations. Each single pronunciation was human-controlled. The phoneme concept follows the Wiktionary.

c. I used a German spelling dictionary from OpenOffice.org to get more words for the dictionary (Ubuntu terminal command: unmunch). With eSpeak I created the phonemes. With an XSLT style-sheet (Ubuntu terminal command: saxonb-xslt) I transformed the eSpeak phonemes into IPA phonemes. And I used the XSTL style-sheet for inserting the role attribute (Substantiv, Adjektiv, Zahlwort).

d. The result is the current version of Ralf's German dictionary. It would be nice if someone would help with the improvement. The real difficult work has been done. But it is necessary to fine-tune the dictionary. Let me give you a concrete example:

<lexeme>
<grapheme>stromsparen</grapheme>
<phoneme>ʃtʀɔmʃpaːʀən</phoneme>
</lexeme>
<lexeme>
<grapheme>stromsparend</grapheme>
<phoneme>ʃtʀɔmspaːʀənt</phoneme>
</lexeme>

What is wrong or could be improved? First, the role attribute is missing. stromsparen is a Verb. stromsparend is an Adverb. It would be good if someone added the missing role attributes. Second, there are small phoneme corrections necessary: ʃtʀɔmʃpaːʀən is OK because you speak “schtromschparen”. But ʃtʀɔmspaːʀənt is wrong because you don’t say “schtromsparent”.

You can see that improvements are necessary. Because Ralf's German dictionary is GPLv3, everyone is permitted to improve it.

For good recognition results, things like “schtromsparent” have to be fixed. It is possible that some dialects (e.g. Hamburg) speak “stromsparen” and not “schtromschparen”. I recommend that specific dialect dictionaries should be developed. This would be part of the fine-tuning, too. Ralf's German dictionary covers Standard German. You can use my dictionary for the development of a dialect dictionary that can be used by people who prefer to dictate in their own specific dialect.

By the way, did you notice the following detail? ʃtʀɔmspaːʀənt
ends with a “t” and not with a “d” because of the “Auslautverhärtung” (which is part of the German pronunciation). Such small details are implemented in the dictionary.

C. Conclusion: I know about the strengths of PLS, but I don’t know which advantages a Julius dictionary would have to offer.

Benefitting from eSpeak

Saturday, January 9th, 2010

Is eSpeak good or bad?

Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn’t let it be good with any possible modifications.”

I used eSpeak for the creation of my 27 PLS dictionaries (the phonemes were created with the help of eSpeak). I found out that the phoneme quality for German isn’t that bad. It is usable for speech recognition after I made some adjustments with an XSLT style-sheet.

What about the other languages? To be honest: at the moment, I don’t care. I need the 27 PLS dictionaries mainly for propaganda. It is necessary to involve more people in the development of an open source ASR solution.

A Polish native speaker wants to dictate in the Polish language. Or another user wants to dictate in the Vietnamese language. Or someone wants to dictate in the Greek language. These people could take advantage of PLS dictionaries in their own languages.

This is what I want to do: Build a PLS dictionary in a Chinese language (e.g. Cantonese – eSpeak offers this language as provisional language). I need a GPL word list with Cantonese words. But I didn’t find one in the internet (the description of this word list should be in English because I don’t understand Cantonese).

Is eSpeak’s synthesis method out of date? I don’t know. At least eSpeak creates phonemes that I can implement in my PLS dictionaries. Is there a program available that produces better results than eSpeak? The program has to work out of the box. I can use eSpeak by simply typing “espeak” into the Ubuntu terminal. And eSpeak can interpret SSML mark-up. That worked fine for me.

My PLS dictionaries are in an early state of development. It should be possible to increase the quality substantially with the help of some engaged native speakers.

Things have to work. To be more precise: it should be possible for the user to import a PLS dictionary in his own native language into simon. I made a start by offering 27 PLS dictionaries. At the moment, I am thinking about whether I should offer much more PLS dictionaries. The problem is: I don’t know how I can create the phonemes for the specific language. I will find a work-around for this problem.

Which kind of phonemes should the PLS dictionary contain? There are several possibilities:
- IPA phonemes (like in Ralf's German dictionary), advantage: can easily be edited by linguists;
- eSpeak phonemes (like in Ralf's Polish dictionary), advantage: I didn’t introduce new errors by trying to convert them into IPA;
- SAMPA phonemes (none of my dictionaries uses SAMPA), I don’t see any advantage at the moment.

In my opinion, a good phoneme quality can be achieved by using IPA phonemes. Because IPA phonemes are easy to read by linguists. So what can you learn from this post? If you are a native speaker of Vietnamese, Polish, Greek, you may want to take a closer look at Ralf’s Vietnamese / Polish / Greek dictionary, and think about what you can do to improve the quality.

Which advantage offer Ralf’s PLS dictionaries? They show you a way to make speech recognition work for your native language. As soon as you have a PLS dictionary with acceptable quality for your own language, you can think about using it for training with simon.

You can learn from my blog that you can import
- Ralf's Vietnamese dictionary,
- Ralf's Polish dictionary,
- Ralf's Greek dictionary
into simon. So simon is the target application. If you improve the quality of the specific dictionary, there is a chance that it might work.

And there is another thing that I found out when building / importing each of these 27 PLS dictionaries: the dictionary size should be about 100.000 words (not 1 million words, not 10.000 words). Help is needed to implement a good compression algorithm (like unmunch for OpenOffice.org dictionaries).

The focus should be to
- improve the quality of each PLS dictionary – native speakers should do that;
- integrate an option into simon to automatically download & import each of these PLS dictionaries into simon;
- think about a good compression algorith for PLS dictionaries (like unmunch) – languages like Spanish, Dutch, German, Latin need such a compression algorithm – not necessary for English.

Tutorial: how to install under Ubuntu

Friday, January 8th, 2010

This tutorial explains how to install simon under Ubuntu, and how to import Ralf’s Portuguese dictionary.

1. Download simon.
2. Double-click on simon-0.2-Linux_i386.deb:

ubuntu-deb

3. Press Install Package:

install-speechrecognition

4. Enter the password that you had chosen during your Ubuntu installation:

administrative-rights

Press the OK button.

5. The installation has been finished. The package simon-0.2-Linux_i386.deb has been installed:

installation-finished

Press the Close button.

6. Select Applications > Universal Access > simon:

universal-access

7. Take a look at the simon main window:

press-wordlist

Press the Wordlist button.

8. The Wordlist tab has opened:

import-dictionary

Press the Import Dictionary button.

9. You have to select the type of the dictionary:

select-dictionary-type

Choose PLS Lexicon, then press the Next button.

Note for simon development team: it would be nice if simon now offered a list of the 27 PLS dictionaries that are available.

10. You can now import one of my 27 PLS dictionaries. In the sidebar of testing simon, you can find a PLS dictionary that you can import:

sidebar-pls

Right-click on Ralf’s Portuguese dictionary, then Save Link As....

11. The dictionary with the name portuguese-dictionary.xml.bz2 will be saved:

save-portuguese-dictionary

It will be saved in the Downloads folder. Press the Save button.

12. It is time to import Ralf's Portuguese dictionary that you have just downloaded:

select-pls-file

Please press the File button to point simon to the downloaded PLS dictionary.

13. Select the Downloads folder:

select-downloads-folder

14. Select portuguese-dictionary.xml.bz2:

select-portuguese-dictionary

On my computer, I didn’t have to press the OK button.

15. simon displays the path to the PLS dictionary:

path-to-dictionary

Note for simon development team: it is pretty complicated to first download, and then import the dictionary. It would be nice if the wizard offered automatic download directly from the internet.

My guess is: the average user begins to lose interest in simon at this point of the installation because he already has invested about 20 minutes of his precious time. It is getting annoying. Don’t annoy the user! Offer automatic PLS dictionary import directly from the internet.

The automatic BOMP import is a great thing. But not everybody is a native German speaker. At the moment, I am offering 27 different languages. An automatic import would make simon much more interesting for a lot of people. E.g. Portuguese is spoken by 200 million native speakers. Recently, someone showed interest in Portuguese at Voxforge. You can imagine that almost all people don’t have a clue where to start, and what is necessary. Helping people from a lot of different countries would be so easy by adding an automatic import function to the wizard.

Press the Next button.

16. simon is now processing the lexicon:

processing-lexicon

What does that mean? It means that simon converts the dictionary from PLS format into HTK format. This process works fine for Ralf's German dictionary. The process is not yet optimized for the other PLS dictionaries. If you are a native speaker of Portuguese (European), you can edit Ralf's Portuguese dictionary with a simple text editor.

17. Ralf's Portuguese dictionary has been imported:

imported-successfully

Press the Finish button.

18. The dictionary is now available:

portuguese-shadow

a. Select Include unused words from the shadow lexicon.
b. Use the scroll bar to get an impression of the Portuguese dictionary.
c. First column: word. Second column: corresponding pronunciation.

19. Let’s finish here. Now you know how to install simon under Ubuntu, and how to import Ralf's Portuguese dictionary into simon.

There are more steps necessary to make it work:
- install HTK;
- record a few training samples;
- define a grammar;
- start ksimond (PDF).

Take a look into the simon handbook to find out more about simon.

Import 180.000 Slovak words

Monday, November 16th, 2009

You can import Ralf's Slovak dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Import 200.000 Icelandic words

Sunday, November 15th, 2009

Now, you can import Ralf's Icelandic dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Ralf’s Vietnamese dictionary

Sunday, November 15th, 2009

You can import Ralf's Vietnamese dictionary (version 0.1; GPLv3) into simon. The dictionary contains about 6.000 words; training is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

Import 140.000 Russian words

Saturday, November 14th, 2009

Now, you can import Ralf's Russian dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Import 300.000 Norwegian words

Saturday, November 14th, 2009

You can import Ralf's Norwegian Bokmål dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Import 380.000 Swedish words

Friday, November 13th, 2009

You can import Ralf's Swedish dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

On my computer, the import of this PLS dictionary took three minutes.

Import 40.000 Swahili words

Thursday, November 12th, 2009

You can import Ralf's Swahili dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

Import 60.000 Tamil words

Thursday, November 12th, 2009

You can import Ralf's Tamil dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

You can try to record Tamil words with simon:

tamil-record

It is necessary to do further adjustments. This is just a first step to make simon work with the Tamil language.

If your native language is Tamil, you may want to do the following things:

1. Install simon.
2. Install HTK.
3. Import Ralf's Tamil dictionary into simon.
4. Record some Tamil words with simon (see screen-shot).

After you have done that, you can try to improve Ralf's Tamil dictionary.

Import 150.000 Romanian words

Thursday, November 12th, 2009

You can import Ralf's Romanian dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak".

Some words about the creation of this dictionary: After getting a spelling dictionary, I generated from the SSML document the phonemes with eSpeak.

Import 200.000 Croatian words

Wednesday, November 11th, 2009

You can import Ralf's Croatian dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak". The eSpeak phonemes are slightly modified: The & has been replaced by &amp;.

Ralf’s Polish dictionary

Tuesday, November 10th, 2009

You can import Ralf's Polish dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is currently not possible.

Some details about (the creation of) this dictionary:
- The grapheme elements contain often garbage characters. I tried to fix this issue (with iconv), but unfortunately I wasn’t successful. When the grapheme element contains crap, the phoneme element is crappy, too. These lexeme elements are unusable.
- I didn’t convert the eSpeak phonemes into IPA phonemes. They are still in their original format. This is indicated by the attribute of the lexicon element: alphabet="espeak"
– As source I used a spelling dictionary.

This dictionary is of really bad quality. It is necessary to solve the encoding issue. But I tried three different conversions, and I failed.

At least, you can get an impression how easy it is to import 300.000 Polish words.

Ralf’s Esperanto dictionary

Saturday, November 7th, 2009

You can import Ralf's Esperanto dictionary (version 0.1; GPLv3) into simon. It should be possible to train simon with this dictionary. The pronunciation of this constructed language should be easy even if you have never spoken this language.

The dictionary contains about 19.000 lexeme elements. Some grapheme elements contain garbage characters. But that doesn’t affect the overall usability of the dictionary (only the specific lexeme elements are affected).

Import 90.000 Italian words

Saturday, November 7th, 2009

You can import Ralf's Italian dictionary (version 0.1; GPLv3). Training with this dictionary is currently not recommended. Some notes about (how I created) this dictionary:

1. I got an Italian spelling dictonary.

2. The unmunch command produced more than 20 million Italian words. Because simon is not intended to handle very large lexicons, I decided to use the style-sheet create-graphemes-italian.xsl instead. This style-sheet removes the prefix/suffix information from the spelling dictionary it_IT.dic. The result was an SSML file with about 90.000 Italian words.

3. I generated from the SSML file the corresponding phonemes: $ espeak -f italian-audio-o -m -v it -q -x --phonout="italian-espeak"

4. Then I combined the grapheme elements with the phoneme elements.

5. The last step was the conversion from eSpeak phonemes to IPA phonemes with the style-sheet espeak2perfectipa-italian.xsl. Here are some of the Italian specific conversions that are contained in the style-sheet:

replace($sierra, 'dZ:', 'ddʒ')
replace($sierra, 'ts:', 'ddz')
replace($sierra, 't:', 'tt')
replace($sierra, 'd:', 'dd')
replace($sierra, 's:', 'ss')
replace($sierra, 'b:', 'bb')
replace($sierra, 'k:', 'kk')

I tried to follow the IPA for Italian. To make the dictionary work with simon (so that training produces reasonable results), the simon import process has to be adjusted. Effective training is currently not possible.

Now you know that an Italian pronunciation dictionary exists that you can import into simon.

PLS dictionary: 850.000 Spanish words

Friday, November 6th, 2009

Ralf's Spanish dictionary (version 0.1; GPLv3) contains about 850.000 words. You can import this PLS dictionary into simon. Some remarks about (how I created) this dictionary:

1. I downloaded a spelling dictionary.

2. Then I used this spelling dictionary to produce the content of the grapheme elements:

$ unmunch es_ES.dic es_ES.aff > spanish-wordlist

3. From the word list, I created the content of the phoneme elements:

$ espeak -f spanish-ssml -m -v es -q -x --phonout="spanish-phoneme"

4. I combined the grapheme with the phoneme elements:

$ paste spanish-ssml spanish-phoneme-o > spanish-pls

5. With this style-sheet, I transformed the eSpeak phonemes into IPA phonemes. I am not sure whether I transformed everything correctly. E.g., I am unsure whether the following conversion is correct:

replace($sierra, 'J\^', 'ʎ')
replace($sierra, 'J', 'ʝ')

It might be the other way around.

6. The following phonemes probably will cause problemes when you import the dictionary into simon: β, ð, θ, ɣ, ɾ, ʎ, ʝ

7. On my computer, it was necessary to wait 5 minutes until the import into simon had been finished.

You can see that the import of Ralf's Spanish dictionary is possible. Unfortunately, training with this dictionary is currently almost impossible because of the phoneme issues.

Import 1.7 million Latin pronunciations

Thursday, November 5th, 2009

You can import Ralf's Latin dictionary (version 0.1.1) into simon. It contains about 1.7 million Latin words. Some information about how I created the dictionary:

1. The Latin words were extracted from a Latin OpenOffice.org dictionary with the command:

$ unmunch la.dic la.aff > latin-wordlist

2. The phonemes were originally generated with eSpeak (German voice) using the command:

$ espeak -f latin-ssml -m -v de -q -x --phonout="espeak-latin"

This means that no Latin specific pronunciation rules were applied. The pronunciation is as if the Latin words were German.

3. I used this style-sheet to transform the eSpeak phonemes into IPA phonemes.

License of Ralf's Latin dictionary is GPLv3. On my computer, the import of the dictionary took about 15 minutes because of its size. So you have to be really patient when you import the dictionary.

It should be possible to train a few Latin words with simon. It is necessary that you pronounce the words as if they were normal German words (German accent; no Latin specific vowel length).

More than 300.000 French words

Tuesday, November 3rd, 2009

Ralf's French dictionary (version 0.1.1) contains more than 300.000 French words.

It has the following known issues:
1. The dictionary contains probably about 60.000 duplicate entries. The duplicates will be removed in a future version of the dictionary.
2. More than 100 phoneme elements contain the invalid character Ã. The corresponding lexeme elements will be removed in a future version.
3. Currently, training with this dictionary is not recommended (because of the French IPA phonemes).

Of course, you can import this PLS dictionary into simon. You can see that the main concept is great:

1. The grapheme elements contain the French accents according to the «Réforme 1990». Of course, there are errors in the dictionary. But most accents should be correct. Here is an example:

<lexeme>
<grapheme>Île-de-France</grapheme>
<phoneme>ildəfʀɑ̃s</phoneme>
</lexeme>

You can see that the French vowel Î is correctly implemented in the grapheme element.

2. The phoneme elements are represented following the IPA standard.
3. License is GPL. It would be nice if a native speaker would improve this dictionary. Of course, it would be allowed to transform this dictionary into Sphinx format, or into HTK format.
4. It is possible to add a role attribute to the lexeme elements. A future version of simon might be able to use this information.
5. There shouldn’t be problems with crappy characters thanks to UTF-8. I will have to fix that minor mistake mentioned above. But this is just a minor mistake, not a major mistake. So there is no need to worry about encoding issues. This problem should be solved.

You can see the advantages of this dictionary.