Posts Tagged ‘French’

Advantages of Ralf’s French dictionary

Wednesday, April 7th, 2010

At the moment, I am thinking about the advantages that Ralf's French dictionary has to offer. Let me show you the advantages:

french-advantages

1. The dictionary is stored as XML file. This means that you can edit the dictionary with gedit. Or you can transform the dictionary with XSLT.

2. The encoding is UTF-8. With this encoding there shouldn’t be encoding problems:

2.a. UTF-8 means no encoding problems within the <grapheme> element: Hebrew or Tamil are no problem. And this means that French accents are displayed correctly inside the <grapheme> element. No crap characters should occur.

2.b. UTF-8 means that the IPA phonemes are displayed correctly inside the <phoneme> elements. The dictionary can easily be edited by human editors. It is difficult to edit a phonetic dictionary that contains X-SAMPA or Kirshenbaum characters. IPA phonemes can easily be read by humans. And it is no problem to process IPA characters with saxonb-xslt (type saxonb-xslt into the Ubuntu terminal). Every detail of the French language can be catched.

3. The license of Ralf's French dictionary is GPLv3. Maybe the simon developers are interested to offer an automatic dictionary import? The license would permit this. You can see that the dictionary design is very developer-friendly.

4. The <lexicon> element contains the attribute alphabet="ipa". Not all of my dictionaries contain IPA characters. Some dictionaries contain eSpeak charakters; these dictionaries contain the attribute alphabet="espeak". The PLS standard allows us to use different alphabets for the <phoneme> elements. Personally, I prefer the IPA alphabet. But of course, other alphabets could be used. A future version of simon could differentiate between the different alphabets during dictionary import. I am thinking about the following solution:
- alphabet="ipa" is used by Ralf's German dictionary, Ralf's French dictionary, Ralf's Spanish dictionary;
- alphabet="espeak" can be used by other dictionaries (or alternatively, I transform the eSpeak phonemes into IPA phonemes). I am not sure whether it is good to use eSpeak phonemes inside some of my PLS dictionaries. Maybe it would be better to convert them into IPA phonemes?

Currently, the simon dictionary import process doesn’t differentiate between the tags alphabet="ipa" and alphabet="espeak". As far as I know, the eSpeak phonemes are probably almost the same for all languages. So maybe it would be a good idea if simon would be able to import PLS dictionaries with eSpeak phonemes. I am saying maybe because I am not sure whether this would be a good decision. In the long run, the IPA is the better decision (because it is easily editable by humans).

You can see that I spent a lot of time thinking about the different phoneme characters. It is great to see that simon handles almost all German phonemes. At the moment, there are import problems with the French IPA phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ. The tilde is imported by simon as the SAMPA sequence n a s. This is wrong, and should be corrected.

5. Each of my dictionaries contains a language tag. Ralf's German dictionary contains xml:lang="de-DE" because it is Standard German. Ralf's French dictionary indicates Standard French by using the language tag xml:lang="fr". It would be possible to develop a phonetic dictionary for Canadian French. In this case, the tag would be xml:lang="fr-CA".

It is possible that a future version of simon automatically “understands” the language of the specific PLS dictionary. E.g. Ralf's Austrian German dictionary contains the language tag xml:lang="de-AT". Maybe it would be possible to ask the simon user via a wizard:

Which language do you want to use for dictation? Please select the appropriate language.

◊ Standard German (BOMP)
◊ Standard German (PLS)
◊ Standard German (PLS) + Austrian German (PLS)
◊ Only Austrian German (PLS)
◊ Standard French (PLS)
◊ Standard German (PLS) + Medical German (PLS)
◊ [Afrikaans, Catalan, Croatian, ...]

Thanks to the language tag xml:lang="fr" (each language contains a specific language tag) it should be not too difficult to develop an automatic dictionary import function for the 27 PLS dictionaries.

simon offers automatic import of the German BOMP dictionary (obviously, this dictionary does have a very good quality). But what about other languages? It would be good for the marketing if simon offered automatic dictionary import for all 27 PLS dictionaries. Almost all of my dictionaries are in an early stage of development. But this is no problem at the moment. Each dictionary can be improved easily. It is just necessary to run an XSLT style-sheet that transforms the eSpeak phonemes into IPA phonemes. No big deal. I did this for German, for French, and for a couple of other languages.

How many phonemes are needed? I would say that we don’t need all IPA phonemes. We can stick to the existing ones plus 4 French phonemes plus 4 Spanish phonemes. That should do the job for all 27 languages. My goal is to help making simon usable for 27 languages. Usable means that all major characteristics (=phonemes) of each specific language are covered by the specific phonetic dictionary.

My personal focus are the following dictionaries: German, French, Spanish, English (maybe I will transform the English Voxforge dictionary into PLS format). The other languages (Afrikaans, Catalan, Croatian, …) will have to use the phonemes that are used by these four languages. I don’t plan to add more phonemes to my PLS dictionaries. I want to keep it as simple as possible, and as complex as necessary.

6. I am experimenting with the role attribute in Ralf's French dictionary. I tried the following role attributes: lettre and abrévation. simon displays the terminal abrévation correctly:

abreviation

This means that French terminals (= value of the role attribute) can use an apostrophe.

Conclusion: Ralf's French dictionary is user-friendly and developer-friendly. I propose that the simon developers do the following two things:
- add the 4 French phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ to the simon import process;
- offer automatic PLS dictionary import for 27 languages (simon should download the specific dictionary directly from the internet, and import it automatically after the download has finished).

Import a big French Sphinx lexicon

Saturday, August 29th, 2009

I just downloaded the French dictionary from univ-lemans.fr. It has more than 65K words. I think that it is bigger than 100.000 words. Probably, it is stored in Sphinx format. Here are some lines of that dictionary:

juridiction jj uu rr ii dd ii kk ss yy on
juridictionnel jj uu rr ii dd ii kk ss yy oo nn ai ll
juridictionnelle jj uu rr ii dd ii kk ss yy oo nn ai ll
juridictionnelle(2) jj uu rr ii dd ii kk ss yy oo nn ai ll ee
juridictions jj uu rr ii dd ii kk ss yy on

I assume that this is Sphinx format, but I don’t know. I will try to import this lexicon into simon as Sphinx lexicon. I will import from the following location: /home/liberty/200908/words_dict. The import process takes a few moments because it is a big dictionary. The import was successful, but there are some smaller encoding problems:

french-sphinx

I know this problem. It is probably a result of the wrong encoding: ISO-8859-1 vs. UTF-8. It should be easy to solve that problem. The dictionary should be opened e.g. with Notepad++, and saved as UTF-8. Maybe that would work (I would have to try that).

I have now opened the file /home/liberty/200908/words_dict with Geany. The encoding is ISO-8859-1. This was my first guess. And of course, I was right. Let’s take a closer look how to fix that encoding issue:

geany-encoding

1. The French dictionary that had been downloaded (see the source at the beginning of this article) has the file name words_dict.
2. The encoding is ISO-8859-1.
3. Let’s try to set the encoding to UTF-8.

After saving the file, I will try to import the lexicon again (of course, I will rename a specific simon folder before). simon offers to select a specific encoding:

encoding

I will stick to automatic encoding. Let’s see what the result of the import process will be. Will the encoding issues have been solved? Yes, everything is OK now. I have now more than 100.000 French words that could be trained with simon.

Import a small French PLS dictionary

Saturday, August 29th, 2009

I want to import a small French PLS dictionary into simon. I have created this lexicon on my own, it contains some errors. But that doesn’t matter because I just want to demonstrate the import process.

french

1. The lexicon has the name french-pronunciation-20090829.xml.
2. Encoding is UTF-8 for optimal compatibility.
3. License is GPL. This small prototype lexicon can be expanded without license problems.
4. Alphabet is IPA. simon is capable to import lexicons that are stored in that format. Of course, it has to be tested with other languages than German. This is what I want to do right now with the French language.
5. The XML language tag marks the lexicon as French. The lexicon contains one word that isn’t in French but in English. I have marked that word with a special XML tag. Probably, it won’t work. But I want to see what will happen.
6. Click the Import Dictionary button.
7. The target will be the active dictionary. Maybe I will be able to use this dictionary in conjunction with sam? I will see.
8. And now it is time to press the Next > button.

As type of dictionary I select of course PLS lexicon. The path to the lexicon is as follows: /home/liberty/200908/french-pronunciation-20090829.xml The lexicon has been imported. But there aren’t any words. What went wrong? I will make some changes to the XML file (delete the blank lines). I tried again after deleting the blank lines in the XML file, but again: After the import of the French PLS lexicon, I can see nothing.

I will now import the German PLS lexicon from the following location: /home/liberty/200908/voxDE20090209-modified.xml. It worked. Why was it possible to import the German PLS dictionary, but not the French PLS dictionary?

I just found out what was wrong: The first line of the XML file has to look like this:

<?xml version=”1.0″ encoding=”UTF-8″?>

Look at the double quotes. They have to be the normal double quotes (U+0022). But they were rendered. Obviously, WordPress renders the double quotes (U+0022) into some other Unicode characters.

And here is the result:

french-imported

You can see that it is possible to import a French PLS dictionary into simon. This is the strength of the system: the capability to import lexicons from different languages. Until now, I have imported the following lexicons:

English Voxforge lexicon
German PLS lexicon
French PLS lexicon

I think that somewhere at Voxforge.org there is a Spanish lexicon available. Spanish is very consistent when it comes to pronunciation. So making simon suitable for the Spanish language shouldn’t be a big problem. I just dowloaded the Spanish lexicon (3.1 MB). I think that it is stored in HTK compatible format. Here are the first few lines of the Spanish lexicon:

a [a] a
aaronita [aaronita] a a r o n i t a
aarónico [aarónico] a a r o n i k o
aba [aba] a b a
ababa [ababa] a b a b a
ababillarse [ababillarse] a b a b i ll a r s e
ababol [ababol] a b a b o l

I don’t know how this lexicon had been generated. I would like to import that Spanish dictionary right now. But first, I should delete the German lexicon (shadow lexicon) and the French lexicon (active lexicon) that are actually available in my simon installation. But before I do that, I want to upload my French lexicon to my webspace. Here it is: lexicon (license: GPL).

I didn’t start ksimond, so the only folder I have to rename is the following folder: /home/liberty/.kde/share/apps/simon/model

I think that it worked. There are no words in the word list available anymore. Now I can import the Spanish dictionary. I will import it as HTK dictionary from the following path: /home/liberty/200908/voxforge_lexicon_spanish. And here is the result:

spanish

So simon should work with Spanish, too. I hope that there will be the possibility to switch between the different languages.

Feature request: multiple languages

Sunday, August 9th, 2009

I would like to be able to switch between the following languages: English, German, French, and Spanish. I have prepared a small sample PLS pronunciation dictionary for the French language:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!– This pronunciation lexicon is licensed under the GPL. –>
<lexicon version=”1.0″ alphabet=”ipa” xml:lang=”fr”>
<lexeme>
<grapheme>hambourg</grapheme>
<phoneme>ʔɑ̃.buʁ</phoneme>
</lexeme>
<lexeme>
<grapheme>manuscrit</grapheme>
<phoneme>ma.nys.kʁi</phoneme>
</lexeme>
<lexeme>
<grapheme>voiture</grapheme>
<phoneme>vwa.tyʁ</phoneme>
</lexeme>
<lexeme>
<grapheme>prophète</grapheme>
<phoneme>pʀɔfɛt</phoneme>
</lexeme>
<lexeme>
<grapheme>danger</grapheme>
<phoneme>dɑ̃.ʒe</phoneme>
</lexeme>
<lexeme>
<grapheme>heureuse</grapheme>
<phoneme>œ.ʁøz</phoneme>
</lexeme>
<lexeme>
<grapheme>heureusement</grapheme>
<phoneme>œ.ʁøzəmɔ̃</phoneme>
</lexeme>
<lexeme>
<grapheme>heureuses</grapheme>
<phoneme>œ.ʁøz</phoneme>
</lexeme>
<lexeme>
<grapheme>heureux</grapheme>
<phoneme>œ.ʁø</phoneme>
</lexeme>
<lexeme>
<grapheme>dangereuse</grapheme>
<phoneme>dɑ̃ʒʀøz</phoneme>
</lexeme>
<lexeme>
<grapheme>dangereux</grapheme>
<phoneme>dɑ̃ʒʀø</phoneme>
</lexeme>
<lexeme>
<grapheme>manteau</grapheme>
<phoneme>mɑ̃.to</phoneme>
</lexeme>
<lexeme>
<grapheme>manteaux</grapheme>
<phoneme>mɑ̃.to</phoneme>
</lexeme>
</lexicon>

This lexicon could be saved as .xml file, and then imported into simon. But the situation is as follows: I have already imported the German PLS dictionary into simon (it works well). And of course, I don’t want to mix the German dictionary with the sample French dictionary. So it would be good if it would be possible to switch between different languages.

With Dragon NaturallySpeaking 9 Preferred (Win XP), I can switch between German and English. With Vista Speech Recognition (Ultimate Edition), I could switch between French and Spanish (I don’t use Vista anymore – I try to migrate directly to Ubuntu). It would be good if a future version of simon offered the possibility to switch between several languages. That means that different dictionaries, different prompts, different wav training samples, and different mfc files would have to be managed by simon.