Import a small French PLS dictionary

I want to import a small French PLS dictionary into simon. I have created this lexicon on my own, it contains some errors. But that doesn’t matter because I just want to demonstrate the import process.

french

1. The lexicon has the name french-pronunciation-20090829.xml.
2. Encoding is UTF-8 for optimal compatibility.
3. License is GPL. This small prototype lexicon can be expanded without license problems.
4. Alphabet is IPA. simon is capable to import lexicons that are stored in that format. Of course, it has to be tested with other languages than German. This is what I want to do right now with the French language.
5. The XML language tag marks the lexicon as French. The lexicon contains one word that isn’t in French but in English. I have marked that word with a special XML tag. Probably, it won’t work. But I want to see what will happen.
6. Click the Import Dictionary button.
7. The target will be the active dictionary. Maybe I will be able to use this dictionary in conjunction with sam? I will see.
8. And now it is time to press the Next > button.

As type of dictionary I select of course PLS lexicon. The path to the lexicon is as follows: /home/liberty/200908/french-pronunciation-20090829.xml The lexicon has been imported. But there aren’t any words. What went wrong? I will make some changes to the XML file (delete the blank lines). I tried again after deleting the blank lines in the XML file, but again: After the import of the French PLS lexicon, I can see nothing.

I will now import the German PLS lexicon from the following location: /home/liberty/200908/voxDE20090209-modified.xml. It worked. Why was it possible to import the German PLS dictionary, but not the French PLS dictionary?

I just found out what was wrong: The first line of the XML file has to look like this:

<?xml version=”1.0″ encoding=”UTF-8″?>

Look at the double quotes. They have to be the normal double quotes (U+0022). But they were rendered. Obviously, WordPress renders the double quotes (U+0022) into some other Unicode characters.

And here is the result:

french-imported

You can see that it is possible to import a French PLS dictionary into simon. This is the strength of the system: the capability to import lexicons from different languages. Until now, I have imported the following lexicons:

English Voxforge lexicon
German PLS lexicon
French PLS lexicon

I think that somewhere at Voxforge.org there is a Spanish lexicon available. Spanish is very consistent when it comes to pronunciation. So making simon suitable for the Spanish language shouldn’t be a big problem. I just dowloaded the Spanish lexicon (3.1 MB). I think that it is stored in HTK compatible format. Here are the first few lines of the Spanish lexicon:

a [a] a
aaronita [aaronita] a a r o n i t a
aarónico [aarónico] a a r o n i k o
aba [aba] a b a
ababa [ababa] a b a b a
ababillarse [ababillarse] a b a b i ll a r s e
ababol [ababol] a b a b o l

I don’t know how this lexicon had been generated. I would like to import that Spanish dictionary right now. But first, I should delete the German lexicon (shadow lexicon) and the French lexicon (active lexicon) that are actually available in my simon installation. But before I do that, I want to upload my French lexicon to my webspace. Here it is: lexicon (license: GPL).

I didn’t start ksimond, so the only folder I have to rename is the following folder: /home/liberty/.kde/share/apps/simon/model

I think that it worked. There are no words in the word list available anymore. Now I can import the Spanish dictionary. I will import it as HTK dictionary from the following path: /home/liberty/200908/voxforge_lexicon_spanish. And here is the result:

spanish

So simon should work with Spanish, too. I hope that there will be the possibility to switch between the different languages.

Tags: , , ,

Comments are closed.