Posts Tagged ‘PLS’

Import 200.000 Croatian words

Wednesday, November 11th, 2009

You can import Ralf's Croatian dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible.

The <phoneme> elements contain eSpeak phonemes (not IPA phonemes) – alphabet="espeak". The eSpeak phonemes are slightly modified: The & has been replaced by &amp;.

Ralf’s Polish dictionary

Tuesday, November 10th, 2009

You can import Ralf's Polish dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is currently not possible.

Some details about (the creation of) this dictionary:
- The grapheme elements contain often garbage characters. I tried to fix this issue (with iconv), but unfortunately I wasn’t successful. When the grapheme element contains crap, the phoneme element is crappy, too. These lexeme elements are unusable.
- I didn’t convert the eSpeak phonemes into IPA phonemes. They are still in their original format. This is indicated by the attribute of the lexicon element: alphabet="espeak"
– As source I used a spelling dictionary.

This dictionary is of really bad quality. It is necessary to solve the encoding issue. But I tried three different conversions, and I failed.

At least, you can get an impression how easy it is to import 300.000 Polish words.

Ralf’s Esperanto dictionary

Saturday, November 7th, 2009

You can import Ralf's Esperanto dictionary (version 0.1; GPLv3) into simon. It should be possible to train simon with this dictionary. The pronunciation of this constructed language should be easy even if you have never spoken this language.

The dictionary contains about 19.000 lexeme elements. Some grapheme elements contain garbage characters. But that doesn’t affect the overall usability of the dictionary (only the specific lexeme elements are affected).

Import 90.000 Italian words

Saturday, November 7th, 2009

You can import Ralf's Italian dictionary (version 0.1; GPLv3). Training with this dictionary is currently not recommended. Some notes about (how I created) this dictionary:

1. I got an Italian spelling dictonary.

2. The unmunch command produced more than 20 million Italian words. Because simon is not intended to handle very large lexicons, I decided to use the style-sheet create-graphemes-italian.xsl instead. This style-sheet removes the prefix/suffix information from the spelling dictionary it_IT.dic. The result was an SSML file with about 90.000 Italian words.

3. I generated from the SSML file the corresponding phonemes: $ espeak -f italian-audio-o -m -v it -q -x --phonout="italian-espeak"

4. Then I combined the grapheme elements with the phoneme elements.

5. The last step was the conversion from eSpeak phonemes to IPA phonemes with the style-sheet espeak2perfectipa-italian.xsl. Here are some of the Italian specific conversions that are contained in the style-sheet:

replace($sierra, 'dZ:', 'ddʒ')
replace($sierra, 'ts:', 'ddz')
replace($sierra, 't:', 'tt')
replace($sierra, 'd:', 'dd')
replace($sierra, 's:', 'ss')
replace($sierra, 'b:', 'bb')
replace($sierra, 'k:', 'kk')

I tried to follow the IPA for Italian. To make the dictionary work with simon (so that training produces reasonable results), the simon import process has to be adjusted. Effective training is currently not possible.

Now you know that an Italian pronunciation dictionary exists that you can import into simon.

PLS dictionary: 850.000 Spanish words

Friday, November 6th, 2009

Ralf's Spanish dictionary (version 0.1; GPLv3) contains about 850.000 words. You can import this PLS dictionary into simon. Some remarks about (how I created) this dictionary:

1. I downloaded a spelling dictionary.

2. Then I used this spelling dictionary to produce the content of the grapheme elements:

$ unmunch es_ES.dic es_ES.aff > spanish-wordlist

3. From the word list, I created the content of the phoneme elements:

$ espeak -f spanish-ssml -m -v es -q -x --phonout="spanish-phoneme"

4. I combined the grapheme with the phoneme elements:

$ paste spanish-ssml spanish-phoneme-o > spanish-pls

5. With this style-sheet, I transformed the eSpeak phonemes into IPA phonemes. I am not sure whether I transformed everything correctly. E.g., I am unsure whether the following conversion is correct:

replace($sierra, 'J\^', 'ʎ')
replace($sierra, 'J', 'ʝ')

It might be the other way around.

6. The following phonemes probably will cause problemes when you import the dictionary into simon: β, ð, θ, ɣ, ɾ, ʎ, ʝ

7. On my computer, it was necessary to wait 5 minutes until the import into simon had been finished.

You can see that the import of Ralf's Spanish dictionary is possible. Unfortunately, training with this dictionary is currently almost impossible because of the phoneme issues.

Import 1.7 million Latin pronunciations

Thursday, November 5th, 2009

You can import Ralf's Latin dictionary (version 0.1.1) into simon. It contains about 1.7 million Latin words. Some information about how I created the dictionary:

1. The Latin words were extracted from a Latin OpenOffice.org dictionary with the command:

$ unmunch la.dic la.aff > latin-wordlist

2. The phonemes were originally generated with eSpeak (German voice) using the command:

$ espeak -f latin-ssml -m -v de -q -x --phonout="espeak-latin"

This means that no Latin specific pronunciation rules were applied. The pronunciation is as if the Latin words were German.

3. I used this style-sheet to transform the eSpeak phonemes into IPA phonemes.

License of Ralf's Latin dictionary is GPLv3. On my computer, the import of the dictionary took about 15 minutes because of its size. So you have to be really patient when you import the dictionary.

It should be possible to train a few Latin words with simon. It is necessary that you pronounce the words as if they were normal German words (German accent; no Latin specific vowel length).

More than 300.000 French words

Tuesday, November 3rd, 2009

Ralf's French dictionary (version 0.1.1) contains more than 300.000 French words.

It has the following known issues:
1. The dictionary contains probably about 60.000 duplicate entries. The duplicates will be removed in a future version of the dictionary.
2. More than 100 phoneme elements contain the invalid character Ã. The corresponding lexeme elements will be removed in a future version.
3. Currently, training with this dictionary is not recommended (because of the French IPA phonemes).

Of course, you can import this PLS dictionary into simon. You can see that the main concept is great:

1. The grapheme elements contain the French accents according to the «Réforme 1990». Of course, there are errors in the dictionary. But most accents should be correct. Here is an example:

<lexeme>
<grapheme>Île-de-France</grapheme>
<phoneme>ildəfʀɑ̃s</phoneme>
</lexeme>

You can see that the French vowel Î is correctly implemented in the grapheme element.

2. The phoneme elements are represented following the IPA standard.
3. License is GPL. It would be nice if a native speaker would improve this dictionary. Of course, it would be allowed to transform this dictionary into Sphinx format, or into HTK format.
4. It is possible to add a role attribute to the lexeme elements. A future version of simon might be able to use this information.
5. There shouldn’t be problems with crappy characters thanks to UTF-8. I will have to fix that minor mistake mentioned above. But this is just a minor mistake, not a major mistake. So there is no need to worry about encoding issues. This problem should be solved.

You can see the advantages of this dictionary.

Ralf’s French dictionary with 40.000 words

Friday, October 30th, 2009

You can import Ralf's French dictionary (license: GPL; version 0.1) into simon. Currently, training with simon is not recommended with this dictionary. The dictionary should be more or less OK, but there are problems with the simon import process which I will describe later in this article.

I used a French OpenOffice.org spelling dictionary (license: GPL) to get the words for this dictionary. The phones were generated with eSpeak. I converted the eSpeak ASCII phones into IPA with this style-sheet. I am not sure whether all French vowels are converted correctly from the eSpeak phone set into IPA. Especially, I am unsure whether the following XPath expressions do the correct replacements:

select="replace($sierra, 'Y:', 'ø')"
select="replace($sierra, 'Y','œ')"

It is difficult to make the right distinction between these similar phones. Maybe there is a native speaker who is familiar with the French eSpeak phone set out there? The question is: how can I improve the conversion from French eSpeak phones into IPA phones? The XSLT-style-sheet (see the link above) contains information which replacements have been made. I am not sure whether eSpeak generated all French phones correctly. There might be some inconsistencies that have to be fixed.

I am open to suggestions on how to improve the XSLT-style-sheet. The concept is great. Support from a native speaker would be appreciated because it is difficult to catch all the details (especially vowels) of the French language.

When you import the dictionary into simon, not all phones are converted properly by simon. Probably, the following IPA symbols aren’t imported properly by simon:

1. ɲ = stimmhafter palataler Nasal = U+0272; a future version of simon could transform this phone into SAMPA J. Example:

<lexeme>
<grapheme>Avignon</grapheme>
<phoneme>aviɲɔ̃</phoneme>
</lexeme>

2. θ = stimmloser dentaler Frikativ = U+03B8; a future version of simon could transform this phone into SAMPA T. Example:

<lexeme>
<grapheme>Commonwealth</grapheme>
<phoneme>kɔmənwɛlθ</phoneme>
</lexeme>

We need this phone also for (a future version of) Ralf’s German dictionary for words like Thunderbird. Currently, Ralf’s German dictionary doesn’t employ this foreign phone. But this phone is needed.

3. ɔ̃ ; see Wikipedia: “im Deutschen in französischen Lehnwörtern wie Balkon, Chanson“. Example:

<lexeme>
<grapheme>Simon</grapheme>
<phoneme>simɔ̃</phoneme>
</lexeme>

4. ɑ̃; nasaliertes a. Example:

<lexeme>
<grapheme>diligence</grapheme>
<phoneme>diliʒɑ̃s</phoneme>
</lexeme>

5. ɛ̃ = heller Nasalvokal Example:

<lexeme>
<grapheme>infidélité</grapheme>
<phoneme>ɛ̃fidelite</phoneme>
</lexeme>

6. œ̃ = gerundeter halboffener Nasalvokal. Example:

<lexeme>
<grapheme>jardin</grapheme>
<phoneme>ʒaʀdœ̃</phoneme>
</lexeme>

7. ɥ = konsonantisch benutzter Ü-Laut = U+0265. Example:

<lexeme>
<grapheme>minuit</grapheme>
<phoneme>minɥi</phoneme>
</lexeme>

You can see that the French language employs a lot of vowels. Maybe it is possible to adjust the PLS import process?

I have imported the dictionary into simon. Let’s take a look at the results (HTK format):

1. AVIGNON [Avignon] aviJOnas – has to be fixed
2. COMMONWEALTH [Commonwealth] kOm@nwElT – this has to be fixed, too.
3. SIMON [Simon] s i m O n a s – interesting: it seems that the phones are separated correctly. But it is easy to recognize that the simon import process should be adjusted.
4. DILIGENCE [diligence] d i l i Z A n a s s – the error is similar to the previous one.
5. INFIDÉLITÉ [infidélité] E n a s f i d e l i t e – has to be fixed.
6. JARDIN [jardin] Z a R d oe n a s – has to be fixed, too.
7. MINUIT [minuit] minHithe H is the correct SAMPA transcription. But there should be spaces between the phones.

Momentarily, the major issue is the simon PLS import process.

More than 300.000 German words

Thursday, October 29th, 2009

Ralf's German dictionary (version 0.1.7) contains more than 300.000 German words. It is not necessary to extract the dictionary. You can import the downloaded PLS dictionary directly into simon (thanks for implementing the *.xml.bz2 feature).

The new words have been generated from an OpenOffice.org spelling dictionary (GPL license) using a stylesheet. This means that you can be sure that there is no copyright violation:

1. The source dictionary (spelling dictionary) is GPL licensed.
2. The XSLT stylesheet is GPL licensed.
3. The target dictionary (Ralf’s German dictionary) is GPL licensed.

And there is another advantage: There shouldn’t be too much spelling errors. I found a few minor mistakes, they will be fixed later.

I recommend this approach for other languages. Currently, I am preparing Ralf's French dictionary with more than 40.000 French words.

Create your own PLS dictionary

Tuesday, October 27th, 2009

This article is for people who want to use simon, but don’t have access to a pronunciation dictionary.

If you want to use simon, you need a pronunciation dictionary. Such dictionaries are available for some languages.

What if you don’t have a dictionary for your own language? You can create your own dictionary. Here are a few thoughts about the topic “dictionary creation”:

Today, I created about 100.000 words from an OpenOffice.org spelling dictionary with the style-sheet create-graphemes.xsl. The source spelling dictionary is licensed under the GPL. There is no licensing problem. Ralf’s German dictionary is GPL licensed, too. I “steal” words only from GPL sources.

I recommend OpenOffice.org spelling dictionaries if you want to create a PLS dictionary for your own language. Think about languages like Czech, Greek, or Finnish. Native speakers can use such an OpenOffice.org spelling dictionary for word creation. The word list can then be used with eSpeak for phoneme generation. eSpeak works with several languages. The word list and the phoneme list can be combined with paste. After that, the eSpeak phonemes can be converted into IPA with the style-sheet espeak2perfectipa.xsl. You can write your own style-sheet for your own language. I want to show you the concept of dictionary creation.

I created a word list of Ralf’s German dictionary using the style-sheet output-text.xsl. Then I used comm: $ comm -23 output-o 0.1.7-wordlist-o > compared
The result should be a word list with words that are not part of the current version of Ralf’s German dictionary (I am not sure whether this approach was successful). I will integrate these words into the dictionary.

Ralf’s German dictionary is a solution for the German language. Create a solution for your own language!

Ralf’s German dictionary (version 0.1.6)

Saturday, October 24th, 2009

You can download Ralf’s German dictionary (version 0.1.6). It has the following features:

Many lexemes now contain a role attribute, e.g.:

<lexeme role="Substantiv">
<grapheme>Simon</grapheme>
<phoneme>ziːmɔn</phoneme>
</lexeme>

I hope that a future version of simon makes use of the role attribute. Maybe I will transform Ralf's German dictionary into the Hadifix format. This would have the advantage that terminal information (= Category) would be available for many lexemes. Here is an example for a Verb:

<lexeme role="Verb">
<grapheme>anreden</grapheme>
<phoneme>ʔanʀeːdən</phoneme>
<phoneme>ʔanʀeːdn̩</phoneme>
</lexeme>

You can see that two pronunciations are available. You can use both pronunciations with simon. Internally, simon uses the following SAMPA notations:

a n R e: d @ n
a n R e: d n=

Here is another example:

<lexeme>
<grapheme>eintippen</grapheme>
<phoneme>ʔaɪ̯ntɪpən</phoneme>
<phoneme>ʔaɪ̯ntɪpm̩</phoneme>
</lexeme>

You can see that the second pronunciation ends with an (and not with an ). This means that you can speak intuitively.

I added more than 10.000 alternative pronunciations to the current version of the dictionary using this stylesheet.

More than 80.000 German adjectives

Sunday, October 18th, 2009

I added more than 80.000 German adjectives to Ralf’s German dictionary. These adjectives are marked using the role attribute:

<lexeme role="Adjektiv">
<grapheme>schweizerische</grapheme>
<phoneme>ʃvaɪ̯tsəʀɪʃə</phoneme></lexeme>

This role attribute isn’t used by simon. Take a look at the Category of schweizerische. It is Unknown:

schweizerische

A future version of simon could display the value of the role attribute in the Category column. Of course, a lexeme can play several roles.

Ralf’s German dictionary (version 0.1.5) contains more than 220.000 German words. You can import the dictionary into simon. On my computer, I have to wait a few moments until the dictionary has been imported.

Edit: I just saw on the screen-shot that the pronunciation of schweizerischer is wrong. This is not simon’s fault. It is an error of Ralf's German dictionary. Such errors can occur quite often. I will try to fix them.

Extract terminal from role-attribute?

Saturday, October 17th, 2009

I want to add terminal information to the pronunciation dictionary. I just read that the PLS allows a role attribute inside the <lexeme> element. The result could look like this:

<lexeme role="Substantiv">
<grapheme>Behandlungsprogramm</grapheme>
<phoneme>bəhandlʊŋspʀɔgʀam</phoneme>
</lexeme>

A lexeme can play a lot of roles. Or should I stick to the Hadi-Bomp Terminale? Or use my own system?

I could make it complicated, and use a specific namespace:

xmlns:hadibomp="http://www.cyber-byte.at/wiki/index.php/Trainingstexte#Hadi-Bomp_Terminale"

Well, I am thinking about a solution. Maybe it would be useful if during the simon PLS import process the value of the role attribute would be used for the terminal column?

Ralf’s Austrian German dictionary

Saturday, October 10th, 2009

Take a look at Ralf’s Austrian German dictionary (license: GPL). It contains more than 130 words which are not included in Ralf’s German dictionary (version 0.1.2).

The following usage is intended:
1. Download and import Ralf’s German dictionary. It contains Standard German, and is targeted at all people who speak the German language.
2. Download and import Ralf’s Austrian German dictionary if you want to use this dialect specific vocabulary.

I would like to create a dictionary for Swiss German. It would be nice if someone would point me to a GPL word list with Swiss German words.

Ralf’s German dictionary (version 0.1.2)

Friday, October 9th, 2009

You can download and import Ralf’s German PLS dictionary (version 0.1.2; October 9, 2009) with more than 100.000 words into simon.

These are the differences to version 0.1.1:
- increased the size of the dictionary (from about 40.000 words to more than 100.000);
- removed duplicate graphemes/phonemes;
- improved phoneme quality, e.g. the word Vogel is not transcribed as voːgəl any more. The phonemes are now corrected to foːgəl. So a lot of Vogel-words are now correctly transcribed: Donnervogel, Eisvogel, Feuervogel, Galgenvogel, Hornvogel, Hühnervogel, Jungvogel, and many more:

feuervogel

The dictionary is getting better and better. Of course, there is a lot of work to be done.

The following is interesting for people who are interested in dictionary development:

I am using this stylesheet to improve the phoneme quality. In the stylesheet, you can find the following line which replaces voːgəl by foːgəl:

<xsl:variable name="phoneme-vogel" select="replace($phoneme-vertrags, 'voːgəl', 'foːgəl')"/>

eSpeak produces phonemes that aren’t always correct. I correct recurring phoneme errors with this stylesheet. It is an efficient approach for the dictionary development. It would be too complicated to correct each word separately. I can adjust the required level of phoneme correction very good thanks to XPath. So you can see that it is no big deal to improve Ralf’s German dictionary. XPath is the right language for PLS dictionary development.

The size of the dictionary increases. To handle the increased amount of information, it is necessary to navigate in the PLS dictionary. This can easily be done with an XSLT stylesheet. I want to develop an XSLT stylesheet that transforms the PLS dictionary into Sphinx format.

Import of ‘Luftnummer’

Wednesday, October 7th, 2009

I just saw that the simon PLS import process has been improved. The ɐ is not transcribed as @ r any more. This was a major issue of the simon PLS import process. It is great that this has been fixed.

Take a look at the screenshot:

luftnummer

The phonemes of Luftlöcher, Luftnummer, Luftverkehrs, Luftwiderstand, lustiger, Luther-Straße are now better than before. This should improve the recognition results.

Currently, I am not testing the recognition of simon. Instead, I am expanding my German PLS dictionary. The current version 0.1.1 contains many duplicate entries (duplicate graphemes and duplicate phonemes). These duplicates will be removed in the next version (with the XPath expression distinct-values(current-group()/phoneme). And of course, I will add a few thousand words to the dictionary.

Import 40.000 German words

Thursday, October 1st, 2009

You can import Ralf’s German dictionary (version 0.1.1) with more than 40.000 words into simon. Download the dictionary, then import it into simon:

dictionary

I hope that it works with simon. I didn’t use it for training. But it should work. It contains many known errors.

Standards of this dictionary: PLS, GPL, IPA, UTF-8.

If there are compatibility issues with simon, please report back.

It might be possible to convert this lexicon into a BOMP compatible format. I would need a way to get information about the terminals (= Wortarten). I don’t know at the moment how I could achieve this goal.

Ralf’s German dictionary

Saturday, September 12th, 2009

In this article, I will explain how to import Ralf’s German dictionary into simon, and you will read about some of the properties of this dictionary.

universal

1. Select Applications > Universal Access > simon.

import-dictionary

2. Press the Word list button.
3. Press Import Dictionary.

shadow-dictionary

4. You can select the target: shadow dictionary or active dictionary. What is the right choice? For dictionary development, I often choose active dictionary (so that I have a dictionary in HTK compatible format which I use in conjunction with sam). But let’s now choose the shadow dictionary as target.

5. Press the Next > button.

hadifix-htk-pls

6. You can choose between different lexicon types: Hadifix, HTK, PLS, and Sphinx. Select PLS.
7. Press the Next > button.

save-page

8. You are now here: http://script.blau.in/xml/german.xml
9. Save Page As... doesn’t work. I just tried that. If you choose this option, the page will be saved as html file. You have to choose a different way.

page-source

10. Select View Page Source.

lexeme-grapheme

11. You can now see the source of the page http://script.blau.in/xml/german.xml.

12. The encoding of the page is UTF-8. This encoding ensures that even languages like Hebrew can be processed correctly. You can imagine that UTF-8 is a very good standard for all languages.

13. Let’s take a look at the address of style sheet http://script.blau.in/xml/ralf-german-dictionary.xsl. This style sheet document changes the appearance of Ralf’s German dictionary when you view it with Firefox.

14. The license is GPL. It would be great if someone would expand the German dictionary.

15. The dictionary has a specific tree structure using the elements lexicon, lexeme, grapheme, phoneme.

16. Select Save Page As....

import

17. Choose the location of Ralf’s German dictionary that you downloaded a few moments ago. On my computer, the XML file is located here: /home/liberty/200909/german.xml.

finish

18. Ralf’s German dictionary has been imported successfully.
19. Press the Finish button.

maschine

20. To take a look at the imported dictionary, select Include unused words from the shadow lexicon.
21. Drag and drop the word Maschine into the white area.

train-selected

22. You want to train the word Maschine.
23. Press the button to start with the training.

add-maschine

24. Currently, the word Maschine is just part of the shadow lexicon. It is not part of the active lexicon. Press Yes to add it to the active lexicon.

sampa

25. You want to define the pronunciation of the word Maschine. The pronunciation is being displayed in SAMPA.
26. I find the concept with the terminals difficult, it is explained in the simon handbook. I am using the terminal Unknown.

OK, I am finishing here. If you want to know more about simon, please read in the simon handbook (PDF).

You should now be able to import Ralf’s German dictionary into simon.

Confidence score with Hebrew

Thursday, September 10th, 2009

A few hours ago, I created a sample Hebrew PLS dictionary. It is very short, but it shows the concept.

hebrew

1. I imported the Hebrew PLS dictionary into simon.
2. I dragged each word to the right side for training.
3. The recorded Hebrew words are stored in the folder /home/liberty/.kde/share/apps/simon/model/training.data.
4. After starting ksimond (PDF), I pressed the Synchronize button.
5. Then I activated simon.
6. When dictating several words, simon is not sure which word is the right one. Is it מדפסת, or is it עִבְרִית? Unfortunately, I didn’t get any output in gedit or in Geany. Maybe it has something to do with the right-to-left encoding?

You can see that thanks to UTF-8 Hebrew shouldn’t be a big problem. I don’t know what went wrong with the missing output. But at least the Hebrew words are displayed correctly, so only the last step is missing.

If anyone is interested in building an Hebrew PLS dictionary, I propose you take a look into the Hebrew Voxforge prompts. I suggest that you take the words that are contained in the prompts into the dictionary. You can take my sample Hebrew dictionary, and expand it. Later, you can use the Voxforge prompts for training (after you have made your first experiences with simon).

Import a small French PLS dictionary

Saturday, August 29th, 2009

I want to import a small French PLS dictionary into simon. I have created this lexicon on my own, it contains some errors. But that doesn’t matter because I just want to demonstrate the import process.

french

1. The lexicon has the name french-pronunciation-20090829.xml.
2. Encoding is UTF-8 for optimal compatibility.
3. License is GPL. This small prototype lexicon can be expanded without license problems.
4. Alphabet is IPA. simon is capable to import lexicons that are stored in that format. Of course, it has to be tested with other languages than German. This is what I want to do right now with the French language.
5. The XML language tag marks the lexicon as French. The lexicon contains one word that isn’t in French but in English. I have marked that word with a special XML tag. Probably, it won’t work. But I want to see what will happen.
6. Click the Import Dictionary button.
7. The target will be the active dictionary. Maybe I will be able to use this dictionary in conjunction with sam? I will see.
8. And now it is time to press the Next > button.

As type of dictionary I select of course PLS lexicon. The path to the lexicon is as follows: /home/liberty/200908/french-pronunciation-20090829.xml The lexicon has been imported. But there aren’t any words. What went wrong? I will make some changes to the XML file (delete the blank lines). I tried again after deleting the blank lines in the XML file, but again: After the import of the French PLS lexicon, I can see nothing.

I will now import the German PLS lexicon from the following location: /home/liberty/200908/voxDE20090209-modified.xml. It worked. Why was it possible to import the German PLS dictionary, but not the French PLS dictionary?

I just found out what was wrong: The first line of the XML file has to look like this:

<?xml version=”1.0″ encoding=”UTF-8″?>

Look at the double quotes. They have to be the normal double quotes (U+0022). But they were rendered. Obviously, WordPress renders the double quotes (U+0022) into some other Unicode characters.

And here is the result:

french-imported

You can see that it is possible to import a French PLS dictionary into simon. This is the strength of the system: the capability to import lexicons from different languages. Until now, I have imported the following lexicons:

English Voxforge lexicon
German PLS lexicon
French PLS lexicon

I think that somewhere at Voxforge.org there is a Spanish lexicon available. Spanish is very consistent when it comes to pronunciation. So making simon suitable for the Spanish language shouldn’t be a big problem. I just dowloaded the Spanish lexicon (3.1 MB). I think that it is stored in HTK compatible format. Here are the first few lines of the Spanish lexicon:

a [a] a
aaronita [aaronita] a a r o n i t a
aarónico [aarónico] a a r o n i k o
aba [aba] a b a
ababa [ababa] a b a b a
ababillarse [ababillarse] a b a b i ll a r s e
ababol [ababol] a b a b o l

I don’t know how this lexicon had been generated. I would like to import that Spanish dictionary right now. But first, I should delete the German lexicon (shadow lexicon) and the French lexicon (active lexicon) that are actually available in my simon installation. But before I do that, I want to upload my French lexicon to my webspace. Here it is: lexicon (license: GPL).

I didn’t start ksimond, so the only folder I have to rename is the following folder: /home/liberty/.kde/share/apps/simon/model

I think that it worked. There are no words in the word list available anymore. Now I can import the Spanish dictionary. I will import it as HTK dictionary from the following path: /home/liberty/200908/voxforge_lexicon_spanish. And here is the result:

spanish

So simon should work with Spanish, too. I hope that there will be the possibility to switch between the different languages.