Posts Tagged ‘Spanish’

Ralf’s Spanish speech model

Thursday, May 17th, 2012

Some words about the creation of this speech model.

1. Get Ralf’s Spanish dictionary.
2. Create a new Spanish scenario.
3. Import Ralf’s Spanish dictionary as shadow dictionary.
4. Train ten words: ababas, activárselos, actuabais, desconté, domingo, envacas, incapacitando, nabla, superviene, vacilas
5. Add Unknown as grammar. Add Dictation plugin. Actions > Synchronize. Actions > Activate.
6. Dictate a few words:

ababas activárselos actuabais actuabais vacilas actuabais superviene ababas incapacitando

7. Export scenario. Export base model.
8. Download Ralf’s Spanish speech model.

Ralf’s Spanish (Venezuela) dictionary

Thursday, May 6th, 2010

How I create Ralf's Spanish (Venezuela) dictionary:

1. Get spelling dictionary. License is GPLv3.
2. The files dic_es_VE.dic and dic_es_VE.aff are UTF-8 encoded.

3. Generating word list:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-venezuela/dictionaries$ unmunch dic_es_VE.dic dic_es_VE.aff > spanish-venezuela

2.8 million words is too much. I am deleting the last 2 million words of this word list. I will only use the first 800.000 words of the word list. The size of the word list shouldn’t be too big because otherwise the performance gets too slow.

4. Replacing "\n" by "</audio>\n<audio>" with Geany (Use escape sequences).
5. Adding <speak> tags to the file spanish-venezuela (<speak> at the beginning of the file; </speak> at the end of the file).

6. Generating Spanish phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-venezuela/dictionaries$ espeak -f spanish-venezuela -m -v es-la -q -x --phonout="spanish-venezuela-espeak"

7. Open the file spanish-venezuela-espeak with Geany. Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of spanish-argentina-espeak. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences).

8. The files spanish-venezuela-espeak and spanish-venezuela have the same number of lines. I can combine them:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-venezuela/dictionaries$ paste spanish-venezuela spanish-venezuela-espeak > spanish-venezuela-pls

9. Edit spanish-venezuela-pls with Geany. I have to do some search and replace operations to get a valid PLS dictionary.
10. Download Ralf's Spanish (Venezuela) dictionary, and import it into simon.

Ralf’s Spanish (Argentina) dictionary

Thursday, May 6th, 2010

Here is how I create Ralf's Spanish (Argentina) dictionary:

1. Get spelling dictionary, license is LGPL.
2. Encoding of es_AR.dic and es_AR.aff is ISO-8859-1.

3. Converting es_AR.dic to UTF-8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ iconv -f ISO8859-1 -t UTF-8 < es_AR.dic > spanish-argentina-utf8.dic

3. Converting es_AR.aff to UTF-8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ iconv -f ISO8859-1 -t UTF-8 < es_AR.aff > spanish-argentina-utf8.aff

4. Change first line of file spanish-argentina-utf8.aff from SET ISO8859-1 to SET UTF-8.

5. Generating word list:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ unmunch spanish-argentina-utf8.dic spanish-argentina-utf8.aff > spanish-argentina-utf8

The result are 950.000 Spanish (Argentina) words. This is a huge list, but I think that the size of this word list is tolerable.

6. Replacing "\n" by "</audio>\n<audio>" with Geany (selecting the option Use escape sequences). I don’t find it easy with escape sequences and with regular expressions. In this case, it is the easiest way to convert the file with Geany.

7. Generating Spanish phonemes (eSpeak: “es-la Spanish – Latin America”):

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ espeak -f spanish-argentina-utf8 -m -v es-la -q -x --phonout="spanish-argentina-espeak"

Details of the espeak command:
-f [file] indicates the source file spanish-argentina-utf8
-m [mark-up] indicates that the source file is an SSML file (this means that each word is enclosed in an <audio> element)
-v [voice] indicates which language eSpeak has to use. The ISO 639-1 language tag for Spanish is es. eSpeak uses es-la for Spanish – Latin America, and es for Spanish. This means that I have to use the tag es-la.
-q [quiet] means that eSpeak doesn’t speak the words out loudly.
-x means that eSpeak should output phonemes.
--phonout="spanish-argentina-espeak" indicates the output file.

8. Open the file spanish-argentina-espeak with Geany. Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of spanish-argentina-espeak. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences).

9. The files spanish-argentina-utf8 and spanish-argentina-espeak have the same number of lines. I can combine these files:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ paste spanish-argentina-utf8 spanish-argentina-espeak > spanish-argentina-pls

10. Edit spanish-argentina-pls with Geany. I have to do some search and replace operations to get a valid PLS dictionary.

11. Download Ralf's Spanish (Argentina) dictionary, and import it into simon.

IPA to SAMPA: voiced bilabial fricative

Saturday, April 10th, 2010

The simon PLS import process of Ralf's German dictionary is OK (not perfect, but OK).

A few minutes ago, I imported Ralf's Spanish dictionary (version 0.1; November 06, 2009) into simon. The performance of the import process was OK. I had to wait about 30 seconds until the import of this huge dictionary with about 850.000 Spanish words was finished. Maybe it is my new computer. Or maybe simon does have better routines for big dictionaries?

So at the moment, I think that it is not necessary that I reduce the size of Ralf's Spanish dictionary.

What is necessary? The simon developers should improve the IPA to SAMPA conversion. There are the IPA phonemes that should be converted into SAMPA.

Phonemes that are needed by Ralf's Spanish dictionary:

1. Voiced bilabial fricative. IPA-number 127. IPA-text: β. X-SAMPA: B.

I have to stop here. Take a look into the simon shadow dictionary:

spanish-sampa-phonemes

simon converted the β phoneme into correct SAMPA. Here is the corresponding entry from Ralf's Spanish dictionary:

<lexeme>
<grapheme>ababa</grapheme>
<phoneme>aβaβa</phoneme>
</lexeme>

But what is wrong? There are no spaces between the phonemes in aBaBa. So the β (IPA phoneme) has been converted correctly into B (SAMPA phoneme). But where are the spaces? The SAMPA transcription should look as follows: a B a B A

I don’t know why this error occurs. This is the reason why I am stopping here.

PLS dictionary: 850.000 Spanish words

Friday, November 6th, 2009

Ralf's Spanish dictionary (version 0.1; GPLv3) contains about 850.000 words. You can import this PLS dictionary into simon. Some remarks about (how I created) this dictionary:

1. I downloaded a spelling dictionary.

2. Then I used this spelling dictionary to produce the content of the grapheme elements:

$ unmunch es_ES.dic es_ES.aff > spanish-wordlist

3. From the word list, I created the content of the phoneme elements:

$ espeak -f spanish-ssml -m -v es -q -x --phonout="spanish-phoneme"

4. I combined the grapheme with the phoneme elements:

$ paste spanish-ssml spanish-phoneme-o > spanish-pls

5. With this style-sheet, I transformed the eSpeak phonemes into IPA phonemes. I am not sure whether I transformed everything correctly. E.g., I am unsure whether the following conversion is correct:

replace($sierra, 'J\^', 'ʎ')
replace($sierra, 'J', 'ʝ')

It might be the other way around.

6. The following phonemes probably will cause problemes when you import the dictionary into simon: β, ð, θ, ɣ, ɾ, ʎ, ʝ

7. On my computer, it was necessary to wait 5 minutes until the import into simon had been finished.

You can see that the import of Ralf's Spanish dictionary is possible. Unfortunately, training with this dictionary is currently almost impossible because of the phoneme issues.

Import a small French PLS dictionary

Saturday, August 29th, 2009

I want to import a small French PLS dictionary into simon. I have created this lexicon on my own, it contains some errors. But that doesn’t matter because I just want to demonstrate the import process.

french

1. The lexicon has the name french-pronunciation-20090829.xml.
2. Encoding is UTF-8 for optimal compatibility.
3. License is GPL. This small prototype lexicon can be expanded without license problems.
4. Alphabet is IPA. simon is capable to import lexicons that are stored in that format. Of course, it has to be tested with other languages than German. This is what I want to do right now with the French language.
5. The XML language tag marks the lexicon as French. The lexicon contains one word that isn’t in French but in English. I have marked that word with a special XML tag. Probably, it won’t work. But I want to see what will happen.
6. Click the Import Dictionary button.
7. The target will be the active dictionary. Maybe I will be able to use this dictionary in conjunction with sam? I will see.
8. And now it is time to press the Next > button.

As type of dictionary I select of course PLS lexicon. The path to the lexicon is as follows: /home/liberty/200908/french-pronunciation-20090829.xml The lexicon has been imported. But there aren’t any words. What went wrong? I will make some changes to the XML file (delete the blank lines). I tried again after deleting the blank lines in the XML file, but again: After the import of the French PLS lexicon, I can see nothing.

I will now import the German PLS lexicon from the following location: /home/liberty/200908/voxDE20090209-modified.xml. It worked. Why was it possible to import the German PLS dictionary, but not the French PLS dictionary?

I just found out what was wrong: The first line of the XML file has to look like this:

<?xml version=”1.0″ encoding=”UTF-8″?>

Look at the double quotes. They have to be the normal double quotes (U+0022). But they were rendered. Obviously, WordPress renders the double quotes (U+0022) into some other Unicode characters.

And here is the result:

french-imported

You can see that it is possible to import a French PLS dictionary into simon. This is the strength of the system: the capability to import lexicons from different languages. Until now, I have imported the following lexicons:

English Voxforge lexicon
German PLS lexicon
French PLS lexicon

I think that somewhere at Voxforge.org there is a Spanish lexicon available. Spanish is very consistent when it comes to pronunciation. So making simon suitable for the Spanish language shouldn’t be a big problem. I just dowloaded the Spanish lexicon (3.1 MB). I think that it is stored in HTK compatible format. Here are the first few lines of the Spanish lexicon:

a [a] a
aaronita [aaronita] a a r o n i t a
aarónico [aarónico] a a r o n i k o
aba [aba] a b a
ababa [ababa] a b a b a
ababillarse [ababillarse] a b a b i ll a r s e
ababol [ababol] a b a b o l

I don’t know how this lexicon had been generated. I would like to import that Spanish dictionary right now. But first, I should delete the German lexicon (shadow lexicon) and the French lexicon (active lexicon) that are actually available in my simon installation. But before I do that, I want to upload my French lexicon to my webspace. Here it is: lexicon (license: GPL).

I didn’t start ksimond, so the only folder I have to rename is the following folder: /home/liberty/.kde/share/apps/simon/model

I think that it worked. There are no words in the word list available anymore. Now I can import the Spanish dictionary. I will import it as HTK dictionary from the following path: /home/liberty/200908/voxforge_lexicon_spanish. And here is the result:

spanish

So simon should work with Spanish, too. I hope that there will be the possibility to switch between the different languages.