Ralf’s Portuguese (European) speech model

Thursday, May 17th, 2012

Some words about the creation of this speech model.

1. Get Ralf's Portuguese (European) dictionary.
2. Create a Simon scenario with the name Portuguese.
3. Clear the shadow vocabulary.
4. Import Ralf's Portuguese (European) dictionary as shadow dictionary.
5. Add ten words to training. Simon asks:

Your vocabulary does not define all words used in this text. These words are missing:
comprar, comprazemos, comprometedor, feito, felino, fenomenal, incitativo, irresoluto, masculino, telescopia

Do you want to add them now?

Press the Yes button.

6. Add as grammar the word “Unknown”. Add dictation plugin.
7. Actions > Synchronize. Actions > Activate. Dictate a few words:

comprar comprazemos comprometedor feito felino comprar irresoluto masculino telescopia

8. Export the Portuguese scenario. Export the Portuguese base model.
9. Download Ralf's Portuguese (European) speech model.

Tip: If you are from Brazil, check this out.

Ralf’s Portuguese (Brazilian) dictionary

Wednesday, May 5th, 2010

How I create Ralf's Portuguese (Brazilian) dictionary:

1. Get spelling dictionary, the license is LGPL version 2.1.
2. Encoding of pt_BR.dic and pt_BR.aff is ISO-8859-1.

3. Converting pt_BR.dic to UTF-8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ iconv -f ISO8859-1 -t UTF-8 < pt_BR.dic > portuguese-brazilian-utf8.dic

4. Converting pt_BR.aff to UTF-8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ iconv -f ISO8859-1 -t UTF-8 < pt_BR.aff > portuguese-brazilian-utf8.aff

5. Change first line of file portuguese-brazilian-utf8.aff from SET ISO8859-1 to SET UTF-8.

6. Because the unmunch command doesn’t work at the moment on my new Ubuntu 10.04 system, I have to issue the following command:

$ sudo apt-get hunspell-tools

It isn’t possible to install hunspell-tools via the Ubuntu terminal. So I am trying to install hunspell via Synaptic Package Manager. But the unmunch command is still not available. What is wrong? Now I found my error, I have to type:

$ sudo apt-get install hunspell-tools

You can see, if you make a small mistake, it doesn’t work.

7. Generating list with Portuguese (Brazilian) words:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ unmunch portuguese-brazilian-utf8.dic portuguese-brazilian-utf8.aff > portuguese-brazilian-wordlist

Probably, this word list is too big (1.3 GB). I will only use portuguese-brazilian-utf8.dic as source for my PLS dictionary.

8. Adding <lexicon> tags to the file portuguese-brazilian-utf8.dic (<lexicon> at the beginning of the file; </lexicon> at the end of the file).

9. Generating XML file:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ saxonb-xslt -s:portuguese-brazilian-utf8.dic -xsl:'' -o:portuguese-brazilian-speak-audio

10. A lot of <grapheme> elements contain a slash followed by additional information. This information including the slash has to be removed. I am doing this with the following command:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ saxonb-xslt -s:portuguese-brazilian-speak-audio -xsl:'' -o:portuguese-brazilian-improved

11. Preparing SSML document for eSpeak:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ saxonb-xslt -s:portuguese-brazilian-improved -xsl:'' -o:portuguese-brazilian-ssml

12. Generating Portuguese (Brazil) eSpeak phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ espeak -f portuguese-brazilian-ssml -m -v pt -q -x --phonout="portuguese-brazililan-espeak"

Let me explain the details of the espeak command:
-f [file] indicates the source file portuguese-brazilian-ssml
-m [mark-up] indicates that the source file is an SSML file (this means that each word is enclosed in an <audio> element)
-v [voice] indicates which language eSpeak has to use. The ISO 639-1 language tag for Portuguese is pt. eSpeak uses pt for Brazilian Portuguese, and pt-pt for European portuguese. This means that I have to use the tag pt.
-q [quiet] means that eSpeak doesn’t speak the words out loudly.
-x means that eSpeak should output phonemes.
--phonout="hungarian-espeak" indicates the output file.

Now you know how you can use eSpeak for phoneme generation.

13. Opening the file portuguese-brazilian-espeak with gedit.
a. Adding <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of portuguese-brazilian-espeak with gedit.
b. I search for the sequence "\n\n " (backslash-n-backslash-n-spacebar). This sequence will be replaced by the expression "</phoneme>\n<phoneme>". Press Replace All. I have to wait a few minutes or so. Obviously, this approach (search and replace with gedit) is pretty slow. Because gedit is too slow, I am doing this operation with Geany.

14. I have to compare the number of lines of the files portuguese-brazilian-espeak (contains the <phoneme> elements) and portuguese-brazilian-ssml (contains the <grapheme> elements). Both files have the same number of lines (254643 lines). This means that I can issue the paste command:

ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ paste portuguese-brazilian-ssml portuguese-brazilian-espeak > portuguese-pls

15. I have to edit the file portuguese-pls with gedit. At the moment it isn’t a valid XML file. Replacing the "&" by "&amp;" because this is a predefined entity.

16. Download Ralf's Portuguese (Brazilian) dictionary, and import it into simon.

17. If someone shows interest in this dictionary, I can help with the conversion of the eSpeak phonemes into IPA phonemes.

Tutorial: how to install under Ubuntu

Friday, January 8th, 2010

This tutorial explains how to install simon under Ubuntu, and how to import Ralf’s Portuguese dictionary.

1. Download simon.
2. Double-click on simon-0.2-Linux_i386.deb:


3. Press Install Package:


4. Enter the password that you had chosen during your Ubuntu installation:


Press the OK button.

5. The installation has been finished. The package simon-0.2-Linux_i386.deb has been installed:


Press the Close button.

6. Select Applications > Universal Access > simon:


7. Take a look at the simon main window:


Press the Wordlist button.

8. The Wordlist tab has opened:


Press the Import Dictionary button.

9. You have to select the type of the dictionary:


Choose PLS Lexicon, then press the Next button.

Note for simon development team: it would be nice if simon now offered a list of the 27 PLS dictionaries that are available.

10. You can now import one of my 27 PLS dictionaries. In the sidebar of testing simon, you can find a PLS dictionary that you can import:


Right-click on Ralf’s Portuguese dictionary, then Save Link As....

11. The dictionary with the name portuguese-dictionary.xml.bz2 will be saved:


It will be saved in the Downloads folder. Press the Save button.

12. It is time to import Ralf's Portuguese dictionary that you have just downloaded:


Please press the File button to point simon to the downloaded PLS dictionary.

13. Select the Downloads folder:


14. Select portuguese-dictionary.xml.bz2:


On my computer, I didn’t have to press the OK button.

15. simon displays the path to the PLS dictionary:


Note for simon development team: it is pretty complicated to first download, and then import the dictionary. It would be nice if the wizard offered automatic download directly from the internet.

My guess is: the average user begins to lose interest in simon at this point of the installation because he already has invested about 20 minutes of his precious time. It is getting annoying. Don’t annoy the user! Offer automatic PLS dictionary import directly from the internet.

The automatic BOMP import is a great thing. But not everybody is a native German speaker. At the moment, I am offering 27 different languages. An automatic import would make simon much more interesting for a lot of people. E.g. Portuguese is spoken by 200 million native speakers. Recently, someone showed interest in Portuguese at Voxforge. You can imagine that almost all people don’t have a clue where to start, and what is necessary. Helping people from a lot of different countries would be so easy by adding an automatic import function to the wizard.

Press the Next button.

16. simon is now processing the lexicon:


What does that mean? It means that simon converts the dictionary from PLS format into HTK format. This process works fine for Ralf's German dictionary. The process is not yet optimized for the other PLS dictionaries. If you are a native speaker of Portuguese (European), you can edit Ralf's Portuguese dictionary with a simple text editor.

17. Ralf's Portuguese dictionary has been imported:


Press the Finish button.

18. The dictionary is now available:


a. Select Include unused words from the shadow lexicon.
b. Use the scroll bar to get an impression of the Portuguese dictionary.
c. First column: word. Second column: corresponding pronunciation.

19. Let’s finish here. Now you know how to install simon under Ubuntu, and how to import Ralf's Portuguese dictionary into simon.

There are more steps necessary to make it work:
- install HTK;
- record a few training samples;
- define a grammar;
- start ksimond (PDF).

Take a look into the simon handbook to find out more about simon.

Import 40.000 Portuguese words

Sunday, November 8th, 2009

You can import Ralf's Portuguese (European) dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is currently not recommended.

To create this dictionary, I downloaded a spelling dictionary, then generated the phonemes.

This dictionary contains information about the primary stress. This information will be automatically removed when importing the dictionary into simon. From my point of view, we don’t need stress information for ASR.

The issues of this dictionary are similar to the Catalan and Spanish dictionaries. Never mind, at least now I am offering you a PLS dictionary for the Portuguese language.

Language tag dependent import

Maybe it would be interesting to think about language specific dictionary import. E.g., Ralf's Portuguese (European) dictionary contains the following information in the lexicon element: xml:lang="pt-pt"

This means European Portuguese (I am not totally sure whether this language code is correct or not. But if I am wrong, this can be corrected later.). My other dictionaries have the following language tags:

Ralf's French dictionary: xml:lang="fr" (I think that the current version has the wrong language tag; I will check that later.)
Ralf's Spanish dictionary: xml:lang="es"
Ralf's German dictionary: xml:lang="de"
Ralf's Austrian German dictionary: xml:lang="de-AT"

Maybe a future version of simon could transform the IPA phonemes into SAMPA using language specific conversion rules.