How I create Ralf's Portuguese (Brazilian) dictionary:
1. Get spelling dictionary, the license is LGPL version 2.1.
2. Encoding of pt_BR.dic and pt_BR.aff is ISO-8859-1.
3. Converting pt_BR.dic to UTF-8 via Ubuntu terminal:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ iconv -f ISO8859-1 -t UTF-8 < pt_BR.dic > portuguese-brazilian-utf8.dic
4. Converting pt_BR.aff to UTF-8 via Ubuntu terminal:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ iconv -f ISO8859-1 -t UTF-8 < pt_BR.aff > portuguese-brazilian-utf8.aff
5. Change first line of file portuguese-brazilian-utf8.aff from SET ISO8859-1 to SET UTF-8.
6. Because the unmunch command doesn’t work at the moment on my new Ubuntu 10.04 system, I have to issue the following command:
$ sudo apt-get hunspell-tools
It isn’t possible to install hunspell-tools via the Ubuntu terminal. So I am trying to install hunspell via Synaptic Package Manager. But the unmunch command is still not available. What is wrong? Now I found my error, I have to type:
$ sudo apt-get install hunspell-tools
You can see, if you make a small mistake, it doesn’t work.
7. Generating list with Portuguese (Brazilian) words:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ unmunch portuguese-brazilian-utf8.dic portuguese-brazilian-utf8.aff > portuguese-brazilian-wordlist
Probably, this word list is too big (1.3 GB). I will only use portuguese-brazilian-utf8.dic as source for my PLS dictionary.
8. Adding <lexicon> tags to the file portuguese-brazilian-utf8.dic (<lexicon> at the beginning of the file; </lexicon> at the end of the file).
9. Generating XML file:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ saxonb-xslt -s:portuguese-brazilian-utf8.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:portuguese-brazilian-speak-audio
10. A lot of <grapheme> elements contain a slash followed by additional information. This information including the slash has to be removed. I am doing this with the following command:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ saxonb-xslt -s:portuguese-brazilian-speak-audio -xsl:'http://spirit.blau.in/simon/files/2010/04/improve-estonian-dictionary.xsl' -o:portuguese-brazilian-improved
11. Preparing SSML document for eSpeak:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ saxonb-xslt -s:portuguese-brazilian-improved -xsl:'http://spirit.blau.in/simon/files/2010/04/prepare-hungarian-for-espeak.xsl' -o:portuguese-brazilian-ssml
12. Generating Portuguese (Brazil) eSpeak phonemes:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ espeak -f portuguese-brazilian-ssml -m -v pt -q -x --phonout="portuguese-brazililan-espeak"
Let me explain the details of the espeak command:
-f [file] indicates the source file portuguese-brazilian-ssml
-m [mark-up] indicates that the source file is an SSML file (this means that each word is enclosed in an <audio> element)
-v [voice] indicates which language eSpeak has to use. The ISO 639-1 language tag for Portuguese is pt. eSpeak uses pt for Brazilian Portuguese, and pt-pt for European portuguese. This means that I have to use the tag pt.
-q [quiet] means that eSpeak doesn’t speak the words out loudly.
-x means that eSpeak should output phonemes.
--phonout="hungarian-espeak" indicates the output file.
Now you know how you can use eSpeak for phoneme generation.
13. Opening the file portuguese-brazilian-espeak with gedit.
a. Adding <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of portuguese-brazilian-espeak with gedit.
b. I search for the sequence "\n\n " (backslash-n-backslash-n-spacebar). This sequence will be replaced by the expression "</phoneme>\n<phoneme>". Press Replace All. I have to wait a few minutes or so. Obviously, this approach (search and replace with gedit) is pretty slow. Because gedit is too slow, I am doing this operation with Geany.
14. I have to compare the number of lines of the files portuguese-brazilian-espeak (contains the <phoneme> elements) and portuguese-brazilian-ssml (contains the <grapheme> elements). Both files have the same number of lines (254643 lines). This means that I can issue the paste command:
ubuntu@ubuntu-desktop:~/Documents/201005/portuguese-brazilian-dictionary$ paste portuguese-brazilian-ssml portuguese-brazilian-espeak > portuguese-pls
15. I have to edit the file portuguese-pls with gedit. At the moment it isn’t a valid XML file. Replacing the "&" by "&" because this is a predefined entity.
16. Download Ralf's Portuguese (Brazilian) dictionary, and import it into simon.
17. If someone shows interest in this dictionary, I can help with the conversion of the eSpeak phonemes into IPA phonemes.
[Edit May 06, 2010: fixed a minor mistake in this article]


















