Posts Tagged ‘node18’

Ralf’s Valencian dictionary

Tuesday, May 11th, 2010

How I create Ralf's Valencian dictionary:

1. Get spelling dictionary. License is GPL.

2. The encoding of valencian.dic and valencian.aff is ISO-8859-1.

3. Maybe I will use the language code of Catalan (and Catalan for the grapheme to phoneme conversion by eSpeak).

4. Convert valencian.dic to UTF-8:

ubuntu@ubuntu-desktop:~/Documents/201005/valencian-dictionary$ iconv -f ISO8859-1 -t UTF-8 < valencian.dic > valencian-utf8.dic

5. Convert valencian.aff to UTF-8:

ubuntu@ubuntu-desktop:~/Documents/201005/valencian-dictionary$ iconv -f ISO8859-1 -t UTF-8 < valencian.aff > valencian-utf8.aff

6. Change the line in valencian-utf8.aff that contain SET ISO8859-1 into SET UTF-8.

7. Generate Valencian word list:

ubuntu@ubuntu-desktop:~/Documents/201005/valencian-dictionary$ unmunch valencian-utf8.dic valencian-utf8.aff > valencian

The word list contains too many words: 3 million words is too much. A lot of words contain a hyphen. I could sort them out. Or I use valencian-utf8.dic as source.

8. Add <lexicon> at the beginning of the file valencian-utf8.dic; </lexicon> at the end of the file).

9. Generate XML file with <grapheme> elements:

ubuntu@ubuntu-desktop:~/Documents/201005/valencian-dictionary$ saxonb-xslt -s:valencian-utf8.dic -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:valencian.xml

10. Generate a first draft with <phoneme> elements:

ubuntu@ubuntu-desktop:~/Documents/201005/valencian-dictionary$ saxonb-xslt -s:valencian.xml -xsl:'http://spirit.blau.in/simon/files/2010/04/improve-estonian-dictionary.xsl' -o:valencian-pls.xml

11. Import Ralf's Valencian dictionary into simon.

12. Ralf's Valencian dictionary is just a first draft. If someone shows interest, I could generate the phonemes with eSpeak (voice: ca – Catalan).

Ralf’s Spanish (Venezuela) dictionary

Thursday, May 6th, 2010

How I create Ralf's Spanish (Venezuela) dictionary:

1. Get spelling dictionary. License is GPLv3.
2. The files dic_es_VE.dic and dic_es_VE.aff are UTF-8 encoded.

3. Generating word list:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-venezuela/dictionaries$ unmunch dic_es_VE.dic dic_es_VE.aff > spanish-venezuela

2.8 million words is too much. I am deleting the last 2 million words of this word list. I will only use the first 800.000 words of the word list. The size of the word list shouldn’t be too big because otherwise the performance gets too slow.

4. Replacing "\n" by "</audio>\n<audio>" with Geany (Use escape sequences).
5. Adding <speak> tags to the file spanish-venezuela (<speak> at the beginning of the file; </speak> at the end of the file).

6. Generating Spanish phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-venezuela/dictionaries$ espeak -f spanish-venezuela -m -v es-la -q -x --phonout="spanish-venezuela-espeak"

7. Open the file spanish-venezuela-espeak with Geany. Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of spanish-argentina-espeak. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences).

8. The files spanish-venezuela-espeak and spanish-venezuela have the same number of lines. I can combine them:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-venezuela/dictionaries$ paste spanish-venezuela spanish-venezuela-espeak > spanish-venezuela-pls

9. Edit spanish-venezuela-pls with Geany. I have to do some search and replace operations to get a valid PLS dictionary.
10. Download Ralf's Spanish (Venezuela) dictionary, and import it into simon.

Ralf’s Spanish (Argentina) dictionary

Thursday, May 6th, 2010

Here is how I create Ralf's Spanish (Argentina) dictionary:

1. Get spelling dictionary, license is LGPL.
2. Encoding of es_AR.dic and es_AR.aff is ISO-8859-1.

3. Converting es_AR.dic to UTF-8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ iconv -f ISO8859-1 -t UTF-8 < es_AR.dic > spanish-argentina-utf8.dic

3. Converting es_AR.aff to UTF-8 via Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ iconv -f ISO8859-1 -t UTF-8 < es_AR.aff > spanish-argentina-utf8.aff

4. Change first line of file spanish-argentina-utf8.aff from SET ISO8859-1 to SET UTF-8.

5. Generating word list:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ unmunch spanish-argentina-utf8.dic spanish-argentina-utf8.aff > spanish-argentina-utf8

The result are 950.000 Spanish (Argentina) words. This is a huge list, but I think that the size of this word list is tolerable.

6. Replacing "\n" by "</audio>\n<audio>" with Geany (selecting the option Use escape sequences). I don’t find it easy with escape sequences and with regular expressions. In this case, it is the easiest way to convert the file with Geany.

7. Generating Spanish phonemes (eSpeak: “es-la Spanish – Latin America”):

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ espeak -f spanish-argentina-utf8 -m -v es-la -q -x --phonout="spanish-argentina-espeak"

Details of the espeak command:
-f [file] indicates the source file spanish-argentina-utf8
-m [mark-up] indicates that the source file is an SSML file (this means that each word is enclosed in an <audio> element)
-v [voice] indicates which language eSpeak has to use. The ISO 639-1 language tag for Spanish is es. eSpeak uses es-la for Spanish – Latin America, and es for Spanish. This means that I have to use the tag es-la.
-q [quiet] means that eSpeak doesn’t speak the words out loudly.
-x means that eSpeak should output phonemes.
--phonout="spanish-argentina-espeak" indicates the output file.

8. Open the file spanish-argentina-espeak with Geany. Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of spanish-argentina-espeak. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences).

9. The files spanish-argentina-utf8 and spanish-argentina-espeak have the same number of lines. I can combine these files:

ubuntu@ubuntu-desktop:~/Documents/201005/spanish-argentina$ paste spanish-argentina-utf8 spanish-argentina-espeak > spanish-argentina-pls

10. Edit spanish-argentina-pls with Geany. I have to do some search and replace operations to get a valid PLS dictionary.

11. Download Ralf's Spanish (Argentina) dictionary, and import it into simon.

IPA to SAMPA: voiced bilabial fricative

Saturday, April 10th, 2010

The simon PLS import process of Ralf's German dictionary is OK (not perfect, but OK).

A few minutes ago, I imported Ralf's Spanish dictionary (version 0.1; November 06, 2009) into simon. The performance of the import process was OK. I had to wait about 30 seconds until the import of this huge dictionary with about 850.000 Spanish words was finished. Maybe it is my new computer. Or maybe simon does have better routines for big dictionaries?

So at the moment, I think that it is not necessary that I reduce the size of Ralf's Spanish dictionary.

What is necessary? The simon developers should improve the IPA to SAMPA conversion. There are the IPA phonemes that should be converted into SAMPA.

Phonemes that are needed by Ralf's Spanish dictionary:

1. Voiced bilabial fricative. IPA-number 127. IPA-text: β. X-SAMPA: B.

I have to stop here. Take a look into the simon shadow dictionary:

spanish-sampa-phonemes

simon converted the β phoneme into correct SAMPA. Here is the corresponding entry from Ralf's Spanish dictionary:

<lexeme>
<grapheme>ababa</grapheme>
<phoneme>aβaβa</phoneme>
</lexeme>

But what is wrong? There are no spaces between the phonemes in aBaBa. So the β (IPA phoneme) has been converted correctly into B (SAMPA phoneme). But where are the spaces? The SAMPA transcription should look as follows: a B a B A

I don’t know why this error occurs. This is the reason why I am stopping here.

90.000 Catalan lexeme elements

Sunday, November 8th, 2009

You can import Ralf's Catalan dictionary (version 0.1; GPLv3) into simon. Training is currently almost impossible because of phoneme issues (compare the issues with Ralf's Spanish dictionary: β, ð, θ, ɣ, ɾ, ʎ, ʝ).

After getting the spelling dictionary, I created the phonemes with eSpeak. The proposal for Catalan SAMPA helped me finding out about the “central (schwa) unrounded”. I hope that I implemented it correctly.

PLS dictionary: 850.000 Spanish words

Friday, November 6th, 2009

Ralf's Spanish dictionary (version 0.1; GPLv3) contains about 850.000 words. You can import this PLS dictionary into simon. Some remarks about (how I created) this dictionary:

1. I downloaded a spelling dictionary.

2. Then I used this spelling dictionary to produce the content of the grapheme elements:

$ unmunch es_ES.dic es_ES.aff > spanish-wordlist

3. From the word list, I created the content of the phoneme elements:

$ espeak -f spanish-ssml -m -v es -q -x --phonout="spanish-phoneme"

4. I combined the grapheme with the phoneme elements:

$ paste spanish-ssml spanish-phoneme-o > spanish-pls

5. With this style-sheet, I transformed the eSpeak phonemes into IPA phonemes. I am not sure whether I transformed everything correctly. E.g., I am unsure whether the following conversion is correct:

replace($sierra, 'J\^', 'ʎ')
replace($sierra, 'J', 'ʝ')

It might be the other way around.

6. The following phonemes probably will cause problemes when you import the dictionary into simon: β, ð, θ, ɣ, ɾ, ʎ, ʝ

7. On my computer, it was necessary to wait 5 minutes until the import into simon had been finished.

You can see that the import of Ralf's Spanish dictionary is possible. Unfortunately, training with this dictionary is currently almost impossible because of the phoneme issues.