Posts Tagged ‘GPLv3’

Ralf’s German speech model 0.1.8

Sunday, September 26th, 2010

You can download Ralf’s German speech model version 0.1.8. It contains 28000 German words (from sections: alpha bravo charlie diego echo).

Ralf’s Maithili dictionary

Friday, May 14th, 2010

How I create Ralf's Maithili dictionary:

1. Get spelling dictionary. License is GPLv2. I always check the license when getting a spelling dictionary that I want to use as source for a PLS dictionary. The license has to be GPLv3 compatible because I publish my PLS dictionaries under the GPLv3.

2. The files mai_IN.dic and mai_IN.aff are UTF-8 encoded. I only need mai_IN.dic for the creation of my PLS dictionary.

3. ISO-639-1 language code is bh. I will use this code in the xml:lang attribute of the <lexicon> element.

4. Add <lexicon> at the beginning of the file mai_IN.dic. Add </lexicon> at the end of mai_IN.dic.

5. Trying to enclose the words in the Maithili word list within <grapheme> elements:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/maithili-dictionary/mai_IN.dic' -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:'/home/ubuntu/Documents/201005/maithili-dictionary/maithili.xml'

An error message is reported by the Ubuntu terminal:

SXXP0003: Error reported by XML parser: An invalid XML character (Unicode: 0x2) was found in the element content of the document.

I never had this error message before. What is the cause for this error? Does it have something to do with the BOM? It is line 44345 that causes the error. I am deleting this line, and try it again. Now it worked.

6. Because “in the modern time Devanagari script is most commonly used” I think that mai_IN.dic contains Devanagari script. I have already published one lexicon that uses Devanagari script: Ralf’s Nepali dictionary. Maybe I can adapt the style-sheet improve-nepali-dictionary.xsl.

7. Generate <phoneme> elements with some IPA characters:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -s:'/home/ubuntu/Documents/201005/maithili-dictionary/maithili.xml' -xsl:'/home/ubuntu/Documents/201005/maithili-dictionary/improve-maithili-dictionary.xsl' -o:'/home/ubuntu/Documents/201005/maithili-dictionary/maithili-dictionary.xml'

The result serves as a first draft.

8. Download Ralf's Maithili dictionary, and import it into simon.

maithiliTake a look into the shadow vocabulary:
The word column contains 48918 Maithili words. The pronunciation column contains the corresponding SAMPA pronunciations.

Of course, you can’t use this dictionary for training at the moment because it is just a first draft that should be improved by a native speaker.

Ralf’s Vietnamese dictionary

Sunday, November 15th, 2009

You can import Ralf's Vietnamese dictionary (version 0.1; GPLv3) into simon. The dictionary contains about 6.000 words; training is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

Ralf’s Polish dictionary

Tuesday, November 10th, 2009

You can import Ralf's Polish dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is currently not possible.

Some details about (the creation of) this dictionary:
- The grapheme elements contain often garbage characters. I tried to fix this issue (with iconv), but unfortunately I wasn’t successful. When the grapheme element contains crap, the phoneme element is crappy, too. These lexeme elements are unusable.
- I didn’t convert the eSpeak phonemes into IPA phonemes. They are still in their original format. This is indicated by the attribute of the lexicon element: alphabet="espeak"
– As source I used a spelling dictionary.

This dictionary is of really bad quality. It is necessary to solve the encoding issue. But I tried three different conversions, and I failed.

At least, you can get an impression how easy it is to import 300.000 Polish words.