Posts Tagged ‘unmunch’

Ralf’s Arabic dictionary

Tuesday, January 10th, 2012

This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.

A. Creation of the dictionary:

1. Get Arabic spelling dictionary.
2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file:

GPL 2.0/LGPL 2.1/MPL 1.1 tri-license

This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.

3. Now I have to extract dict_ar-3.0.oxt.
4. Let’s try the unmunch command inside the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic

It failed. I wasn’t able to unmunch the word list.
5. I have to remove all numbers from ar.dic. This can be done with the sed command:

sed 's/[0-9]*//g' ar.dic > arabic-without-numbers

6. Remove the slash (“/”) from arabic-without-numbers with Geany.
7. Add lexicon tags at the beginning and the end of the file.
8. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml

9. ISO 639-1 language code is ar.
10. Maybe I will use this table for the grapheme to phoneme conversion.
11. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'improve-arabic.xsl' -o:arabic-dictionary.xml

I have to remove the number sign (“#”) with Geany from arabic.xml.

B. Download the dictionary. Import it into simon.

The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with “Unknown”. This is because the PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon.

Ralf’s Hebrew dictionary

Tuesday, January 10th, 2012

In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.

A. Creation of the dictionary:

1. Get Hebrew spelling dictionary from OpenOffice.org.
2. License is GPL. There is a copyright notice inside the file he_IL.aff.

3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test

4. The source file he_IL.dic contains a lot of numbers. I remove them with the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ sed 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers

With Geany, I remove the “,” (commas) and the “/” (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.

5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.
6. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml

7. ISO 639-1 language code is he.
8. I need a table for grapheme to phoneme conversion. Maybe I will use this table. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew share the same alphabet. This means I could try to use the Yiddish improve-yiddish.xsl style sheet:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml

The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn’t been converted: [א] I will add this phone to the .xsl style sheet with the name improve-hebrew.xsl. Now I try it again:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'improve-hebrew.xsl' -o:hebrew-dictionary.xml

The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.

B. Download the dictionary. Import it into simon as shadow dictionary.

Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just Unknown) since the source PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn’t be a problem to adjust the style sheet improve-hebrew.xsl so that the phoneme results are better.

Ralf’s Belarusian dictionary

Monday, January 9th, 2012

This article explains how I create this PLS dictionary and how the imported result looks like.

A. Creation of the Belarusian PLS dictionary:

1. Get spelling dictionary. I choose the official orthography.
2. License is LGPL (see hyph_be_BY.dic). I am allowed to “convert any LGPLed piece of software into a GPLed piece of software.” I did this before. And I will do it again. This means that I get a spelling dictionary that is licensed under the LGPL. And I will produce a pronunciation dictionary that is licensed under the GPLv3. By the way, all my dictionaries are licensed under the GPLv3.
3. Extract dict-be-official.oxt.

4. The file be-official.aff is encoded in UTF-8. The file be-official-dic may be encoded in ISO-8859-1. At least this encoding is displayed by Geany. I believe that be-official-dic is encoded in microsoft-cp1251. I had this encoding before (Macedonian and Bulgarian).
Now it is time to use the Ubuntu terminal:
cd /home/ubuntu/Documents/2011-II/Belarusian
iconv -f cp1251 -t UTF-8 <be-official.dic >belarusian-utf8.dic

The text file belarusian-utf8.dic looks fine.

5. Now I change the line SET microsoft-cp1251 in the file be-official.aff into SET UTF-8
6. I don’t know whether the next step is necessary. I could convert the file hyph_be_BY.dic from cp1251 into UTF-8. At the moment, I skip this step.

7. Ubuntu terminal: unmunch belarusian-utf8.dic be-official.aff > belarusian-wordlist I think that this step wasn’t necessary. It didn’t extract the word list. At the moment, I have a word list of 1.5 million words. This is way too much. I have to reduce the dictionary size. The target size is 400.000 words.

8. I have to reduce the dictionary size. I found a tip. Ubuntu terminal:

sed -n 'p;N;N;N' belarusian-wordlist > belarusian-wordlist-reduced

Yes, it worked. The word list contains now 391.000 words. This is a good basis for a PLS dictionary.

9. Add lexicon elements at the beginning and the end of belarusian-wordlist-reduced.
10. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian-wordlist-reduced -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:belarusian.xml

11. Language code is be.
12. I will use this table for grapheme to phoneme mapping.
13. Creation of the phoneme elements:

ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian.xml -xsl:'improve-belarusian.xsl' -o:belarusian-dictionary.xml

B. Download and import the dictionary.

Let’s take a look at the result. The left column contains 391669 Belarusian words. The pronunciation column contains the corresponding SAMPA transcriptions. All entries in the third column are marked as “Unknown”. This is because the Belarusian PLS dictionary doesn’t contain any role attribute.

Now you know how I created the dictionary. And you got an impression how the result looks like when imported into simon.

Ralf’s Asturian dictionary

Thursday, January 5th, 2012

This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.

A. How I create the dictionary:
1. Get spelling dictionary.
2. Check license. It is GPLv3.
3. Extract asturianu.oxt.
4. Language code is ast.
5. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist

The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the same problem with my Latin dictionary. I had to reduce the size.

6. Add lexicon elements at the beginning/end of asturian-wordlist.

7. Generate .xml document with lexicon, lexeme and grapheme elements:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

I got an error message because the available space isn’t enough (“Java heap space”). I think that I should reduce the file size with grep. Or I install VisualVM. I think I will work with grep:
a. Remove lines that begin with l’: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\’ asturian-wordlist > asturian-wordlist-02
b. Remove lines that begin with t’: grep -v ^t\’ asturian-wordlist-02 > asturian-wordlist-03
c. Remove lines that begin with s’: grep -v ^s\’ asturian-wordlist-03 > asturian-wordlist-04
d. Remove lines that begin with m’: grep -v ^m\’ asturian-wordlist-04 > asturian-wordlist-05
e. Remove lines that begin with n’: grep -v ^n\’ asturian-wordlist-05 > asturian-wordlist-06
f. Remove lines that begin with d’: grep -v ^d\’ asturian-wordlist-06 > asturian-wordlist-07
g. Remove lines that begin with qu’: grep -v ^qu\’ asturian-wordlist-07 > asturian-wordlist-08
h. Remove lines that begin with p’: grep -v ^p\’ asturian-wordlist-08 > asturian-wordlist-09
The dictionary will contain 1.1 million words. I think that that number is acceptable.

8. And now Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-09 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

This command creates a PLS dictionary without phoneme elements. The phoneme elements will be added later.

9. I will use this table for grapheme to phoneme conversion. Here is the command that creates the phoneme elements:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian.xml -xsl:'improve-asturian.xsl' -o:asturian-dictionary.xml

10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn’t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.
a. Remove lines that contain ‘l: grep -v \’l asturian-wordlist-09 > asturian-wordlist-10
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11
c. This isn’t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14
f. Remove 265.000 words: grep -v es asturian-wordlist-14 > asturian-wordlist-15
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an .xml file:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-15 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

And now I repeat step 9. The file asturian-dictionary.xml has a size of 45 MB. I hope that this size is OK.

B. Download the dictionary. Import it into simon.

Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.

You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn’t too big for simon.

Ralf’s Yiddish dictionary

Tuesday, January 3rd, 2012

This article explains some details about the creation of the dictionary, and how the result looks like in simon.

A. How I create Ralf's Yiddish dictionary:

1. Get spelling dictionary.
2. License is GPLv3.
3. Extract jidysz.net.ooo.spellchecker.oxt.
4. Ubuntu terminal:
cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries
sudo apt-get install hunspell-tools
unmunch yi.dic yi.aff > yiddish-wordlist

5. Add <lexicon> at the beginning of yiddish-wordlist. Add </lexicon> at the end of this file.
6. Generate .xml document with lexicon, lexeme and grapheme elements:

ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:yiddish.xml

7. ISO 639-1 language code is yi.
8. I think I will use this table as source for the grapheme to phoneme mapping.
9. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish.xml -xsl:'improve-yiddish.xsl' -o:yiddish-dictionary.xml

B. Download the dictionary, and import it into simon.

Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription.
Yiddish is written in the Hebrew alphabet. The Hebrew alphabet is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.

There are a lot of other PLS dictionaries available. Find the PLS dictionary that suits your language.

PLS dictionary: 850.000 Spanish words

Friday, November 6th, 2009

Ralf's Spanish dictionary (version 0.1; GPLv3) contains about 850.000 words. You can import this PLS dictionary into simon. Some remarks about (how I created) this dictionary:

1. I downloaded a spelling dictionary.

2. Then I used this spelling dictionary to produce the content of the grapheme elements:

$ unmunch es_ES.dic es_ES.aff > spanish-wordlist

3. From the word list, I created the content of the phoneme elements:

$ espeak -f spanish-ssml -m -v es -q -x --phonout="spanish-phoneme"

4. I combined the grapheme with the phoneme elements:

$ paste spanish-ssml spanish-phoneme-o > spanish-pls

5. With this style-sheet, I transformed the eSpeak phonemes into IPA phonemes. I am not sure whether I transformed everything correctly. E.g., I am unsure whether the following conversion is correct:

replace($sierra, 'J\^', 'ʎ')
replace($sierra, 'J', 'ʝ')

It might be the other way around.

6. The following phonemes probably will cause problemes when you import the dictionary into simon: β, ð, θ, ɣ, ɾ, ʎ, ʝ

7. On my computer, it was necessary to wait 5 minutes until the import into simon had been finished.

You can see that the import of Ralf's Spanish dictionary is possible. Unfortunately, training with this dictionary is currently almost impossible because of the phoneme issues.

Expanding Ralf’s French dictionary

Tuesday, November 3rd, 2009

This article is interesting for people who want to create a pronunciation dictionary for their own language that can be imported into simon.

Currently, I am preparing a new version of Ralf's French dictionary. I am adding a lot of words to the dictionary. Here is what I am doing:

1. I got a French spelling dictionary from OpenOffice.org (Orthographe «Réforme 1990»). There are two important files:
a. .dict file: contains the word list
b. .aff file: contains rules about the possible suffixes. French has a lot of suffixes. Let’s take a look into Conjugaison française:Premier groupe:

# INDICATIF

* Présent :-e, -es, -e, -ons, -ez, -ent
* Imparfait : -ais, -ais, -ait, -ions, -iez, -aient
* Futur simple : -erai, -eras, -era, -erons, -erez, -eront
* Passé simple : -ai, -as, -a, -âmes, -âtes, -èrent

# SUBJONCTIF

* Présent : -e, -es, -e, -ions, -iez, -ent
* Imparfait : -asse, -asses, -ât, -assions, -assiez, -assent

# CONDITIONNEL

* Présent : -erais, -erais, -erait, -erions, -eriez, -eraient

The next version of Ralf’s French dictionary should cover most of the French suffixes. And of course, the French accents should be correct (according to the Réforme 1990).

2. I typed into the Ubuntu terminal the command: unmunch fr-1990.dic fr-1990.aff > french-wordlist-o.txt
This created a list of about 3 million French words. But there are many duplicates. This means that I will use only a subset of these words. I will have to sort out the duplicates with the distinct-values(), or with sort -u | \:

“sort sorts the list of ‘words’ into alphabetical order, and the -u switch removes duplicates”

I will have to be careful with the special characters. The sort command might be easier, but maybe there will be problems with ISO-8859-1 vs. UTF-8. To avoid this possible problem, it might be better to use an XSLT style-sheet. But there can be a problem with java heap space (I solved this problem under Ubuntu 9.04 with VisualVM. Unfortunately, I didn’t find out how to deal with this issue under Ubuntu 9.10.). I will have to try out. The tools that I am using are very good, but they aren’t perfect. Fixing the issues costs me a lot of time.

There are encoding problems. If you want to create a dictionary with non-US-ASCII characters (like it is the case in German – äöüß – or French), you probably will encounter the following problem:

ISO-8859-1 vs. UTF-8

Both standards are very common, and it can cause a lot of headache. I am trying to solve the problem via the style-sheet create-graphemes-french.xsl:

replace-french

The unmunch command introduced a lot of crap characters. I don’t know how I can prevent this problem. But I know how I can try to fix the crap characters. One solution would be a simple search and replace with gedit. But I prefer to use a style-sheet because it allows several transformations at a time.

You can see that dictionary development is possible. I am using very powerful tools (the unmunch command for word generation; the .xsl style-sheet for fixing encoding issues; the espeak command for phoneme generation).

Why am I writing this in this blog? Because if you want to use simon, you need a pronunciation dictionary. I want to help people who don’t have access to a dictionary. Create your own dictionary! OpenOffice.org offers spelling dictionaries in a lot of languages. This is your source to get a word list. With eSpeak you transform the words into phonemes. You can use eSpeak for the following languages:

Afrikaans, Bosnian, Catalan, Czech, German (most phonemes of Ralf's German dictionary are created with eSpeak), Greek, Esperanto, Spanish, Finnish, Croatian, Hungarian, Italian, Kurdish, Latvian, Polish, Romanian, Slovak, Serbian, Swedish, Swahili, Tamil, Turkish.

If you are a native speaker of one of those languages, my concept of dictionary creation is something for you.

With paste you combine the word list and the phoneme list.

[Edited on November 16, 2009]