This article is interesting for people who want to create a pronunciation dictionary for their own language that can be imported into simon.
Currently, I am preparing a new version of Ralf's French dictionary. I am adding a lot of words to the dictionary. Here is what I am doing:
1. I got a French spelling dictionary from OpenOffice.org (Orthographe «Réforme 1990»). There are two important files:
a. .dict file: contains the word list
b. .aff file: contains rules about the possible suffixes. French has a lot of suffixes. Let’s take a look into Conjugaison française:Premier groupe:
# INDICATIF
* Présent :-e, -es, -e, -ons, -ez, -ent
* Imparfait : -ais, -ais, -ait, -ions, -iez, -aient
* Futur simple : -erai, -eras, -era, -erons, -erez, -eront
* Passé simple : -ai, -as, -a, -âmes, -âtes, -èrent
# SUBJONCTIF
* Présent : -e, -es, -e, -ions, -iez, -ent
* Imparfait : -asse, -asses, -ât, -assions, -assiez, -assent
# CONDITIONNEL
* Présent : -erais, -erais, -erait, -erions, -eriez, -eraient
The next version of Ralf’s French dictionary should cover most of the French suffixes. And of course, the French accents should be correct (according to the Réforme 1990).
2. I typed into the Ubuntu terminal the command: unmunch fr-1990.dic fr-1990.aff > french-wordlist-o.txt
This created a list of about 3 million French words. But there are many duplicates. This means that I will use only a subset of these words. I will have to sort out the duplicates with the distinct-values(), or with sort -u | \:
“sort sorts the list of ‘words’ into alphabetical order, and the -u switch removes duplicates”
I will have to be careful with the special characters. The sort command might be easier, but maybe there will be problems with ISO-8859-1 vs. UTF-8. To avoid this possible problem, it might be better to use an XSLT style-sheet. But there can be a problem with java heap space (I solved this problem under Ubuntu 9.04 with VisualVM. Unfortunately, I didn’t find out how to deal with this issue under Ubuntu 9.10.). I will have to try out. The tools that I am using are very good, but they aren’t perfect. Fixing the issues costs me a lot of time.
There are encoding problems. If you want to create a dictionary with non-US-ASCII characters (like it is the case in German – äöüß – or French), you probably will encounter the following problem:
ISO-8859-1 vs. UTF-8
Both standards are very common, and it can cause a lot of headache. I am trying to solve the problem via the style-sheet create-graphemes-french.xsl:

The unmunch command introduced a lot of crap characters. I don’t know how I can prevent this problem. But I know how I can try to fix the crap characters. One solution would be a simple search and replace with gedit. But I prefer to use a style-sheet because it allows several transformations at a time.
You can see that dictionary development is possible. I am using very powerful tools (the unmunch command for word generation; the .xsl style-sheet for fixing encoding issues; the espeak command for phoneme generation).
Why am I writing this in this blog? Because if you want to use simon, you need a pronunciation dictionary. I want to help people who don’t have access to a dictionary. Create your own dictionary! OpenOffice.org offers spelling dictionaries in a lot of languages. This is your source to get a word list. With eSpeak you transform the words into phonemes. You can use eSpeak for the following languages:
Afrikaans, Bosnian, Catalan, Czech, German (most phonemes of Ralf's German dictionary are created with eSpeak), Greek, Esperanto, Spanish, Finnish, Croatian, Hungarian, Italian, Kurdish, Latvian, Polish, Romanian, Slovak, Serbian, Swedish, Swahili, Tamil, Turkish.
If you are a native speaker of one of those languages, my concept of dictionary creation is something for you.
With paste you combine the word list and the phoneme list.
[Edited on November 16, 2009]