More than 300.000 French words

Ralf's French dictionary (version 0.1.1) contains more than 300.000 French words.

It has the following known issues:
1. The dictionary contains probably about 60.000 duplicate entries. The duplicates will be removed in a future version of the dictionary.
2. More than 100 phoneme elements contain the invalid character Ã. The corresponding lexeme elements will be removed in a future version.
3. Currently, training with this dictionary is not recommended (because of the French IPA phonemes).

Of course, you can import this PLS dictionary into simon. You can see that the main concept is great:

1. The grapheme elements contain the French accents according to the «Réforme 1990». Of course, there are errors in the dictionary. But most accents should be correct. Here is an example:

<lexeme>
<grapheme>Île-de-France</grapheme>
<phoneme>ildəfʀɑ̃s</phoneme>
</lexeme>

You can see that the French vowel Î is correctly implemented in the grapheme element.

2. The phoneme elements are represented following the IPA standard.
3. License is GPL. It would be nice if a native speaker would improve this dictionary. Of course, it would be allowed to transform this dictionary into Sphinx format, or into HTK format.
4. It is possible to add a role attribute to the lexeme elements. A future version of simon might be able to use this information.
5. There shouldn’t be problems with crappy characters thanks to UTF-8. I will have to fix that minor mistake mentioned above. But this is just a minor mistake, not a major mistake. So there is no need to worry about encoding issues. This problem should be solved.

You can see the advantages of this dictionary.

Tags: , , ,

Comments are closed.