I just downloaded the French dictionary from univ-lemans.fr. It has more than 65K words. I think that it is bigger than 100.000 words. Probably, it is stored in Sphinx format. Here are some lines of that dictionary:
juridiction jj uu rr ii dd ii kk ss yy on
juridictionnel jj uu rr ii dd ii kk ss yy oo nn ai ll
juridictionnelle jj uu rr ii dd ii kk ss yy oo nn ai ll
juridictionnelle(2) jj uu rr ii dd ii kk ss yy oo nn ai ll ee
juridictions jj uu rr ii dd ii kk ss yy on
I assume that this is Sphinx format, but I don’t know. I will try to import this lexicon into simon as Sphinx lexicon. I will import from the following location: /home/liberty/200908/words_dict. The import process takes a few moments because it is a big dictionary. The import was successful, but there are some smaller encoding problems:
I know this problem. It is probably a result of the wrong encoding: ISO-8859-1 vs. UTF-8. It should be easy to solve that problem. The dictionary should be opened e.g. with Notepad++, and saved as UTF-8. Maybe that would work (I would have to try that).
I have now opened the file /home/liberty/200908/words_dict with Geany. The encoding is ISO-8859-1. This was my first guess. And of course, I was right. Let’s take a closer look how to fix that encoding issue:
1. The French dictionary that had been downloaded (see the source at the beginning of this article) has the file name words_dict.
2. The encoding is ISO-8859-1.
3. Let’s try to set the encoding to UTF-8.
After saving the file, I will try to import the lexicon again (of course, I will rename a specific simon folder before). simon offers to select a specific encoding:
I will stick to automatic encoding. Let’s see what the result of the import process will be. Will the encoding issues have been solved? Yes, everything is OK now. I have now more than 100.000 French words that could be trained with simon.
Tags: French, Geany, ISO-8859-1, Sphinx



[...] You can choose between different lexicon types: Hadifix, HTK, PLS, and Sphinx. Select PLS. 7. Press the Next > [...]