I downloaded the Japanese language model (20k-word) from julius.sourceforge.jp. After extracting the packaged files, I opened the file /home/liberty/200908/japanese-models/lang_m/20k.htkdic with gedit. Here are a few lines (with obviously the wrong encoding, sorry for that):
¥Õ¥ë¡Œ¥Ä+¥Õ¥ë¡Œ¥Ä+2 [¥Õ¥ë¡Œ¥Ä] f u r u: ts u
¥Õ¥ë¡Œ¥È+¥Õ¥ë¡Œ¥È+2 [¥Õ¥ë¡Œ¥È] f u r u: t o
¥Õ¥ë¥»¥Ã¥È+¥Õ¥ë¥»¥Ã¥È+2 [¥Õ¥ë¥»¥Ã¥È] f u r u s e q t o
¥Õ¥ì¡Œ¥º+¥Õ¥ì¡Œ¥º+2 [¥Õ¥ì¡Œ¥º] f u r e: z u
This looks like HTK format. So it should be possible to import this file into simon. I import this file as HTK dictionary:
1. I have opened the file 20k.htkdic. It contains about 20k words in Japanese. The characters aren’t displayed correctly but that is not important to me. I don’t understand Japanese. The file looks like a typical dictionary file in HTK format.
2. Press Import Dictionary.
3. Select HTK Lexicon.
4. Press the Next > button.
First, I imported the Japanese dictionary with automatic encoding into the shadow dictionary. This didn’t work out. Why is there automatic encoding available when simon doesn’t recognize the correct encoding automatically?
But no problem. I took a look at the encoding of the website julius.sourceforge.jp. It is EUC-JP. So I chose during the dictionary import wizard the EUC-JP encoding:
You can now train simon with Japanese words. It should work.
About 20k words are available. It would be nice if someone would try simon in conjunction with this Japanese dictionary. Hopefully, there shouldn’t be any major problems because the Japanese dictionary works with Julius. And simon makes use of Julius, too. The only remaining problem might be the encoding.
I hope that simon will be of use for Asian people as well. My personal focus are the Western languages English, German, French, and Spanish. But I want to show that the concept can be used for Asian languages as well.



The “automatic” detection uses KEncodingDetector. Once that improves, the automatic detection will improve accordingly.
Greetings,
Peter
You know that I am interested to learn about the internal architecture of simon. So thanks for the info.
[...] You can choose between different lexicon types: Hadifix, HTK, PLS, and Sphinx. Select PLS. 7. Press the Next > [...]