This blog post is about (A) Julius dictionary and (B) the import of a PLS dictionary.
A. Obviously, it is possible to import a Julius dictionary:
I didn’t know that this kind of dictionary existed. What are the properties of this format? And what are the advantages?
B. I want to import Ralf's German dictionary (version 0.1.7; October 29, 2009). Great, simon now recognizes the role attribute:
1. A few minutes ago, I imported Ralf's German dictionary into simon. I am offering 27 PLS dictionaries for 27 different languages. Choose the dictionary that suits your native language, and import it into simon.
2. Let’s take a look into the shadow dictionary.
3. The word kernchemischen is an Adjektiv. Let’s take a look at the specific entry in Ralf's German dictionary:
<lexeme role=”Adjektiv”>
<grapheme>kernchemischen</grapheme>
<phoneme>kɛʀnçeːmɪʃən</phoneme>
</lexeme>
You can see that the role attribute which is part of the <lexeme> element was imported by simon. Thanks for implementing that feature.
4. The word Kerndurchmesser is a Substantiv. The corresponding entry in the PLS dictionary:
<lexeme role=”Substantiv”>
<grapheme>Kerndurchmesser</grapheme>
<phoneme>kɛʀndʊʀçmɛsɐ</phoneme>
</lexeme>
You can see the strength of the simon import process: The last two letters Kerndurchmesser correspond with one single phoneme kɛʀndʊʀçmɛsɐ. Because such details are implemented, we can get a very good recognition rate as I showed in the video with 200 German words.
Why is Ralf's German dictionary good? Let me explain about the history of this dictionary:
a. The initial steps were done at Voxforge with the development of a German pronunciation dictionary. You can convince yourself: the script espeak2Phones.pl is great because it transforms eSpeak’s cryptic ASCII output into SAMPA. This approach is good for the German language.
b. Later, we used the dictionary acquistion project for the collection of about 8.000 pronunciations. Each single pronunciation was human-controlled. The phoneme concept follows the Wiktionary.
c. I used a German spelling dictionary from OpenOffice.org to get more words for the dictionary (Ubuntu terminal command: unmunch). With eSpeak I created the phonemes. With an XSLT style-sheet (Ubuntu terminal command: saxonb-xslt) I transformed the eSpeak phonemes into IPA phonemes. And I used the XSTL style-sheet for inserting the role attribute (Substantiv, Adjektiv, Zahlwort).
d. The result is the current version of Ralf's German dictionary. It would be nice if someone would help with the improvement. The real difficult work has been done. But it is necessary to fine-tune the dictionary. Let me give you a concrete example:
<lexeme>
<grapheme>stromsparen</grapheme>
<phoneme>ʃtʀɔmʃpaːʀən</phoneme>
</lexeme>
<lexeme>
<grapheme>stromsparend</grapheme>
<phoneme>ʃtʀɔmspaːʀənt</phoneme>
</lexeme>
What is wrong or could be improved? First, the role attribute is missing. stromsparen is a Verb. stromsparend is an Adverb. It would be good if someone added the missing role attributes. Second, there are small phoneme corrections necessary: ʃtʀɔmʃpaːʀən is OK because you speak “schtromschparen”. But ʃtʀɔmspaːʀənt is wrong because you don’t say “schtromsparent”.
You can see that improvements are necessary. Because Ralf's German dictionary is GPLv3, everyone is permitted to improve it.
For good recognition results, things like “schtromsparent” have to be fixed. It is possible that some dialects (e.g. Hamburg) speak “stromsparen” and not “schtromschparen”. I recommend that specific dialect dictionaries should be developed. This would be part of the fine-tuning, too. Ralf's German dictionary covers Standard German. You can use my dictionary for the development of a dialect dictionary that can be used by people who prefer to dictate in their own specific dialect.
By the way, did you notice the following detail? ʃtʀɔmspaːʀənt ends with a “t” and not with a “d” because of the “Auslautverhärtung” (which is part of the German pronunciation). Such small details are implemented in the dictionary.
C. Conclusion: I know about the strengths of PLS, but I don’t know which advantages a Julius dictionary would have to offer.
Tags: Auslautverhärtung, German, kɛʀnçeːmɪʃən, kɛʀndʊʀçmɛsɐ, PLS, ʃtʀɔmʃpaːʀən


None.
I am not to sure if “Julius dictionary” is the correct description, either.
You might know that simon uses Julius for the recognition. The Julius vocabulary used for this system is now generated from the active scenarios but was – up to 0.2 – stored in the original format that Julius uses (have a look at older model.voca files). The format is quite simple:
% Terminal
Word\tPronunciation
Word2\tPronunciation2
% Terminal2
Word\tPronunciation
I implemented the import feature to make it possible to “restore” the word list of an older simon version so that users upgrading from 0.2 don’t have to re-enter the words manually. They can instead import the old file ~/.kde/share/apps/simon/model/model.voca.
Of course importing the lexicon (~/.kde/share/apps/simon/model/lexicon) as HTK lexicon would work too but terminal information would have be lost as an HTK dictionary does not contain that information.
Greetings,
Peter
OK, I understand. Thanks for the explanation. Maybe I can try the following:
- import
model.vocafrom my backup file as Julius dictionary (as active dictionary);- import
Ralf's German dictionaryas PLS dictionary (as shadow dictionary), or alternatively importshadow.vocaas Julius dictionary (as shadow dictionary).This makes sense. Then I would have to import my
promptsfile and the correspondingtraining.datafolder via the import training samples wizard.I hope that I will learn how to import the information that is stored in my backup folder. I think that I will try it again.