Improving Ralf’s Hungarian dictionary

Because somebody showed interest in Ralf's Hungarian dictionary, I want to improve it. Here is what I do:

1. I convert Ralf's Hungarian dictionary (version 0.1; April 14, 2010) with the style-sheet prepare-hungarian-for-espeak.xsl into the SSML file hungarian-ssml.xml via the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ saxonb-xslt -ext:on -s:hungarian-dictionary.xml -xsl:'prepare-hungarian-for-espeak.xsl' -o:hungarian-ssml.xml

I will use the object file hungarian-ssml.xml as source file for eSpeak. Thanks to prepare-hungarian-for-espeak.xsl I filtered out problematic words. Which Hungarian words were problematic for eSpeak? Problematic words are words that contain a space-bar (" "), a tab, or a hyphen (-). The filtering was done with the following lines:


(1) This line removes <grapheme> elements that contain a space-bar.
(2) This line removes <grapheme> elements with a tab.
(3) This line removes <grapheme> elements with a hyphen.

Why am I doing this? I want a “clean” word list with no problematic characters (no space-bars; no tabs; no hyphens).

2. Now I have the file hungarian-ssml.xml that I will use as source file for eSpeak. I invoke eSpeak via the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ espeak -f hungarian-ssml.xml -m -v hu -q -x --phonout="hungarian-espeak"

Let me explain the details of the espeak command:
-f [file] indicates the source file hungarian-ssml.xml
-m [mark-up] indicates that the source file is an SSML file (this means that each word is enclosed in an <audio> element)
-v [voice] indicates which language eSpeak has to use. The ISO 639-1 language tag for Hungarian is hu.
-q [quiet] means that eSpeak doesn’t speak the words out loudly.
-x means that eSpeak should output phonemes.
--phonout="hungarian-espeak" indicates the output file.

Now you know how you can use eSpeak for phoneme generation.

3. I have to edit the file hungarian-espeak. I have two options:
a. Edit hungarian-espeak only with gedit.
b. Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>) of hungarian-espeak with gedit. Then run an .xsl style-sheet for the conversion into an .xml file.

I will use option (a).

4. Here is how I edit hungarian-espeak:


(a) I opened hungarian-espeak with gedit.
(b) I added a <lexicon> tag at the beginning of the file.
(c) Here you can see the Hungarian eSpeak phonemes. The Hungarian words are separated by an empty line from each other.
(d) I will search for the sequence "\n\n " (backslash-n-backslash-n-spacebar).
(e) This sequence will be replaced by the expression "</phoneme>\n<phoneme>"
(f) Press Replace All. I have to wait a few minutes or so. Obviously, this approach (search and replace with gedit) is pretty slow. If I would have chosen to do the replacement with saxonb-xslt, it would have been faster.

5. After doing some minor changes with gedit to the files hungarian-ssml.xml (this file contains the future <grapheme> elements) and hungarian-espeak (contains the future <phoneme> elements), it is time to combine them with paste. I invoke the paste command via the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ paste hungarian-ssml.xml hungarian-espeak > hungarian-dictionary-with-espeak-phonemes.xml

I have the problem that the both source files (hungarian-ssml.xml and hungarian-espeak) don’t have the exact same length in line numbers. The file hungarian-ssml.xml has 77642 lines. The file hungarian-espeak has 77643 lines. This means that I will have to use the paste command again after fixing the issue. I found the issue:


The value of the <audio> element under the word Zürich contains an open-bracket ([). eSpeak handles special characters differently (this is my experience). This is the reason why I had filtered out words with problematic characters (words with tabs, space-bars, hyphens). I didn’t filter out words that contain an open-bracket because I didn’t expect such a word.

You can see: it is time consuming to find such an issue. This was the reason why I applied the style-sheet prepare-hungarian-for-espeak.xsl in the first step. The Hungarian word list had to be as clean as possible. Thanks to XSLT the filtering can be done easily. Probably, you could get the same result by applying the grep command. But why use grep when you can use an elegant XSLT style-sheet?

6. I fixed the issue with the <audio>[</audio> element. Both source files (hungarian-ssml.xml and hungarian-espeak) have now the same length. I am trying the paste command again:

$ paste hungarian-ssml.xml hungarian-espeak > hungarian-dictionary-with-espeak-phonemes.xml

The object file hungarian-dictionary-with-espeak-phonemes.xml isn’t a valid XML file at the moment. I will do some search/replace operations with gedit so that the file is a valid XML file (to be more specific: a PLS dictionary file).

7. Now let’s transform the eSpeak phonemes in the PLS dictionary hungarian-dictionary-with-espeak-phonemes.xml into IPA phonemes. This can be done via the Ubuntu terminal. But first, I have to do some major changes to the style-sheet improve-hungarian-dictionary.xsl.

8. I have to find out which Hungarian phonemes eSpeak employs. I think that I will download the eSpeak source code, and then look for the Hungarian *_rules and *_list files. But this didn’t help me.

9. Taking a look into the Hungarian pronunciation guide. I am using the Wikipedia pronunciation table.

10. I am now “ready” with the .xsl style-sheet improve-hungarian-dictionary.xsl. Please, take a look into my eSpeak-to-IPA conversion rules:


Are these rules OK? Do you find major or minor mistakes? I marked the phoneme ɟ (IPA-number 108; voiced palatal plosive). From my point of view, this phoneme should be added to the simon PLS import process because it is of relevance for several languages.

11. I generate the PLS dictionary via Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ saxonb-xslt -ext:on -s:hungarian-dictionary-with-espeak-phonemes.xml -xsl:'' -o:hungarian-0.1.1.xml

12. Download Ralf's Hungarian dictionary (version 0.1.1; April 24, 2010), and import it into simon. Please make suggestions for improvements.

Tags: , ,

One Response to “Improving Ralf’s Hungarian dictionary”