Because somebody showed interest in
Ralf's Hungarian dictionary, I want to improve it. Here is what I do:
1. I convert
Ralf's Hungarian dictionary (version 0.1; April 14, 2010) with the style-sheet
prepare-hungarian-for-espeak.xsl into the
hungarian-ssml.xml via the
am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ saxonb-xslt -ext:on -s:hungarian-dictionary.xml -xsl:'prepare-hungarian-for-espeak.xsl' -o:hungarian-ssml.xml
I will use the object file
hungarian-ssml.xml as source file for eSpeak. Thanks to
prepare-hungarian-for-espeak.xsl I filtered out problematic words. Which Hungarian words were problematic for eSpeak? Problematic words are words that contain a space-bar (
" "), a tab, or a hyphen (-). The filtering was done with the following lines:
(1) This line removes
<grapheme> elements that contain a space-bar.
(2) This line removes
<grapheme> elements with a tab.
(3) This line removes
<grapheme> elements with a hyphen.
Why am I doing this? I want a “clean” word list with no problematic characters (no space-bars; no tabs; no hyphens).
2. Now I have the file
hungarian-ssml.xml that I will use as source file for
eSpeak. I invoke
eSpeak via the
am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ espeak -f hungarian-ssml.xml -m -v hu -q -x --phonout="hungarian-espeak"
Let me explain the details of the espeak command:
-f [file] indicates the source file
-m [mark-up] indicates that the source file is an
SSML file (this means that each word is enclosed in an
-v [voice] indicates which language
eSpeak has to use. The
ISO 639-1 language tag for Hungarian is
-q [quiet] means that
eSpeak doesn’t speak the words out loudly.
-x means that
eSpeak should output phonemes.
--phonout="hungarian-espeak" indicates the output file.
Now you know how you can use
eSpeak for phoneme generation.
3. I have to edit the file
hungarian-espeak. I have two options:
hungarian-espeak only with gedit.
<lexicon> tags at the beginning (
<lexicon>) and at the end (
hungarian-espeak with gedit. Then run an
.xsl style-sheet for the conversion into an
I will use option (a).
4. Here is how I edit
(a) I opened
(b) I added a
<lexicon> tag at the beginning of the file.
(c) Here you can see the Hungarian eSpeak phonemes. The Hungarian words are separated by an empty line from each other.
(d) I will search for the sequence
"\n\n " (backslash-n-backslash-n-spacebar).
(e) This sequence will be replaced by the expression
Replace All. I have to wait a few minutes or so. Obviously, this approach (search and replace with gedit) is pretty slow. If I would have chosen to do the replacement with
saxonb-xslt, it would have been faster.
5. After doing some minor changes with gedit to the files
hungarian-ssml.xml (this file contains the future
<grapheme> elements) and
hungarian-espeak (contains the future
<phoneme> elements), it is time to combine them with
paste. I invoke the
paste command via the
am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ paste hungarian-ssml.xml hungarian-espeak > hungarian-dictionary-with-espeak-phonemes.xml
I have the problem that the both source files (
hungarian-espeak) don’t have the exact same length in line numbers. The file
hungarian-ssml.xml has 77642 lines. The file
hungarian-espeak has 77643 lines. This means that I will have to use the
paste command again after fixing the issue. I found the issue:
The value of the
<audio> element under the word
Zürich contains an open-bracket (
[). eSpeak handles special characters differently (this is my experience). This is the reason why I had filtered out words with problematic characters (words with tabs, space-bars, hyphens). I didn’t filter out words that contain an open-bracket because I didn’t expect such a word.
You can see: it is time consuming to find such an issue. This was the reason why I applied the style-sheet
prepare-hungarian-for-espeak.xsl in the first step. The Hungarian word list had to be as clean as possible. Thanks to
XSLT the filtering can be done easily. Probably, you could get the same result by applying the
grep command. But why use
grep when you can use an elegant
6. I fixed the issue with the
<audio>[</audio> element. Both source files (
hungarian-espeak) have now the same length. I am trying the
paste command again:
$ paste hungarian-ssml.xml hungarian-espeak > hungarian-dictionary-with-espeak-phonemes.xml
The object file
hungarian-dictionary-with-espeak-phonemes.xml isn’t a valid
XML file at the moment. I will do some search/replace operations with
gedit so that the file is a valid
XML file (to be more specific: a
PLS dictionary file).
7. Now let’s transform the eSpeak phonemes in the
IPA phonemes. This can be done via the
Ubuntu terminal. But first, I have to do some major changes to the style-sheet
10. I am now “ready” with the
improve-hungarian-dictionary.xsl. Please, take a look into my eSpeak-to-IPA conversion rules:
Are these rules OK? Do you find major or minor mistakes? I marked the phoneme
ɟ (IPA-number 108; voiced palatal plosive). From my point of view, this phoneme should be added to the simon
PLS import process because it is of relevance for several languages.
11. I generate the
PLS dictionary via
am3msi@am3msi-desktop:~/Documents/201004/hungarian-dictionary$ saxonb-xslt -ext:on -s:hungarian-dictionary-with-espeak-phonemes.xml -xsl:'http://spirit.blau.in/simon/files/2010/04/improve-hungarian-dictionary.xsl' -o:hungarian-0.1.1.xml
Ralf's Hungarian dictionary (version 0.1.1; April 24, 2010), and import it into
simon. Please make suggestions for improvements.