Ralf’s Italian speech model

Wednesday, May 16th, 2012

Some words about the creation of Ralf’s Italian speech model.

1. I took a look at the Italian frequency list. It is licensed under the LGPL – very good.


Ralf’s Italian dictionary 0.1.2

Monday, May 17th, 2010

How I create Ralf's Italian dictionary version 0.1.2:

1. Make some adjustments to espeak2ipa.xsl (Ralf's eSpeak2IPA style-sheet).

2. Transform the eSpeak phonemes (Ralf's Italian dictionary version 0.1.1 contains espeak phonemes) into IPA phonemes via the Ubuntu terminal:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/italian/it_IT/italian-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

Some explanations: The cat command outputs the content of Ralf's Italian dictionary 0.1.1 in compressed form. The special character "|" causes the output of the cat command to be used as input for the bunzip command. The output of the bunzip command is then used as input for saxonb-xslt.

3. Download Ralf's Italian dictionary 0.1.2, and import it into simon. Take a look at the shadow dictionary:

italian The word column contains 95192 Italian words. The pronunciation column contains the corresponding SAMPA transcriptions.

4. A native speaker could improve Ralf's Italian dictionary.

Import 90.000 Italian words

Saturday, November 7th, 2009

You can import Ralf's Italian dictionary (version 0.1; GPLv3). Training with this dictionary is currently not recommended. Some notes about (how I created) this dictionary:

1. I got an Italian spelling dictonary.

2. The unmunch command produced more than 20 million Italian words. Because simon is not intended to handle very large lexicons, I decided to use the style-sheet create-graphemes-italian.xsl instead. This style-sheet removes the prefix/suffix information from the spelling dictionary it_IT.dic. The result was an SSML file with about 90.000 Italian words.

3. I generated from the SSML file the corresponding phonemes: $ espeak -f italian-audio-o -m -v it -q -x --phonout="italian-espeak"

4. Then I combined the grapheme elements with the phoneme elements.

5. The last step was the conversion from eSpeak phonemes to IPA phonemes with the style-sheet espeak2perfectipa-italian.xsl. Here are some of the Italian specific conversions that are contained in the style-sheet:

replace($sierra, 'dZ:', 'ddʒ')
replace($sierra, 'ts:', 'ddz')
replace($sierra, 't:', 'tt')
replace($sierra, 'd:', 'dd')
replace($sierra, 's:', 'ss')
replace($sierra, 'b:', 'bb')
replace($sierra, 'k:', 'kk')

I tried to follow the IPA for Italian. To make the dictionary work with simon (so that training produces reasonable results), the simon import process has to be adjusted. Effective training is currently not possible.

Now you know that an Italian pronunciation dictionary exists that you can import into simon.