Import 90.000 Italian words

You can import Ralf's Italian dictionary (version 0.1; GPLv3). Training with this dictionary is currently not recommended. Some notes about (how I created) this dictionary:

1. I got an Italian spelling dictonary.

2. The unmunch command produced more than 20 million Italian words. Because simon is not intended to handle very large lexicons, I decided to use the style-sheet create-graphemes-italian.xsl instead. This style-sheet removes the prefix/suffix information from the spelling dictionary it_IT.dic. The result was an SSML file with about 90.000 Italian words.

3. I generated from the SSML file the corresponding phonemes: $ espeak -f italian-audio-o -m -v it -q -x --phonout="italian-espeak"

4. Then I combined the grapheme elements with the phoneme elements.

5. The last step was the conversion from eSpeak phonemes to IPA phonemes with the style-sheet espeak2perfectipa-italian.xsl. Here are some of the Italian specific conversions that are contained in the style-sheet:

replace($sierra, 'dZ:', 'ddÊ’')
replace($sierra, 'ts:', 'ddz')
replace($sierra, 't:', 'tt')
replace($sierra, 'd:', 'dd')
replace($sierra, 's:', 'ss')
replace($sierra, 'b:', 'bb')
replace($sierra, 'k:', 'kk')

I tried to follow the IPA for Italian. To make the dictionary work with simon (so that training produces reasonable results), the simon import process has to be adjusted. Effective training is currently not possible.

Now you know that an Italian pronunciation dictionary exists that you can import into simon.

Tags: , ,

One Response to “Import 90.000 Italian words”

  1. producer says:

    Ralf's Italian dictionary (version 0.1) is not available any more. I replaced it with version 0.1.1. I fixed a major issue concerning the grapheme elements. And the phoneme elements now contain eSpeak phonemes instead of IPA phonemes. I don’t want to introduce too much errors.

    Of course, the new version can be imported into simon, too. But training is not possible.