Import 1.7 million Latin pronunciations

You can import Ralf's Latin dictionary (version 0.1.1) into simon. It contains about 1.7 million Latin words. Some information about how I created the dictionary:

1. The Latin words were extracted from a Latin OpenOffice.org dictionary with the command:

$ unmunch la.dic la.aff > latin-wordlist

2. The phonemes were originally generated with eSpeak (German voice) using the command:

$ espeak -f latin-ssml -m -v de -q -x --phonout="espeak-latin"

This means that no Latin specific pronunciation rules were applied. The pronunciation is as if the Latin words were German.

3. I used this style-sheet to transform the eSpeak phonemes into IPA phonemes.

License of Ralf's Latin dictionary is GPLv3. On my computer, the import of the dictionary took about 15 minutes because of its size. So you have to be really patient when you import the dictionary.

It should be possible to train a few Latin words with simon. It is necessary that you pronounce the words as if they were normal German words (German accent; no Latin specific vowel length).

Tags: , ,

5 Responses to “Import 1.7 million Latin pronunciations”

  1. Peter Grasch says:

    Please note however, that simon was never intended to handle such large lexicons and it will slow down the system _noticably_ in many ways.

  2. producer says:

    Well, my next dictionary (Spanish) will be smaller: just about 850.000 words.

  3. Peter Grasch says:

    I really appreciate your work but I don’t think that this is the right approach for simon (in general there are of course many use cases).

    A dictionary containing 850.000 _automatically_created_ entries is worth _less_ than the “receipt” of how to build pronunciations on the fly (it retains less information).

    Because simon is interaction driven, it should be no problem to do what you did for each word for any word that needs to be added. This would leave the shadow dictionary for “special” words where this build receipt produces less than optimal results.

    Benefits:
    * Shadow lexicon is substantially smaller -> Performance
    * Shadow lexicon doesn’t have to anticipate all possible words / words not in shadow dictionary will still be transcribed -> Performance, Accuracy

  4. producer says:

    Hello Peter,

    I know about the performance issue. Unfortunately, I don’t know how to solve this issue:

    “In some situations the explicit specification of all the morphological variants of a word can lead to extremely large lexicons. A standard scheme for providing prefix and suffix morphological rules would enable more compact lexicon documents.”

    Obviously, OpenOffice.org spelling dictionaries do have such a standard (.dic file and .aff file). It would be useful if such a standard would exist for pronunciation dictionaries.

    Neverless, it is not so much work to create additional dictionaries. I think that it is the right decision to offer several large PLS dictionaries. The performance issue can be solved later.

    Regards,
    Ralf

  5. Peter Grasch says:

    Performance is not an issue that could be solved on this level. Think about mobile clients / networking environments. Expand a 800 word dictionary and it’s well over 50 MB. That is just too much for invaluable data. Compression helps of course but only reduces the problem and adds another significantly performance drop for embedded devices with little processing power.

    Moreover, the problem you talk about vanishes when integrating this automatic transcription process in the simon workflow.

    All in all we’d gain versatility (add arbitrary words), (a lot of) performance without losing anything in the process. Win/win.