Ralf’s French dictionary with 40.000 words

You can import Ralf's French dictionary (license: GPL; version 0.1) into simon. Currently, training with simon is not recommended with this dictionary. The dictionary should be more or less OK, but there are problems with the simon import process which I will describe later in this article.

I used a French OpenOffice.org spelling dictionary (license: GPL) to get the words for this dictionary. The phones were generated with eSpeak. I converted the eSpeak ASCII phones into IPA with this style-sheet. I am not sure whether all French vowels are converted correctly from the eSpeak phone set into IPA. Especially, I am unsure whether the following XPath expressions do the correct replacements:

select="replace($sierra, 'Y:', 'ø')"
select="replace($sierra, 'Y','œ')"

It is difficult to make the right distinction between these similar phones. Maybe there is a native speaker who is familiar with the French eSpeak phone set out there? The question is: how can I improve the conversion from French eSpeak phones into IPA phones? The XSLT-style-sheet (see the link above) contains information which replacements have been made. I am not sure whether eSpeak generated all French phones correctly. There might be some inconsistencies that have to be fixed.

I am open to suggestions on how to improve the XSLT-style-sheet. The concept is great. Support from a native speaker would be appreciated because it is difficult to catch all the details (especially vowels) of the French language.

When you import the dictionary into simon, not all phones are converted properly by simon. Probably, the following IPA symbols aren’t imported properly by simon:

1. ɲ = stimmhafter palataler Nasal = U+0272; a future version of simon could transform this phone into SAMPA J. Example:

<lexeme>
<grapheme>Avignon</grapheme>
<phoneme>aviɲɔ̃</phoneme>
</lexeme>

2. θ = stimmloser dentaler Frikativ = U+03B8; a future version of simon could transform this phone into SAMPA T. Example:

<lexeme>
<grapheme>Commonwealth</grapheme>
<phoneme>kɔmənwɛlθ</phoneme>
</lexeme>

We need this phone also for (a future version of) Ralf’s German dictionary for words like Thunderbird. Currently, Ralf’s German dictionary doesn’t employ this foreign phone. But this phone is needed.

3. ɔ̃ ; see Wikipedia: “im Deutschen in französischen Lehnwörtern wie Balkon, Chanson“. Example:

<lexeme>
<grapheme>Simon</grapheme>
<phoneme>simɔ̃</phoneme>
</lexeme>

4. ɑ̃; nasaliertes a. Example:

<lexeme>
<grapheme>diligence</grapheme>
<phoneme>diliʒɑ̃s</phoneme>
</lexeme>

5. ɛ̃ = heller Nasalvokal Example:

<lexeme>
<grapheme>infidélité</grapheme>
<phoneme>ɛ̃fidelite</phoneme>
</lexeme>

6. œ̃ = gerundeter halboffener Nasalvokal. Example:

<lexeme>
<grapheme>jardin</grapheme>
<phoneme>ʒaʀdœ̃</phoneme>
</lexeme>

7. ɥ = konsonantisch benutzter Ü-Laut = U+0265. Example:

<lexeme>
<grapheme>minuit</grapheme>
<phoneme>minɥi</phoneme>
</lexeme>

You can see that the French language employs a lot of vowels. Maybe it is possible to adjust the PLS import process?

I have imported the dictionary into simon. Let’s take a look at the results (HTK format):

1. AVIGNON [Avignon] aviJOnas – has to be fixed
2. COMMONWEALTH [Commonwealth] kOm@nwElT – this has to be fixed, too.
3. SIMON [Simon] s i m O n a s – interesting: it seems that the phones are separated correctly. But it is easy to recognize that the simon import process should be adjusted.
4. DILIGENCE [diligence] d i l i Z A n a s s – the error is similar to the previous one.
5. INFIDÉLITÉ [infidélité] E n a s f i d e l i t e – has to be fixed.
6. JARDIN [jardin] Z a R d oe n a s – has to be fixed, too.
7. MINUIT [minuit] minHithe H is the correct SAMPA transcription. But there should be spaces between the phones.

Momentarily, the major issue is the simon PLS import process.

Tags: , , , , , , ,

5 Responses to “Ralf’s French dictionary with 40.000 words”

  1. Peter Grasch says:

    Hi Ralph!

    Well I understand that this is a major issue but the issue is mainly with the IPA. There are _so_many_ IPA phonemes that simon doesn’t correctly translate them all. Also, I don’t know enough about phonetics in general to anticipate all combinations or even know what the correct SAMPA equivalent would be in most cases.

    However, the simon IPA -> SAMPA process is VERY simple and the missing spacing in some cases is just because simon doesn’t know some SAMPA phonemes and thus doesn’t know where to put the spaces (hard to tell because of phonemes like dZ).

    Some time ago you said that you would be interested in getting involved in the development process. This could be a very easy start!

    For the IPA -> SAMPA conversion all you’d need to change is the translation table int his file:
    http://speech2text.svn.sourceforge.net/viewvc/speech2text/branches/scenarios/simon/src/simonmodelmanagementui/ImportDict/dict.cpp?view=markup
    (it is pretty self explanatory, just have a look)

    Fixing the separation issues is even easier. Just add the missing phonemes (“H” in this case) in a new line to this file.

    If you need SVN write access, just drop me a line…

    Greetings,
    Peter

  2. producer says:

    Hello Peter!

    Well, at the moment, my focus is dictionary creation / expansion. As soon as I have
    - expanded Ralf’s French dictionary,
    - created Ralf’s English dictionary,
    - created Ralf’s Spanish dictionary,
    you can give me SVN write access.

    By the way, maybe it would possible to extract terminal information for the role attribute from an OpenOffice.org dictionary?

  3. [...] testing simon my first steps with the simon speech recognition software « Ralf’s French dictionary with 40.000 words [...]

  4. [...] lexeme elements will be removed in a future version. 3. Currently, training with this dictionary is not recommended (because of the French IPA [...]