Posts Tagged ‘node17’

French terminals: “verbe” and “nom”

Friday, April 9th, 2010

Ralf's French dictionary (version 0.1.3; April 09, 2010) contains terminal information about some verbs (French: verbe) and a few nouns (French: nom). I added the terminal information with the style-sheet improve-french-dictionary.xsl. I invoked this .xsl style-sheet via the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/french-dictionary$ saxonb-xslt -ext:on -s:french-dictionary-0.1.2.xml -xsl:improve-french-dictionary.xsl -o:french-dictionary-0.1.3.xml

Short explanation:
1. saxonb-xslt = XSLT processor
2. french-dictionary-0.1.2.xml = Ralf's French dictionary version 0.1.2 = source XML file
3. improve-french-dictionary.xsl = XSLT style-sheet that inserts the terminal information into the object XML file
4. french-dictionary-0.1.3.xml = object XML file

Please import Ralf's French dictionary into simon:

import-french-pls-dictionary

Take a look at the terminal information:

terminaient

1. In the left column you can see the word terminaient.
2. The SAMPA pronunciation is t E R m i n E.
3. The category (= terminal) is verbe.

Another example:
4. The french word terminaison
5. … has the SAMPA pronunciation t E R m i n E z O n a s (which is wrong). Maybe it would be a good decision to transcribe this word as follows: t E R m i n E z On.
6. The category of terminaison is nom.

Let’s train the the word terminaient:

train-french-word-terminaient

7. Select with the left mouse button the French word terminaient.
8. Press Add to Training.
9. The word terminaient (verbe) is ready for training.
10. Press Train selected Words.

OK, let’s finish here. You now know that Ralf's French dictionary contains some verbs and nouns with terminal information.

Generating French role attributes

Wednesday, April 7th, 2010

Recently, I released Ralf's French dictionary version 0.1.2 with empty role attributes. At the moment, I am preparing version 0.1.3 that will contain some role attributes. Let me explain what I am doing:

generating-french-terminals

1. Take a look at the next version of Ralf's French dictionary (french-dictionary-0.1.3.xml) that I am currently working on.

2. Some <lexeme> elements contain non-empty role attributes. The French word communiaient is marked as verbe.

3. Take a look at the style-sheet improve-french-dictionary.xsl. I am using this style-sheet to add role attributes automatically. Let me explain more details so that you can feel what is happening.

4. The template with the name "create-french-lexeme" creates <lexeme> elements.

5. The template with the name "create-french-role-attribute" creates a French terminal depending on the morphology of the specific <grapheme> element:

6. When the <grapheme> element ends with âmes then the content of the role attribute will be a verbe.

7. When it ends with aient then the role attribute will be a verbe, too. Take a look at 2. communiaient. This lexeme is a verbe because of the style-sheet. Great, isn’t it? I have defined a simple rule in the style-sheet that marks the <lexeme> element as verbe when the morphologic condition is met.

8. When the <grapheme> element ends with ais then the <lexeme> element with the <grapheme> element communiais will be marked as verbe.

Conclusion: You can feel what is happening. It is possible that not all French words that end with âmes, aient, ais, or ait are verbs. But I assume that most words with this specific morphology are verbs. That should be sufficient for the moment. You can see that XSLT is a very powerful language for dictionary improvement.

Advantages of Ralf’s French dictionary

Wednesday, April 7th, 2010

At the moment, I am thinking about the advantages that Ralf's French dictionary has to offer. Let me show you the advantages:

french-advantages

1. The dictionary is stored as XML file. This means that you can edit the dictionary with gedit. Or you can transform the dictionary with XSLT.

2. The encoding is UTF-8. With this encoding there shouldn’t be encoding problems:

2.a. UTF-8 means no encoding problems within the <grapheme> element: Hebrew or Tamil are no problem. And this means that French accents are displayed correctly inside the <grapheme> element. No crap characters should occur.

2.b. UTF-8 means that the IPA phonemes are displayed correctly inside the <phoneme> elements. The dictionary can easily be edited by human editors. It is difficult to edit a phonetic dictionary that contains X-SAMPA or Kirshenbaum characters. IPA phonemes can easily be read by humans. And it is no problem to process IPA characters with saxonb-xslt (type saxonb-xslt into the Ubuntu terminal). Every detail of the French language can be catched.

3. The license of Ralf's French dictionary is GPLv3. Maybe the simon developers are interested to offer an automatic dictionary import? The license would permit this. You can see that the dictionary design is very developer-friendly.

4. The <lexicon> element contains the attribute alphabet="ipa". Not all of my dictionaries contain IPA characters. Some dictionaries contain eSpeak charakters; these dictionaries contain the attribute alphabet="espeak". The PLS standard allows us to use different alphabets for the <phoneme> elements. Personally, I prefer the IPA alphabet. But of course, other alphabets could be used. A future version of simon could differentiate between the different alphabets during dictionary import. I am thinking about the following solution:
- alphabet="ipa" is used by Ralf's German dictionary, Ralf's French dictionary, Ralf's Spanish dictionary;
- alphabet="espeak" can be used by other dictionaries (or alternatively, I transform the eSpeak phonemes into IPA phonemes). I am not sure whether it is good to use eSpeak phonemes inside some of my PLS dictionaries. Maybe it would be better to convert them into IPA phonemes?

Currently, the simon dictionary import process doesn’t differentiate between the tags alphabet="ipa" and alphabet="espeak". As far as I know, the eSpeak phonemes are probably almost the same for all languages. So maybe it would be a good idea if simon would be able to import PLS dictionaries with eSpeak phonemes. I am saying maybe because I am not sure whether this would be a good decision. In the long run, the IPA is the better decision (because it is easily editable by humans).

You can see that I spent a lot of time thinking about the different phoneme characters. It is great to see that simon handles almost all German phonemes. At the moment, there are import problems with the French IPA phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ. The tilde is imported by simon as the SAMPA sequence n a s. This is wrong, and should be corrected.

5. Each of my dictionaries contains a language tag. Ralf's German dictionary contains xml:lang="de-DE" because it is Standard German. Ralf's French dictionary indicates Standard French by using the language tag xml:lang="fr". It would be possible to develop a phonetic dictionary for Canadian French. In this case, the tag would be xml:lang="fr-CA".

It is possible that a future version of simon automatically “understands” the language of the specific PLS dictionary. E.g. Ralf's Austrian German dictionary contains the language tag xml:lang="de-AT". Maybe it would be possible to ask the simon user via a wizard:

Which language do you want to use for dictation? Please select the appropriate language.

◊ Standard German (BOMP)
◊ Standard German (PLS)
◊ Standard German (PLS) + Austrian German (PLS)
◊ Only Austrian German (PLS)
◊ Standard French (PLS)
◊ Standard German (PLS) + Medical German (PLS)
◊ [Afrikaans, Catalan, Croatian, ...]

Thanks to the language tag xml:lang="fr" (each language contains a specific language tag) it should be not too difficult to develop an automatic dictionary import function for the 27 PLS dictionaries.

simon offers automatic import of the German BOMP dictionary (obviously, this dictionary does have a very good quality). But what about other languages? It would be good for the marketing if simon offered automatic dictionary import for all 27 PLS dictionaries. Almost all of my dictionaries are in an early stage of development. But this is no problem at the moment. Each dictionary can be improved easily. It is just necessary to run an XSLT style-sheet that transforms the eSpeak phonemes into IPA phonemes. No big deal. I did this for German, for French, and for a couple of other languages.

How many phonemes are needed? I would say that we don’t need all IPA phonemes. We can stick to the existing ones plus 4 French phonemes plus 4 Spanish phonemes. That should do the job for all 27 languages. My goal is to help making simon usable for 27 languages. Usable means that all major characteristics (=phonemes) of each specific language are covered by the specific phonetic dictionary.

My personal focus are the following dictionaries: German, French, Spanish, English (maybe I will transform the English Voxforge dictionary into PLS format). The other languages (Afrikaans, Catalan, Croatian, …) will have to use the phonemes that are used by these four languages. I don’t plan to add more phonemes to my PLS dictionaries. I want to keep it as simple as possible, and as complex as necessary.

6. I am experimenting with the role attribute in Ralf's French dictionary. I tried the following role attributes: lettre and abrévation. simon displays the terminal abrévation correctly:

abreviation

This means that French terminals (= value of the role attribute) can use an apostrophe.

Conclusion: Ralf's French dictionary is user-friendly and developer-friendly. I propose that the simon developers do the following two things:
- add the 4 French phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ to the simon import process;
- offer automatic PLS dictionary import for 27 languages (simon should download the specific dictionary directly from the internet, and import it automatically after the download has finished).

Ralf’s French dictionary 0.1.2 released

Tuesday, April 6th, 2010

A few minutes ago, I uploaded Ralf's French dictionary version 0.1.2 (license: GPLv3). Download the dictionary, and import it into simon as PLS dictionary. I applied the following changes to the dictionary:

1. I added an empty role attribute to each <lexeme> element. At the moment, Ralf's French dictionary doesn’t contain any terminal information (noun, verb, adjective). It is possible to add terminal information to this dictionary with a simple text editor.

2. I changed another thing: The previous version of Ralf's French dictionary contained about 60.000 duplicate <lexeme> elements. I removed these elements with the following XSLT expression:

<xsl:for-each-group select="lexeme" group-by="grapheme">

You can find this line in the style-sheet improve-french-dictionary.xsl (license: GPLv3). I used this style-sheet to generate version 0.1.2 using the Ubuntu terminal:

am3msi@am3msi-desktop:~/Documents/201004/french-dictionary$ saxonb-xslt -ext:on -s:french-dictionary.xml -xsl:improve-french-dictionary.xsl -o:french-dictionary-0.1.2.xml

3. The language tag is now correct: xml:lang="fr"

Unfortunately, at the moment the French phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ aren’t transcribed into the correct SAMPA phonemes during the simon import process. It shouldn’t be a big deal to fix this. As soon as this issue has been fixed, Ralf's French dictionary should be usable for training of French words. Remember: the dictionary contains more than 300.000 French <lexeme> elements. It would be nice if a native speaker from France would take a closer look at Ralf's French dictionary, and make suggestions for improvements.

What happens with French phonemes?

Tuesday, April 6th, 2010

I want to import Ralf's French dictionary (version 0.1.1; November 03, 2009), and see what could be improved.

1. I found a small error. It says in the XML file:

<lexicon version="1.0" alphabet="ipa" xml:lang="de">

Well, the language tag shouldn’t be de. It should be fr.

2. What happens with the French phonemes when I import the dictionary? Here are a few examples with French phonemes:

2.a.

<lexeme>
<grapheme>dissolussions</grapheme>
<phoneme>disɔlysjɔ̃</phoneme>
</lexeme>

2.b.

<lexeme>
<grapheme>spiritains</grapheme>
<phoneme>spiʀitɛ̃</phoneme>
</lexeme>

2.c.

<lexeme>
<grapheme>transportable</grapheme>
<phoneme>tʀɑ̃spoʀtabl</phoneme>
</lexeme>

2.d.

<lexeme>
<grapheme>habituiez</grapheme>
<phoneme>abitɥie</phoneme>
</lexeme>

2.e.

<lexeme>
<grapheme>mademoiselle</grapheme>
<phoneme>madəmwazɛl</phoneme>
</lexeme>

I want to know what happens with these specific <phoneme> elements when I import Ralf's French dictionary into simon:

disɔlysjɔ̃
spiʀitɛ̃
tʀɑ̃spoʀtabl
abitɥie
madəmwazɛl

Here are the results:

2.a. disɔlysjɔ̃

dissolussions

The SAMPA result is crap: d i s O l y s j O n a s
A future version of simon should import the French phoneme ɔ̃ correctly. This phoneme occurs even in the German words Balkon, Ballon that have French origin.

2.b. spiʀitɛ̃

spiritains

The SAMPA result is crap, too: s p i R i t E n a s
The simon PLS import process should be adjusted so that the French phoneme ɛ̃ is represented correctly.

2.c. tʀɑ̃spoʀtabl

transportable

The SAMPA result is crap: t R A n a s s p o R t a b l
This is the third import phoneme error that should be corrected. Let’s take a look at the next French word:

2.d. abitɥie

habituiez

Is the SAMPA result abitHie acceptable? No, it isn’t. There should be a space between each phoneme. The French IPA phoneme ɥ has been transformed during the import into the SAMPA phoneme H. Probably, this is an error, but I am not sure. At least, the spaces between the phonemes are missing. So this is the fourth error that should be fixed. Let’s take a look at the next word:

2.e. madəmwazɛl

mademoiselle

The SAMPA phonemes are correct: m a d @ m w a z E l The w phoneme has been imported correctly by simon.

Conclusion: The simon import process should be adjusted so that all phonemes in the French IPA transcriptions disɔlysjɔ̃, spiʀitɛ̃, tʀɑ̃spoʀtabl, and abitɥie are converted correctly into SAMPA. As soon as these phoneme import errors are corrected, I will think about a video that demonstrates that simon recognises French words.

I demonstrated in this 30 MB video that simon recognizes 200 German words (Ralf's German dictionary). It should be possible to get a similar result with Ralf's French dictionary. But first, it is necessary that the errors during import are being fixed.