Posts Tagged ‘paste’

Zuführungsdrähten: two pronunciations

Thursday, December 24th, 2009

I am now adding the word Zuführungsdrähten (which is part of the shadow dictionary):

zufuehrungsdraehten

You can see that there are two pronunciation alternatives. And this proves the strength of Ralf’s German dictionary: I am using an XSLT stylesheet to fix recurring pronunciation errors (eSpeak is not perfect), and to add alternate pronunciations (using replace: replace($ieren, 'tən', 'tn̩').

It would be great if someone would be willing to volunteer with the development of Ralf's German dictionary. My concept is as follows (compare with the XSLT concept):

xslt-concept
Image source: Wikipedia

Ralf’s German dictionary (current version) = XML input
espeak2perfectipa.xsl = XSLT code
saxonb-xslt (Ubuntu terminal) = XSLT processor
Ralf’s German dictionary (future version) = Result document

The development of Ralf's German dictionary is done outside and independent from simon. simon does a pretty good conversion during import from IPA to SAMPA. So there is no need to worry.

Ralf’s German dictionary has a lot of known weaknesses. It would be great if someone who is interested in the improvement of this dictionary would volunteer. It is not that difficult to get involved.

Everyone is permitted to improve Ralf’s German dictionary (and the corresponding XSLT code espeak2perfectipa.xsl) because both are GPLv3 licensed.

Ralf's German dictionary is the flagship. There are other dictionaries which need improvement:

A. Austrian German
Ralf’s Austrian German dictionary – it is a very small dictionary. The target group is very specific. This dictionary should contain only words that are not included in Ralf's German dictionary. So if you live in Austria, it is intended that you import two dictionaries:
1. Ralf's German dictionary (with 300.000 words);
2. Ralf's Austrian German dictionary (with specific words).

I don’t know much about Austrian German. You can get an impression of how Austrian German sounds when watching the simon video tutorial.

B. Swiss German
Maybe I will release a Swiss German dictionary. If someone from Switzerland is interested in the development of such a dictionary, I could create Ralf's Swiss German dictionary (I haven’t done that so far). I think that there is a GPL word list at OpenOffice.org available. So a volunteer would be welcome. I can help you with the first steps (unmunch, eSpeak, paste). The result would be a PLS dictionary with a vocabulary that only contains words that are specific to Swiss German.

The future Ralf's Swiss German dictionary is interesting for people who live in (or emigrated from Germany to) Switzerland. If you emigrated from Germany to Switzerland, you should get familiar with Swiss German. So Ralf's Swiss German dictionary would be interesting for German people who immigrate to Switzerland. If you want to stay in Switzerland, learn their language! simon / Ralf's Swiss German dictionary might help you to reach that goal.

C. Medical German
Ralf’s German medical dictionary is targeted at people who are interested in medical education. It is necessary to develop specialised medical dictionaries. The concept is easy:
1. Import Ralf's German dictionary into simon.
2. Import Ralf's German medical dictionary. Then you can train simon to recognize medical terms (e.g. LinsenchirurgielɪnzənçiːʀʊʀgiːEpilepsiebehandlungʔeːpiːlɛpsiːbeːandlʊŋ).
3. Develop specialised medical dictionaries: Human anatomy, pharmacology, genetics, etc..

simon could be used by medical students. So if you are a medical student (German language), you can improve Ralf's German medical dictionary. You can add words to this dictionary. Later, in a few years when you become a doctor, you might be able to use your experiences with simon / Ralf's German medical dictionary. Develop your own medical pronunciation dicitionary, and become a better doctor!

Of course, because we are in a very early stage of development, this is just something for medical students who have enough time.

D. Latin (German pronunciation)
Ralf’s Latin dictionary needs improvement. Latin has pronunciation rules that are different from German. The following way is possible: Improve Ralf's Latin dictionary with an XSLT stylesheet. I explained the concept above. The stylesheet needs information that are specific to the Latin language. Sometimes the Latin e is short (e.g. currere), sometimes it is long (e.g. dēbēre). These things need to be fixed.

And of course, 1.7 million Latin words is too much. The size of this dictionary has to be reduced because of performance issues. A dictionary with about 100.000 Latin words would be optimal at the moment. We don’t have yet a routine (compression algorithm) to handle dictionaries with e.g. 1.7 million words. This has to be developed (maybe someone could do that who is familiar with the unmunch command). But a dictionary with 100.000 Latin words would be a good start.

Latin is a “dead language”. But it should be possible – thanks to simon / Ralf's Latin dictionary to make the computer write down Latin words when you speak them into your microphone. From my point of view, Ralf’s Latin dictionary is something for Latin teachers (school or university). So the target group is very specific. I think that it can be fun for Latin students to use simon for the recognition of spoken Latin words.

E. Conclusion
The different dictionaries need improvement. Interested persons (people from Austria, Switzerland; medical students; Latin teachers) are encouraged to improve the specific dictionary on their own. My dictionaries are GPLv3 licensed. It is intended that someone improves them. This is my concept:

1. unmunch an OpenOffice.org spelling dictionary
2. generate phonemes with eSpeak
3. paste
4. convert eSpeak phonemes into IPA phonemes with XSLT
5. import the resulting PLS dictionary into simon
6. record a few words with simon
7. use simon for recognition (dictate e.g. into a gedit window)

This concept should work with dictionaries that use the German pronunciation (Austrian, Swiss, German medical, Latin). I didn’t test these dictionaries for training / recognition with simon. But the concept is the same. Since Ralf's German dictionary (flagship) works with simon, the other dictionaries with German pronunciation should work, too.

Expanding Ralf’s French dictionary

Tuesday, November 3rd, 2009

This article is interesting for people who want to create a pronunciation dictionary for their own language that can be imported into simon.

Currently, I am preparing a new version of Ralf's French dictionary. I am adding a lot of words to the dictionary. Here is what I am doing:

1. I got a French spelling dictionary from OpenOffice.org (Orthographe «Réforme 1990»). There are two important files:
a. .dict file: contains the word list
b. .aff file: contains rules about the possible suffixes. French has a lot of suffixes. Let’s take a look into Conjugaison française:Premier groupe:

# INDICATIF

* Présent :-e, -es, -e, -ons, -ez, -ent
* Imparfait : -ais, -ais, -ait, -ions, -iez, -aient
* Futur simple : -erai, -eras, -era, -erons, -erez, -eront
* Passé simple : -ai, -as, -a, -âmes, -âtes, -èrent

# SUBJONCTIF

* Présent : -e, -es, -e, -ions, -iez, -ent
* Imparfait : -asse, -asses, -ât, -assions, -assiez, -assent

# CONDITIONNEL

* Présent : -erais, -erais, -erait, -erions, -eriez, -eraient

The next version of Ralf’s French dictionary should cover most of the French suffixes. And of course, the French accents should be correct (according to the Réforme 1990).

2. I typed into the Ubuntu terminal the command: unmunch fr-1990.dic fr-1990.aff > french-wordlist-o.txt
This created a list of about 3 million French words. But there are many duplicates. This means that I will use only a subset of these words. I will have to sort out the duplicates with the distinct-values(), or with sort -u | \:

“sort sorts the list of ‘words’ into alphabetical order, and the -u switch removes duplicates”

I will have to be careful with the special characters. The sort command might be easier, but maybe there will be problems with ISO-8859-1 vs. UTF-8. To avoid this possible problem, it might be better to use an XSLT style-sheet. But there can be a problem with java heap space (I solved this problem under Ubuntu 9.04 with VisualVM. Unfortunately, I didn’t find out how to deal with this issue under Ubuntu 9.10.). I will have to try out. The tools that I am using are very good, but they aren’t perfect. Fixing the issues costs me a lot of time.

There are encoding problems. If you want to create a dictionary with non-US-ASCII characters (like it is the case in German – äöüß – or French), you probably will encounter the following problem:

ISO-8859-1 vs. UTF-8

Both standards are very common, and it can cause a lot of headache. I am trying to solve the problem via the style-sheet create-graphemes-french.xsl:

replace-french

The unmunch command introduced a lot of crap characters. I don’t know how I can prevent this problem. But I know how I can try to fix the crap characters. One solution would be a simple search and replace with gedit. But I prefer to use a style-sheet because it allows several transformations at a time.

You can see that dictionary development is possible. I am using very powerful tools (the unmunch command for word generation; the .xsl style-sheet for fixing encoding issues; the espeak command for phoneme generation).

Why am I writing this in this blog? Because if you want to use simon, you need a pronunciation dictionary. I want to help people who don’t have access to a dictionary. Create your own dictionary! OpenOffice.org offers spelling dictionaries in a lot of languages. This is your source to get a word list. With eSpeak you transform the words into phonemes. You can use eSpeak for the following languages:

Afrikaans, Bosnian, Catalan, Czech, German (most phonemes of Ralf's German dictionary are created with eSpeak), Greek, Esperanto, Spanish, Finnish, Croatian, Hungarian, Italian, Kurdish, Latvian, Polish, Romanian, Slovak, Serbian, Swedish, Swahili, Tamil, Turkish.

If you are a native speaker of one of those languages, my concept of dictionary creation is something for you.

With paste you combine the word list and the phoneme list.

[Edited on November 16, 2009]

Create your own PLS dictionary

Tuesday, October 27th, 2009

This article is for people who want to use simon, but don’t have access to a pronunciation dictionary.

If you want to use simon, you need a pronunciation dictionary. Such dictionaries are available for some languages.

What if you don’t have a dictionary for your own language? You can create your own dictionary. Here are a few thoughts about the topic “dictionary creation”:

Today, I created about 100.000 words from an OpenOffice.org spelling dictionary with the style-sheet create-graphemes.xsl. The source spelling dictionary is licensed under the GPL. There is no licensing problem. Ralf’s German dictionary is GPL licensed, too. I “steal” words only from GPL sources.

I recommend OpenOffice.org spelling dictionaries if you want to create a PLS dictionary for your own language. Think about languages like Czech, Greek, or Finnish. Native speakers can use such an OpenOffice.org spelling dictionary for word creation. The word list can then be used with eSpeak for phoneme generation. eSpeak works with several languages. The word list and the phoneme list can be combined with paste. After that, the eSpeak phonemes can be converted into IPA with the style-sheet espeak2perfectipa.xsl. You can write your own style-sheet for your own language. I want to show you the concept of dictionary creation.

I created a word list of Ralf’s German dictionary using the style-sheet output-text.xsl. Then I used comm: $ comm -23 output-o 0.1.7-wordlist-o > compared
The result should be a word list with words that are not part of the current version of Ralf’s German dictionary (I am not sure whether this approach was successful). I will integrate these words into the dictionary.

Ralf’s German dictionary is a solution for the German language. Create a solution for your own language!