Posts Tagged ‘XSLT’

Schott’s German dictionary 0.2.5

Tuesday, February 22nd, 2011

Here is how I create Schott's German dictionary version 0.2.5 (former name: Ralf's German dictionary). Within the XSLT style sheet improve-german.xsl, I implement these transformation rules:

1. About 70 occurences have to be replaced:

<xsl:when test="contains(lower-case(../grapheme), 'alkohol')"><xsl:value-of select="replace($sierra, 'alkoːɔl','alkohoːl')"/></xsl:when>

2. Replace about 50 occurences:

<xsl:when test="contains(lower-case(../grapheme), 'vertret')"><xsl:value-of select="replace($sierra, 'vɐtʀeːt','fɐtʀeːt')"/></xsl:when>

3. Replace 32 matches:

<xsl:when test="contains(lower-case(../grapheme), 'aluminium')"><xsl:value-of select="replace($sierra, 'aluːmiːniːʊm','alumiːni̯ʊm')"/></xsl:when>

4. Replace 386 matches: (more…)

Generating French role attributes

Wednesday, April 7th, 2010

Recently, I released Ralf's French dictionary version 0.1.2 with empty role attributes. At the moment, I am preparing version 0.1.3 that will contain some role attributes. Let me explain what I am doing:

generating-french-terminals

1. Take a look at the next version of Ralf's French dictionary (french-dictionary-0.1.3.xml) that I am currently working on.

2. Some <lexeme> elements contain non-empty role attributes. The French word communiaient is marked as verbe.

3. Take a look at the style-sheet improve-french-dictionary.xsl. I am using this style-sheet to add role attributes automatically. Let me explain more details so that you can feel what is happening.

4. The template with the name "create-french-lexeme" creates <lexeme> elements.

5. The template with the name "create-french-role-attribute" creates a French terminal depending on the morphology of the specific <grapheme> element:

6. When the <grapheme> element ends with âmes then the content of the role attribute will be a verbe.

7. When it ends with aient then the role attribute will be a verbe, too. Take a look at 2. communiaient. This lexeme is a verbe because of the style-sheet. Great, isn’t it? I have defined a simple rule in the style-sheet that marks the <lexeme> element as verbe when the morphologic condition is met.

8. When the <grapheme> element ends with ais then the <lexeme> element with the <grapheme> element communiais will be marked as verbe.

Conclusion: You can feel what is happening. It is possible that not all French words that end with âmes, aient, ais, or ait are verbs. But I assume that most words with this specific morphology are verbs. That should be sufficient for the moment. You can see that XSLT is a very powerful language for dictionary improvement.

Zuführungsdrähten: two pronunciations

Thursday, December 24th, 2009

I am now adding the word Zuführungsdrähten (which is part of the shadow dictionary):

zufuehrungsdraehten

You can see that there are two pronunciation alternatives. And this proves the strength of Ralf’s German dictionary: I am using an XSLT stylesheet to fix recurring pronunciation errors (eSpeak is not perfect), and to add alternate pronunciations (using replace: replace($ieren, 'tən', 'tn̩').

It would be great if someone would be willing to volunteer with the development of Ralf's German dictionary. My concept is as follows (compare with the XSLT concept):

xslt-concept
Image source: Wikipedia

Ralf’s German dictionary (current version) = XML input
espeak2perfectipa.xsl = XSLT code
saxonb-xslt (Ubuntu terminal) = XSLT processor
Ralf’s German dictionary (future version) = Result document

The development of Ralf's German dictionary is done outside and independent from simon. simon does a pretty good conversion during import from IPA to SAMPA. So there is no need to worry.

Ralf’s German dictionary has a lot of known weaknesses. It would be great if someone who is interested in the improvement of this dictionary would volunteer. It is not that difficult to get involved.

Everyone is permitted to improve Ralf’s German dictionary (and the corresponding XSLT code espeak2perfectipa.xsl) because both are GPLv3 licensed.

Ralf's German dictionary is the flagship. There are other dictionaries which need improvement:

A. Austrian German
Ralf’s Austrian German dictionary – it is a very small dictionary. The target group is very specific. This dictionary should contain only words that are not included in Ralf's German dictionary. So if you live in Austria, it is intended that you import two dictionaries:
1. Ralf's German dictionary (with 300.000 words);
2. Ralf's Austrian German dictionary (with specific words).

I don’t know much about Austrian German. You can get an impression of how Austrian German sounds when watching the simon video tutorial.

B. Swiss German
Maybe I will release a Swiss German dictionary. If someone from Switzerland is interested in the development of such a dictionary, I could create Ralf's Swiss German dictionary (I haven’t done that so far). I think that there is a GPL word list at OpenOffice.org available. So a volunteer would be welcome. I can help you with the first steps (unmunch, eSpeak, paste). The result would be a PLS dictionary with a vocabulary that only contains words that are specific to Swiss German.

The future Ralf's Swiss German dictionary is interesting for people who live in (or emigrated from Germany to) Switzerland. If you emigrated from Germany to Switzerland, you should get familiar with Swiss German. So Ralf's Swiss German dictionary would be interesting for German people who immigrate to Switzerland. If you want to stay in Switzerland, learn their language! simon / Ralf's Swiss German dictionary might help you to reach that goal.

C. Medical German
Ralf’s German medical dictionary is targeted at people who are interested in medical education. It is necessary to develop specialised medical dictionaries. The concept is easy:
1. Import Ralf's German dictionary into simon.
2. Import Ralf's German medical dictionary. Then you can train simon to recognize medical terms (e.g. LinsenchirurgielɪnzənçiːʀʊʀgiːEpilepsiebehandlungʔeːpiːlɛpsiːbeːandlʊŋ).
3. Develop specialised medical dictionaries: Human anatomy, pharmacology, genetics, etc..

simon could be used by medical students. So if you are a medical student (German language), you can improve Ralf's German medical dictionary. You can add words to this dictionary. Later, in a few years when you become a doctor, you might be able to use your experiences with simon / Ralf's German medical dictionary. Develop your own medical pronunciation dicitionary, and become a better doctor!

Of course, because we are in a very early stage of development, this is just something for medical students who have enough time.

D. Latin (German pronunciation)
Ralf’s Latin dictionary needs improvement. Latin has pronunciation rules that are different from German. The following way is possible: Improve Ralf's Latin dictionary with an XSLT stylesheet. I explained the concept above. The stylesheet needs information that are specific to the Latin language. Sometimes the Latin e is short (e.g. currere), sometimes it is long (e.g. dēbēre). These things need to be fixed.

And of course, 1.7 million Latin words is too much. The size of this dictionary has to be reduced because of performance issues. A dictionary with about 100.000 Latin words would be optimal at the moment. We don’t have yet a routine (compression algorithm) to handle dictionaries with e.g. 1.7 million words. This has to be developed (maybe someone could do that who is familiar with the unmunch command). But a dictionary with 100.000 Latin words would be a good start.

Latin is a “dead language”. But it should be possible – thanks to simon / Ralf's Latin dictionary to make the computer write down Latin words when you speak them into your microphone. From my point of view, Ralf’s Latin dictionary is something for Latin teachers (school or university). So the target group is very specific. I think that it can be fun for Latin students to use simon for the recognition of spoken Latin words.

E. Conclusion
The different dictionaries need improvement. Interested persons (people from Austria, Switzerland; medical students; Latin teachers) are encouraged to improve the specific dictionary on their own. My dictionaries are GPLv3 licensed. It is intended that someone improves them. This is my concept:

1. unmunch an OpenOffice.org spelling dictionary
2. generate phonemes with eSpeak
3. paste
4. convert eSpeak phonemes into IPA phonemes with XSLT
5. import the resulting PLS dictionary into simon
6. record a few words with simon
7. use simon for recognition (dictate e.g. into a gedit window)

This concept should work with dictionaries that use the German pronunciation (Austrian, Swiss, German medical, Latin). I didn’t test these dictionaries for training / recognition with simon. But the concept is the same. Since Ralf's German dictionary (flagship) works with simon, the other dictionaries with German pronunciation should work, too.

Internet Extensions – trying to make it work

Wednesday, May 20th, 2009

A few days ago, I tried to download an XML file from the internet:

Model Extension with Simon

But as far as I can remember, the import failed. I have to think about a way to transform the XML file (it contains <speak> and <audio> tags) into a different format. I could do this with XSLT/XPath. I just would have to navigate inside the XML file with XPath, and output the result in a different file (maybe output as .txt file?). Or maybe I should use XQuery instead of XSLT? I don’t know at the moment.

I find the concept of Model Extensions pretty good. I just have to find a way to make it work with the already published GPL prompts. And of course, I would like to access several prompts.xml files with just one query. Probably, I should take a closer look at XQuery.

This is the reason why I am trying to learn more about XPath: to make the XML prompts compatible with Simon.

I hope that you don’t find this topic annoying. I published an article with a similar problem a few days ago: Is it possible to import an XML text file?