Archive for the ‘dictionary’ Category

Ralf’s Hungarian speech model

Tuesday, May 15th, 2012

Some words about the creation of Ralf’s Hungarian speech model.

1. I downloaded Ralf’s Hungarian dictionary.

2. Press the Manage scenarios button.

(more…)

Ralf’s modern Greek speech model

Tuesday, May 15th, 2012

Some words about the creation of Ralf’s modern Greek speech model.

1. I used Ralf’s Greek dictionary as source dictionary. Then I took a look into this word list. Then using the style sheet compare-popular-words.xsl, I created the target dictionary popular-greek-words-dictionary.xml. I created the target dictionary with the following command:

cat 'greek-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'compare-popular-words.xsl'

I excluded some <lexeme> elements that contain <phoneme> elements with the phoneme [ʎ] or [ɲ]:

not(contains($dictionary-phoneme, 'ʎ')) and not(contains($dictionary-phoneme, 'ɲ'))

I excluded these phonemes because the Simon PLS import process obviously doesn’t “like” these specific phonemes.

(more…)

Long vowels and the colon mark

Friday, May 11th, 2012

I just found out that my General American PLS dictionary contains often the colon mark : instead of the specific IPA Unicode ​U+02D0​ character.

Let me make an example:

<lexeme role="">
<grapheme>zorro</grapheme>
<phoneme>zˈɔ:roʊ</phoneme>
</lexeme>

The long o-vowel is marked by a colon mark. (more…)

Ralf’s General American speech model 0.1

Thursday, May 10th, 2012

Here is how I create Ralf’s General American speech model version 0.1.

1. Schott’s General American dictionary has to be reduced. I want to use just popular words. Where can I find information about popular words? I found a good source. And what about copyright? I combined several sources, mixed them, extracted a subset following specific criteria. This means I didn’t create a work that can be called derivative. In short: I produced the dictionary Popular words - GA dictionary with the following command:

cat 'general-american-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'compare-popular-words.xsl'

(more…)

Import of my GA dictionary; thoughts

Monday, May 7th, 2012

I want to test whether the import of Schott’s General American dictionary functions now better. Here is what I do:

1. Linux Mint terminal:

git pull origin master
./build.sh

2. Create a scenario test-us.
3. Import of the dictionary. Take a look at the result:

The word gardening looks much better than before. But it isn’t perfect yet. The primary stress has been omitted (which is a good decision; we don’t need the stress marks at the moment. We need stress marks in the long run, in a few years or so). But what happened to the secondary stress mark? It became a phoneme with the ASCII transcription perc. This isn’t optimal. At least, the secondary stress mark is processed by simon.

Some thoughts about stress marks: We don’t need them right now. This is why Schott’s German dictionary doesn’t contain any stress information. There are long vowels and short vowels. This is sufficient information for the dictation.

When I generated Schott’s General American dictionary with eSpeak (I chose eSpeak because I “know” the results; some argue that Festival is better – that might be the case. But I didn’t want to “learn” Festival.), I didn’t want to omit the stress marks. I think that in the long run this is valuable information.

So that’s the current situation:
- Schott's German dictionary doesn’t contain stress marks.
- Schott's General American dictionary contains stress marks. I don’t know whether they are on the right position following the IPA specification. But at the moment, I don’t care – it would be possible to move the stress marks to the right position easily (if the positions were wrong) with an XSLT style sheet.

Some thoughts:

By the way, I recommend XSLT for dictionary development. It is easy to use, distinguishes clearly between lower-case and upper-case. It isn’t necessary to install some extra library or so to make it UTF-8 compliant. And you can publish PLS/IPA dictionaries as .xml file (e.g. take a look at Ralf’s Basque IPA FLAC files – it is an PLS dictionary combined with an XSLT style sheet – you can download and directly import this dictionary into simon – the full PLS standard is met).

Why am using sometimes the name Ralf’s …, and sometimes the name Schott’s …. ? The reason is the following: First drafts are published under my Nickname Ralf. And as soon as they have evolved, I change the name to Schott’s (which is my real name).

The next step will be no rename Ralf’s German speech model into Schott's German speech model. Maybe the next version of my speech model will carry the new name. I am not sure yet.

Recognizing a Hebrew word

Friday, May 4th, 2012

I just imported Ralf’s Hebrew dictionary into simon after creating a scenario with the name “hebrew.” I had to delete the General American dictionary first that I had imported recently.

Simon just recognized a Hebrew word.

Special training: Please record the text below. It is possible to record a Hebrew word with simon.

My computer isn’t localized for Hebrew. It would be interesting to know if it works on a localized computer.

Import of the verb confiscated

Thursday, May 3rd, 2012

I imported Schott’s General American dictionary 0.2.1 into simon 0.3.80 as shadow dictionary. Some remarks about the result:
1. Let’s open the Shadow vocabulary tab.

2. The pronunciation of the adjective confiscable is correct. The simon IPA import process has converted the IPA phonemes of confiscable correctly into SAMPA phonemes.

3. The pronunciation of the verb confiscated is garbage. Let’s take a look into Schott’s General American dictionary:

<lexeme role="verb">
<grapheme>confiscated</grapheme>
<phoneme>kˈɒnfɪskˌeɪtɪd</phoneme>
</lexeme>

There are three Unicode letters that might cause the simon IPA import process to produce garbage: The [ɒ], and the IPA stress marks.

4. The category column contains just words that have a role attribute. Words without a role attribute aren’t listed.

Schott’s General American dictionary 0.2.1 (IPA)

Wednesday, May 2nd, 2012

Today, I generated Schott’s General American dictionary 0.2.1 (IPA edition) with the following command in the Linux Mint terminal:

saxonb-xslt -ext:on -s:general-american-dictionary.xml -xsl:american2ipa.xsl -o:'general-american-0.2.xml'

I used as source Schott’s General American dictionary 0.2 (eSpeak edition).

Then I imported the IPA version of this dictionary into simon. I would like to test this dictionary extensively with simon. It would be nice if the simon PLS import process would be adjusted for this dictionary. Some phonemes are missing. The phonemes that are incorporated into this dictionary are listed in the Wikipedia. Here is a list of phonemes that probably aren’t processed correctly by simon (examples are from the Wikipedia):

1. Consonants:

ð thy, breathe, father
θ thigh, math

2. Vowels:
ɑː PALM, father, bra
ɒ LOT, pod, John
æ TRAP, pad, shall, ban
ɔː THOUGHT, Maud, dawn, fall, straw
ʌ STRUT, mud, dull, gun
ɑr START, bard, barn, snarl, star
ɜr NURSE, word, girl, fern, furry

3. Reduced vowels:
ɨ roses, emission
ɵ omission
ʉ beautiful, curriculum

4. Stress
ˈ << primary stress mark, it is similar to an apostrophe, it has a special UTF-8/IPA code: U+02C8
ˌ << secondary stress mark, it is similar to a comma, it has a special UTF-8/IPA code: U+02CC

The simon import process should eliminate both stress marks during import (please implement this feature).

Advantages of my dictionary:
- It is bigger that cmudict (cmudict has about 133.000 entries). Schott’s General American dictionary has about 390.000 entries.
- The grapheme elements distinguish correctly between upper-case and lower-case.
- About 20% of the words carry a role attribute (noun, verb, adjective).

[Edited: May 3, 2012]

Ralf’s Interlingua dictionary

Tuesday, January 10th, 2012

This article explains how I create the dictionary, and how the imported result looks like in simon.

A. Creation of the PLS dictionary:

1. Get spelling dictionary.
2. License is GPL. It says in the file README_en.txt:

This spell check dictionary for Interlingua is licensed under GPL. [...] This hyphenation rules for Interlingua are licensed under GPL.

This means that I can use this spelling dictionary as source.
3. Extract dict-ia-2010-11-29.oxt.
4. ISO 639-1 language code is ia.
5. Probably I will use this table for grapheme to phoneme conversion.

6. Check the encoding of ia_iso.aff and ia_iso.dic. Both files are encoded in ISO 8859-1. Probably it is best if I convert the encoding of both files into UTF-8.
iconv -f ISO-8859-1 -t UTF-8 < ia_iso.dic > interlingua-utf8.dic
iconv -f ISO-8859-1 -t UTF-8 < ia_iso.aff > interlingua-utf8.aff

Change the first line in interlingua-utf8.aff into SET UTF-8. Both files contain CRLF at the end of each line (Windows mode). I don’t know whether this is ok with the unmunch command. I will check it out:

ubuntu@ubuntu:~/Documents/2011-II/Interlingua$ unmunch interlingua-utf8.dic interlingua-utf8.aff > interlingua-wordlist

Obviously, it worked. The CRLF is part of the source files. The target file contains just a LF (Unix mode). There are a lot of duplicate entries. I think that these duplicate entries will be removed later by an .xsl script.

7. Add lexicon tags at the beginning and the end of interlingua-wordlist.

8. Create XML file:

ubuntu@ubuntu:~/Documents/2011-II/Interlingua$ saxonb-xslt -s:interlingua-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:interlingua.xml

9. Create PLS dictionary:

ubuntu@ubuntu:~/Documents/2011-II/Interlingua$ saxonb-xslt -s:interlingua.xml -xsl:'improve-interlingua.xsl' -o:interlingua-dictionary.xml

B. Download the dictionary. Import it into simon.

The left column contains the words. The pronunciation column contains the corresponding SAMPA transcriptions. The Category column contains just “Unknown” entries.

Now you know how I created the dictionary and how the result looks like in simon.

Ralf’s Arabic dictionary

Tuesday, January 10th, 2012

This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.

A. Creation of the dictionary:

1. Get Arabic spelling dictionary.
2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file:

GPL 2.0/LGPL 2.1/MPL 1.1 tri-license

This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.

3. Now I have to extract dict_ar-3.0.oxt.
4. Let’s try the unmunch command inside the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic

It failed. I wasn’t able to unmunch the word list.
5. I have to remove all numbers from ar.dic. This can be done with the sed command:

sed 's/[0-9]*//g' ar.dic > arabic-without-numbers

6. Remove the slash (“/”) from arabic-without-numbers with Geany.
7. Add lexicon tags at the beginning and the end of the file.
8. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml

9. ISO 639-1 language code is ar.
10. Maybe I will use this table for the grapheme to phoneme conversion.
11. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'improve-arabic.xsl' -o:arabic-dictionary.xml

I have to remove the number sign (“#”) with Geany from arabic.xml.

B. Download the dictionary. Import it into simon.

The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with “Unknown”. This is because the PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon.

Ralf’s Hebrew dictionary

Tuesday, January 10th, 2012

In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.

A. Creation of the dictionary:

1. Get Hebrew spelling dictionary from OpenOffice.org.
2. License is GPL. There is a copyright notice inside the file he_IL.aff.

3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test

4. The source file he_IL.dic contains a lot of numbers. I remove them with the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ sed 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers

With Geany, I remove the “,” (commas) and the “/” (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.

5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.
6. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml

7. ISO 639-1 language code is he.
8. I need a table for grapheme to phoneme conversion. Maybe I will use this table. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew share the same alphabet. This means I could try to use the Yiddish improve-yiddish.xsl style sheet:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml

The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn’t been converted: [א] I will add this phone to the .xsl style sheet with the name improve-hebrew.xsl. Now I try it again:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'improve-hebrew.xsl' -o:hebrew-dictionary.xml

The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.

B. Download the dictionary. Import it into simon as shadow dictionary.

Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just Unknown) since the source PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn’t be a problem to adjust the style sheet improve-hebrew.xsl so that the phoneme results are better.

Ralf’s Belarusian dictionary

Monday, January 9th, 2012

This article explains how I create this PLS dictionary and how the imported result looks like.

A. Creation of the Belarusian PLS dictionary:

1. Get spelling dictionary. I choose the official orthography.
2. License is LGPL (see hyph_be_BY.dic). I am allowed to “convert any LGPLed piece of software into a GPLed piece of software.” I did this before. And I will do it again. This means that I get a spelling dictionary that is licensed under the LGPL. And I will produce a pronunciation dictionary that is licensed under the GPLv3. By the way, all my dictionaries are licensed under the GPLv3.
3. Extract dict-be-official.oxt.

4. The file be-official.aff is encoded in UTF-8. The file be-official-dic may be encoded in ISO-8859-1. At least this encoding is displayed by Geany. I believe that be-official-dic is encoded in microsoft-cp1251. I had this encoding before (Macedonian and Bulgarian).
Now it is time to use the Ubuntu terminal:
cd /home/ubuntu/Documents/2011-II/Belarusian
iconv -f cp1251 -t UTF-8 <be-official.dic >belarusian-utf8.dic

The text file belarusian-utf8.dic looks fine.

5. Now I change the line SET microsoft-cp1251 in the file be-official.aff into SET UTF-8
6. I don’t know whether the next step is necessary. I could convert the file hyph_be_BY.dic from cp1251 into UTF-8. At the moment, I skip this step.

7. Ubuntu terminal: unmunch belarusian-utf8.dic be-official.aff > belarusian-wordlist I think that this step wasn’t necessary. It didn’t extract the word list. At the moment, I have a word list of 1.5 million words. This is way too much. I have to reduce the dictionary size. The target size is 400.000 words.

8. I have to reduce the dictionary size. I found a tip. Ubuntu terminal:

sed -n 'p;N;N;N' belarusian-wordlist > belarusian-wordlist-reduced

Yes, it worked. The word list contains now 391.000 words. This is a good basis for a PLS dictionary.

9. Add lexicon elements at the beginning and the end of belarusian-wordlist-reduced.
10. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian-wordlist-reduced -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:belarusian.xml

11. Language code is be.
12. I will use this table for grapheme to phoneme mapping.
13. Creation of the phoneme elements:

ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian.xml -xsl:'improve-belarusian.xsl' -o:belarusian-dictionary.xml

B. Download and import the dictionary.

Let’s take a look at the result. The left column contains 391669 Belarusian words. The pronunciation column contains the corresponding SAMPA transcriptions. All entries in the third column are marked as “Unknown”. This is because the Belarusian PLS dictionary doesn’t contain any role attribute.

Now you know how I created the dictionary. And you got an impression how the result looks like when imported into simon.

Ralf’s Asturian dictionary

Thursday, January 5th, 2012

This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.

A. How I create the dictionary:
1. Get spelling dictionary.
2. Check license. It is GPLv3.
3. Extract asturianu.oxt.
4. Language code is ast.
5. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist

The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the same problem with my Latin dictionary. I had to reduce the size.

6. Add lexicon elements at the beginning/end of asturian-wordlist.

7. Generate .xml document with lexicon, lexeme and grapheme elements:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

I got an error message because the available space isn’t enough (“Java heap space”). I think that I should reduce the file size with grep. Or I install VisualVM. I think I will work with grep:
a. Remove lines that begin with l’: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\’ asturian-wordlist > asturian-wordlist-02
b. Remove lines that begin with t’: grep -v ^t\’ asturian-wordlist-02 > asturian-wordlist-03
c. Remove lines that begin with s’: grep -v ^s\’ asturian-wordlist-03 > asturian-wordlist-04
d. Remove lines that begin with m’: grep -v ^m\’ asturian-wordlist-04 > asturian-wordlist-05
e. Remove lines that begin with n’: grep -v ^n\’ asturian-wordlist-05 > asturian-wordlist-06
f. Remove lines that begin with d’: grep -v ^d\’ asturian-wordlist-06 > asturian-wordlist-07
g. Remove lines that begin with qu’: grep -v ^qu\’ asturian-wordlist-07 > asturian-wordlist-08
h. Remove lines that begin with p’: grep -v ^p\’ asturian-wordlist-08 > asturian-wordlist-09
The dictionary will contain 1.1 million words. I think that that number is acceptable.

8. And now Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-09 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

This command creates a PLS dictionary without phoneme elements. The phoneme elements will be added later.

9. I will use this table for grapheme to phoneme conversion. Here is the command that creates the phoneme elements:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian.xml -xsl:'improve-asturian.xsl' -o:asturian-dictionary.xml

10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn’t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.
a. Remove lines that contain ‘l: grep -v \’l asturian-wordlist-09 > asturian-wordlist-10
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11
c. This isn’t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14
f. Remove 265.000 words: grep -v es asturian-wordlist-14 > asturian-wordlist-15
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an .xml file:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-15 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

And now I repeat step 9. The file asturian-dictionary.xml has a size of 45 MB. I hope that this size is OK.

B. Download the dictionary. Import it into simon.

Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.

You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn’t too big for simon.

Ralf’s Yiddish dictionary

Tuesday, January 3rd, 2012

This article explains some details about the creation of the dictionary, and how the result looks like in simon.

A. How I create Ralf's Yiddish dictionary:

1. Get spelling dictionary.
2. License is GPLv3.
3. Extract jidysz.net.ooo.spellchecker.oxt.
4. Ubuntu terminal:
cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries
sudo apt-get install hunspell-tools
unmunch yi.dic yi.aff > yiddish-wordlist

5. Add <lexicon> at the beginning of yiddish-wordlist. Add </lexicon> at the end of this file.
6. Generate .xml document with lexicon, lexeme and grapheme elements:

ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:yiddish.xml

7. ISO 639-1 language code is yi.
8. I think I will use this table as source for the grapheme to phoneme mapping.
9. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish.xml -xsl:'improve-yiddish.xsl' -o:yiddish-dictionary.xml

B. Download the dictionary, and import it into simon.

Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription.
Yiddish is written in the Hebrew alphabet. The Hebrew alphabet is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.

There are a lot of other PLS dictionaries available. Find the PLS dictionary that suits your language.

Schott’s German dictionary 0.2.8

Tuesday, November 1st, 2011

Here is how I create Schott’s German dictionary 0.2.8 (with the style sheet improve-german.xsl):

1. Replace 152 matches:

<xsl:when test=”contains(lower-case(../grapheme), ‘planung’)”><xsl:value-of select=”replace($sierra, ‘planʊŋ’,'plaːnʊŋ’)”/></xsl:when>

2. Replace 178 matches:

<xsl:when test=”contains(lower-case(../grapheme), ‘fußball’)”><xsl:value-of select=”replace($sierra, ‘fʊsbal’,'fuːsbal’)”/></xsl:when>

A lot of other small changes have been made. Please, import Schott’s German dictionary (author: Kai Schott) into simon.

Schott’s German dictionary 0.2.7

Wednesday, May 25th, 2011

With the style sheet improve-german.xsl, I create Schott's German dictionary version 0.2.7: (more…)

Mapping between IPA and SAMPA

Thursday, March 10th, 2011

I agree with this statement:

The IPA phonemes are completely language independent. Therefore the mapping
between IPA <-> SAMPA is fixed. The mapping should, of course, be extended
to cover all available IPA phonemes. However, as the conversion is static,
I don’t think the rules need to be dynamic.

If you have any IPA phonemes that are not correctly converted to SAMPA
during the import, please report those as bugs!

Schott’s German dictionary 0.2.6

Saturday, March 5th, 2011

Here is how I create Schott's German dictionary version 0.2.6. The style sheet improve-german.xsl contains the following transformation rules:

1. Replace 209 matches:

<xsl:when test="contains(lower-case(../grapheme), 'position')"><xsl:value-of select="replace($sierra, 'poːziːt͡sɪoːn','pozit͡si̯oːn')"/></xsl:when>

2. Replace 983 matches:

<xsl:when test="contains(lower-case(../grapheme), 'tion')"><xsl:value-of select="replace($sierra, 't͡sɪoːn','t͡si̯oːn')"/></xsl:when>

3. Replace 107 matches: (more…)

Schott’s German dictionary 0.2.5

Tuesday, February 22nd, 2011

Here is how I create Schott's German dictionary version 0.2.5 (former name: Ralf's German dictionary). Within the XSLT style sheet improve-german.xsl, I implement these transformation rules:

1. About 70 occurences have to be replaced:

<xsl:when test="contains(lower-case(../grapheme), 'alkohol')"><xsl:value-of select="replace($sierra, 'alkoːɔl','alkohoːl')"/></xsl:when>

2. Replace about 50 occurences:

<xsl:when test="contains(lower-case(../grapheme), 'vertret')"><xsl:value-of select="replace($sierra, 'vɐtʀeːt','fɐtʀeːt')"/></xsl:when>

3. Replace 32 matches:

<xsl:when test="contains(lower-case(../grapheme), 'aluminium')"><xsl:value-of select="replace($sierra, 'aluːmiːniːʊm','alumiːni̯ʊm')"/></xsl:when>

4. Replace 386 matches: (more…)

German: pronunciation of ‘Moral’

Thursday, February 17th, 2011

How should the German word “Moral” be transcribed? Take a look into Ralf's German dictionary:

<lexeme role="Substantiv">
<grapheme>Moral</grapheme>
<phoneme>moːʀaːl</phoneme>
</lexeme>

And compare this with the Wikipedia:

Berücksichtigt man auch den Vokalismus von entlehnten Wörtern (Lehn- und Fremdwörtern), so kommt eine Reihe geschlossener Kurzvokale hinzu. Man kann sich dies gut am Beispiel der Wörter „Post“ (mit kurzem, offenem [ɔ]), „Moral“ (mit kurzem, geschlossenem [o]) und „Koma“ (mit langem, geschlossenem [oː]) verdeutlichen. (Die Vokallänge zwischen „Moral“ und „Koma“ ist unterschiedlich.) Diese geschlossenen Kurzvokale treten nicht nur bei den „o“-Lauten auf.

This means that I could use /moʀaːl/ instead of /moːʀaːl/.

And now take a look into the Wiktionary:

[oː] U+006F (o), U+02D0 Bote /[ˈboːtə]/, Ofen /[ˈoːfɱ̍]/, roh /[ʀoː]/
[o] U+006F nur in Fremdwörtern: Phonologie /[ˌfonoloˈgiː]/, Motiv /[moˈtiːf]/
[ɔ] U+0254 Gott /[gɔt]/, Post /[pɔst]/, offen /[ˈɔfɱ̍]/

Three different o-vowels. I find it often difficult to distinguish between [oː] and [o]. Take a look again into Ralf's German dictionary:

<lexeme role="Substantiv">
<grapheme>Post</grapheme>
<phoneme>pɔst</phoneme>
</lexeme>

It is clear that /pɔst/ is the correct pronunciation.