Posts Tagged ‘saxonb-xslt’

Ralf’s Arabic dictionary

Tuesday, January 10th, 2012

This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.

A. Creation of the dictionary:

1. Get Arabic spelling dictionary.
2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file:

GPL 2.0/LGPL 2.1/MPL 1.1 tri-license

This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.

3. Now I have to extract dict_ar-3.0.oxt.
4. Let’s try the unmunch command inside the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic

It failed. I wasn’t able to unmunch the word list.
5. I have to remove all numbers from ar.dic. This can be done with the sed command:

sed 's/[0-9]*//g' ar.dic > arabic-without-numbers

6. Remove the slash (“/”) from arabic-without-numbers with Geany.
7. Add lexicon tags at the beginning and the end of the file.
8. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml

9. ISO 639-1 language code is ar.
10. Maybe I will use this table for the grapheme to phoneme conversion.
11. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'improve-arabic.xsl' -o:arabic-dictionary.xml

I have to remove the number sign (“#”) with Geany from arabic.xml.

B. Download the dictionary. Import it into simon.

The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with “Unknown”. This is because the PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon.

Ralf’s Hebrew dictionary

Tuesday, January 10th, 2012

In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.

A. Creation of the dictionary:

1. Get Hebrew spelling dictionary from OpenOffice.org.
2. License is GPL. There is a copyright notice inside the file he_IL.aff.

3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test

4. The source file he_IL.dic contains a lot of numbers. I remove them with the Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ sed 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers

With Geany, I remove the “,” (commas) and the “/” (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.

5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.
6. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml

7. ISO 639-1 language code is he.
8. I need a table for grapheme to phoneme conversion. Maybe I will use this table. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew share the same alphabet. This means I could try to use the Yiddish improve-yiddish.xsl style sheet:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml

The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn’t been converted: [א] I will add this phone to the .xsl style sheet with the name improve-hebrew.xsl. Now I try it again:

ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'improve-hebrew.xsl' -o:hebrew-dictionary.xml

The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.

B. Download the dictionary. Import it into simon as shadow dictionary.

Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just Unknown) since the source PLS dictionary contains no role attributes.

Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn’t be a problem to adjust the style sheet improve-hebrew.xsl so that the phoneme results are better.

Ralf’s Asturian dictionary

Thursday, January 5th, 2012

This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.

A. How I create the dictionary:
1. Get spelling dictionary.
2. Check license. It is GPLv3.
3. Extract asturianu.oxt.
4. Language code is ast.
5. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist

The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the same problem with my Latin dictionary. I had to reduce the size.

6. Add lexicon elements at the beginning/end of asturian-wordlist.

7. Generate .xml document with lexicon, lexeme and grapheme elements:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

I got an error message because the available space isn’t enough (“Java heap space”). I think that I should reduce the file size with grep. Or I install VisualVM. I think I will work with grep:
a. Remove lines that begin with l’: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\’ asturian-wordlist > asturian-wordlist-02
b. Remove lines that begin with t’: grep -v ^t\’ asturian-wordlist-02 > asturian-wordlist-03
c. Remove lines that begin with s’: grep -v ^s\’ asturian-wordlist-03 > asturian-wordlist-04
d. Remove lines that begin with m’: grep -v ^m\’ asturian-wordlist-04 > asturian-wordlist-05
e. Remove lines that begin with n’: grep -v ^n\’ asturian-wordlist-05 > asturian-wordlist-06
f. Remove lines that begin with d’: grep -v ^d\’ asturian-wordlist-06 > asturian-wordlist-07
g. Remove lines that begin with qu’: grep -v ^qu\’ asturian-wordlist-07 > asturian-wordlist-08
h. Remove lines that begin with p’: grep -v ^p\’ asturian-wordlist-08 > asturian-wordlist-09
The dictionary will contain 1.1 million words. I think that that number is acceptable.

8. And now Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-09 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

This command creates a PLS dictionary without phoneme elements. The phoneme elements will be added later.

9. I will use this table for grapheme to phoneme conversion. Here is the command that creates the phoneme elements:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian.xml -xsl:'improve-asturian.xsl' -o:asturian-dictionary.xml

10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn’t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.
a. Remove lines that contain ‘l: grep -v \’l asturian-wordlist-09 > asturian-wordlist-10
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11
c. This isn’t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14
f. Remove 265.000 words: grep -v es asturian-wordlist-14 > asturian-wordlist-15
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an .xml file:

ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-15 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml

And now I repeat step 9. The file asturian-dictionary.xml has a size of 45 MB. I hope that this size is OK.

B. Download the dictionary. Import it into simon.

Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.

You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn’t too big for simon.

Ralf’s Yiddish dictionary

Tuesday, January 3rd, 2012

This article explains some details about the creation of the dictionary, and how the result looks like in simon.

A. How I create Ralf's Yiddish dictionary:

1. Get spelling dictionary.
2. License is GPLv3.
3. Extract jidysz.net.ooo.spellchecker.oxt.
4. Ubuntu terminal:
cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries
sudo apt-get install hunspell-tools
unmunch yi.dic yi.aff > yiddish-wordlist

5. Add <lexicon> at the beginning of yiddish-wordlist. Add </lexicon> at the end of this file.
6. Generate .xml document with lexicon, lexeme and grapheme elements:

ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:yiddish.xml

7. ISO 639-1 language code is yi.
8. I think I will use this table as source for the grapheme to phoneme mapping.
9. Ubuntu terminal:

ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish.xml -xsl:'improve-yiddish.xsl' -o:yiddish-dictionary.xml

B. Download the dictionary, and import it into simon.

Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription.
Yiddish is written in the Hebrew alphabet. The Hebrew alphabet is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.

There are a lot of other PLS dictionaries available. Find the PLS dictionary that suits your language.

Ralf’s German dictionary 0.1.9.7

Thursday, June 17th, 2010

This article explains how I am preparing Ralf's German dictionary 0.1.9.7. At moment, I am focusing on the creation of new <phoneme> elements:

1. Add the following rule to improve-german.xsl:

<xsl:when test="ends-with(lower-case(../grapheme), 'ier') and
ends-with(., 'iːʀ') and
not(ends-with(../grapheme, 'eier'))"><xsl:value-of select="replace($sierra, 'iːʀ', 'iːɐ̯')"/></xsl:when>

2. Add this rule:

<xsl:when test="ends-with(../grapheme, 'gen') and
not(ends-with(../grapheme, 'ngen'))"><xsl:value-of select="replace($sierra, 'gən', 'gŋ̩')"/></xsl:when>

3. Invoke the following instruction via Ubuntu terminal because I want to test whether improve-german.xsl produces the desired results:

ubuntu@ubuntu-desktop:~$ saxonb-xslt -ext:on -s:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.6.xml' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/improve-german.xsl' -o:'/home/ubuntu/Documents/201006/german-0.1.9.5/german-0.1.9.7.xml'

4. Add the following rule:

<xsl:when test="contains(../grapheme, 'gens') and
not(ends-with(../grapheme, 'ngens'))"><xsl:value-of select="replace($sierra, 'gəns', 'gŋ̩s')"/></xsl:when>

What does this rule create? Here is an example:
Source dictionary:

<lexeme role="Substantiv">
<grapheme>Volkswagens</grapheme>
<phoneme>fɔlksvagəns</phoneme>
</lexeme>

Object dictionary:

<lexeme role="Substantiv">
<grapheme>Volkswagens</grapheme>
<phoneme>fɔlksvagəns</phoneme>
<phoneme>fɔlksvagŋ̩s</phoneme>
</lexeme>

You can see that I am using XSLT to produce additional <phoneme> elements.

5. Add this rule:

<xsl:when test="ends-with(../grapheme, 'bens')"><xsl:value-of select="replace($sierra, 'bəns', 'bm̩s')"/></xsl:when>

Example: Source dictionary:

<lexeme role="Substantiv">
<grapheme>Schneetreibens</grapheme>
<phoneme>ʃneːtʀaɪ̯bəns</phoneme>
</lexeme>

Object dictionary:

<lexeme role="Substantiv">
<grapheme>Schneetreibens</grapheme>
<phoneme>ʃneːtʀaɪ̯bəns</phoneme>
<phoneme>ʃneːtʀaɪ̯bm̩s</phoneme>
</lexeme>

6. Add the following rule:

<xsl:when test="ends-with(../grapheme, 'ben')"><xsl:value-of select="replace($sierra, 'bən', 'bm̩')"/></xsl:when>

Example from the source dictionary:

<lexeme role="Verb">
<grapheme>ausgeben</grapheme>
<phoneme>ʔaʊ̯sgeːbən</phoneme>
</lexeme>

Target dictionary:

<lexeme role="Verb">
<grapheme>ausgeben</grapheme>
<phoneme>ʔaʊ̯sgeːbən</phoneme>
<phoneme>ʔaʊ̯sgeːbm̩</phoneme>
</lexeme>

With this rule, I added about 1400 <phoneme> elements. It would be too much work to do this manually. Thanks to saxonb-xslt I can work efficiently and precisely.

7. If you have suggestions for improvements of Ralf's German dictionary, please tell me. At the moment, the dictionary contains 384067 <lexeme> elements. I don’t want to add more words to the dictionary at the moment. I want to improve the phoneme quality. And there are a lot of <grapheme> elements that have more than one possible pronunciation. This is my current focus to add a <phoneme> element where it seems to be appropriate. If you are missing something, please tell me.

(more…)

Clear button; improve phoneme

Wednesday, February 3rd, 2010

1. This is what I did a couple of minutes ago:

$ cd Documents/201001/speech2text
$ git pull origin master
$ ./build_ubuntu.sh

After starting simon, I can see that a Clear button is available:

clear

It should now be possible to delete the active dictionary.

2. Let me make an additional remark about the phonemes of Ralf's German dictionary. The phonemes

S t E n d e: k E m pf @
S t E n d e: O R g a n i: z a ts I o: n

are not optimal. e: indicates a long vowel. Instead, there should be the short vowel @. When you watch the video 200 German words, then you can listen how I pronounce these words. I pronounce them with e: (long e) instead of @ (short e). Such problems occur often in Ralf's German dictionary.

How do I modify Ralf's German dictionary? I use the Ubuntu terminal:

$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:modify-german-dictionary.xsl -o:prepare-0.1.8.xml

To avoid a special java heap space error, I run VisualVM in the background. The result is that it is possible to modify Ralf's German dictionary with the XSLT style-sheet.

Why am I telling you these details that are not directly related to simon? Because you need a pronunciation dictionary if you want to use simon for dictation. And it is necessary to improve the quality of the phonemes that are contained in Ralf's German dictionary.

3. It is possible to modify phonemes with simon using the Edit Word button. My approach with saxonb-xslt is necessary for dictionary development because this way it is possible to modify lots of <lexeme> elements.

4. I removed the active and the shadow dictionary using the Clear button. What about an Export dictionary button? If someone edits the dictionary with simon (Edit Word button), he may want to export the dictionary.

Import of prompts 01 failed

Tuesday, October 13th, 2009

This is what I am currently doing: I checked out revision 1040.

I want to import the prompts 01 into simon. First, I had to convert the 40 flac files into wav files with the following command:

liberty@liberty-desktop:~/200910/editing-ralfherzog/01$ for f in *.flac; do sox "$f" -t wav -r 16000 -s -c 1 "subfolder/${f%.flac}.wav"; done

Then I transformed the file http://script.blau.in/german/01/prompts.xml into (almost) HTK compatible format with the following command:

liberty@liberty-desktop:~/200910/editing-ralfherzog/01$ saxonb-xslt -ext:on -o:PROMPTS01 -xsl:transform-ssml-prompts.xsl -s:prompts.xml

The stylesheet transform-ssml-prompts.xsl has the following content:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- 20091013; license: GPL -->
<xsl:output method="text"/>
<xsl:template match="speak">
<xsl:for-each select="audio">
<xsl:value-of select="replace(@src, 'flac','wav')"/>
<xsl:text> </xsl:text>
<xsl:value-of select="."/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

I think that I found an error. I forgot to capitalize the prompts with the XPath expression upper-case(). I will have to correct the stylesheet. Probably that is the reason why the Import Trainingsdata function didn’t work out.

Creating Ralf’s German dictionary

Tuesday, September 22nd, 2009

To get an impression how I create the German PLS dictionary, watch the video (19.2 MB, WMV):

[20100101: video removed]

Currently, I am preparing a new version of Ralf’s German dictionary. The dictionary should be 100% simon compatible (version 0.1 contains some minor mistakes).

This is what I did yesterday:
1. I created more than 80.000 pronunciations with eSpeak from a set of 300.000 words. Not all words were transcribed, I don’t know what went wrong.
2. Then I created an XSLT stylesheet to transform the eSpeak phoneset into IPA with saxonb-xslt.
3. The result was that I had a list of the phonemes, but the graphemes are missing. What can I do? I decided to start dictating the missing graphemes with DNS 9.5. You can see the dictation process in the video.