<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>testing simon &#187; saxonb-xslt</title>
	<atom:link href="http://spirit.blau.in/simon/tag/saxonb-xslt/feed/" rel="self" type="application/rss+xml" />
	<link>http://spirit.blau.in/simon</link>
	<description>my first steps with the simon speech recognition software</description>
	<lastBuildDate>Tue, 10 Jan 2012 14:59:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Ralf&#8217;s Arabic dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 11:55:50 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Arabic]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[sed]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5726</guid>
		<description><![CDATA[This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon. A. Creation of the dictionary: 1. Get Arabic spelling dictionary. 2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file: GPL 2.0/LGPL [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.</p>
<p><strong>A. Creation of the dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/Arabicspellchecker">Get</a> Arabic spelling dictionary.<br />
2. Check the license. Inside the file <a href="http://extensions.services.openoffice.org/en/download/4955">dict_ar-3.0.oxt</a> there is a file with the name COPYING (in the docs folder). It says in the file:</p>
<blockquote><p>GPL 2.0/LGPL 2.1/MPL 1.1 tri-license</p></blockquote>
<p>This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.</p>
<p>3. Now I have to extract <code>dict_ar-3.0.oxt</code>.<br />
4. Let&#8217;s try the <code>unmunch</code> command inside the Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic</code></p></blockquote>
<p>It failed. I wasn&#8217;t able to unmunch the word list.<br />
5. I have to remove all numbers from ar.dic. This can be done with the <code>sed</code> command:</p>
<blockquote><p><code>sed 's/[0-9]*//g' ar.dic > arabic-without-numbers</code></p></blockquote>
<p>6. Remove the slash (&#8220;/&#8221;) from arabic-without-numbers with <a href="http://en.wikipedia.org/wiki/Geany">Geany</a>.<br />
7. Add lexicon tags at the beginning and the end of the file.<br />
8. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml</code></p></blockquote>
<p>9. ISO 639-1 language code is ar.<br />
10. Maybe I will <a href="http://en.wikipedia.org/wiki/Romanization_of_Arabic#Comparison_table">use this table</a> for the grapheme to phoneme conversion.<br />
11. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-arabic.xsl_.zip'>improve-arabic.xsl</a>' -o:<a href="http://script.blau.in/arabic-dictionary.xml.bz2">arabic-dictionary.xml</a></code></p></blockquote>
<p>I have to remove the number sign (&#8220;#&#8221;) with Geany from arabic.xml.</p>
<p><strong>B. <a href="http://script.blau.in/arabic-dictionary.xml.bz2">Download</a> the dictionary. Import it into <a href="http://simon-listens.blogspot.com/">simon</a>.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/arabic-pronunciation.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/arabic-pronunciation-271x300.jpg" alt="" title="arabic-pronunciation" width="271" height="300" class="alignleft size-medium wp-image-5736" /></a>The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with &#8220;Unknown&#8221;. This is because the PLS dictionary contains no <code>role</code> attributes.
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you know how the result looks like in simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Hebrew dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 09:43:30 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Hebrew]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5718</guid>
		<description><![CDATA[In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon. A. Creation of the [...]]]></description>
			<content:encoded><![CDATA[<p>In 2009, I made some <a href="http://spirit.blau.in/simon/2009/09/10/confidence-score-with-hebrew/">initial tests with Hebrew</a>. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.</p>
<p><strong>A. Creation of the dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/dict-he">Get</a> Hebrew spelling dictionary from OpenOffice.org.<br />
2. License is <a href="http://www.gnu.org/licenses/gpl-2.0.html">GPL</a>. There is a copyright notice inside the file <code>he_IL.aff</code>.</p>
<p>3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test</code></p></blockquote>
<p>4. The source file <code>he_IL.dic</code> contains a lot of numbers. I remove them with the Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ <a href="http://www.cyberciti.biz/faq/sed-remove-all-digits-input-from-input/">sed</a> 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers</code></p></blockquote>
<p>With Geany, I remove the &#8220;,&#8221; (commas) and the &#8220;/&#8221; (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.</p>
<p>5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.<br />
6. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml</code></p></blockquote>
<p>7. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Hebrew_language">language</a> code is <code>he</code>.<br />
8. I need a table for grapheme to phoneme conversion. Maybe I will <a href="http://en.wikipedia.org/wiki/Hebrew_alphabet#Pronunciation">use this table</a>. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew <a href="http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/">share the same alphabet</a>. This means I could try to use the Yiddish <a href="http://spirit.blau.in/simon/files/2012/01/improve-yiddish.xsl_.zip">improve-yiddish.xsl</a> style sheet:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml</code></p></blockquote>
<p>The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn&#8217;t been converted: [א] I will add this phone to the <code>.xsl</code> style sheet with the name <code>improve-hebrew.xsl</code>. Now I try it again:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-hebrew.xsl_.zip'>improve-hebrew.xsl</a>' -o:<a href="http://script.blau.in/hebrew-dictionary.xml.bz2" title="IPA phonetic dictionary, first draft">hebrew-dictionary.xml</a></code></p></blockquote>
<p>The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.</p>
<p><strong>B. <a href="http://script.blau.in/hebrew-dictionary.xml.bz2">Download</a> the dictionary. Import it into <a href="http://simon-listens.org/index.php?id=122&#038;L=1">simon</a> as shadow dictionary.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/hebrew-SAMPA.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/hebrew-SAMPA-255x300.jpg" alt="" title="hebrew-SAMPA" width="255" height="300" class="alignleft size-medium wp-image-5723" /></a>Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just <code>Unknown</code>) since the source PLS dictionary contains no <code>role</code> attributes.</p>
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn&#8217;t be a problem to adjust the style sheet <code>improve-hebrew.xsl</code> so that the phoneme results are better.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Asturian dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 20:30:43 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Asturian]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5678</guid>
		<description><![CDATA[This article explains how I create the Asturian PLS dictionary, and some words about the import into simon. A. How I create the dictionary: 1. Get spelling dictionary. 2. Check license. It is GPLv3. 3. Extract asturianu.oxt. 4. Language code is ast. 5. Ubuntu terminal: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist The result is a [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.</p>
<p>A. How I create the dictionary:<br />
1. <a href="http://extensions.services.openoffice.org/en/project/asturianu">Get</a> spelling dictionary.<br />
2. Check license. It is <a href="http://extensions.services.openoffice.org/en/project/license/3932">GPLv3</a>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/5129">asturianu.oxt</a>.<br />
4. <a href="http://en.wikipedia.org/wiki/Asturian_language">Language</a> code is ast.<br />
5. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist</code></p></blockquote>
<p>The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the <a href="http://spirit.blau.in/simon/2010/04/13/removing-words-from-latin-dictionary/">same problem</a> with my Latin dictionary. I had to reduce the size.</p>
<p>6. Add lexicon elements at the beginning/end of asturian-wordlist.</p>
<p>7. Generate .xml document with lexicon, lexeme and grapheme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>I got an error message because the available space isn&#8217;t enough (&#8220;Java heap space&#8221;). I think that I should reduce the file size with <a href="http://en.wikipedia.org/wiki/Grep">grep</a>. Or I install VisualVM. I think I will work with grep:<br />
a. Remove lines that begin with l&#8217;: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\&#8217; asturian-wordlist > asturian-wordlist-02<br />
b. Remove lines that begin with t&#8217;: grep -v ^t\&#8217; asturian-wordlist-02 > asturian-wordlist-03<br />
c. Remove lines that begin with s&#8217;: grep -v ^s\&#8217; asturian-wordlist-03 > asturian-wordlist-04<br />
d. Remove lines that begin with m&#8217;: grep -v ^m\&#8217; asturian-wordlist-04 > asturian-wordlist-05<br />
e. Remove lines that begin with n&#8217;: grep -v ^n\&#8217; asturian-wordlist-05 > asturian-wordlist-06<br />
f. Remove lines that begin with d&#8217;: grep -v ^d\&#8217; asturian-wordlist-06 > asturian-wordlist-07<br />
g. Remove lines that begin with qu&#8217;: grep -v ^qu\&#8217; asturian-wordlist-07 > asturian-wordlist-08<br />
h. Remove lines that begin with p&#8217;: grep -v ^p\&#8217; asturian-wordlist-08 > asturian-wordlist-09<br />
The dictionary will contain 1.1 million words. I think that that number is acceptable.</p>
<p>8. And now Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-09 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>This command creates a PLS dictionary without phoneme elements. The phoneme elements will be added later.</p>
<p>9. I will use <a href="http://en.wikipedia.org/wiki/Asturian_language#Orthography">this</a> table for grapheme to phoneme conversion. Here is the command that creates the phoneme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-asturian.xsl_.zip'>improve-asturian.xsl</a>' -o:<a href="http://script.blau.in/asturian-dictionary.xml.bz2">asturian-dictionary.xml</a></code></p></blockquote>
<p>10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn&#8217;t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.<br />
a. Remove lines that contain &#8216;l: grep -v \&#8217;l asturian-wordlist-09 > asturian-wordlist-10<br />
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11<br />
c. This isn&#8217;t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12<br />
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13<br />
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14<br />
f. Remove 265.000 words:  grep -v es asturian-wordlist-14 > asturian-wordlist-15<br />
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an <code>.xml</code> file:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-15 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>And now I repeat step 9. The file <code>asturian-dictionary.xml</code> has a size of 45 MB. I hope that this size is OK.</p>
<p>B. <a href="http://script.blau.in/asturian-dictionary.xml.bz2">Download the dictionary</a>. Import it into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/asturian.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/asturian-238x300.jpg" alt="" title="asturian" width="238" height="300" class="alignleft size-medium wp-image-5693" /></a>Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.
<div style="clear:both"></div>
<p>You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn&#8217;t too big for simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Yiddish dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 17:05:18 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>
		<category><![CDATA[yiddish]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5663</guid>
		<description><![CDATA[This article explains some details about the creation of the dictionary, and how the result looks like in simon. A. How I create Ralf's Yiddish dictionary: 1. Get spelling dictionary. 2. License is GPLv3. 3. Extract jidysz.net.ooo.spellchecker.oxt. 4. Ubuntu terminal: cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries sudo apt-get install hunspell-tools unmunch yi.dic yi.aff &#62; yiddish-wordlist 5. Add &#60;lexicon&#62; at [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains some details about the creation of the dictionary, and how the result looks like in simon.</p>
<p>A. How I create <code>Ralf's Yiddish dictionary</code>:</p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/jidysz-net-ooo-spellchecker">Get</a> spelling dictionary.<br />
2. License is <code>GPLv3</code>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/4324"><code>jidysz.net.ooo.spellchecker.oxt</code></a>.<br />
4. Ubuntu terminal:<br />
<code>cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries<br />
sudo apt-get install hunspell-tools<br />
unmunch yi.dic yi.aff &gt; yiddish-wordlist</code><br />
5. Add  <code>&lt;lexicon&gt;</code> at the beginning of yiddish-wordlist. Add <code>&lt;/lexicon&gt;</code> at the end of this file.<br />
6. Generate <code>.xml</code> document with lexicon, lexeme and grapheme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:yiddish.xml</code></p></blockquote>
<p>7. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Yiddish_language">language code</a> is yi.<br />
8. I think I will use <a href="http://en.wikipedia.org/wiki/Yiddish_orthography#The_Yiddish_alphabet">this table</a> as source for the grapheme to phoneme mapping.<br />
9. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-yiddish.xsl_.zip'>improve-yiddish.xsl</a>' -o:yiddish-dictionary.xml</code></p></blockquote>
<p>B. <a href="http://script.blau.in/yiddish-dictionary.xml.bz2">Download the dictionary</a>, and <a href="http://spirit.blau.in/simon/2010/06/17/tutorial-import-german-dictionary/">import</a> it into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/yiddish.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/yiddish-243x300.jpg" alt="" title="yiddish" width="243" height="300" class="alignleft size-medium wp-image-5670" /></a>Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription.<br />
<a href="http://en.wikipedia.org/wiki/Yiddish_language">Yiddish</a> is written in the Hebrew alphabet. The <a href="http://en.wikipedia.org/wiki/Hebrew_alphabet">Hebrew alphabet</a> is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.</p>
<div style="clear:both"></div>
<p>There are a lot of other PLS dictionaries available. <a href="http://spirit.blau.in/simon/import-pls-dictionary/">Find the PLS dictionary</a> that suits your language.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s German dictionary 0.1.9.7</title>
		<link>http://spirit.blau.in/simon/2010/06/17/ralfs-german-dictionary-0-1-9-7/</link>
		<comments>http://spirit.blau.in/simon/2010/06/17/ralfs-german-dictionary-0-1-9-7/#comments</comments>
		<pubDate>Thu, 17 Jun 2010 00:48:38 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[de]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[ʃneːtʀaɪ̯bm̩s]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=4164</guid>
		<description><![CDATA[This article explains how I am preparing Ralf's German dictionary 0.1.9.7. At moment, I am focusing on the creation of new &#60;phoneme&#62; elements: 1. Add the following rule to improve-german.xsl: &#60;xsl:when test="ends-with(lower-case(../grapheme), 'ier') and ends-with(., 'iːʀ') and not(ends-with(../grapheme, 'eier'))"&#62;&#60;xsl:value-of select="replace($sierra, 'iːʀ', 'iːɐ̯')"/&#62;&#60;/xsl:when&#62; 2. Add this rule: &#60;xsl:when test="ends-with(../grapheme, 'gen') and not(ends-with(../grapheme, 'ngen'))"&#62;&#60;xsl:value-of select="replace($sierra, 'gən', 'gŋ̩')"/&#62;&#60;/xsl:when&#62; [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I am preparing <code>Ralf's German dictionary</code> 0.1.9.7. At moment, I am focusing on the creation of new <code>&lt;phoneme&gt;</code> elements:</p>
<p>1. Add the following rule to <code>improve-german.xsl</code>:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(lower-case(../grapheme), 'ier') and<br />
ends-with(., 'iːʀ') and<br />
not(ends-with(../grapheme, 'eier'))"&gt;&lt;xsl:value-of select="replace($sierra, 'iːʀ', 'iːɐ̯')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>2. Add this rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(../grapheme, 'gen') and<br />
not(ends-with(../grapheme, 'ngen'))"&gt;&lt;xsl:value-of select="replace($sierra, 'gən', 'gŋ̩')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>3. Invoke the following instruction via <code>Ubuntu</code> terminal because I want to test whether <code>improve-german.xsl</code> produces the desired results:</p>
<blockquote><p><code>ubuntu@ubuntu-desktop:~$ <strong>saxonb-xslt</strong> -ext:on -s:'/home/ubuntu/Documents/201006/german-0.1.9.5/<a href="http://script.blau.in/german-previous.xml.bz2">german-0.1.9.6.xml</a>' -xsl:'/home/ubuntu/Documents/201005/german-0.1.9.4/<a href='http://spirit.blau.in/simon/files/2010/06/improve-german2.xsl'>improve-german.xsl</a>' -o:'/home/ubuntu/Documents/201006/german-0.1.9.5/<a href="http://script.blau.in/german-dictionary.xml.bz2">german-0.1.9.7.xml</a>'</code></p></blockquote>
<p>4. Add the following rule:</p>
<blockquote><p><code>&lt;xsl:when test="contains(../grapheme, 'gens') and<br />
not(ends-with(../grapheme, 'ngens'))"&gt;&lt;xsl:value-of select="replace($sierra, 'gəns', 'gŋ̩s')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>What does this rule create? Here is an example:<br />
Source dictionary:</p>
<blockquote><p><code>&lt;lexeme role="Substantiv"&gt;<br />
&lt;grapheme&gt;Volkswagens&lt;/grapheme&gt;<br />
&lt;phoneme&gt;fɔlksvagəns&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</code></p></blockquote>
<p>Object dictionary:</p>
<blockquote><p><code>&lt;lexeme role="Substantiv"&gt;<br />
&lt;grapheme&gt;Volkswagens&lt;/grapheme&gt;<br />
&lt;phoneme&gt;fɔlksvagəns&lt;/phoneme&gt;<br />
&lt;phoneme&gt;<strong>fɔlksvagŋ̩s</strong>&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</code></p></blockquote>
<p>You can see that I am using <code>XSLT</code> to produce additional <code>&lt;phoneme&gt;</code> elements.</p>
<p>5. Add this rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(../grapheme, 'bens')"&gt;&lt;xsl:value-of select="replace($sierra, 'bəns', 'bm̩s')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>Example: Source dictionary:</p>
<blockquote><p><code>&lt;lexeme role="Substantiv"&gt;<br />
&lt;grapheme&gt;Schneetreibens&lt;/grapheme&gt;<br />
&lt;phoneme&gt;ʃneːtʀaɪ̯bəns&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</code></p></blockquote>
<p>Object dictionary:</p>
<blockquote><p><code>&lt;lexeme role="Substantiv"&gt;<br />
&lt;grapheme&gt;Schneetreibens&lt;/grapheme&gt;<br />
&lt;phoneme&gt;ʃneːtʀaɪ̯bəns&lt;/phoneme&gt;<br />
&lt;phoneme&gt;<strong>ʃneːtʀaɪ̯bm̩s</strong>&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</code></p></blockquote>
<p>6. Add the following rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(../grapheme, 'ben')"&gt;&lt;xsl:value-of select="replace($sierra, 'bən', 'bm̩')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>Example from the source dictionary:</p>
<blockquote><p><code>&lt;lexeme role="Verb"&gt;<br />
&lt;grapheme&gt;ausgeben&lt;/grapheme&gt;<br />
&lt;phoneme&gt;ʔaʊ̯sgeːbən&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</code></p></blockquote>
<p>Target dictionary:</p>
<blockquote><p><code>&lt;lexeme role="Verb"&gt;<br />
&lt;grapheme&gt;ausgeben&lt;/grapheme&gt;<br />
&lt;phoneme&gt;ʔaʊ̯sgeːbən&lt;/phoneme&gt;<br />
&lt;phoneme&gt;<strong>ʔaʊ̯sgeːbm̩</strong>&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</code></p></blockquote>
<p>With this rule, I added about 1400 <code>&lt;phoneme&gt;</code> elements. It would be too much work to do this manually. Thanks to <a href="http://saxon.sourceforge.net/">saxonb-xslt</a> I can work efficiently and precisely.</p>
<p>7. If you have suggestions for improvements of <code>Ralf's German dictionary</code>, please tell me. At the moment, the dictionary contains 384067 <code>&lt;lexeme&gt;</code> elements. I don&#8217;t want to add more words to the dictionary at the moment. I want to improve the phoneme quality. And there are a lot of <code>&lt;grapheme&gt;</code> elements that have more than one possible pronunciation. This is my current focus to add a <code>&lt;phoneme&gt;</code> element where it seems to be appropriate. If you are missing something, please tell me.</p>
<p><span id="more-4164"></span>8. Add this rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(lower-case(../grapheme), 'ur')<br />
and not(ends-with(../grapheme, 'eur'))"&gt;&lt;xsl:value-of select="replace($sierra, 'uːʀ', 'uːɐ̯')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>9. Add this rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(lower-case(../grapheme), 'schen')"&gt;&lt;xsl:value-of select="replace($sierra, 'ʃən', 'ʃn̩')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>10. Add rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(../grapheme, 'ür')"&gt;&lt;xsl:value-of select="replace($sierra, 'yːʀ', 'yːɐ̯')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>11. Add rule:</p>
<blockquote><p><code>&lt;xsl:when test="ends-with(lower-case(../grapheme), 'ierende')"&gt;&lt;xsl:value-of select="replace($sierra, 'iːʀəndə', 'iːɐ̯nde')"/&gt;&lt;/xsl:when&gt;</code></p></blockquote>
<p>12. <a href="http://script.blau.in/german-dictionary.xml.bz2">Download <code>Ralf's German dictionary</code></a> 0.1.9.7, and import it into simon. The new version contains about 19.000 additional <code>&lt;phoneme&gt;</code> elements.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/06/17/ralfs-german-dictionary-0-1-9-7/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Clear button; improve phoneme</title>
		<link>http://spirit.blau.in/simon/2010/02/03/clear-button-improve-phoneme/</link>
		<comments>http://spirit.blau.in/simon/2010/02/03/clear-button-improve-phoneme/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 11:41:19 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[node11]]></category>
		<category><![CDATA[saxonb-xslt]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=2436</guid>
		<description><![CDATA[1. This is what I did a couple of minutes ago: $ cd Documents/201001/speech2text $ git pull origin master $ ./build_ubuntu.sh After starting simon, I can see that a Clear button is available: It should now be possible to delete the active dictionary. 2. Let me make an additional remark about the phonemes of Ralf's [...]]]></description>
			<content:encoded><![CDATA[<p>1. This is what I did a couple of minutes ago:</p>
<p><code>$ cd Documents/201001/speech2text<br />
$ git pull origin master<br />
$ ./build_ubuntu.sh</code></p>
<p>After starting simon, I can see that a <code>Clear</code> button is available:</p>
<p><a href="http://spirit.blau.in/simon/files/2010/02/clear.png"><img class="alignnone size-medium wp-image-2437" src="http://spirit.blau.in/simon/files/2010/02/clear-300x221.png" alt="clear" width="300" height="221" /></a></p>
<p>It should now be possible to delete the active dictionary.</p>
<p>2. Let me make an additional remark about the phonemes of <code>Ralf's German dictionary</code>. The phonemes</p>
<p>S t E n d <strong>e:</strong> k E m pf @<br />
S t E n d <strong>e:</strong> O R g a n i: z a ts I o: n</p>
<p>are not optimal. <code>e:</code> indicates a long vowel. Instead, there should be the short vowel <code>@</code>. When you watch the <a href="http://spirit.blau.in/simon/2009/12/27/video-recognize-200-german-words/">video 200 German words</a>, then you can listen how I pronounce these words. I pronounce them with <code>e:</code> (long e) instead of <code>@</code> (short e). Such problems occur often in <code>Ralf's German dictionary</code>.</p>
<p>How do I modify <code>Ralf's German dictionary</code>? I use the Ubuntu terminal:</p>
<p><code>$ saxonb-xslt -ext:on -s:german-dictionary-0.1.7.xml -xsl:modify-german-dictionary.xsl -o:prepare-0.1.8.xml</code></p>
<p>To avoid a special <em>java heap space error</em>, I run <a href="http://en.wikipedia.org/wiki/VisualVM">VisualVM</a> in the background. The result is that it is possible to modify <code>Ralf's German dictionary</code> with the XSLT style-sheet.</p>
<p>Why am I telling you these details that are not directly related to simon? Because you <a href="http://spirit.blau.in/simon/2010/01/08/tutorial-how-to-install-under-ubuntu/#pls-dictionary">need a pronunciation dictionary</a> if you want to use simon for dictation. And it is necessary to improve the quality of the phonemes that are contained in <code>Ralf's German dictionary</code>.</p>
<p>3. It is possible to modify phonemes with simon using the <code>Edit Word</code> button. My approach with <code>saxonb-xslt</code> is necessary for dictionary development because this way it is possible to modify lots of <code>&lt;lexeme&gt;</code> elements.</p>
<p>4. I removed the active and the shadow dictionary using the <code>Clear</code> button. What about an <code>Export dictionary</code> button? If someone edits the dictionary with simon (<code>Edit Word</code> button), he may want to export the dictionary.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/02/03/clear-button-improve-phoneme/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Import of prompts 01 failed</title>
		<link>http://spirit.blau.in/simon/2009/10/13/import-of-prompts-01-failed/</link>
		<comments>http://spirit.blau.in/simon/2009/10/13/import-of-prompts-01-failed/#comments</comments>
		<pubDate>Tue, 13 Oct 2009 15:54:28 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[sox]]></category>
		<category><![CDATA[upper-case()]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=1516</guid>
		<description><![CDATA[This is what I am currently doing: I checked out revision 1040. I want to import the prompts 01 into simon. First, I had to convert the 40 flac files into wav files with the following command: liberty@liberty-desktop:~/200910/editing-ralfherzog/01$ for f in *.flac; do sox "$f" -t wav -r 16000 -s -c 1 "subfolder/${f%.flac}.wav"; done Then [...]]]></description>
			<content:encoded><![CDATA[<p>This is what I am currently doing: I <a href="http://spirit.blau.in/speech2text/2009/10/01/revision-1036/#comment-55">checked out revision 1040</a>.</p>
<p>I want to import the <a href="http://script.blau.in/german/01/prompts.xml">prompts 01</a> into simon. First, I had to convert the 40 flac files into wav files with the following command:</p>
<blockquote><p><code>liberty@liberty-desktop:~/200910/editing-ralfherzog/01$ for f in *.flac; do sox "$f" -t wav -r 16000 -s -c 1 "subfolder/${f%.flac}.wav"; done</code></p></blockquote>
<p>Then I transformed the file <code>http://script.blau.in/german/01/<strong>prompts.xml</strong></code> into (almost) HTK compatible format with the following command:</p>
<blockquote><p><code>liberty@liberty-desktop:~/200910/editing-ralfherzog/01$ <strong>saxonb-xslt -ext:on -o:PROMPTS01 -xsl:transform-ssml-prompts.xsl -s:prompts.xml</strong><br />
</code></p></blockquote>
<p>The stylesheet <code>transform-ssml-prompts.xsl</code> has the following content:</p>
<blockquote><p><code>&lt;?xml version="1.0" encoding="UTF-8"?&gt;<br />
&lt;xsl:stylesheet version="2.0"<br />
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;<br />
&lt;!-- 20091013; license: GPL --&gt;<br />
&lt;xsl:output method="text"/&gt;<br />
&lt;xsl:template match="speak"&gt;<br />
&lt;xsl:for-each select="audio"&gt;<br />
&lt;xsl:value-of select="replace(@src, 'flac','wav')"/&gt;<br />
&lt;xsl:text&gt; &lt;/xsl:text&gt;<br />
&lt;xsl:value-of select="."/&gt;&lt;xsl:text&gt;<br />
&lt;/xsl:text&gt;<br />
&lt;/xsl:for-each&gt;<br />
&lt;/xsl:template&gt;<br />
&lt;/xsl:stylesheet&gt;</code></p></blockquote>
<p>I think that I found an error. I forgot to capitalize the prompts with the XPath expression <a href="http://www.w3.org/TR/xpath-functions/#func-upper-case">upper-case()</a>. I will have to correct the stylesheet. Probably that is the reason why the <code>Import Trainingsdata</code> function didn&#8217;t work out.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2009/10/13/import-of-prompts-01-failed/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating Ralf&#8217;s German dictionary</title>
		<link>http://spirit.blau.in/simon/2009/09/22/creating-ralfs-german-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2009/09/22/creating-ralfs-german-dictionary/#comments</comments>
		<pubDate>Tue, 22 Sep 2009 10:40:01 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[node16]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=1426</guid>
		<description><![CDATA[To get an impression how I create the German PLS dictionary, watch the video (19.2 MB, WMV): [20100101: video removed] Currently, I am preparing a new version of Ralf&#8217;s German dictionary. The dictionary should be 100% simon compatible (version 0.1 contains some minor mistakes). This is what I did yesterday: 1. I created more than [...]]]></description>
			<content:encoded><![CDATA[<p>To get an impression how I create the German PLS dictionary, watch the video (19.2 MB, WMV):</p>
<p>[20100101: video removed]</p>
<p>Currently, I am preparing a new version of <a href="http://spirit.blau.in/simon/2009/09/12/ralfs-german-dictionary/">Ralf&#8217;s German dictionary</a>. The dictionary should be 100% simon compatible (version 0.1 contains some minor mistakes).</p>
<p>This is what I did yesterday:<br />
1. I created more than 80.000 pronunciations with eSpeak from a set of 300.000 words. Not all words were transcribed, I don&#8217;t know what went wrong.<br />
2. Then <a href="http://spirit.blau.in/lexicon/files/2009/09/20090921-espeak2ipa.xsl">I created an XSLT stylesheet</a> to transform the eSpeak phoneset into IPA with <code>saxonb-xslt</code>.<br />
3. The result was that I had a list of the phonemes, but the graphemes are missing. What can I do? I decided to start dictating the missing graphemes with DNS 9.5. You can see the dictation process in the video.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2009/09/22/creating-ralfs-german-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

