<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>testing simon</title>
	<atom:link href="http://spirit.blau.in/simon/feed/" rel="self" type="application/rss+xml" />
	<link>http://spirit.blau.in/simon</link>
	<description>my first steps with the simon speech recognition software</description>
	<lastBuildDate>Tue, 10 Jan 2012 14:59:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Ralf&#8217;s Interlingua dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-interlingua-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-interlingua-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 14:59:26 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5743</guid>
		<description><![CDATA[This article explains how I create the dictionary, and how the imported result looks like in simon. A. Creation of the PLS dictionary: 1. Get spelling dictionary. 2. License is GPL. It says in the file README_en.txt: This spell check dictionary for Interlingua is licensed under GPL. [...] This hyphenation rules for Interlingua are licensed [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I create the dictionary, and how the imported result looks like in simon.</p>
<p><strong>A. Creation of the PLS dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/dict-ia">Get</a> spelling dictionary.<br />
2. License is <a href="http://www.gnu.org/licenses/gpl.txt">GPL</a>. It says in the file README_en.txt:</p>
<blockquote><p>This spell check dictionary for Interlingua is licensed under GPL. [...] This hyphenation rules for Interlingua are licensed under GPL.</p></blockquote>
<p>This means that I can use this spelling dictionary as source.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/4581">dict-ia-2010-11-29.oxt</a>.<br />
4. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Interlingua">language</a> code is <code>ia</code>.<br />
5. Probably I will <a href="http://en.wikipedia.org/wiki/Interlingua#Interlingua_alphabet">use this table</a> for grapheme to phoneme conversion.</p>
<p>6. Check the encoding of ia_iso.aff and ia_iso.dic. Both files are encoded in ISO 8859-1. Probably it is best if I convert the encoding of both files into UTF-8.<br />
<code>iconv -f ISO-8859-1 -t UTF-8 < ia_iso.dic > interlingua-utf8.dic<br />
iconv -f ISO-8859-1 -t UTF-8 < ia_iso.aff > interlingua-utf8.aff</code><br />
Change the first line in interlingua-utf8.aff into SET UTF-8. Both files contain CRLF at the end of each line (Windows mode). I don&#8217;t know whether this is ok with the unmunch command. I will check it out:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Interlingua$ unmunch interlingua-utf8.dic interlingua-utf8.aff > interlingua-wordlist</code></p></blockquote>
<p>Obviously, it worked. The CRLF is part of the source files. The target file contains just a LF (Unix mode). There are a lot of duplicate entries. I think that these duplicate entries will be removed later by an <code>.xsl</code> script.</p>
<p>7. Add lexicon tags at the beginning and the end of interlingua-wordlist.</p>
<p>8. Create XML file:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Interlingua$ saxonb-xslt -s:interlingua-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:interlingua.xml</code></p></blockquote>
<p>9. Create PLS dictionary:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Interlingua$ saxonb-xslt -s:interlingua.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-interlingua.xsl_.zip'>improve-interlingua.xsl</a>' -o:<a href="http://script.blau.in/interlingua-dictionary.xml.bz2">interlingua-dictionary.xml</a></code></p></blockquote>
<p><strong>B. <a href="http://script.blau.in/interlingua-dictionary.xml.bz2">Download</a> the dictionary. Import it into simon.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/interlingua.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/interlingua-293x300.jpg" alt="" title="interlingua" width="293" height="300" class="alignleft size-medium wp-image-5746" /></a>The left column contains the words. The pronunciation column contains the corresponding SAMPA transcriptions. The Category column contains just &#8220;Unknown&#8221; entries.
<div style="clear:both"></div>
<p>Now you know how I created the dictionary and how the result looks like in simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-interlingua-dictionary/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Arabic dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 11:55:50 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Arabic]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[sed]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5726</guid>
		<description><![CDATA[This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon. A. Creation of the dictionary: 1. Get Arabic spelling dictionary. 2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file: GPL 2.0/LGPL [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.</p>
<p><strong>A. Creation of the dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/Arabicspellchecker">Get</a> Arabic spelling dictionary.<br />
2. Check the license. Inside the file <a href="http://extensions.services.openoffice.org/en/download/4955">dict_ar-3.0.oxt</a> there is a file with the name COPYING (in the docs folder). It says in the file:</p>
<blockquote><p>GPL 2.0/LGPL 2.1/MPL 1.1 tri-license</p></blockquote>
<p>This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.</p>
<p>3. Now I have to extract <code>dict_ar-3.0.oxt</code>.<br />
4. Let&#8217;s try the <code>unmunch</code> command inside the Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic</code></p></blockquote>
<p>It failed. I wasn&#8217;t able to unmunch the word list.<br />
5. I have to remove all numbers from ar.dic. This can be done with the <code>sed</code> command:</p>
<blockquote><p><code>sed 's/[0-9]*//g' ar.dic > arabic-without-numbers</code></p></blockquote>
<p>6. Remove the slash (&#8220;/&#8221;) from arabic-without-numbers with <a href="http://en.wikipedia.org/wiki/Geany">Geany</a>.<br />
7. Add lexicon tags at the beginning and the end of the file.<br />
8. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml</code></p></blockquote>
<p>9. ISO 639-1 language code is ar.<br />
10. Maybe I will <a href="http://en.wikipedia.org/wiki/Romanization_of_Arabic#Comparison_table">use this table</a> for the grapheme to phoneme conversion.<br />
11. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-arabic.xsl_.zip'>improve-arabic.xsl</a>' -o:<a href="http://script.blau.in/arabic-dictionary.xml.bz2">arabic-dictionary.xml</a></code></p></blockquote>
<p>I have to remove the number sign (&#8220;#&#8221;) with Geany from arabic.xml.</p>
<p><strong>B. <a href="http://script.blau.in/arabic-dictionary.xml.bz2">Download</a> the dictionary. Import it into <a href="http://simon-listens.blogspot.com/">simon</a>.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/arabic-pronunciation.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/arabic-pronunciation-271x300.jpg" alt="" title="arabic-pronunciation" width="271" height="300" class="alignleft size-medium wp-image-5736" /></a>The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with &#8220;Unknown&#8221;. This is because the PLS dictionary contains no <code>role</code> attributes.
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you know how the result looks like in simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Hebrew dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 09:43:30 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Hebrew]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5718</guid>
		<description><![CDATA[In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon. A. Creation of the [...]]]></description>
			<content:encoded><![CDATA[<p>In 2009, I made some <a href="http://spirit.blau.in/simon/2009/09/10/confidence-score-with-hebrew/">initial tests with Hebrew</a>. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.</p>
<p><strong>A. Creation of the dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/dict-he">Get</a> Hebrew spelling dictionary from OpenOffice.org.<br />
2. License is <a href="http://www.gnu.org/licenses/gpl-2.0.html">GPL</a>. There is a copyright notice inside the file <code>he_IL.aff</code>.</p>
<p>3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test</code></p></blockquote>
<p>4. The source file <code>he_IL.dic</code> contains a lot of numbers. I remove them with the Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ <a href="http://www.cyberciti.biz/faq/sed-remove-all-digits-input-from-input/">sed</a> 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers</code></p></blockquote>
<p>With Geany, I remove the &#8220;,&#8221; (commas) and the &#8220;/&#8221; (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.</p>
<p>5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.<br />
6. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml</code></p></blockquote>
<p>7. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Hebrew_language">language</a> code is <code>he</code>.<br />
8. I need a table for grapheme to phoneme conversion. Maybe I will <a href="http://en.wikipedia.org/wiki/Hebrew_alphabet#Pronunciation">use this table</a>. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew <a href="http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/">share the same alphabet</a>. This means I could try to use the Yiddish <a href="http://spirit.blau.in/simon/files/2012/01/improve-yiddish.xsl_.zip">improve-yiddish.xsl</a> style sheet:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml</code></p></blockquote>
<p>The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn&#8217;t been converted: [א] I will add this phone to the <code>.xsl</code> style sheet with the name <code>improve-hebrew.xsl</code>. Now I try it again:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-hebrew.xsl_.zip'>improve-hebrew.xsl</a>' -o:<a href="http://script.blau.in/hebrew-dictionary.xml.bz2" title="IPA phonetic dictionary, first draft">hebrew-dictionary.xml</a></code></p></blockquote>
<p>The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.</p>
<p><strong>B. <a href="http://script.blau.in/hebrew-dictionary.xml.bz2">Download</a> the dictionary. Import it into <a href="http://simon-listens.org/index.php?id=122&#038;L=1">simon</a> as shadow dictionary.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/hebrew-SAMPA.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/hebrew-SAMPA-255x300.jpg" alt="" title="hebrew-SAMPA" width="255" height="300" class="alignleft size-medium wp-image-5723" /></a>Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just <code>Unknown</code>) since the source PLS dictionary contains no <code>role</code> attributes.</p>
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn&#8217;t be a problem to adjust the style sheet <code>improve-hebrew.xsl</code> so that the phoneme results are better.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Belarusian dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/09/ralfs-belarusian-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/09/ralfs-belarusian-dictionary/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 20:27:52 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Belarusian]]></category>
		<category><![CDATA[iconv]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5699</guid>
		<description><![CDATA[This article explains how I create this PLS dictionary and how the imported result looks like. A. Creation of the Belarusian PLS dictionary: 1. Get spelling dictionary. I choose the official orthography. 2. License is LGPL (see hyph_be_BY.dic). I am allowed to &#8220;convert any LGPLed piece of software into a GPLed piece of software.&#8221; I [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I create this PLS dictionary and how the imported result looks like.</p>
<p><strong>A. Creation of the Belarusian PLS dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/dict-be-official">Get</a> spelling dictionary. I choose the official orthography.<br />
2. License is <a href="http://www.gnu.org/licenses/lgpl.html">LGPL</a> (see <code>hyph_be_BY.dic</code>). I am <a href="http://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License#Differences_from_the_GPL">allowed</a> to <em>&#8220;convert any LGPLed piece of software into a GPLed piece of software.&#8221;</em> I <a href="http://spirit.blau.in/simon/2010/05/22/ralfs-northern-sotho-dictionary/">did this before</a>. And I will do it again. This means that I get a spelling dictionary that is licensed under the LGPL. And I will produce a pronunciation dictionary that is licensed under the GPLv3. By the way, all my dictionaries are <a href="http://spirit.blau.in/simon/import-pls-dictionary/" title="phonetic IPA dictionaries">licensed under the GPLv3</a>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/2976">dict-be-official.oxt</a>.</p>
<p>4. The file <code>be-official.aff</code> is encoded in UTF-8. The file <code>be-official-dic</code> may be encoded in ISO-8859-1. At least this encoding is displayed by <a href="http://www.geany.org/">Geany</a>. I believe that <code>be-official-dic</code> is encoded in microsoft-cp1251. I had this encoding before (<a title="Macedonian PLS dictionary - Pronunciation Lexicon Specification" href="http://spirit.blau.in/simon/2010/04/24/ralfs-macedonian-dictionary/">Macedonian</a> and <a href="http://spirit.blau.in/simon/2010/04/17/creating-ralfs-bulgarian-dictionary/" title="phonetic dictionary">Bulgarian</a>).<br />
Now it is time to use the Ubuntu terminal:<br />
<code><a href="http://en.wikipedia.org/wiki/Cd_(command)">cd</a> /home/ubuntu/Documents/2011-II/Belarusian<br />
<a href="http://en.wikipedia.org/wiki/Iconv#Examples">iconv</a> -f cp1251 -t UTF-8 &lt;be-official.dic &gt;belarusian-utf8.dic</code><br />
The text file <code>belarusian-utf8.dic</code> looks fine.</p>
<p>5. Now I change the line <code>SET microsoft-cp1251</code> in the file <code>be-official.aff</code> into <code>SET UTF-8</code><br />
6. I don&#8217;t know whether the next step is necessary. I could convert the file <code>hyph_be_BY.dic</code> from cp1251 into UTF-8. At the moment, I skip this step.</p>
<p>7. Ubuntu terminal: <code>unmunch belarusian-utf8.dic be-official.aff &gt; belarusian-wordlist</code> I think that this step wasn&#8217;t necessary. It didn&#8217;t extract the word list. At the moment, I have a word list of 1.5 million words. This is way too much. I have to reduce the dictionary size. The target size is 400.000 words.</p>
<p>8. I have to reduce the dictionary size. I found a <a href="http://www.unix.com/shell-programming-scripting/24845-remove-every-third-line-file.html">tip</a>. Ubuntu terminal:</p>
<blockquote><p><code><a href="http://en.wikipedia.org/wiki/Sed#Usage">sed</a> -n 'p;N;N;N' belarusian-wordlist <a href="http://en.wikipedia.org/wiki/Redirection_%28computing%29#Redirecting_standard_input_and_standard_output">></a> belarusian-wordlist-reduced</code></p></blockquote>
<p>Yes, it worked. The word list contains now 391.000 words. This is a good basis for a PLS dictionary.</p>
<p>9. Add lexicon elements at the beginning and the end of belarusian-wordlist-reduced.<br />
10. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian-wordlist-reduced -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:belarusian.xml</code></p></blockquote>
<p>11. <a href="http://en.wikipedia.org/wiki/Belarusian_language">Language</a> code is be.<br />
12. I will use <a href="http://en.wikipedia.org/wiki/Belarusian_alphabet#Letters">this</a> table for grapheme to phoneme mapping.<br />
13. Creation of the phoneme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-belarusian.xsl_.zip'>improve-belarusian.xsl</a>' -o:<a href="http://script.blau.in/belarusian-dictionary.xml.bz2" title="Belarusian IPA phonetic dictionary">belarusian-dictionary.xml</a></code></p></blockquote>
<p><strong>B. <a href="http://script.blau.in/belarusian-dictionary.xml.bz2" title="Belarusian - Pronunciation Lexicon Specification">Download</a> and import the dictionary.</strong> </p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/belarusian.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/belarusian-300x275.jpg" alt="" title="belarusian" width="300" height="275" class="alignleft size-medium wp-image-5709" /></a>Let&#8217;s take a look at the result. The left column contains 391669 Belarusian words. The pronunciation column contains the corresponding SAMPA transcriptions. All entries in the third column are marked as &#8220;Unknown&#8221;. This is because the Belarusian PLS dictionary doesn&#8217;t contain any <code>role</code> attribute.</p>
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you got an impression how the result looks like when imported into simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/09/ralfs-belarusian-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Asturian dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 20:30:43 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Asturian]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5678</guid>
		<description><![CDATA[This article explains how I create the Asturian PLS dictionary, and some words about the import into simon. A. How I create the dictionary: 1. Get spelling dictionary. 2. Check license. It is GPLv3. 3. Extract asturianu.oxt. 4. Language code is ast. 5. Ubuntu terminal: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist The result is a [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.</p>
<p>A. How I create the dictionary:<br />
1. <a href="http://extensions.services.openoffice.org/en/project/asturianu">Get</a> spelling dictionary.<br />
2. Check license. It is <a href="http://extensions.services.openoffice.org/en/project/license/3932">GPLv3</a>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/5129">asturianu.oxt</a>.<br />
4. <a href="http://en.wikipedia.org/wiki/Asturian_language">Language</a> code is ast.<br />
5. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist</code></p></blockquote>
<p>The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the <a href="http://spirit.blau.in/simon/2010/04/13/removing-words-from-latin-dictionary/">same problem</a> with my Latin dictionary. I had to reduce the size.</p>
<p>6. Add lexicon elements at the beginning/end of asturian-wordlist.</p>
<p>7. Generate .xml document with lexicon, lexeme and grapheme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>I got an error message because the available space isn&#8217;t enough (&#8220;Java heap space&#8221;). I think that I should reduce the file size with <a href="http://en.wikipedia.org/wiki/Grep">grep</a>. Or I install VisualVM. I think I will work with grep:<br />
a. Remove lines that begin with l&#8217;: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\&#8217; asturian-wordlist > asturian-wordlist-02<br />
b. Remove lines that begin with t&#8217;: grep -v ^t\&#8217; asturian-wordlist-02 > asturian-wordlist-03<br />
c. Remove lines that begin with s&#8217;: grep -v ^s\&#8217; asturian-wordlist-03 > asturian-wordlist-04<br />
d. Remove lines that begin with m&#8217;: grep -v ^m\&#8217; asturian-wordlist-04 > asturian-wordlist-05<br />
e. Remove lines that begin with n&#8217;: grep -v ^n\&#8217; asturian-wordlist-05 > asturian-wordlist-06<br />
f. Remove lines that begin with d&#8217;: grep -v ^d\&#8217; asturian-wordlist-06 > asturian-wordlist-07<br />
g. Remove lines that begin with qu&#8217;: grep -v ^qu\&#8217; asturian-wordlist-07 > asturian-wordlist-08<br />
h. Remove lines that begin with p&#8217;: grep -v ^p\&#8217; asturian-wordlist-08 > asturian-wordlist-09<br />
The dictionary will contain 1.1 million words. I think that that number is acceptable.</p>
<p>8. And now Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-09 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>This command creates a PLS dictionary without phoneme elements. The phoneme elements will be added later.</p>
<p>9. I will use <a href="http://en.wikipedia.org/wiki/Asturian_language#Orthography">this</a> table for grapheme to phoneme conversion. Here is the command that creates the phoneme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-asturian.xsl_.zip'>improve-asturian.xsl</a>' -o:<a href="http://script.blau.in/asturian-dictionary.xml.bz2">asturian-dictionary.xml</a></code></p></blockquote>
<p>10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn&#8217;t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.<br />
a. Remove lines that contain &#8216;l: grep -v \&#8217;l asturian-wordlist-09 > asturian-wordlist-10<br />
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11<br />
c. This isn&#8217;t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12<br />
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13<br />
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14<br />
f. Remove 265.000 words:  grep -v es asturian-wordlist-14 > asturian-wordlist-15<br />
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an <code>.xml</code> file:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-15 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>And now I repeat step 9. The file <code>asturian-dictionary.xml</code> has a size of 45 MB. I hope that this size is OK.</p>
<p>B. <a href="http://script.blau.in/asturian-dictionary.xml.bz2">Download the dictionary</a>. Import it into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/asturian.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/asturian-238x300.jpg" alt="" title="asturian" width="238" height="300" class="alignleft size-medium wp-image-5693" /></a>Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.
<div style="clear:both"></div>
<p>You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn&#8217;t too big for simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Yiddish dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 17:05:18 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>
		<category><![CDATA[yiddish]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5663</guid>
		<description><![CDATA[This article explains some details about the creation of the dictionary, and how the result looks like in simon. A. How I create Ralf's Yiddish dictionary: 1. Get spelling dictionary. 2. License is GPLv3. 3. Extract jidysz.net.ooo.spellchecker.oxt. 4. Ubuntu terminal: cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries sudo apt-get install hunspell-tools unmunch yi.dic yi.aff &#62; yiddish-wordlist 5. Add &#60;lexicon&#62; at [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains some details about the creation of the dictionary, and how the result looks like in simon.</p>
<p>A. How I create <code>Ralf's Yiddish dictionary</code>:</p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/jidysz-net-ooo-spellchecker">Get</a> spelling dictionary.<br />
2. License is <code>GPLv3</code>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/4324"><code>jidysz.net.ooo.spellchecker.oxt</code></a>.<br />
4. Ubuntu terminal:<br />
<code>cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries<br />
sudo apt-get install hunspell-tools<br />
unmunch yi.dic yi.aff &gt; yiddish-wordlist</code><br />
5. Add  <code>&lt;lexicon&gt;</code> at the beginning of yiddish-wordlist. Add <code>&lt;/lexicon&gt;</code> at the end of this file.<br />
6. Generate <code>.xml</code> document with lexicon, lexeme and grapheme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:yiddish.xml</code></p></blockquote>
<p>7. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Yiddish_language">language code</a> is yi.<br />
8. I think I will use <a href="http://en.wikipedia.org/wiki/Yiddish_orthography#The_Yiddish_alphabet">this table</a> as source for the grapheme to phoneme mapping.<br />
9. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-yiddish.xsl_.zip'>improve-yiddish.xsl</a>' -o:yiddish-dictionary.xml</code></p></blockquote>
<p>B. <a href="http://script.blau.in/yiddish-dictionary.xml.bz2">Download the dictionary</a>, and <a href="http://spirit.blau.in/simon/2010/06/17/tutorial-import-german-dictionary/">import</a> it into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/yiddish.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/yiddish-243x300.jpg" alt="" title="yiddish" width="243" height="300" class="alignleft size-medium wp-image-5670" /></a>Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription.<br />
<a href="http://en.wikipedia.org/wiki/Yiddish_language">Yiddish</a> is written in the Hebrew alphabet. The <a href="http://en.wikipedia.org/wiki/Hebrew_alphabet">Hebrew alphabet</a> is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.</p>
<div style="clear:both"></div>
<p>There are a lot of other PLS dictionaries available. <a href="http://spirit.blau.in/simon/import-pls-dictionary/">Find the PLS dictionary</a> that suits your language.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Import of your Grammar</title>
		<link>http://spirit.blau.in/simon/2012/01/02/import-of-your-grammar/</link>
		<comments>http://spirit.blau.in/simon/2012/01/02/import-of-your-grammar/#comments</comments>
		<pubDate>Sun, 01 Jan 2012 22:14:58 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[German]]></category>
		<category><![CDATA[grammar]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5642</guid>
		<description><![CDATA[In this post I want to write some words about the Grammar / Import function of simon 0.3. Here is what I do: 1. Import Schott&#8217;s German dictionary as active dictionary into simon. 2. Open the Grammar tab. Press the Import button. 3. Simon starts a wizard. Press the Next button. 4. Let&#8217;s try and [...]]]></description>
			<content:encoded><![CDATA[<p>In this post I want to write some words about the Grammar /  Import function of simon 0.3. Here is what I do:</p>
<p>1. Import <a href="http://spirit.blau.in/simon/2011/11/01/schott%e2%80%99s-german-dictionary-0-2-8/">Schott&#8217;s German dictionary</a> as active dictionary into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/grammar-01.jpg"><img class="alignleft size-medium wp-image-5643" title="grammar-01" src="http://spirit.blau.in/simon/files/2012/01/grammar-01-300x287.jpg" alt="" width="300" height="287" /></a>2. Open the Grammar tab. Press the <code>Import</code> button.</p>
<div style="clear:both"></div>
<p><a href="http://spirit.blau.in/simon/files/2012/01/grammar-02.jpg"><img class="alignleft size-medium wp-image-5645" title="grammar-02" src="http://spirit.blau.in/simon/files/2012/01/grammar-02-300x202.jpg" alt="" width="300" height="202" /></a>3. Simon starts a wizard. Press the <code>Next</code> button.</p>
<div style="clear:both"></div>
<p><a href="http://spirit.blau.in/simon/files/2012/01/grammar-03.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/grammar-03-300x202.jpg" alt="" title="grammar-03" width="300" height="202" class="alignleft size-medium wp-image-5647" /></a>4. Let&#8217;s try and check the option <code>Also import unknown sentences</code>. I don&#8217;t know whether this is a good decision. So let&#8217;s give it a try.<br />
This is interesting: &#8220;words with more than one terminal&#8221; &#8211; <strong>is it now possible to use more than one entry for the <code>role</code> attribute?</strong> The current version of Schott&#8217;s German dictionary employs just one entry for each <code>role</code> attribute. The PLS standard allows <a href="http://www.w3.org/TR/pronunciation-lexicon/#S4.4">more entries</a>.<br />
Please, download and extract <a href="http://script.blau.in/etc/15000-german-utterances.zip">Schott&#8217;s German utterances</a>. This compressed folder <code>15000-german-utterances.zip</code> contains a plain text file with more than 15000 utterances. I am the author of these utterances, and I have licensed them under the <a href="http://script.blau.in/etc/GPL_License">GPLv3</a>. The utterances are designed to be used in conjunction with Schott&#8217;s German dictionary (former name: Ralf&#8217;s German dictionary). Every word that is included within Schott&#8217;s German utterances should be included in Schott&#8217;s German dictionary, too. I am 99% sure that this is the case, but I can&#8217;t guarantee it. If some words are missing in Schott&#8217;s German dictionary, please inform me, and I will include them within the next version of Schott&#8217;s German dictionary.<br />
You can import my German utterances using simon&#8217;s <code>Import Text</code> option (copy &#038; paste).
<div style="clear:both"></div>
<p>5. The import has been completed. There are a lot of lines that contain the <code>Unknown</code> terminal. Probably it would have been better if I wouldn&#8217;t have checked the option <code>Also import unknown sentences</code> in step 4.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/system-monitor-zombie.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/system-monitor-zombie-300x248.jpg" alt="" title="system-monitor-zombie" width="300" height="248" class="alignleft size-medium wp-image-5654" /></a>6. Because simon didn&#8217;t react any more, I forced it to quit. I tried to start simon several times, but it wouldn&#8217;t start. Now there are several simon zombie statuses displayed. I was able to end these zombie / sleeping processes. But at the moment, it seems to be impossible to start simon again (a new zombie / sleeping status is beeing created if I try to start simon again).
<div style="clear:both"></div>
<p><strong>Conclusion:</strong> I don&#8217;t recommend to check the option <code>Also import unknown sentences</code>. I tried the import Grammar function before without checking this option. Simon reacted normal, everything seemed to be fine.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/02/import-of-your-grammar/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Schott’s German dictionary 0.2.8</title>
		<link>http://spirit.blau.in/simon/2011/11/01/schott%e2%80%99s-german-dictionary-0-2-8/</link>
		<comments>http://spirit.blau.in/simon/2011/11/01/schott%e2%80%99s-german-dictionary-0-2-8/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 17:05:20 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5569</guid>
		<description><![CDATA[Here is how I create Schott’s German dictionary 0.2.8 (with the style sheet improve-german.xsl): 1. Replace 152 matches: &#60;xsl:when test=&#8221;contains(lower-case(../grapheme), &#8216;planung&#8217;)&#8221;&#62;&#60;xsl:value-of select=&#8221;replace($sierra, &#8216;planʊŋ&#8217;,'plaːnʊŋ&#8217;)&#8221;/&#62;&#60;/xsl:when&#62; 2. Replace 178 matches: &#60;xsl:when test=&#8221;contains(lower-case(../grapheme), &#8216;fußball&#8217;)&#8221;&#62;&#60;xsl:value-of select=&#8221;replace($sierra, &#8216;fʊsbal&#8217;,'fuːsbal&#8217;)&#8221;/&#62;&#60;/xsl:when&#62; A lot of other small changes have been made. Please, import Schott’s German dictionary (author: Kai Schott) into simon.]]></description>
			<content:encoded><![CDATA[<p>Here is how I create <a href="http://script.blau.in/german-dictionary.xml.bz2">Schott’s German dictionary</a> 0.2.8 (with the style sheet <code>improve-german.xsl</code>):</p>
<p>1. Replace 152 matches:</p>
<blockquote><p>&lt;xsl:when test=&#8221;contains(lower-case(../grapheme), &#8216;planung&#8217;)&#8221;&gt;&lt;xsl:value-of select=&#8221;replace($sierra, &#8216;planʊŋ&#8217;,'plaːnʊŋ&#8217;)&#8221;/&gt;&lt;/xsl:when&gt;</p></blockquote>
<p>2. Replace 178 matches:</p>
<blockquote><p>&lt;xsl:when test=&#8221;contains(lower-case(../grapheme), &#8216;fußball&#8217;)&#8221;&gt;&lt;xsl:value-of select=&#8221;replace($sierra, &#8216;fʊsbal&#8217;,'fuːsbal&#8217;)&#8221;/&gt;&lt;/xsl:when&gt;</p></blockquote>
<p>A lot of other small changes have been made. Please, <a href="http://spirit.blau.in/simon/2010/06/17/tutorial-import-german-dictionary/">import Schott’s German dictionary</a> (author: Kai Schott) into simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2011/11/01/schott%e2%80%99s-german-dictionary-0-2-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>German speech model &#8216;deep&#8217;</title>
		<link>http://spirit.blau.in/simon/2011/08/07/german-speech-model-deep/</link>
		<comments>http://spirit.blau.in/simon/2011/08/07/german-speech-model-deep/#comments</comments>
		<pubDate>Sun, 07 Aug 2011 15:08:00 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[import]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5634</guid>
		<description><![CDATA[Visit Schott&#8217;s German IPA FLAC files (section: deep) or Voxforge. Download and import the German speech model &#8216;deep&#8217;.]]></description>
			<content:encoded><![CDATA[<p>Visit <a href="http://script.blau.in/german/deep.xml">Schott&#8217;s German IPA FLAC files (section: deep)</a> or <a href="http://voxforge.org/home/downloads/speech/german-speech-files/german-ipa-flac-files-section-deep#qPFtzv5z-qCPHAai87hUhQ">Voxforge</a>. Download and import the <a href="http://ubuntuone.com/p/18tF/">German speech model &#8216;deep&#8217;</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2011/08/07/german-speech-model-deep/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>German speech model &#8216;deek&#8217;</title>
		<link>http://spirit.blau.in/simon/2011/08/03/german-speech-model-deek/</link>
		<comments>http://spirit.blau.in/simon/2011/08/03/german-speech-model-deek/#comments</comments>
		<pubDate>Wed, 03 Aug 2011 08:56:20 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[import]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5626</guid>
		<description><![CDATA[Visit Schott’s German IPA FLAC files (section: deek) [source 1] or Voxforge [source 2]. Get the corresponding speech model [object], and import it into simon 0.3.]]></description>
			<content:encoded><![CDATA[<p>Visit <a href="http://script.blau.in/german/deek.xml">Schott’s German IPA FLAC files (section: deek)</a> [source 1] or <a href="http://voxforge.org/home/downloads/speech/german-speech-files/german-ipa-flac-files-section-deek#004PZQQMTKqSO06mkFfihQ">Voxforge</a> [source 2]. Get the corresponding <a href="http://ubuntuone.com/p/1839/">speech model</a> [object], and import it into simon 0.3.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2011/08/03/german-speech-model-deek/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

