<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>testing simon &#187; PLS</title>
	<atom:link href="http://spirit.blau.in/simon/tag/pls/feed/" rel="self" type="application/rss+xml" />
	<link>http://spirit.blau.in/simon</link>
	<description>my first steps with the simon speech recognition software</description>
	<lastBuildDate>Tue, 10 Jan 2012 14:59:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Ralf&#8217;s Arabic dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 11:55:50 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Arabic]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[sed]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5726</guid>
		<description><![CDATA[This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon. A. Creation of the dictionary: 1. Get Arabic spelling dictionary. 2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file: GPL 2.0/LGPL [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.</p>
<p><strong>A. Creation of the dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/Arabicspellchecker">Get</a> Arabic spelling dictionary.<br />
2. Check the license. Inside the file <a href="http://extensions.services.openoffice.org/en/download/4955">dict_ar-3.0.oxt</a> there is a file with the name COPYING (in the docs folder). It says in the file:</p>
<blockquote><p>GPL 2.0/LGPL 2.1/MPL 1.1 tri-license</p></blockquote>
<p>This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.</p>
<p>3. Now I have to extract <code>dict_ar-3.0.oxt</code>.<br />
4. Let&#8217;s try the <code>unmunch</code> command inside the Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ unmunch ar.dic ar.aff > arabic</code></p></blockquote>
<p>It failed. I wasn&#8217;t able to unmunch the word list.<br />
5. I have to remove all numbers from ar.dic. This can be done with the <code>sed</code> command:</p>
<blockquote><p><code>sed 's/[0-9]*//g' ar.dic > arabic-without-numbers</code></p></blockquote>
<p>6. Remove the slash (&#8220;/&#8221;) from arabic-without-numbers with <a href="http://en.wikipedia.org/wiki/Geany">Geany</a>.<br />
7. Add lexicon tags at the beginning and the end of the file.<br />
8. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:arabic.xml</code></p></blockquote>
<p>9. ISO 639-1 language code is ar.<br />
10. Maybe I will <a href="http://en.wikipedia.org/wiki/Romanization_of_Arabic#Comparison_table">use this table</a> for the grapheme to phoneme conversion.<br />
11. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Arabic$ saxonb-xslt -s:arabic.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-arabic.xsl_.zip'>improve-arabic.xsl</a>' -o:<a href="http://script.blau.in/arabic-dictionary.xml.bz2">arabic-dictionary.xml</a></code></p></blockquote>
<p>I have to remove the number sign (&#8220;#&#8221;) with Geany from arabic.xml.</p>
<p><strong>B. <a href="http://script.blau.in/arabic-dictionary.xml.bz2">Download</a> the dictionary. Import it into <a href="http://simon-listens.blogspot.com/">simon</a>.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/arabic-pronunciation.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/arabic-pronunciation-271x300.jpg" alt="" title="arabic-pronunciation" width="271" height="300" class="alignleft size-medium wp-image-5736" /></a>The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with &#8220;Unknown&#8221;. This is because the PLS dictionary contains no <code>role</code> attributes.
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you know how the result looks like in simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-arabic-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Hebrew dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 09:43:30 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Hebrew]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5718</guid>
		<description><![CDATA[In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon. A. Creation of the [...]]]></description>
			<content:encoded><![CDATA[<p>In 2009, I made some <a href="http://spirit.blau.in/simon/2009/09/10/confidence-score-with-hebrew/">initial tests with Hebrew</a>. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.</p>
<p><strong>A. Creation of the dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/dict-he">Get</a> Hebrew spelling dictionary from OpenOffice.org.<br />
2. License is <a href="http://www.gnu.org/licenses/gpl-2.0.html">GPL</a>. There is a copyright notice inside the file <code>he_IL.aff</code>.</p>
<p>3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ unmunch he_IL.dic he_IL.aff > hebrew-test</code></p></blockquote>
<p>4. The source file <code>he_IL.dic</code> contains a lot of numbers. I remove them with the Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ <a href="http://www.cyberciti.biz/faq/sed-remove-all-digits-input-from-input/">sed</a> 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers</code></p></blockquote>
<p>With Geany, I remove the &#8220;,&#8221; (commas) and the &#8220;/&#8221; (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.</p>
<p>5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.<br />
6. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew-without-numbers -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:hebrew.xml</code></p></blockquote>
<p>7. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Hebrew_language">language</a> code is <code>he</code>.<br />
8. I need a table for grapheme to phoneme conversion. Maybe I will <a href="http://en.wikipedia.org/wiki/Hebrew_alphabet#Pronunciation">use this table</a>. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew <a href="http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/">share the same alphabet</a>. This means I could try to use the Yiddish <a href="http://spirit.blau.in/simon/files/2012/01/improve-yiddish.xsl_.zip">improve-yiddish.xsl</a> style sheet:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'/home/ubuntu/Documents/2011-II/Yiddish/dictionaries/improve-yiddish.xsl' -o:hebrew-dictionary.xml</code></p></blockquote>
<p>The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn&#8217;t been converted: [א] I will add this phone to the <code>.xsl</code> style sheet with the name <code>improve-hebrew.xsl</code>. Now I try it again:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ saxonb-xslt -s:hebrew.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-hebrew.xsl_.zip'>improve-hebrew.xsl</a>' -o:<a href="http://script.blau.in/hebrew-dictionary.xml.bz2" title="IPA phonetic dictionary, first draft">hebrew-dictionary.xml</a></code></p></blockquote>
<p>The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.</p>
<p><strong>B. <a href="http://script.blau.in/hebrew-dictionary.xml.bz2">Download</a> the dictionary. Import it into <a href="http://simon-listens.org/index.php?id=122&#038;L=1">simon</a> as shadow dictionary.</strong></p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/hebrew-SAMPA.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/hebrew-SAMPA-255x300.jpg" alt="" title="hebrew-SAMPA" width="255" height="300" class="alignleft size-medium wp-image-5723" /></a>Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just <code>Unknown</code>) since the source PLS dictionary contains no <code>role</code> attributes.</p>
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn&#8217;t be a problem to adjust the style sheet <code>improve-hebrew.xsl</code> so that the phoneme results are better.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/10/ralfs-hebrew-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Belarusian dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/09/ralfs-belarusian-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/09/ralfs-belarusian-dictionary/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 20:27:52 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Belarusian]]></category>
		<category><![CDATA[iconv]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5699</guid>
		<description><![CDATA[This article explains how I create this PLS dictionary and how the imported result looks like. A. Creation of the Belarusian PLS dictionary: 1. Get spelling dictionary. I choose the official orthography. 2. License is LGPL (see hyph_be_BY.dic). I am allowed to &#8220;convert any LGPLed piece of software into a GPLed piece of software.&#8221; I [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I create this PLS dictionary and how the imported result looks like.</p>
<p><strong>A. Creation of the Belarusian PLS dictionary:</strong></p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/dict-be-official">Get</a> spelling dictionary. I choose the official orthography.<br />
2. License is <a href="http://www.gnu.org/licenses/lgpl.html">LGPL</a> (see <code>hyph_be_BY.dic</code>). I am <a href="http://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License#Differences_from_the_GPL">allowed</a> to <em>&#8220;convert any LGPLed piece of software into a GPLed piece of software.&#8221;</em> I <a href="http://spirit.blau.in/simon/2010/05/22/ralfs-northern-sotho-dictionary/">did this before</a>. And I will do it again. This means that I get a spelling dictionary that is licensed under the LGPL. And I will produce a pronunciation dictionary that is licensed under the GPLv3. By the way, all my dictionaries are <a href="http://spirit.blau.in/simon/import-pls-dictionary/" title="phonetic IPA dictionaries">licensed under the GPLv3</a>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/2976">dict-be-official.oxt</a>.</p>
<p>4. The file <code>be-official.aff</code> is encoded in UTF-8. The file <code>be-official-dic</code> may be encoded in ISO-8859-1. At least this encoding is displayed by <a href="http://www.geany.org/">Geany</a>. I believe that <code>be-official-dic</code> is encoded in microsoft-cp1251. I had this encoding before (<a title="Macedonian PLS dictionary - Pronunciation Lexicon Specification" href="http://spirit.blau.in/simon/2010/04/24/ralfs-macedonian-dictionary/">Macedonian</a> and <a href="http://spirit.blau.in/simon/2010/04/17/creating-ralfs-bulgarian-dictionary/" title="phonetic dictionary">Bulgarian</a>).<br />
Now it is time to use the Ubuntu terminal:<br />
<code><a href="http://en.wikipedia.org/wiki/Cd_(command)">cd</a> /home/ubuntu/Documents/2011-II/Belarusian<br />
<a href="http://en.wikipedia.org/wiki/Iconv#Examples">iconv</a> -f cp1251 -t UTF-8 &lt;be-official.dic &gt;belarusian-utf8.dic</code><br />
The text file <code>belarusian-utf8.dic</code> looks fine.</p>
<p>5. Now I change the line <code>SET microsoft-cp1251</code> in the file <code>be-official.aff</code> into <code>SET UTF-8</code><br />
6. I don&#8217;t know whether the next step is necessary. I could convert the file <code>hyph_be_BY.dic</code> from cp1251 into UTF-8. At the moment, I skip this step.</p>
<p>7. Ubuntu terminal: <code>unmunch belarusian-utf8.dic be-official.aff &gt; belarusian-wordlist</code> I think that this step wasn&#8217;t necessary. It didn&#8217;t extract the word list. At the moment, I have a word list of 1.5 million words. This is way too much. I have to reduce the dictionary size. The target size is 400.000 words.</p>
<p>8. I have to reduce the dictionary size. I found a <a href="http://www.unix.com/shell-programming-scripting/24845-remove-every-third-line-file.html">tip</a>. Ubuntu terminal:</p>
<blockquote><p><code><a href="http://en.wikipedia.org/wiki/Sed#Usage">sed</a> -n 'p;N;N;N' belarusian-wordlist <a href="http://en.wikipedia.org/wiki/Redirection_%28computing%29#Redirecting_standard_input_and_standard_output">></a> belarusian-wordlist-reduced</code></p></blockquote>
<p>Yes, it worked. The word list contains now 391.000 words. This is a good basis for a PLS dictionary.</p>
<p>9. Add lexicon elements at the beginning and the end of belarusian-wordlist-reduced.<br />
10. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian-wordlist-reduced -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:belarusian.xml</code></p></blockquote>
<p>11. <a href="http://en.wikipedia.org/wiki/Belarusian_language">Language</a> code is be.<br />
12. I will use <a href="http://en.wikipedia.org/wiki/Belarusian_alphabet#Letters">this</a> table for grapheme to phoneme mapping.<br />
13. Creation of the phoneme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Belarusian$ saxonb-xslt -s:belarusian.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-belarusian.xsl_.zip'>improve-belarusian.xsl</a>' -o:<a href="http://script.blau.in/belarusian-dictionary.xml.bz2" title="Belarusian IPA phonetic dictionary">belarusian-dictionary.xml</a></code></p></blockquote>
<p><strong>B. <a href="http://script.blau.in/belarusian-dictionary.xml.bz2" title="Belarusian - Pronunciation Lexicon Specification">Download</a> and import the dictionary.</strong> </p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/belarusian.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/belarusian-300x275.jpg" alt="" title="belarusian" width="300" height="275" class="alignleft size-medium wp-image-5709" /></a>Let&#8217;s take a look at the result. The left column contains 391669 Belarusian words. The pronunciation column contains the corresponding SAMPA transcriptions. All entries in the third column are marked as &#8220;Unknown&#8221;. This is because the Belarusian PLS dictionary doesn&#8217;t contain any <code>role</code> attribute.</p>
<div style="clear:both"></div>
<p>Now you know how I created the dictionary. And you got an impression how the result looks like when imported into simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/09/ralfs-belarusian-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Asturian dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 20:30:43 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Asturian]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5678</guid>
		<description><![CDATA[This article explains how I create the Asturian PLS dictionary, and some words about the import into simon. A. How I create the dictionary: 1. Get spelling dictionary. 2. Check license. It is GPLv3. 3. Extract asturianu.oxt. 4. Language code is ast. 5. Ubuntu terminal: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist The result is a [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.</p>
<p>A. How I create the dictionary:<br />
1. <a href="http://extensions.services.openoffice.org/en/project/asturianu">Get</a> spelling dictionary.<br />
2. Check license. It is <a href="http://extensions.services.openoffice.org/en/project/license/3932">GPLv3</a>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/5129">asturianu.oxt</a>.<br />
4. <a href="http://en.wikipedia.org/wiki/Asturian_language">Language</a> code is ast.<br />
5. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ unmunch ast.dic ast.aff > asturian-wordlist</code></p></blockquote>
<p>The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the <a href="http://spirit.blau.in/simon/2010/04/13/removing-words-from-latin-dictionary/">same problem</a> with my Latin dictionary. I had to reduce the size.</p>
<p>6. Add lexicon elements at the beginning/end of asturian-wordlist.</p>
<p>7. Generate .xml document with lexicon, lexeme and grapheme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>I got an error message because the available space isn&#8217;t enough (&#8220;Java heap space&#8221;). I think that I should reduce the file size with <a href="http://en.wikipedia.org/wiki/Grep">grep</a>. Or I install VisualVM. I think I will work with grep:<br />
a. Remove lines that begin with l&#8217;: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\&#8217; asturian-wordlist > asturian-wordlist-02<br />
b. Remove lines that begin with t&#8217;: grep -v ^t\&#8217; asturian-wordlist-02 > asturian-wordlist-03<br />
c. Remove lines that begin with s&#8217;: grep -v ^s\&#8217; asturian-wordlist-03 > asturian-wordlist-04<br />
d. Remove lines that begin with m&#8217;: grep -v ^m\&#8217; asturian-wordlist-04 > asturian-wordlist-05<br />
e. Remove lines that begin with n&#8217;: grep -v ^n\&#8217; asturian-wordlist-05 > asturian-wordlist-06<br />
f. Remove lines that begin with d&#8217;: grep -v ^d\&#8217; asturian-wordlist-06 > asturian-wordlist-07<br />
g. Remove lines that begin with qu&#8217;: grep -v ^qu\&#8217; asturian-wordlist-07 > asturian-wordlist-08<br />
h. Remove lines that begin with p&#8217;: grep -v ^p\&#8217; asturian-wordlist-08 > asturian-wordlist-09<br />
The dictionary will contain 1.1 million words. I think that that number is acceptable.</p>
<p>8. And now Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-09 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>This command creates a PLS dictionary without phoneme elements. The phoneme elements will be added later.</p>
<p>9. I will use <a href="http://en.wikipedia.org/wiki/Asturian_language#Orthography">this</a> table for grapheme to phoneme conversion. Here is the command that creates the phoneme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-asturian.xsl_.zip'>improve-asturian.xsl</a>' -o:<a href="http://script.blau.in/asturian-dictionary.xml.bz2">asturian-dictionary.xml</a></code></p></blockquote>
<p>10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn&#8217;t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.<br />
a. Remove lines that contain &#8216;l: grep -v \&#8217;l asturian-wordlist-09 > asturian-wordlist-10<br />
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11<br />
c. This isn&#8217;t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12<br />
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13<br />
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14<br />
f. Remove 265.000 words:  grep -v es asturian-wordlist-14 > asturian-wordlist-15<br />
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an <code>.xml</code> file:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ saxonb-xslt -s:asturian-wordlist-15 -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:asturian.xml</code></p></blockquote>
<p>And now I repeat step 9. The file <code>asturian-dictionary.xml</code> has a size of 45 MB. I hope that this size is OK.</p>
<p>B. <a href="http://script.blau.in/asturian-dictionary.xml.bz2">Download the dictionary</a>. Import it into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/asturian.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/asturian-238x300.jpg" alt="" title="asturian" width="238" height="300" class="alignleft size-medium wp-image-5693" /></a>Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.
<div style="clear:both"></div>
<p>You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn&#8217;t too big for simon.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/05/ralfs-asturian-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s Yiddish dictionary</title>
		<link>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 17:05:18 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[saxonb-xslt]]></category>
		<category><![CDATA[unmunch]]></category>
		<category><![CDATA[yiddish]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=5663</guid>
		<description><![CDATA[This article explains some details about the creation of the dictionary, and how the result looks like in simon. A. How I create Ralf's Yiddish dictionary: 1. Get spelling dictionary. 2. License is GPLv3. 3. Extract jidysz.net.ooo.spellchecker.oxt. 4. Ubuntu terminal: cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries sudo apt-get install hunspell-tools unmunch yi.dic yi.aff &#62; yiddish-wordlist 5. Add &#60;lexicon&#62; at [...]]]></description>
			<content:encoded><![CDATA[<p>This article explains some details about the creation of the dictionary, and how the result looks like in simon.</p>
<p>A. How I create <code>Ralf's Yiddish dictionary</code>:</p>
<p>1. <a href="http://extensions.services.openoffice.org/en/project/jidysz-net-ooo-spellchecker">Get</a> spelling dictionary.<br />
2. License is <code>GPLv3</code>.<br />
3. Extract <a href="http://extensions.services.openoffice.org/en/download/4324"><code>jidysz.net.ooo.spellchecker.oxt</code></a>.<br />
4. Ubuntu terminal:<br />
<code>cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries<br />
sudo apt-get install hunspell-tools<br />
unmunch yi.dic yi.aff &gt; yiddish-wordlist</code><br />
5. Add  <code>&lt;lexicon&gt;</code> at the beginning of yiddish-wordlist. Add <code>&lt;/lexicon&gt;</code> at the end of this file.<br />
6. Generate <code>.xml</code> document with lexicon, lexeme and grapheme elements:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish-wordlist -xsl:'http://spirit.blau.in/simon/files/2010/04/create-xml-file.xsl' -o:yiddish.xml</code></p></blockquote>
<p>7. ISO 639-1 <a href="http://en.wikipedia.org/wiki/Yiddish_language">language code</a> is yi.<br />
8. I think I will use <a href="http://en.wikipedia.org/wiki/Yiddish_orthography#The_Yiddish_alphabet">this table</a> as source for the grapheme to phoneme mapping.<br />
9. Ubuntu terminal:</p>
<blockquote><p><code>ubuntu@ubuntu:~/Documents/2011-II/Yiddish/dictionaries$ saxonb-xslt -s:yiddish.xml -xsl:'<a href='http://spirit.blau.in/simon/files/2012/01/improve-yiddish.xsl_.zip'>improve-yiddish.xsl</a>' -o:yiddish-dictionary.xml</code></p></blockquote>
<p>B. <a href="http://script.blau.in/yiddish-dictionary.xml.bz2">Download the dictionary</a>, and <a href="http://spirit.blau.in/simon/2010/06/17/tutorial-import-german-dictionary/">import</a> it into simon.</p>
<p><a href="http://spirit.blau.in/simon/files/2012/01/yiddish.jpg"><img src="http://spirit.blau.in/simon/files/2012/01/yiddish-243x300.jpg" alt="" title="yiddish" width="243" height="300" class="alignleft size-medium wp-image-5670" /></a>Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription.<br />
<a href="http://en.wikipedia.org/wiki/Yiddish_language">Yiddish</a> is written in the Hebrew alphabet. The <a href="http://en.wikipedia.org/wiki/Hebrew_alphabet">Hebrew alphabet</a> is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.</p>
<div style="clear:both"></div>
<p>There are a lot of other PLS dictionaries available. <a href="http://spirit.blau.in/simon/import-pls-dictionary/">Find the PLS dictionary</a> that suits your language.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2012/01/03/ralfs-yiddish-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Advantages of Ralf&#8217;s French dictionary</title>
		<link>http://spirit.blau.in/simon/2010/04/07/advantages-of-ralfs-french-dictionary/</link>
		<comments>http://spirit.blau.in/simon/2010/04/07/advantages-of-ralfs-french-dictionary/#comments</comments>
		<pubDate>Wed, 07 Apr 2010 15:33:39 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[French]]></category>
		<category><![CDATA[node17]]></category>
		<category><![CDATA[PLS]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=2559</guid>
		<description><![CDATA[At the moment, I am thinking about the advantages that Ralf's French dictionary has to offer. Let me show you the advantages: 1. The dictionary is stored as XML file. This means that you can edit the dictionary with gedit. Or you can transform the dictionary with XSLT. 2. The encoding is UTF-8. With this [...]]]></description>
			<content:encoded><![CDATA[<p>At the moment, I am thinking about the advantages that <code>Ralf's French dictionary</code> has to offer. Let me show you the advantages:</p>
<p><a href="http://spirit.blau.in/simon/files/2010/04/french-advantages.png"><img class="alignnone size-medium wp-image-2560" src="http://spirit.blau.in/simon/files/2010/04/french-advantages-214x300.png" alt="french-advantages" width="214" height="300" /></a></p>
<p>1. The dictionary is stored as <code>XML</code> file. This means that you can edit the dictionary with <a href="http://fr.wikipedia.org/wiki/Gedit">gedit</a>. Or you can <a href="http://spirit.blau.in/simon/2009/12/24/zufuhrungsdrahten-two-pronunciations/#volunteer">transform the dictionary with XSLT</a>.</p>
<p>2. The encoding is UTF-8. With this encoding there shouldn&#8217;t be encoding problems:</p>
<p>2.a. UTF-8 means no encoding problems within the <code>&lt;grapheme&gt;</code> element: <a href="http://spirit.blau.in/simon/2009/09/10/confidence-score-with-hebrew/">Hebrew</a> or <a href="http://spirit.blau.in/simon/2009/11/12/import-60000-tamil-words/">Tamil</a> are no problem. And this means that French accents are displayed correctly inside the <code>&lt;grapheme&gt;</code> element. No crap characters should occur.</p>
<p>2.b. UTF-8 means that the IPA phonemes are displayed correctly inside the <code>&lt;phoneme&gt;</code> elements. The dictionary can easily be edited by human editors. It is difficult to edit a phonetic dictionary that contains <a href="http://en.wikipedia.org/wiki/X-SAMPA">X-SAMPA</a> or <a href="http://en.wikipedia.org/wiki/Kirshenbaum">Kirshenbaum</a> characters. IPA phonemes can easily be read by humans. And it is no problem to process IPA characters with <code>saxonb-xslt</code> (type <code>saxonb-xslt</code> into the Ubuntu terminal). Every detail of the French language can be catched.</p>
<p>3. The license of <code>Ralf's French dictionary</code> is GPLv3. Maybe the simon developers are interested to offer an automatic dictionary import? The license would permit this. You can see that the dictionary design is very developer-friendly.</p>
<p>4. The <code>&lt;lexicon&gt;</code> element contains the attribute <code>alphabet="ipa"</code>. Not all of my dictionaries contain IPA characters. Some dictionaries contain <code>eSpeak</code> charakters; these dictionaries contain the attribute <code>alphabet="espeak"</code>. The PLS standard allows us to use different alphabets for the <code>&lt;phoneme&gt;</code> elements. Personally, I prefer the IPA alphabet. But of course, other alphabets could be used. A future version of <code>simon</code> could differentiate between the different alphabets during dictionary import. I am thinking about the following solution:<br />
- <code>alphabet="ipa"</code> is used by <code>Ralf's German dictionary</code>, <code>Ralf's French dictionary</code>, <code>Ralf's Spanish dictionary</code>;<br />
- <code>alphabet="espeak"</code> can be used by other dictionaries (or alternatively, I transform the eSpeak phonemes into IPA phonemes). I am not sure whether it is good to use eSpeak phonemes inside some of my PLS dictionaries. Maybe it would be better to convert them into IPA phonemes?</p>
<p>Currently, the simon dictionary import process doesn&#8217;t differentiate between the tags <code>alphabet="ipa"</code> and <code>alphabet="espeak"</code>. As far as I know, the <code>eSpeak</code> phonemes are probably almost the same for all languages. So maybe it would be a good idea if simon would be able to import PLS dictionaries with eSpeak phonemes. I am saying <em>maybe</em> because I am not sure whether this would be a good decision. In the long run, the IPA is the better decision (because it is easily editable by humans).</p>
<p>You can see that I spent a lot of time thinking about the different phoneme characters. It is great to see that simon handles <a href="http://spirit.blau.in/simon/2010/04/06/omitting-the-knacklaut/#comment-275">almost all</a> German phonemes. <strong>At the moment, there are <a href="http://spirit.blau.in/simon/2010/04/06/ralfs-french-dictionary-0-1-2-released/">import problems with the French IPA phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ</a>. The <a href="http://en.wikipedia.org/wiki/Tilde">tilde</a> is <a href="http://spirit.blau.in/simon/2010/04/06/what-happens-with-french-phonemes/">imported by simon as the SAMPA sequence <code>n a s</code></a>. This is wrong, and should be corrected.</strong></p>
<p>5. Each of my dictionaries contains a language tag. <code>Ralf's German dictionary</code> contains <code>xml:lang="de-DE"</code> because it is Standard German. <code>Ralf's French dictionary</code> indicates <a href="http://en.wikipedia.org/wiki/Standard_French">Standard French</a> by using the language tag <code>xml:lang="fr"</code>. It would be possible to develop a phonetic dictionary for <a href="http://en.wikipedia.org/wiki/Canadian_French">Canadian French</a>. In this case, the tag would be <a href="http://de.wikipedia.org/wiki/Speech_Recognition_Grammar_Specification"><code>xml:lang="fr-CA"</code></a>.</p>
<p>It is possible that a future version of simon automatically &#8220;understands&#8221; the language of the specific PLS dictionary. E.g. <a href="http://script.blau.in/austrian.xml"><code>Ralf's Austrian German dictionary</code></a> contains the language tag <code>xml:lang="de-AT"</code>. Maybe it would be possible to ask the simon user via a wizard:</p>
<blockquote><p>Which language do you want to use for dictation? Please select the appropriate language.</p>
<p>◊ Standard German (<a href="http://www.cyber-byte.at/wiki/index.php/Deutsch:_Schattenw%C3%B6rterbuch">BOMP</a>)<br />
◊ Standard German (<a href="http://script.blau.in/german-dictionary.xml.bz2">PLS</a>)<br />
◊ Standard German (PLS) + Austrian German (<a href="http://script.blau.in/austrian.xml">PLS</a>)<br />
◊ Only Austrian German (PLS)<br />
◊ Standard French (PLS)<br />
◊ Standard German (PLS) + Medical German (<a href="http://script.blau.in/medical.xml">PLS</a>)<br />
◊ [Afrikaans, Catalan, Croatian, ...]</p></blockquote>
<p>Thanks to the language tag <code>xml:lang="fr"</code> (each language contains a specific language tag) it should be not too difficult to develop an <a href="http://spirit.blau.in/simon/2010/01/06/try-to-install-revision-1112-on-32-bit-ubuntu/#pls-dictionaries">automatic dictionary import function for the 27 PLS dictionaries</a>.</p>
<p>simon offers automatic import of the German BOMP dictionary (obviously, this dictionary does have a very good quality). But what about other languages? <strong>It would be good for the marketing if simon offered automatic dictionary import for all 27 PLS dictionaries.</strong> Almost all of my dictionaries are in an early stage of development. But this is no problem at the moment. Each dictionary can be improved easily. It is just necessary to run an XSLT style-sheet that transforms the eSpeak phonemes into IPA phonemes. No big deal. I did this for German, for French, and for a couple of other languages.</p>
<p>How many phonemes are needed? I would say that we don&#8217;t need all IPA phonemes. We can stick to the existing ones <em>plus</em> 4 French phonemes <em>plus</em> 4 Spanish phonemes. That should do the job for all 27 languages. My goal is to help making simon usable for 27 languages. Usable means that all major characteristics (=phonemes) of each specific language are covered by the specific phonetic dictionary.</p>
<p>My personal focus are the following dictionaries: German, French, Spanish, English (maybe I will transform the <a href="http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Lexicon/">English Voxforge dictionary</a> into PLS format). The other languages (Afrikaans, Catalan, Croatian, &#8230;)  will have to use the phonemes that are used by these four languages. I don&#8217;t plan to add more phonemes to my PLS dictionaries. I want to keep it as simple as possible, and as complex as necessary.</p>
<p>6. I am experimenting with the <code>role</code> attribute in <code>Ralf's French dictionary</code>. I tried the following <code>role</code> attributes: <code>lettre</code> and <code>abrévation</code>. simon displays the terminal <code>abrévation</code> correctly:</p>
<p><a href="http://spirit.blau.in/simon/files/2010/04/abreviation.png"><img class="alignnone size-medium wp-image-2571" src="http://spirit.blau.in/simon/files/2010/04/abreviation-300x145.png" alt="abreviation" width="300" height="145" /></a></p>
<p>This means that French terminals (= value of the <code>role</code> attribute) can use an <a href="http://fr.wikipedia.org/wiki/Apostrophe_%28typographie%29#L.E2.80.99apostrophe">apostrophe</a>.</p>
<p>Conclusion: <code>Ralf's French dictionary</code> is user-friendly and developer-friendly. I propose that the simon developers do the following two things:<br />
- add the 4 French phonemes ɔ̃ — ɛ̃ — ɑ̃ — ɥ to the simon import process;<br />
- offer <a href="http://spirit.blau.in/simon/files/2010/01/sidebar-pls.png">automatic PLS dictionary import for 27 languages</a> (simon should download the specific dictionary directly from the internet, and import it automatically after the download has finished).</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/04/07/advantages-of-ralfs-french-dictionary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ralf&#8217;s French dictionary 0.1.2 released</title>
		<link>http://spirit.blau.in/simon/2010/04/06/ralfs-french-dictionary-0-1-2-released/</link>
		<comments>http://spirit.blau.in/simon/2010/04/06/ralfs-french-dictionary-0-1-2-released/#comments</comments>
		<pubDate>Tue, 06 Apr 2010 21:33:23 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[node17]]></category>
		<category><![CDATA[PLS]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=2545</guid>
		<description><![CDATA[A few minutes ago, I uploaded Ralf's French dictionary version 0.1.2 (license: GPLv3). Download the dictionary, and import it into simon as PLS dictionary. I applied the following changes to the dictionary: 1. I added an empty role attribute to each &#60;lexeme&#62; element. At the moment, Ralf's French dictionary doesn&#8217;t contain any terminal information (noun, [...]]]></description>
			<content:encoded><![CDATA[<p>A few minutes ago, I uploaded <code>Ralf's French dictionary</code> version 0.1.2 (license: <a href="http://script.blau.in/etc/GPL_License">GPLv3</a>). <a href="http://script.blau.in/french-dictionary.xml.bz2">Download</a> the dictionary, and <a href="http://spirit.blau.in/simon/2010/01/08/tutorial-how-to-install-under-ubuntu/#pls-dictionary">import it into simon as PLS dictionary</a>. I applied the following changes to the dictionary:</p>
<p>1. I added an empty <code>role</code> attribute to each <code>&lt;lexeme&gt;</code> element. At the moment, <code>Ralf's French dictionary</code> doesn&#8217;t contain any terminal information (noun, verb, adjective). It is possible to add terminal information to this dictionary with a simple text editor.</p>
<p>2. I changed another thing: The <a href="http://spirit.blau.in/simon/2009/11/03/more-than-300000-french-words/">previous version</a> of <code>Ralf's French dictionary</code> contained about 60.000 duplicate <code>&lt;lexeme&gt;</code> elements. I removed these elements with the following <code>XSLT</code> expression:</p>
<blockquote><p><code>&lt;xsl:for-each-group select="lexeme" group-by="grapheme"&gt;</code></p></blockquote>
<p>You can find this line in the style-sheet <a href="http://spirit.blau.in/simon/files/2010/04/improve-french-dictionary.xsl"><code>improve-french-dictionary.xsl</code></a> (license: GPLv3). I used this style-sheet to generate version 0.1.2 using the Ubuntu terminal:</p>
<blockquote><p><code>am3msi@am3msi-desktop:~/Documents/201004/french-dictionary$ <strong>saxonb-xslt</strong> -ext:on -s:french-dictionary.xml -xsl:improve-french-dictionary.xsl -o:french-dictionary-0.1.2.xml</code></p></blockquote>
<p>3. The language tag is now correct: <code>xml:lang="fr"</code></p>
<p>Unfortunately, at the moment the French phonemes ɔ̃  &#8212;  ɛ̃  &#8212;  ɑ̃  &#8212;  ɥ <a href="http://spirit.blau.in/simon/2010/04/06/what-happens-with-french-phonemes/">aren&#8217;t transcribed into the correct SAMPA phonemes</a> during the <code>simon</code> import process. It shouldn&#8217;t be a big deal to fix this. As soon as this issue has been fixed, <code>Ralf's French dictionary</code> should be usable for training of French words. Remember: the dictionary contains more than 300.000 French <code>&lt;lexeme&gt;</code> elements. It would be nice if a native speaker from France would take a closer look at <code>Ralf's French dictionary</code>, and make suggestions for improvements.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/04/06/ralfs-french-dictionary-0-1-2-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What are my targets?</title>
		<link>http://spirit.blau.in/simon/2010/01/22/what-are-my-targets/</link>
		<comments>http://spirit.blau.in/simon/2010/01/22/what-are-my-targets/#comments</comments>
		<pubDate>Fri, 22 Jan 2010 12:19:53 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[PLS]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=2378</guid>
		<description><![CDATA[I have two targets: 1. Produce more PLS dictionaries that can be imported into simon. I am planning to explain the development steps in this blog. It might be a little off-topic, but I think it is important to inform the people. This means that I will provide the reader with details of dictionary development [...]]]></description>
			<content:encoded><![CDATA[<p>I have two targets:</p>
<p>1. Produce <strong>more PLS dictionaries</strong> that can be imported into simon. I am planning to explain the development steps in this blog. It might be a little off-topic, but I think it is important to inform the people. This means that I will provide the reader with details of dictionary development / improvement.</p>
<p>I want that people understand how to handle the development / improvement of PLS dictionaries. </p>
<p>2. I want to <strong>learn about the simon source code</strong>. How does simon work internally? I don&#8217;t need to understand every detail, but I would like to be able to understand what is going on &#8220;behind the scenes&#8221; (scenes = simon GUI; behind = simon source code). Where can I start?</p>
<p>The article <a href="http://www.kde.org/getinvolved/development/">Becoming a KDE developer</a> contains some useful links (e.g. I could type <code>qtdemo</code> into the Ubuntu terminal). Qt looks like a very interesting development software. How can I get involved?  </p>
<p>Let me make an example: I read the <a href="http://www.cplusplus.com/doc/tutorial/">C++ tutorial</a>. The first chapters were easy. Then suddenly, it became extremely difficult. What are pointers? What is an array? At least I know <a href="http://spirit.blau.in/cplusplus/2008/11/11/compiling-a-c-program-under-ubuntu/">how to compile a simple C++ program under Ubuntu</a>. That is a start.</p>
<p>Or I could read the simon source code that is available via Sourceforge. E.g. I could read <a href="http://speech2text.svn.sourceforge.net/viewvc/speech2text/trunk/simond/src/clientsocket.cpp?revision=1119&amp;view=markup"><code>clientsocket.cpp</code></a>. But I understand almost nothing. </p>
<p>3. Conclusion: It is a lot of work to focus on these targets. </p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/01/22/what-are-my-targets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Julius dictionary; PLS: role attribute</title>
		<link>http://spirit.blau.in/simon/2010/01/20/julius-dictionary-pls-role-attribute/</link>
		<comments>http://spirit.blau.in/simon/2010/01/20/julius-dictionary-pls-role-attribute/#comments</comments>
		<pubDate>Wed, 20 Jan 2010 11:32:37 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[Auslautverhärtung]]></category>
		<category><![CDATA[German]]></category>
		<category><![CDATA[kɛʀnçeːmɪʃən]]></category>
		<category><![CDATA[kɛʀndʊʀçmɛsɐ]]></category>
		<category><![CDATA[PLS]]></category>
		<category><![CDATA[ʃtʀɔmʃpaːʀən]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=2352</guid>
		<description><![CDATA[This blog post is about (A) Julius dictionary and (B) the import of a PLS dictionary. A. Obviously, it is possible to import a Julius dictionary: I didn&#8217;t know that this kind of dictionary existed. What are the properties of this format? And what are the advantages? B. I want to import Ralf's German dictionary [...]]]></description>
			<content:encoded><![CDATA[<p>This blog post is about (A) Julius dictionary and (B) the import of a PLS dictionary.</p>
<p>A. Obviously, it is possible to import a Julius dictionary:</p>
<p><a href="http://spirit.blau.in/simon/files/2010/01/julius-vocabulary.png"><img class="alignnone size-medium wp-image-2351" src="http://spirit.blau.in/simon/files/2010/01/julius-vocabulary-300x210.png" alt="julius-vocabulary" width="300" height="210" /></a></p>
<p>I didn&#8217;t know that this kind of dictionary existed. What are the properties of this format? And what are the advantages?</p>
<p>B. I want to import <a href="http://script.blau.in/german-dictionary.xml.bz2"><code>Ralf's German dictionary</code></a> (version 0.1.7; October 29, 2009). Great, simon now <a href="http://spirit.blau.in/simon/2009/10/24/ralfs-german-dictionary-version-016/#comment-162">recognizes the role attribute</a>:</p>
<p><a href="http://spirit.blau.in/simon/files/2010/01/adjektiv-substantiv.png"><img class="alignnone size-medium wp-image-2354" src="http://spirit.blau.in/simon/files/2010/01/adjektiv-substantiv-300x229.png" alt="adjektiv-substantiv" width="300" height="229" /></a></p>
<p>1. A few minutes ago, I imported <code>Ralf's German dictionary</code> into simon. I am offering <a href="http://spirit.blau.in/simon/2010/01/06/try-to-install-revision-1112-on-32-bit-ubuntu/#pls-dictionaries">27 PLS dictionaries</a> for 27 different languages. <a href="http://spirit.blau.in/simon/2010/01/08/tutorial-how-to-install-under-ubuntu/#pls-dictionary">Choose the dictionary that suits your native language</a>, and <a href="http://spirit.blau.in/simon/files/2010/01/select-pls-file.png">import it into simon</a>.</p>
<p>2. Let&#8217;s take a look into the shadow dictionary.</p>
<p>3. The word <code>kernchemischen</code> is an <code>Adjektiv</code>. Let&#8217;s take a look at the specific entry in <code>Ralf's German dictionary</code>:</p>
<blockquote><p>&lt;lexeme role=&#8221;Adjektiv&#8221;&gt;<br />
&lt;grapheme&gt;kernchemischen&lt;/grapheme&gt;<br />
&lt;phoneme&gt;kɛʀnçeːmɪʃən&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</p></blockquote>
<p>You can see that the <code>role</code> attribute which is part of the <a href="http://www.w3.org/TR/pronunciation-lexicon/#S4.4"><code>&lt;lexeme&gt;</code> element</a> was imported by simon. <a href="http://spirit.blau.in/simon/2009/10/24/ralfs-german-dictionary-version-016/#comment-164">Thanks</a> for implementing that feature.</p>
<p>4. The word <code>Kerndurchmesser</code> is a <code>Substantiv</code>. The corresponding entry in the PLS dictionary:</p>
<blockquote><p>&lt;lexeme role=&#8221;Substantiv&#8221;&gt;<br />
&lt;grapheme&gt;Kerndurchmesser&lt;/grapheme&gt;<br />
&lt;phoneme&gt;kɛʀndʊʀçmɛsɐ&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</p></blockquote>
<p>You can see the strength of the simon import process: The last two letters <code>Kerndurchmess<strong>er</strong></code> correspond with one single phoneme <code>kɛʀndʊʀçmɛs<strong>ɐ</strong></code>. Because such details are implemented, we can get a very good recognition rate as I showed in the <a href="http://spirit.blau.in/simon/2009/12/27/video-recognize-200-german-words/">video with 200 German words</a>.</p>
<p>Why is <code>Ralf's German dictionary</code> good? Let me explain about the <strong>history of this dictionary</strong>:</p>
<p>a. The <a href="http://www.voxforge.org/home/forums/other-languages/german/building-a-dictionary-with-the-help-of-espeak#C0rNzf242hz0P8D2x7P_rg">initial steps</a> were done at Voxforge with the development of a German pronunciation dictionary. You can convince yourself: the script <a href="http://www.dev.voxforge.org/projects/de/wiki/espeak2Phones.pl">espeak2Phones.pl</a> is great because it transforms eSpeak&#8217;s cryptic ASCII output into SAMPA. This approach is good for the German language.</p>
<p>b. Later, we used the <a href="http://www.ling.uni-potsdam.de/~timo/projekte/voxforge.html">dictionary acquistion project</a> for the collection of about 8.000 pronunciations. Each single pronunciation was human-controlled. The <a href="http://de.wiktionary.org/wiki/Wiktionary:Lautschrift#Erkl.C3.A4rung_der_einzelnen_Laute">phoneme concept</a> follows the Wiktionary.</p>
<p>c. I used a German <a href="http://extensions.services.openoffice.org/dictionary">spelling dictionary from OpenOffice.org</a> to get more words for the dictionary (Ubuntu terminal command: <code>unmunch</code>). With eSpeak I created the phonemes. With an XSLT style-sheet (Ubuntu terminal command: <code>saxonb-xslt</code>) I transformed the eSpeak phonemes into IPA phonemes. And I used the XSTL style-sheet for inserting the <code>role</code> attribute (<code>Substantiv</code>, <code>Adjektiv</code>, <code>Zahlwort</code>).</p>
<p>d. The result is the current version of <code>Ralf's German dictionary</code>. It would be nice if someone would <a href="http://spirit.blau.in/simon/2009/12/24/zufuhrungsdrahten-two-pronunciations/#volunteer">help with the improvement</a>. The real difficult work has been done. But it is necessary to fine-tune the dictionary. Let me give you a concrete example:</p>
<blockquote><p>
&lt;lexeme&gt;<br />
&lt;grapheme&gt;stromsparen&lt;/grapheme&gt;<br />
&lt;phoneme&gt;ʃtʀɔm<strong>ʃ</strong>paːʀən&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;<br />
&lt;lexeme&gt;<br />
&lt;grapheme&gt;stromsparend&lt;/grapheme&gt;<br />
&lt;phoneme&gt;ʃtʀɔm<strong>s</strong>paːʀənt&lt;/phoneme&gt;<br />
&lt;/lexeme&gt;</p></blockquote>
<p>What is wrong or could be improved? First, the <code>role</code> attribute is missing. <code>stromsparen</code> is a <code>Verb</code>. <code>stromsparend</code> is an <code>Adverb</code>. It would be good if someone added the missing <code>role</code> attributes. Second, there are small phoneme corrections necessary: <code>ʃtʀɔm<strong>ʃ</strong>paːʀən</code> is OK because you speak &#8220;schtrom<strong>sch</strong>paren&#8221;. But <code>ʃtʀɔm<strong>s</strong>paːʀənt</code> is wrong because you don&#8217;t say &#8220;schtrom<strong>s</strong>parent&#8221;. </p>
<p>You can see that improvements are necessary. Because <code>Ralf's German dictionary</code> is <a href="http://script.blau.in/etc/GPL_License">GPLv3</a>, everyone is permitted to improve it. </p>
<p>For good recognition results, things like &#8220;schtrom<strong>s</strong>parent&#8221; have to be fixed. It is possible that some dialects (e.g. <a href="http://en.wikipedia.org/wiki/Hamburg">Hamburg</a>) speak &#8220;<strong>s</strong>trom<strong>s</strong>paren&#8221; and not &#8220;<strong>sch</strong>trom<strong>sch</strong>paren&#8221;. I recommend that specific dialect dictionaries should be developed. This would be part of the fine-tuning, too. <code>Ralf's German dictionary</code> covers <a href="http://en.wikipedia.org/wiki/Standard_German">Standard German</a>. You can use my dictionary for the development of a dialect dictionary that can be used by people who prefer to dictate in their own specific dialect.</p>
<p>By the way, did you notice the following detail?  <code>ʃtʀɔmspaːʀən<strong>t</strong><br />
</code> ends with a &#8220;t&#8221; and not with a &#8220;d&#8221; because of the &#8220;<a href="http://de.wikipedia.org/wiki/Auslautverh%C3%A4rtung">Auslautverhärtung</a>&#8221; (which is part of the German pronunciation). Such small details are implemented in the dictionary. </p>
<p>C. Conclusion: I know about the strengths of PLS, but I don&#8217;t know which advantages a Julius dictionary would have to offer.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/01/20/julius-dictionary-pls-role-attribute/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Benefitting from eSpeak</title>
		<link>http://spirit.blau.in/simon/2010/01/09/benefitting-from-espeak/</link>
		<comments>http://spirit.blau.in/simon/2010/01/09/benefitting-from-espeak/#comments</comments>
		<pubDate>Sat, 09 Jan 2010 09:47:19 +0000</pubDate>
		<dc:creator>producer</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[PLS]]></category>

		<guid isPermaLink="false">http://spirit.blau.in/simon/?p=2277</guid>
		<description><![CDATA[Is eSpeak good or bad? &#8220;Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn&#8217;t let it be good with any possible modifications.&#8221; I used eSpeak for the creation of my 27 PLS dictionaries (the phonemes were created with the help of eSpeak). I found out that the [...]]]></description>
			<content:encoded><![CDATA[<p>Is eSpeak good or <a href="http://nshmyrev.blogspot.com/2010/01/greetings-and-random-thoughts.html">bad</a>?</p>
<blockquote><p>&#8220;<a href="http://espeak.sourceforge.net/">Espeak</a> with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn&#8217;t let it be good with any possible modifications.&#8221;</p></blockquote>
<p>I used eSpeak for the creation of my 27 PLS dictionaries (the phonemes were created with the help of eSpeak). I found out that the phoneme quality for German isn&#8217;t that bad. It is usable for speech recognition after I made some adjustments with an XSLT style-sheet. </p>
<p>What about the other languages? To be honest: at the moment, I don&#8217;t care. I need the 27 PLS dictionaries mainly for propaganda. It is necessary to involve more people in the development of an open source ASR solution.</p>
<p>A Polish native speaker <a href="http://www.voxforge.org/home/forums/message-boards/general-discussion/julius-htk-or-sphinx-for-mobile-phone#Rc5-WdEr6pI3_7Gr9peARA">wants to dictate in the Polish language</a>. Or another user <a href="http://www.voxforge.org/home/forums/other-languages-forum/general-discussion/vietnamese-support#SuTlbzZNHLBOxldMOaE7vw">wants to dictate in the Vietnamese language</a>. Or someone <a href="http://www.voxforge.org/home/forums/other-languages-forum/general-discussion/greek-support#EqIKsBbRJYP2d9yq0dAi7A">wants to dictate in the Greek language</a>. These people could take advantage of PLS dictionaries in their own languages.</p>
<p>This is what I want to do: Build a PLS dictionary in a Chinese language (e.g. <a href="http://www.voxforge.org/home/forums/other-languages-forum/general-discussion/please-add-cantonese#Am9oud__x-guXzpjm-sQeA">Cantonese</a> &#8211; eSpeak offers this language as <a href="http://espeak.sourceforge.net/languages.html">provisional language</a>). I need a GPL word list with Cantonese words. But I didn&#8217;t find one in the internet (the description of this word list should be in English because I don&#8217;t understand Cantonese). </p>
<p>Is eSpeak&#8217;s synthesis method out of date? I don&#8217;t know. At least eSpeak creates phonemes that I can implement in my PLS dictionaries. Is there a program available that produces better results than eSpeak? The program has to work out of the box. I can use eSpeak by simply typing &#8220;<code>espeak</code>&#8221; into the Ubuntu terminal. And eSpeak can interpret SSML mark-up. That worked fine for me.</p>
<p>My PLS dictionaries are in an early state of development. It should be possible to increase the quality substantially with the help of some engaged native speakers.</p>
<p>Things have to work. To be more precise: it should be possible for the user to import a PLS dictionary in his own native language into simon. I made a start by offering <a href="http://spirit.blau.in/simon/2010/01/06/try-to-install-revision-1112-on-32-bit-ubuntu/#pls-dictionaries">27 PLS dictionaries</a>. At the moment, I am thinking about whether I should offer much more PLS dictionaries. The problem is: I don&#8217;t know how I can create the phonemes for the specific language. I will find a work-around for this problem. </p>
<p>Which kind of phonemes should the PLS dictionary contain? There are several possibilities:<br />
- IPA phonemes (like in <code>Ralf's German dictionary</code>), advantage: can easily be edited by linguists;<br />
- eSpeak phonemes (like in <a href="http://script.blau.in/polish-dictionary.xml.bz2"><code>Ralf's Polish dictionary</code></a>), advantage: I didn&#8217;t introduce new errors by trying to convert them into IPA;<br />
- SAMPA phonemes (none of my dictionaries uses SAMPA), I don&#8217;t see any advantage at the moment.</p>
<p>In my opinion, a good phoneme quality can be achieved by using IPA phonemes. Because IPA phonemes are easy to read by linguists. So what can you learn from this post? If you are a native speaker of Vietnamese, Polish, Greek, you may want to take a closer look at Ralf&#8217;s <a href="http://script.blau.in/vietnamese-dictionary.xml.bz2">Vietnamese</a> / <a href="http://script.blau.in/polish-dictionary.xml.bz2">Polish</a> / <a href="http://script.blau.in/greek-dictionary.xml.bz2">Greek</a> dictionary, and think about what you can do to improve the quality.</p>
<p>Which advantage offer Ralf&#8217;s PLS dictionaries? They show you a way to make speech recognition work for your native language. As soon as you have a PLS dictionary with acceptable quality for your own language, you can think about using it for training with simon. </p>
<p>You can learn from my blog that you can import<br />
- <code>Ralf's Vietnamese dictionary</code>,<br />
- <code>Ralf's Polish dictionary</code>,<br />
- <code>Ralf's Greek dictionary</code><br />
into simon. So simon is the target application. If you improve the quality of the specific dictionary, there is a chance that it might work.</p>
<p>And there is another thing that I found out when building / importing each of these 27 PLS dictionaries: the dictionary size should be about 100.000 words (not 1 million words, not 10.000 words). Help is needed to implement a good compression algorithm (like <code>unmunch</code> for OpenOffice.org dictionaries). </p>
<p>The focus should be to<br />
- improve the quality of each PLS dictionary &#8211; native speakers should do that;<br />
- integrate an option into simon to <em>automatically download &amp; import</em> each of these PLS dictionaries into simon;<br />
- think about a good compression algorith for PLS dictionaries (like <code>unmunch</code>) &#8211; languages like Spanish, Dutch, German, Latin need such a compression algorithm &#8211; not necessary for English.</p>
]]></content:encoded>
			<wfw:commentRss>http://spirit.blau.in/simon/2010/01/09/benefitting-from-espeak/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

