Ralf’s Swedish speech model

Thursday, May 17th, 2012

Some words about this speech model:

1. Download Ralf’s Swedish dictionary. Create a scenario. Clear the shadow vocabulary. Import the dictionary as shadow dictionary.
2. Select ten Swedish words for training: anfalla, geologi, gestalta, getskinnets, inknådad, nämnda, tiotal, tiokamps, tingshus, tingat
3. Grammar: Unknown. Dictation plugin. Synchronize. Activate. Dictate: anfalla geologi gestalta nämnda inknådad nämnda nämnda tingshus tiokamps tiotal
4. Export scenario and base model.
5. Get Ralf’s Swedish speech model.

Ralf’s Swedish dictionary 0.1.1

Sunday, May 23rd, 2010

How I improve Ralf's Swedish dictionary:

1. Version 0.1 contains eSpeak phonemes. They should be converted into IPA phonemes.

2. Language code is sv.

3. Convert eSpeak phonemes into IPA phonemes:

$ cat '/media/5f6432a3-9a68-45ee-b4b7-11f3b009825a/home/am3msi/Documents/200911/swedish/dictionaries/swedish-dictionary.xml.bz2' | bunzip2 -k | saxonb-xslt -ext:on -s:- -xsl:'/home/ubuntu/Documents/201005/dict-phonemes-espeak2ipa/espeak2ipa.xsl'

4. Download Ralf's Swedish dictionary 0.1.1, and import it into simon. The PLS dictionary contains 398964 words.

Ralf’s Swedish (Finland) dictionary

Thursday, May 6th, 2010

How I create Ralf's Swedish (Finland) dictionary:

1. Get spelling dictionary. License is GPL.

2. The phonology

is identical, but it has slightly different vowel qualities. The phoneme /ʉ/ is more centralized and pronounced like [ʉ], quite similar to the American English pronunciation of /u/ (as in moon). This should be compared to the Central Swedish [ʉ̟], which is very close to the short vowel [ʏ] and is more rounded.

This means that I can use eSpeak (sv) for phoneme generation.

3. The encoding of the files sv_FI.dic and sv_FI.aff is ISO-8859-1. I have to convert both files to UTF-8.

4. Convert sv_FI.dic:

ubuntu@ubuntu-desktop:~/Documents/201005/swedish-finland/dictionaries$ iconv -f ISO8859-1 -t UTF-8 < sv_FI.dic > swedish-finland.dic

5. Convert sv_FI.aff:

ubuntu@ubuntu-desktop:~/Documents/201005/swedish-finland/dictionaries$ iconv -f ISO8859-1 -t UTF-8 < sv_FI.aff > swedish-finland.aff

6. Change the first line of the file swedish-finland.aff from SET ISO8859-1 to SET UTF-8.

7. Generate word list with Ubuntu terminal:

ubuntu@ubuntu-desktop:~/Documents/201005/swedish-finland/dictionaries$ unmunch swedish-finland.dic swedish-finland.aff > swedish-finland

The file swedish-finland contains 400.000 Swedish (Finland) words. 400.000 words is the optimal size for a dictionary (if the language offers so much words. For comparison: English offers only 140.000 words).

8. I have to prepare swedish-finland for eSpeak:
8.a. Replacing "\n" by "</audio>\n<audio>" with Geany (Use escape sequences).
8.b. Adding <speak> tags (<speak> at the beginning of the file; </speak> at the end of the file).

9. Generate phonemes:

ubuntu@ubuntu-desktop:~/Documents/201005/swedish-finland/dictionaries$ espeak -f swedish-finland -m -v sv -q -x --phonout="swedish-finland-espeak"

Open the file swedish-finland-espeak with Geany. Replace the sequence "\n\n " (backslash-n-backslash-n-spacebar) by the sequence "</phoneme>\n<phoneme>" (Use escape sequences). Add <lexicon> tags at the beginning (<lexicon>) and at the end (</lexicon>).

10. The files swedish-finland and swedish-finland-espeak have the same line length. This means that I can combine both files with paste:

ubuntu@ubuntu-desktop:~/Documents/201005/swedish-finland/dictionaries$ paste swedish-finland swedish-finland-espeak > swedish-finland-dictionary.xml

11. Edit swedish-finland-dictionary.xml with Geany so that it will become a valid PLS dictionary.

12. Download Ralf's Swedish (Finland) dictionary, and import it into simon.

Import 380.000 Swedish words

Friday, November 13th, 2009

You can import Ralf's Swedish dictionary (version 0.1; GPLv3) into simon. Training with this dictionary is not possible. The phoneme elements contain eSpeak phonemes (not IPA phonemes).

On my computer, the import of this PLS dictionary took three minutes.

Edit May 2010: License of the source dictionary is GPL.