This article explains how I create the dictionary, and how the imported result looks like in simon.
A. Creation of the PLS dictionary:
1. Get spelling dictionary.
2. License is GPL. It says in the file README_en.txt:
This spell check dictionary for Interlingua is licensed under GPL. [...] This hyphenation rules for Interlingua are licensed under GPL.
This means that I can use this spelling dictionary as source.
3. Extract dict-ia-2010-11-29.oxt.
4. ISO 639-1 language code is ia.
5. Probably I will use this table for grapheme to phoneme conversion.
6. Check the encoding of ia_iso.aff and ia_iso.dic. Both files are encoded in ISO 8859-1. Probably it is best if I convert the encoding of both files into UTF-8. iconv -f ISO-8859-1 -t UTF-8 < ia_iso.dic > interlingua-utf8.dic
iconv -f ISO-8859-1 -t UTF-8 < ia_iso.aff > interlingua-utf8.aff
Change the first line in interlingua-utf8.aff into SET UTF-8. Both files contain CRLF at the end of each line (Windows mode). I don’t know whether this is ok with the unmunch command. I will check it out:
Obviously, it worked. The CRLF is part of the source files. The target file contains just a LF (Unix mode). There are a lot of duplicate entries. I think that these duplicate entries will be removed later by an .xsl script.
7. Add lexicon tags at the beginning and the end of interlingua-wordlist.
The left column contains the words. The pronunciation column contains the corresponding SAMPA transcriptions. The Category column contains just “Unknown” entries.
Now you know how I created the dictionary and how the result looks like in simon.
This article explains the creation of an Arabic PLS dictionary and how the result looks like in simon.
A. Creation of the dictionary:
1. Get Arabic spelling dictionary.
2. Check the license. Inside the file dict_ar-3.0.oxt there is a file with the name COPYING (in the docs folder). It says in the file:
GPL 2.0/LGPL 2.1/MPL 1.1 tri-license
This means that I can use this tri-licensed spelling dictionary as source for my future GPLv3 PLS dictionary.
3. Now I have to extract dict_ar-3.0.oxt.
4. Let’s try the unmunch command inside the Ubuntu terminal:
The left column contains 457089 Arabic words. The pronunciation column contains the corresponding SAMPA transcriptions. The third column contains just entries with “Unknown”. This is because the PLS dictionary contains no role attributes.
Now you know how I created the dictionary. And you know how the result looks like in simon.
In 2009, I made some initial tests with Hebrew. Now it is time to develop a Hebrew PLS dictionary that is much bigger than the sample dictionary from 2009 (which I have deleted). This article explains how I create the dictionary, and how the result looks like when imported into simon.
A. Creation of the dictionary:
1. Get Hebrew spelling dictionary from OpenOffice.org.
2. License is GPL. There is a copyright notice inside the file he_IL.aff.
3. I tried to unmunch the dictionary in the Ubuntu terminal, but unfortunately I failed:
4. The source file he_IL.dic contains a lot of numbers. I remove them with the Ubuntu terminal:
ubuntu@ubuntu:~/Documents/2011-II/Hebrew$ sed 's/[0-9]*//g' he_IL.dic > hebrew-without-numbers
With Geany, I remove the “,” (commas) and the “/” (slashes) that still are included within in the file hebrew-without-numbers. Now I have a clean word list with 43.000 Hebrew words.
5. Add lexicon tags at the beginning and the end of hebrew-without-numbers.
6. Ubuntu terminal:
7. ISO 639-1 language code is he.
8. I need a table for grapheme to phoneme conversion. Maybe I will use this table. There are several tables available at Wikipedia. I am not sure which one I should use. I have an idea: as far as I know, Yiddish and Hebrew share the same alphabet. This means I could try to use the Yiddish improve-yiddish.xsl style sheet:
The result is that most Hebrew letters have been converted into IPA. There is only one Hebrew letter that hasn’t been converted: [א] I will add this phone to the .xsl style sheet with the name improve-hebrew.xsl. Now I try it again:
The result is not so good: Maybe I should adjust the grapheme to phoneme conversion rules for modern standard Israeli Hebrew. Or is this not necessary? I think for a first draft I can use the Yiddish transformation rules.
B. Download the dictionary. Import it into simon as shadow dictionary.
Take a look at the result: The left column contains 43933 Hebrew words. The pronunciation column contains the corresponding SAMPA transcriptions. The category column is unemployed (or to be more exact: displays just Unknown) since the source PLS dictionary contains no role attributes.
Now you know how I created the dictionary. And you know how the result looks like in simon. This dictionary uses more or less Yiddish pronunciation because I was too lazy to adjust it to modern standard Israeli Hebrew. It shouldn’t be a problem to adjust the style sheet improve-hebrew.xsl so that the phoneme results are better.
This article explains how I create this PLS dictionary and how the imported result looks like.
A. Creation of the Belarusian PLS dictionary:
1. Get spelling dictionary. I choose the official orthography.
2. License is LGPL (see hyph_be_BY.dic). I am allowed to “convert any LGPLed piece of software into a GPLed piece of software.” I did this before. And I will do it again. This means that I get a spelling dictionary that is licensed under the LGPL. And I will produce a pronunciation dictionary that is licensed under the GPLv3. By the way, all my dictionaries are licensed under the GPLv3.
3. Extract dict-be-official.oxt.
4. The file be-official.aff is encoded in UTF-8. The file be-official-dic may be encoded in ISO-8859-1. At least this encoding is displayed by Geany. I believe that be-official-dic is encoded in microsoft-cp1251. I had this encoding before (Macedonian and Bulgarian).
Now it is time to use the Ubuntu terminal: cd /home/ubuntu/Documents/2011-II/Belarusian iconv -f cp1251 -t UTF-8 <be-official.dic >belarusian-utf8.dic
The text file belarusian-utf8.dic looks fine.
5. Now I change the line SET microsoft-cp1251 in the file be-official.aff into SET UTF-8
6. I don’t know whether the next step is necessary. I could convert the file hyph_be_BY.dic from cp1251 into UTF-8. At the moment, I skip this step.
7. Ubuntu terminal: unmunch belarusian-utf8.dic be-official.aff > belarusian-wordlist I think that this step wasn’t necessary. It didn’t extract the word list. At the moment, I have a word list of 1.5 million words. This is way too much. I have to reduce the dictionary size. The target size is 400.000 words.
8. I have to reduce the dictionary size. I found a tip. Ubuntu terminal:
sed -n 'p;N;N;N' belarusian-wordlist > belarusian-wordlist-reduced
Yes, it worked. The word list contains now 391.000 words. This is a good basis for a PLS dictionary.
9. Add lexicon elements at the beginning and the end of belarusian-wordlist-reduced.
10. Ubuntu terminal:
Let’s take a look at the result. The left column contains 391669 Belarusian words. The pronunciation column contains the corresponding SAMPA transcriptions. All entries in the third column are marked as “Unknown”. This is because the Belarusian PLS dictionary doesn’t contain any role attribute.
Now you know how I created the dictionary. And you got an impression how the result looks like when imported into simon.
This article explains how I create the Asturian PLS dictionary, and some words about the import into simon.
A. How I create the dictionary:
1. Get spelling dictionary.
2. Check license. It is GPLv3.
3. Extract asturianu.oxt.
4. Language code is ast.
5. Ubuntu terminal:
The result is a file of 70MB with more than 5 million words. This word list is too big. I should reduce it. I had the same problem with my Latin dictionary. I had to reduce the size.
6. Add lexicon elements at the beginning/end of asturian-wordlist.
7. Generate .xml document with lexicon, lexeme and grapheme elements:
I got an error message because the available space isn’t enough (“Java heap space”). I think that I should reduce the file size with grep. Or I install VisualVM. I think I will work with grep:
a. Remove lines that begin with l’: ubuntu@ubuntu:~/Documents/2011-II/Asturian/dictionaries$ grep -v ^l\’ asturian-wordlist > asturian-wordlist-02
b. Remove lines that begin with t’: grep -v ^t\’ asturian-wordlist-02 > asturian-wordlist-03
c. Remove lines that begin with s’: grep -v ^s\’ asturian-wordlist-03 > asturian-wordlist-04
d. Remove lines that begin with m’: grep -v ^m\’ asturian-wordlist-04 > asturian-wordlist-05
e. Remove lines that begin with n’: grep -v ^n\’ asturian-wordlist-05 > asturian-wordlist-06
f. Remove lines that begin with d’: grep -v ^d\’ asturian-wordlist-06 > asturian-wordlist-07
g. Remove lines that begin with qu’: grep -v ^qu\’ asturian-wordlist-07 > asturian-wordlist-08
h. Remove lines that begin with p’: grep -v ^p\’ asturian-wordlist-08 > asturian-wordlist-09
The dictionary will contain 1.1 million words. I think that that number is acceptable.
10. I tried to import the resulting dictionary into simon. Unfortunately, simon didn’t react any more after the import had been finished. I assume that the dictionary is way too big. I have to reduce its size, again.
a. Remove lines that contain ‘l: grep -v \’l asturian-wordlist-09 > asturian-wordlist-10
b. Continue to reduce the size of the wordlist: grep -v ylu astorian-wordlist-10 > astorian-wordlist-11
c. This isn’t enough, I have to remove about 80.000 words: grep -v les asturian-wordlist-11 > asturian-wordlist-12
d. Remove 136.000 words: grep -v mos asturian-wordlist-12 > asturian-wordlist-13
e. Remove 67.000 words: grep -v los asturian-wordlist-13 > asturian-wordlist-14
f. Remove 265.000 words: grep -v es asturian-wordlist-14 > asturian-wordlist-15
You see it is a lot of work to get a dictionary size that is suitable for simon. At the moment, the word list contains 539.000 words. Is this number OK, or should I continue to reduce the size? I think that I will try it again. Again, I will create an .xml file:
Take a look at the result. In the left column, you can see the Asturian words. This dictionary contains 539928 words. The right column contains the corresponding SAMPA transcriptions.
You could see that it was a lot of work to reduce the size of the dictionary. At least, now it has a size that isn’t too big for simon.
This article explains some details about the creation of the dictionary, and how the result looks like in simon.
A. How I create Ralf's Yiddish dictionary:
1. Get spelling dictionary.
2. License is GPLv3.
3. Extract jidysz.net.ooo.spellchecker.oxt.
4. Ubuntu terminal: cd /home/ubuntu/Documents/2011-II/Yiddish/dictionaries
sudo apt-get install hunspell-tools
unmunch yi.dic yi.aff > yiddish-wordlist
5. Add <lexicon> at the beginning of yiddish-wordlist. Add </lexicon> at the end of this file.
6. Generate .xml document with lexicon, lexeme and grapheme elements:
Take a look at the result. The left column contains the Yiddish words. This dictionary contains 99980 words. The right column contains the corresponding SAMPA transcription. Yiddish is written in the Hebrew alphabet. The Hebrew alphabet is written from right to left. Obviously, the corresponding SAMPA transcriptions are written from left to right. This means that the phoneme order should be fine.
There are a lot of other PLS dictionaries available. Find the PLS dictionary that suits your language.
4. Let’s try and check the option Also import unknown sentences. I don’t know whether this is a good decision. So let’s give it a try.
This is interesting: “words with more than one terminal” – is it now possible to use more than one entry for the role attribute? The current version of Schott’s German dictionary employs just one entry for each role attribute. The PLS standard allows more entries.
Please, download and extract Schott’s German utterances. This compressed folder 15000-german-utterances.zip contains a plain text file with more than 15000 utterances. I am the author of these utterances, and I have licensed them under the GPLv3. The utterances are designed to be used in conjunction with Schott’s German dictionary (former name: Ralf’s German dictionary). Every word that is included within Schott’s German utterances should be included in Schott’s German dictionary, too. I am 99% sure that this is the case, but I can’t guarantee it. If some words are missing in Schott’s German dictionary, please inform me, and I will include them within the next version of Schott’s German dictionary.
You can import my German utterances using simon’s Import Text option (copy & paste).
5. The import has been completed. There are a lot of lines that contain the Unknown terminal. Probably it would have been better if I wouldn’t have checked the option Also import unknown sentences in step 4.
6. Because simon didn’t react any more, I forced it to quit. I tried to start simon several times, but it wouldn’t start. Now there are several simon zombie statuses displayed. I was able to end these zombie / sleeping processes. But at the moment, it seems to be impossible to start simon again (a new zombie / sleeping status is beeing created if I try to start simon again).
Conclusion: I don’t recommend to check the option Also import unknown sentences. I tried the import Grammar function before without checking this option. Simon reacted normal, everything seemed to be fine.
Yesterday, I installed Qwt 6, and then built simon 0.3.60. It was difficult, but in the end it worked out fine. And look, sam offers now an Export test result button (top right of the screen shot):
I want to export the following information: Filename, Expected result, Actual result, Recognition rate (below 50%). The resulting document should be a simple text file (or XML file or whatever). Is this possible with the current Export test result function of simon?
3. System > Administration > Synaptic Package Manager:
Remove the package “simon” (Mark for Complete Removal).
simon is now not visible any more in Synaptic. So obviously, it has been completely removed.
ubuntu@ubuntu:~/Documents/2011-II/speech2text$ ./build_ubuntu.sh
– The C compiler identification is GNU
– The CXX compiler identification is GNU
– Check for working C compiler: /usr/bin/gcc
– Check for working C compiler: /usr/bin/gcc — works
– Detecting C compiler ABI info
– Detecting C compiler ABI info – done
– Check for working CXX compiler: /usr/bin/c++
– Check for working CXX compiler: /usr/bin/c++ — works
– Detecting CXX compiler ABI info
– Detecting CXX compiler ABI info – done
CMake Error at cmake/FindZLIB.cmake:25 (MESSAGE):
Could not find ZLIB
Call Stack (most recent call first):
julius/libsent/CMakeLists.txt:3 (find_package)
– Configuring incomplete, errors occurred!
touch: cannot touch `./julius/gramtools/mkdfa/mkfa-1.44-flex/*’: No such file or directory
ubuntu@ubuntu:~/Documents/2011-II/speech2text$
6. Question: What do I have to do to get simon going from git repository?
ubuntu@ubuntu:~/Documents/2011-II/speech2text$ ./build_ubuntu.sh
– Found Portaudio: /usr/lib/libportaudio.so
– Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so
– Found Pthreads: /usr/lib/x86_64-linux-gnu/libpthread.so
– Looking for Q_WS_X11
– Looking for Q_WS_X11 – found
– Looking for Q_WS_WIN
– Looking for Q_WS_WIN – not found.
– Looking for Q_WS_QWS
– Looking for Q_WS_QWS – not found.
– Looking for Q_WS_MAC
– Looking for Q_WS_MAC – not found.
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Looking for XOpenDisplay in /usr/lib/x86_64-linux-gnu/libX11.so;/usr/lib/x86_64-linux-gnu/libXext.so;/usr/lib/x86_64-linux-gnu/libXau.so;/usr/lib/x86_64-linux-gnu/libXdmcp.so
– Looking for XOpenDisplay in /usr/lib/x86_64-linux-gnu/libX11.so;/usr/lib/x86_64-linux-gnu/libXext.so;/usr/lib/x86_64-linux-gnu/libXau.so;/usr/lib/x86_64-linux-gnu/libXdmcp.so – found
– Looking for gethostbyname
– Looking for gethostbyname – found
– Looking for connect
– Looking for connect – found
– Looking for remove
– Looking for remove – found
– Looking for shmat
– Looking for shmat – found
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Looking for include files CMAKE_HAVE_PTHREAD_H
– Looking for include files CMAKE_HAVE_PTHREAD_H – found
– Looking for pthread_create in pthreads
– Looking for pthread_create in pthreads – not found
– Looking for pthread_create in pthread
– Looking for pthread_create in pthread – found
– Found Threads: TRUE
– Looking for _POSIX_TIMERS
– Looking for _POSIX_TIMERS – found
– Found Automoc4: /usr/bin/automoc4
– Found Perl: /usr/bin/perl
– Found Phonon: /usr/include
– Performing Test _OFFT_IS_64BIT
– Performing Test _OFFT_IS_64BIT – Success
– Performing Test HAVE_FPIE_SUPPORT
– Performing Test HAVE_FPIE_SUPPORT – Success
– Performing Test __KDE_HAVE_W_OVERLOADED_VIRTUAL
– Performing Test __KDE_HAVE_W_OVERLOADED_VIRTUAL – Success
– Performing Test __KDE_HAVE_GCC_VISIBILITY
– Performing Test __KDE_HAVE_GCC_VISIBILITY – Success
– Found KDE 4.6 include dir: /usr/include
– Found KDE 4.6 library dir: /usr/lib
– Found the KDE4 kconfig_compiler preprocessor: /usr/bin/kconfig_compiler
– Found automoc4: /usr/bin/automoc4
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found libsamplerate: /usr/lib/libsamplerate.so
– Found ALSA: /usr/lib/libasound.so
– Enabling resample support
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Enabling simon scenario support.
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Could NOT find KdepimLibs (missing: KdepimLibs_CONFIG) (Required is at least version “4.5.60″)
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
– Found Qt-Version 4.7.2 (using /usr/bin/qmake)
– Found X11: /usr/lib/x86_64-linux-gnu/libX11.so
CMake Error at cmake/FindQwt6.cmake:101 (MESSAGE):
Could not find Qwt 6.x
Call Stack (most recent call first):
sam/src/CMakeLists.txt:1 (find_package)
– Configuring incomplete, errors occurred!
make: *** No targets specified and no makefile found. Stop.
ubuntu@ubuntu:~/Documents/2011-II/speech2text$
1. Open the file de-dv. It contains a list of 1000 words (only the phonetic transcriptions).
2. Start Audacity.
3. Now I read every word in the list. Between each word, there is a pause of 1-2 seconds. Later, Audacity will find the pauses automatically.
4. Mark the whole recording with a double-click.
5. Then select Analyze > Sound Finder…
6. Set the Label starting point to 1.0. Set the Label ending point to 1.0. Why? Because simon has to know the amount of background noise.
7. Let’s eliminate the error at position 237. There was some noise (above 26 dB, I think) at position 237, and not a word. Mark the area with the mouse. Then press the Silence button (see top right of the picture).
8. Let’s have a look at the text file de-dv, and at the audio file. The text file ends with line 841. The Audacity audio file ends with number 841. Both files correspond to each other.
9. Select Audacity > File > Export Labels… The position (starting point and ending point) of each label will be exported into a simple text file. Export the labels to a file named labels.txt.
10. Open labels.txt with Geany. The file labels.txt ends at line 841. The first number of each line indicates the label starting point. The second number indicates the label ending point. The third number indicates the label itself.
You can see in the picture that de-dv is open, too. Both files – labels.txt and de-dv – have a length of exactly 841 lines.
11. Geany > Search > Replace.
Search for: \t\w+$ (t means tab; w means alphanumeric character; + means this: “The plus sign indicates that there is one or more of the preceding element”; $ means end of line)
Don’t forget to mark Use regular expressions.
This procedure removes the third number from each line.
12. You can see that the third column has been removed thanks to the regular expression procedure.
13. Now it is time to merge both files: labels.txt should be merged with de-dv. This is done via the paste command in the Ubuntu terminal:
14. You can see that the document pasted.txt has a third column: The labels are the phonetic transcriptions!
15. Now let’s go back to Audacity > File > Import > Labels… Take a look at the result. Each label is a phonetic transcription of the corresponding recording.
16. Audacity > File > Export Multiple…
Export format: FLAC files
Export location: /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/flac-dedv
Split files based on: Labels
Name files: Using Label/Track Name
Press the Export button.
The result is a PLS dictionary at the following location: file:///media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/lexicon-dedv.xml
19. Now I need a prompts file. This is generated, too, via Ubuntu terminal:
20. Now it is time to upload the package file:///media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedv/german-ipa-flac-files-dedv-20110727.tar.bz2 to Voxforge.
21. Delete file:///home/ubuntu/.kde/share/apps/simon and file:///home/ubuntu/.kde/share/apps/simond.
22. Start simon.
23. I am skipping the next steps. Please read my article German speech model ‘dedq’ to get more details.
24. And now it is time to watch the video about this speech model:
Type into the terminal “setxkbmap de” – and then start simon and ksimond.
By the way, my simon version is 0.3.0-1ubuntu8. I installed it a few months ago using this approach.
Yes, the German special characters are displayed correctly. I just dictated: “Mühlingen Müllern Mörike Mörtelgerüche Mühelosigkeit ” – great, it is working now.
This article shows (A.) how I create the German speech model ‘dedq’, and (B.) how I dictate using this speech model.
A. Creation of the German speech model ‘dedq’
1. Delete file:///home/ubuntu/.kde/share/apps/simon and file:///home/ubuntu/.kde/share/apps/simond
2. Start simon.
3. Click the Vocabulary button. Press the Import Dictionary button. Select Target: Active Dictionary. Type of dictionary: PLS dictionary. The location of the PLS dictionary on my computer is as follows: /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedq/german-ipa-flac-files-dedq-20110726/lexicon-dedq.xml You can find this dictionary file at Voxforge (37 MB; you can extract the dictionary file). Or here is an easier way: you can download the file dedq.xml (right click; Save Link as). The file dedq.xml is a valid PLS dictionary that you can import into simon.
5. Click the Training button. Import trainingsdata. Import prompts:
- Prompts: /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedq/german-ipa-flac-files-dedq-20110726/prompts-dedq
- Base directory: /media/104d991d-2062-40d7-89f6-ddde3cb5b781/home/ubuntu/Documents/2011-i/german-0.2.5/object/split/dedq/german-ipa-flac-files-dedq-20110726/flac-dedq
You can get both files from Voxforge (see link above). Please keep in mind that some post-processing has to be enabled.
Go to Settings > Configure simon… > Recordings > Post Processing You can see the post processing command that causes sox to convert the FLAC files to WAV format.
6. Press the Commands button. Manage Plug-ins > Add > Dictation > Dictation > Append text after result: ” ” (enter just a space bar, then press the OK button).
7. Start ksimond. simon > Connect button. simon now starts with the compilation of the speech model. Let’s dictate a few words: “M;nchsfisch Mnchskopf Mhe Mllerin Mllerthal Mllschlucker” The German ö and ü aren’t displayed. Press the Activated button to stop the recognition.
You now know how I created the German speech model ‘dedq’.
B. The following video demonstrates the dictation / recognition process. Unfortunately, Youtube limits the length to 15 minutes. Take a look at the German speech model ‘dedq’ in action:
These are the words that were recognized in the video: