Archive for the ‘General’ Category

What are my targets?

Friday, January 22nd, 2010

I have two targets:

1. Produce more PLS dictionaries that can be imported into simon. I am planning to explain the development steps in this blog. It might be a little off-topic, but I think it is important to inform the people. This means that I will provide the reader with details of dictionary development / improvement.

I want that people understand how to handle the development / improvement of PLS dictionaries.

2. I want to learn about the simon source code. How does simon work internally? I don’t need to understand every detail, but I would like to be able to understand what is going on “behind the scenes” (scenes = simon GUI; behind = simon source code). Where can I start?

The article Becoming a KDE developer contains some useful links (e.g. I could type qtdemo into the Ubuntu terminal). Qt looks like a very interesting development software. How can I get involved?

Let me make an example: I read the C++ tutorial. The first chapters were easy. Then suddenly, it became extremely difficult. What are pointers? What is an array? At least I know how to compile a simple C++ program under Ubuntu. That is a start.

Or I could read the simon source code that is available via Sourceforge. E.g. I could read clientsocket.cpp. But I understand almost nothing.

3. Conclusion: It is a lot of work to focus on these targets.

The user wants a GUI

Sunday, January 10th, 2010

DNS 9.5 Preferred is the best speech recognition software that I have used (I didn’t buy DNS 10 because I want to migrate to Ubuntu). Here are some interesting questions (all quotes in this article are from this source):

“What would be the approach to get either Sphinx, Julius or HTK to be as accurate as Dragon Naturally Speaking preferred or pro version in English (USA and or UK)?”

The main problem is not accuracy. The main problem is usability / user-friendliness. The user wants a GUI so that he doesn’t have to use the Ubuntu terminal. The average user would invest 20 minutes in the installation of some speech recognition software, then he wants to begin with the dictation. The computer has to react (= simulate key strokes) when the user speaks words into his microphone.

“I assume that a much larger speech corpus than what VoxForge currently has would be required”

How big should the speech corpus be? My personal experience: it is possible to dictate 200 words after I have trained each word 3 times (200*3=600 recorded words). Probably, in the end we don’t need a very large speech corpus. But who knows? I think that wav recordings are needed, and the pronunciation of each recorded word has to correspond with the phonemes that are used in the specific PLS dictionary. When speaking into the microphone, you have to hit every single phoneme. Then you can get high accuracy. So for training and for recognition it is necessary to be as exact as possible (= don’t omit any phoneme).

“a much larger dictionary”

VoxForgeDict is large enough. It would be nice if someone transformed this dictionary into PLS format. This would have several advantages:
- More competition. cmudict (from which VoxForgeDict is obviously derived) could need a competing dictionary. Ralf's English dictionary serves as dummy dictionary. It can’t compete because it is too bad. It would be better to transform VoxForgeDict into PLS format (convert the Arpabet phonemes into IPA phonemes). The result would be a dictionary of high quality that could be developed independent from VoxForgeDict.
- Convert the <grapheme> elements from upper-case to lower-case.
- Add a role attribute to each <lexeme> element. With this feature we might get better recognition results because the content of each role attribute could be used by simon for terminal information (= verb, noun, adjective).

So the dictionary size is not the main issue. Of course, we need more PLS dictionaries for a lot of other languages.

“what does it take to get speaker independent accuracy similar to Dragon’s?”

I would say that we need five years to be as good as they are today.

To sum it up: simon is currently the best approach because it offers a GUI. It is necessary to implement existing algorithms and technologies. I think that Qt/C++ is the right approach. HTK and Julius seem to work perfect. But it is very difficult for the user to use the Ubuntu terminal. It is much easier to have a GUI.

The existing technology (HTK, Julius) is obviously good enough. We need a GUI. That is the great thing about simon because it offers such a GUI.

Benefitting from eSpeak

Saturday, January 9th, 2010

Is eSpeak good or bad?

Espeak with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn’t let it be good with any possible modifications.”

I used eSpeak for the creation of my 27 PLS dictionaries (the phonemes were created with the help of eSpeak). I found out that the phoneme quality for German isn’t that bad. It is usable for speech recognition after I made some adjustments with an XSLT style-sheet.

What about the other languages? To be honest: at the moment, I don’t care. I need the 27 PLS dictionaries mainly for propaganda. It is necessary to involve more people in the development of an open source ASR solution.

A Polish native speaker wants to dictate in the Polish language. Or another user wants to dictate in the Vietnamese language. Or someone wants to dictate in the Greek language. These people could take advantage of PLS dictionaries in their own languages.

This is what I want to do: Build a PLS dictionary in a Chinese language (e.g. Cantonese – eSpeak offers this language as provisional language). I need a GPL word list with Cantonese words. But I didn’t find one in the internet (the description of this word list should be in English because I don’t understand Cantonese).

Is eSpeak’s synthesis method out of date? I don’t know. At least eSpeak creates phonemes that I can implement in my PLS dictionaries. Is there a program available that produces better results than eSpeak? The program has to work out of the box. I can use eSpeak by simply typing “espeak” into the Ubuntu terminal. And eSpeak can interpret SSML mark-up. That worked fine for me.

My PLS dictionaries are in an early state of development. It should be possible to increase the quality substantially with the help of some engaged native speakers.

Things have to work. To be more precise: it should be possible for the user to import a PLS dictionary in his own native language into simon. I made a start by offering 27 PLS dictionaries. At the moment, I am thinking about whether I should offer much more PLS dictionaries. The problem is: I don’t know how I can create the phonemes for the specific language. I will find a work-around for this problem.

Which kind of phonemes should the PLS dictionary contain? There are several possibilities:
- IPA phonemes (like in Ralf's German dictionary), advantage: can easily be edited by linguists;
- eSpeak phonemes (like in Ralf's Polish dictionary), advantage: I didn’t introduce new errors by trying to convert them into IPA;
- SAMPA phonemes (none of my dictionaries uses SAMPA), I don’t see any advantage at the moment.

In my opinion, a good phoneme quality can be achieved by using IPA phonemes. Because IPA phonemes are easy to read by linguists. So what can you learn from this post? If you are a native speaker of Vietnamese, Polish, Greek, you may want to take a closer look at Ralf’s Vietnamese / Polish / Greek dictionary, and think about what you can do to improve the quality.

Which advantage offer Ralf’s PLS dictionaries? They show you a way to make speech recognition work for your native language. As soon as you have a PLS dictionary with acceptable quality for your own language, you can think about using it for training with simon.

You can learn from my blog that you can import
- Ralf's Vietnamese dictionary,
- Ralf's Polish dictionary,
- Ralf's Greek dictionary
into simon. So simon is the target application. If you improve the quality of the specific dictionary, there is a chance that it might work.

And there is another thing that I found out when building / importing each of these 27 PLS dictionaries: the dictionary size should be about 100.000 words (not 1 million words, not 10.000 words). Help is needed to implement a good compression algorithm (like unmunch for OpenOffice.org dictionaries).

The focus should be to
- improve the quality of each PLS dictionary – native speakers should do that;
- integrate an option into simon to automatically download & import each of these PLS dictionaries into simon;
- think about a good compression algorith for PLS dictionaries (like unmunch) – languages like Spanish, Dutch, German, Latin need such a compression algorithm – not necessary for English.

Thoughts about comments in this blog

Sunday, January 3rd, 2010

I want to answer some comments that were made in this blog, and add some further thoughts:

1. comment: OK, I understand how to change REGRETTING to lower-case. But my goal is a speech model with 1000 words. And it would be too much work to change that manually. Of course, this is not a simon issue. It is a dictionary issue that should be fixed by someone.

2. comment: I don’t know where to find khotnewstuff3 and libattaca. You could help me by updating the requirements section in the simon wiki. I need to know which packages I should install with sudo apt-get install. These are the answers in the Ubuntu terminal:

$ khotnewstuff3
No command 'khotnewstuff3' found, did you mean:
Command 'khotnewstuff' from package 'kdelibs4c2a' (main)

So I assume that I have to install khotnewstuff.

$ libattaca
libattaca: command not found

I don’t know where to find this package (even Google doesn’t help). Obviously, this command / package doesn’t exist.

Maybe I should try the following:
sudo apt-get install khotnewstuff
./build_ubuntu.sh

I will try that later, not now.

Thanks for the info about trunk / scenarios / no-scenarios. I will continue to use trunk (and forget about scenarios).

3. comment: Yes, I figured that out on my own that I can delete both folders: /home/ubuntu-64bit/.kde/share/apps/simon and /home/ubuntu-64bit/.kde/share/apps/simond. But I couldn’t restore my old model (with >200 German words) even though I have all necessary files available (53MB; link will become invalid soon). A tutorial would be helpful that explains each step that has to be done to restore an old model (that has been backed up while it was working). I tried a lot of things, but always something went wrong.

I want to make an example: My copy of Dragon NaturallySpeaking Preferred offers export/import functionality. This is great. I can be sure that it is possible to restore my own adapted speech model.

And what about simon? Momentarily, simon just offers a dysfunctional Import Trainingsdata function. This is a very weak part because I want to train a model with 1000 German words. And I want to use sam to find poor wav files (and to sort them out). sam seems to be great for finding/fixing wav issues. But the combination simon / sam has to work which unfortunately isn’t the case. So if something goes wrong, I want to be able to restore my 580 German wav files (and there will be much more wav files in the future).

simon is now OK for me for recording because my USB sound card now works with simon / 16.000 Hz.

Backup/restore with simon should become much easier than it is now. If you have a vocabulary of just 10 words, you don’t need backup/restore because you can start from scratch without loosing much time. But when you have a bigger vocabulary (>200 words), the user doesn’t want to start from scratch. There has to be some way to make it work.

In my opinion, a future version of simon should have an Export speech model button. The export file could be a .tar.gz file that contains all files that are included in /home/ubuntu-64bit/.kde/share/apps/simon/model. And with an Import speech model button this .tar.gz file could be imported. This would make things so much easier.

Or I need a tutorial which explains what exactly I have to do. Because you can see when reading my posts from the last couple of days, that I tried a lot of different things to restore my speech model, but I failed.

4. comment: OK, if I delete the folder /home/ubuntu-64bit/.kde/share/apps/simond, then I delete the file simond.db (which I assume contains information about the username and the password).
If I delete /home/ubuntu-64bit/.kde/share/apps/simond/models, the file simond.db will remain available which means I don’t have to add a user with username and password.

It is good to know about these details because I want to have as much control over the computer as possible.

“document some of your tests on the simon wiki too” – I will think about it. I had written a couple of articles in the simon wiki that were deleted (e.g. an article about PLS). In my opinion, these articles are important for on-line marketing of simon. Information can be 2-3 times redundant. Of course, I understand that the simon team wants to keep the simon wiki lean.

By the way, the marketing of open-source ASR is pretty bad:

Most CMU Sphinx websites are outdated. The problems with the one at sourceforge are:

* Not so modern style
* No interactivity
* Loosely organized outdated information
* Hard to manage/update
* No CMS/search

Also there is a generic problem with the quality of documentation available. A lot is quite outdated and just confusing.

This is exactly my impression of CMU Sphinx. I don’t know where to start: Sphinx-2, Sphinx-3, Sphinx-4, or pocketsphinx? Here is what I want: A system that recognizes words that I speak. So I want one solution for my problem. I tried this Sphinx tutorial. But I failed. I need a tutorial that I can finish within 20 minutes. If it takes longer, I loose interest.

My idea is to reduce the complexity. E.g. you should be able to import Ralf’s German dictionary within 20 minutes. Imagine the following situation: simon is installed, you want to import Ralf's German dictionary. You should be able to achieve this simple goal within 20 minutes. You have success. I want to help people to become successful. This is the reason why I started this blog. I want that the people benefit as fast as possible.

Or let’s assume that you are an average Windows XP user, and you want to know whether you can use Ubuntu to recognize your spoken German words. You don’t have to install Ubuntu. You don’t have to install simon. You don’t have to know about Ralf's German dictionary. All you have to do: Download the video Recognize 200 German words, and watch this 13 minutes video on your Windows XP computer. Easy, isn’t it? You can learn from me where we are. You don’t loose much time. You can decide when you want to migrate from Windows XP to Ubuntu. I recommend that normal users stay with Win XP. And what about gouvernments, big companies? They should consider to migrate to Ubuntu during the next couple of years. ASR for Ubuntu should be usable in let’s say five years or so.

Of course, the development could be faster if the governments wouldn’t fight against the internet instead of supporting it. E.g. the private households in Germany could use fiber to the home (fiber into every house; inside the house you use WLAN or copper cable). The result would be that we could use simon / simond across the internet via TCP/IP. We don’t need better computers. We need better infrastructure.

Interested people (average computer users who hardly know how to handle the Ubuntu terminal) need some kind of entertainment: Sometimes, they want to lean back, and just consume. I am trying to offer this kind of entertainment: You can watch my 13 minutes video where I am showing you that it is possible to dictate more than 200 German words under Ubuntu with 100% accuracy. This video is not scientific. It is easy information about what is possible when you install the following components:
- simon,
- HTK,
- import Ralf's German dictionary into simon, and train a word with simon.

The open-source ASR community does have one major problem: the insufficient marketing. People want to have one solution that works out of the box. So we have to make it as easy as possible (if we want to involve more people in the development of open-source ASR software). 10.000 users = one developer. So if we get more users into the system, we might get a few developers.

I want to say what we need:

1. Someone who has C++/Qt knowledge and who knows how to handle Sphinx commands. This person could get the simon source code, and build a simon fork. This simon fork should handle Sphinx commands instead of HTK commands. I understand what has to be done. But I can’t solve this problem on my own because I don’t have the necessary knowledge. It shouldn’t be that hard for someone who is experienced with C++/Qt/Sphinx to solve this problem.

2. We need pronunciation dictionaries for several languages. The ideal format is the Sphinx format (for non-German dictionaries). I am telling you the easiest solution. Why Sphinx format? Because if you have a Sphinx dictionary, you can
- use this dictionary directly with Sphinx;
- import this Sphinx dictionary into simon.

The import of a dictionary in Sphinx format into simon should be possible without problems. This is what should be done:

a. A German native speaker could get involved in the development of Ralf's German dictionary (and similar dictionaries with German pronunciation). My concept is clear (XML + XSLT = PLS). I can’t do all the work. We need specialised dialect dictionaries (Austrian German, Swiss German, Medical German). I made the first step with the creation of Ralf's German dictionary, Ralf's Austrian German dictionary, and Ralf's Medical German dictionary.

b. Is there a person from Austria who wants to improve Ralf's Austrian German dictionary? A linguist who knows about the details of the IPA phonetics could do this. Of course, you don’t have to be a linguist. You can do learning by doing. Learn about the IPA phonemes that are used by Austrian German speakers by improving Ralf's Austrian German dictionary!

c. Maybe there is someone who studies medicine, who wants to use simon for his studies of medicine. The headset will become a tool that will be used by many doctors. Dictionaries with special medicine vocabulary are necessary. Ralf's Medical German dictionary is a start. To achieve this goal, help from a student of medicine is needed.

I want to tell you why the PLS is a great thing:
- no problem to handle right-to-left languages like Hebrew;
- no encoding problems thanks to UTF-8 (OK, I have encoding problems with Ralf’s Polish dictionary);
- the GPLv3 licence can be added at the beginning of the XML document (as comment). There is no need to provide an additional text file containing the GPLv3 license. There is no room for any licensing misunderstanding. XML documents do have a lot of verbosity. But modern computers can handle this verbosity.
- they are easy to edit by humans. E.g. linguists who are familiar with IPA can edit the PLS dictionary without problems. The IPA is much easier to read than Arpabet.

At the moment, my concept is as follows:
- PLS format for German dictionaries.
- Sphinx format for non-German dictionaries. The non-German dictionaries that I am offering in this blog should be transformed from PLS/eSpeak format to Sphinx/Arpabet format. After the dictionary has been converted into Sphinx format (this should be done by a native speaker), you can import it into simon.

I want that simon becomes a solution that you can use with the following languages: Icelandic, Vietnamese, Russian, Norwegian, and many more languages. Help from native speakers is needed. You can transform the PLS dictionary of your language into Sphinx format. E.g. let’s say you are from Russia. Then you can transform Ralf's Russian dictionary into Sphinx/Arpabet with a simple text editor. After you have done that, you can import the dictionary into simon.

HTK website is not reachable

Friday, January 1st, 2010

At the moment, it seems that the HTK website is not reachable. Here is what you can do:
- install simon,
- read the simon handbook,
- wait until the HTK website is available. Probably, it is just a temporary issue. No need to worry.

You can install simon without installing HTK. But to build speech models, you need HTK.

Something for mathematicians who want to learn more about the theoretical background of HMMs in general: read the PDF The Application of Hidden Markov Models in Speech Recognition.

Why should you read the PDF with the theoretical background? Personally, I would understand just about 1% of the things they explain (I read parts of the HTK book – I understood almost nothing). But at least I understand one fundamental thing: HMMs are a great thing for speech recognition. The HTK toolkit handles HMMs. And simon makes use of the HTK toolkit (the HTK toolkit has to be installed separately; it is not part of simon).

In the end, it is working. Watch my newest video where I am dictating 200 German words with simon.

So there are several things that you can do until the HTK toolkit is available again.

About comments in this blog

Wednesday, December 30th, 2009

During the past couple of days, this blog got hit by lots of comment spam. Obviously, the current anti-spam protection isn’t sufficient any more. So I decided to add another spam protection to this blog. From now on, it is necessary to fill out a CAPTCHA if you want to comment.