Posts Tagged ‘confidence score’

Confidence score with Hebrew

Thursday, September 10th, 2009

A few hours ago, I created a sample Hebrew PLS dictionary. It is very short, but it shows the concept.

hebrew

1. I imported the Hebrew PLS dictionary into simon.
2. I dragged each word to the right side for training.
3. The recorded Hebrew words are stored in the folder /home/liberty/.kde/share/apps/simon/model/training.data.
4. After starting ksimond (PDF), I pressed the Synchronize button.
5. Then I activated simon.
6. When dictating several words, simon is not sure which word is the right one. Is it מדפסת, or is it עִבְרִית? Unfortunately, I didn’t get any output in gedit or in Geany. Maybe it has something to do with the right-to-left encoding?

You can see that thanks to UTF-8 Hebrew shouldn’t be a big problem. I don’t know what went wrong with the missing output. But at least the Hebrew words are displayed correctly, so only the last step is missing.

If anyone is interested in building an Hebrew PLS dictionary, I propose you take a look into the Hebrew Voxforge prompts. I suggest that you take the words that are contained in the prompts into the dictionary. You can take my sample Hebrew dictionary, and expand it. Later, you can use the Voxforge prompts for training (after you have made your first experiences with simon).

Import PLS dictionary to active vocabulary

Sunday, August 23rd, 2009

I imported the whole PLS dictionary /home/liberty/200905/voxDE20090209.xml into the active vocabulary. This feature had been added to simon a few weeks ago:

“simon can now import dictionaries to the active lexicon.”

You know that my next goal is to hit the 1000 words mark. 1000 words should be recognized by simon. At the moment, I have major recognition problems. simon isn’t very responsive. It recognizes e.g. the word “abnahmen”, but when I dictate other words (that are of course part of the active vocabulary and had been successfully trained by me), simon doesn’t react. Maybe it is something with the confidence score? Or maybe while playing with sam the speech model has been changed?

Well, the active vocabulary now contains more than 8000 words. When I dictate, simon now recognizes words that I never had trained. And of course, it recognizes the wrong words. So I will have to do figure out how to adjust the speech model.

For example, I could record with Audacity lots of single words (not utterances because I find it difficult to define an appropriate grammar), and choose the Export Multiple... function. I am using Audacity in combination with my external USB sound card. This sound card only works with 22050 hertz, not with 16000 hertz under Ubuntu. This is the reason why I am using my on board sound card when dictating into simon directly (= recognition) or when recording words with simon (= training).

It is a bit complicated to explain. I prefer Audacity for recording because it allows me to record lots of training samples in a short amount of time. So if I record with Audacity in 22050 hertz, I have to resample the wav files with sox. I tested the command from the Sphinx guide. The following command allowed me to transform a 22050 hertz file successfully into 16000 hertz:

$ sox de27-02.wav -r 16000 -c 1 -s de27-02-test.wav

With Audacity, I could record all 8000 words that are now in my active vocabulary. Let’s say in packages of 100 words. Two years ago, Audacity allowed me to export just about 30 wav files at a time otherwise the application would crash. I will have to test the current version of Audacity. Probably, this issue has been fixed.

My main concern is that the words of my dictionary are often very similar. Here is an example:

DUTZEND [Dutzend] d U ts @ n t
DUTZEND [Dutzend] d U ts n= t
DUTZENDE [Dutzende] d U ts @ n d @
DUTZENDE [Dutzende] d U ts n= d @
DUTZENDEN [Dutzenden] d U ts @ n d @ n
DUTZENDEN [Dutzenden] d U ts n= d @ n
DUTZENDEN [Dutzenden] d U ts n= d n=
DUTZENDS [Dutzends] d U ts n= ts

Eight entries that are very similar. I think that this is very hard to train successfully. I will have to find out how I can achive my 1000 words goal. Maybe I should reduce the size of the active vocabulary from 8000 words to 1000 words? The result would be that I could use a set of words that aren’t too similar, and thus I would get better recognition results.

And I could sort out short words. Short words are harder to recognize than long words. It is a trick to only train long words, and to leave out the short ones.

I am interested in the following function:

“simon can now import prompts files through the import training data wizard.”

This is a very interesting function for me. I have recorded lots of utterances. They could be imported into simon. But I have one problem: I didn’t define an appropriate grammar. I could use a grammar that uses just one category of words (e.g. all words are marked as noun, it doesn’t matter what they really are; they can be adverbs, adjectives, verbs, etc.). So this could be the way to go.

I think that the 1000 mark goal could be hit with the present vocabulary of 8000 words. Julius allows 20.000 words dictation. So 1000 words is a reasonable goal. When I have reached that goal, I will have to think about the following question: How can I hit the 10.000 words mark? First, I need a bigger lexicon. I don’t want to use BOMP since it would be necessary to write them an email. I prefer to stick to free dictionaries.

Another solution could be that I switch from the German PLS dictionary to the English Voxforge dictionary. I could do the testing in English.