Posts Tagged ‘Sphinx’

Try to train the Polish word “JEDEN”

Thursday, November 19th, 2009

I want to import a small sample dictionary into simon (Sphinx format):

polish-sphinx

The source can be found here (I don’t know how long this link will be valid). The dictionary contains 19 Polish words (US-ASCII). Here is what you have to do next:

universal

1. Select Applications > Universal Access > simon.

import-dictionary

2. Press the Wordlist button.
3. Press Import Dictionary.

shadow-dictionary

4. You can select the target: shadow dictionary or active dictionary. For this Polish example dictionary, choose active dictionary.
5. Press the Next button.

And now it is time to choose the appropriate lexicon format:

import-sphinx

Import the dictionary (with the 19 Polish words; see the screen-shot at the beginning of this post) as SPHINX lexicon.

sphinx-automatic

You have to select the path to the Polish Sphinx dictionary. After pressing the Next button, the following message appears:

finish

The Polish Sphinx dictionary has been imported successfully. Press the Finish button.

Now let’s train a Polish word:

add-polish

a. Select the Polish word JEDEN.
b. Add to Training.
c. Train selected Words.

You can now record the Polish word with simon:

train-polish

(more…)

Importing a Turkish Sphinx dictionary

Saturday, September 5th, 2009

I have imported a Turkish Sphinx dictionary into simon. This is the result:

turkish

The dictionary contains more than 1000 entries. I read this discussion at Voxforge. In the discussion the source of this Turkish dictionary is mentioned: rapidshare.com [source is not available anymore]. I downloaded the rar file, and unpacked the Turkish Sphinx dictionary: file:///home/liberty/200909/myam.dic. Then I imported this dictionary into simon. simon transformed it into HTK format:

ABDIL [abdil] a b d i l
ABDULLAH [abdullah] a b d u l l a h
ACENTACI [acentacı] a c e n t a c iy
ACI [acı] a c iy
ACIDIR [acıdır] a c iy d iy r
ACIL [acil] aa c i l
ADALAR [adalar] a d a l a rh

Training should be possible.

Article edited on September 7, 2009.

Importing the Voxforge dictionary

Monday, August 31st, 2009

I am now importing the Voxforge dictionary into simon from this location: /home/liberty/200908/sam/english/VoxForgeDict. I had downloaded it from here (VoxForge.tgz). It is in HTK format. What does the HTK format look like? Here is a small excerpt from the dictionary:

APPROACH [APPROACH] ax p r ow ch
APPROACHABLE [APPROACHABLE] ax p r ow ch ax b ax l
APPROACHED [APPROACHED] ax p r ow ch t
APPROACHES [APPROACHES] ax p r ow ch ax z
APPROACHES(2) [APPROACHES] ax p r ow ch ix z
APPROACHING [APPROACHING] ax p r ow ch ix ng
APPROBATION [APPROBATION] ae p r ax b ey sh ax n

The VoxForge dictionary contains about 130k words.

First, I imported the /home/liberty/200908/sam/english/cmudict.0.6d. It is in Sphinx format:

APPROACH AH0 P R OW1 CH
APPROACHABLE AH0 P R OW1 CH AH0 B AH0 L
APPROACHED AH0 P R OW1 CH T
APPROACHES AH0 P R OW1 CH AH0 Z
APPROACHES(2) AH0 P R OW1 CH IH0 Z
APPROACHING AH0 P R OW1 CH IH0 NG
APPROBATION AE2 P R AH0 B EY1 SH AH0 N

So you now know the difference between a dictionary that is stored in HTK format and one that is stored in Sphinx format. Both dictionaries – VoxForgeDict and cmudict.0.6d – contain each about 130k words. I don’t know whether they share the same phoneme set or not. My guess is that both lexicons are using CMU-40 but I don’t know, so I could be wrong!

I think that I will stick to VoxForgeDict because it is in HTK format.

So, what will be my next step? I want to train a few words with simon (words: this, is, a, different, approach). Then I will compile the speech model (= synchronize with ksimond). After that, I will try whether simon recognizes my voice.

If it works, I will try to make a test with sam. I want to test the sentence This is a different approach. with sam. This is the first sentence of my English files.

I have to get familiar with the whole training and testing process. In the long term, I want to use sam for model creation and model testing.

I just imported the lexicon VoxForgeDict into simon. Maybe I should define a grammar now? I just did that: Now I have a grammar with just one category: Unknown. I know that this isn’t sufficient for testing the whole sentence This is a different approach., but I will try to fix that later when the problem occurs.

I am adding now the word this:

this

I can’t record the word because I can’t restart simon. I think that I will have to get the current snapshot via svn.

Import a big French Sphinx lexicon

Saturday, August 29th, 2009

I just downloaded the French dictionary from univ-lemans.fr. It has more than 65K words. I think that it is bigger than 100.000 words. Probably, it is stored in Sphinx format. Here are some lines of that dictionary:

juridiction jj uu rr ii dd ii kk ss yy on
juridictionnel jj uu rr ii dd ii kk ss yy oo nn ai ll
juridictionnelle jj uu rr ii dd ii kk ss yy oo nn ai ll
juridictionnelle(2) jj uu rr ii dd ii kk ss yy oo nn ai ll ee
juridictions jj uu rr ii dd ii kk ss yy on

I assume that this is Sphinx format, but I don’t know. I will try to import this lexicon into simon as Sphinx lexicon. I will import from the following location: /home/liberty/200908/words_dict. The import process takes a few moments because it is a big dictionary. The import was successful, but there are some smaller encoding problems:

french-sphinx

I know this problem. It is probably a result of the wrong encoding: ISO-8859-1 vs. UTF-8. It should be easy to solve that problem. The dictionary should be opened e.g. with Notepad++, and saved as UTF-8. Maybe that would work (I would have to try that).

I have now opened the file /home/liberty/200908/words_dict with Geany. The encoding is ISO-8859-1. This was my first guess. And of course, I was right. Let’s take a closer look how to fix that encoding issue:

geany-encoding

1. The French dictionary that had been downloaded (see the source at the beginning of this article) has the file name words_dict.
2. The encoding is ISO-8859-1.
3. Let’s try to set the encoding to UTF-8.

After saving the file, I will try to import the lexicon again (of course, I will rename a specific simon folder before). simon offers to select a specific encoding:

encoding

I will stick to automatic encoding. Let’s see what the result of the import process will be. Will the encoding issues have been solved? Yes, everything is OK now. I have now more than 100.000 French words that could be trained with simon.