DNS 9.5 Preferred is the best speech recognition software that I have used (I didn’t buy DNS 10 because I want to migrate to Ubuntu). Here are some interesting questions (all quotes in this article are from this source):
“What would be the approach to get either Sphinx, Julius or HTK to be as accurate as Dragon Naturally Speaking preferred or pro version in English (USA and or UK)?”
The main problem is not accuracy. The main problem is usability / user-friendliness. The user wants a GUI so that he doesn’t have to use the Ubuntu terminal. The average user would invest 20 minutes in the installation of some speech recognition software, then he wants to begin with the dictation. The computer has to react (= simulate key strokes) when the user speaks words into his microphone.
“I assume that a much larger speech corpus than what VoxForge currently has would be required”
How big should the speech corpus be? My personal experience: it is possible to dictate 200 words after I have trained each word 3 times (200*3=600 recorded words). Probably, in the end we don’t need a very large speech corpus. But who knows? I think that wav recordings are needed, and the pronunciation of each recorded word has to correspond with the phonemes that are used in the specific PLS dictionary. When speaking into the microphone, you have to hit every single phoneme. Then you can get high accuracy. So for training and for recognition it is necessary to be as exact as possible (= don’t omit any phoneme).
“a much larger dictionary”
VoxForgeDict is large enough. It would be nice if someone transformed this dictionary into PLS format. This would have several advantages:
- More competition. cmudict (from which VoxForgeDict is obviously derived) could need a competing dictionary. Ralf's English dictionary serves as dummy dictionary. It can’t compete because it is too bad. It would be better to transform VoxForgeDict into PLS format (convert the Arpabet phonemes into IPA phonemes). The result would be a dictionary of high quality that could be developed independent from VoxForgeDict.
- Convert the <grapheme> elements from upper-case to lower-case.
- Add a role attribute to each <lexeme> element. With this feature we might get better recognition results because the content of each role attribute could be used by simon for terminal information (= verb, noun, adjective).
So the dictionary size is not the main issue. Of course, we need more PLS dictionaries for a lot of other languages.
“what does it take to get speaker independent accuracy similar to Dragon’s?”
I would say that we need five years to be as good as they are today.
To sum it up: simon is currently the best approach because it offers a GUI. It is necessary to implement existing algorithms and technologies. I think that Qt/C++ is the right approach. HTK and Julius seem to work perfect. But it is very difficult for the user to use the Ubuntu terminal. It is much easier to have a GUI.
The existing technology (HTK, Julius) is obviously good enough. We need a GUI. That is the great thing about simon because it offers such a GUI.