I just read in the simon blog about XML standards. I want to reply to some of the remarks:
“this might be interesting to other readers”
I agree. That’s the reason why I started blogging about simon. I want to give some feedback to the developers. And maybe other people might be interested as well. The people need to know that simon is a project with a very high potential: open source speech recognition for the masses might become true in the near future.
This is important to know for large corporations and governments as well: should they continue to use Win XP, or should they upgrade to Win Vista (or the upcoming Windows 7)? One aspect of this decision can be: is there a speech recognition available or not? Windows Vista does have built-in speech recognition. And Ubuntu Linux? It doesn’t offer any ASR at the moment that would be sufficient. But that could change – hopefully in the not so far future – thanks to simon. So my goal is to influence decisions.
“simon does support importing PLS dictionaries”
That’s great. Why am I so into XML-based standards? Because I understand them. And I want to produce something that is of great value for others (not limited to the ASR development). Even search engines should be capable to understand what is meant when I am offering SSML files. But does a search engine understand what is meant when it analyzes files that are in the HTK or Julius format? I doubt that. HTK and Julius formats are obviously very specific standards just for ASR developers. But I am thinking in a more general sense.
Let me explain what I do believe in: The world is a giant global graph: “I’ll be thinking in the graph. My flights. My friends. Things in my life.” – the inventor of the WWW says that. And I couldn’t agree more. XML is probably the best language to describe this giant graph of knowledge. This is my ideology. If XML doesn’t suit your specific needs with HTK, I understand that.
And, by the way: I don’t like to read SAMPA. I prefer the IPA when editing the pronouncing dictionary. Sometimes, I ask myself the question: why don’t they switch from SAMPA to the IPA? Why don’t they switch their homepage from ISO-8895-1 to UTF-8? OK, they are Americans. They don’t have problems with exotic characters like “äöüß”. Do they care about other languages? Probably not. We don’t live in the time of old-fashioned ASCII any more. There are more spoken languages in the world than just English. The English speaking developers may be comfortable with ASCII. A lot of modernizations would be useful (ASCII->UTF-8; HTK format->PLS; Voxforge prompts->SSML). I can’t criticize the simon developers for that. It is not their fault.
I understand that there are priorities.
“export functionality is a low priority feature”
OK. From my point of view, Voxforge needs an export functionality. And the export could be done via SSML/XML (
<audio> elements). The question is: how can I train the speech collected by Voxforge with simon? My proposition is to use SSML as intermediate step. This is additional work in the short term, but in the long term we might increase our productivity.
PLS and SSML are developed by speech experts. And currently, I am convinced that it is not a wrong decision to stick to these standards. I read in the HTK book – it takes a lot of time to get involved.
“PLS standard does not allow for any terminal information”
We could add terminal information, and create a standard XML file with the tags
<terminal>. Maybe a future version of the PLS will suit our needs. We can use just XML – and add the missing
<terminal> element. I don’t know about the exotic BOMP standard, I couldn’t find an entry in the Wikipedia. So I assume that BOMP is not a relevant standard. I want to use common standards that are well understood outside of the ASR development community. The W3C Speech Interface Framework offers a lot XML-based markup languages. So people who don’t know about the specific needs of HTK/Julius but have a basic understanding of XML can immediately understand what is beeing offered. They don’t have to do lots of research.
I am not very familiar with HTK, and Julius. I tried several times, installed HTK, read the Voxforge tutorial. I made progress, but unfortunately I didn’t achieve sufficient skill to get through with the Voxforge HTK tutorial. Maybe I didn’t try hard enough.
“no reason to introduce new file formats”
Then I will try to develop something on my own. Currently, I am thinking about the question whether we should take a closer look at Symfony to develop an evaluation system for the Voxforge prompts. The result would be that we could deliver high quality training material for simon. By the way, I am primarily interested in dictation (not command and control). And for dictation, we need utterances to get good recognition results. Simon allows me to record just single words, not utterances. I am not convinced by that concept. Training should be done with utterances, not just single words. Voxforge made the right decision to collect utterances.
“importing of a “normal” HTK prompts file”
That would be sufficient. I would appreciate it if such a feature would be implemented.
My proposition is: Voxforge (HTK prompts) -> SSML -> simon
A shorter way would be: Voxforge (HTK prompts) -> simon
Everyone should use the shortest path. But I am thinking about the question: How can we evaluate the Voxforge prompts? Some of them should be sorted out. And how can we achieve this goal?
You see, there are several aspects. The world is not just about simon. It is about Voxforge, too.
“introduce an additional source of errors”
You were capable to implement PLS import. If you don’t want to implement SSML, that would be OK.
I think that I will have to read and try the Voxforge tutorial about HTK again.
It is OK not to focus on PLS export, and SSML. Just do what you think is best for the simon project.
I hope that you understand now my point of view better than before. It is an ideological view – describing the world as a graph. Speech recognition is just a small part of this giant graph. I would like it if Voxforge would offer its prompts in SSML format so that other projects could import the prompts directly. There may be projects out there who focus on speech synthesis. These projects could use the prompts, too.
P.S.: I changed the title of my blog to “testing simon”. Obviously, the developers prefer “simon” over “Simon”.