I’m working on implementing speech recognition in my call center. I am using Miscrosoft Speech Platform, and I want to be able to replace my DTMF recognition with speech recognition (for example, ‘Say the department you are trying to reach” instead of “press one for sales”).
I have the SpeechRecognitionEngine working perfectly to my specifications, with one exception. While recognizing spontaneous speech I must account for disfluencies (‘uh’, ‘um’, ‘er’, ‘you know’, ‘like’). My question is, are there any methods within the .NET framework that allow the recognition engine to bypass these utterances and continue searching for actual speech?
If there aren’t any pre-supplied methods, how would you go about bypassing these disfluencies? I suspect the answer may lie in how I construct my grammar, but any insight would be greatly appreciated.
Thanks!
The way to handle this is in your grammars. You need to add these “disfluencies” to the rules in your grammars. That is where the tuning come in for speech recognition. You need to look at all of the unrecognized phrases in your application and listen to the audio recording to figure out what users are saying that is “out of grammar” and then add them. For example, if you ask the user, “What would you like to eat, a pizza or a hamburger?” If your grammar is only setup to handle “pizza” or “hamburger” and the user responds “um pizza” then it will fail as out of grammar. You need to add “um” to the rules in such a way that it is optional. If you are using XML grammars it may look something like this:
If you do not want to include the “influencies” in the return values you would use tags to return the semantic interpretation. How you include this semantic interpretation can vary from platform to platform, but here is one example:
Microsoft has a discussion on semantic interpretation here.