I’m currently working on the Vaani project at Mozilla, and part of my work on that allows me to do some exploration around the topic of speech recognition and speech assistants. After looking at some of the commercial offerings available, I thought that if we were going to do some kind of add-on API, we’d be best off aping the Amazon Alexa skills JS API. Amazon Echo appears to be doing quite well and people have written a number of skills with their API. There isn’t really any alternative right now, but I actually happen to think their API is quite well thought out and concise, and maps well to the sort of data structures you need to do reliable speech recognition.
So skipping forward a bit, I decided to prototype with Node.js and some existing open source projects to implement an offline version of the Alexa skills JS API. Today it’s gotten to the point where it’s actually usable (for certain values of usable) and I’ve just spent the last 5 minutes asking it to tell me Knock-Knock jokes, so rather than waste any more time on that, I thought I’d write this about it instead. If you want to try it out, check out this repository and run npm install in the usual way. You’ll need pocketsphinx installed for that to succeed (install sphinxbase and pocketsphinx from github), and you’ll need espeak installed and some skills for it to do anything interesting, so check out the Alexa sample skills and sym-link the ‘samples‘ directory as a directory called ‘skills‘ in your ferris checkout directory. After that, just run the included example file with node and talk to it via your default recording device (hint: say ‘launch wise guy‘).
Hopefully someone else finds this useful – I’ll be using this as a base to prototype further voice experiments, and I’ll likely be extending the Alexa API further in non-standard ways. What was quite neat about all this was just how easy it all was. The Alexa API is extremely well documented, Node.js is also extremely well documented and just as easy to use, and there are tons of libraries (of varying quality…) to do what you need to do. The only real stumbling block was pocketsphinx’s lack of documentation (there’s no documentation at all for the Node bindings and the C API documentation is pretty sparse, to say the least), but thankfully other members of my team are much more familiar with this codebase than I am and I could lean on them for support.
I’m reasonably impressed with the state of lightweight open source voice recognition. This is easily good enough to be useful if you can limit the scope of what you need to recognise, and I find the Alexa API is a great way of doing that. I’d be interested to know how close the internal implementation is to how I’ve gone about it if anyone has that insider knowledge.
15 Replies to “Open Source Speech Recognition”
Iirc some of the amazon engineers works for or provided code for the mycroft project. Maybe look there.
I highly recommend you take a look at Kaldi (http://kaldi-asr.org) and the Kaldi GStreamer server for the backend ASR:
We’ve been working on on-device ASR (based on Kaldi). Early release is available for iOS: http://keenresearch.com/kaldi-ios-framework. Next week we will provide a new release that will allow developers to create decoding graphs based on bigram language models or based on list of sentences users are expected to say.
We have been looking at and using Kaldi, but from what I understood, it’s too resource-intensive to use on a local device… But I guess if you’re using it on iOS, that isn’t quite accurate? I’ve been told various services use it server-side?
It depends on what the task is, i.e. what you are trying to recognize (the size of the language model). For large vocabulary with trigram LMs it’s definitely resource intensive and not (yet) feasible on a local device… but for smaller tasks with few hundred to couple of thousand words, it’s definitely possible.
As far as I know, you are right in that most people use it server side, for large vocabulary recognition, I believe.
to be a bit more precise, by local device I mean iPhone 6 or equivalent.
I got set up and get a bit of an error when I try to run the example file. See below:
I’ve not tested the minecraft helper skill, it looks like these errors would come after launching it – is there any output missing from this log? No mention of local files in that backtrace either… Do you have espeak installed and in the path? If not, you can change the speech binary by setting the speechCommand property: https://gitlab.com/Cwiiis/ferris/blob/master/index.js#L146 – note that it expects whatever command you give it to be able to handle SSML.
Had the same issue on OSX.
brew install soxfixed it for me.
Hope that helps,
How does this compare to OpenEars? Why not fork the latter?
I am truly interested in *onboard* speech recognition in phones.
Is it possible to download a set of “likely words and phrases” to bias the recognition results and improve accuracy, or is it all freeform only?
If you’re interested, my company Qbix would be happy to help test this in our apps with a few million users.
No idea how it compares to OpenEars, but from the OpenEars site: “OpenEars works on the iPhone, iPod and iPad and uses the open source CMU Sphinx project” – so I guess OpenEars is just a repackaging of pocketsphinx with Objective-C bindings anyway. This “likely words and phrases” is the grammar that gets generated – sphinx will only return results that conform to the set grammar (you can do freeform recognition but I found the results were practically unusable). What I can’t do exactly is weight these sentences – sphinx has that ability, but Alexa skills aren’t annotated with that information. I may add an option for it as an extension.
Feel free to use this if it’s helpful to you – I’ve set the licence as GPL-3.0. I’m not sure what I’d specifically want testing, but I’m sure the projects I use here could all do with help (I’d love to see pocketsphinx improved, and other members of my team are working on that). I’ve been developing this with a mind for it to work offline on a reasonable resource-constrained device (think Raspberry Pi), I’ll be testing this in the coming weeks.
Neat! Have you looked at https://www.houndify.com/ ? Folks who would know better than I have told me Houndify’s language work is a bit smarter but I’m not sure how different their pattern is. Also, I’ve read rumors that Apple will be opening up Siri to services rather than just apps in a few months.
I hadn’t, thanks for the link. I’m more interested in offline solutions, and preferably open source ones… Not to say it’s a solved problem, but if you’re willing to use a 3rd party service and you’re willing to pay, you can get pretty decent free-form voice recognition. Some people in the team have looked into how Android’s offline voice recognition works, apparently that’s pretty good (and may also be open source too?)
Is it possible to make the installation steps more clear.
I have installed sphinxbase and pocketsphinx, node. Then I have cloned the ferris project and ran
npm installcommand, but I am getting the following error:
I’m afraid building and installing software is too complex and variable a task for me to be much clearer than I have been without this post being much longer and mostly being about building software. That said, your error is right at the top
/bin/sh: pkg-config: command not found– pkg-config isn’t in your path, so it can’t find your installed pocketsphinx. Once you can execute pkg-config and it can find pocketsphinx, npm install should work just fine.
Got it. Thanks.