To train a speech recognition that works well, we need a large amount of transcribed audio. That’s typically hundreds or thousands of hours of recordings of people speaking, along with the text of what they said. In the past, the transcription has been done by experts, which is a slow and expensive option but, on the whole, fairly accurate. Due to the cost, a lot of work has gone into using untranscribed or partially transcribed audio (e.g. using closed-captions), and these unsupervised or lightly-supervised approaches do work reasonably well.
Now, with the growth of the web, and especially with the introduction of services like Amazon’s Mechanical Turk, there’s a third option for transcribing audio – crowdsourcing. Mechanical Turk offers easy access to a large number of non-expert ‘workers’, who are willing to do small, simple tasks for small amounts of money, a model ideally suited to audio transcription.
The main issue with using crowdsourcing is that of quality control. Although the task of transcription is easy to understand, the workers are unsupervised and their main motivation is to complete the task as quickly as possible so they can be paid. This can mean bad transcriptions, especially when the worker doesn’t understand the context or vocabulary of the recording, the audio quality is poor or the speaker is stumbling over their words. You also get many genuine mistakes from carelessness, especially spelling mistakes or mistakes in proper nouns.
Aside from transcription, there are other tasks that crowdsourcing is good for in speech technology. These include evaluating synthetic voices and having real people talk to a machine to elicit examples of dialogue. However, quality control is still an issue, and there’s the additional problem that workers might behave unnaturally simply by virtue of the fact they’ve been given an artificial task to complete. Someone who is actually in a strange city talking to a dialogue system because they need to find a hotel will probably have quite a different dialogue from someone who is sitting comfortably at home pretending they’re in a strange city trying to find a hotel.
Yet, crowdsourcing is proving to be a valuable resource in speech technology where the tasks are normally simple and easy to understand, and the cost of obtaining better quality data is large.