I posted previously about confidence intervals and how they’re important for being sure of your conclusions when you infer from data. A somewhat related problem that crops up a lot in machine learning is that of data reliability and bias. This isn’t just about knowing how reliable your conclusions are, but about whether the data you’ve collected enables you to legitimately draw those conclusions.
For example, speech recognition systems are trained using large corpora of audio and text data. They often perform really well under test conditions in the lab, but performance plummets in the real world. This is normally down to a mismatch between the data used to build the system and the kind of data you get from real users. People in the real world talk in unexpectedly noisy or reverberant environments, using language that you didn’t anticipate, and these conditions aren’t covered by the training set.
The obvious solution is to collect more real-world data for training your models on, and to some extent this is done now for most speech recognition applications, but it can still be difficult to collect data that covers the full range of real-world conditions. In the related field of human-computer dialogue, there’s no easy answer to the question of how best to obtain data. The most common approach – paying people to talk to a machine – doesn’t necessarily capture the full range of behaviour that you’d like to model. People talk differently to a computer when it’s an artificial task. And it turns out that people are very bad at both following instructions, and at realising that they haven’t followed the instructions properly.
So, when designing a data collection for building statistical models, it’s important to examine the data and consider whether your collection method elicits the sort of behaviour that’s typical of the real world, whether it covers the full set of potential users, and to ensure that any processing you do to clean up the data doesn’t introduce further biases.
In the case of a deployed speech recognition system, with measurable performance, it’s relatively easy to spot the degradation and uncover the bias in the training data. But it might not always be the case that you can even notice your model performing badly, let alone understand the cause. The problems get harder if you’re relying on data collected for another purpose, where you may be unaware of the hidden biases, and when you’re using ‘big data’, where there’s too much of it for you to examine manually. Some interesting examples of hidden bias in big data are discussed in this blog post from Kate Crawford.