Thanks to the fantastic institution that is the big scream, I got to see Her at the cinema while on maternity leave. There are spoilers below, so go and see the film before you read on!
At it’s heart, Her is about the developing relationship between Theo, a recent divorcee, and his intelligent operating system Samantha. The film is set in a not-too-distant future where technology is smart but small enough to recede into the background. People interact with their computers mainly using voice, though gesture is in there too, most notably for gaming. Emails are listened to via discreet earpieces and people are comfortable enough to talk naturally to their computers, as if they were talking to an old friend.
Samantha’s voice recognition skills are near perfect, far better than exist now. Yet, we forget just how bad humans can be at speech recognition sometimes. People mishear and mispronounce all the time, but we’re really good at combining all our knowledge to seamlessly recover a conversation gone wrong. Much of the time we don’t even realise that we’re doing it. Today’s speech recognition systems are worse than humans, though operating error rates of around 20%, that’s 1 in 5 words wrongly transcribed, is useful enough for applications like Siri and Google Now.
The biggest difference between current dialogue systems and Samantha is the naturalness of the dialogue between her and Theo. Samantha is able to hit the correct emotional note, talk back at appropriate points with no (unintended) awkward silences, and to keep track of complex conversations. These are things that our current state-of-the-art systems are not yet capable of.
Of these three active research areas, detecting and synthesising emotion is perhaps the most difficult to define. Identifying emotion is something that humans don’t even agree that well on, and collecting data for research purposes often means relying on acted emotion. Current research tends to focus on a small subset of easily identifiable emotions like happy, sad, angry and excited, ignoring many more nuances.
Our current spoken dialogue systems are also not great at knowing when they should speak, leading to unnatural conversations that don’t flow well. Typically, current systems have to wait for half a second or so after the other party has finished speaking, to be sure that they’re not going to say anything else. In contrast, humans are really good at jumping in, sometimes even before the previous speaker has finished, to minimise the total amount of silence in a conversation.
Furthermore, we’re only at the beginning of solving the problem of keeping track of a conversation. Most deployed dialogue systems use a really simple set of handwritten rules to decide how to respond to a person. Such rules can only capture a small subset of human behaviour and conversation topics, and it takes a huge amount of work to write down those rules. For computers to have realistic conversations, we need new models of dialogue that are easily extendable without human intervention. This is the focus of the work done in the dialogue systems group at Cambridge.
The film makes the point that the smart technology is far more advanced than us mere mortals can ever dream of being. At one point, Samantha drives this home by confessing that she’s talking to more than 8,000 people at the same time. As the story unfolds, Samantha becomes gradually more and more self-aware, eventually getting bored of Theo, until she (and all the other smart operating systems) leave. In the end, this is another in a long line of sci-fi films to rely on the age-old idea of intelligent machines becoming self-aware and deciding to rise above us humans (though without the usual killer robots trying to wipe out humanity).