Turn taking in human-computer dialogue

Turn-taking has had a brief moment in the press recently with the news that marmosets ‘take turns in conversation’.  A blog post on the topic at National Geographic led me to a 2009 paper which examined turn-taking behaviour across 10 different languages [1]. It’s long been thought that there are cross-cultural differences in how quickly people respond in conversation, and in [1] the mean response time to a yes/no question over the 10 languages was reported as 200ms. That is, the second speaker responded, on average, 200ms after the first speaker had finished.  The fastest average response time of 7ms came from Japanese conversations, and the slowest of 470ms from Danish. So, between languages, the overall variation in response time is just less than half a second. This may seem like a really short time, but it’s actually long enough to fit in a three syllable word.

Over at language log, Mark Liberman tried a similar analysis on the Switchboard corpus (a corpus of conversational telephone speech in English) and his results for response time were similar. Intriguingly, response time was noticeably quicker in female-female conversations than in male-male ones.

Another interesting point is that many turns have a negative response time. That is, the second person starts talking before the first has quite finished asking the question. This happens really frequently, and is a conversational effect we rarely notice. Overlapping speech has been of interest in the speech technology community for a few years now as it can cause issues for speech recognition systems. One estimate of the proportion of overlapping speech in a meeting scenario is that as much as 12% of foreground speech overlaps with speech from a background speaker [2]. Some of this overlapping speech comes from ‘backchannels‘, a second speaker saying ‘uh-uh’, ‘yeah’, ‘hmm’ and the like in the background, but a good portion comes from the overlap of speakers at turn transitions.

These statistics show that human turn-taking behaviour is complicated and nuanced. We use a variety of cues – semantic, linguistic, prosodic and visual – to determine when to respond in a conversation, and are very good at minimising the amount of silence in a conversation. Most of our human-computer dialogue systems are nowhere near as sophisticated. The typical approach to turn-taking in an artificial dialogue system is simply to wait until the user has paused for a set amount of time, say 0.5s (or 500ms), before responding. There has been some work to improve response times of computers by taking the different cues into account, but even these still have a minimum response time, such as [3] which only considers responding after 200ms.

There are many reasons why human-computer dialogue can feel stilted and unnatural, and poor turn-taking is certainly one of them. However, as research progresses, computers will become better at predicting when they should respond to us and may, one day, be able to take cultural variation into account.

[1] “Universals and cultural variation in turn-taking in conversation“, T. Stivers et al, 2009

[2] “Overlap in meetings: ASR effects and Analysis by Dialog Factors, Speakers and Collection Site“, O. Cetin and E. Shriberg (2006)

[3] “Optimizing Endpointing Thresholds using Dialogue Features in a Spoken Dialogue System“, A. Raux and M Eskenazi (2008)

This entry was posted in Machine Learning, Technology and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s