Working remotely

Recently I started a new job, and it’s taken a while to get to grips with the new ways of working that come with switching role. The team I now belong to is a global one, stretched across 6 locations and 4 timezones. Of those, I’m the only person in my timezone, sharing an office with a completely separate team. This setup has its own challenges, over and above those of starting in a new role, and has made me think about how to work best with colleagues in different timezones and locations.

So, my top tips for working remotely:

  • Visit in person as soon as you can; it’s much easier to work with someone if you’ve shared a coffee with them. I’m lucky enough to have met a handful of my new colleagues at past conferences, but making the effort to travel and meet some other colleagues has definitely helped ease the transition.
  • Video chat, again it’s the face-to-face contact that helps. A good video conference system means you can start to put faces to names. Also, video makes it much easier to work out who’s talking, compared to audio-only!
  • Find a group text chat system that works nicely in the background, and use it! Turn off the notifications though, as there’s nothing worse than a system that beeps at you all the time while you’re trying to concentrate on another task.
  • Reply to email quickly. As Eric Schmidt points out, being unresponsive means that people assume the worst. This is amplified when you’re not there in person.
  • Don’t worry about asking silly questions, chances are that if you’re confused by it then someone else on your team is too. And starting to ask questions of your colleagues can create an atmosphere where others are unafraid to ask them too, which is beneficial for everyone.
  • Finally, don’t forget the small talk! Working remotely means you don’t run into your colleagues in the kitchen or on the stairs, but it’s still nice to make time and find out what else is going on in your colleagues’ lives.
Posted in Career choices | Tagged , , | Leave a comment

Visualising waveforms with Python and Bokeh

I’ve recently been playing with the Bokeh Python library for visualisation. One thing I end up trying to do more often than I should is trying to draw waveforms for talks and presentations. Turns out that Bokeh is great for this!

Here’s a long waveform:


And a shorter segment of it:



Have given up trying to format code properly in wordpress, so it’s on GitHub – you need to supply your own wav file.

Posted in Uncategorized | Leave a comment

Machine learning in practice

In the last week of my old job, I saw a talk from one of Facebook’s Engineers about how they use machine learning in practice. His talk boiled down to 4 points:

  • More data, better quality data: spend time collecting and cleaning data
  • Practice != theory: often simple models work better in practice as better ones may be too slow
  • Efficiency is key: getting something to work in real time with lots of data is hard
  • 99% Perspiration: actually running the classifier is a tiny fraction of the time


Posted in Machine Learning | Tagged , , | Leave a comment

Language Models

Language models assign probability to sequences of words. They have many applications, including machine translation, smartphone typing, information retrieval, though I’m familiar with them through speech recognition.

For many years, the probabilities of N-Grams – that’s words or sequences of words – have been estimated by counting occurrences.

Screen Shot 2014-06-02 at 22.30.29


One of the key problems for speech recognition is obtaining text that represents the way we speak. The web and other archived resources contain a large amount of written text, but the probabilities estimated from these do not match the way that people speak ungrammatically, and with hesitation, correction, um’s, ah’s and er’s etc. It is much more expensive and labour intensive to obtain transcribed spontaneous speech.

More recently, neural network models have had some success for language modelling, there’s a publicly released toolkit available. The amount of data available for language modelling has increased, and Google have recently released a 1 billion word language model project.

Posted in Uncategorized | Leave a comment

Busy busy busy!

Back to work after maternity leave doesn’t leave me much time to keep the blog up to date! But, I’ve also been busy on a couple of other articles.

The first, over at Statistics Views, is an introduction to the role of statistics in speech recognition.

The second, over at the Software Sustainability Instutute, is about my latest project – Cambridge Women and Tech – as part of their blog about women in technology.

I also took the baby along to give a talk at Women in Data, in London!

Posted in Uncategorized | Leave a comment

Her: fact vs. fiction

Thanks to the fantastic institution that is the big scream, I got to see Her at the cinema while on maternity leave. There are spoilers below, so go and see the film before you read on!

At it’s heart, Her is about the developing relationship between Theo, a recent divorcee, and his intelligent operating system Samantha. The film is set in a not-too-distant future where technology is smart but small enough to recede into the background. People interact with their computers mainly using voice, though gesture is in there too, most notably for gaming. Emails are listened to via discreet earpieces and people are comfortable enough to talk naturally to their computers, as if they were talking to an old friend.

Samantha’s voice recognition skills are near perfect, far better than exist now. Yet, we forget just how bad humans can be at speech recognition sometimes. People mishear and mispronounce all the time, but we’re really good at combining all our knowledge to seamlessly recover a conversation gone wrong. Much of the time we don’t even realise that we’re doing it. Today’s speech recognition systems are worse than humans, though operating error rates of around 20%, that’s 1 in 5 words wrongly transcribed, is useful enough for applications like Siri and Google Now.

The biggest difference between current dialogue systems and Samantha is the naturalness of the dialogue between her and Theo. Samantha is able to hit the correct emotional note, talk back at appropriate points with no (unintended) awkward silences, and to keep track of complex conversations. These are things that our current state-of-the-art systems are not yet capable of.

Of these three active research areas, detecting and synthesising emotion is perhaps the most difficult to define. Identifying emotion is something that humans don’t even agree that well on, and collecting data for research purposes often means relying on acted emotion. Current research tends to focus on a small subset of easily identifiable emotions like happy, sad, angry and excited, ignoring many more nuances.

Our current spoken dialogue systems are also not great at knowing when they should speak, leading to unnatural conversations that don’t flow well. Typically, current systems have to wait for half a second or so after the other party has finished speaking, to be sure that they’re not going to say anything else. In contrast, humans are really good at jumping in, sometimes even before the previous speaker has finished, to minimise the total amount of silence in a conversation.

Furthermore, we’re only at the beginning of solving the problem of keeping track of a conversation. Most deployed dialogue systems use a really simple set of handwritten rules to decide how to respond to a person. Such rules can only capture a small subset of human behaviour and conversation topics, and it takes a huge amount of work to write down those rules. For computers to have realistic conversations, we need new models of dialogue that are easily extendable without human intervention. This is the focus of the work done in the dialogue systems group at Cambridge.

The film makes the point that the smart technology is far more advanced than us mere mortals can ever dream of being. At one point, Samantha drives this home by confessing that she’s talking to more than 8,000 people at the same time. As the story unfolds, Samantha becomes gradually more and more self-aware, eventually getting bored of Theo, until she (and all the other smart operating systems) leave. In the end, this is another in a long line of sci-fi films to rely on the age-old idea of intelligent machines becoming self-aware and deciding to rise above us humans (though without the usual killer robots trying to wipe out humanity).

Posted in Technology | Tagged , , | Leave a comment

Getting started: data science with Python

The purpose of this post is to collect together online resources for anyone who wants to learn how to do machine learning (data science) in Python, starting from scratch. Some of these sites I’ve used, and others I’ve only glanced at, but I hope they let you get started no matter what your level. I’ll add new stuff as I come across it, but let me know if you have any useful resources to add!

If you’re new to programming, the first step is to get started! Code Academy has an introduction to Python tutorial which will get you started with some basic concepts. Google’s tutorial is a bit more advanced, but should be do-able once you have an understanding of variables, conditionals and loops:

Now you have a basic understanding of Python, install some of the libraries that are useful – numpy, scipy, pandas and scikit-learn. A great place to get a set of useful libraries is Anaconda.

With the tools in place, the best thing to do is dive in. A great place to start is Kaggle. They have some tutorial tasks to get started on, including one from Data Science London. This is a binary supervised classification task so you’ll want to read up on how that works, but it’s essentially about deciding whether a data example is from one class or another. ‘Class’ in this context can be things like:

  • Is an email spam or not?
  • Is a credit card transaction fraudulent or not?
  • Does some audio contain speech or noise?

You can use sci-kit learn to get started without knowing too much about what’s going on under the hood. Perhaps the most important thing to get to grips with is the use of training/dev/test data, cross-validation and generalisation. But, if you want to really get a good understanding, then Coursera’s Machine Learning course covers a lot of the basics of machine learning, with some practical tasks to complete.

Finally, be aware of common pitfalls.

If you can build a classifier to work on the Data Science London Kaggle challenge, and understand how it works, then you’re well on the way to learning about more advanced stuff. But that’s a topic for another post!

Posted in Machine Learning, Technology | Tagged , , , | Leave a comment