This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here

FEATURE19 May 2014

Data basics


Fancy a career in data science? Former Google engineer Sean Owen reveals what you need to succeed in this century’s most desirable vocation. Interview by Brian Tarran.

It was 2012 when the Harvard Business Review declared data scientist to be “the sexiest job of the 21st century”. Thomas H Davenport and DJ Patil defined the data scientist as a new breed of “high- ranking professional with the training and curiosity to make discoveries in the world of big data”.

Demand for this type of role was set to increase as the amount of data that companies had to deal with grew ever larger. But still experts in the field talk of a looming shortage of people with the necessary skills to do the work. The jobs are out there, and they pay well – Silicon Valley start rate for an entry-level position is $110,000 (£68,000 ). So why should there be a shortage of candidates?

“Most of the work is just a lot of engineering and a little statistics – and it doesn’t feel sexy”

Impact Magazine sat down with Sean Owen, Cloudera’s director of data science, and a former Google software engineer, to find out more about the demands of the job, in an attempt to solve the mystery.

So, Sean – how does it feel to be doing the “sexiest job of the 21st century”?
Here I’ll refer you to my counterpart in the US, Josh Wills, who said recently: “I’m a data janitor. That’s the sexiest job of the 21st century? It’s very flattering, but it’s also a little baffling.” Most of the work is just a lot of engineering and a little statistics – and it doesn’t feel sexy.

I suppose it has captured the imagination a bit because of ‘celebrity’ statisticians like Nate Silver. And maybe it is interesting since the output of engineering plus statistics is more relatable than either of those two things independently.

And the job results – occasionally – in real and intriguing insights that can be explained at a cocktail party in a way that the computing and the maths just can’t. So that feels good.

What does a data scientist do all day?
This depends a lot on the organisation but, in my view, from least to most time, it’s modelling, communication and integration.

It always involves modelling of some kind – turning data into an explanation that gives insight into why the data is what it is, or that helps predict future data. Most modelling is about knowing what data answers the question, and how to answer it in a valid way.

Then, a big part of the job is about communicating to the business what can be answered from data; what data would be good to collect; and what it means.

But, honestly, most of the work is connecting the wires. Let’s say that, in theory, you have customer behaviour data and want to predict purchase intent. In practice what you have are log files in four formats from six systems, some incomplete, with noise and errors. These have to be copied, translated and unified. And, once the model predicts purchase intent, that result has to be output, exported and integrated into some business system.

The highlights and lowlights of a career in data science

What’s the best thing about being a data scientist?
The problems are real-world and interesting, and different every week. I suppose people in this role get treated a bit like the goose that lays golden eggs: nobody quite knows how it works, but everyone thinks it’s interesting and profitable and wants one around. That goose has a great life.

And what’s the worst thing?
Expectations about how well data can be used to predict answers to arbitrary questions. Sometimes it takes a lot of frustrating trial and error to succeed. The bad news is that it isn’t magic and lots of problems are hard to solve. It’s difficult to have to reign in optimism about the possibilities. The goose can’t lay that golden egg every time, and disappoints at its peril.

How did you end up in this line of work?
Two paths converged. I worked at Google from 2004 with the big data tools we recognise today – things like MapReduce – but from well before they existed outside Google in the form of Hadoop.

Separately, beginning in 2005, I operated an open-source machine learning project. These two things merged from 2008 or thereabouts into the Apache Mahout project, which is focused on developing machine learning libraries on Hadoop.

From 2012, I switched to work on a successor project/company called Myrrix to further advance real-time learning on Hadoop, and Myrrix is now part of Cloudera.

What education and skills do you need to succeed in the data science role?
I think many people took linear algebra and had no idea why matrices, determinants and conjugate transposes mattered at all. This is why.

“I think we’re seeing engineers re-train as statisticians to fill the gap much more than we are seeing statisticians become engineers”

It’s frequently claimed that there is a shortage of data scientists, given the expected growth in demand for this type of role. Why is there a shortage, and how can we plug the gap?
Some of it is hype. One might be sceptical here, since data science is, in part, statistics, and statistics is not new – and we’ve always had more data than we can comfortably deal with. But there are at least two real factors that have opened up a demand/supply gap.

First is that the value of data science has shot up because the cost has dropped. The internet/mobile revolution has increased the potential supply of data over the past 15 years way above the rate that storage/processing has typically gotten cheaper. The data, and value, has been there – but it has been too expensive to use until now.

The cost has recently dropped for mainstream companies, with the emergence of distributed computing platforms that can coordinate many cheap machines. And it’s only in the last one to two years that the industry has started to make proper analysis tools on new platforms such as Hadoop.

So, suddenly it makes economic sense to try to extract value from all this data out there. Hence demand for the skills.

The supply has lagged because the skill set is different. When NoSQL databases became popular, there was also a skills shortage that rapidly closed because – honestly – the way you run, administer and query these things is still quite familiar to anyone that has used a classic relational database. Skills translate fast.

This is not quite the case with machine learning, which is an important element of data science. It’s not purely an engineering question, but needs some applied maths background.

The skills gap will close; it will just take a little longer. And I think we’re seeing engineers re-train as statisticians to fill the gap much more than we are seeing statisticians become engineers.

Let’s say I’m already working as a data analyst – would it be fairly easy for me to become a data scientist?
Data analyst means a lot of things. To me, it’s usually someone who is familiar with the tools of the trade, like SAS or SPSS, and who has domain knowledge and the ability to communicate with the business.

The conventional analyst is probably mostly missing the engineering skills. It becomes necessary to know how to write scripts, Python, Java, and develop software systems, rather than use modelling tools. Unfortunately, getting up to speed on the engineering takes a lot of work. It’s not hard to acquire these skills, but it does generally take years of real-world experience.

But, if I am seriously interested in making a career in this field, what are the top three things I need to do to start down the data science path?
You probably need to fill a gap in your stats/applied maths/machine learning knowledge. Turn to and self-study. Everyone recommends Andrew Ng’s machine learning course, and it is a great intro – although you should also look into linear algebra and stats.

I also highly recommend Yasser Abu-Mostafa’s Learning from Data ( as a more rigorous (read: harder) intro to machine learning.

You probably have a small skills gap – the roles you’re interested in probably use tools you haven’t used before. Maybe it’s R, or Sqoop, Hive or Impala in the Hadoop world – so buy the book, get sent to training; whatever you can do to start acquiring this experience.

Finally, you really just have to do the work, and it need not be in the context of a job. Go and sign up for a competition, and figure out how to solve one of the introductory problems with, say, random decision forests on scikit-learn. Just getting that far will mean you’ve learned a great deal.

The data science skill set: Things Sean Owen would want to see on a budding data scientist’s CV

Computer science coursework
Stats and linear algebra
Some machine learning coursework, covering at least:

  • regression
  • classification
  • clustering
  • recommendation
  • graphical models

Data collection tools
Hadoop-based tools like Flume/Sqoop
Text munging languages such as Python, or maybe Perl
Basic SQL

Data modelling tools
A library like scipy/numpy or Weka
A tool such as R (or commercial equivalents like SAS, SPSS)

Model-serving tools
Familiarity with Predictive Model Markup Language
Basic knowledge of a NoSQL store
Systems language skill – Java, for example

Business smarts
Communication skills
Some facility with a visualisation tool, even if gnuplot or Excel
Domain knowledge relevant to business sector/vertical