Data Science

Date posted
23 February 2015
Reading time
10 Minutes
Thomas Swann

Data Science

In this post, I'm turning my attention to the often misunderstood term "Data Science" and what it actually means to businesses today in a practical sense.
I would argue that a clear understanding of the business applications of advanced analytical techniques is one of the biggest gaps currently facing their widespread adoption. From a delivery perspective there is the need for data science teams to arm themselves with the details of algorithms and the tools with which to implement them - statistical software packages like R.
There is also the (often prerequisite) need to put in place the infrastructure and data processing software to support these endeavours. Often this requires new engineering skills such as the ability to provision and support platforms like Hadoop and the usually underestimated challenge of getting the data in from wherever it currently lives. However, beyond this is the broader need for the business to understand the use cases where these techniques can add value or for managers to effectively assess the viability of an analytics project proposal. It's a field where it's often not totally clear what the language means, so I will try to demystify some of the jargon.

What Is A Data Scientist Anyway?

Given that it's a young field, the discussion of what Data Science actually means and who practices it, is somewhat of an ongoing conversation among its practitioners and those who would employ them.
Drew Conway offers the following definition for a data scientist by way of a Venn diagram (huzzah!):   venn In her excellent book 'Doing Data Science', Cathy O'Neill offers the following:
'And here's one thing we noticed about most of the [data science] job descriptions: they ask data scientists to be experts in computer science, statistics, communication, data visualisation, and to have extensive domain expertise. Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise - together, as a team, they can specialise in all those things."
Just as with the DevOps philosophy that invisible walls shouldn't exist between development and operations, data science encourages a blending of related disciplines to help facilitate a more exploratory and iterative approach to data mining and business intelligence. As part of this team, the job of the Data Scientist is to devise statistical models and to have enough programming knowledge to implement them in tools like SAS or R. They also need to explore and clean the data using more 'traditional" tools like SQL or Python without having to rely entirely on the engineers. They also need to be able to clearly articulate what those models mean and to make them comprehensible to a business audience. What the various competing definitions certainly all agree on, is that you need to know an awful lot.

Predictive Analytics

The diagram below was stolen from the slide deck at a ThoughtWorks session on Big Data that I attended back in the summer (highly recommended viewing). I really like it, and try to shoe-horn it in everywhere with my own addition to illustrate where "Data Science' resides on the spectrum of analytical categories:
data-science The word 'prediction' usually conjures up images of The Future (as in hover-boards and flying cars), but in the context of analytics this word typically means something far more precise.
Take the example of a newly joined customer to my fictitious online retail site. Normal reporting can tell me a great deal about my existing customers - what their service usage is, what types of item they most often choose to purchase. The new customer by comparison is a blank sheet - at least until the point when they actually start doing *stuff*. Predictive analytics allows me to say things about variables for which I currently don't have a value.
When that new customer joins, I might be able to get an estimate for these unknown values based on what I know about similar, existing customers. And from that estimate I can drive decisions about how I should best engage with them. Even saying what constitutes 'similarity' can be a tough problem in itself, and it requires a good level of domain knowledge on the part of the analytics team. This too has its own term - profiling. It's exactly the technique that's in play whenever your bank flags suspicious activity on your card. Your profile in this case is your normal usage - if you don't typically make bulk electronics purchases in Hong Kong then the fraud detection model might consider this suspicious activity!

The Value of Data-Analytic Thinking

"Data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets"
So where is the value in all this? How can an organisation justify the investment in data science, or big data for that matter? The answer rather unsurprisingly lies in the value of the data itself. Recall from the diagram above, how competitive differentiation increases with the level of complexity of the analysis.
Not that I'd expect you to take my word for it - rather refer to this paper from a study carried out by MIT Sloan which provides some hard figures on the financial performance benefits to organisations who have adopted a policy of data-driven decision making. [Brynjolfsson, Hitt, Kim 2011] Across all industries, a growing challenge is how to view business problems from a data perspective. The continuing challenge for companies like Kainos is to help businesses make sense of the increasingly complex techniques for turning their data into useful, actionable knowledge.
* As an aside, I highly recommend the book quoted above. It's mandatory reading for anyone who wants to better understand the implications of data-driven decision making for business and the steps an organisation might take to get there.

About the author

Thomas Swann