Apache Spark is one of the biggest reasons that data analytics is such an exciting area of work for technologists such as myself right now.
It’s hugely popular, with the most active community of any open source big data project currently in development.
So, this post is an overview of Spark, the problems that it solves and whether or not the hype is warranted.
The Spark Elevator Pitch
Spark is a general-purpose cluster computing framework. If you’ve got so much data that you need a cluster of machines to process it – then Spark provides an execution engine, programming model and set of libraries that can help you do cool and powerful things with that data.
The main reason for it’s existence is to supersede MapReduce – the original batch processing engine at the heart of Hadoop – and to support the types of use cases MapReduce traditionally struggled with. That is, anything requiring low latency.
Spark too, is a batch processing engine, but it’s also much, much more. It adds the ability to work with the data in your cluster interactively – something you could never do with MapReduce.
It accomplishes this by using the pool of memory in your cluster to cache data that requires fast access.
The Spark website lists an interesting benchmark that claims a particular machine learning algorithm will run 100x faster on Spark in-memory than the comparable implementation in MapReduce.
As a caveat, I feel this is an appropriate moment at which to invoke the Law of Benchmarketing* :
Given any benchmarking claim c, there exists at least one workload w or at least one query q that will prove claim c correct.
Enabled by its architecture, Spark offers up many possibilities to do iterative machine learning at scale that produces results within a realistic timeframe.
This type of processing needs to load up a set of data, cache it and then process it repeatedly to produce good results – again not the kind of thing MapReduce was very good at, and this is where we see the biggest disparity in performance.
Data Science At Scale
For me, what really sets Spark apart from MapReduce is it’s ease of use. It’s fun to program in Spark – and that’s a factor you should never underestimate.
The API is highly focused on the needs of it’s users – that is, data scientists and engineers.
This pays big dividends. The typical workflow for a data scientist is to process and analyse data on their local machine using tools like R or Python.
This workflow is iterative and interactive. You type in commands at a prompt (REPL) and you get instant feedback.
Spark embraces this approach to data analysis. The vision is to make the transition from working on a single machine to working on a cluster, a seamless experience.
The generality of Spark is a core part of its power. Out-of-the-box it supports Scala, Python, R, SQL and Java – interactive shells are available for each of the first three options (and SQL can be embedded).
A unified analytics stack
As well as a wide range of core language support, Spark provides many options for interpreting your data through its libraries.
You can view your data as a graph to explore relationships with GraphX, utilise a wide range of highly parallel machine learning algorithms with MLLib or interact with your data using standard SQL syntax using Spark SQL.
This unified stack allows you to view the data in your cluster under a variety of lenses using a single toolset.
One of the biggest challenges with the “lambda architecture” – applying unified business logic to historical data in batch, and live data in real-time – was that you had to choose different technologies to handle both workloads.
For example, MapReduce for the batch layer and Apache Storm for the real-time component.
Spark Streaming allows the same code to be applied as both a batch job over historical data and a real-time job against live data streams. This is a big win in terms of reducing architectural complexity and overall code base.
The core of the Spark framework is now becoming a major production component in the big data workloads of many global institutions.
Alibaba, one of the worlds largest e-commerce platforms, is known to be operating a cluster that processes in the region of 1 Petabyte of data per week – primarily to analyse their massive social network with GraphX.
Many of these corporations are committing code back into the core project and are helping to drive the fast expansion of available features supported out-of-the-box.
With this rapid evolution of the API comes a word of caution regarding stability – certain components such as Spark Streaming still have question marks over their maturity, though again the improvements are coming thick and fast.
To reiterate my opening point – projects like Spark are why it’s an exciting time to be involved in the big data and analytics space.
They allow technologists to deliver ever more flexible and powerful ways of unlocking the value in customer data. The project continues to be one to watch in 2015 and beyond.
* Credit to Gregorio (sadly, I missed the surname) from his Strata 2014 SQL on Hadoop talk for this gem!
Sign up to the Kainos newsletter