WP_20160413_08_55_12_Pro
You may have recently read Tom’s recap of Strata and Hadoop World, San Jose which is the key conference hosted by our partners Cloudera. The equivalent event hosted by the other big player in the Hadoop Distribution space, Hortonworks, is Hadoop Summit, and this year we were fortunate enough that the location for Hadoop Summit Europe was right on our doorstep, in the Convention Centre Dublin.

I was extremely impressed with the variety of talks at Hadoop Summit, and some of my thoughts on the key themes are outlined in this blog. You will probably notice that a lot of the notions mentioned in Tom’s blog will be reiterated here, but you can consider this confirmation that the focus and roadmap of the Hadoop ecosystem is shared, regardless of which distribution you align yourself with.

Ingestion and Streaming

Streaming is undeniably the hot topic in the Hadoop world at the moment, with more emphasis being placed on real time analytics which allows for time critical decision making and the most up-to-date view of information. The benefits of using Hadoop to process huge volumes of “data at rest” have been widely recognised, but the focus has now switched to identifying how we can gain insight from “data in motion”.

In his talk, Cloudera’s Ted Malaska weighed up the different options for data ingestion and streaming. This included Storm, Spark Streaming, Flink and Kafka, among others. While each has its strengths and weaknesses the point continuously made by Ted was “don’t overcomplicate what doesn’t need to be overcomplicated”. While some options may seem boring, if they work for the particular use case then don’t ignore them, simply for the sake of using the new, interesting technologies.

Another topic which was the focus of several talks was the combining of streaming data with data already stored in Hadoop data lakes. Particularly interesting talks I attended covered using Spark Streaming combined with data in an HBase data store. These talks were more code heavy, and gave some concrete examples of how to integrate HBase with RDD’s in Spark, and also how to include the newer dataframe implementation of Spark, primarily as a way of enriching our streamed data.

Hadoop – But Not As You Know It

This year marks the 10th anniversary of Hadoop, however, Hadoop as we have come to know it today has evolved greatly from the core Hadoop Distributed File System and Hadoop MapReduce processing engine, which began development a decade ago.

For example, the favoured choice for distributed processing is now widely accepted to be Apache Spark as opposed to MapReduce, but this is only one of a multitude of additions to the Hadoop Ecosystem. The number of projects now considered to be part of the Hadoop Ecosystem was estimated by one conference speaker, Matthew Aslett of 451 Research, to be somewhere in the region of 65. This includes tools for data processing, analytics, management and security to name a few. It can at times seem overwhelming, but it is great that so many contributions have been made to Hadoop in the last ten years.

This does lead to what one customer might consider as Hadoop being completely different from another. Hadoop Summit allowed customers to explain to the conference what implementations of Hadoop their particular business was utilising and although each was unique, all were built upon the core principles shared with the original Hadoop stack, of distributed data storage and processing.

Making Data Consumable

Data visualisation as a topic was not a huge focus of Hadoop Summit, with more emphasis being placed on processing technologies, governance and security. Visualisation did however feature in the second keynote delivered by David McCandless, which was definitely one of the standout points of the entire conference.

David describes himself as a Data Journalist and Information Designer, which, in his own words, involves “converting data into graphical images that anyone can understand”. The crucial part of that phrase for me is the final three words – anyone can understand.

With the staggering increase in the amount of data collected, the advancements in technology for data processing at scale and the professionals trained in this area, the potential for gaining insight from data has never been as real as it is today. But what is the point in analysing data if the results are presented in a way that only the data scientists, statisticians and engineers can understand?

Data visualisations can be extremely powerful, both for communicating results and answering questions for end users, but also for revealing insight we didn’t know to look for and assisting with the classic problem of “you don’t know what you don’t know”.

In this blog post I’ve touched on just some of the topics which were featured at the conference, but if you’d like to watch any of the wide range of sessions from Hadoop Summit, the full agenda with links to videos and slides are available here.