Strata and Hadoop World is the world’s biggest and best conference on all aspects of the data economy. I had the pleasure of attending this year’s event in San Jose, and below you can find my thoughts on the major conference themes.
Hadoop continues to mature
With each passing year, it becomes more difficult to pin down exactly what anyone means when they say ‘Hadoop’. Like the term “big data” with which it is synonymous, Hadoop has come to stand for the explosion in data storage and processing technologies, which have moved the conversation beyond the relational database, as the default option.
Hadoops evolution in the past year can really be summed up in one word: Spark, and this proved to be a major theme of the conference. Every major analytics tool vendor is either currently supporting Apache Spark as their underlying processing framework, or they have it on their roadmap.
Spark helps to address the traditional limitations of Hadoop as a platform – allowing batch and near real-time processing workloads to be unified under a single framework. It has moved the entire ecosystem forward from the batch-only world view of MapReduce, and provided greater usability for data scientists through it’s support for interactive Python and R command lines interfaces.
Hadoop adoption in the enterprise was the other major element of this story and the projects that are emerging to support it.
Projects such as Cloudera’s RecordService add finer grained access control and data masking to the existing Hadoop security landscape. Perhaps unsurprisingly, the project has its roots in financial services, having been developed and open sourced in collaboration with CapitolOne.
As Hadoop is increasingly viewed as a place to store core information assets such as customer transactional information, the demand for controls which have long been standard in the established relational database world, are now seen as required features.
The other major talking point of the conference was real-time processing and analytics on data streams.
The primary component underpinning most approaches to this problem is Apache Kafka.
Kafka can best be described as a pub-sub message queue, but combined with reliable storage and the capability to massively scale out. You can run Kafka on a single machine, or you can run a cluster at web-scale.
My favourite talk of the conference was delivered by Alex Silva from Pluralsight, describing the architecture for ‘Project Hydra’.
Pluralsight have developed a micro service architecture to ingest data from numerous elements of their business and store the information as Kafka data streams. This replaces the traditional message queue (e.g. RabbitMQ) as the “dumb pipe” in the micro-service equation, with support for fast, advanced analytics baked in.
Hydra should provide a fine case study for anyone wishing to develop a streaming data platform, and the project itself is soon to be open sourced.
The year of streaming analytics
For me, the single most prominent element of Strata was streaming analytics and its increasing footprint in the Hadoop ecosystem.
Whilst Kafka is the invariable part of most streaming architectures, the framework to provide the advanced analytics looks like it will be a hotly contested space in 2016 and beyond.
Kafka have introduced their own stream processing framework as part of the project – Kafka Streams. This will compete with established options like Spark Streaming, Samza, Storm and Flink.
My bet would be on Spark Streaming (at least for those use cases which do not require millisecond latency), given the growing ubiquity of the full framework.
The capability to reduce the time to react to events is becoming a critical point of differentiation, whether it be in the context of the enterprise, smart cities or connected health. Velocity is starting to trump volume as the main point of discussion.
This will be an interesting and dynamic area of technology to watch in the next 12 months. However, I feel that we can be confident that both Kafka and Spark Streaming will play important roles in lowering the barrier to entry and spreading the adoption of streaming as a viable approach to data processing.
You can also watch a recap of our coverage from the event on the Kainos Software YouTube channel here:
Each video summarises our experiences of the day and provides more detail on some of the areas highlighted in this post.