Manhattan was overrun with almost 6000 big data and analytics enthusiasts last week. The Strata + Hadoop conference was hosted as part of NYC Data week and myself and Thomas Swann represented Kainos. The enthusiasm and excitement from the attendees was palpable, with the realisation that our industry is now in the midst of a genuine revolution. Below are ten key learnings I have taken away with me.
Embracing big data-driven decision making is now mandatory for enterprises. Where previously big data projects were viewed as value-add niceties for multinational conglomerates, now all enterprises must adopt a data-first approach to their decision making, simply to survive. Competitors all around you will be making more accurate decisions, faster and more cost effectively than you. It is imperative for you to follow suit to simply sustain your position within your marketplace.
Apache Hadoop will continue on its trailblazing path of success across the industry – by disappearing. HDFS, MapReduce, YARN, Hive and Pig, will be confined to low-level infrastructure technology discussions, where they belong. Customers won’t need or want to understand the Hadoop stack, as they won’t want to buy Hadoop specifically. They will want to buy analytics solutions, of which Hadoop will form a critical part of the underlying infrastructure. Additional analytics, exploration and visualisation tools combined with a competitive deployment model, will complete the solution stack for customers.
The future of Hadoop processing is Apache Spark not MapReduce, it’s that simple. More efficient use of resources, significantly faster on-disk processing, in-memory processing and generic application containers for multi-functional clusters, are all features that can be availed of today. The most exciting aspects of Spark are those that are in the pipeline, in particular those worked on by Databricks such as Spark SQL. Spark will become as synonymous with Hadoop as MapReduce is today. MapReduce will not disappear, but will become the secondary processing engine for Hadoop. Vendors operating at the top of the Hadoop stack must focus on Spark integration. Native MapReduce integration only will place a shelf-life on the product as customer expectations of real-time interaction and deep-diving have been raised by newcomers like Platfora and ZoomData.
The familiar metaphor of a data lake refers to a landing zone for all data that is continually added to by operational systems allowing a range of access patterns across all users in the organisation. Establishing a data lake allows you to take the first step of being able to store all data present and future and creates a centralised data archive location. A lake is merely the first step in implementing Cloudera’s unique vision of the Enterprise Data Hub (EDH). The EDH is an integrated backbone for mission-critical workloads which processes analytical queries and models across the entire data estate as part of operational business processes. This could be where recommendation engines return specific product sets based on previous experience or where insurance quotes are returned based on anticipated customer behaviour. This insight can be gained using the compute and storage of the EDH and can then surfaced up to line-of-business systems.
In December 2004 Google published MapReduce: Simplified Data Processing on Large Clusters. Ten years later Gartner predict that over 70% of enterprises have started or will start a Hadoop-based project this year. It’s worth looking at Google now for an idea of what enterprises will be doing ten years from now, according to M.C. Srivas. Self-driving cars, intelligent manufacturing devices and self-learning robots may well become standard parts of the technology estate for enterprise customers. The price we pay for Google’s progress, however, is the surrender of privacy. To interact and thrive in modern-day society we have a critical dependence on technology. Julia Angwin’s attempts to become anonymous in the digital world proved not just expensive, but ultimately futile. A premium can be paid to improve personal privacy levels by using anonymous search engines and private cloud backups, but escaping the clutches of Google completely proves nigh-on impossible unless you consider parting company with your mobile phone. Full privacy is simply unobtainable and in reality we have to settle for a poor second best – an assurance that nothing bad will deliberately happen.
Emotional analytics is the systematic interpretation of human emotion through combining facial recognition and sentiment analysis. Affectiva hold the world’s largest emotion data repository and they determine emotional reactions to media and advertising between software and customers. This powerful and somewhat intimidating approach to customer interaction will allow companies to tailor communications to customers with different emotional reactions. Imagine the next online advertisement you watch that knows you’re in a bad mood and decides to skip the whimsical, musical intro and move straight to the salient sales message.
When selecting your data discovery, visualisation and analytics tools, take time to understand the architecture and data storage used. Some products run natively on Hadoop, some extract data out into memory, others extract data into separate operational data stores. Having invested in a Hadoop cluster, you want to maximise use of that investment without risking performance or stability. Determine whether the Apache Spark processing engine is used and if not today, then when.
Modern enterprise data architectures will have two or more data platform technologies. EDH’s and enterprise data warehouses (EDW) will not just co-exist but will coalesce as a logical unified platform. The unification of these technologies proceeds the numerous partnerships between Hadoop distributors and traditional EDW vendors. High-value data will be kept in the EDW, valuable data will be kept in EDH and less valuable data will be kept in archive stores.
It’s best practice to only use local storage with dedicated physical nodes within your Hadoop cluster – right? Not necessarily. Netflix are using Amazon EMR on virtual infrastructure with centralised storage on Amazon S3 to process petabytes of data on multiple clusters. Performance does not suffer as much as expected and transfers to and from S3 are more performant than the remote location would suggest.
Microsoft’s inviting Azure Machine Learning platform will further lower the barrier to entry for budding data scientists. The familiar user experience with API bundling and deployment will entice many novices but the restrictions on exporting the models in a standard format such as PMML may spook the lock-in fears among some adopters. Azure joins the likes of RapidMiner with their no-coding approach to advanced analytics.