It’s no secret, my life revolves around cars, technology and the opportunity to combine them. So what happens when you realise you have just purchased a second hand car that tracks your driving style, travel times and efficiency? Well you analyse it. Just don’t go sharing it with your significant other.
So, what’s in the data?
Paddy – the car I have purchased – has provided a great opportunity to explore the world of Hue Search Dashboards on Hadoop. The car contains logging software which writes a wealth of information to a USB pen on each journey, most interesting of which is the fuel consumption, distance and journey cost fields.
While it is interesting to sift through and find the most expensive journeys, there isn’t much value to be gained. This is where the World Weather Online API comes in.
I see you now, lightbulb above your head, turning into a data analyst. We can combine the journey and weather data to see how the weather affects fuel consumption.
Hadoop, Hive, Morphlines and Solr
Much like my obsession with my garage, I need to have many tools for the job. Hadoop gives us the garage and all the tools in it. In this instance I’m using Cloudera’s Distribution of Hadoop as it bundles together all the tools I need and contains the wonderful Hue dashboards.
My first question in any challenge is how to remove the complexities. I could write a few MapReduce jobs to clean the datasets, transform any timestamps and join them, however a simpler solution is to throw a Hive table over the top and use simple SQL statements. Hive provides an SQL-like MetaStore which applies a table schema onto underlying files in Hadoop. This schema is then used by Hive generated MapReduce jobs to read and manipulate the data.
Before creating any tables we need to think about the end goal – searching and finding trends. Searching means indexing and why not make use of the distributed power of Hadoop to achieve this? Mainly because distributed indexing is quite the challenge. However, Open source saves us again in the form of Apache Solr. Solr 4.0 includes SolrCloud with sharding and the integration of Zookeeper. Apache Zookeeper provides synchronisation of distributed services, filling in the distributed index gap.
My main gripe with Solr is the lucene TrieDateField format, it’s very strict in what it can accept by default. Bringing this issue into context, we need the date format in our journey data to be the same format as Solr expects to enable indexing. This is not the case with our journey data so we can’t simply join our weather and journey data using Hive, we need a preprocessing step.
Any excuse for another puzzle piece, this time it’s Morphlines. Morphlines is part of Cloudera’s Kite SDK, aimed at simplifying MapReduce jobs using configuration files. We still need to create a MapReduce job, but it just wraps the Morphline configuration file. We use this to take our journey data, process it line by line and output a freshly formatted csv.
Now we can get back to Hive and create three tables: one for journey data, one for weather and, for visibilities sake, the joined table that we will load into Solr.
After grabbing a csv export of the joined table and setting up Solr, we use Morphlines one final time to post our meaningful data into Solr – where it is indexed.
Hue Search Dashboards
Packed neatly into CDH 5.1 is the new dynamic dashboard portion of Hue. This is what all our data loading has been about. Dynamic dashboards make use of Solr collections to build an open source, distributed, interactive search. This is cutting edge technology that’s been reserved for custom implementations and proprietary software for a very long time, it’s great to see Hadoop make this option more accessible.
Dynamic Dashboards are neat, simple drag and drop WYSIWYG views which support pie charts, time lines, bar charts, faceting and maps.
What’s even better is that each of these widgets are interactive – you can highlight a portion of a time line to filter all other widgets in the view.
Along the top we also have our search field which supports complex searching, powered by Solr.
I am very impressed by the enterprise capability of this search, so much so that here at Kainos we will be integrating it as a core feature in one of our projects.
The outcome of all of this is that I can now identify trends in my driving style and whether they changed with the weather. Call it simple visual analytics, search or shiny graphs but whatever you call it be sure to include the sheer accessibility of this platform and it’s potential.