Five Sins of a Big Data Architect
As a Big Data Architect you are responsible for the successful implementation of next-generation analytics platforms. You must oversee the integration of these platforms into existing technology estates. Below are five of the common pitfalls that you should avoid.
“Should we go for Teradata or Hadoop?” – if you find yourself answering, or worse still, asking this question, you’re doing something wrong. Hadoop should not be viewed as an alternative to a sub-second latency, relational database (or graph database, or time-series database for that matter) but should be viewed as a complementary big brother to these premium storage systems.
Hadoop gives you fault-tolerance, commodity storage, co-located processing engines and a query engine goes some way to giving you optimisers and indexes. The reality, however is that the analytical, relational databases remain different beasts. We have implemented solutions where both Hadoop and PostgreSQL were deployed simultaneously, happily co-existing as the former acted as the both the data source and active archive for the latter.
As a Big Data Architect, you should embrace the Polyglot Persistence paradigm – adopt the most suitable data storage platform for the task at hand. Hadoop should by default be considered as an augmentation of your existing architecture. A careful assessment of your use cases will highlight those which are prime candidates for cluster-scale processing and those which are candidates for optimised analytical data stores.
You can have your cake and eat it, it would seem.
Gartner has predicted that by 2018, 90% of deployed data lakes will be useless. Hadoop offers a low barrier to entry for data ingestion, but that shouldn’t mean no barrier. As a Big Data Architect it is your responsibility to govern how external systems and data sources interact with your Hadoop cluster.
You must ensure that ingestion pipelines perform requisite compaction, compression, metadata auditing and that deposited files adopt serialisation formats that are optimised for further downstream processing:
Make people want to visit your data hub, no-one ever wants to visit a graveyard.
One of the most frequently occurring requirements is the ability for users to use their SQL skills to interrogate their data hub. They want to execute the same scripts atop Hadoop as they have done on SQL Server and Oracle for years. This is only possible using the native Hadoop metastore through query engines like Hive and Impala.
Both query engines provide the ubiquitous JDBC and ODBC drivers which allow traditional and legacy Business Intelligence and Advanced Analytics tools to connect to your data hub as if it were a relational data store. Hive can interpret HDFS files serialised in the binary Avro format and infer the Hive schema, using AvroSerDe library. Without the metadata of field types and structures you’re limiting access to your data to the lower-level data processing engines only, such as MapReduce or Spark.
No schema up-front does not mean no schema at all.
So we now know that Hadoop will be an addition to and not a replacement of your existing data warehouse technologies – it integrates well and augments functionality of existing systems. It is therefore critical to fully assess and evaluate the capabilities of existing technologies – Business Intelligence, Data Cleansing, Data Integration, Advanced Analytics and even Backup and Recovery systems – in the context of big data workloads, to ensure the end-to-end journey time is within agreed SLAs.
Customers will look to combine existing middleware, integration and exploration technologies with Hadoop to understandably leverage previous investments. However, just because you now have a 100PB Hadoop cluster that can ingest 1GB/s of sensor data does not automatically mean that the end consumers of analytical insight can receive it accurately and timely using their existing technologies.
You’re only as fast as your slowest link.
The rate of change across the big data marketplace and specifically within the Hadoop ecosystem sees new technology released almost monthly. Adopting the very latest technology immediately is usually not a wise decision. You need to be particular about the technologies you adopt and carefully assess the actual and eventual functionality. This is where the commercial distributions of Hadoop can help with your decision making.
Take Cloudera for example who include over 20 components in their distribution of Hadoop These listed components are considered sufficiently mature to be integrated and shipped as part of their enterprise-ready offering. However, Cloudera are careful to assess and publish those sub-components which are deemed to immature to provide full operational support. Take Apache Spark for example, where the core Spark libraries are mature and commercially supported within CDH 5, however the less mature Spark SQL, MLLib and GraphX are currently not.
The leading edge should not be the bleeding edge.
Sign up to the Kainos newsletter