Our Thinking
Five Sins of a Big Data Architect
13 April 2015 | Posted by Darragh McConville

As a Big Data Architect you are responsible for the successful implementation of next-generation analytics platforms.  You must oversee the integration of these platforms into existing technology estates. Below are five of the common pitfalls that you should avoid.

1. Making Hadoop an Either/Or Decision

“Should we go for Teradata or Hadoop?” – if you find yourself answering, or worse still, asking this question, you’re doing something wrong. Hadoop should not be viewed as an alternative to a sub-second latency, relational database (or graph database, or time-series database for that matter) but should be viewed as a complementary big brother to these premium storage systems.

Hadoop gives you fault-tolerance, commodity storage, co-located processing engines and a query engine goes some way to giving you optimisers and indexes.  The reality, however is that the analytical, relational databases remain different beasts. We have implemented solutions where both Hadoop and PostgreSQL were deployed simultaneously, happily co-existing as the former acted as the both the data source and active archive for the latter.

As a Big Data Architect, you should embrace the Polyglot Persistence paradigm – adopt the most suitable data storage platform for the task at hand.  Hadoop should by default be considered as an augmentation of your existing architecture.   A careful assessment of your use cases will highlight those which are prime candidates for cluster-scale processing and those which are candidates for optimised analytical data stores.

You can have your cake and eat it, it would seem.

2. Building a Graveyard instead of a Lake

Gartner has predicted that by 2018, 90% of deployed data lakes will be useless. Hadoop offers a low barrier to entry for data ingestion, but that shouldn’t  mean no barrier.  As a Big Data Architect it is your responsibility to govern how external systems and data sources interact with your Hadoop cluster.

You must ensure that ingestion pipelines perform requisite compaction, compression, metadata auditing and that deposited files adopt serialisation formats that are optimised for further downstream processing:

  • Compaction refers to the joining of files together to create optimally-sized files for storage and processing on HDFS.  File sizes should be a factor of your configured block size and we usually aim for an average file size of 128MB or thereabouts;
  • Compression refers to the application of a lossless algorithm to represent your data using less space.  We typically apply Snappy compression due to rapid compression/decompression rates, although be mindful that if you’re intending to process the data in-memory, compression will require CPU-intensive decompression and will actually degrade performance;
  • Serialisation formats are used to optimally store data structures on disk. You should identify the format that provides you with most options for further downstream processing.  Apache Avro, for example, is a binary serialisation format which co-locates schema and data.  Its compatible with Hive, Impala, MapReduce and Spark – maximising your options for technology selection. Alternatives to Avro include Google’s Protocol Buffers and Facebook’s Thrift.

Make people want to visit your data hub, no-one ever wants to visit a graveyard.

3. Forgetting about Schema Entirely

One of the most frequently occurring requirements is the ability for users to use their SQL skills to interrogate their data hub.  They want to execute the same scripts atop Hadoop as they have done on SQL Server and Oracle for years.  This is only possible using the native Hadoop metastore through query engines like Hive and Impala.

Both query engines provide the ubiquitous JDBC and ODBC drivers which allow traditional and legacy Business Intelligence and Advanced Analytics tools to connect to your data hub as if it were a relational data store.  Hive can interpret HDFS files serialised in the binary Avro format and infer the Hive schema, using AvroSerDe library.  Without the metadata of field types and structures you’re limiting access to your data to the lower-level data processing engines only, such as MapReduce or Spark.

No schema up-front does not mean no schema at all.

4. Failing to Assess Existing Technologies

So we now know that Hadoop will be an addition to and not a replacement of your existing data warehouse technologies – it integrates well and augments functionality of existing systems.  It is therefore critical to fully assess and evaluate the capabilities of existing technologies – Business Intelligence, Data Cleansing, Data Integration, Advanced Analytics and even Backup and Recovery systems – in the context of big data workloads, to ensure the end-to-end journey time is within agreed SLAs.

Customers will look to combine existing middleware, integration and exploration technologies with Hadoop to understandably leverage previous investments.  However, just because you now have a 100PB Hadoop cluster that can ingest 1GB/s of sensor data does not automatically mean that the end consumers of analytical insight can receive it accurately and timely using their existing technologies.

  • Data Integration tools such as Pentaho PDI are maturing to execute transformation jobs natively on Hadoop as MapReduce jobs.  Maturation will likely continue to include native Spark support.  Many data integration tools do not support native Hadoop execution and the heavy-lifting ETL is performed on the data integration server or existing cluster.  Again the throughput should be profiled and bottlenecks identified;
  • Existing messaging infrastructure should have throughput assessed and offset against any new web-scale messaging frameworks such as Kafka or clustered aggregation frameworks such as Flume;
  • Advanced Analytics tools provide rapid prototyping of advanced models which users will want to apply to their Hadoop data.  There will be a huge disparity in processing times between those tools that use native Hadoop processing engines (such as Radoop) and those that rely on data extraction through ODBC/JDBC (such as KNIME);
  • Traditional in-memory Business Intelligence tools, although capable of handling TBs of data, will likely not have been deployed with such capacity in mind.  Scaling these tools upwards or outwards can prove costly and budget limitations may constrain the amount of data accessible to end-users.

You’re only as fast as your slowest link.

5. Immediately Adopting the Latest Tech

The rate of change across the big data marketplace and specifically within the Hadoop ecosystem sees new technology released almost monthly.  Adopting the very latest technology immediately is usually not a wise decision.  You need to be particular about the technologies you adopt and carefully assess the actual and eventual functionality.  This is where the commercial distributions of Hadoop can help with your decision making.

Take Cloudera for example who include over 20 components in their distribution of Hadoop  These listed components are considered sufficiently mature to be integrated and shipped as part of their enterprise-ready offering.  However, Cloudera are careful to assess and publish those sub-components which are deemed to immature to provide full operational support.   Take Apache Spark for example, where the core Spark libraries are mature and commercially supported within CDH 5, however the less mature Spark SQL, MLLib and GraphX are currently not.

The leading edge should not be the bleeding edge.

No comments
Leave a Comment