In recent years Hadoop has proven to be a disruptive technology in the world of data storage and processing. It’s rise to prominence as an open source platform for performing “big data” analytics has shaken up how many companies think about the ways in which they can transform and derive value from their most important data assets.
I’ve spent quite a bit of time working with the software over the past year, and I think Hadoop has hit a few important milestones in that time with regards it’s readiness for enterprise use.
This post will take a look at the Cloudera distribution of Hadoop (CDH) and how it’s feature-set has developed in 2014 to better address issues of data protection and governance on the Hadoop platform.
Hadoop for the Enterprise
Compliance is fundamentally about defining a common set of rules and processes that guarantee the privacy, integrity and availability of systems. Whether we are discussing an RDBMS, a LOB application or a distributed computing platform like Hadoop – the best way to unlock the value of data is to ensure it is secure.
More specifically, providing assurances that a system meets compliance standards means an expanded range of use cases are possible for the data that it manages.
For new platforms such as Hadoop, it is also crucial for proving that they mean business – that it is not just a tool for solving edge cases – and that they can be trusted with sensitive customer information.
Cloudera’s Distribution of Hadoop
As a quick recap, Hadoop is an open source distributed computing platform for reliably storing and processing huge and varied data sets on commodity hardware. Consider Hadoop a flexible toolset, rather than an individual tool.
CDH is one of the major commercially supported distributions that packages the best of those tools and offers additional configuration management and operational monitoring.
Cloudera have introduced two products which compliment the core open source feature set to address the twin concerns of governance and information security.
A problem which needed to be solved with regards to Hadoop security has been the inconsistent mechanisms for controlling access to data in the underlying file system. For example the Hive data warehouse, the HBase key/value store and the HDFS file system itself all provide their own mechanisms for authorisation.
Sentry is an Apache project developed in collaboration between Cloudera and Intel which provides consistent role based access control across the major Hadoop ecosystem tools.
Sentry takes responsibility for global concepts such as User, Group, Role and Privilege. It’s core functionality is ‘glued’ to the various different projects by individual bindings.
This modular architecture allows Sentry to provide very fine grained access in each tool (down to the column level in a single database for example) from a centrally managed location.
On the other side of the coin to Sentry, Navigator is a tool for auditing, metadata and data lineage management.
A common pattern for processing in Hadoop is to ingest the raw data files from external systems into a staging area. Like any staging mechanism, this provides traceability and a fall back point to the original source once it has entered whatever processing pipeline(s) you have defined.
As Hadoop often functions as a ‘data lake’, bringing together data from many different operational divisions within an organisation, it is very important that data can be traced back to the original sources and owners as it is moved and transformed by the Hadoop tools.
This is the visibility of data lineage which Navigator supplies.
As data is transformed by jobs, Navigator can build a visual representation to show how and when data in one location is processed and moved to another in a new format.
Auditing of user access is also built in, for example failed access attempts logged by Sentry will be propagated to Navigator.
A Bright Future For Secure Big Data Analytics
It’s an exciting time to be working with the Hadoop platform and the rate of development in the technical landscape continues to be tremendous.
Tools such as Navgiator and Sentry are helping to solve some of the core challenges to the adoption of Hadoop as a platform for enterprise information processing.
We are already seeing customers making use of the benefits of these tools to process valuable information that would previously only have been entrusted to an Enterprise Data Warehouse solution.