Becoming a Databricks professional – My certification journey
What is Databricks and why should I care?
It’s entirely possible that you’ve heard of this company but are unaware of the platform that they offer. So, if you haven’t, I’ll give you a quick-fire introduction.
Databricks is a web-based platform, produced by the creators of Apache Spark, MLflow, and Delta Lake, that allows data engineers, scientists, and analysts to all work collaboratively on one platform.
The huge benefit of Databricks is its ability to condense all your ETL processing, data lake storage, data warehousing, machine learning/deep learning experiments, and business intelligence into one unified platform whilst still logically isolating individuals by their role to make sure data is protected.
Whilst Databricks is a newer alternative to the tools natively found on Microsoft Azure and AWS, from release it has surged in popularity, and is now used in over 5000+ organisations worldwide. So, whether you believe in buying into the hype or not, Databricks does look like a tool with which it is worth becoming familiar.
What experience did I have before?
Before starting the learning for this certification, I was already working on a project leveraging Databricks to solve logistical problems. I’d recently challenged myself to learn more, and in line with this, I earnt the Databricks associate developer for Apache Spark 3.0 qualification.
However, whilst that was certainly useful, I still felt like I was just operating as an ETL engineer and as less of a well-rounded data engineer. This is when I saw that the data engineering professional exam was available to be taken.
The course seemed to be exactly what I needed to take my skills to the next level, and it was delivered using Python and SQL, my daily driver languages. What wasn’t to love about that?
My journey and my challenges
Now, if you or I were hoping for an easy ride to a brand-new, shiny certificate to post on LinkedIn, we were about to be disappointed. It’s called professional for a reason.
With this qualification being very new, there were some small issues to get to grips with. The main one was that Advanced Data Engineering, unlike the first two courses on the learning path (Data Science and Engineering Workspace, and Optimising Apache Spark), didn’t exist. I searched for hours; I couldn’t find it. I couldn’t find reference to it by anyone online. After nearly giving up, I stumbled across the Databricks official GitHub account, and lo and behold, there it was, in all its pre-release glory, the source code for the missing course.
It was the day after, when the first two courses disappeared from my learning account that I saw the news, the official release for this course was being delayed until June 2022.
At this point I’d invested too much time and energy to just give up, so what was the solution? If the learning guide didn’t exist, I’d just make my own.
So, I cloned the GitHub repo, found access to the first two courses by dredging through my browsing history, and set to work. Two days and a copious amount of tea drinking later, I’d done it. A (nearly) comprehensive learning guide that would later help me go on to pass the exam to become a Databricks professional within my first year of being a data engineer.
That all sounds wonderful, but what did you actually learn about?
You’ve made it this far, you want to know what technical aspects are covered in the exam and what problems I faced, I hear you; grab a seat and a hot drink, here it comes.
First off, I started learning about the Data Science and Engineering workspace, how to create and manage clusters and jobs, how to install custom libraries on a cluster, and how to link Databricks up to a hosted Git service.
Next, I moved on to learning about how to optimise Spark and solve common problems such as data skew, spill, storage issues, shuffles and serialisation of command logic. Personally, I found this incredibly interesting, but be prepared to sit through many hours of watching videos if you want to fully immerse yourself in this topic. A word of advice if you want to take this certification, try and remember where in the Spark UI to find both the symptoms and fixes for common optimisation problems.
Finally, the big one, advanced data engineering. This covered a wide range of topics including:
- Architecting for the Lakehouse
- Setting up tables and databases
- Optimising data storage
- Advanced Delta Lake transactions
- Streaming data patterns (including change data capture)
- How to manage data in motion
- Cloning tables using deep and shallow clones, and file retention
- Leveraging Autoloader
- Micro-batch streaming
- Streaming deduplication for data quality
- Quality enforcement
- Privacy and Governance for the Lakehouse
- Stored and materialised views
- Lookup tables for Personally Identifiable Information (PII)
- How to store PII securely
- Conditional access to PII using roles and groups
- Processing records using change data feed (CDF)
- Propagating delete requests using CDF
Now something to note, although it was never covered (probably because at the time of writing the course isn’t finished), the exam specification wants you to have knowledge of the Databricks CLI/REST API, how to integrate MLflow models into an ETL process, and how to assign permissions and policies to users and groups. Because of this, and having taken the exam, I would recommend reading through the documentation for all these things.
What was the exam like?
Hard, sadly, unsurprisingly. Although it does a very good job of examining what was covered, there were some questions that could only realistically be answered using experience.
For example, in the Databricks Jobs API, the create endpoint is shown to create a job using a JSON payload. Now say you wondering what would happen if the same payload was sent three times, I assumed it would make one job and overwrite it the for the subsequent API calls. As it happens three identical jobs are created, who knew?
Another gotcha I found was about what Databricks notebooks look like. For months I’d been saying how great it was that you can export Databricks notebooks to python source files to run as a script, also making version controlling so much easier. Never in all that time did I think to memorise what the top of that source file looked like.
Whilst these knowledge gaps may be plugged when the final course is released, a colleague mentioned a similar thing is happening on the Databricks data engineer associate exam, with some questions being easy to answer if you’ve come across it in real life, but easy to fall into a logic trap if you’ve only read the learning materials.
How has it improved me?
This exam, and mainly the learning and experiences I had before passing it, introduced me to so many ways of working that I now include in my toolkit.
I returned to my project with a newfound passion for optimising our existing pipeline, improving data quality checking, and securing our solution and the data inside of it. These are skills and considerations that I will use in all future projects, and ones I would encourage others to develop.
That’s great Alex, but is it worth the effort?
Absolutely, this learning has given me a greater understanding of how I can use Databricks to solve data problems.
If you do want an easier experience, patience is a virtue, and you can just wait for the official course to come out.
Having the support to learn so many new things as part of my job, is what makes coming to work each day so exciting.
Interested in finding out more about careers in Data? Visit our careers site to learn more about life at Kainos and our current opportunities. https://careers.kainos.com/gb/en