Why Big Data Academy
Why Big Data Academy
I’m Barbara and I joined Kainos in early July 2022 as a Trainee Data Engineer. Previously I worked as an analyst so working with data is familiar to me. However, I felt that I wanted to approach data more from the technical side. I had long been interested in performance issues in big data systems and their architecture.
As I am finishing my Master's degree in Computer Science, I decided it is the perfect time to start a new career path! I had already heard about the Data Academy at Kainos from participants in previous editions, and without hesitation, I decided to attend this year's academy.
The choice turned out to be the right one! Kainos is a fast-growing company with years of experience in technology development, applying the latest IT solutions.
The Big Data Academy
This year's academy covered eight weeks of intensive learning of the latest solutions dedicated to Big Data. During the first two weeks, we had the opportunity to attend lectures and workshops conducted by top specialists.
We addressed issues such as:
- Introduction to data-intensive systems
- HDFS Architecture
- Azure storage
- Queuing systems (Kafka)
- SQL and NoSQL databases
- Apache Spark and Databrics
- ElasticSearch and Azure Data Factory
The next four weeks were dedicated to hands-on learning. In an agile team, we created from scratch a solution for real-time monitoring of aviation data, described in more detail in the next section. At the end of the academy, we had some time dedicated to study for the Azure Data Fundamentals DP-900 certification.
Big Data Academy Project 2022
Goal
The goal of the project was to gain practical knowledge in data engineering by designing and implementing a big data solution using real-time flight data. The solution was to cover end-to-end data flow from data got through dedicated APIs to the visualization stage. As part of our work on the project, we decided to realize 4 use cases based on flight data API, the final visualization of which was designed in PowerBI.
Use case 1: Live Flight tracking
Use case 2: Real time heatmap of air traffic
Use case 3: CO2 Emissions from flights
Use case 4: Live schedules for airports
The implementation of the above use cases provided an opportunity to learn data engineering techniques for each academy participant. To implement the use cases both stream and batching datatypes were used as well as additional datasets that were joined to flight data.
Architecture
During the project, the architecture of the solution changed several times. Initially, the architecture did not include a division into Bronze, Silver and Delta Tables and the transformed data was sent to directly CosmosDB. However, it turned out that this solution duplicated the data processing work and the division into tables made it possible to reuse the data by individual use cases. The architectural solution was sufficiently efficient and met the project requirements, but due to the limited project time, the solution was tested on a limited basis. The final architecture is shown in the graphic below. Each of the use cases was implemented based on the same architecture.

Databricks Notebook
The streaming data was obtained by the Databrics Notebook calling flight API and then sent to a dedicated EventHub instance. In the case of batch data, the data is stored in files which are then ingested and processed in the Data Factory pipeline. Then, according to the medallion lakehouse architecture, the raw data is saved to the Bonze delta Table. In the next step, Databrics notebook or activity in the Data Factory obtains data from the Bronze Delta Table then match, merge, adjusted and cleanse and finally save to the Silver Delta Table. Eventually, the data from the Silver Delta Table is aggregated, grouped accordingly, and saved to the Gold Delta Table and CosmosDB.
The orchestration of the entire solution was implemented in Azure Data Factory. Work planning, collaboration and code development were supported by Azure DevOps Service.
Final results
As a result of the project, we were able to create four interactive reports in PowerBI, allowing real-time data updates. Sample visualizations for each use case are shown in the graphics below.




Outcomes for the team
From the team member's perspective, the main takeaway from the project is that team collaboration is key. All team members were eager to work together and worked well together. We all collaborated to bring fresh ideas to the challenges faced. Tasks moved forward rapidly due to the sharing of experience, knowledge, and solutions used for similar issues.
The stand-ups and retrospectives provided us with the platform for us to come together as a team and bring ideas to the table. We all feel we learned a lot about the technologies, tools, and processes required for a big data engineering project and that the academy has given us the confidence to bring this into future projects.