Big Data Conference - what, how and why?
Recently, thanks to the support of Kainos Innovation team, I had the opportunity to attend the Big Data Warsaw Summit 2022 and I must admit that I've found the whole conference experience amazing - such a privilege for me to be to participate in such an event.
The whole event consisted of two main parts: a full-day workshop and a two-day conference.
The workshop
The workshop that I chose (due to the characteristics of my daily tasks and personal interests) concerned real-time streaming.
In this one-day workshop, we learned how to process unlimited real-time data streams using popular open-source frameworks.
Our main focus was Apache Flink and Apache Kafka - the most promising open-source streaming framework that is increasingly used in commercial projects.
During the course, we simulated an end-to-end scenario in the real world - processing logs generated by users interacting with the mobile application in real time. All exercises were performed in the local docker environment.
Topic focus
We focused on several larger topics broken down into smaller areas, for example:
- Introduction to Apache Kafka
- Introduction to Apache Flink:
Key concepts behind stream processing
Building a streaming pipeline with Flink - Timely Stream Processing:
Notions of time, windowing and aggregations - Connecting to the external world:
Flink integration with Apache Kafka - Stateful Stream Processing:
Fault tolerance
Advance time handling
Stateful operations
Each topic ended with very interesting exercises and the lecturers throughout the day, apart from providing the necessary knowledge, put a lot of emphasis on the practical side, which was great and I could learn a lot of useful things.
The 3 presentations
As for the conference part, it was really great and I could write a separate article for each presentation. Now, however, I would like to briefly describe three speeches that I personally found the most interesting.
The first of my favorite speeches was that the Data Mesh paradigm is a strong candidate to replace the centralized data lake and data warehouse as the dominant architectural patterns in data and analysis. It promotes the concept of domain-oriented data products that go beyond file sharing and towards quality assurance and data ownership assurance.
Thanks to personal experience in applying the Data Mesh concept in practice, as well as dedicated field research, the presenter discovered the most common problems at different stages of the journey and identified effective methods to overcome these challenges.
In this lecture, we'll get both technical and organizational insights, from companies just starting to promote a change of mindset when working with data, to companies that are already in the process of transforming their data infrastructure landscape, to advanced companies that are already working on federated management configurations for a sustainable data-driven future.
Second presentation
The second of the presentations I mentioned concerned scaling data lake witch Apache Iceberg. The presenter raised many important issues, among others:
- Common issues with data lakes
- What is Apache Iceberg? and what problems does it solve
- Building CDC archive at Shopify using Iceberg
- Management / considerations when using Iceberg
We also received a brief intro into whats next on deck for presenter company + Iceberg (Type-1 dimensions using Iceberg's V2 spec with row-level deletion)
Third presentation
Third and last but not least speech was about building data pipelines on modern data platform DBT.
In this case, the presentation focused on the fact that data engineering was once a difficult problem that only people with a background in software engineering could solve.
In addition, many of the use cases and analytical needs in companies waiting for a solution outweigh the data engineering teams, burdening them and leaving business departments waiting a long time to implement their use cases.
On the other hand, business departments would like to implement the data pipelines themselves.
But they couldn't do it well for a long time, mainly because they lacked the engineering skills required to work efficiently and deliver technical quality.
Today, we are witnessing a change in this obstacle in the distance thanks to the maturity of modern data platforms and thanks to tools that facilitate the implementation of flows in accordance with DataOps best practices.
Tools like DBT are great but are just puzzles of the bigger picture. There is a need to take those puzzles, these advancements of data tools, and combine them into a coherent, unified structure of data pipelines that guides analytics engineers on the spot in developing pipelines from beginning to end.
Workshop overview
During this year's edition, we had over 40 presentations and 14 roundtable discussions, presented by over 70 speakers.
After analyzing all the speeches, one can come to the following conclusions:
- AI / ML is everywhere now, but ecommerce and commerce seem to have become one of the hottest AI / ML sectors
- New technologies that support SQL as the main query language are being used more and more today.
- All innovation happens in the cloud, so all data-driven businesses are migrating there or starting there.
- Nowadays, companies have more and more data, so new data discovery tools are being developed as well as new data access concepts such as data mesh.
- Data ingestion pipelines are often developed in a real-time streaming manner with adequate data quality, and data sources such as video, images, and voice are now adopted more frequently.
It is also worth adding that there are many benefits of participating in such meetings, such as:
- Sharpening your knowledge,
- Gaining and sharing new ideas and best practices with your team,
- Learning about the latest innovations and insights,
- Meeting industry experts face to face,
- Engaging in high-level debates and refining your ideas,
I am glad that kainos is enabling employees to take part in such initiatives and I am sure there will be more opportunities of this kind in the future that I am looking forward to. However, to all who have not yet had the opportunity to participate in this type of event, I highly recommend it.