The Kainos Big Data Academy - analysing traffic data

Date posted
7 June 2018
Reading time
7 Minutes

The Kainos Big Data Academy - analysing traffic data

Previously on our Big Data Academy blog series, we introduced our Big Data Academy attendees and asked them about their thoughts and own, personal perspective on what the Academy is, what they can learn and what are their expectations. In this, second part of the series, we're going to focus on what the internal project is actually about.

What?

The Tristar project (the Integrated Traffic Management System in Gdansk, Gdynia and Sopot) collects a massive amount of data and uses it to manage the traffic light system in the Tricity. Some parts of this data, especially delays in public transport, is open to everyone, thanks to the Gdansk Open Data project. Why not combine this data with, for example, weather data and sports/events calendar and create a system, that can be used for checking weather impact on traffic delays, possible delay prediction, heatmaps, statistics and many more?

Why?

This project was worked on by all attendees attending the Academy. The goal of this project was to apply new-found knowledge, understand what kind of problems can appear during the development and what are possible knowledge gaps. It also prepares new joiners for work within the Scrum framework and Gitflow process. Last but not least it's just fun to finally write some code after a couple of weeks of lectures and labs :) Besides, a system that holds a lot of information of historical traffic data can actually be useful outside the Academy. Let's say you overslept and you're already late for work. There's raining cats and dogs outside, and you just don't have time to think what tram or bus you should take. This system can analyse historical data and look for the fastest way to your destination, given the time, weather conditions and start point. What are the use cases? Those are ideas just around the traffic data. Who knows what could be accomplished, when we add weather and calendar data? A map, that shows current (real-time) delays, when tapped on a bus stop. A heat map with all stops. Colour represents a mean delay from last X hours. Maybe an animation with situation from different days/time of a day would be interesting? A chart, that shows mean delay on different stops. Let's say we want to know what departure time will be less likely to be delayed. Red line represents departure at 8 am, the green one at 8:50 am, and the blue one at 9:15am. It's clear, the one at 9:15am has the lowest delay. Those are ideas just around the traffic data. Who knows what could be accomplished when we add weather and calendar data?

How?

We wanted to make this project as much 'big-data-like', as possible. First, we download all the data from Tristar, weather prediction systems and event calendars. We use API Monitor module for that task. The API Monitor asks for fresh data every couple of minutes, then it parses the data using Spark and Akka and eventually loads all the data to two branches: the Real-time Layer and the Batch Layer. Both layers form the lambda architecture. If you ask the system for the newest data, you ask the Real-time Layer. It uses Redis.io for fast, memory-based cache database, and keeps only the newest data. If you want to make some kind of historical statistics, you ask the Batch Layer. It has all the data ever collected, and again, using Spark can prepare some interesting data, like median or mean delay between now and last week. Sometimes, the Batch Layer communicates with the Real-time Layer and updates the Redis.io database. We're almost finished with the internal project. Collected data can be presented as heatmaps, charts, tables we have a lot of ideas and use cases in our minds and we can't wait to implement them!