Monitoring Public Transportation with the Kafka Ecosystem (Demo Project)

Intro

As part of taking the Data Streaming Nanodegree at Udacity, I completed a demo project, written in Python, that provides hands-on experience with a variety of concepts, tools, and frameworks around Apache Kafka:

  • Kafka brokers, topics, producers, consumers
  • Serializing Kafka messages with Apache Avro
  • Kafka Connect
  • Kafka REST Proxy
  • Stream Processing with Robinhood Faust
  • KSQL
  • Schema Registry

I believe the project can be a handy reference point in case you’re exploring adding data streaming workflows to your system.

This article will give you a good overview of the system in case you decide to explore it deeply.

You can find the source code and instructions on how to run it locally on GitHub.

About the Nanodegree

I encourage you to take the Nanodegree yourself. The lectures will arm you with some theoretical background to help you evolve your understanding of Data Streaming systems and the technologies surrounding them.

What I like the most about the learning process is it’s quite interactive, with plenty of hands-on exercises after each section. Plus, you get to complete two bigger and more practical projects at the end of each section.

Also, you’re getting great support from the mentors that answered any questions I had within just a few hours.

In my opinion, taking the course is a pretty good investment(*) for your career in the age of Microservices and Big Data systems.

(*) Don’t let the pricing discourage you – at the time I took the Nanodegree, every customer could use a discount coupon that reduces the cost by up to 90%!

Still, don’t expect to become an expert in the field just by completing the course – as you know, this takes years of experience working on and solving various types of edge-case real-world scenarios.

Project Overview

This demo system is about constructing an event-driven pipeline that simulates and displays train stations data in real-time. The data originates from a public dataset provided by the Chicago Transit Authority.

As a final result, you’ll see a web page similar to this one:

Fig. 1 – Demo Project User Interface

For every station, you can see the trains arrivals. Additionally, there’s turnstile data presented together with a weather condition indicator.

Of course, this project is not meant to be extremely useful from a functional perspective.

Instead, the main objective here is to showcase how to integrate various data streaming workflows into a coherent system.

Furthermore, the system is not production-ready. I hope you’d appreciate that particular tradeoffs had been made (from a coding and design standpoint) to keep things relatively simple and easy to follow.

Workflows

In this section, I will present each of the data flows through the system and the Data Streaming Architecture behind them.

Streaming Stations Data via Kafka JDBC Source Connector and Faust Consumer

The first task is to present a list of all stations:

Fig. 2 – Listing the Stations

Data Flow Diagram

Fig. 3 – Stations Listing Data Flow

  1. Stations data is first loaded into a Postgres database during the project setup.
  2. Then we configure a Kafka JDBC Source connector that polls the stations from the database and loads them into a Kafka topic.
  3. We’re using Faust to transform the data stream to a more convenient format.
  4. Faust sends the transformed data to an output Kafka topic.
  5. A Kafka consumer is subscribed to the topic and updates the UI models.

Streaming Train Arrivals with Kafka Producers and Consumers

The next piece that we’ll review is the trains arrivals:

Fig. 4 – Trains Arrivals

Data Flow Diagram

Fig. 5 – Trains Arrivals Data Flow

  1. The data is initially stored in a CSV file provided by the Chicago Transit Authority.
  2. The Simulator is a core component responsible for orchestrating the generation of the event based on the input CSV files.
  3. Every station publishes train arrivals data to its’ own Kafka topic.
  4. A Kafka consumer is subscribed to the arrivals topics and updates the UI when a new message comes in.

This workflow uses Apache Avro for data serialization.

Aggregating Turnstiles Count per Station via KSQL

Let’s now discuss processing the turnstiles data:

Fig. 6 – Turnstiles

Data Flow Diagram

Fig. 7 – Turnstiles Data Flow

  1. The data is initially stored in CSV files provided by the Chicago Transit Authority.
  2. The Simulator reads the data from the CSV files and manages the turnstiles events generation.
  3. Turnstile events are published to a Kafka topic.
  4. A KSQL query is being executed against the topic and aggregates the total turnstiles count for each station.
  5. The aggregated data is published into another Kafka topic.
  6. A Kafka consumer receives the turnstiles summary and updates the UI.

The initial part of the workflow here uses Apache Avro for serialization. After the KSQL aggregation, the data is stored in JSON.

Streaming Weather Data with Kafka REST Proxy

The weather data is displayed in the top right corner of the web page:

Fig. 8 – Weather Indicator

Data Flow Diagram

Fig. 9 – Weather Indicator Workflow

  1. The weather data is generated by the Simulator component.
  2. It is published via Kafka REST Proxy (HTTP POST calls) to a Kafka topic.
  3. A Kafka consumer receives the weather data and updates the UI

This workflow uses Apache Avro for data serialization.

Give it a Try

I hope this project will help you gain some practical insight into your use case. If you want to take it for a spin, please follow the instructions on GitHub.

Thanks for reading!

Resources

  1. Udacity Nanodegree
  2. Source Code

Site Footer

Subscribe To My Newsletter

Email address