Data-Platform example for architecture lab (Salzburg University of Applied Sciences)
Find a file
2025-06-01 18:28:06 +02:00
.vscode feat(dashboard): dashboard example 2025-05-11 14:38:35 +02:00
1_simple feat(pipelines): data-engineering / pipelines 2025-04-21 19:37:55 +02:00
2_storage_stream chore(notebooks): clear cell output 2025-06-01 18:28:06 +02:00
3_pipeline fix(weather_producer): update the kafka topic 2025-05-10 15:20:01 +02:00
4_user_interface feat(dashboard): dashboard example 2025-05-11 14:38:35 +02:00
.gitmodules feat(pipelines): data-engineering / pipelines 2025-04-21 19:37:55 +02:00
LICENSE Initial commit 2025-04-05 19:04:13 +02:00
README.md doc: correct readme 2025-05-11 14:40:35 +02:00

Architecture Lab

The repo provides examples to work with modern data-platform-architecture.

A data platform serves as a unified system for efficiently managing and analyzing large datasets. It integrates components like databases, data lakes, and data warehouses to handle structured and / or unstructured data depending on the use cases.

Anatomy of a Data Platform — How to choose your data architecture

Jupyter notebooks are used to create pipelines implementing a simplified medallion architecture

A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).\

Medallion Architecture

Prerequisites

Containers

The examples are provided as docker compose files. A working container setup with docker or similar is needed. From developer ergonomics perspective a decent shell is needed.

Note

Docker: The compose files where created on Linux with docker-ce, tested on Windows with Docker-Desktop on Mac with OrbStack. Other container-environments like podman may work/may need adaptions.

Shell

In a Unix-like environments like Mac/Linux typically a good shell is available out of the box (bash, zsh) in combination with a terminal (terminal, iTerm, Konsole, gnome-terminal, ...).

For Windows a good combination of shell/terminal is PowerShell/Windows Terminal.

Note

Powershell: For windows users it might be necessary to set the execution-policy for powershell:

Set-ExecutionPolicy RemoteSigned -Scope CurrentUser

Warning

cmd.exe: If you use cmd.exe, you are without help. Nobody should use this old command-interpreter anymore!

Python

A modern package manager for python should be used to simplify dependency-management and environment setup.

Note

uv: I very much recommend uv "An extremely fast Python package and project manager, written in Rust."

Examples

1_simple

Basic setup to work with Jupyter / PySpark

2_storage_stream

Introduce Apache Kafka for streaming data and MinIO as a S3-compatible storage backend.

3_pipeline

Shows a simple data-pipeline with Bronze/Silver/Gold notebooks and storing data in Parquet Format and using DuckDB for data processing.

4_user_interface

A streamlit app to visualize the processed pipeline data in the GOLD layer.