| .vscode | ||
| 1_simple | ||
| 2_storage_stream | ||
| 3_pipeline | ||
| 4_user_interface | ||
| .gitmodules | ||
| LICENSE | ||
| README.md | ||
Architecture Lab
The repo provides examples to work with modern data-platform-architecture.
A data platform serves as a unified system for efficiently managing and analyzing large datasets. It integrates components like databases, data lakes, and data warehouses to handle structured and / or unstructured data depending on the use cases.
Anatomy of a Data Platform — How to choose your data architecture
Jupyter notebooks are used to create pipelines implementing a simplified medallion architecture
A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).\
Prerequisites
Containers
The examples are provided as docker compose files. A working container setup with docker or similar is needed. From developer ergonomics perspective a decent shell is needed.
Note
Docker: The compose files where created on Linux with docker-ce, tested on Windows with Docker-Desktop on Mac with OrbStack. Other container-environments like podman may work/may need adaptions.
Shell
In a Unix-like environments like Mac/Linux typically a good shell is available out of the box (bash, zsh) in combination with a terminal (terminal, iTerm, Konsole, gnome-terminal, ...).
For Windows a good combination of shell/terminal is PowerShell/Windows Terminal.
Note
Powershell: For windows users it might be necessary to set the execution-policy for powershell:
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
Warning
cmd.exe: If you use cmd.exe, you are without help. Nobody should use this old command-interpreter anymore!
Python
A modern package manager for python should be used to simplify dependency-management and environment setup.
Note
uv: I very much recommend uv "An extremely fast Python package and project manager, written in Rust."
Examples
1_simple
Basic setup to work with Jupyter / PySpark
2_storage_stream
Introduce Apache Kafka for streaming data and MinIO as a S3-compatible storage backend.
3_pipeline
Shows a simple data-pipeline with Bronze/Silver/Gold notebooks and storing data in Parquet Format and using DuckDB for data processing.
4_user_interface
A streamlit app to visualize the processed pipeline data in the GOLD layer.