Data Platform architecture example using containers.

data-platform docker docker-compose grafana parquet python sbom

Python 68.6%
PowerShell 15.3%
Shell 14.1%
Dockerfile 2%

Find a file

Henrik Binggl 3f32165ccf doc: remove question; correct numbering		2026-04-18 12:19:58 +02:00
services	feat: add supply-chain security task	2026-04-02 11:41:54 +02:00
.editorconfig	Initial import	2026-03-07 15:48:43 +01:00
.gitattributes	Initial import	2026-03-07 15:48:43 +01:00
.gitignore	feat: add supply-chain security task	2026-04-02 11:41:54 +02:00
docker-compose.yml	feat: add supply-chain security task	2026-04-02 11:41:54 +02:00
LICENSE	Initial commit	2026-03-07 14:45:43 +00:00
README.md	doc: remove question; correct numbering	2026-04-18 12:19:58 +02:00
sbom.ps1	fix: fix script for Windows	2026-04-02 12:43:28 +02:00
sbom.sh	feat: add supply-chain security task	2026-04-02 11:41:54 +02:00

README.md

Lab Exercise: Wire the Medallion Architecture

This is an example architecture of a data-platform. A data-platform typically follows the Medallion Architecture. This example provides services and a docker-compose file (incomplete). The example implements a analytics platform and needs to be modified to follow the Medallion Architecture pattern:

Layer	Purpose	Technology
Bronze	Raw, unvalidated data as it arrives	Shared volume
Silver	Cleaned, validated, deduplicated data	RustFS — Parquet files (S3-compatible object store)
Gold	Aggregated, analytics-ready data	PostgreSQL
Serve	Dashboard / query interface for end-users	Grafana

All services are already implemented. You do not write any application code. Your task is to define the architecture by completing docker-compose.yml.

Target Picture

The goal is a running data platform where sales data flows through three layers and is visible as charts in Grafana:

  raw CSV files                                                  Browser
      │                                                             │
      ▼                                                             ▼
 [ Bronze ]  ──►  [ Silver / RustFS ]  ──►  [ Gold / PostgreSQL ]  ──►  [ Grafana ]
 (raw data)        (S3 object store)          (aggregated tables)        (dashboards)

Six services need to be wired together to make this work. How they communicate, and which ones should be reachable from outside, is what you need to decide.

Your Task

To implement the data-platform complete the following tasks:

Use your own products --> change the Products and Regions in services/data-generator/generate.py
Complete docker-compose.yml by answering these architectural questions:

1. Volumes — Data persistence

Docker volumes are managed storage areas that exist independently of any container. Unlike bind mounts, volumes are fully managed by Docker, survive container restarts and removals, and can be shared across multiple containers simultaneously.

See: Docker volumes

Which data layers must survive a container restart?
Which are transient pipeline staging areas?
How do you give multiple services access to the same volume?

2. Networks — Communication boundaries

By default, Docker containers are isolated from each other. A user-defined bridge network connects a group of containers so they can reach each other by service name. Containers on different networks cannot communicate. A container can join multiple networks, making it reachable from each of them. Ports are only accessible from the host machine if explicitly published.

See: Docker networking overview

Which services need to communicate, and which should be isolated?
Should the data-generator be able to reach the gold database? Why not?
Should Grafana be able to reach RustFS directly? Why not?
silver-to-gold reads from RustFS and writes to PostgreSQL — which networks does it need?
Hint: a service can belong to more than one network.

3. Endpoints — Service discovery

Each entry under services: in Compose defines a named container. Docker's embedded DNS automatically resolves a service name to its container IP within the same network — so containers address each other by service name instead of hardcoded IP addresses.

See: Compose networking and service discovery

Both bronze-to-silver and silver-to-gold connect to RustFS via RUSTFS_ENDPOINT. What hostname does Docker assign to a service on the same network?
How does silver-to-gold address the PostgreSQL container in DATABASE_URL?

4. Ports — External exposure

Publishing a port maps a port on the host machine to a port inside the container (host:container). Without a published port, a service is only reachable by other containers on the same network — never from outside Docker.

See: Docker published ports

Which services should be reachable from your browser?
The RustFS console (port 9001) lets you browse uploaded files — is that useful during the exercise?
Which services should never be reachable from outside the platform?

5. Startup order — Dependencies

depends_on controls the start order of services. By default it only waits for a container to start, not for the application inside it to be ready. Use condition: service_healthy together with a healthcheck to wait until a service is actually accepting connections.

See: Control startup and shutdown order in Compose

silver-to-gold must not start before PostgreSQL is accepting connections — how do you express this?
The pipeline services have built-in retry logic, so strict ordering between generator and transformer is less critical — but what is the correct conceptual order?

Running the Platform

# Build custom service images
docker compose build

# Start everything
docker compose up

# Watch the data flow
docker compose logs -f

# Stop and clean up (add -v to also delete volumes)
docker compose down

Verifying the Architecture

Only attempt this section once all TODOs in docker-compose.yml are complete and docker compose up starts all six services without errors.

Once running, verify each layer is working:

Bronze layer — raw files arriving:

docker compose exec bronze-to-silver ls /bronze

You should see timestamped CSV files and a processed/ subdirectory. New files appear every ~30 seconds, processed ones move into processed/:

orders_20240315_120032_b0.csv
orders_20240315_120102_b1.csv
processed/

If the directory is empty, the data-generator cannot reach the bronze volume — check your volume mounts.

Silver layer — objects uploaded to RustFS:

Open the RustFS console in your browser:

http://localhost:9001

Navigate to the silver bucket. You should see one clean_ prefixed Parquet object for every file that passed through bronze:

clean_orders_20240315_120032_b0.parquet
clean_orders_20240315_120102_b1.parquet

Notice that the files are significantly smaller than the original CSVs despite holding the same data — this is the effect of columnar storage with Snappy compression.

If the bucket is missing or empty, bronze-to-silver cannot reach RustFS — check that both services are on the same network and that RUSTFS_ENDPOINT uses the correct service name.

Gold layer — aggregated data in PostgreSQL:

docker compose exec postgres psql -U analytics -d gold \
  -c "SELECT * FROM sales_by_product ORDER BY total_revenue DESC;"

You should see a table of products with aggregated revenue figures:

      product       | total_revenue | order_count | avg_order_value |         last_updated
--------------------+---------------+-------------+-----------------+-------------------------------
 Docking Station    |      18432.50 |          23 |          801.41 | 2024-03-15 12:02:45.123456
 Laptop             |      15987.00 |          18 |          888.17 | 2024-03-15 12:02:45.123456
 Monitor            |      12301.75 |          31 |          396.83 | 2024-03-15 12:02:45.123456
 ...

If the table is empty or does not exist, silver-to-gold cannot reach PostgreSQL — check your DATABASE_URL and network configuration.

Serving layer — open Grafana in your browser:

http://localhost:3000

You should see the Grafana login page. Login: admin / lab

If the page does not load, Grafana's port is not exposed to the host — check your ports configuration.

Once logged in, add a PostgreSQL data source:

Host: postgres:5432 ← the container name, not localhost
Database: gold
User: analytics / Password: analytics
TLS/SSL mode: disable

Click Save & Test — you should see "Database Connection OK". If the test fails, Grafana and PostgreSQL are not on the same network.

Then create a panel with:

SELECT product, total_revenue, order_count FROM sales_by_product ORDER BY total_revenue DESC;

Deliverables

Architecture Documentation including:
A completed docker-compose.yml with the platform running end-to-end.
- Screenshots of the Grafana Dashboard
A C4 Deployment Diagram for the platform (https://c4model.com/diagrams/deployment).

Map each container onto its deployment node (Docker container) and show how nodes are grouped into networks. Include port assignments for any service that is externally exposed. This is the right level to document network topology and the boundary between internal and external traffic.
Answer the following questions:
- The data-generator cannot reach the gold database in a correctly wired architecture — what would go wrong architecturally if it could?
- If the RustFS silver store is lost (e.g. docker compose down -v), what happens to the gold layer? Is that a problem?
- The silver layer uses object storage (S3) rather than a shared volume. What architectural advantages does that give you compared to a plain volume?
- The silver layer stores Parquet files instead of CSV. Why is Parquet a better choice for an intermediate layer in a data platform? Consider:
  - How are columns stored compared to CSV?
  - What happens to data types when you read a CSV vs. a Parquet file?
  - How does file size compare, and why does that matter in object storage?
  - If silver-to-gold only needed the amount and product columns, how would Parquet help?

Supply-Chain Security: Generating a Software Bill of Materials

A Software Bill of Materials (SBOM) is a formal, machine-readable inventory of every software component in a system — the open-source libraries, OS packages, and runtime dependencies that a platform is built from. Just as a physical product has a bill of materials listing every part, a software platform has an SBOM listing every component.

SBOMs matter because most security vulnerabilities are introduced not by the code you write, but by the third-party components you depend on. The EU Cyber Resilience Act requires manufacturers and operators of products with digital elements to maintain an accurate SBOM — making this a compliance requirement, not just a best practice.

In this exercise you will generate a CycloneDX SBOM for the data platform and analyse what you find.

Note

Current Info (04.2026): Cyber-Security-Threads are all around. The tool used below, trivy, recently was attacked as well. As trivy is widely used, it makes it a good target. The current issue mainly targets CI-Environments, but also local machines are affected.

Trivy Blog-Post: https://www.aquasec.com/blog/trivy-supply-chain-attack-what-you-need-to-know/

Microsoft Security-Blog: https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/

EU CERT Advisory: https://cert.europa.eu/blog/european-commission-cloud-breach-trivy-supply-chain

Other supply chain attack: https://www.microsoft.com/en-us/security/blog/2026/04/01/mitigating-the-axios-npm-supply-chain-compromise/

Prerequisites

Install Trivy: https://trivy.dev/docs/latest/getting-started/installation/
Have the platform images built:
```
docker compose build
```
Create a local output directory (this is git-ignored):
```
mkdir -p sbom
```

Note

Ensure to use the known safe versions (https://github.com/aquasecurity/trivy/security/advisories/GHSA-69fq-xp46-6x23)

Component Safe Version

Trivy binary v0.69.2, v0.69.3

Component	Safe Version
Trivy binary	v0.69.2, v0.69.3

Part 1 — Per-Image SBOMs

A container image is a layered filesystem. Every layer may add OS packages, language runtime packages, or application files — all of which become part of your supply chain.

Use trivy image with CycloneDX output format and license scanning enabled to generate a SBOM for each of the six platform images. Consult the Trivy documentation to find the correct flags. Write each SBOM to sbom/<service>.cdx.json.

Images to scan:

Image	Built by
`msa_sai_lab-data-generator`	Custom (this project)
`msa_sai_lab-bronze-to-silver`	Custom (this project)
`msa_sai_lab-silver-to-gold`	Custom (this project)
`rustfs/rustfs:latest`	Third-party
`postgres:16`	Third-party
`grafana/grafana:latest`	Third-party

Reflection questions — answer in your deliverable:

Which image has the most components? Which has the fewest? What explains the difference?
Two images are tagged :latest (rustfs/rustfs:latest, grafana/grafana:latest), while one is pinned to a major version (postgres:16). What supply-chain risk does a :latest tag introduce? How does Trivy record the actual image identity in the SBOM?

Part 2 — Platform-Wide SBOM

Now generate a single SBOM that covers the entire project directory using trivy fs. This command scans the filesystem for package manifests (such as requirements.txt, package.json, go.mod) and Dockerfiles.

Write the output to sbom/platform.cdx.json.

Note: Each custom service declares its dependencies in a requirements.txt file, which trivy fs uses to detect Python packages. This is why manifest files matter for supply-chain tooling — dependencies declared only as inline shell commands (e.g. RUN pip install boto3) are invisible to filesystem scanners.

On licenses: You will notice that platform.cdx.json lists components without license information, while the per-image SBOMs from Part 1 do include licenses. This is a fundamental difference between the two scanning modes: trivy image scans the full image filesystem including site-packages, where Python stores license metadata (.dist-info directories). trivy fs only has access to the source tree — it knows what packages are declared but cannot read their license metadata without the installed environment. For complete license information, the per-image SBOMs are authoritative.

Reflection questions — answer in your deliverable:

A new regulation (e.g. the EU Cyber Resilience Act) requires your organisation to provide an SBOM for any software platform it operates. What would you submit? Which of the two approaches — per-image or platform-wide — better satisfies that requirement, and why?

Architectural Reflection

The data-generator has no pip install step — it relies entirely on the Python standard library. What does its SBOM tell you about its attack surface compared to silver-to-gold? What is the architectural principle at work here?

Deliverables

Add the following to your submission:

The generated SBOM files:
- sbom/data-generator.cdx.json
- sbom/bronze-to-silver.cdx.json
- sbom/silver-to-gold.cdx.json
- sbom/rustfs.cdx.json
- sbom/postgres.cdx.json
- sbom/grafana.cdx.json
- sbom/platform.cdx.json
Written answers to all reflection questions in Parts 1, 2, and the Architectural Reflection above.