- Python 68.6%
- PowerShell 15.3%
- Shell 14.1%
- Dockerfile 2%
| services | ||
| .editorconfig | ||
| .gitattributes | ||
| .gitignore | ||
| docker-compose.yml | ||
| LICENSE | ||
| README.md | ||
| sbom.ps1 | ||
| sbom.sh | ||
Lab Exercise: Wire the Medallion Architecture
This is an example architecture of a data-platform. A data-platform typically follows the Medallion Architecture. This example provides services and a docker-compose file (incomplete). The example implements a analytics platform and needs to be modified to follow the Medallion Architecture pattern:
| Layer | Purpose | Technology |
|---|---|---|
| Bronze | Raw, unvalidated data as it arrives | Shared volume |
| Silver | Cleaned, validated, deduplicated data | RustFS — Parquet files (S3-compatible object store) |
| Gold | Aggregated, analytics-ready data | PostgreSQL |
| Serve | Dashboard / query interface for end-users | Grafana |
All services are already implemented. You do not write any application code.
Your task is to define the architecture by completing docker-compose.yml.
Target Picture
The goal is a running data platform where sales data flows through three layers and is visible as charts in Grafana:
raw CSV files Browser
│ │
▼ ▼
[ Bronze ] ──► [ Silver / RustFS ] ──► [ Gold / PostgreSQL ] ──► [ Grafana ]
(raw data) (S3 object store) (aggregated tables) (dashboards)
Six services need to be wired together to make this work. How they communicate, and which ones should be reachable from outside, is what you need to decide.
Your Task
To implement the data-platform complete the following tasks:
-
Use your own
products--> change the Products and Regions inservices/data-generator/generate.py -
Complete
docker-compose.ymlby answering these architectural questions:
1. Volumes — Data persistence
Docker volumes are managed storage areas that exist independently of any container. Unlike bind mounts, volumes are fully managed by Docker, survive container restarts and removals, and can be shared across multiple containers simultaneously.
See: Docker volumes
- Which data layers must survive a container restart?
- Which are transient pipeline staging areas?
- How do you give multiple services access to the same volume?
2. Networks — Communication boundaries
By default, Docker containers are isolated from each other. A user-defined bridge network connects a group of containers so they can reach each other by service name. Containers on different networks cannot communicate. A container can join multiple networks, making it reachable from each of them. Ports are only accessible from the host machine if explicitly published.
See: Docker networking overview
- Which services need to communicate, and which should be isolated?
- Should the data-generator be able to reach the gold database? Why not?
- Should Grafana be able to reach RustFS directly? Why not?
silver-to-goldreads from RustFS and writes to PostgreSQL — which networks does it need?- Hint: a service can belong to more than one network.
3. Endpoints — Service discovery
Each entry under services: in Compose defines a named container. Docker's embedded DNS automatically resolves a service name to its container IP within the same network — so containers address each other by service name instead of hardcoded IP addresses.
See: Compose networking and service discovery
- Both
bronze-to-silverandsilver-to-goldconnect to RustFS viaRUSTFS_ENDPOINT. What hostname does Docker assign to a service on the same network? - How does
silver-to-goldaddress the PostgreSQL container inDATABASE_URL?
4. Ports — External exposure
Publishing a port maps a port on the host machine to a port inside the container (host:container). Without a published port, a service is only reachable by other containers on the same network — never from outside Docker.
- Which services should be reachable from your browser?
- The RustFS console (port 9001) lets you browse uploaded files — is that useful during the exercise?
- Which services should never be reachable from outside the platform?
5. Startup order — Dependencies
depends_on controls the start order of services. By default it only waits for a container to start, not for the application inside it to be ready. Use condition: service_healthy together with a healthcheck to wait until a service is actually accepting connections.
See: Control startup and shutdown order in Compose
silver-to-goldmust not start before PostgreSQL is accepting connections — how do you express this?- The pipeline services have built-in retry logic, so strict ordering between generator and transformer is less critical — but what is the correct conceptual order?
Running the Platform
# Build custom service images
docker compose build
# Start everything
docker compose up
# Watch the data flow
docker compose logs -f
# Stop and clean up (add -v to also delete volumes)
docker compose down
Verifying the Architecture
Only attempt this section once all TODOs in
docker-compose.ymlare complete anddocker compose upstarts all six services without errors.
Once running, verify each layer is working:
Bronze layer — raw files arriving:
docker compose exec bronze-to-silver ls /bronze
You should see timestamped CSV files and a processed/ subdirectory.
New files appear every ~30 seconds, processed ones move into processed/:
orders_20240315_120032_b0.csv
orders_20240315_120102_b1.csv
processed/
If the directory is empty, the data-generator cannot reach the bronze volume —
check your volume mounts.
Silver layer — objects uploaded to RustFS:
Open the RustFS console in your browser:
http://localhost:9001
Login: rustfsadmin / rustfsadmin
Navigate to the silver bucket. You should see one clean_ prefixed Parquet
object for every file that passed through bronze:
clean_orders_20240315_120032_b0.parquet
clean_orders_20240315_120102_b1.parquet
Notice that the files are significantly smaller than the original CSVs despite holding the same data — this is the effect of columnar storage with Snappy compression.
If the bucket is missing or empty, bronze-to-silver cannot reach RustFS —
check that both services are on the same network and that RUSTFS_ENDPOINT
uses the correct service name.
Gold layer — aggregated data in PostgreSQL:
docker compose exec postgres psql -U analytics -d gold \
-c "SELECT * FROM sales_by_product ORDER BY total_revenue DESC;"
You should see a table of products with aggregated revenue figures:
product | total_revenue | order_count | avg_order_value | last_updated
--------------------+---------------+-------------+-----------------+-------------------------------
Docking Station | 18432.50 | 23 | 801.41 | 2024-03-15 12:02:45.123456
Laptop | 15987.00 | 18 | 888.17 | 2024-03-15 12:02:45.123456
Monitor | 12301.75 | 31 | 396.83 | 2024-03-15 12:02:45.123456
...
If the table is empty or does not exist, silver-to-gold cannot reach PostgreSQL —
check your DATABASE_URL and network configuration.
Serving layer — open Grafana in your browser:
http://localhost:3000
You should see the Grafana login page. Login: admin / lab
If the page does not load, Grafana's port is not exposed to the host — check
your ports configuration.
Once logged in, add a PostgreSQL data source:
- Host:
postgres:5432← the container name, notlocalhost - Database:
gold - User:
analytics/ Password:analytics - TLS/SSL mode: disable
Click Save & Test — you should see "Database Connection OK". If the test fails, Grafana and PostgreSQL are not on the same network.
Then create a panel with:
SELECT product, total_revenue, order_count FROM sales_by_product ORDER BY total_revenue DESC;
Deliverables
-
Architecture Documentation including:
-
A completed
docker-compose.ymlwith the platform running end-to-end.- Screenshots of the Grafana Dashboard
-
A C4 Deployment Diagram for the platform (https://c4model.com/diagrams/deployment).
Map each container onto its deployment node (Docker container) and show how nodes are grouped into networks. Include port assignments for any service that is externally exposed. This is the right level to document network topology and the boundary between internal and external traffic.
-
Answer the following questions:
- The
data-generatorcannot reach the gold database in a correctly wired architecture — what would go wrong architecturally if it could? - If the RustFS silver store is lost (e.g.
docker compose down -v), what happens to the gold layer? Is that a problem? - The silver layer uses object storage (S3) rather than a shared volume. What architectural advantages does that give you compared to a plain volume?
- The silver layer stores Parquet files instead of CSV. Why is Parquet a better choice for an intermediate layer in a data platform? Consider:
- How are columns stored compared to CSV?
- What happens to data types when you read a CSV vs. a Parquet file?
- How does file size compare, and why does that matter in object storage?
- If
silver-to-goldonly needed theamountandproductcolumns, how would Parquet help?
- The
Supply-Chain Security: Generating a Software Bill of Materials
A Software Bill of Materials (SBOM) is a formal, machine-readable inventory of every software component in a system — the open-source libraries, OS packages, and runtime dependencies that a platform is built from. Just as a physical product has a bill of materials listing every part, a software platform has an SBOM listing every component.
SBOMs matter because most security vulnerabilities are introduced not by the code you write, but by the third-party components you depend on. The EU Cyber Resilience Act requires manufacturers and operators of products with digital elements to maintain an accurate SBOM — making this a compliance requirement, not just a best practice.
In this exercise you will generate a CycloneDX SBOM for the data platform and analyse what you find.
Note
Current Info (04.2026): Cyber-Security-Threads are all around. The tool used below,
trivy, recently was attacked as well. As trivy is widely used, it makes it a good target. The current issue mainly targets CI-Environments, but also local machines are affected.Trivy Blog-Post: https://www.aquasec.com/blog/trivy-supply-chain-attack-what-you-need-to-know/
Microsoft Security-Blog: https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/
EU CERT Advisory: https://cert.europa.eu/blog/european-commission-cloud-breach-trivy-supply-chain
Other supply chain attack: https://www.microsoft.com/en-us/security/blog/2026/04/01/mitigating-the-axios-npm-supply-chain-compromise/
Prerequisites
- Install Trivy: https://trivy.dev/docs/latest/getting-started/installation/
- Have the platform images built:
docker compose build - Create a local output directory (this is git-ignored):
mkdir -p sbom
Note
Ensure to use the known safe versions (https://github.com/aquasecurity/trivy/security/advisories/GHSA-69fq-xp46-6x23)
Component Safe Version Trivy binary v0.69.2, v0.69.3
Part 1 — Per-Image SBOMs
A container image is a layered filesystem. Every layer may add OS packages, language runtime packages, or application files — all of which become part of your supply chain.
Use trivy image with CycloneDX output format and license scanning enabled to generate a SBOM for each of the six platform images. Consult the Trivy documentation to find the correct flags. Write each SBOM to sbom/<service>.cdx.json.
Images to scan:
| Image | Built by |
|---|---|
msa_sai_lab-data-generator |
Custom (this project) |
msa_sai_lab-bronze-to-silver |
Custom (this project) |
msa_sai_lab-silver-to-gold |
Custom (this project) |
rustfs/rustfs:latest |
Third-party |
postgres:16 |
Third-party |
grafana/grafana:latest |
Third-party |
Reflection questions — answer in your deliverable:
- Which image has the most components? Which has the fewest? What explains the difference?
- Two images are tagged
:latest(rustfs/rustfs:latest,grafana/grafana:latest), while one is pinned to a major version (postgres:16). What supply-chain risk does a:latesttag introduce? How does Trivy record the actual image identity in the SBOM?
Part 2 — Platform-Wide SBOM
Now generate a single SBOM that covers the entire project directory using trivy fs. This command scans the filesystem for package manifests (such as requirements.txt, package.json, go.mod) and Dockerfiles.
Write the output to sbom/platform.cdx.json.
Note: Each custom service declares its dependencies in a
requirements.txtfile, whichtrivy fsuses to detect Python packages. This is why manifest files matter for supply-chain tooling — dependencies declared only as inline shell commands (e.g.RUN pip install boto3) are invisible to filesystem scanners.On licenses: You will notice that
platform.cdx.jsonlists components without license information, while the per-image SBOMs from Part 1 do include licenses. This is a fundamental difference between the two scanning modes:trivy imagescans the full image filesystem includingsite-packages, where Python stores license metadata (.dist-infodirectories).trivy fsonly has access to the source tree — it knows what packages are declared but cannot read their license metadata without the installed environment. For complete license information, the per-image SBOMs are authoritative.
Reflection questions — answer in your deliverable:
- A new regulation (e.g. the EU Cyber Resilience Act) requires your organisation to provide an SBOM for any software platform it operates. What would you submit? Which of the two approaches — per-image or platform-wide — better satisfies that requirement, and why?
Architectural Reflection
- The
data-generatorhas nopip installstep — it relies entirely on the Python standard library. What does its SBOM tell you about its attack surface compared tosilver-to-gold? What is the architectural principle at work here?
Deliverables
Add the following to your submission:
-
The generated SBOM files:
sbom/data-generator.cdx.jsonsbom/bronze-to-silver.cdx.jsonsbom/silver-to-gold.cdx.jsonsbom/rustfs.cdx.jsonsbom/postgres.cdx.jsonsbom/grafana.cdx.jsonsbom/platform.cdx.json
-
Written answers to all reflection questions in Parts 1, 2, and the Architectural Reflection above.