Venturing Beyond Hello World: Mastering Containerization and Orchestration
Context: Everything here lives in personal repos (docker_multilang_project, Car-Match backend, ProjectHub proxy). I’ve never owned production containers.
AI assist: ChatGPT condensed my Docker/EKS notes; I validated each callout against the actual repos and AWS lab scripts on 2025-10-15.
Status: Learning log. Use it to gauge my current level, not as evidence of SRE tenure.
Reality snapshot
- Day-to-day dev happens in Docker Compose (Node + Python + Postgres or Mongo).
- When I want to stretch, I follow AWS workshops to stand up an EKS cluster with Terraform, deploy the same services, and watch how health checks + autoscaling behave.
- Observability equals
stdoutJSON logs,/healthzroutes, and occasionally Prometheus/Grafana during labs. No 24/7 pager yet.
Compose: my default sandbox
Stack anatomy
services:api:build: ./apienv_file: .env.apiports: ["4000:4000"]depends_on: [db]healthcheck:test: ["CMD", "curl", "-f", "http://localhost:4000/healthz"]interval: 30stimeout: 5sretries: 3frontend:build:context: ./frontendtarget: productionports: ["8080:80"]depends_on: [api]db:image: postgres:15environment:POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}volumes:- pgdata:/var/lib/postgresql/datavolumes:pgdata:
- Why it works: Health checks gate traffic,
.envfiles keep secrets out of the compose file, and named volumes preserve data between runs. - Observability: Services log JSON with request IDs so
docker compose logs apistitches calls together. - Chaos drills: I kill containers with
docker compose kill apito verify the frontend fails gracefully and recovers once the container restarts.
Lessons
- Multi-stage builds keep images small (e.g.,
node:20-alpine+npm ciin builder stage). - Mounting local certs into containers makes HTTPS dev possible without messing with the host.
- Documenting every command (
docs/dev-runbook.md) stops classmates from asking, “why doesn’t this container start?”
EKS labs: leveling up (still sandboxed)
What I practice
- Terraform provisioning – VPC, node groups, IAM roles. All lives in
labs/eks/terraform. - Deployments & services – Basic
Deployment+Servicemanifests, ConfigMaps for environment variables, Secrets for credentials. - Ingress – AWS Load Balancer Controller with TLS certs for the sample domain.
- Observability – Prometheus + Grafana via Helm, scraping the demo pods.
- Autoscaling & rolling updates – HPA based on CPU + custom metrics,
kubectl rollout statusdemos.
apiVersion: apps/v1kind: Deploymentmetadata:name: apispec:replicas: 3selector:matchLabels:app: apitemplate:metadata:labels:app: apispec:containers:- name: apiimage: $ECR_URI/api:${GITHUB_SHA}ports:- containerPort: 4000envFrom:- secretRef:name: api-secretsreadinessProbe:httpGet:path: /readyport: 4000initialDelaySeconds: 5periodSeconds: 10
Honest caveats
- Traffic is synthetic (k6 scripts + curl). No paying customers.
- IAM roles follow workshop defaults. Before touching production I’d need a full review.
- I rely on AWS Cloud9 + workshop accounts. Costs are low, but I still tear everything down immediately after lab time.
Tooling & guardrails
- Container builds:
npm run build:dockerorscripts/docker-build.shensure multi-stage builds use the same base images. - Security:
npm audit,docker scan, and occasional Trivy runs keep dependencies honest. Findings go in the repo issues list. - Docs: Every repo has
docs/runbook.md(start/stop commands, health checks, log locations, TODOs). For EKS labs I add Terraform diagrams +destroyinstructions.
What I’m working on next
- Add automated smoke tests (k6 or Playwright) that hit the running Compose stack on CI before merging.
- Package Terraform/EKS lab into a repeatable template so I can spin it up faster (and share with classmates).
- Explore App Mesh or Linkerd to understand service meshes before I make claims about them.
- Figure out how to shrink cold-start time on the Render backend (Car-Match) without leaving the free tier.
Failure stories and fixes
- Images ballooned past 1 GB: I forgot multi-stage builds. Switched to
node:20-alpinebuilder + slim runtime; image dropped to ~150 MB. - Crash-looping pods on EKS: Liveness probes hit
/healthzbefore the app bound the port. AddedinitialDelaySecondsand a lightweight/readyendpoint. - Mystery 502s through ALB: Uppercase headers + missing
te: trailerson gRPC paths. Normalized headers and documented it in the runbook. - Compose env drift: Teammates kept stale
.envfiles. Added.env.example+scripts/verify-env.shto diff and fail fast.
Guardrails I follow
- Keep
docker-compose.ymlas the source of truth; avoid “just run docker run” instructions. - Health checks everywhere (Compose, K8s, and app-level) so failures surface quickly.
- Tear down labs immediately (
terraform destroy,kubectl delete ns demo) to avoid surprise bills. - Document every drill in
docs/runbook.mdwith a date, command, and lesson learned.
Interview-ready narratives
- Why Compose first: It’s the quickest way for me to prove a multi-service app works end-to-end. Once stable, I port the same health checks and env patterns to K8s.
- How I debug crashes: Start with container logs, then health checks, then dependencies. If it’s K8s, I add
kubectl describe podto see events before blaming app code. - What I know vs. don’t know: Comfortable with Compose + basic K8s deployments; not experienced with service meshes, multi-tenant clusters, or production on-call. I say that upfront.
Labs I plan to add
- Blue/green + canary rollouts via Argo or plain
kubectl rolloutpatterns. - Pod disruption budgets and eviction drills to see how apps behave during node maintenance.
- Centralized logging stack (Fluent Bit + Loki) with correlation IDs wired through HTTP headers.
- Cost tracking on EKS using AWS Cost Explorer tags, just to see how quickly experiments add up.