AWS Cloud Support Internship: Mastering Troubleshooting and Architecture
Context: Summer 2025 AWS Cloud Support Associate internship in Seattle (Oscar building). My cohort lived inside labs, mock tickets, and certification prep—no direct customer production support.
AI assist: ChatGPT + Amazon Q Business helped summarize service docs and draft troubleshooting checklists; I edited everything before submitting to mentors.
Status: Honest reflection so recruiters see exactly what I touched (and what remains on the “practice only” list).
Reality snapshot
- 12-week program split between mornings (Cloud Practitioner/SAA coursework, architecture reviews) and afternoons (ticket simulations, hands-on labs, runbook writing).
- I rotated through EC2, S3, IAM, networking, and observability labs; each lab ended with a quiz + short retro shared with a senior engineer.
- Capstone was a media metadata pipeline built entirely in AWS sandbox accounts: S3 → Lambda (FFmpeg) → DynamoDB → API Gateway + static dashboard. No external users relied on it.
- Tracked lab completion, quiz scores, ticket MTTR in simulations, and budget alerts; shipped weekly retros (“what broke / what I fixed / what I still don’t know”) to mentors.
Table of contents
- Weekly structure
- Troubleshooting drills I ran
- Capstone: media metadata pipeline
- Tooling & automations I leaned on
- Proof & artifacts
- Gaps & next steps
- Interview stories I reuse
Weekly structure
| Week(s) | Focus | Deliverables |
|---|---|---|
| 1–2 | Orientation, Cloud Practitioner refresh | Daily lab reports, IAM policy walk-through, “how to escalate” checklist. |
| 3–4 | Linux + networking deep dive | Troubleshoot EC2 boot loops, build VPC peering diagrams, script CloudWatch log exports. |
| 5–6 | Storage & security | S3 bucket policy labs, KMS envelope encryption exercises, Bedrock prompt-logging prototype. |
| 7–8 | Observability + automation | CloudWatch dashboard for mock SaaS, Cost Explorer alarms, npm audit playbooks. |
| 9–10 | Capstone build | S3→Lambda→DynamoDB metadata pipeline, runbook, health checks, IaC template. |
| 11 | Support simulations | Pager-style ticket drills, on-call shadowing, Amazon Leadership Principles reviews. |
| 12 | Presentations + retros | Capstone demo, personal growth plan, peer feedback write-up. |
Troubleshooting drills I ran
- EC2 boot loops: Collected console output, diffed failed user-data scripts, rebuilt launch templates. Lesson: user-data must be idempotent, and CloudWatch agent config drifts silently.
- VPC reachability: Reachability Analyzer + traceroute inside bastions; caught mismatched CIDRs and missing return routes during peering labs.
- S3 “Access Denied” mazes: IAM Policy Simulator + CloudTrail to isolate missing
Principal/Conditionin cross-account bucket policies; wrote a step-by-step playbook. - Cost spikes: “Rapid response” checklist: Budgets alert → Cost Explorer by service → orphaned EBS + idle NAT → snapshot/terminate → tag everything.
- CloudWatch log flooding: Temporary retention policy, metric filters for error bursts, and alarms to detect runaway debug logs.
- Bedrock prompt logging: Prototyped logging wrapper to capture prompts/outputs for audit; documented privacy/legal considerations before any production use.
Capstone: media metadata pipeline (lab-only)
Architecture
- Input: Files land in
media-ingest-bucket. - Processing: Node.js 20 Lambda pulls the object, shells out to FFmpeg to extract metadata, pushes a compact JSON doc to DynamoDB.
- API: API Gateway exposes read-only endpoints so a static dashboard (S3 + CloudFront) can query the table.
- Observability: CloudWatch logs, metrics, and alarms track Lambda duration, DynamoDB throttle counts, and FFmpeg exits. Budgets/Cost Explorer alerts guard the lab account.
Resources:MediaBucket:Type: AWS::S3::BucketProperties:NotificationConfiguration:LambdaConfigurations:- Event: s3:ObjectCreated:*Function: !GetAtt MetadataLambda.ArnMetadataLambda:Type: AWS::Lambda::FunctionProperties:Runtime: nodejs20.xHandler: index.handlerCode:S3Bucket: !Ref ArtifactBucketS3Key: lambda.zipEnvironment:Variables:TABLE_NAME: !Ref MetadataTableMetadataTable:Type: AWS::DynamoDB::TableProperties:BillingMode: PAY_PER_REQUESTAttributeDefinitions:- AttributeName: FileKeyAttributeType: SKeySchema:- AttributeName: FileKeyKeyType: HASH
What worked
- Lambda stayed under 2 GB memory/30 s duration even when FFmpeg processed 250 MB sample files.
- DynamoDB recorded ~300 sample rows with zero throttling thanks to on-demand mode.
- CloudWatch dashboard (latency, invocations, FFmpeg exit codes, DynamoDB consumed RCUs) made it easy to talk through the design review.
- Step Functions “stretch goal” doc lists how I’d fan-out enrichment jobs if this ever handled more than demo traffic.
What still needs work
- Replace API key auth with Cognito + IAM authorizers (on the backlog).
- Integration tests exist locally but CI/CD only runs lint + unit tests. Need to script
sam validate, deploy to a staging stack, and capture screenshots automatically. - FFmpeg binary came from a public layer; I owe the team a security review and pinning strategy before recommending it anywhere else.
- Add signed URLs and lifecycle rules on the ingest bucket so lab data ages out automatically.
- Improve cold-starts: explore container-based Lambda vs. slimmer FFmpeg layer; measure impact and document trade-offs.
- Load test with varied media types and longer runs; current numbers are from small samples.
Tooling & automations I leaned on
- Docs-as-code: Every lab ended with a markdown runbook + diagram (Mermaid + Excalidraw). These live in
notes/aws-internship/and were reviewed by mentors weekly. - Cost visibility: Budgets (email + Slack) triggered at 10% and 25% of the sandbox allowance, mostly to prove I could wire alerts.
- Security workflows: npm audit CI (both frontend + backend), OWASP ZAP baseline workflow for the Render-deployed intern app, Bedrock prompt logging experiments with Amazon Q.
- AI helpers: Amazon Q Business answered “where does this service log?” while ChatGPT helped translate dense docs into playbooks. Every AI-assisted snippet is annotated in the repo so it’s obvious what I edited.
- Observability kits: Reusable CloudWatch dashboards for EC2/Lambda/RDS; alarm templates for errors, latency p90/p99, throttles.
- Runbook template: Intro → Symptoms → Timeline → Logs/metrics links → Fix → Prevent → Open questions. Kept consistency across labs.
- Retro cadence: Weekly self-review to mentors: “what worked, what broke, what I still don’t understand,” with ticket IDs and lab links.
Proof & artifacts
- Lab tracker: https://github.com/BradleyMatera/aws-internship-journal (private; screenshots/redacted excerpts available).
- Dashboards:
dashboards/cloudwatch-dashboard.jsonplus PNG exports in the repo. - Runbooks: See
notes/runbooks/*.md—each links to the relevant log groups, budgets, or config files. - Capstone deck: PDF stored under
presentations/capstone-media-pipeline.pdfwith the architecture diagram, metrics, and TODO table. - Quizzes/assessments: Scores + notes per module; gaps highlighted (SCPs, advanced VPC patterns).
- Ticket drills: Stored transcripts/timelines from mock pager escalations; anonymized for interview use.
Gaps & next steps
- Earn Developer Associate and re-run the labs with IaC-first deployments.
- Pair with a real AWS Support engineer on a shadow shift to see how customer tickets differ from our simulations.
- Harden IAM knowledge (resource-level permissions, SCPs, org design) beyond what the internship covered.
- Turn the FFmpeg Lambda into a public tutorial once I replace the binary layer and add full test coverage.
- Build a “cost kill switch” pattern (Budgets → SNS → Lambda to tag/stop idle resources) to prove automated cleanup.
- Add synthetics against the capstone API (API Gateway → Lambda → DynamoDB) and publish results with dashboards.
Interview stories I reuse
- EC2 boot-loop fix: Found bad user-data and broken CloudWatch agent; rewrote idempotently, added alarms—shows calm debugging under pressure.
- S3 Access Denied: Used Policy Simulator + CloudTrail to prove missing
Principalcondition; fixed cross-account access—demonstrates systematic troubleshooting. - Cost spike response: Identified idle NAT + orphaned EBS, tagged/terminated, set budgets/alerts—ownership + cost awareness.
- Capstone cold-starts: Measured FFmpeg cold-start impact, proposed container-based Lambda + signed URLs—trade-off thinking and next steps.