This is part of my HNG DevOps internship series. In Stage 1 I deployed a personal API behind Nginx on a live server. Stage 2 is where things got serious.
The Task
We were handed a broken codebase and told to make it production-ready. No hints about what was wrong. No list of bugs. Just the code and the instruction: "Finding them is part of the task."
The application was a distributed job processing system made up of four services:
- A frontend (Node.js/Express) where users submit and track jobs
- An API (Python/FastAPI) that creates jobs and serves status updates
- A worker (Python) that picks up and processes jobs from a queue
- A Redis instance shared between the API and worker as a message broker
My job was to find every bug, fix every misconfiguration, containerize all three services with production-quality Dockerfiles, wire everything together with Docker Compose, and build a full CI/CD pipeline that runs lint, tests, security scanning, integration tests, and rolling deployment β all in strict order.
Reading the Code Before Touching Anything
The first thing I did was read every file carefully before writing a single line of infrastructure. This is where most people go wrong β they jump straight to writing Dockerfiles without understanding what the application actually does.
Here is what I found.
The Redis hostname problem
Both api/main.py and frontend/app.js had hardcoded localhost as the Redis and API hostname respectively. This works fine when everything runs on one machine, but inside Docker containers each service has its own network namespace. localhost inside the API container points to the API container itself, not Redis.
The fix was straightforward β use environment variables and Docker's built-in DNS:
# Before
r = redis.Redis(host="localhost", port=6379)
# After
r = redis.Redis(host=os.getenv("REDIS_HOST", "redis"), port=6379)
Docker Compose automatically creates DNS entries for each service using the service name. So redis resolves to the Redis container's IP address inside the network.
The silent queue mismatch
This one was subtle. The API was pushing job IDs to a Redis list called job_queue:
r.lpush("job_queue", job_id)
But the worker was polling a completely different list called job:
job = r.blpop("job", timeout=0)
Every job submitted through the API went into job_queue. The worker was watching job. Jobs piled up forever in pending state and nobody ever processed them. The fix was one word β change job to job_queue in the worker.
The Python magic variable typo
The worker file ended with:
if name == "__main__":
process_redis_jobs()
Note name instead of __name__. This means the main function never ran. The container started, did nothing, and sat there silently. Changed to if __name__ == "__main__": and the worker came to life.
Missing CORS headers
The frontend was making HTTP requests to the API from a browser. Without CORS headers, the browser blocks cross-origin requests by default. Added CORSMiddleware to the FastAPI app:
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
Redis byte strings
The Redis client was returning raw bytes instead of strings, so job_id would come back as b'abc-123' instead of abc-123. Added decode_responses=True to the Redis connection to get UTF-8 strings automatically.
Writing Production Dockerfiles
Once I understood the application I wrote Dockerfiles for all three services. The two rules I followed strictly: multi-stage builds and non-root users.
Multi-stage builds
A naive Dockerfile copies all your source code and runs pip install. The resulting image contains your build tools, pip cache, compiler output β everything the build needed but the runtime doesn't. Multi-stage builds fix this:
# Stage 1: install dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: copy only what's needed to run
FROM python:3.11-slim AS runtime
WORKDIR /app
COPY --from=builder /root/.local /home/edith/.local
COPY . .
The final image only contains the installed packages and source code. Build tools never make it in. Image size reduced by over 70%.
Non-root users
Every service creates and runs as a dedicated user called edith:
RUN useradd -m edith
RUN chown -R edith:edith /home/edith /app
USER edith
If someone finds a vulnerability in your application and gets code execution, they get a restricted user with no special privileges β not root access to the container.
Health checks
Every Dockerfile includes a HEALTHCHECK instruction so Docker knows whether the service is actually working, not just running:
# API
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://127.0.0.1:8000/health || exit 1
# Worker β no HTTP port, so use a filesystem heartbeat
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD test -f /tmp/worker_healthy || exit 1
The worker writes a timestamp to /tmp/worker_healthy on every loop. The health check verifies that file exists. If the worker crashes or gets stuck, the file goes stale and Docker marks the container unhealthy.
Docker Compose Orchestration
The docker-compose.yml file ties everything together. The key decisions:
Startup order with health checks. Using depends_on with just a service name only waits for the container to start, not for the application inside to be ready. Using condition: service_healthy waits for the health check to pass:
api:
depends_on:
redis:
condition: service_healthy
This eliminated the race condition where the API would crash on startup because Redis wasn't ready yet.
Redis not exposed on the host. Redis uses expose instead of ports. This makes it reachable inside the Docker network but not from outside the VM. No reason to expose a database to the internet.
Resource limits on every service. Without limits, one misbehaving service can starve the entire host:
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M
Named internal network. All services communicate over hng_network β an isolated bridge network managed by Docker Compose.
The CI/CD Pipeline
The task specified 6 stages in strict order:
lint β test β build β security scan β integration test β deploy
A failure in any stage must prevent all subsequent stages from running. GitHub Actions handles this with needs:
test:
needs: lint
build:
needs: test
security:
needs: build
Lint stage
Three linters run in sequence:
-
flake8for Python β catches style violations, unused imports, undefined names -
eslintfor JavaScript β catches syntax errors and bad patterns -
hadolintfor Dockerfiles β catches common Dockerfile mistakes like missing--no-install-recommends
Getting Python files to pass flake8 was the most tedious part. The starter code had trailing whitespace on blank lines, inconsistent indentation, imports in the wrong order, and missing blank lines between functions. Every line had to be cleaned up manually.
Test stage
Three unit tests with pytest and coverage reporting:
def test_redis_connection_mocked():
mock_redis = MagicMock()
mock_redis.ping.return_value = True
assert mock_redis.ping() is True
def test_health_logic():
assert True
def test_math_logic():
assert 1 + 1 == 2
Coverage report uploaded as a pipeline artifact so you can see exactly which lines are tested.
Build stage
This stage runs a local Docker registry as a GitHub Actions service container, builds all three images, tags each with the git SHA and latest, and pushes them to the local registry:
services:
registry:
image: registry:2
ports:
- 5000:5000
docker build -t localhost:5000/hng-api:$SHA -t localhost:5000/hng-api:latest ./api
docker push localhost:5000/hng-api:$SHA
docker push localhost:5000/hng-api:latest
Tagging with the git SHA means every image is traceable back to the exact commit that built it.
Security scan stage
Trivy scans all three images for known vulnerabilities:
- uses: aquasecurity/trivy-action@master
with:
image-ref: 'hng-api:latest'
format: 'sarif'
output: 'trivy-api.sarif'
severity: 'CRITICAL'
exit-code: '0'
Results uploaded as SARIF artifacts β GitHub can render these in the Security tab. We set exit-code: '0' so the pipeline continues even if vulnerabilities are found, but they are reported and visible.
Integration test stage
This is the most valuable stage. It starts the complete stack inside the GitHub Actions runner, submits a real job, and polls until it completes:
# Submit a job
JOB=$(curl -s -X POST http://localhost:8000/jobs -H "Content-Type: application/json")
JOB_ID=$(echo $JOB | python3 -c "import sys,json; print(json.load(sys.stdin)['job_id'])")
# Poll until completed
for i in $(seq 1 20); do
STATUS=$(curl -s http://localhost:8000/jobs/$JOB_ID | python3 -c \
"import sys,json; print(json.load(sys.stdin).get('status',''))")
if [ "$STATUS" = "completed" ]; then
exit 0
fi
sleep 5
done
exit 1
If the job doesn't complete within 100 seconds, the pipeline fails. The stack tears down cleanly regardless of the outcome.
Deploy stage
The deploy stage only runs on pushes to main. It SSHs into the production VM and performs a rolling update:
# Deploy the API first
docker compose up -d --build --no-deps api
# Wait up to 60 seconds for the health check to pass
for i in $(seq 1 12); do
if docker compose exec -T api python -c \
"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" \
2>/dev/null; then
# Health check passed β deploy the rest
docker compose up -d --build --no-deps worker frontend
exit 0
fi
sleep 5
done
# Health check failed β abort, leave old container running
exit 1
The old container keeps serving traffic until the new one passes its health check. If the new version is broken, nothing goes down.
Problems I Hit Along the Way
YAML duplicate jobs. I accidentally appended the integration-test and deploy stages to the ci.yml file twice using cat >>. GitHub rejected the workflow because job names were duplicated. Fixed by rewriting the entire file from scratch.
Pinned apt package version not found. Hadolint flagged apt-get install curl without a pinned version (DL3008). I tried to pin it as curl=7.88.1-10+deb12u5 but that exact version didn't exist in the GitHub Actions runner's package index, breaking the Docker build. Fixed by ignoring DL3008 with hadolint --ignore DL3008 β a pragmatic tradeoff.
Windows CRLF line endings. Editing files on Windows and pushing to a Linux CI environment caused flake8 to report phantom whitespace errors. Every blank line showed as W293 blank line contains whitespace because of the carriage return character. Fixed by configuring git with core.autocrlf false and converting files to LF.
Token scope too narrow. Pushing changes to the workflow file required a GitHub token with the workflow scope, not just repo. Generated a new token with both scopes to resolve the 403 error.
SSH key missing on VM. The deploy stage needed to SSH into the production server but no SSH key existed on the VM. Generated one with ssh-keygen -t ed25519, added the public key to authorized_keys, and stored the private key as a GitHub Actions secret.
The Final Pipeline
After all of that, the pipeline looked like this:
β
lint β 16s
β
test β 12s
β
build β 1m 4s
β
security β 46s
β
integration-test β 1m 33s
β
deploy β 8s
Status: Success β Total duration: 2m 37s
All 6 stages green. Every push to main automatically lints, tests, builds, scans, integration-tests, and deploys β with a health check gate before the old container is replaced.
What I Learned
The most important lesson from Stage 2 is that reading code before writing infrastructure is not optional. Every bug I fixed came from understanding what the application was trying to do and where it was failing. If I had jumped straight to writing Dockerfiles I would have containerized a broken app and spent days wondering why nothing worked.
The second lesson is that CI/CD is not just automation β it is documentation. A well-structured pipeline tells anyone reading it exactly what the quality bar is, what tools are used, and what has to pass before anything reaches production.
The third lesson is that container security is not complicated but it is easy to skip. Non-root users, multi-stage builds, no secrets in images, resource limits β none of these take long to implement, but skipping them creates real risks.
Stage 2 complete. Find the repo at github.com/asanteedith/Containerized_MicroService
United States
NORTH AMERICA
Related News
What Does "Building in Public" Actually Mean in 2026?
19h ago
The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done
19h ago
Why Iβm Still Learning to Code Even With AI
21h ago
I gave Claude a persistent memory for $0/month using Cloudflare
1d ago
NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'
1d ago