This is part of my HNG DevOps internship series. Follow along as I document every stage.
A Quick Recap
Stage 0 was about securing a Linux server. Stage 1 was deploying an API behind Nginx. Stage 2 was containerizing a microservices app. Stage 3 was building a DDoS detection engine. Stage 4 was writing a declarative deployment tool. Stage 5 is the most ambitious yet.
This time there was no starter code. No bugs to fix. No existing app to containerize. I had to build the entire platform from scratch β a self-service system where users can spin up isolated temporary environments, deploy apps into them, simulate outages, monitor health, and have everything auto-destroyed when the lifetime expires. Think of it as a miniature internal Heroku with a chaos engineering toggle.
The Task
The platform had to do all of this on a single Linux VM:
- Environment Lifecycle β create and destroy isolated Docker environments on demand with a configurable TTL
- Auto Cleanup Daemon β a background process that scans every 60 seconds and destroys expired environments automatically
- Dynamic Nginx Routing β every new environment gets its own Nginx config written and reloaded automatically
- Log Shipping β container logs captured and queryable by environment ID
-
Health Monitoring β a poller that hits every environment's
/healthendpoint every 30 seconds and marks environments as degraded after 3 consecutive failures - Outage Simulation β a script that can crash, pause, disconnect, or stress-test any environment on demand
- Control API β a REST API with 6 endpoints wrapping all the scripts
- Makefile β every action available as a make target
The stack was Docker, Docker Compose, Nginx, Bash, Python 3, and Flask. Everything had to spin up with one command.
Step 1: Repo Structure and Scaffold
Before writing a single line of logic I set up the repo structure exactly as specified:
devops-sandbox/
βββ platform/
β βββ create_env.sh
β βββ destroy_env.sh
β βββ cleanup_daemon.sh
β βββ simulate_outage.sh
β βββ api.py
βββ nginx/
β βββ nginx.conf
β βββ conf.d/
βββ monitor/
β βββ health_poller.sh
βββ logs/
βββ envs/
βββ Makefile
βββ docker-compose.yml
βββ README.md
βββ .env.example
βββ .gitignore
Getting this right first saved a lot of headaches later. Every script references paths relative to the project root, and if those paths don't exist at runtime the scripts fail silently. I also set chmod +x on all shell scripts immediately β forgetting this causes confusing permission errors later.
The .gitignore was set up to exclude envs/, logs/, and .env from the start. These directories contain runtime state and secrets that should never be committed.
Step 2: The Demo App
The platform needed something to run inside each environment. The task was clear that the demo app is not the project β the platform is. So I kept it simple: a Flask app with two routes.
@app.route("/")
def index():
return jsonify({
"message": "Hello from the sandbox!",
"env_id": ENV_ID
})
@app.route("/health")
def health():
return jsonify({"status": "ok", "env_id": ENV_ID}), 200
The /health route is the critical one. The health poller depends on it. Every environment container gets its ENV_ID injected as an environment variable so you can always tell which container you are talking to.
The app binds to 0.0.0.0 not 127.0.0.1. This is a mistake I see constantly. If you bind to localhost inside a container, nothing outside the container can reach it β including Nginx.
Step 3: Nginx Dynamic Routing
Nginx is the front door for every environment. The key insight is that nginx.conf never needs to change. It just includes everything in conf.d/:
http {
include /etc/nginx/conf.d/*.conf;
server {
listen 80 default_server;
return 404 "No environment found\n";
}
}
When create_env.sh runs, it writes a new file to nginx/conf.d/$ENV_ID.conf and reloads Nginx. When destroy_env.sh runs, it deletes that file and reloads Nginx again. No manual config editing ever.
The conf.d/ directory is mounted as a Docker volume into the Nginx container. This means files written to nginx/conf.d/ on the host appear immediately inside the container. Only a reload is needed, not a rebuild.
One critical mistake to avoid: never write the Nginx config before the container is running. Nginx validates upstream hostnames on reload. If you write a config pointing to a container that doesn't exist yet, the reload fails and Nginx goes down. The order matters β start the container first, then write the config.
Step 4: Environment Lifecycle
create_env.sh is the heart of the platform. It has to do six things in the right order:
- Generate a unique env ID from the name and a timestamp suffix
- Create a dedicated Docker network for the environment
- Connect the Nginx container to that network
- Start the app container on that network with a
sandbox.env=$ENV_IDlabel - Write the Nginx config and reload
- Write the state file to
envs/$ENV_ID.jsonatomically
The atomic write is important. The cleanup daemon reads these state files in a loop. If a write crashes halfway, the daemon reads garbage and fails. The fix is to write to a temp file first and then mv it into place:
TEMP_FILE=$(mktemp "$ENVS_DIR/.tmp.XXXXXX")
cat > "$TEMP_FILE" << JSON
{
"id": "$ENV_ID",
"name": "$ENV_NAME",
"container": "$CONTAINER_NAME",
"network": "$NETWORK_NAME",
"created_at": "$CREATED_AT",
"ttl": $TTL,
"status": "running"
}
JSON
mv "$TEMP_FILE" "$ENVS_DIR/$ENV_ID.json"
mv is atomic on Linux when source and destination are on the same filesystem. The daemon either reads the complete file or nothing.
destroy_env.sh reverses all of this in the correct order β kill the log shipper first, stop and remove containers, disconnect Nginx from the network, remove the network, delete the Nginx config, reload Nginx, archive logs, delete the state file. Order matters here too. You cannot remove a network while containers are still connected to it.
Step 5: The Cleanup Daemon
The daemon runs in an infinite loop with a 60 second sleep. On each iteration it reads every file in envs/, computes how much time has passed since created_at, and calls destroy_env.sh if the TTL has been exceeded.
CREATED_EPOCH=$(date -d "$CREATED_AT" +%s)
NOW_EPOCH=$(date -u +%s)
EXPIRES_AT=$((CREATED_EPOCH + TTL))
if [[ "$NOW_EPOCH" -ge "$EXPIRES_AT" ]]; then
bash "$DESTROY_SCRIPT" "$ENV_ID"
fi
One thing that breaks this: not using nullglob. If envs/ is empty, *.json expands to the literal string *.json and the loop tries to process a file called *.json which doesn't exist. Fix:
shopt -s nullglob
STATE_FILES=("$ENVS_DIR"/*.json)
shopt -u nullglob
Every action is timestamped and written to logs/cleanup.log. The daemon runs in the background with nohup and its PID is saved so make down can stop it cleanly.
Step 6: Health Monitoring
The health poller runs every 30 seconds. For each active environment it finds the container's IP address, hits GET /health, measures the latency, and writes the result to logs/$ENV_ID/health.log.
Getting latency right was harder than expected. My first approach used date +%s%N for nanosecond timestamps. This failed because the %N flag is not supported on the version of Linux on the VM. The numbers came out as something like 14209454ms for a request that obviously took under a second.
The fix was to use curl's own built-in timing:
RESULT=$(curl -s -o /dev/null \
-w "%{http_code} %{time_total}" \
--max-time 5 \
"http://$CONTAINER_IP:5000/health")
HTTP_STATUS=$(echo "$RESULT" | awk '{print $1}')
TIME_SEC=$(echo "$RESULT" | awk '{print $2}')
LATENCY=$(echo "$TIME_SEC * 1000" | awk '{printf "%d", $1 * 1000}')
curl's %{time_total} gives you wall clock time in seconds as a decimal. Multiply by 1000 and you have milliseconds. Accurate and reliable.
After 3 consecutive failures the poller marks the environment as degraded by updating the state file. It also resets the fail counter and restores the status to running when checks pass again. The status update uses the same atomic write pattern as the lifecycle scripts.
Step 7: Outage Simulation
The simulation script accepts --env and --mode flags. The modes map directly to Docker commands:
-
crashβdocker kill(SIGKILL, not graceful) -
pauseβdocker pause -
networkβdocker network disconnect -
recoverβ inspects current state and reverses whichever mode is active -
stressβstress-nginside the container for 60 seconds
The guard at the top of the script is not optional. It checks whether the target container name matches any protected service names and refuses to run if it does:
PROTECTED=("sandbox-nginx" "cleanup_daemon" "sandbox-api")
for PROTECTED_NAME in "${PROTECTED[@]}"; do
if [[ "$CONTAINER" == *"$PROTECTED_NAME"* ]]; then
echo "ERROR: Refusing to simulate outage against protected container"
exit 1
fi
done
Without this guard, nothing stops someone from passing the Nginx container ID and taking down the entire platform.
The recover mode was the most interesting to write. It does not know which mode caused the problem β it just inspects the current state and fixes whatever is wrong. Paused? Unpause. Exited? Restart. Network disconnected? Reconnect. This makes recover genuinely useful rather than just a wrapper around one specific undo.
Step 8: The Control API
The Flask API wraps all the scripts via subprocess.run. It has 6 endpoints:
POST /envs β create env
GET /envs β list active envs + TTL remaining
DELETE /envs/:id β destroy env
GET /envs/:id/logs β last 100 lines of app.log
GET /envs/:id/health β last 10 health check results
POST /envs/:id/outage β trigger simulation
The TTL remaining calculation happens in Python:
def ttl_remaining(env):
created = datetime.fromisoformat(
env["created_at"].replace("Z", "+00:00")
)
now = datetime.now(timezone.utc)
elapsed = (now - created).total_seconds()
return max(0, int(env["ttl"] - elapsed))
The API runs inside a Docker container with the project directory mounted as a volume and the Docker socket mounted so it can execute Docker commands. This is the standard pattern for tools that need to manage Docker from inside Docker.
Step 9: The Makefile
Every action has a make target. The two most important ones are up and down.
make up starts Nginx and the API via Docker Compose, then starts the cleanup daemon and health poller as background processes with nohup, saving their PIDs to files:
up:
docker compose up -d --build
nohup bash platform/cleanup_daemon.sh > logs/cleanup.log 2>&1 &
echo $$! > logs/cleanup_daemon.pid
nohup bash monitor/health_poller.sh > logs/poller.log 2>&1 &
echo $$! > logs/health_poller.pid
make down reads those PID files and kills the processes cleanly:
down:
@if [ -f logs/cleanup_daemon.pid ]; then \
kill $$(cat logs/cleanup_daemon.pid) 2>/dev/null || true; \
rm -f logs/cleanup_daemon.pid; \
fi
Makefile syntax has one rule that catches everyone: indentation must use tabs, not spaces. If you use spaces, make throws a cryptic missing separator error that has nothing to do with separators.
Problems I Hit Along the Way
Docker permission denied on a fresh VM β The ubuntu user is not in the docker group by default. Fix: sudo usermod -aG docker $USER followed by newgrp docker.
Nginx crashing on startup β I left a sample example.conf file in nginx/conf.d/ as a reference. Nginx tried to resolve the upstream hostname example:5000 on startup, failed, and crashed. The fix was obvious in hindsight: delete the sample file before starting Nginx.
Disk full during Docker build β docker system prune -af recovered the space. The build cache had accumulated several GB from previous builds and test runs.
demo-app:latest image lost after prune β Docker prune removes all images not referenced by a running container. After cleaning disk space the demo app image was gone. Always rebuild the demo app image after a prune: docker build -t demo-app:latest ./demo-app.
Health log latency showing 14 million milliseconds β Caused by date +%s%N not being supported. Fixed by switching to curl's %{time_total} timing.
The Big Picture
| What we built | Why it matters |
|---|---|
| Dedicated Docker network per environment | Complete isolation β environments cannot interfere with each other |
| Atomic state file writes | Prevents corruption when daemon and scripts write concurrently |
| Nginx config as code | Dynamic routing without touching the main config |
| Log shipper PID tracking | Prevents zombie processes on destroy |
| Guard in simulation script | Prevents accidental destruction of platform infrastructure |
| Health-based degraded detection | Automated observability without external tooling |
| REST API over raw scripts | Makes the platform programmable and integratable |
The hardest part of this task was not any single script. It was understanding the correct order of operations. Create the container before writing the Nginx config. Kill the log shipper before removing the container. Disconnect the network before removing it. Write state files atomically. These ordering constraints are not obvious until something breaks, and when they break they break in confusing ways.
That is the difference between infrastructure that works in a demo and infrastructure that works at 3am when something goes wrong.
Stage 5 complete. Find me on Dev.to | GitHub
United States
NORTH AMERICA
Related News
What Does "Building in Public" Actually Mean in 2026?
19h ago
The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done
19h ago
Why Iβm Still Learning to Code Even With AI
21h ago
I gave Claude a persistent memory for $0/month using Cloudflare
1d ago
NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'
1d ago