Some systems look scalable right up until they meet real production traffic.
The tests pass. Dashboards are green. The architecture diagram looks clean. The team feels good about it.
Then traffic grows, usage patterns shift, and the system starts failing in ways that were never obvious in staging.
I have seen this in very different environments: public-sector infrastructure supporting criminal justice workflows across 87 counties, and enterprise AI infrastructure where even tiny per-request costs multiplied across high-volume evaluation pipelines. In both cases, the system did not fail because someone forgot to add servers. It failed because an assumption that was harmless at small volume became expensive at large volume.
Most of them were assumptions nobody had written down anywhere.
This is the first post in a five-part series on scale assumptions: the design decisions that look harmless early and become painful later.
The assumptions that fool you
The first trap is linear thinking.
A system handles 1,000 requests per second, so it feels natural to assume 100,000 is mostly a hardware problem. Add more machines. Increase capacity. Tune a few indexes.
Sometimes that works.
More often, the bottleneck is not compute. It is a design decision that was reasonable at the beginning and painful later.
I learned this the first time we expanded a system from a pilot with one agency to three. The system slowed down noticeably. Nothing had broken. Nothing had been changed. The code was identical. But three agencies meant three times the concurrent load on shared workflow components, database access patterns, and integration points that had been sized for one. That was the moment I started worrying about statewide rollout in a way I had not before. Not because the architecture was wrong, but because the assumptions it was built on had never been tested at that width.
A synchronous call in the critical path is a good example of how this plays out. At low traffic, a 50ms dependency feels harmless. At scale, that dependency becomes a ceiling. Every request waits. Every downstream timeout becomes your timeout. What looked like a normal service call becomes an architectural anchor you cannot easily remove.
Data models have the same problem.
A schema that works for a small number of roles, permissions, tenants, or workflow states can become fragile when real business complexity arrives. The problem is not that someone designed a bad schema. The problem is that the schema quietly encoded a belief about the future.
That belief might have been:
"We will only have a few roles."
"This workflow will stay simple."
"This lookup will always be cheap."
"This relationship will never need to be queried at high volume."
Then the system grows, and those beliefs become production incidents.
Logging is the sneaky one. You start logging everything because you want to be able to debug production safely. Good instinct. Then one month the infrastructure bill arrives and the audit trail is costing more to move, store, and retain than the application it was supposed to help you understand. Nobody put that on any architecture diagram. Nobody planned for logging to become a system that needed its own architecture.
At scale, p99 is not an edge case
Your p99 is someone's normal.
If a system processes 100,000 requests per second and 1% of them are slow, that is not "just the p99." That is 1,000 requests every second with a bad experience. Those users are not theoretical. They create support tickets, trigger retries, increase queue depth, and hit downstream services in ways that change how the system behaves for everyone else.
The unhappy path has enough traffic behind it to become its own workload. That changes what you optimize for.
What actually breaks
The thing that breaks is rarely the thing you load tested.
I have seen systems pass every load test and still fall over in production because the test traffic was too clean. Real production traffic is messier: retries stacking on top of retries, duplicate events, one tenant whose data volume is ten times anyone else's, a permissions edge case that only appears for a specific combination of roles that no tester thought to try. Load tests model the traffic you imagined. Production brings the traffic you did not.
One assumption I got wrong: I expected a decision-making component to maintain consistent latency as more systems onboarded. What I had not accounted for was what concurrent writes from multiple systems do to a shared database under real load. The component itself was fine. The contention underneath it was not. In isolation, everything looked healthy. Together, under actual concurrent traffic, the latency climbed in ways the tests had never surfaced.
What breaks is often the interaction between components, not the components themselves.
A retry policy that looked safe starts amplifying failures. Cache invalidation creates churn nobody anticipated. A permission check that costs microseconds in isolation gets called so often it starts showing up on flame graphs. The coordination overhead between services (locks, consensus, invalidation, fanout) can quietly grow faster than the cost of the actual work being done.
There is also something harder to describe, and I did not fully understand it until I was debugging an access control issue and could not figure out where to even start looking.
The system had multiple sources of truth for permissions, and they had drifted out of sync. The system itself had no way to resolve the conflict. It was just making a call based on whichever source it happened to check first. There was no single place I could look and understand what was happening. I had to reconstruct the state of several different systems at a specific point in time to understand one decision. That was when it clicked for me that at a certain point you stop debugging code and start debugging a system you no longer fully understand, because the interactions between components only become visible under volume.
The decisions that are hard to undo
Most systems should not be designed for massive scale on day one. Premature optimization is real, and designing for scale you do not have is expensive and often wrong.
But some early decisions are genuinely hard to reverse:
Your data model. Your synchronous versus asynchronous boundaries. Your consistency and availability assumptions. Your authorization model. Your audit and retention strategy.
Get these wrong and you will eventually rewrite them under pressure, in production, while users are affected and every migration carries risk. That rewrite is never the calm, well-resourced project you wish it were.
In the next post, I will go into the one that has cost me the most: data modeling decisions that look completely reasonable on day one and become load-bearing walls by year three.
What scaling assumption has burned you? What broke in production that looked fine in testing?
United States
NORTH AMERICA
Related News
What Does "Building in Public" Actually Mean in 2026?
20h ago
The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done
20h ago
Why Iβm Still Learning to Code Even With AI
22h ago
I gave Claude a persistent memory for $0/month using Cloudflare
1d ago
NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'
1d ago