Why Bigger Pools Wait Less

There is a result from queueing theory that feels wrong the first time you see it. Take a service behind a load balancer, keep every server exactly as busy as before, and just make the pool bigger. The latency gets better. Not the throughput, the latency. Same per-server utilization, less waiting.

This is Marc Brooker's surprising economics of load-balanced systems. He holds each server's utilization fixed, grows the server count, and watches the queueing fall away. Most engineers already know the single-server version of the intuition from Erik Bernhardsson's "why you need to cut your systems some slack", where you run one machine below 80% to stay off the latency cliff. The multi-server story is the one that is less widely felt, so this post makes Brooker's result something you can drag a slider through, then takes it two steps further: when to merge pools instead of splitting them, and where real systems quietly hand the win back.

Everything below is computed live in your browser by a small open-source library I wrote for this post, queue-economics. Drag the sliders and the math reruns.

The model in one paragraph

Requests arrive at random, independently of each other, with no coordinated bursts. Each one takes some time for a server to handle, and that time varies around an average. That average is the service time, and it is the unit for every wait number on this page, so "0.5 service times" just means half as long as a typical request takes to process. There is one shared queue, served first come first served, and nobody times out or retries. If a request arrives and every server is busy, it waits in line. That is the whole model, and it has a name: M/M/c, where c is the number of servers. The chance an arriving request has to wait is given by the Erlang C formula, which Agner Erlang worked out in 1917 for telephone exchanges. Those assumptions, random independent arrivals and service times that vary around a fixed average, are what make the number exact. They are also where real systems drift, which is the last section of this post.

The thing that makes a system feel slow is not utilization, it is waiting in line. And here is the trick: at a fixed utilization, the chance of having to wait drops sharply as the pool grows. With a handful of servers, one busy moment leaves an arrival nowhere to go. With many servers, there is almost always one free to grab the next request. More servers means more chances to dodge the queue.

Hold the utilization slider wherever you like and grow the pool. The wait probability falls off a cliff, and the p99 tail falls with it. If you do not fully trust a formula you met ninety seconds ago, good instinct, so flip on the Monte Carlo check: a seeded simulation drops its dots right on the analytic curve. Same sanity check Brooker ran on his version.

There is a subtler tax hiding in those stats. The average wait your dashboard reports and the wait people actually feel are not the same number, and the felt one is always worse. The reason is almost obvious once you say it out loud: long waits take longer, so at any moment more people are stuck inside a long wait than a short one. When you average over people instead of over requests, the long waits get counted more, and the number drifts up. For this model the wait people feel comes out to exactly twice the ordinary mean wait, which is the "experienced wait" stat above. (If you want the formula, the felt average is the mean plus the variance divided by the mean, E[W] + Var(W)/E[W], the size-biased mean.) This is the point of Brooker's companion piece Meet Alice. Alice is impatient.: your p50 dashboard and your users can both be right and still disagree, because they are measuring the same thing in two different ways.

The one rule worth remembering

If there is a single takeaway, it is this. To hit a target wait probability you staff roughly

servers ≈ load + β × √load

The bare load keeps the servers fed. The second term, the safety margin, is the slack you add so requests do not pile up. The surprise is that it grows like the square root of the load, not in proportion to it. This is the square-root staffing rule from the Halfin-Whitt regime.

A system handling 100x the traffic needs only about 10x the absolute slack, and proportionally far less. That is the whole economy of scale in one line: big systems can run hotter for the same quality of service, because their cushion is cheaper per unit of load. Small systems cannot, which is the uncomfortable corollary nobody mentions.

Split or merge?

This is where the theory turns into an architecture decision you actually make. Say you have some traffic and some servers. You can split them into several independent pools, or feed everything through one shared queue. Same total servers, same total traffic, same utilization either way. The only difference is whether the servers share a line.

Here is the intuition, because it is the part worth keeping. You only ever wait when every server is busy at the same instant. Split your servers into separate pools and each little pool hits that all busy state on its own, fairly often, while servers in a neighboring pool sit idle and cannot help. That idle time is stranded: real capacity that exists but is walled off from the request that needs it. Merge the pools and no server is ever idle while someone waits, because any free server anywhere takes the next request. It is the supermarket lesson: one line feeding ten checkouts beats ten separate lines, because you never get stuck behind one slow cart while a register opens up two aisles over. The bigger the shared pool, the rarer it is for every server to be busy at once, so the wait probability keeps falling as the pool grows. This is the square-root staffing rule from earlier seen from the other side: pool the load and one proportionally smaller cushion of spare servers covers everyone, instead of every small pool buying its own.

Merging wins, every time, for free. The wait probability drops and the tail shrinks, with no change to how hard each server works. The "cost of staying split" number is the concrete version: how many extra servers you would have to buy, across all your independent pools, just to match the merged pool's p99. That overhead is the price of fragmentation. It is also the argument for consolidating chatty microservices, sizing one big worker pool instead of many small ones, and being suspicious of per-tenant isolation that shards your capacity into pieces too small to benefit.

This is not a thought experiment. Around 2010 Heroku quietly changed its router from a single global request queue, which handed each request to a free dyno, to random routing, which drops each request onto one dyno's own queue. With single-threaded Rails dynos that turned one shared line into hundreds of independent ones, and the tail fell apart: a request could sit waiting while other dynos sat idle. It took James Somers' 2013 writeup for Heroku to admit the change, by which point Rap Genius was paying around twenty thousand dollars a month for the degraded throughput. Erik Bernhardsson compresses the whole lesson into one line: a 2x faster machine with a single queue always beats two 1x machines with their own queues. Sharing the queue is the entire game.

That Heroku split was accidental, which is why it was pure loss. Deliberate splits are a different story, and they are everywhere for good reason. The reason to split is almost never latency, it is failure. One shared queue is also one shared blast radius: a bad deploy, a poison request, or a single overloaded node can take the whole pool down at once. Availability zones, regions, and cells exist precisely so that one failure does not become every failure, and you cannot merge across that boundary without giving up the independence that protects you. Tenant isolation, compliance boundaries, and noisy-neighbor control carve up capacity for similar reasons. So queueing economics are one side of the ledger and redundancy is the other, and they pull in opposite directions. The move that respects both is to pool aggressively inside a failure domain and keep the domains separate: one healthy pool per zone beats ten small pools per zone, and it also beats a single global pool with nowhere to fail over to. Merge for latency and cost, split for blast radius, and put the boundary where a failure should stop, not where the org chart happens to fall.

Where the win leaks away

The clean curve assumes a tidy world: random independent arrivals, service times that vary around a fixed average, infinite patience, no retries. Real systems are not tidy, and the gap between the textbook and production is where most of the interesting failures live. Three distortions matter more than the rest.

Retries. A retry adds load at the worst possible moment, when the pool is already saturated. Here is the model behind the red line above, in words. Start with the offered load, the work actually arriving. Some fraction of requests have to wait, and the retry slider sets what share of those waiting requests give up and try again, piling extra work back on. So the effective load is the original load multiplied by (1 + retry rate × P(wait)). The trap is that P(wait) is itself driven by the load, so more load means more waiting, which means more retries, which means more load. The red curve is the fixed point of that loop: feed the new load back in, recompute, repeat until it stops moving. While the effective load stays under the server count, it settles on a slightly worse but stable number. Once retries push the effective load up to the number of servers, there is no stable point, the loop runs away, and the curve pins to 100%. That cliff is a real outage mode, the classic retry storm: a few timeouts trigger retries, the retries add load, the added load causes more timeouts, and a pool that was comfortable at 80% is suddenly at 100% and climbing. It is why retry budgets, exponential backoff, and circuit breakers exist, and why naive client retries turn a small wobble into a full outage.

Single-threaded actors. A Durable Object, an actor, any "one request at a time" component is an M/M/1 queue. You can run a thousand of them, but each request still hits exactly one, and that one is a queue of size one. Sharding adds capacity without ever pooling it, so a request waits as if it were alone. The flat line in the chart is the point: this architecture gets none of the economies of scale, by construction. When you partition by user or key, you are trading the queueing win for isolation, and that trade is often right, but it is a trade.

Cold starts and fat tails. M/M/c assumes service times that are memoryless. Cold starts, GC pauses, and the occasional slow dependency give you a heavy tail instead, and the tail is exactly what p99 cares about. The economies of scale still apply, but the constant in front is worse than the clean model promises.

So what

Let me step out of the server room for a second. The same math is why one big on-call rotation rides out a bad night better than three small ones, and why a team you keep pinned at 100% utilization always seems to have a mysteriously long queue of half-finished work. People are servers too, and a queue is a queue. The pooling win and the small-system penalty both transfer, which is either comforting or alarming depending on how your org is shaped.

Back to machines. Three things to take away:

Bigger shared pools are not just more capacity, they are lower latency at the same cost, or the same latency at lower cost. The benefit shows up at modest sizes, not just at hyperscale, so you do not need to be Google to collect it. And the moment you shard, retry, or serialize, you start handing the win back, so spend it deliberately.

The math is old and settled. What is usually missing is a way to feel it and a number to put on the decision. That is what the sliders above are for, and the queue-economics package (on npm) is there if you want the same numbers in your own capacity planning.

If your system is sitting on the steep part of one of these curves and you want a second pair of eyes on it, book a call.