Sub-50ms on MSK: the broker tuning that moved the needle

Six knobs. Two mattered. Real p99: 141ms → 47ms, and why the other four were theatre.

The brief was a number: end-to-end p99 under 50ms on an MSK pipeline that was sitting at 141ms. The internet offers about six knobs for this. Two of them did the work. The other four were theatre, and knowing which is which is the difference between a day and a fortnight.

Before / after

produce→consume p99 // 3-broker cluster // ms

Stage	p99
Baseline	141
+ batch & linger tuned	88
+ partitions matched to consumers	47
Net	−67%

The two that mattered

Producer batching. The default linger.ms=0 was shipping tiny batches under load; nudging linger and batch size let the producer amortize round-trips. That single change took p99 from 141 to 88.
Partition count matched to consumer parallelism. Consumers were starved waiting on too few partitions. Matching partitions to the real consumer count took us the rest of the way to 47.

The four that were theatre

Bumping broker instance size, adding brokers, fiddling num.io.threads, and chasing compression codecs — all popular, all near-zero impact here, because the bottleneck was never broker CPU. It was batching and parallelism. Throwing hardware at a batching problem just makes an idle cluster more expensive.

Tune the thing that’s actually the constraint. Everything else is moving money to look busy.

The general rule travels further than MSK: measure where the time actually goes before you reach for the knob with the best marketing. Two changes, a 67% cut, and a cluster that didn’t grow.

$ filed under modernization — next: Where the architecture meets the SoW → →