Minimize Kafka Consumer Processing Latency
Minimize consumer processing latency — small fetches, fast failure detection.
Recommended starting points curated by Conduktor. Always benchmark with your workload. Some broker configs are not available on managed services (AWS MSK, Confluent Cloud) — check your provider's documentation.
consumer
| Config | Change | Why |
|---|---|---|
| Fetching | ||
|
fetch.max.wait.ms
Kafka 0.9.0+
|
500ms → 10ms | Reducing the broker-side wait from 500ms to 10ms ensures the broker responds within 10ms even if fetch.min.bytes is not yet satisfied. This is the primary lever for reducing consumer-side latency: the broker no longer holds the connection for half a second waiting to batch data. • At low message rates, the broker returns near-empty FetchResponses every 10ms, generating 100 round-trips/sec of network and CPU overhead per consumer. This degrades cluster efficiency if applied globally to all consumer groups. |
|
max.poll.records
Kafka 0.10.0+
|
500 → 100 | Lowering max.poll.records from 500 to 100 reduces the batch size returned per poll(), so the application processes records faster and returns to poll() sooner. This tightens the processing loop and reduces the delay between a record arriving at the broker and being processed by the application. • Reduces throughput by up to 5x compared to the default; the application calls poll() more frequently, increasing per-poll overhead. Only use if per-record processing latency is a product requirement. |
|
max.poll.interval.mscaution
Kafka 0.10.1+
|
5min → 30s | For latency-sensitive consumers processing small batches (100 records), 30 seconds is a generous bound. Reducing it from 5 minutes forces faster detection of stalled consumers: a genuinely hung consumer is evicted within 30s instead of 5 minutes, allowing its partitions to be reassigned quickly. • If a burst of slow external calls (DB timeout, API slowdown) causes poll() to take >30s, the consumer is incorrectly evicted. This is a false positive: tune conservatively above your p99.9 processing time per batch. |
| Timeouts & Sessions | ||
|
session.timeout.mscaution
Kafka 0.9.0+
|
45s → 10s | Reducing session timeout to 10s enables faster failover: when a consumer crashes, its partitions are reassigned within 10s instead of 45s, maintaining low lag for other consumers in the group. Critical for latency-sensitive pipelines where lag accumulation during failover is unacceptable. • A 10s session window is very tight; any GC pause >10s (e.g., full GC on a JVM with large heap) triggers a false timeout and rebalance. Requires tuning JVM GC or using ZGC/Shenandoah to keep pauses <5s. |
|
heartbeat.interval.ms
Kafka 0.9.0+
|
3s → 2s | With session.timeout.ms=10s, reducing heartbeat interval to 2s maintains the 1:5 ratio (session / heartbeat) needed for reliable liveness detection. Sends heartbeats 5x per session window, allowing 4 missed beats before timeout — sufficient to survive transient GC pauses. • Heartbeat thread sends 0.5 RPCs/sec per consumer to the group coordinator; at hundreds of consumers this adds measurable coordinator load. Monitor GroupCoordinator request rates. |
|
request.timeout.mscaution
Kafka 0.9.0+
|
30s → 15s | Reducing the request timeout to 15s makes the consumer fail-fast on unresponsive brokers rather than waiting 30s before surfacing an error. In a latency-sensitive pipeline, a hung broker should be bypassed quickly so the consumer can reconnect to a replica or another broker. • On overloaded clusters with GC pauses >15s, this may cause spurious TimeoutExceptions and connection resets that themselves add latency. Tune to ~2× your broker's p99 response time. |
| Consumer Group | ||
|
enable.auto.commit
Kafka 0.9.0+
|
true → false | Disable auto-commit and use manual commitAsync() after processing each batch. This ensures offsets are only committed after records are processed, preventing loss of processed-record tracking and enabling precise at-least-once semantics. For latency-critical consumers, auto-commit's 5s interval can cause 5s of reprocessing on restart, spiking latency during recovery. • Each explicit commitAsync() adds a network round-trip to the group coordinator per poll batch. At 100ms poll intervals this is 10 commits/sec — monitor OffsetCommit latency. Synchronous commitSync() adds the round-trip to the processing critical path directly. |