Minimize Kafka Producer End-to-End Latency
Minimize end-to-end produce latency — fast sends, fail-fast timeouts.
Recommended starting points curated by Conduktor. Always benchmark with your workload. Some broker configs are not available on managed services (AWS MSK, Confluent Cloud) — check your provider's documentation.
producer
| Config | Change | Why |
|---|---|---|
| Batching & Compression | ||
|
linger.ms
Kafka 0.8.1+
|
5ms → 0 | Setting linger.ms=0 disables accumulation wait entirely: the sender thread dispatches a ProduceRequest as soon as a record is appended to the batch, eliminating up to 5ms of forced batching delay introduced by the default since Kafka 4.0. • Each record travels in its own (near-empty) batch, multiplying the number of ProduceRequests and TCP round-trips by the record rate; throughput drops dramatically at high message rates compared to linger.ms>=1. |
| Delivery Guarantees | ||
|
acksdangerous
Kafka 0.8.1+
|
all → 1 | acks=1 eliminates the ISR round-trip wait: the leader acknowledges as soon as the record is written to its own log, removing the follower replication lag (typically 5-50ms) from the produce critical path. • Messages acknowledged but not yet replicated are PERMANENTLY LOST if the leader crashes before followers catch up. Incompatible with enable.idempotence=true — setting acks=1 with idempotence throws ConfigException; you must explicitly set enable.idempotence=false. |
|
enable.idempotencedangerous
Kafka 0.11.0+
|
true → false | Disabling idempotence is required when acks=1 is set; it also removes the PID assignment handshake at startup and the sequence-number tracking overhead per batch, shaving a few microseconds per ProduceRequest on the hot path. • Duplicate records on retry are now possible: a network timeout that causes a retry will produce the record twice with no detection. Transactional semantics become impossible. Only acceptable when duplicate delivery is handled downstream (e.g., idempotent consumers or discardable events). |
|
delivery.timeout.mscaution
Kafka 2.1+
|
2min → 10s | Capping the total delivery window at 10s bounds how long a record can stay in the accumulator waiting for delivery; failed records surface as exceptions quickly rather than silently blocking buffer.memory and inflating tail latency. • Transient broker restarts or leader elections lasting >10s will cause permanent record loss (delivery failure surfaced to the error callback) rather than transparent retry. Unsuitable for any durability-sensitive workload. |
| Timeouts & Sessions | ||
|
request.timeout.mscaution
Kafka 0.8.0+
|
30s → 5s | Reducing the per-request timeout to 5s causes the producer to fail fast and surface broker-side stalls quickly rather than silently queueing for 30s; this keeps application-level latency predictable and triggers circuit-breaker logic sooner. • Under momentary broker GC pauses (>5s, common on JVM brokers) this causes spurious TimeoutException and potentially triggers retries, adding latency instead of removing it. Must satisfy: request.timeout.ms < delivery.timeout.ms - linger.ms. |
|
max.block.mscaution
Kafka 0.9.0+
|
1min → 1s | Reducing the send() block timeout to 1s prevents the calling thread from stalling for up to 60s when buffer.memory is exhausted; instead it fails fast with a BufferExhaustedException that the application can handle (drop, circuit-break, or shed load). • Under traffic bursts larger than buffer.memory, records are rejected rather than queued; the application must implement its own backpressure or queue. Not appropriate if the application cannot tolerate send() throwing exceptions. |
|
socket.connection.setup.timeout.mscaution
Kafka 2.6+
|
10s → 3s | Reducing the initial TCP connection setup timeout to 3s ensures the producer fails fast on unreachable brokers and triggers re-bootstrap sooner, avoiding 10s of blocked send() time on the first request to a cold or failed broker. • On high-latency links (e.g., cross-region, >200ms RTT) TLS handshake + TCP setup may legitimately exceed 3s, causing spurious connection failures and retries that add latency. Only apply on low-latency same-datacenter deployments. |
| Metadata & Connections | ||
|
metadata.max.age.ms
Kafka 0.8.1+
|
5min → 30s | Refreshing metadata every 30s instead of 5 minutes reduces the periodic background refresh interval; this shortens the window for stale-leader routing after a partition election. Note that errors like NOT_LEADER_OR_FOLLOWER already trigger an immediate refresh, so the practical impact is mainly on reducing the burst of retries before the first error-triggered refresh. • More frequent metadata fetch requests add a small background overhead to the broker (typically negligible); on large clusters with 10,000+ partitions the metadata response size itself becomes a concern. |