conduktor.io ↗

Minimize Kafka Producer End-to-End Latency

Minimize end-to-end produce latency — fast sends, fail-fast timeouts.

Recommended starting points curated by Conduktor. Always benchmark with your workload. Some broker configs are not available on managed services (AWS MSK, Confluent Cloud) — check your provider's documentation.

producer

Config Change Why
Batching & Compression
linger.ms
Kafka 0.8.1+
5ms → 0 Setting linger.ms=0 disables accumulation wait entirely: the sender thread dispatches a ProduceRequest as soon as a record is appended to the batch, eliminating up to 5ms of forced batching delay introduced by the default since Kafka 4.0.
• Each record travels in its own (near-empty) batch, multiplying the number of ProduceRequests and TCP round-trips by the record rate; throughput drops dramatically at high message rates compared to linger.ms>=1.
Delivery Guarantees
acksdangerous
Kafka 0.8.1+
all → 1 acks=1 eliminates the ISR round-trip wait: the leader acknowledges as soon as the record is written to its own log, removing the follower replication lag (typically 5-50ms) from the produce critical path.
• Messages acknowledged but not yet replicated are PERMANENTLY LOST if the leader crashes before followers catch up. Incompatible with enable.idempotence=true — setting acks=1 with idempotence throws ConfigException; you must explicitly set enable.idempotence=false.
enable.idempotencedangerous
Kafka 0.11.0+
true → false Disabling idempotence is required when acks=1 is set; it also removes the PID assignment handshake at startup and the sequence-number tracking overhead per batch, shaving a few microseconds per ProduceRequest on the hot path.
• Duplicate records on retry are now possible: a network timeout that causes a retry will produce the record twice with no detection. Transactional semantics become impossible. Only acceptable when duplicate delivery is handled downstream (e.g., idempotent consumers or discardable events).
delivery.timeout.mscaution
Kafka 2.1+
2min → 10s Capping the total delivery window at 10s bounds how long a record can stay in the accumulator waiting for delivery; failed records surface as exceptions quickly rather than silently blocking buffer.memory and inflating tail latency.
• Transient broker restarts or leader elections lasting >10s will cause permanent record loss (delivery failure surfaced to the error callback) rather than transparent retry. Unsuitable for any durability-sensitive workload.
Timeouts & Sessions
request.timeout.mscaution
Kafka 0.8.0+
30s → 5s Reducing the per-request timeout to 5s causes the producer to fail fast and surface broker-side stalls quickly rather than silently queueing for 30s; this keeps application-level latency predictable and triggers circuit-breaker logic sooner.
• Under momentary broker GC pauses (>5s, common on JVM brokers) this causes spurious TimeoutException and potentially triggers retries, adding latency instead of removing it. Must satisfy: request.timeout.ms < delivery.timeout.ms - linger.ms.
max.block.mscaution
Kafka 0.9.0+
1min → 1s Reducing the send() block timeout to 1s prevents the calling thread from stalling for up to 60s when buffer.memory is exhausted; instead it fails fast with a BufferExhaustedException that the application can handle (drop, circuit-break, or shed load).
• Under traffic bursts larger than buffer.memory, records are rejected rather than queued; the application must implement its own backpressure or queue. Not appropriate if the application cannot tolerate send() throwing exceptions.
socket.connection.setup.timeout.mscaution
Kafka 2.6+
10s → 3s Reducing the initial TCP connection setup timeout to 3s ensures the producer fails fast on unreachable brokers and triggers re-bootstrap sooner, avoiding 10s of blocked send() time on the first request to a cold or failed broker.
• On high-latency links (e.g., cross-region, >200ms RTT) TLS handshake + TCP setup may legitimately exceed 3s, causing spurious connection failures and retries that add latency. Only apply on low-latency same-datacenter deployments.
Metadata & Connections
metadata.max.age.ms
Kafka 0.8.1+
5min → 30s Refreshing metadata every 30s instead of 5 minutes reduces the periodic background refresh interval; this shortens the window for stale-leader routing after a partition election. Note that errors like NOT_LEADER_OR_FOLLOWER already trigger an immediate refresh, so the practical impact is mainly on reducing the burst of retries before the first error-triggered refresh.
• More frequent metadata fetch requests add a small background overhead to the broker (typically negligible); on large clusters with 10,000+ partitions the metadata response size itself becomes a concern.