conduktor.io ↗

Reduce Kafka Broker Storage Costs

Reduce disk usage and storage costs with retention tuning, log compaction, and tiered storage.

Recommended starting points curated by Conduktor. Always benchmark with your workload. Some broker configs are not available on managed services (AWS MSK, Confluent Cloud) — check your provider's documentation.

broker

Config Change Why
Log Segments & Compaction
log.cleaner.io.max.bytes.per.secondcaution
Kafka 0.9+
unlimited → 50MB/s Throttles log compaction I/O to 50MB/s. The default is uncapped (Double.MAX_VALUE) — compaction can fully saturate broker I/O, causing produce and fetch latency spikes for all other topics on the same broker. This is a common mystery performance degradation.
• Compaction runs slower, so compacted topics take longer to reclaim space. Monitor log-cleaner metrics for compaction backlog.
log.segment.bytes
Kafka 0.8.0+
1GB → 256MB Smaller segments (256MB) allow the retention enforcement loop to reclaim disk more granularly. With 1GB segments, a topic that hits its retention byte limit must wait for a full 1GB segment to be eligible for deletion; with 256MB segments, cleanup is 4x more precise.
• More segment files mean more file handles and more sparse index files per partition. On a broker with 10,000 partitions, cutting segment size from 1GB to 256MB quadruples the segment file count.
log.cleaner.threads
Kafka 0.8.1+
1 → 2 With log.cleanup.policy=compact, a single cleaner thread can fall behind on high-write-rate compacted topics, causing the dirty ratio to grow and increasing storage consumption. Two threads allow parallel compaction across different topic-partitions.
• Each cleaner thread consumes log.cleaner.dedupe.buffer.size (default 128MB) of heap memory; 2 threads means 256MB reserved for cleaner deduplication buffers.
log.cleaner.min.cleanable.ratio
Kafka 0.8.1+
0.5 → 0.3 A partition is eligible for compaction when its dirty (uncompacted) bytes exceed this fraction of total log bytes. Lowering from 0.5 to 0.3 triggers compaction earlier, keeping the log more aggressively compacted at the cost of more frequent cleaner cycles.
• More frequent compaction cycles consume more CPU and disk I/O bandwidth. On clusters where compaction I/O already saturates disk, this worsens the problem.
log.cleaner.dedupe.buffer.size
Kafka 0.8+
128MB → 256MB Doubles the log compaction deduplication buffer from 128MB to 256MB. This memory is split evenly across all cleaner threads (e.g. 2 threads = 128MB each). Larger per-thread buffer means fewer compaction passes for topics with high key cardinality, reducing I/O.
• Consumes 256MB of broker heap. Only beneficial if you use log.cleanup.policy=compact; with delete-only topics this memory is wasted.
Retention
log.retention.hourscaution
Kafka 0.7+
168h → 24h Reducing retention from 7 days to 24 hours cuts disk usage by 7x for constant-throughput topics. Most real-time consumers catch up within minutes; a 24h window provides ample replay buffer without hoarding a week of data.
• Consumers that fall more than 24 hours behind lose the ability to replay; a lagging consumer group whose offset is past the earliest available offset will receive OffsetOutOfRangeException and must decide whether to reset to earliest (data loss) or latest (gap).
log.retention.bytescaution
Kafka 0.8.0+
unlimited → 10GB Setting a hard per-partition cap of 10GB prevents runaway disk usage when a topic suddenly receives much higher write volume than expected. Combined with log.retention.hours, whichever limit is hit first triggers segment deletion.
• Byte-based retention is per-partition, not per-topic; a topic with 12 partitions at 10GB cap uses up to 120GB. Size limits must be set with the full partition-count multiplication in mind.
log.retention.check.interval.ms
Kafka 0.8.0+
5min → 1min The log retention cleaner checks for eligible segments every 5 minutes by default; reducing to 1 minute means expired or over-size segments are deleted sooner, preventing temporary disk spikes of up to 5 minutes × write rate between checks.
• More frequent checks add minor CPU overhead from file stat calls on all segments; negligible on clusters with < 100,000 segments total.
log.segment.delete.delay.mscaution
Kafka 0.8.2+
1min → 10s Segments eligible for deletion are held for this delay before actual removal to allow active readers to finish. Reducing from 60s to 10s recovers disk space faster after retention-triggered deletions, which matters when retention.bytes limits are tight.
• If a consumer or follower is actively reading a segment that is concurrently deleted, it will receive an IOException and must re-fetch from an earlier offset. On clusters where consumers are always caught up this risk is negligible.
Tiered Storage
remote.log.storage.system.enablecaution
Kafka 3.6+
false → true Enables tiered storage (KIP-405), offloading older log segments to cheap object storage (S3, GCS, Azure Blob) while keeping only recent data on local disks.
• Adds operational complexity (remote storage credentials, availability dependency on object store). Fetch latency for cold reads increases to 50-200ms.
log.local.retention.mscaution
Kafka 3.6+
inherit → 1d Keeps only 24 hours of data on local disk when tiered storage is enabled. Older segments are served from remote storage, reducing local disk usage by 80-95% for long-retention topics.
• Consumer reads beyond 24h hit remote storage with higher latency. Requires remote.log.storage.system.enable=true.
log.local.retention.bytescaution
Kafka 3.6+
inherit → 1GB Caps local disk usage to 1GB per partition when tiered storage is enabled, complementing time-based local retention for predictable disk capacity planning.
• High-throughput partitions may evict local segments faster than expected, causing more remote reads. Value is per-partition, not per-topic.
remote.log.manager.copy.max.bytes.per.second
Kafka 3.6+
never → 50MB/s Throttles remote copy to 50MB/s per broker to prevent tiered storage uploads from saturating network bandwidth and impacting live produce/fetch traffic.
• May cause upload lag during traffic spikes if throttle is too aggressive. Monitor remote-log-manager copy lag metrics.
remote.log.reader.max.pending.tasks
Kafka 3.6+
100 → 200 Doubles the tiered storage task queue from 100 to 200 pending tasks. When consumers read cold data or async remote offset reads hit the queue, a full queue causes fetch errors; 200 provides headroom for bursty patterns.
• Each pending task holds memory for the fetch context. On brokers with limited heap, a large queue can contribute to memory pressure under sustained cold-read storms.
remote.log.reader.threads
Kafka 3.6+
10 → 20 Doubles the thread pool for reading from remote (tiered) storage. During consumer catch-up after outages, 10 threads can bottleneck cold reads causing fetch timeouts. 20 threads better handle bursty cold-read patterns.
• More threads consume CPU and increase concurrent object store requests. Monitor remote storage API rate limits.
remote.log.manager.task.interval.ms
Kafka 3.6+
30s → 10s Reduces the segment archival check interval from 30s to 10s. Segments become eligible for remote copy sooner, reducing the window where data exists only on local disk. Critical for bursty topics that produce large segments quickly.
• More frequent task runs add minor CPU overhead. The tasks themselves are I/O-bound so the scheduler overhead is negligible.