Kafka Error STALE_CONTROLLER_EPOCH
Error code 11 · Non-retriable Broker
The controller moved to another broker.
Common Causes
- Split-brain scenario where two brokers simultaneously believe they are the controller; resolved by ZooKeeper/KRaft epoch increment
- Network partition causing the active controller to lose ZooKeeper connectivity, triggering a new controller election with a higher epoch
- Broker sending requests with an outdated controller epoch after a controller failover; typically seen during rolling restarts
Solutions
- This is usually self-healing: the broker will detect the stale epoch, refresh its metadata, and retry with the new controller; monitor 'ActiveControllerCount' JMX metric — it must always equal 1
- If persistent, check ZooKeeper health: 'echo mntr | nc <zk-host> 2181 | grep zk_outstanding_requests'; a backlogged ZooKeeper causes repeated controller flaps
- Review controller logs on all brokers: grep for 'Resigned as controller' and 'Elected as controller' to trace controller history and identify instability
Diagnostic Commands
# Look for controller election events in logs
grep -E 'Elected as controller|Resigned as controller|StaleControllerEpoch' /var/log/kafka/server.log | tail -20
# Search logs for related error messages
echo mntr | nc <zookeeper-host> 2181 | grep -E 'zk_outstanding|zk_avg_latency|zk_num_alive'
Debugging Kafka errors? Conduktor Console gives you real-time visibility into your cluster. Explore all errors in the Error Decoder.