Availability

Covers how to ensure availability for kube-based controllers. Most controllers work fine with a single instance, and when HA is needed, leader election is the solution.

Why a Single Instance Is Sufficient

Kubernetes controllers are different from typical web servers. The watch + reconcile loop is a queue consumer model:

Idempotent reconcile: Following the reconciler pattern, every reconcile is idempotent. Even after a temporary interruption and restart, it converges to the desired state
Automatic watcher recovery: When the watcher restarts, it resumes watching from the resourceVersion, or if unavailable, recovers the full state via a re-list
Scheduler deduplication: Even if multiple events for the same object accumulate, the scheduler deduplicates them

When a Pod restarts, there is a brief delay, but no data loss or inconsistency occurs.

Recovery Timeline

t=0    Pod termination begins (SIGTERM)
t=0~30 In-progress reconciles complete (graceful shutdown)
t=30   Pod terminates
...    Deployment schedules a new Pod
t=45   New Pod starts, watcher initializes
t=46   re-list completes, processing of backlogged reconciles begins

The default terminationGracePeriodSeconds: 30 on a Deployment is sufficient in most cases. Increase this value if you have long-running reconciles.

Why replicas: 2 Doesn't Work

Watch-based controllers have no load balancer to route requests. If two instances run simultaneously:

Problem	Description
Duplicate reconciles	Both instances reconcile the same object concurrently
Conflicts	SSA field manager conflicts, optimistic concurrency (`resourceVersion`) errors
Resource waste	Both instances maintain identical watch streams, doubling API server load

In particular, with Server-Side Apply, if two instances patch with the same fieldManager, unintended field ownership issues arise.

Rolling Update Caution

With strategy.type: RollingUpdate and maxSurge: 1, two instances briefly run simultaneously during deployment. Since most controllers are idempotent, short periods of duplicate execution are harmless, but leader election makes even this window safe.

Leader Election

A mechanism that ensures only one instance among many is active. kube-rs does not include built-in leader election, so third-party crates are used.

Lease-based Mechanism

How leader election works using the Kubernetes Lease object:

Key parameters:

Parameter	Meaning	Typical Value
`leaseDuration`	Leader validity period	15 seconds
`renewDeadline`	Renewal attempt deadline	10 seconds
`retryPeriod`	Non-leader retry interval	2 seconds

Third-Party Crates

Leader election crates available in the kube ecosystem:

Crate	Approach	Features
`kube-leader-election`	Lease-based	Simple API, provides renewal loop
`kube-coordinate`	Lease-based	kube-runtime compatible stream API
`kubert::lease`	Lease-based	Used by the Linkerd project

Usage pattern:

// Conceptual usage example (API varies by crate)
let lease = LeaseManager::new(client.clone(), "my-controller", "controller-ns");

// Wait until leadership is acquired
lease.wait_for_leadership().await?;

// Run the Controller only while leader
Controller::new(api, wc)
    .shutdown_on_signal()
    .run(reconcile, error_policy, ctx)
    .for_each(|res| async move { /* ... */ })
    .await;

Shutdown Coordination

It is important to safely shut down the Controller when leadership is lost:

Controller::new(api, wc)
    .graceful_shutdown_on(lease.lost_leadership())
    .run(reconcile, error_policy, ctx)

When you pass a leadership-lost future to graceful_shutdown_on(), it stops starting new reconciles upon losing leadership, waits for in-progress reconciles to complete, and then shuts down.

Graceful Shutdown

shutdown_on_signal

Controller::shutdown_on_signal() handles SIGTERM and Ctrl+C.

kube-runtime/src/controller/mod.rs (simplified)
pub fn shutdown_on_signal(mut self) -> Self

Behavior:

On receiving SIGTERM or SIGINT, stops starting new reconciles
Waits for in-progress reconciles to complete
Terminates immediately on receiving a second signal

Controller::new(api, wc)
    .shutdown_on_signal()
    .run(reconcile, error_policy, ctx)
    .for_each(|res| async move {
        match res {
            Ok(obj) => tracing::info!(?obj, "reconciled"),
            Err(err) => tracing::error!(%err, "reconcile failed"),
        }
    })
    .await;

Custom Shutdown Trigger

Use graceful_shutdown_on() to set an arbitrary shutdown condition:

use tokio::sync::oneshot;

let (tx, rx) = oneshot::channel::<()>();

Controller::new(api, wc)
    .graceful_shutdown_on(async move { rx.await.ok(); })
    .run(reconcile, error_policy, ctx)

Deployment Configuration

deployment.yaml
spec:
  replicas: 1
  strategy:
    type: Recreate  # Prevent simultaneous execution (when not using leader election)
  template:
    spec:
      terminationGracePeriodSeconds: 60  # Sufficient shutdown time
      containers:
        - name: controller
          # ...

Strategy	Without leader election	With leader election
`Recreate`	Recommended — prevents overlap	Unnecessary
`RollingUpdate`	Brief overlap occurs	Safe — new instance waits

Elected Shards — HA + Horizontal Scaling

On large-scale clusters, a single leader may not be enough to handle the throughput. In this case, shard the resources so that multiple leaders each handle their own scope.

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Shard 0    │  │  Shard 1    │  │  Shard 2    │
│ ns: team-a  │  │ ns: team-b  │  │ ns: team-c  │
│ (leader)    │  │ (leader)    │  │ (leader)    │
└─────────────┘  └─────────────┘  └─────────────┘

Each shard:

Runs independent leader election with its own Lease
Watches only resources in its assigned scope (Api::namespaced() by namespace or label selector)
Ignores resources belonging to other shards

For detailed sharding strategies, see Optimization — Scaling Strategies.

Availability Checklist

Item	Verified
Is the reconciler idempotent?
Is `shutdown_on_signal()` or `graceful_shutdown_on()` configured?
Is `terminationGracePeriodSeconds` sufficient?
Are you not using `replicas > 1` without leader election?
When using leader election, is leadership loss linked to shutdown?
Is the Deployment strategy appropriate? (`Recreate` or leader election)

Why a Single Instance Is Sufficient​

Recovery Timeline​

Why replicas: 2 Doesn't Work​

Leader Election​

Lease-based Mechanism​

Third-Party Crates​

Shutdown Coordination​

Graceful Shutdown​

shutdown_on_signal​

Custom Shutdown Trigger​

Deployment Configuration​

Elected Shards — HA + Horizontal Scaling​

Availability Checklist​