Skip to main content

Troubleshooting

This section organizes common problems encountered during controller operation by symptom. Each item links to relevant detailed documentation.

Symptom-Based Diagnosis Tables

Reconciler Infinite Loop

Symptom: Reconcile call count increases endlessly and CPU usage is high.

CauseHow to VerifySolution
Writing non-deterministic values to status (timestamps, etc.)Check with RUST_LOG=kube=debug that a patch occurs every reconcileUse only deterministic values or skip the patch when nothing changed
predicate_filter not appliedCheck reconcile logs to see if status-only changes also triggerApply predicate_filter(predicates::generation, Default::default())
Racing with another controller (annotation ping-pong)Check resourceVersion change patterns with kubectl get -wSeparate field ownership with SSA

Details: Reconciler Patterns — Infinite Loop

Continuous Memory Growth

Symptom: Higher than expected Pod memory.

CauseHow to VerifySolution
Initial list allocationsHigh baseline memory after startupUse streaming_lists(), and/or reduce page_size
Large objects in Store cacheCheck Store size with jemalloc profilingRemove managedFields etc. with .modify(), and/or use metadata_watcher()
Watch scope too broadCheck cached object count with Store's state().len()Narrow scope with label/field selectors

Details: Optimization — Reflector optimization, Optimization — re-list memory spikes

Watch Connection Not Recovering After Disconnect

Symptom: Controller appears stuck, not receiving any events.

CauseHow to VerifySolution
410 Gone + bookmarks not configuredCheck logs for WatchError 410watcher auto-re-lists with default_backoff()
Credential expirationCheck logs for 401/403 errorsVerify Config::infer() auto-refreshes, check exec plugin configuration
RBAC / NetworkPoliciesLog shows 403 ForbiddenAdd watch/list permissions to ClusterRole; check NetworkPolicy allows egress to API server
Backoff not configuredStream terminates on first errorAlways use .default_backoff()

Details: Watcher State Machine, Error Handling and Backoff — Watcher errors

API Server Throttling (429)

Symptom: 429 Too Many Requests errors appear frequently in logs.

CauseHow to VerifySolution
Too many concurrent reconcilesCheck active reconcile count in metricsSet Config::concurrency(N) (default is unlimited)
Too many watch connectionsCheck number of owns(), watches() callsShare watches with a shared reflector
Too many API calls in reconcilerCheck HTTP request count in tracing spansLeverage Store cache; batch where possible

Details: Optimization — Reconciler optimization, Optimization — API server load

Finalizer Deadlock (Permanently Terminating)

Symptom: Resource is permanently stuck in Terminating state.

CauseHow to VerifySolution
cleanup function failingCheck cleanup errors in logs; monitor via error_policy metricsDesign cleanup to eventually succeed (treat missing external resources as success)
predicate_filter blocking finalizer eventsCheck if only predicates::generation is usedUse predicates::generation.combine(predicates::finalizers) with Default::default() config
Controller is downCheck Pod statusAutomatically handled after controller recovery

Emergency release: kubectl patch <resource> -p '{"metadata":{"finalizers":null}}' --type=merge (skips cleanup)

Details: Relations and Finalizers — Caveats

Reconciler Not Running

Symptom: Reconciler logs are not printed even when resources change.

CauseHow to VerifySolution
Store not yet initialized (advanced; only with streams interface)Readiness probe failingVerify behavior after wait_until_ready()
predicate_filter blocking all eventsCheck predicate logicAdjust predicate combination or temporarily remove for testing
Insufficient RBAC permissionsCheck logs for 403 ForbiddenAdd watch/list permissions to ClusterRole
NetworkPolicies blocking API server accessConnection timeouts in logsCheck NetworkPolicy allows egress to API server
watcher Config selector too narrowVerify matches with kubectl get -l <selector>Adjust selector

Debugging Tools

RUST_LOG Configuration

# Basic debugging: kube internals + controller logic
RUST_LOG=kube=debug,my_controller=debug

# Inspect individual watch events (very verbose)
RUST_LOG=kube=trace

# Check HTTP request level
RUST_LOG=kube=debug,tower_http=debug

# Suppress noise
RUST_LOG=kube=warn,hyper=warn,my_controller=info

Using tracing Spans

Check object.ref and object.reason in the spans automatically generated by the Controller. Enabling JSON logging allows structured searching.

# Filter reconcile logs for a specific resource
cat logs.json | jq 'select(.span."object.ref" | contains("my-resource-name"))'

Details: Monitoring — Structured logging

Checking State with kubectl

# Check resource status and events
kubectl describe myresource <name>

# Track changes in real-time with watch mode
kubectl get myresource -w

# Check resourceVersion change pattern (infinite loop diagnosis)
kubectl get myresource <name> -o jsonpath='{.metadata.resourceVersion}' -w

# Check finalizer state
kubectl get myresource <name> -o jsonpath='{.metadata.finalizers}'

Profiling

Memory Profiling (jemalloc)

[dependencies]
tikv-jemallocator = { version = "*", features = ["profiling"] }
#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
# Enable heap profiling
MALLOC_CONF="prof:true,prof_active:true,lg_prof_interval:30" ./my-controller

# Analyze profile dumps
jeprof --svg ./my-controller jeprof.*.heap > heap.svg

The Store cache is often the primary memory consumer. If the profile shows large AHashMap-related allocations, apply .modify() to strip large fields, or switch to metadata_watcher().

Async Runtime Profiling (tokio-console)

Check whether slow reconciler performance is caused by async task scheduling.

[dependencies]
console-subscriber = "*"
// Add at the top of the main function
console_subscriber::init();
# Connect with the tokio-console client
tokio-console http://localhost:6669

You can monitor per-task poll time, waker count, and wait time in real time. If a reconciler task is blocked for a long time, the cause may be synchronous operations or slow API calls inside it.

For lightweight runtime metrics without the TUI, consider tokio-metrics which can export to Prometheus.