Skip to main content

Troubleshooting

This section organizes common problems encountered during controller operation by symptom. Each item links to relevant detailed documentation.

Symptom-Based Diagnosis Tables

Reconciler Infinite Loop

Symptom: Reconcile call count increases endlessly and CPU usage is high.

CauseHow to VerifySolution
Writing non-deterministic values to status (timestamps, etc.)Check with RUST_LOG=kube=debug that a patch occurs every reconcileUse only deterministic values or skip the patch when nothing changed
predicate_filter not appliedCheck reconcile logs to see if status-only changes also triggerApply predicate_filter(predicates::generation)
Racing with another controller (annotation ping-pong)Check resourceVersion change patterns with kubectl get -wSeparate field ownership with SSA

Details: Reconciler Patterns — Infinite Loop

Continuous Memory Growth

Symptom: Pod memory keeps growing over time, eventually OOMKilled.

CauseHow to VerifySolution
Re-list spikesCheck for periodic spikes in memory graphUse streaming_lists(), reduce page_size
Large objects in Store cacheCheck Store size with jemalloc profilingRemove managedFields etc. with .modify(), use metadata_watcher()
Watch scope too broadCheck cached object count with Store's state().len()Narrow scope with label/field selectors

Details: Optimization — Reflector optimization, Optimization — re-list memory spikes

Watch Connection Not Recovering After Disconnect

Symptom: Controller appears stuck, not receiving any events.

CauseHow to VerifySolution
410 Gone + bookmarks not configuredCheck logs for WatchError 410watcher auto-re-lists with default_backoff()
Credential expirationCheck logs for 401/403 errorsVerify Config::infer() auto-refreshes, check exec plugin configuration
Backoff not configuredStream terminates on first errorAlways use .default_backoff()

Details: Watcher State Machine, Error Handling and Backoff — Watcher errors

API Server Throttling (429)

Symptom: 429 Too Many Requests errors appear frequently in logs.

CauseHow to VerifySolution
Too many concurrent reconcilesCheck active reconcile count in metricsSet Config::concurrency(N)
Too many watch connectionsCheck number of owns(), watches() callsShare watches with a shared reflector
Too many API calls in reconcilerCheck HTTP request count in tracing spansLeverage Store cache, parallelize with try_join!

Details: Optimization — Reconciler optimization, Optimization — API server load

Finalizer Deadlock (Permanently Terminating)

Symptom: Resource is permanently stuck in Terminating state.

CauseHow to VerifySolution
cleanup function failingCheck cleanup errors in logsDesign cleanup to eventually succeed (treat missing external resources as success)
predicate_filter blocking finalizer eventsCheck if only predicates::generation is usedUse predicates::generation.combine(predicates::finalizers)
Controller is downCheck Pod statusAutomatically handled after controller recovery

Emergency release: kubectl patch <resource> -p '{"metadata":{"finalizers":null}}' --type=merge (skips cleanup)

Details: Relations and Finalizers — Caveats

Reconciler Not Running

Symptom: Reconciler logs are not printed even when resources change.

CauseHow to VerifySolution
Store not yet initializedReadiness probe failingVerify behavior after wait_until_ready()
predicate_filter blocking all eventsCheck predicate logicAdjust predicate combination or temporarily remove for testing
Insufficient RBAC permissionsCheck logs for 403 ForbiddenAdd watch/list permissions to ClusterRole
watcher Config selector too narrowVerify matches with kubectl get -l <selector>Adjust selector

Debugging Tools

RUST_LOG Configuration

# Basic debugging: kube internals + controller logic
RUST_LOG=kube=debug,my_controller=debug

# Inspect individual watch events (very verbose)
RUST_LOG=kube=trace

# Check HTTP request level
RUST_LOG=kube=debug,tower_http=debug

# Suppress noise
RUST_LOG=kube=warn,hyper=warn,my_controller=info

Using tracing Spans

Check object.ref and object.reason in the spans automatically generated by the Controller. Enabling JSON logging allows structured searching.

# Filter reconcile logs for a specific resource
cat logs.json | jq 'select(.span.object_ref | contains("my-resource-name"))'

Details: Monitoring — Structured logging

Checking State with kubectl

# Check resource status and events
kubectl describe myresource <name>

# Track changes in real-time with watch mode
kubectl get myresource -w

# Check resourceVersion change pattern (infinite loop diagnosis)
kubectl get myresource <name> -o jsonpath='{.metadata.resourceVersion}' -w

# Check finalizer state
kubectl get myresource <name> -o jsonpath='{.metadata.finalizers}'

Profiling

Memory Profiling (jemalloc)

[dependencies]
tikv-jemallocator = { version = "0.6", features = ["profiling"] }
#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
# Enable heap profiling
MALLOC_CONF="prof:true,prof_active:true,lg_prof_interval:30" ./my-controller

# Analyze profile dumps
jeprof --svg ./my-controller jeprof.*.heap > heap.svg

Objects cached in the Store often account for the majority of memory. If the profile shows large AHashMap-related allocations, apply .modify() or metadata_watcher().

Async Runtime Profiling (tokio-console)

Check whether slow reconciler performance is caused by async task scheduling.

[dependencies]
console-subscriber = "0.4"
// Add at the top of the main function
console_subscriber::init();
# Connect with the tokio-console client
tokio-console http://localhost:6669

You can monitor per-task poll time, waker count, and wait time in real time. If a reconciler task is blocked for a long time, the cause may be synchronous operations or slow API calls inside it.