Troubleshooting

This section organizes common problems encountered during controller operation by symptom. Each item links to relevant detailed documentation.

Symptom-Based Diagnosis Tables

Reconciler Infinite Loop

Symptom: Reconcile call count increases endlessly and CPU usage is high.

Cause	How to Verify	Solution
Writing non-deterministic values to status (timestamps, etc.)	Check with `RUST_LOG=kube=debug` that a patch occurs every reconcile	Use only deterministic values or skip the patch when nothing changed
predicate_filter not applied	Check reconcile logs to see if status-only changes also trigger	Apply `predicate_filter(predicates::generation, Default::default())`
Racing with another controller (annotation ping-pong)	Check resourceVersion change patterns with `kubectl get -w`	Separate field ownership with SSA

Details: Reconciler Patterns — Infinite Loop

Continuous Memory Growth

Symptom: Higher than expected Pod memory.

Cause	How to Verify	Solution
Initial list allocations	High baseline memory after startup	Use `streaming_lists()`, and/or reduce `page_size`
Large objects in Store cache	Check Store size with jemalloc profiling	Remove managedFields etc. with `.modify()`, and/or use `metadata_watcher()`
Watch scope too broad	Check cached object count with Store's `state().len()`	Narrow scope with label/field selectors

Details: Optimization — Reflector optimization, Optimization — re-list memory spikes

Watch Connection Not Recovering After Disconnect

Symptom: Controller appears stuck, not receiving any events.

Cause	How to Verify	Solution
410 Gone + bookmarks not configured	Check logs for `WatchError` 410	watcher auto-re-lists with `default_backoff()`
Credential expiration	Check logs for 401/403 errors	Verify `Config::infer()` auto-refreshes, check exec plugin configuration
RBAC / NetworkPolicies	Log shows 403 Forbidden	Add watch/list permissions to ClusterRole; check NetworkPolicy allows egress to API server
Backoff not configured	Stream terminates on first error	Always use `.default_backoff()`

Details: Watcher State Machine, Error Handling and Backoff — Watcher errors

API Server Throttling (429)

Symptom: 429 Too Many Requests errors appear frequently in logs.

Cause	How to Verify	Solution
Too many concurrent reconciles	Check active reconcile count in metrics	Set `Config::concurrency(N)` (default is unlimited)
Too many watch connections	Check number of `owns()`, `watches()` calls	Share watches with a shared reflector
Too many API calls in reconciler	Check HTTP request count in tracing spans	Leverage Store cache; batch where possible

Details: Optimization — Reconciler optimization, Optimization — API server load

Finalizer Deadlock (Permanently Terminating)

Symptom: Resource is permanently stuck in Terminating state.

Cause	How to Verify	Solution
cleanup function failing	Check cleanup errors in logs; monitor via `error_policy` metrics	Design cleanup to eventually succeed (treat missing external resources as success)
predicate_filter blocking finalizer events	Check if only `predicates::generation` is used	Use `predicates::generation.combine(predicates::finalizers)` with `Default::default()` config
Controller is down	Check Pod status	Automatically handled after controller recovery

Emergency release: kubectl patch <resource> -p '{"metadata":{"finalizers":null}}' --type=merge (skips cleanup)

Details: Relations and Finalizers — Caveats

Reconciler Not Running

Symptom: Reconciler logs are not printed even when resources change.

Cause	How to Verify	Solution
Store not yet initialized (advanced; only with streams interface)	Readiness probe failing	Verify behavior after `wait_until_ready()`
predicate_filter blocking all events	Check predicate logic	Adjust predicate combination or temporarily remove for testing
Insufficient RBAC permissions	Check logs for 403 Forbidden	Add watch/list permissions to ClusterRole
NetworkPolicies blocking API server access	Connection timeouts in logs	Check NetworkPolicy allows egress to API server
watcher Config selector too narrow	Verify matches with `kubectl get -l <selector>`	Adjust selector

Debugging Tools

RUST_LOG Configuration

# Basic debugging: kube internals + controller logic
RUST_LOG=kube=debug,my_controller=debug

# Inspect individual watch events (very verbose)
RUST_LOG=kube=trace

# Check HTTP request level
RUST_LOG=kube=debug,tower_http=debug

# Suppress noise
RUST_LOG=kube=warn,hyper=warn,my_controller=info

Using tracing Spans

Check object.ref and object.reason in the spans automatically generated by the Controller. Enabling JSON logging allows structured searching.

# Filter reconcile logs for a specific resource
cat logs.json | jq 'select(.span."object.ref" | contains("my-resource-name"))'

Details: Monitoring — Structured logging

Checking State with kubectl

# Check resource status and events
kubectl describe myresource <name>

# Track changes in real-time with watch mode
kubectl get myresource -w

# Check resourceVersion change pattern (infinite loop diagnosis)
kubectl get myresource <name> -o jsonpath='{.metadata.resourceVersion}' -w

# Check finalizer state
kubectl get myresource <name> -o jsonpath='{.metadata.finalizers}'

Profiling

Memory Profiling (jemalloc)

[dependencies]
tikv-jemallocator = { version = "*", features = ["profiling"] }

#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

# Enable heap profiling
MALLOC_CONF="prof:true,prof_active:true,lg_prof_interval:30" ./my-controller

# Analyze profile dumps
jeprof --svg ./my-controller jeprof.*.heap > heap.svg

The Store cache is often the primary memory consumer. If the profile shows large AHashMap-related allocations, apply .modify() to strip large fields, or switch to metadata_watcher().

Async Runtime Profiling (tokio-console)

Check whether slow reconciler performance is caused by async task scheduling.

[dependencies]
console-subscriber = "*"

// Add at the top of the main function
console_subscriber::init();

# Connect with the tokio-console client
tokio-console http://localhost:6669

You can monitor per-task poll time, waker count, and wait time in real time. If a reconciler task is blocked for a long time, the cause may be synchronous operations or slow API calls inside it.

For lightweight runtime metrics without the TUI, consider tokio-metrics which can export to Prometheus.

Symptom-Based Diagnosis Tables​

Reconciler Infinite Loop​

Continuous Memory Growth​

Watch Connection Not Recovering After Disconnect​

API Server Throttling (429)​

Finalizer Deadlock (Permanently Terminating)​

Reconciler Not Running​

Debugging Tools​

RUST_LOG Configuration​

Using tracing Spans​

Checking State with kubectl​

Profiling​

Memory Profiling (jemalloc)​

Async Runtime Profiling (tokio-console)​