Kubernetes PVC WAL No Space Left CrashLoopBackOff
When a stateful pod fills a PVC with WAL, snapshots, raft logs, database logs, or trace files, the wrong cleanup can turn a recoverable ENOSPC into data loss. Start with read-only evidence, prove the growth bucket, then choose between resize, retention fix, safe archive, or a guarded restart path.
Turn one PVC-full incident into a repeatable cleanup policy.
The $99 team policy covers one representative Kubernetes or CI storage failure: largest PVC bucket, safe/review/do-not-touch paths, temporary rescue order, and the retention guard to prevent the next CrashLoopBackOff.
Copy the read-only Kubernetes PVC evidence checklist.
This does not delete anything. It captures the exact volume, restart reason, top directories, inode pressure, and growth suspects so the cleanup boundary is clear before touching WAL, snapshots, database files, or logs.
kubectl get pvc,pod; kubectl describe pod; df -h; du -xh /data | sort -h | tail
Runbook: Prove The Growth Bucket Before Cleanup
In a PVC-full stateful workload, "free space now" and "do not corrupt state" are separate decisions. Treat the first pass as evidence collection and containment.
- Confirm the failure mode: pod phase, restart count, last state, and whether `no space left on device` happens during normal writes or startup recovery.
- Measure both bytes and inodes inside the mounted volume. A byte-full PVC and inode-full PVC lead to different fixes.
- Identify the top path family: WAL, raft logs, snapshots, database logs, temp files, debug traces, backups, or build/test artifacts.
- Check whether retention or truncation stopped before disk pressure. Look for last successful compact/import/export/prune event and the first sustained error.
- Resize the PVC or move non-state evidence first if the pod cannot start. Do not delete stateful database/WAL paths until the owner has a recovery invariant.
- Convert the incident into a guard: max retained snapshots, max log age, min free bytes before writes, monotonic growth alert, and a startup escape path.
Use this for WAL or snapshot growth incidents.
The reply keeps the conversation useful: evidence first, no blind deletion, and a concrete retention guard.
I would split this into two tracks: immediate PVC rescue and the retention/truncation guard that prevents the next CrashLoopBackOff.
Read-only evidence I would capture before deleting anything:
kubectl -n <namespace> get pvc,pod -o wide
kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> logs <pod> --previous --tail=200
kubectl -n <namespace> exec <pod> -- sh -lc 'df -h; df -i; du -xh /data 2>/dev/null | sort -h | tail -80'
kubectl -n <namespace> exec <pod> -- sh -lc 'find /data -xdev -type f -size +256M -printf "%s %p\n" 2>/dev/null | sort -n | tail -80'
For the fix, I would add a monotonic-growth alert and a retention/truncation health signal: last successful truncate/import/export, current retained snapshot/WAL count, ticks since success, and remaining PVC bytes. If startup recovery also writes temp/index files, it needs a low-space escape path so restart does not just re-enter ENOSPC.
Do Not Delete First
- WAL, raft logs, snapshots, or database files without a product-specific recovery invariant.
- Current leader/follower state directories while the workload may still be writing.
- Crash logs that contain the first ENOSPC or first truncation failure until evidence has been captured.
- Kubernetes PVC contents from the host node unless the pod owner confirms path ownership and filesystem semantics.
Good Paid Scope
This is a good fit for a $99 team policy when the same class of problem can recur across branches, pods, runners, or environments. It is not a fit if you need emergency database recovery, legal compliance, or deletion of production state without an owner-approved backup and restore plan.