Kubernetes PVC WAL No Space Left CrashLoopBackOff

When a stateful pod fills a PVC with WAL, snapshots, raft logs, database logs, or trace files, the wrong cleanup can turn a recoverable ENOSPC into data loss. Start with read-only evidence, prove the growth bucket, then choose between resize, retention fix, safe archive, or a guarded restart path.

Team storage incident

Turn one PVC-full incident into a repeatable cleanup policy.

The $99 team policy covers one representative Kubernetes or CI storage failure: largest PVC bucket, safe/review/do-not-touch paths, temporary rescue order, and the retention guard to prevent the next CrashLoopBackOff.

1Capture PVC, pod, and top path evidence read-only.

2Separate safe archive candidates from database state.

3Ship a repeatable team runbook instead of one-off deletion.

Free first pass

Copy the read-only Kubernetes PVC evidence checklist.

This does not delete anything. It captures the exact volume, restart reason, top directories, inode pressure, and growth suspects so the cleanup boundary is clear before touching WAL, snapshots, database files, or logs.

kubectl get pvc,pod; kubectl describe pod; df -h; du -xh /data | sort -h | tail

Request $99 team policy Request $29 incident triage Open runbook

Runbook: Prove The Growth Bucket Before Cleanup

In a PVC-full stateful workload, "free space now" and "do not corrupt state" are separate decisions. Treat the first pass as evidence collection and containment.

Confirm the failure mode: pod phase, restart count, last state, and whether `no space left on device` happens during normal writes or startup recovery.
Measure both bytes and inodes inside the mounted volume. A byte-full PVC and inode-full PVC lead to different fixes.
Identify the top path family: WAL, raft logs, snapshots, database logs, temp files, debug traces, backups, or build/test artifacts.
Check whether retention or truncation stopped before disk pressure. Look for last successful compact/import/export/prune event and the first sustained error.
Resize the PVC or move non-state evidence first if the pod cannot start. Do not delete stateful database/WAL paths until the owner has a recovery invariant.
Convert the incident into a guard: max retained snapshots, max log age, min free bytes before writes, monotonic growth alert, and a startup escape path.

Copy-ready issue reply

Use this for WAL or snapshot growth incidents.

The reply keeps the conversation useful: evidence first, no blind deletion, and a concrete retention guard.

I would split this into two tracks: immediate PVC rescue and the retention/truncation guard that prevents the next CrashLoopBackOff.

Read-only evidence I would capture before deleting anything:

kubectl -n <namespace> get pvc,pod -o wide
kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> logs <pod> --previous --tail=200
kubectl -n <namespace> exec <pod> -- sh -lc 'df -h; df -i; du -xh /data 2>/dev/null | sort -h | tail -80'
kubectl -n <namespace> exec <pod> -- sh -lc 'find /data -xdev -type f -size +256M -printf "%s %p\n" 2>/dev/null | sort -n | tail -80'

For the fix, I would add a monotonic-growth alert and a retention/truncation health signal: last successful truncate/import/export, current retained snapshot/WAL count, ticks since success, and remaining PVC bytes. If startup recovery also writes temp/index files, it needs a low-space escape path so restart does not just re-enter ENOSPC.

Request policy review

Do Not Delete First

WAL, raft logs, snapshots, or database files without a product-specific recovery invariant.
Current leader/follower state directories while the workload may still be writing.
Crash logs that contain the first ENOSPC or first truncation failure until evidence has been captured.
Kubernetes PVC contents from the host node unless the pod owner confirms path ownership and filesystem semantics.

Good Paid Scope

This is a good fit for a $99 team policy when the same class of problem can recur across branches, pods, runners, or environments. It is not a fit if you need emergency database recovery, legal compliance, or deletion of production state without an owner-approved backup and restore plan.