Full Disk Trash GC Deadlock Recovery
Some stateful services enter a nasty loop when the disk is full: the trash or async deleter is the component that should free space, but it crashes during startup because the filesystem cannot create, list, or write its own recovery directory. Treat that as a storage deadlock, not a normal cache cleanup.
Decide if the cleanup worker can safely boot before anyone deletes state.
For Electric-style async deleters, the first decision is not "what can we delete?" It is whether the deleter can start in degraded mode, surface the original ENOSPC, and retry without taking the tenant stack down.
surface ENOSPC -> boot degraded -> retry cleanup -> gate new trash
Capture the deadlock boundary before deleting trash.
The goal is to distinguish four states: the trash directory is absent because create failed, present but not listable, full of reclaimable files, or mixed with state that needs owner approval.
df -h; df -i; du -xh <data-root> | sort -h | tail
Runbook: Break The Self-Reinforcing Loop
- Preserve the first error. A later "directory missing" or "file not found" may only be a consequence of an earlier
ENOSPCduringmkdir, write, rename, or fsync. - Measure both bytes and inodes on the mounted data root. A full inode table can make cleanup code fail even when bytes look available.
- Find the trash, deleted, pending-delete, GC, WAL, snapshot, and temp path families separately. Do not mix reclaimable trash with active tenant or database state.
- If the cleanup worker is supervised, make boot tolerant: list with an error return, report the true reason, and keep retrying instead of taking the whole stack down.
- Add a low-space admission check before enqueueing new deletes or snapshots. The system should stop creating more trash before the trash worker loses the ability to run.
- Turn the incident into a guard: minimum free bytes/inodes, max pending trash age/size, oldest cleanup success timestamp, and an emergency owner-approved purge path.
Use this when the deleter cannot start on a full disk.
This keeps the thread focused on the actual recovery invariant: make cleanup boot resilient, expose the real ENOSPC, and define safe trash boundaries.
I would treat this as a full-disk cleanup deadlock, not just a missing-directory bug: the component responsible for reclaiming trash has a startup path that itself needs writable/listable storage.
The useful split is:
1. Preserve the real first error. If mkdir/write failed with ENOSPC, do not let a later File.ls/list-dir ENOENT become the surfaced cause.
2. Make the cleanup worker boot resilient on full-disk conditions. Use non-bang list calls, log the true reason, and keep a retry loop alive instead of failing the stack supervisor.
3. Add an admission guard before creating more trash: minimum free bytes/inodes, max pending trash age/size, and last successful delete timestamp.
4. For recovery, capture read-only evidence first:
df -h <data-root>
df -i <data-root>
find <data-root> -maxdepth 4 -type d \( -name "*trash*" -o -name "*deleted*" -o -name "*gc*" \) -print
du -xh <data-root> 2>/dev/null | sort -h | tail -80
Only after that would I define which trash paths are owner-approved for emergency purge versus state that needs product-specific recovery rules.
Turn one full-disk deadlock into a recovery policy.
The $99 policy is for teams running stateful services where trash, WAL, snapshots, or async deletion can fill a shared data root. You get a safe/review/do-not-touch boundary and the monitoring guard that should have fired before 100% disk.
Do Not Delete First
- Tenant, database, WAL, raft, snapshot, or active state directories without a product owner-approved invariant.
- Trash paths whose ownership is unclear or mixed with active state.
- The first logs that show the original ENOSPC or failed mkdir/write; later errors may be misleading.
- Host-level files from a mounted volume before confirming namespace, tenant, and service ownership.