Object Storage Disk Full Metadata Recovery
When an S3-compatible store fills a disk, the first question is not what to delete. It is whether object data, metadata versions, drive fault state, and quorum can still be made consistent without wiping the PVC.
Separate capacity recovery from metadata repair before destructive cleanup.
Use this when a disk-full event marks drives faulty, splits metadata versions, blocks reads/deletes, or leaves operators choosing between manual deletion and data loss.
preserve evidence -> restore headroom -> prove quorum -> repair metadata
Capture headroom, metadata, quorum, and repair status before cleanup.
These checks are intentionally generic and public-safe. Replace paths and admin commands with the object store's equivalents; do not paste secrets, access keys, or object contents into public issues.
df -h; df -i; du metadata; drive state; heal status; largest temp/trash
Runbook: Do Not Turn Full Disk Into Data Loss
- Pause writes or admission before the emergency reserve is consumed again.
- Preserve metadata, manifests, drive-state files, and repair logs before manual deletion.
- Separate data path fullness from metadata path corruption. A capacity fix is not proof that metadata can rejoin quorum.
- Reclaim only known-disposable space first: logs, temp files, failed multipart staging, expired trash, or documented cache paths.
- Verify the four operator-critical operations after headroom returns: list, read, delete, and restart.
- Document whether an offline repair or heal path exists. If it does not, the incident needs an explicit "wipe required" boundary and backup restore plan.
- Add a regression test that injects ENOSPC during metadata write, then proves restart, quorum state, and object operations do not require PVC wipe.
Use this when disk-full recovery risks metadata corruption.
This keeps the discussion on acceptance criteria: recoverable headroom, metadata preservation, and offline repair boundaries.
I would make the recovery boundary explicit before recommending any wipe or manual deletion.
Acceptance checks I would want:
- Disk-full admission pauses new writes before the recovery reserve is gone.
- Metadata and drive-state files are preserved before any manual cleanup.
- Reclaimable temp/trash/multipart/log paths are documented separately from object data and metadata.
- After headroom is restored, list/read/delete/restart are tested before declaring recovery complete.
- If drives were marked faulty in memory, restart or repair reconciles that state without requiring a PVC wipe.
- If no offline metadata repair tool exists yet, the docs say exactly when restore-from-backup is the only safe path.
Turn one disk-full object-store incident into a reusable recovery policy.
The $99 policy is for S3-compatible object storage, erasure-coded stores, self-hosted backup/object systems, and stateful services where disk-full can split metadata or block read/delete recovery. You get one recovery boundary, read-only evidence checklist, and regression-test acceptance criteria.
Do Not Delete First
- Metadata, manifests, drive-state, quorum, or erasure-set files.
- Object data that may be needed to heal parity or rejoin quorum.
- Repair logs that identify whether metadata was split before or after disk-full.
- PVCs or volumes before proving the backup restore boundary.