Worker Diagnostic Reports Filling Root Disk

Solver, build, and data workers often write optional readiness reports, coverage JSON, temp decode files, and Docker caches onto the same root filesystem. When those reports are unbounded, the real failure becomes ENOSPC while the user sees a misleading artifact or decode error.

Storage cleanup cleanup

Get the exact cleanup order before deleting.

Send one request now. The job link, log excerpt, or storage summary can follow after the first reply; we send the $29 payment step only when the issue needs review.

See sample result

Runbook: Optional Reports Must Not Break Required Work

Separate required artifacts from optional diagnostics. A readiness report, coverage report, or debug JSON should never consume the emergency reserve needed to decode the actual job artifact.
Prune before write, not only on a timer. The large-write path should enforce max age, max count, and max bytes before creating the next report.
Keep a per-job or per-snapshot floor: newest report, latest failure sample, and enough context for debugging, then delete older siblings.
Check the filesystem that actually stores the report and temp decode files. Root may be full even if the object store, database, or artifact checksum is healthy.
Preserve the original storage exception in worker diagnostics. Do not collapse ENOSPC into a misleading missing-table, missing-artifact, or decode fallback.
Add an operational repair path for existing hosts: dry-run prune, staged delete, before/after `df`, and a rollback-free deployment note.

Copy-ready issue reply

Use this when worker diagnostics fill root.

This turns the incident into concrete acceptance checks: retention, preflight, error preservation, and a safe operator cleanup path.

I would make this a pre-write retention gate, not only a background cleanup task.

Acceptance checks I would add:

- Enforce max age, max count, and max bytes for reports/snapshot-coverage before writing the next matrix-readiness report.
- Keep the newest N reports per snapshot/job plus the latest failure sample; prune older siblings first.
- Check both blocks and inodes on the filesystem that stores reports and temporary HDF5 decode files.
- If the emergency reserve would be breached, skip optional report writing and preserve the original worker error.
- Surface ENOSPC/root-disk context in worker_jobs diagnostics instead of falling through to the legacy-table/artifact-missing message.
- Add a dry-run operator cleanup command that prints before/after df and du for reports, temp decode paths, and Docker/build cache.

Do Not Delete First

The newest successful and failing report sample for the incident.
The temp decode path before recording whether ENOSPC happened there or in the report path.
Docker/build caches before measuring whether they are the largest reclaimable bucket.
Diagnostic logs that contain the original ENOSPC, checksum, or artifact decode exception.

Deep Cleanup

Still full after the SafeDisk Lite scan?

Start with the SafeDisk Lite scan. If the scan shows review-first storage that still needs judgment, send one request for the $29 Deep Cleanup next step.