Key Concepts
Maintenance actions
Snowpack supports five maintenance actions, always executed in this order:
rewrite_data_files— Compacts small data files into fewer, optimally sized files. This is the most impactful action for query performance.rewrite_position_delete_files— Merges position-delete files back into their corresponding data files, eliminating the read-time overhead of applying deletes.expire_snapshots— Removes snapshots older than the retention threshold, freeing the metadata layer from tracking stale table states.rewrite_manifests— Consolidates manifest files to reduce planning time for queries that scan large tables.remove_orphan_files— Deletes data files on storage that are no longer referenced by any active snapshot.
The ordering matters: compaction runs before cleanup because orphan file removal relies on snapshots having already been expired. Removing orphan files before expiring snapshots would miss files that are still referenced by soon-to-expire snapshots.
Health analysis
Snowpack evaluates table health by inspecting Iceberg metadata for four key metrics:
- Small file count — Number of data files below the target file size.
- Snapshot count — Total snapshots retained by the table.
- Manifest count — Number of manifest files in the current metadata.
- Position delete files — Count of outstanding position-delete files.
Each metric is compared against configurable thresholds. When any metric exceeds
its threshold, the table is flagged as needs_maintenance. Health data is
available in two flavors:
- Live — Fetched directly from the PyIceberg catalog (Glue/S3). Accurate but takes a few seconds per table.
- Cached — Served from Postgres. Returns in roughly 1 ms, refreshed periodically by the health-sync process.
Opt-out model
Snowpack maintains tables by default. Any Iceberg table in a database that the platform team has added to the orchestrator allowlist is eligible for automated maintenance — no per-table action is required to enroll.
A table is maintained unless one of these opts it out:
-
Explicit opt-out. A data engineer sets the table property to
false:ALTER TABLE lakehouse_dev.my_database.my_tableSET TBLPROPERTIES ('snowpack.maintenance_enabled' = 'false');The
snowpack.maintenance_enabledproperty is three-state:true(always maintained),false(never maintained), and unset (maintained in opt-out mode). -
Hard exclude. Setting
compaction_skip = 'true'removes the table from all Snowpack maintenance regardless of mode — use it for tables undergoing migration or manual intervention.
The platform team still controls which databases are in scope via the
databases allowlist in the Helm values. A table is maintained
only if its database is allowlisted and it has not opted out.
Job lifecycle
All maintenance operations in Snowpack are asynchronous. A job moves through these states:
- Pending — The job has been accepted and queued for execution.
- Running — Spark is actively executing the maintenance actions.
- Completed — All requested actions finished successfully.
- Failed — One or more actions encountered an error. Partial results may exist.
- Cancelled — The job was cancelled before completion.
The typical flow: submit a maintenance request via POST and receive a
202 Accepted response with a job ID. Then poll GET /jobs/{id} to track
progress. The orchestrator CronJob follows this same lifecycle automatically for
all eligible tables.