Skip to content

Key Concepts

Maintenance actions

Snowpack supports five maintenance actions, always executed in this order:

  1. rewrite_data_files — Compacts small data files into fewer, optimally sized files. This is the most impactful action for query performance.
  2. rewrite_position_delete_files — Merges position-delete files back into their corresponding data files, eliminating the read-time overhead of applying deletes.
  3. expire_snapshots — Removes snapshots older than the retention threshold, freeing the metadata layer from tracking stale table states.
  4. rewrite_manifests — Consolidates manifest files to reduce planning time for queries that scan large tables.
  5. remove_orphan_files — Deletes data files on storage that are no longer referenced by any active snapshot.

The ordering matters: compaction runs before cleanup because orphan file removal relies on snapshots having already been expired. Removing orphan files before expiring snapshots would miss files that are still referenced by soon-to-expire snapshots.

Health analysis

Snowpack evaluates table health by inspecting Iceberg metadata for four key metrics:

  • Small file count — Number of data files below the target file size.
  • Snapshot count — Total snapshots retained by the table.
  • Manifest count — Number of manifest files in the current metadata.
  • Position delete files — Count of outstanding position-delete files.

Each metric is compared against configurable thresholds. When any metric exceeds its threshold, the table is flagged as needs_maintenance. Health data is available in two flavors:

  • Live — Fetched directly from the PyIceberg catalog (Glue/S3). Accurate but takes a few seconds per table.
  • Cached — Served from Postgres. Returns in roughly 1 ms, refreshed periodically by the health-sync process.

Opt-out model

Snowpack maintains tables by default. Any Iceberg table in a database that the platform team has added to the orchestrator allowlist is eligible for automated maintenance — no per-table action is required to enroll.

A table is maintained unless one of these opts it out:

  1. Explicit opt-out. A data engineer sets the table property to false:

    ALTER TABLE lakehouse_dev.my_database.my_table
    SET TBLPROPERTIES ('snowpack.maintenance_enabled' = 'false');

    The snowpack.maintenance_enabled property is three-state: true (always maintained), false (never maintained), and unset (maintained in opt-out mode).

  2. Hard exclude. Setting compaction_skip = 'true' removes the table from all Snowpack maintenance regardless of mode — use it for tables undergoing migration or manual intervention.

The platform team still controls which databases are in scope via the databases allowlist in the Helm values. A table is maintained only if its database is allowlisted and it has not opted out.

Job lifecycle

All maintenance operations in Snowpack are asynchronous. A job moves through these states:

  1. Pending — The job has been accepted and queued for execution.
  2. Running — Spark is actively executing the maintenance actions.
  3. Completed — All requested actions finished successfully.
  4. Failed — One or more actions encountered an error. Partial results may exist.
  5. Cancelled — The job was cancelled before completion.

The typical flow: submit a maintenance request via POST and receive a 202 Accepted response with a job ID. Then poll GET /jobs/{id} to track progress. The orchestrator CronJob follows this same lifecycle automatically for all eligible tables.