Troubleshooting
This page covers the most common failure modes, their symptoms, and how to resolve them.
Workers not scaling
Symptom: The job_queue table has unclaimed rows, but no worker pods are
spawning. kubectl get pods -n snowpack shows no worker pods.
Cause: The KEDA ScaledJob is not triggering. This usually means the postgresql trigger cannot connect to the database, or the trigger query is not returning the expected result.
Diagnosis:
-
Check the ScaledJob status:
Terminal window kubectl get scaledjob -n snowpackLook at the
READYcolumn. If it showsFalse, KEDA cannot evaluate the trigger. -
Check KEDA operator logs for connection errors:
Terminal window kubectl logs -n keda -l app=keda-operator --tail=50 -
Verify the
job_queuehas unclaimed work:SELECT COUNT(*) FROM job_queueWHERE claimed_at IS NULL AND visible_at <= NOW(); -
Verify
activationTargetQueryValueis set in the ScaledJob trigger metadata. KEDA 2.12+ requiresactivationTargetQueryValue(notactivationLagCount) to activate from zero replicas. Without it, KEDA will not scale up from zero even when there is work in the queue.
Resolution: Fix the KEDA trigger authentication (check the Secret referenced
by TriggerAuthentication), verify Postgres connectivity from the KEDA
namespace, and confirm the activationTargetQueryValue is present.
Jobs stuck in pending
Symptom: Jobs show status: pending for longer than expected. Workers
may or may not be running.
Cause: Several possible causes:
- KEDA polling interval is 30 seconds, so there is an inherent delay between a job being queued and a worker pod starting.
- The
visible_attimestamp on the queue row may be in the future (retry backoff). - A stale claim from a crashed worker may be blocking the row. The
reclaim_stalesweeper releases claims older than 30 minutes, but this requires the API process to be running.
Diagnosis:
-
Check queue row timestamps:
SELECT job_id, visible_at, claimed_atFROM job_queueWHERE claimed_at IS NULLORDER BY visible_at; -
Check for stale claims (claimed but not progressing):
SELECT job_id, claimed_atFROM job_queueWHERE claimed_at IS NOT NULLAND claimed_at < NOW() - INTERVAL '30 minutes'; -
Verify the API is running (the
reclaim_stalesweeper runs inside the API process):Terminal window kubectl get pods -n snowpack -l app.kubernetes.io/component=api
Resolution: If stale claims exist and the API is running, the sweeper will
reclaim them within 30 seconds. If the API is not running, fix the API first —
the sweeper cannot run without it. For jobs stuck behind a future visible_at,
wait for the backoff window to expire.
Health sync OOM
Symptom: The health-sync CronJob pod is OOMKilled. kubectl describe pod
shows the container exceeded its memory limit.
Cause: PyIceberg loads table metadata into memory. With high concurrency and many large tables, the combined memory footprint exceeds the pod’s limit. This was tracked in DL-278.
Diagnosis:
kubectl get pods -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTimekubectl describe pod <oom-killed-pod> -n snowpackLook for Last State: Terminated with Reason: OOMKilled and check the
memory limit in the container spec.
Resolution: Reduce the SNOWPACK_HEALTH_SYNC_CONCURRENCY setting. For the
dev environment the concurrency is set to 2 (down from the default 10). In
the Helm values:
healthSync: concurrency: 2 resources: limits: memory: 768MiIf the problem persists even at concurrency 2, increase the memory limit rather than lowering concurrency further — at concurrency 1 the sync window may exceed the 15-minute CronJob interval.
Table not appearing in orchestrator
Symptom: A table has snowpack.maintenance_enabled = true set as a table
property, but the orchestrator never submits maintenance for it.
Cause: The orchestrator only processes tables that satisfy all three conditions:
- The table appears in the API table cache (
GET /tables). - The table has
snowpack.maintenance_enabled = trueas a table property. - The table’s database is listed in
orchestrator.includeDatabases.
If any condition is not met, the orchestrator will skip the table silently.
Diagnosis:
-
Verify the table appears in the table cache:
Terminal window curl -s https://<snowpack-host>/tables?database=<database>&maintenance_enabled=true | jq .If the table is not in the response, the API
TableCacheSyncWorkerhas not discovered it yet or the catalog cannot list it. -
Check the table’s
snowpack.maintenance_enabledproperty from Spark/Kyuubi or the Iceberg catalog. -
Check
orchestrator.includeDatabases:Terminal window helm get values snowpack -n snowpack | grep -A5 orchestratorThe table’s database must be in this list if the allowlist is set.
-
If cached health is expected, check
healthSync.databases:Terminal window helm get values snowpack -n snowpack | grep -A5 healthSyncIf a database is absent from this list, the orchestrator can still fall back to live health checks, but the first run may be slower and cached health lookups may return 404 until a live check persists a snapshot.
Resolution: Add the database to orchestrator.includeDatabases in the Helm
values, and add it to healthSync.databases when cached health should be
precomputed. Then run terraform apply.
409 Conflict on maintenance submit
Symptom: POST /tables/{db}/{table}/maintenance returns 409 Conflict
with the message “Maintenance already in progress for {db}.{table}”.
Cause: Another job currently holds the lock for this table. Snowpack uses
a table_locks table to ensure only one maintenance job runs per table at a
time. The lock is acquired when a job is submitted and released when the job
completes, fails, or is cancelled.
Diagnosis:
-
Check who holds the lock:
SELECT table_key, holder, acquired_at, expires_atFROM table_locksWHERE table_key = '<database>.<table>'; -
Check the status of the holding job:
Terminal window curl -s https://<snowpack-host>/jobs/<holder-job-id> | jq .status
Resolution: If the holding job is still running, wait for it to complete or
cancel it with POST /jobs/{id}/cancel. If the lock has expired, the next
submission for the same table can take it over atomically. If a terminal job
still appears to hold a non-expired lock, inspect the worker/API logs before
manually deleting the row.
Stale table cache
Symptom: The API returns outdated table lists, or newly opted-in tables are not appearing in API responses.
Cause: The table cache is populated by the API’s TableCacheSyncWorker,
which refreshes on SNOWPACK_TABLE_CACHE_REFRESH_SECONDS (300 seconds by
default). If the worker cannot list the catalog or cannot write Postgres, the
cache may become stale and /readyz will fail once the staleness window is
exceeded.
Diagnosis:
Check the cache status endpoint for the last sync timestamp:
curl -s https://<snowpack-host>/tables/cache-status | jq .The response includes:
{ "last_synced": "2026-04-25T12:15:00+00:00", "table_count": 142}If last_synced is older than the configured staleness window, the API
table-cache sync is failing or blocked.
Resolution:
-
Check API readiness and logs:
Terminal window kubectl get pods -n snowpack -l app.kubernetes.io/component=apikubectl logs -n snowpack -l app.kubernetes.io/component=api --tail=100 -
Check for
table_cache_sync_failedevents, catalog authentication failures, and Postgres connection errors. -
Confirm the configured cache refresh and staleness settings:
Terminal window helm get values snowpack -n snowpack | grep -E 'tableCacheRefreshSeconds|tableCacheStalenessSeconds' -
Common causes include Polaris/Glue API throttling, catalog credential failures, or Postgres connection failures. Fix the underlying issue and the next API sync cycle will repopulate the cache.