Observability
Snowpack exposes metrics via OpenTelemetry, emits structured logs via structlog, and surfaces operational data through a hybrid Grafana dashboard.
Metrics
The API mounts Prometheus-formatted metrics at /metrics on the same FastAPI
service port as the application (8000 in the container). Worker pods are
ephemeral KEDA jobs: they record job/action metrics while executing, and can push
OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT is configured, but they do not
serve a long-lived scrape endpoint.
Metric inventory
| Metric | Type | Labels | Description |
|---|---|---|---|
snowpack.job.duration | Histogram | database, table_name, status | End-to-end job wall-clock time |
snowpack.job.total | Counter | database, table_name, status | Total jobs by terminal status |
snowpack.action.duration | Histogram | database, table_name, action, status | Per-action wall-clock time |
snowpack.action.total | Counter | database, table_name, action, status | Total action executions |
snowpack.queue.depth | Observable Gauge | Number of unclaimed, visible jobs in the queue | |
snowpack.workers.active | Observable Gauge | Count of distinct claimed_by values (active workers) | |
snowpack.tables.discovered | Gauge | Number of tables in the table cache |
Observable gauges
snowpack.queue.depth and snowpack.workers.active are observable gauges —
they re-query Postgres on every scrape rather than maintaining in-memory
counters. This design eliminates state drift across API replicas: every scrape
returns the true current value from the database, regardless of which replica
serves the request.
The health-sync path can also push Iceberg table health gauges to Mimir through
OTLP when SNOWPACK_MIMIR_ENDPOINT is set. Those gauges use the
iceberg.table.* namespace and are separate from the API /metrics endpoint.
Structured logging
All log output uses structlog with JSON formatting. Key structured events emitted across the system:
| Event | Emitted by | Description |
|---|---|---|
job_started | Worker | Job claim succeeded, execution beginning |
job_completed | Worker | All actions finished successfully |
job_failed | Worker | Job reached terminal failure |
job_crashed | Worker | Unhandled exception during execution |
table_cache_synced | API / TableCacheSyncWorker | Table cache refreshed from catalog |
executor_started | Worker | Spark query engine connected |
health_sync_started | Health Sync | Health sync cycle beginning |
health_sync_discovered | Health Sync | Tables discovered from catalog |
health_sync_collected | Health Sync | Health data collected for tables |
health_sync_pg_written | Health Sync | Health snapshots persisted to Postgres |
Grafana dashboard
Snowpack uses a hybrid Postgres + Athena Grafana dashboard:
- Postgres panels show live operational data — active jobs, queue depth, recent failures, lock status.
- Athena panels query historical job and action data for longer-range trend analysis (e.g., compaction duration percentiles over weeks, action success rates by table).
This split keeps the live dashboard responsive (Postgres queries are fast for small working sets) while supporting deep historical analysis without burdening the operational database.