Observability

Snowpack exposes metrics via OpenTelemetry, emits structured logs via structlog, and surfaces operational data through a hybrid Grafana dashboard.

Metrics

The API mounts Prometheus-formatted metrics at /metrics on the same FastAPI service port as the application (8000 in the container). Worker pods are ephemeral KEDA jobs: they record job/action metrics while executing, and can push OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT is configured, but they do not serve a long-lived scrape endpoint.

Metric inventory

Metric	Type	Labels	Description
`snowpack.job.duration`	Histogram	`database`, `table_name`, `status`	End-to-end job wall-clock time
`snowpack.job.total`	Counter	`database`, `table_name`, `status`	Total jobs by terminal status
`snowpack.action.duration`	Histogram	`database`, `table_name`, `action`, `status`	Per-action wall-clock time
`snowpack.action.total`	Counter	`database`, `table_name`, `action`, `status`	Total action executions
`snowpack.queue.depth`	Observable Gauge		Number of unclaimed, visible jobs in the queue
`snowpack.workers.active`	Observable Gauge		Count of distinct `claimed_by` values (active workers)
`snowpack.tables.discovered`	Gauge		Number of tables in the table cache

Observable gauges

snowpack.queue.depth and snowpack.workers.active are observable gauges — they re-query Postgres on every scrape rather than maintaining in-memory counters. This design eliminates state drift across API replicas: every scrape returns the true current value from the database, regardless of which replica serves the request.

The health-sync path can also push Iceberg table health gauges to Mimir through OTLP when SNOWPACK_MIMIR_ENDPOINT is set. Those gauges use the iceberg.table.* namespace and are separate from the API /metrics endpoint.

Structured logging

All log output uses structlog with JSON formatting. Key structured events emitted across the system:

Event	Emitted by	Description
`job_started`	Worker	Job claim succeeded, execution beginning
`job_completed`	Worker	All actions finished successfully
`job_failed`	Worker	Job reached terminal failure
`job_crashed`	Worker	Unhandled exception during execution
`table_cache_synced`	API / TableCacheSyncWorker	Table cache refreshed from catalog
`executor_started`	Worker	Spark query engine connected
`health_sync_started`	Health Sync	Health sync cycle beginning
`health_sync_discovered`	Health Sync	Tables discovered from catalog
`health_sync_collected`	Health Sync	Health data collected for tables
`health_sync_pg_written`	Health Sync	Health snapshots persisted to Postgres

Grafana dashboard

Snowpack uses a hybrid Postgres + Athena Grafana dashboard:

Postgres panels show live operational data — active jobs, queue depth, recent failures, lock status.
Athena panels query historical job and action data for longer-range trend analysis (e.g., compaction duration percentiles over weeks, action success rates by table).

This split keeps the live dashboard responsive (Postgres queries are fast for small working sets) while supporting deep historical analysis without burdening the operational database.