Skip to content

Observability

Snowpack exposes metrics via OpenTelemetry, emits structured logs via structlog, and surfaces operational data through a hybrid Grafana dashboard.

Metrics

The API mounts Prometheus-formatted metrics at /metrics on the same FastAPI service port as the application (8000 in the container). Worker pods are ephemeral KEDA jobs: they record job/action metrics while executing, and can push OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT is configured, but they do not serve a long-lived scrape endpoint.

Metric inventory

MetricTypeLabelsDescription
snowpack.job.durationHistogramdatabase, table_name, statusEnd-to-end job wall-clock time
snowpack.job.totalCounterdatabase, table_name, statusTotal jobs by terminal status
snowpack.action.durationHistogramdatabase, table_name, action, statusPer-action wall-clock time
snowpack.action.totalCounterdatabase, table_name, action, statusTotal action executions
snowpack.queue.depthObservable GaugeNumber of unclaimed, visible jobs in the queue
snowpack.workers.activeObservable GaugeCount of distinct claimed_by values (active workers)
snowpack.tables.discoveredGaugeNumber of tables in the table cache

Observable gauges

snowpack.queue.depth and snowpack.workers.active are observable gauges — they re-query Postgres on every scrape rather than maintaining in-memory counters. This design eliminates state drift across API replicas: every scrape returns the true current value from the database, regardless of which replica serves the request.

The health-sync path can also push Iceberg table health gauges to Mimir through OTLP when SNOWPACK_MIMIR_ENDPOINT is set. Those gauges use the iceberg.table.* namespace and are separate from the API /metrics endpoint.

Structured logging

All log output uses structlog with JSON formatting. Key structured events emitted across the system:

EventEmitted byDescription
job_startedWorkerJob claim succeeded, execution beginning
job_completedWorkerAll actions finished successfully
job_failedWorkerJob reached terminal failure
job_crashedWorkerUnhandled exception during execution
table_cache_syncedAPI / TableCacheSyncWorkerTable cache refreshed from catalog
executor_startedWorkerSpark query engine connected
health_sync_startedHealth SyncHealth sync cycle beginning
health_sync_discoveredHealth SyncTables discovered from catalog
health_sync_collectedHealth SyncHealth data collected for tables
health_sync_pg_writtenHealth SyncHealth snapshots persisted to Postgres

Grafana dashboard

Snowpack uses a hybrid Postgres + Athena Grafana dashboard:

  • Postgres panels show live operational data — active jobs, queue depth, recent failures, lock status.
  • Athena panels query historical job and action data for longer-range trend analysis (e.g., compaction duration percentiles over weeks, action success rates by table).

This split keeps the live dashboard responsive (Postgres queries are fast for small working sets) while supporting deep historical analysis without burdening the operational database.