Observability & health checks

Every Stanza framework package exposes a Stats() method with atomic counters for its key operations. This recipe shows how to build health check endpoints, wire Stats() into an admin dashboard, inject build metadata, and monitor your app in production.

Health check endpoint

A health endpoint answers one question: is the service healthy enough to receive traffic? Keep it simple — check database connectivity and report basic runtime metrics:

package health

import (
    "runtime"
    "time"

    "github.com/stanza-go/framework/pkg/http"
    "github.com/stanza-go/framework/pkg/sqlite"
)

var startTime = time.Now()

type BuildInfo struct {
    Version   string
    Commit    string
    BuildTime string
}

func Register(api *http.Group, db *sqlite.DB, bi BuildInfo) {
    ver := bi.Version
    if ver == "" {
        ver = "dev"
    }

    api.HandleFunc("GET /health", func(w http.ResponseWriter, r *http.Request) {
        dbOK := true
        var dbErr string
        row := db.QueryRow("SELECT 1")
        var one int
        if err := row.Scan(&one); err != nil {
            dbOK = false
            dbErr = err.Error()
        }

        var mem runtime.MemStats
        runtime.ReadMemStats(&mem)

        status := http.StatusOK
        if !dbOK {
            status = http.StatusServiceUnavailable
        }

        stats := db.Stats()
        http.WriteJSON(w, status, map[string]any{
            "status":     statusText(dbOK),
            "version":    ver,
            "commit":     bi.Commit,
            "uptime":     time.Since(startTime).Round(time.Second).String(),
            "go":         runtime.Version(),
            "goroutines": runtime.NumGoroutine(),
            "memory_mb":  mem.Alloc / 1024 / 1024,
            "database": map[string]any{
                "ok":          dbOK,
                "error":       dbErr,
                "total_reads": stats.TotalReads,
                "total_writes": stats.TotalWrites,
                "pool_size":   stats.ReadPoolSize,
                "pool_in_use": stats.ReadPoolInUse,
                "pool_waits":  stats.PoolWaits,
            },
        })
    })
}

func statusText(ok bool) string {
    if ok {
        return "ok"
    }
    return "degraded"
}

The endpoint returns 200 OK when healthy and 503 Service Unavailable when the database is unreachable. Container orchestrators (Railway, Cloud Run) use this to route traffic and restart unhealthy instances.

health.Register(api, db, health.BuildInfo{
    Version:   version,
    Commit:    commit,
    BuildTime: buildTime,
})

Test it:

curl -s http://localhost:23710/api/health | jq .

{
  "status": "ok",
  "version": "dev",
  "uptime": "2m30s",
  "go": "go1.26.1",
  "goroutines": 14,
  "memory_mb": 8,
  "database": {
    "ok": true,
    "error": "",
    "total_reads": 1247,
    "total_writes": 83,
    "pool_size": 4,
    "pool_in_use": 1,
    "pool_waits": 0
  }
}

What to check

The health endpoint should verify only critical dependencies — the database and nothing else. Don't check optional services (email, webhooks) here. A failing email provider shouldn't mark the entire service as unhealthy.

Build metadata

Inject version, commit hash, and build time at compile time via -ldflags:

// main.go — these are empty in development, set by the build.
var (
    version   string
    commit    string
    buildTime string
)

The Makefile sets them:

go build -ldflags="-s -w \
    -X main.version=$$(git describe --tags --always --dirty) \
    -X main.commit=$$(git rev-parse --short HEAD) \
    -X main.buildTime=$$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    -o bin/standalone .

This lets you identify exactly which code is running in production by hitting the health endpoint.

Stats() reference

Every framework package with runtime state exposes a Stats() method. All counters use sync/atomic — safe to call from any goroutine, no locks, no allocations.

SQLite

stats := db.Stats()

Field	Type	Description
`ReadPoolSize`	`int`	Configured number of read connections
`ReadPoolAvailable`	`int`	Idle connections in pool
`ReadPoolInUse`	`int`	Connections currently checked out
`TotalReads`	`int64`	Total read queries executed
`TotalWrites`	`int64`	Total write operations executed
`PoolWaits`	`int64`	Times a read query waited for a free connection
`PoolWaitTime`	`time.Duration`	Cumulative time waiting for a free connection

What to watch: PoolWaits growing means your read pool is too small for your concurrency. Increase with sqlite.WithReadPoolSize(n).

HTTP

stats := metrics.Stats()

Field	Type	Description
`TotalRequests`	`int64`	Total requests processed
`ActiveRequests`	`int64`	Currently in-flight requests
`Status2xx`	`int64`	Successful responses
`Status3xx`	`int64`	Redirects
`Status4xx`	`int64`	Client errors
`Status5xx`	`int64`	Server errors
`BytesWritten`	`int64`	Total response bytes
`AvgDurationMs`	`float64`	Average request duration in milliseconds

What to watch: Status5xx climbing means server errors. ActiveRequests staying high means requests are backing up — check for slow queries or external calls.

Cache

stats := myCache.Stats()

Field	Type	Description
`Size`	`int`	Current entries (including expired not yet cleaned)
`MaxSize`	`int`	Configured maximum (0 = unlimited)
`Hits`	`int64`	Key found and not expired
`Misses`	`int64`	Key not found or expired
`Evictions`	`int64`	Involuntary removals (TTL expiry + LRU)

What to watch: Hit rate = Hits / (Hits + Misses). Below 80% means your TTL is too short or your cache is too small. High Evictions with a full cache means MaxSize is too low.

Queue

stats, err := queue.Stats()

Field	Type	Description
`Pending`	`int`	Jobs waiting to be processed
`Running`	`int`	Jobs currently being processed
`Completed`	`int`	Successfully finished jobs
`Failed`	`int`	Jobs that errored (will retry)
`Dead`	`int`	Jobs that exhausted all retries
`Cancelled`	`int`	Manually cancelled jobs

What to watch: Pending growing faster than Completed means your workers can't keep up. Dead increasing means jobs are permanently failing — check error logs.

Cron

stats := scheduler.Stats()

Field	Type	Description
`Jobs`	`int`	Total registered jobs
`Completed`	`int64`	Successful executions
`Failed`	`int64`	Executions that returned error or panicked
`Skipped`	`int64`	Skipped because previous execution still running

What to watch: Skipped increasing means a cron job takes longer than its interval. Either increase the interval or optimize the job. Failed means a job is erroring — check logs.

Webhook

stats := webhookClient.Stats()

Field	Type	Description
`Sends`	`int64`	Total Send or SendWithRetry calls
`Successes`	`int64`	Deliveries that received 2xx response
`Failures`	`int64`	Non-2xx response after retries exhausted
`Retries`	`int64`	Total retry attempts
`Errors`	`int64`	Network errors (DNS, timeouts, context cancellation)

What to watch: Failures / Sends is your delivery failure rate. High Retries means endpoints are flaky. High Errors means network issues.

Auth

stats := auth.Stats()

Field	Type	Description
`Issued`	`int64`	Access tokens successfully created
`Accepted`	`int64`	Tokens that passed validation
`Rejected`	`int64`	Tokens that failed (expired, malformed, invalid signature)

What to watch: Rejected spiking could indicate token expiry issues (clock skew) or attack attempts (forged tokens).

Email

stats := emailClient.Stats()

Field	Type	Description
`Sent`	`int64`	Emails successfully delivered to API
`Errors`	`int64`	Failed send attempts (transport errors, non-2xx responses)

What to watch: Errors / (Sent + Errors) is your email failure rate. Non-zero Errors usually means an API key issue or provider outage.

Dashboard endpoint

The admin dashboard aggregates Stats() from all packages into a single JSON response. The key pattern: call Stats() live for cheap in-memory counters, cache expensive database queries.

func Register(admin *http.Group, db *sqlite.DB, q *queue.Queue,
    s *cron.Scheduler, m *http.Metrics, wh *webhooks.Dispatcher,
    a *auth.Auth, ec *email.Client) {

    // Cache expensive DB queries (table counts, user counts).
    statsCache := cache.New[*dbStats](
        cache.WithTTL[*dbStats](30 * time.Second),
        cache.WithMaxSize[*dbStats](1),
    )

    admin.HandleFunc("GET /dashboard", statsHandler(db, q, s, m, wh, a, ec, statsCache))
}

Cheap vs expensive stats

Source	Cost	Strategy
`db.Stats()`	Free — atomic reads	Call live
`q.Stats()`	DB query	Call live (fast — indexed)
`s.Stats()`	Free — atomic reads	Call live
`m.Stats()`	Free — atomic reads	Call live
`wh.Stats()`	Free — atomic reads	Call live
`a.Stats()`	Free — atomic reads	Call live
`ec.Stats()`	Free — atomic reads	Call live
Table counts, user counts	Multiple DB queries	Cache 30s
Time-series chart data	Complex aggregation	Cache 5m

Assembling the response

func statsHandler(db *sqlite.DB, q *queue.Queue, s *cron.Scheduler,
    m *http.Metrics, wh *webhooks.Dispatcher, a *auth.Auth,
    ec *email.Client, statsCache *cache.Cache[*dbStats]) func(http.ResponseWriter, *http.Request) {

    return func(w http.ResponseWriter, r *http.Request) {
        var mem runtime.MemStats
        runtime.ReadMemStats(&mem)

        // Cached — expensive DB queries, 30s TTL.
        st, _ := statsCache.GetOrSet("stats", func() (*dbStats, error) {
            return queryDBStats(db)
        })
        if st == nil {
            st = &dbStats{}
        }

        // Live — all in-memory, no DB hit.
        queueStats := map[string]any{"pending": 0, "running": 0}
        if qs, err := q.Stats(); err == nil {
            queueStats["pending"] = qs.Pending
            queueStats["running"] = qs.Running
            queueStats["completed"] = qs.Completed
            queueStats["failed"] = qs.Failed
            queueStats["dead"] = qs.Dead
        }

        cronStats := s.Stats()

        http.WriteJSON(w, http.StatusOK, map[string]any{
            "system": map[string]any{
                "goroutines":      runtime.NumGoroutine(),
                "memory_alloc_mb": float64(mem.Alloc) / 1024 / 1024,
            },
            "database": map[string]any{
                "size_bytes":   st.DBSizeBytes,
                "total_reads":  db.Stats().TotalReads,
                "total_writes": db.Stats().TotalWrites,
                "pool_waits":   db.Stats().PoolWaits,
            },
            "queue":   queueStats,
            "cron": map[string]any{
                "completed": cronStats.Completed,
                "failed":    cronStats.Failed,
                "skipped":   cronStats.Skipped,
            },
            "http":    m.Stats(),
            "webhook": wh.Stats(),
            "auth":    a.Stats(),
            "email":   ec.Stats(),
            "stats": map[string]any{
                "total_users":     st.TotalUsers,
                "active_sessions": st.ActiveSessions,
            },
        })
    }
}

Wiring in main.go

Pass all stats providers through the lifecycle DI container:

func main() {
    app := lifecycle.New(
        lifecycle.Provide(provideDB),
        lifecycle.Provide(provideAuth),
        lifecycle.Provide(provideEmail),
        lifecycle.Provide(provideQueue),
        lifecycle.Provide(provideWebhookDispatcher),
        lifecycle.Provide(provideCron),
        lifecycle.Provide(provideMetrics),
        lifecycle.Provide(provideRouter),
        lifecycle.Provide(provideServer),
        lifecycle.Invoke(registerModules),
    )
    if err := app.Run(); err != nil {
        fmt.Fprintf(os.Stderr, "fatal: %v\n", err)
        os.Exit(1)
    }
}

func provideMetrics() *http.Metrics {
    return http.NewMetrics()
}

The registerModules function receives all providers and wires them into the dashboard:

func registerModules(router *http.Router, db *sqlite.DB, q *queue.Queue,
    s *cron.Scheduler, m *http.Metrics, wh *webhooks.Dispatcher,
    a *auth.Auth, ec *email.Client) {

    admin := router.Group("/api/admin")
    admin.Use(a.RequireAuth())
    admin.Use(auth.RequireScope("admin"))

    dashboard.Register(admin, db, q, s, m, wh, a, ec)
}

Prometheus metrics endpoint

For external monitoring tools (Prometheus, Grafana, Datadog), expose all framework Stats() in Prometheus text exposition format using http.PrometheusHandler. This is a public endpoint — register it alongside the health check:

api.HandleFunc("GET /metrics", http.PrometheusHandler(
    collectPrometheus(db, m, q, s, whDispatcher, a, emailClient),
))

The collector function gathers metrics from all framework packages on each scrape:

func collectPrometheus(db *sqlite.DB, m *http.Metrics, q *queue.Queue,
    s *cron.Scheduler, wh *webhooks.Dispatcher, a *auth.Auth,
    ec *email.Client) func() []http.PrometheusMetric {

    return func() []http.PrometheusMetric {
        var out []http.PrometheusMetric

        // SQLite — pool and query counters.
        ds := db.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_sqlite_reads_total", Help: "Total read queries", Type: "counter", Value: float64(ds.TotalReads)},
            http.PrometheusMetric{Name: "stanza_sqlite_writes_total", Help: "Total write queries", Type: "counter", Value: float64(ds.TotalWrites)},
            http.PrometheusMetric{Name: "stanza_sqlite_pool_waits_total", Help: "Read pool wait events", Type: "counter", Value: float64(ds.PoolWaits)},
            http.PrometheusMetric{Name: "stanza_sqlite_read_pool_in_use", Help: "Read pool connections in use", Type: "gauge", Value: float64(ds.ReadPoolInUse)},
        )

        // HTTP — request counters and latency.
        hs := m.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_http_requests_total", Help: "Total requests processed", Type: "counter", Value: float64(hs.TotalRequests)},
            http.PrometheusMetric{Name: "stanza_http_requests_active", Help: "Requests in flight", Type: "gauge", Value: float64(hs.ActiveRequests)},
            http.PrometheusMetric{Name: "stanza_http_responses_2xx_total", Help: "2xx responses", Type: "counter", Value: float64(hs.Status2xx)},
            http.PrometheusMetric{Name: "stanza_http_responses_4xx_total", Help: "4xx responses", Type: "counter", Value: float64(hs.Status4xx)},
            http.PrometheusMetric{Name: "stanza_http_responses_5xx_total", Help: "5xx responses", Type: "counter", Value: float64(hs.Status5xx)},
        )

        // Queue — job state counts.
        if qs, err := q.Stats(); err == nil {
            out = append(out,
                http.PrometheusMetric{Name: "stanza_queue_pending", Help: "Pending jobs", Type: "gauge", Value: float64(qs.Pending)},
                http.PrometheusMetric{Name: "stanza_queue_completed_total", Help: "Completed jobs", Type: "counter", Value: float64(qs.Completed)},
                http.PrometheusMetric{Name: "stanza_queue_failed_total", Help: "Failed jobs", Type: "counter", Value: float64(qs.Failed)},
                http.PrometheusMetric{Name: "stanza_queue_dead_total", Help: "Dead-lettered jobs", Type: "counter", Value: float64(qs.Dead)},
            )
        }

        // Cron, webhook, auth, email — same pattern.
        cs := s.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_cron_completed_total", Help: "Cron runs completed", Type: "counter", Value: float64(cs.Completed)},
            http.PrometheusMetric{Name: "stanza_cron_failed_total", Help: "Cron runs failed", Type: "counter", Value: float64(cs.Failed)},
        )

        ws := wh.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_webhook_sends_total", Help: "Webhook deliveries attempted", Type: "counter", Value: float64(ws.Sends)},
            http.PrometheusMetric{Name: "stanza_webhook_failures_total", Help: "Webhook deliveries failed", Type: "counter", Value: float64(ws.Failures)},
        )

        as := a.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_auth_tokens_issued_total", Help: "Tokens issued", Type: "counter", Value: float64(as.Issued)},
            http.PrometheusMetric{Name: "stanza_auth_tokens_rejected_total", Help: "Tokens rejected", Type: "counter", Value: float64(as.Rejected)},
        )

        es := ec.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_email_sent_total", Help: "Emails sent", Type: "counter", Value: float64(es.Sent)},
            http.PrometheusMetric{Name: "stanza_email_errors_total", Help: "Email errors", Type: "counter", Value: float64(es.Errors)},
        )

        return out
    }
}

Test it:

curl -s http://localhost:23710/api/metrics

# HELP stanza_sqlite_reads_total Total read queries
# TYPE stanza_sqlite_reads_total counter
stanza_sqlite_reads_total 1247
# HELP stanza_http_requests_total Total requests processed
# TYPE stanza_http_requests_total counter
stanza_http_requests_total 892
# HELP stanza_queue_pending Pending jobs
# TYPE stanza_queue_pending gauge
stanza_queue_pending 0
...

Prometheus scrape config

Point Prometheus at your app's metrics endpoint:

scrape_configs:
  - job_name: stanza
    scrape_interval: 30s
    static_configs:
      - targets: ["your-app.up.railway.app"]
    scheme: https
    metrics_path: /api/metrics

Counters vs gauges

Use counter for values that only go up (requests, errors, bytes). Use gauge for values that go up and down (active connections, queue depth, pool usage). Prometheus calculates rates from counters automatically — rate(stanza_http_requests_total[5m]) gives you requests per second.

Adding observability to a new module

When you add a framework package or standalone service that maintains runtime state, follow this pattern:

1. Define the stats struct

type ServiceStats struct {
    Processed int64
    Errors    int64
}

2. Add atomic counters

type Service struct {
    processed atomic.Int64
    errors    atomic.Int64
}

func (s *Service) Stats() ServiceStats {
    return ServiceStats{
        Processed: s.processed.Load(),
        Errors:    s.errors.Load(),
    }
}

3. Increment in operations

func (s *Service) Do(ctx context.Context) error {
    err := s.process(ctx)
    if err != nil {
        s.errors.Add(1)
        return err
    }
    s.processed.Add(1)
    return nil
}

4. Wire into the dashboard

Add the service to the dashboard's Register function signature and include its stats in the response:

"my_service": svc.Stats(),

Atomic counters are the right choice for Stats() — they're lock-free, allocation-free, and safe to read from any goroutine at any time.

Production monitoring

Railway health checks

Railway automatically monitors your health endpoint. Configure it in railway.toml:

[deploy]
healthcheckPath = "/api/health"
healthcheckTimeout = 10
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 5

Railway will restart your service if the health endpoint returns non-200.

Key metrics to watch

Metric	Source	Alert threshold
Health status	`GET /health` → `status`	Any response != `"ok"`
Memory usage	`GET /health` → `memory_mb`	> 80% of container limit
Goroutine count	`GET /health` → `goroutines`	Sustained growth (leak)
5xx error rate	`http.Stats()` → `Status5xx / TotalRequests`	> 1%
Queue backlog	`queue.Stats()` → `Pending`	Growing over time
Dead jobs	`queue.Stats()` → `Dead`	Any increase
Pool waits	`db.Stats()` → `PoolWaits`	Sustained growth
Auth rejections	`auth.Stats()` → `Rejected`	Sudden spike
Email failures	`email.Stats()` → `Errors`	Any increase
Webhook failures	`webhook.Stats()` → `Failures / Sends`	> 5%

Polling from external monitoring

For external uptime monitoring (UptimeRobot, Pingdom, or a simple cron), poll the health endpoint:

# Simple check — exit code 0 if healthy, non-zero if degraded.
curl -sf http://your-app.up.railway.app/api/health > /dev/null

# Detailed check — parse the JSON for specific conditions.
STATUS=$(curl -s http://your-app.up.railway.app/api/health | jq -r '.status')
if [ "$STATUS" != "ok" ]; then
    echo "ALERT: service degraded"
fi

Dashboard polling

The admin panel polls the dashboard endpoint every 30 seconds to show live metrics. The 30-second cache TTL on expensive queries means the dashboard is always responsive — it never waits for a slow aggregation query.

Tips

Health endpoint stays public. No auth — load balancers and monitoring tools need unauthenticated access.
Dashboard endpoint stays protected. It exposes internal counts (users, sessions, queue depth) that shouldn't be public.
Stats() is always safe to call. Atomic reads have zero contention and zero allocations. Call them as often as you need.
Cache expensive queries, not Stats(). Framework Stats() methods are free. Database counts, file sizes, and aggregation queries are not — cache those with a 30-second TTL.
Don't add Stats() to everything. Only packages with meaningful runtime state need it. Config, validation, and CLI don't.
Use ReadMemStats sparingly. It triggers a stop-the-world pause. The health and dashboard endpoints call it once per request — don't call it in a hot loop.
Return 503 for degraded, not 500. Container orchestrators treat 503 as "temporarily unavailable" and may retry routing, while 500 suggests a code bug.