Recipes

Observability & health checks

Every Stanza framework package exposes a Stats() method with atomic counters for its key operations. This recipe shows how to build health check endpoints, wire Stats() into an admin dashboard, inject build metadata, and monitor your app in production.


Health check endpoint

A health endpoint answers one question: is the service healthy enough to receive traffic? Keep it simple — check database connectivity and report basic runtime metrics:

package health

import (
    "runtime"
    "time"

    "github.com/stanza-go/framework/pkg/http"
    "github.com/stanza-go/framework/pkg/sqlite"
)

var startTime = time.Now()

type BuildInfo struct {
    Version   string
    Commit    string
    BuildTime string
}

func Register(api *http.Group, db *sqlite.DB, bi BuildInfo) {
    ver := bi.Version
    if ver == "" {
        ver = "dev"
    }

    api.HandleFunc("GET /health", func(w http.ResponseWriter, r *http.Request) {
        dbOK := true
        var dbErr string
        row := db.QueryRow("SELECT 1")
        var one int
        if err := row.Scan(&one); err != nil {
            dbOK = false
            dbErr = err.Error()
        }

        var mem runtime.MemStats
        runtime.ReadMemStats(&mem)

        status := http.StatusOK
        if !dbOK {
            status = http.StatusServiceUnavailable
        }

        stats := db.Stats()
        http.WriteJSON(w, status, map[string]any{
            "status":     statusText(dbOK),
            "version":    ver,
            "commit":     bi.Commit,
            "uptime":     time.Since(startTime).Round(time.Second).String(),
            "go":         runtime.Version(),
            "goroutines": runtime.NumGoroutine(),
            "memory_mb":  mem.Alloc / 1024 / 1024,
            "database": map[string]any{
                "ok":          dbOK,
                "error":       dbErr,
                "total_reads": stats.TotalReads,
                "total_writes": stats.TotalWrites,
                "pool_size":   stats.ReadPoolSize,
                "pool_in_use": stats.ReadPoolInUse,
                "pool_waits":  stats.PoolWaits,
            },
        })
    })
}

func statusText(ok bool) string {
    if ok {
        return "ok"
    }
    return "degraded"
}

The endpoint returns 200 OK when healthy and 503 Service Unavailable when the database is unreachable. Container orchestrators (Railway, Cloud Run) use this to route traffic and restart unhealthy instances.

Register it on a public route — no auth required:

health.Register(api, db, health.BuildInfo{
    Version:   version,
    Commit:    commit,
    BuildTime: buildTime,
})

Test it:

curl -s http://localhost:23710/api/health | jq .
{
  "status": "ok",
  "version": "dev",
  "uptime": "2m30s",
  "go": "go1.26.1",
  "goroutines": 14,
  "memory_mb": 8,
  "database": {
    "ok": true,
    "error": "",
    "total_reads": 1247,
    "total_writes": 83,
    "pool_size": 4,
    "pool_in_use": 1,
    "pool_waits": 0
  }
}

What to check

The health endpoint should verify only critical dependencies — the database and nothing else. Don't check optional services (email, webhooks) here. A failing email provider shouldn't mark the entire service as unhealthy.


Build metadata

Inject version, commit hash, and build time at compile time via -ldflags:

// main.go — these are empty in development, set by the build.
var (
    version   string
    commit    string
    buildTime string
)

The Makefile sets them:

go build -ldflags="-s -w \
    -X main.version=$$(git describe --tags --always --dirty) \
    -X main.commit=$$(git rev-parse --short HEAD) \
    -X main.buildTime=$$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    -o bin/standalone .

This lets you identify exactly which code is running in production by hitting the health endpoint.


Stats() reference

Every framework package with runtime state exposes a Stats() method. All counters use sync/atomic — safe to call from any goroutine, no locks, no allocations.

SQLite

stats := db.Stats()
FieldTypeDescription
ReadPoolSizeintConfigured number of read connections
ReadPoolAvailableintIdle connections in pool
ReadPoolInUseintConnections currently checked out
TotalReadsint64Total read queries executed
TotalWritesint64Total write operations executed
PoolWaitsint64Times a read query waited for a free connection
PoolWaitTimetime.DurationCumulative time waiting for a free connection

What to watch: PoolWaits growing means your read pool is too small for your concurrency. Increase with sqlite.WithReadPoolSize(n).

HTTP

stats := metrics.Stats()
FieldTypeDescription
TotalRequestsint64Total requests processed
ActiveRequestsint64Currently in-flight requests
Status2xxint64Successful responses
Status3xxint64Redirects
Status4xxint64Client errors
Status5xxint64Server errors
BytesWrittenint64Total response bytes
AvgDurationMsfloat64Average request duration in milliseconds

What to watch: Status5xx climbing means server errors. ActiveRequests staying high means requests are backing up — check for slow queries or external calls.

Cache

stats := myCache.Stats()
FieldTypeDescription
SizeintCurrent entries (including expired not yet cleaned)
MaxSizeintConfigured maximum (0 = unlimited)
Hitsint64Key found and not expired
Missesint64Key not found or expired
Evictionsint64Involuntary removals (TTL expiry + LRU)

What to watch: Hit rate = Hits / (Hits + Misses). Below 80% means your TTL is too short or your cache is too small. High Evictions with a full cache means MaxSize is too low.

Queue

stats, err := queue.Stats()
FieldTypeDescription
PendingintJobs waiting to be processed
RunningintJobs currently being processed
CompletedintSuccessfully finished jobs
FailedintJobs that errored (will retry)
DeadintJobs that exhausted all retries
CancelledintManually cancelled jobs

What to watch: Pending growing faster than Completed means your workers can't keep up. Dead increasing means jobs are permanently failing — check error logs.

Cron

stats := scheduler.Stats()
FieldTypeDescription
JobsintTotal registered jobs
Completedint64Successful executions
Failedint64Executions that returned error or panicked
Skippedint64Skipped because previous execution still running

What to watch: Skipped increasing means a cron job takes longer than its interval. Either increase the interval or optimize the job. Failed means a job is erroring — check logs.

Webhook

stats := webhookClient.Stats()
FieldTypeDescription
Sendsint64Total Send or SendWithRetry calls
Successesint64Deliveries that received 2xx response
Failuresint64Non-2xx response after retries exhausted
Retriesint64Total retry attempts
Errorsint64Network errors (DNS, timeouts, context cancellation)

What to watch: Failures / Sends is your delivery failure rate. High Retries means endpoints are flaky. High Errors means network issues.

Auth

stats := auth.Stats()
FieldTypeDescription
Issuedint64Access tokens successfully created
Acceptedint64Tokens that passed validation
Rejectedint64Tokens that failed (expired, malformed, invalid signature)

What to watch: Rejected spiking could indicate token expiry issues (clock skew) or attack attempts (forged tokens).

Email

stats := emailClient.Stats()
FieldTypeDescription
Sentint64Emails successfully delivered to API
Errorsint64Failed send attempts (transport errors, non-2xx responses)

What to watch: Errors / (Sent + Errors) is your email failure rate. Non-zero Errors usually means an API key issue or provider outage.


Dashboard endpoint

The admin dashboard aggregates Stats() from all packages into a single JSON response. The key pattern: call Stats() live for cheap in-memory counters, cache expensive database queries.

func Register(admin *http.Group, db *sqlite.DB, q *queue.Queue,
    s *cron.Scheduler, m *http.Metrics, wh *webhooks.Dispatcher,
    a *auth.Auth, ec *email.Client) {

    // Cache expensive DB queries (table counts, user counts).
    statsCache := cache.New[*dbStats](
        cache.WithTTL[*dbStats](30 * time.Second),
        cache.WithMaxSize[*dbStats](1),
    )

    admin.HandleFunc("GET /dashboard", statsHandler(db, q, s, m, wh, a, ec, statsCache))
}

Cheap vs expensive stats

SourceCostStrategy
db.Stats()Free — atomic readsCall live
q.Stats()DB queryCall live (fast — indexed)
s.Stats()Free — atomic readsCall live
m.Stats()Free — atomic readsCall live
wh.Stats()Free — atomic readsCall live
a.Stats()Free — atomic readsCall live
ec.Stats()Free — atomic readsCall live
Table counts, user countsMultiple DB queriesCache 30s
Time-series chart dataComplex aggregationCache 5m

Assembling the response

func statsHandler(db *sqlite.DB, q *queue.Queue, s *cron.Scheduler,
    m *http.Metrics, wh *webhooks.Dispatcher, a *auth.Auth,
    ec *email.Client, statsCache *cache.Cache[*dbStats]) func(http.ResponseWriter, *http.Request) {

    return func(w http.ResponseWriter, r *http.Request) {
        var mem runtime.MemStats
        runtime.ReadMemStats(&mem)

        // Cached — expensive DB queries, 30s TTL.
        st, _ := statsCache.GetOrSet("stats", func() (*dbStats, error) {
            return queryDBStats(db)
        })
        if st == nil {
            st = &dbStats{}
        }

        // Live — all in-memory, no DB hit.
        queueStats := map[string]any{"pending": 0, "running": 0}
        if qs, err := q.Stats(); err == nil {
            queueStats["pending"] = qs.Pending
            queueStats["running"] = qs.Running
            queueStats["completed"] = qs.Completed
            queueStats["failed"] = qs.Failed
            queueStats["dead"] = qs.Dead
        }

        cronStats := s.Stats()

        http.WriteJSON(w, http.StatusOK, map[string]any{
            "system": map[string]any{
                "goroutines":      runtime.NumGoroutine(),
                "memory_alloc_mb": float64(mem.Alloc) / 1024 / 1024,
            },
            "database": map[string]any{
                "size_bytes":   st.DBSizeBytes,
                "total_reads":  db.Stats().TotalReads,
                "total_writes": db.Stats().TotalWrites,
                "pool_waits":   db.Stats().PoolWaits,
            },
            "queue":   queueStats,
            "cron": map[string]any{
                "completed": cronStats.Completed,
                "failed":    cronStats.Failed,
                "skipped":   cronStats.Skipped,
            },
            "http":    m.Stats(),
            "webhook": wh.Stats(),
            "auth":    a.Stats(),
            "email":   ec.Stats(),
            "stats": map[string]any{
                "total_users":     st.TotalUsers,
                "active_sessions": st.ActiveSessions,
            },
        })
    }
}

Wiring in main.go

Pass all stats providers through the lifecycle DI container:

func main() {
    app := lifecycle.New(
        lifecycle.Provide(provideDB),
        lifecycle.Provide(provideAuth),
        lifecycle.Provide(provideEmail),
        lifecycle.Provide(provideQueue),
        lifecycle.Provide(provideWebhookDispatcher),
        lifecycle.Provide(provideCron),
        lifecycle.Provide(provideMetrics),
        lifecycle.Provide(provideRouter),
        lifecycle.Provide(provideServer),
        lifecycle.Invoke(registerModules),
    )
    if err := app.Run(); err != nil {
        fmt.Fprintf(os.Stderr, "fatal: %v\n", err)
        os.Exit(1)
    }
}

func provideMetrics() *http.Metrics {
    return http.NewMetrics()
}

The registerModules function receives all providers and wires them into the dashboard:

func registerModules(router *http.Router, db *sqlite.DB, q *queue.Queue,
    s *cron.Scheduler, m *http.Metrics, wh *webhooks.Dispatcher,
    a *auth.Auth, ec *email.Client) {

    admin := router.Group("/api/admin")
    admin.Use(a.RequireAuth())
    admin.Use(auth.RequireScope("admin"))

    dashboard.Register(admin, db, q, s, m, wh, a, ec)
}

Prometheus metrics endpoint

For external monitoring tools (Prometheus, Grafana, Datadog), expose all framework Stats() in Prometheus text exposition format using http.PrometheusHandler. This is a public endpoint — register it alongside the health check:

api.HandleFunc("GET /metrics", http.PrometheusHandler(
    collectPrometheus(db, m, q, s, whDispatcher, a, emailClient),
))

The collector function gathers metrics from all framework packages on each scrape:

func collectPrometheus(db *sqlite.DB, m *http.Metrics, q *queue.Queue,
    s *cron.Scheduler, wh *webhooks.Dispatcher, a *auth.Auth,
    ec *email.Client) func() []http.PrometheusMetric {

    return func() []http.PrometheusMetric {
        var out []http.PrometheusMetric

        // SQLite — pool and query counters.
        ds := db.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_sqlite_reads_total", Help: "Total read queries", Type: "counter", Value: float64(ds.TotalReads)},
            http.PrometheusMetric{Name: "stanza_sqlite_writes_total", Help: "Total write queries", Type: "counter", Value: float64(ds.TotalWrites)},
            http.PrometheusMetric{Name: "stanza_sqlite_pool_waits_total", Help: "Read pool wait events", Type: "counter", Value: float64(ds.PoolWaits)},
            http.PrometheusMetric{Name: "stanza_sqlite_read_pool_in_use", Help: "Read pool connections in use", Type: "gauge", Value: float64(ds.ReadPoolInUse)},
        )

        // HTTP — request counters and latency.
        hs := m.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_http_requests_total", Help: "Total requests processed", Type: "counter", Value: float64(hs.TotalRequests)},
            http.PrometheusMetric{Name: "stanza_http_requests_active", Help: "Requests in flight", Type: "gauge", Value: float64(hs.ActiveRequests)},
            http.PrometheusMetric{Name: "stanza_http_responses_2xx_total", Help: "2xx responses", Type: "counter", Value: float64(hs.Status2xx)},
            http.PrometheusMetric{Name: "stanza_http_responses_4xx_total", Help: "4xx responses", Type: "counter", Value: float64(hs.Status4xx)},
            http.PrometheusMetric{Name: "stanza_http_responses_5xx_total", Help: "5xx responses", Type: "counter", Value: float64(hs.Status5xx)},
        )

        // Queue — job state counts.
        if qs, err := q.Stats(); err == nil {
            out = append(out,
                http.PrometheusMetric{Name: "stanza_queue_pending", Help: "Pending jobs", Type: "gauge", Value: float64(qs.Pending)},
                http.PrometheusMetric{Name: "stanza_queue_completed_total", Help: "Completed jobs", Type: "counter", Value: float64(qs.Completed)},
                http.PrometheusMetric{Name: "stanza_queue_failed_total", Help: "Failed jobs", Type: "counter", Value: float64(qs.Failed)},
                http.PrometheusMetric{Name: "stanza_queue_dead_total", Help: "Dead-lettered jobs", Type: "counter", Value: float64(qs.Dead)},
            )
        }

        // Cron, webhook, auth, email — same pattern.
        cs := s.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_cron_completed_total", Help: "Cron runs completed", Type: "counter", Value: float64(cs.Completed)},
            http.PrometheusMetric{Name: "stanza_cron_failed_total", Help: "Cron runs failed", Type: "counter", Value: float64(cs.Failed)},
        )

        ws := wh.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_webhook_sends_total", Help: "Webhook deliveries attempted", Type: "counter", Value: float64(ws.Sends)},
            http.PrometheusMetric{Name: "stanza_webhook_failures_total", Help: "Webhook deliveries failed", Type: "counter", Value: float64(ws.Failures)},
        )

        as := a.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_auth_tokens_issued_total", Help: "Tokens issued", Type: "counter", Value: float64(as.Issued)},
            http.PrometheusMetric{Name: "stanza_auth_tokens_rejected_total", Help: "Tokens rejected", Type: "counter", Value: float64(as.Rejected)},
        )

        es := ec.Stats()
        out = append(out,
            http.PrometheusMetric{Name: "stanza_email_sent_total", Help: "Emails sent", Type: "counter", Value: float64(es.Sent)},
            http.PrometheusMetric{Name: "stanza_email_errors_total", Help: "Email errors", Type: "counter", Value: float64(es.Errors)},
        )

        return out
    }
}

Test it:

curl -s http://localhost:23710/api/metrics
# HELP stanza_sqlite_reads_total Total read queries
# TYPE stanza_sqlite_reads_total counter
stanza_sqlite_reads_total 1247
# HELP stanza_http_requests_total Total requests processed
# TYPE stanza_http_requests_total counter
stanza_http_requests_total 892
# HELP stanza_queue_pending Pending jobs
# TYPE stanza_queue_pending gauge
stanza_queue_pending 0
...

Prometheus scrape config

Point Prometheus at your app's metrics endpoint:

scrape_configs:
  - job_name: stanza
    scrape_interval: 30s
    static_configs:
      - targets: ["your-app.up.railway.app"]
    scheme: https
    metrics_path: /api/metrics

Counters vs gauges

Use counter for values that only go up (requests, errors, bytes). Use gauge for values that go up and down (active connections, queue depth, pool usage). Prometheus calculates rates from counters automatically — rate(stanza_http_requests_total[5m]) gives you requests per second.


Adding observability to a new module

When you add a framework package or standalone service that maintains runtime state, follow this pattern:

1. Define the stats struct

type ServiceStats struct {
    Processed int64
    Errors    int64
}

2. Add atomic counters

type Service struct {
    processed atomic.Int64
    errors    atomic.Int64
}

func (s *Service) Stats() ServiceStats {
    return ServiceStats{
        Processed: s.processed.Load(),
        Errors:    s.errors.Load(),
    }
}

3. Increment in operations

func (s *Service) Do(ctx context.Context) error {
    err := s.process(ctx)
    if err != nil {
        s.errors.Add(1)
        return err
    }
    s.processed.Add(1)
    return nil
}

4. Wire into the dashboard

Add the service to the dashboard's Register function signature and include its stats in the response:

"my_service": svc.Stats(),

Atomic counters are the right choice for Stats() — they're lock-free, allocation-free, and safe to read from any goroutine at any time.


Production monitoring

Railway health checks

Railway automatically monitors your health endpoint. Configure it in railway.toml:

[deploy]
healthcheckPath = "/api/health"
healthcheckTimeout = 10
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 5

Railway will restart your service if the health endpoint returns non-200.

Key metrics to watch

MetricSourceAlert threshold
Health statusGET /healthstatusAny response != "ok"
Memory usageGET /healthmemory_mb> 80% of container limit
Goroutine countGET /healthgoroutinesSustained growth (leak)
5xx error ratehttp.Stats()Status5xx / TotalRequests> 1%
Queue backlogqueue.Stats()PendingGrowing over time
Dead jobsqueue.Stats()DeadAny increase
Pool waitsdb.Stats()PoolWaitsSustained growth
Auth rejectionsauth.Stats()RejectedSudden spike
Email failuresemail.Stats()ErrorsAny increase
Webhook failureswebhook.Stats()Failures / Sends> 5%

Polling from external monitoring

For external uptime monitoring (UptimeRobot, Pingdom, or a simple cron), poll the health endpoint:

# Simple check — exit code 0 if healthy, non-zero if degraded.
curl -sf http://your-app.up.railway.app/api/health > /dev/null
# Detailed check — parse the JSON for specific conditions.
STATUS=$(curl -s http://your-app.up.railway.app/api/health | jq -r '.status')
if [ "$STATUS" != "ok" ]; then
    echo "ALERT: service degraded"
fi

Dashboard polling

The admin panel polls the dashboard endpoint every 30 seconds to show live metrics. The 30-second cache TTL on expensive queries means the dashboard is always responsive — it never waits for a slow aggregation query.


Tips

  • Health endpoint stays public. No auth — load balancers and monitoring tools need unauthenticated access.
  • Dashboard endpoint stays protected. It exposes internal counts (users, sessions, queue depth) that shouldn't be public.
  • Stats() is always safe to call. Atomic reads have zero contention and zero allocations. Call them as often as you need.
  • Cache expensive queries, not Stats(). Framework Stats() methods are free. Database counts, file sizes, and aggregation queries are not — cache those with a 30-second TTL.
  • Don't add Stats() to everything. Only packages with meaningful runtime state need it. Config, validation, and CLI don't.
  • Use ReadMemStats sparingly. It triggers a stop-the-world pause. The health and dashboard endpoints call it once per request — don't call it in a hot loop.
  • Return 503 for degraded, not 500. Container orchestrators treat 503 as "temporarily unavailable" and may retry routing, while 500 suggests a code bug.
Previous
Performance tuning