back to all skills

monitoring-observability

operationsv1.0.0

Production observability — Grafana, Prometheus, Datadog, alerting strategies, SLOs, distributed tracing, and incident response.

copied ✓
openclawclaude-codecursorcodex
0 installsVirusTotal: cleanSource code


name: monitoring-observability description: "Production monitoring and observability — Prometheus, Grafana, OpenTelemetry, distributed tracing, SLOs, and incident response."

Monitoring & Observability

The Three Pillars — And How They Connect

Monitoring tells you something is broken. Observability tells you why.

Alert fires (metric) → Find error spike in dashboard (metric)
  → Filter logs by time window (logs) → Find correlation ID
    → Trace the request across services (traces) → Find the slow DB query

Metrics: Aggregated numbers over time. Cheap to store, good for alerting. Logs: Individual events with context. Expensive at scale, essential for debugging. Traces: Request flow across services. The connective tissue between metrics and logs.

The key insight: correlation. Every log line and trace should carry the same request ID so you can jump between pillars seamlessly.


Structured Logging That Actually Helps

The Pattern

Every log entry is JSON. No exceptions. No console.log("user signed up").

// lib/logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level(label) {
      return { level: label };  // "info" not 30
    },
  },
  serializers: {
    err: pino.stdSerializers.err,
    req: pino.stdSerializers.req,
    res: pino.stdSerializers.res,
  },
  // Add service metadata to every log
  base: {
    service: process.env.SERVICE_NAME || 'api',
    version: process.env.APP_VERSION || 'unknown',
    environment: process.env.NODE_ENV || 'development',
  },
});

// Request-scoped logger with correlation ID
export function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({
    requestId,
    userId,
  });
}

Express Middleware

import { randomUUID } from 'crypto';
import { createRequestLogger } from './logger';

app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] as string || randomUUID();
  req.log = createRequestLogger(requestId, req.user?.id);
  res.setHeader('x-request-id', requestId);

  const start = performance.now();
  res.on('finish', () => {
    const duration = performance.now() - start;
    req.log.info({
      method: req.method,
      url: req.originalUrl,
      statusCode: res.statusCode,
      duration: Math.round(duration),
      contentLength: res.getHeader('content-length'),
    }, 'request completed');
  });

  next();
});

Log Levels That Actually Mean Something

LevelWhen to UseExample
fatalProcess is about to crashUncaught exception, out of memory
errorOperation failed, needs attentionPayment processing failed, DB connection lost
warnSomething unexpected, but handledRate limit approaching, deprecated API called
infoBusiness events worth recordingUser signed up, order placed, deploy completed
debugTechnical details for debuggingSQL queries, cache hit/miss, request/response bodies
traceExtremely verbose, rarely enabledFunction entry/exit, variable values

Rule of thumb: If you'd want to see it in production logs during an incident, it's info. If you'd only want it when actively debugging, it's debug.


Prometheus: PromQL Deep Dive

Metric Types and When to Use Each

import { Counter, Histogram, Gauge, Summary, Registry } from 'prom-client';

const registry = new Registry();

// Counter: things that only go up
// Use for: requests, errors, bytes transferred
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status_code'] as const,
  registers: [registry],
});

// Histogram: distribution of values (request duration, response size)
// Use for: latency, size — anything you want percentiles of
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status_code'] as const,
  buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry],
});

// Gauge: values that go up and down
// Use for: queue depth, active connections, temperature
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [registry],
});

// In your request handler:
app.use((req, res, next) => {
  activeConnections.inc();
  const end = httpRequestDuration.startTimer({
    method: req.method,
    path: routePattern(req),  // "/users/:id" not "/users/12345"
  });

  res.on('finish', () => {
    const labels = { method: req.method, path: routePattern(req), status_code: String(res.statusCode) };
    httpRequestsTotal.inc(labels);
    end({ status_code: String(res.statusCode) });
    activeConnections.dec();
  });

  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

PromQL: Queries You'll Actually Use

# Request rate (requests per second over last 5 minutes)
rate(http_requests_total[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
* 100

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P95 latency per endpoint
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
)

# Apdex score (satisfied < 0.5s, tolerating < 2s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  + sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
) / 2
/ sum(rate(http_request_duration_seconds_count[5m]))

# Top 5 slowest endpoints
topk(5,
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
  )
)

# Rate of change (is error rate increasing?)
deriv(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))[30m:1m]
)

# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0

Recording Rules

Pre-compute expensive queries to speed up dashboards:

# prometheus/recording-rules.yml
groups:
  - name: http_metrics
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_errors:rate5m
        expr: sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)

      - record: job:http_error_ratio:rate5m
        expr: |
          job:http_errors:rate5m / job:http_requests:rate5m

      - record: job:http_latency:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )

      - record: job:http_latency:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )

Alerting Rules

# prometheus/alerting-rules.yml
groups:
  - name: availability
    rules:
      - alert: HighErrorRate
        expr: job:http_error_ratio:rate5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: HighLatency
        expr: job:http_latency:p95_5m > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency on {{ $labels.job }}"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} crash looping"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
        for: 30m
        labels:
          severity: critical

Grafana: Dashboard as Code

Provisioning with Docker Compose

# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/recording-rules.yml:/etc/prometheus/recording-rules.yml
      - ./prometheus/alerting-rules.yml:/etc/prometheus/alerting-rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - '9090:9090'

  grafana:
    image: grafana/grafana:10.3.0
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - '3001:3000'

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - '9093:9093'

  loki:
    image: grafana/loki:2.9.4
    ports:
      - '3100:3100'
    command: -config.file=/etc/loki/local-config.yaml

volumes:
  prometheus-data:
  grafana-data:

Grafana Datasource Provisioning

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: '15s'

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceId":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo

Dashboard Provisioning

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Alertmanager Routing

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'slack-default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX'
        channel: '#alerts-warnings'

OpenTelemetry: Auto-Instrumentation

Node.js Setup

// tracing.ts — MUST be imported before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || 'api',
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || '0.0.0',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/metrics',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/healthz', '/ready', '/metrics'],
      },
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Custom Spans

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(orderId: string, amount: number) {
  return tracer.startActiveSpan('payment.process', async (span) => {
    try {
      span.setAttributes({
        'payment.order_id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Nested span for the Stripe API call
      const result = await tracer.startActiveSpan('payment.stripe.charge', async (stripeSpan) => {
        try {
          const charge = await stripe.charges.create({ amount, currency: 'usd' });
          stripeSpan.setAttributes({
            'stripe.charge_id': charge.id,
            'stripe.status': charge.status,
          });
          return charge;
        } catch (err) {
          stripeSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
          stripeSpan.recordException(err);
          throw err;
        } finally {
          stripeSpan.end();
        }
      });

      span.setAttributes({ 'payment.status': 'success' });
      return result;
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Python Auto-Instrumentation

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install  # Auto-install instrumentations
# Run with auto-instrumentation
opentelemetry-instrument \
  --service_name my-service \
  --exporter_otlp_endpoint http://localhost:4318 \
  python app.py
# Custom spans in Python
from opentelemetry import trace

tracer = trace.get_tracer("payment-service")

def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("payment.process") as span:
        span.set_attribute("payment.order_id", order_id)
        span.set_attribute("payment.amount", amount)

        with tracer.start_as_current_span("payment.stripe.charge") as stripe_span:
            charge = stripe.Charge.create(amount=int(amount * 100), currency="usd")
            stripe_span.set_attribute("stripe.charge_id", charge.id)
            return charge

Distributed Tracing: Practical Patterns

Span Naming Conventions

# Good — consistent, searchable, useful for aggregation
http.request GET /api/users/:id
db.query SELECT users
cache.get user:profile:123
queue.publish order.created
payment.stripe.charge
email.send welcome

# Bad — too specific (high cardinality) or too vague
GET /api/users/12345          ← every user ID creates a unique span
processRequest                ← useless for filtering
doStuff                       ← really?

Sampling Strategies

import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-base';

// Sample 10% of traces, but always sample if parent was sampled
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

// Custom sampler: always sample errors, sample 10% of success
import { SamplingDecision, SamplingResult } from '@opentelemetry/sdk-trace-base';

class ErrorAwareSampler {
  shouldSample(context, traceId, spanName, spanKind, attributes): SamplingResult {
    // Always sample errors
    if (attributes?.['http.status_code'] >= 500) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    // 10% sampling for everything else
    const hash = parseInt(traceId.slice(-8), 16);
    if (hash % 100 < 10) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    return { decision: SamplingDecision.NOT_RECORD };
  }
}

Context Propagation Across Services

// Service A — outgoing HTTP request
import { context, propagation } from '@opentelemetry/api';

async function callServiceB() {
  const headers: Record<string, string> = {};
  // Inject trace context into outgoing headers
  propagation.inject(context.active(), headers);

  const response = await fetch('http://service-b/api/data', { headers });
  return response.json();
}

// Service B — incoming request (auto-instrumented by OTel HTTP instrumentation)
// The trace context is automatically extracted from incoming headers
// No manual code needed — just ensure both services use OTel

SLOs, SLIs, and Error Budgets

Defining SLIs

# SLI definitions
slis:
  availability:
    description: "Percentage of successful requests"
    query: |
      1 - (
        sum(rate(http_requests_total{status_code=~"5.."}[5m]))
        / sum(rate(http_requests_total[5m]))
      )

  latency:
    description: "Percentage of requests faster than 500ms"
    query: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
      / sum(rate(http_request_duration_seconds_count[5m]))

  throughput:
    description: "Requests per second"
    query: sum(rate(http_requests_total[5m]))

SLO Targets and Error Budgets

SLO: 99.9% availability over 30 days
Error budget: 0.1% = 43.2 minutes of downtime per month

SLO: 99% of requests under 500ms
Error budget: 1% of requests can be slow

Burn Rate Alerts

# Multi-window, multi-burn-rate alerts (Google SRE book pattern)
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 14.4x burn rate over 1h (uses 2% of monthly budget in 1h)
      - alert: SLOErrorBudgetFastBurn
        expr: |
          (
            job:http_error_ratio:rate5m > (14.4 * 0.001)
            and
            job:http_error_ratio:rate1h > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn on {{ $labels.job }}"
          description: "At current rate, monthly error budget exhausted in ~2 days"

      # Slow burn: 3x burn rate over 6h
      - alert: SLOErrorBudgetSlowBurn
        expr: |
          (
            job:http_error_ratio:rate30m > (3 * 0.001)
            and
            job:http_error_ratio:rate6h > (3 * 0.001)
          )
        for: 15m
        labels:
          severity: warning

On-Call and Incident Response

Runbook Template

# Runbook: High Error Rate

## Severity: Critical

## Symptoms
- Error rate exceeds 5% for 5+ minutes
- PagerDuty alert: HighErrorRate

## First Response (< 5 minutes)
1. Check Grafana dashboard: https://grafana.internal/d/http-overview
2. Check if it's a single endpoint or service-wide
3. Check recent deployments: `kubectl rollout history deployment/app`
4. If a recent deploy correlates: `kubectl rollout undo deployment/app`

## Diagnosis
1. Check error logs in Loki:
   `{job="api"} |= "error" | json | status_code >= 500`
2. Check dependent services:
   - Database: `pg_isready -h db.internal`
   - Redis: `redis-cli -h redis.internal ping`
   - External APIs: Check status pages
3. Check resource usage:
   - CPU: `kubectl top pods -n production`
   - Memory: Same command
   - Connections: Check connection pool metrics

## Mitigation
- **Bad deploy:** Roll back immediately
- **Database overload:** Enable read replicas, kill long queries
- **External dependency:** Enable circuit breaker, serve degraded
- **Traffic spike:** Scale up pods: `kubectl scale deployment/app --replicas=10`

## Escalation
- If not resolved in 30 minutes: Page the team lead
- If data loss suspected: Page the CTO

PagerDuty Integration via Alertmanager

Already shown above in alertmanager config. Key decisions:

  • Critical alerts → PagerDuty (wakes people up)
  • Warning alerts → Slack (checked during business hours)
  • Info alerts → Dashboard only (no notification)

Post-Incident Template

# Incident Post-Mortem: [Title]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P1/P2/P3
**Impact:** X% of users affected, $Y revenue impact

## Timeline
- HH:MM — Alert fired
- HH:MM — On-call acknowledged
- HH:MM — Root cause identified
- HH:MM — Mitigation applied
- HH:MM — Full resolution

## Root Cause
[What actually broke and why]

## What Went Well
- [Quick detection, good runbooks, etc.]

## What Went Wrong
- [Slow response, missing alerts, etc.]

## Action Items
- [ ] [Action] — Owner — Due Date
- [ ] [Action] — Owner — Due Date

## Lessons Learned
[What we'll do differently]

Datadog vs Self-Hosted: Decision Matrix

FactorDatadogSelf-hosted (Prometheus/Grafana/Loki)
Setup timeMinutesDays to weeks
Monthly cost (10 services)$2,000-5,000$200-500 (infra) + engineer time
Monthly cost (100 services)$20,000-50,000$2,000-5,000 + dedicated SRE
MaintenanceZeroSignificant (upgrades, scaling, backups)
CorrelationExcellent (built-in)Good (requires setup)
Custom dashboardsGreatGreat (Grafana)
APM/tracingBuilt-inOTel + Jaeger/Tempo
Log managementBuilt-inLoki or ELK
Learning curveLowMedium-High

Use Datadog when:

  • Team is < 20 engineers
  • No dedicated SRE/platform team
  • You need to move fast and budget allows it
  • Compliance requires vendor-managed infrastructure

Self-host when:

  • Cost is a primary concern at scale
  • You have SRE capacity
  • Data sovereignty requirements
  • You want full control over retention and queries

Hybrid approach: Use Datadog for APM/tracing, self-host Prometheus for metrics (it's just better for Kubernetes), use Loki for logs.


Quick Reference: Essential Queries

Prometheus

# Golden signals
rate(http_requests_total[5m])                          # Traffic
rate(http_requests_total{status_code=~"5.."}[5m])       # Errors
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # Latency
sum(active_connections)                                 # Saturation

Loki (LogQL)

# Error logs with JSON parsing
{job="api"} |= "error" | json | level="error" | line_format "{{.msg}}"

# Logs for a specific request
{job="api"} | json | requestId="abc-123"

# Count errors per minute
sum(count_over_time({job="api"} |= "error" [1m])) by (level)

# Top 10 error messages
topk(10, sum(count_over_time({job="api"} | json | level="error" [1h])) by (msg))

Checklist: Production Observability

  • Structured JSON logging with correlation IDs
  • Request ID propagated across all services
  • RED metrics exposed (Rate, Errors, Duration)
  • Prometheus scraping all services
  • Recording rules for expensive queries
  • Alerting rules with severity levels
  • Alertmanager routing (critical → PagerDuty, warning → Slack)
  • Grafana dashboards for each service
  • Distributed tracing with OpenTelemetry
  • Trace-to-log correlation configured
  • SLOs defined with error budget tracking
  • Burn rate alerts for SLO violations
  • Runbooks linked in alert annotations
  • On-call rotation configured
  • Post-incident process documented
  • Log retention policy (30d hot, 90d cold)
  • Dashboard provisioned as code (version controlled)
  • Sampling strategy for traces (don't sample 100% in production)