Observability Design

This document defines the observability foundation for TradePsykl across logging, metrics, tracing, dashboards, alerting, and operational runbooks. It references file paths and components per the documentation style policy (no embedded code excerpts).

Implementation Status

Phase 1 Complete ✅ (Local Instrumentation + Grafana Cloud Integration):

✅ API and Engine services instrumented with OpenTelemetry SDK
✅ JSON structured logging with trace/span correlation fields (pythonjsonlogger)
✅ FastAPI auto-instrumentation active
✅ Prometheus /metrics endpoints serving runtime metrics (API port 8000, Engine port 8080)
✅ Engine HTTP server with /health endpoint for container healthchecks
✅ Prometheus service scraping API and Engine metrics every 30s
✅ Remote Write to Grafana Cloud Mimir working (metrics visible in Grafana Cloud)
✅ OTLP traces flowing to Grafana Cloud Tempo (API + Engine)
✅ Resource attributes: service.name, service.version, deployment.environment, project
✅ Grafana dashboards created and validated: api-overview.json, engine-metrics.json
✅ Datasource configured: grafanacloud-t4apps-prom
✅ Environment filtering working: deployment_environment="dev"
✅ Tests passing (3/3): metrics endpoint, root endpoint, span creation
✅ Per-service venvs for editor import resolution (multi-root workspace setup)
✅ .env configured with Grafana Cloud credentials

Phase 1 Completion Date: November 9, 2025

Phase 2 Complete ✅ (Log Shipping to Grafana Cloud Loki):

✅ Promtail service deployed with Docker socket access for log collection
✅ Docker service discovery scraping logs from API, Engine, Data, and other containers
✅ JSON log parsing with field extraction (timestamp, level, logger, message, trace_id, span_id)
✅ Logs shipped to Grafana Cloud Loki with basic auth
✅ Container and service labels added to log streams (container, service_name, project, deployment_environment)
✅ Deployment environment filtering for log queries
✅ Log/trace correlation via trace_id and span_id fields
✅ Promtail configuration with runtime environment variable substitution

Phase 2 Completion Date: November 9, 2025

Known Limitations:

Alert rules not yet defined (future enhancement)
Only process metrics exposed (no custom application metrics yet)
MongoDB metrics not yet exposed

Next Steps (Incremental as Features Are Built):

Incremental (As Features Are Built):

[ ] ETL Pipeline: Add metrics (records_processed, etl_duration_seconds, etl_errors_total) and traces
[ ] Strategy Engine: Add metrics (signals_generated, rules_evaluated, strategy_latency_seconds)
[ ] Order Flow: Add metrics (orders_placed, orders_filled, order_errors, slippage)
[ ] Data Quality: Add metrics (schema_violations, data_staleness_seconds, missing_data_count)
[ ] MongoDB: Add mongodb_exporter for database metrics
[ ] Alert Rules: Define 2-3 critical alerts in Grafana Cloud as SLO thresholds become clear
[ ] Runbooks: Create docs/devops/runbooks.md with incident response procedures
[ ] Log Dashboards: Add "Recent Errors" panels to service dashboards for operational awareness (logs primarily viewed in Explore)

Later (Advanced Infrastructure):

[ ] OpenTelemetry Collector for unified telemetry pipeline
[ ] Tail-based sampling for trace volume management
[ ] Derived metrics and aggregations for long-term analysis

Example Log Output:

json

{"asctime": "2025-11-08 13:02:39,691", "levelname": "INFO", "name": "uvicorn.access", "message": "127.0.0.1:50466 - \"GET / HTTP/1.1\" 200", "trace_id": null, "span_id": null}

Note: trace_id/span_id populate when Grafana OTLP endpoint is configured and requests flow through traced code.

Scope & Goals

Unified visibility for weekly batch strategies and real-time event triggers.
Facilitate rapid incident detection (data feed failures, strategy anomalies, broker errors).
Provide quantitative performance insights (latency to ingest → signal → order, strategy PnL attribution, ETL durations).
Establish a baseline now that can evolve toward ML feature quality monitoring and advanced anomaly detection later.

Principles

Minimal first: instrument core flows early; expand breadth iteratively.
Standards-based: OpenTelemetry (OTel) APIs and semantic conventions.
Destination-agnostic: collection/export pipeline can point to local, self-hosted, or managed backends without code changes.
Cost awareness: retain high-cardinality data only where actionable; aggregate early for long-term metrics.
Separation of concerns: business events (strategy signals) vs. infra telemetry (CPU, memory) vs. data quality metrics.

Core Pillars

Monitoring

System resources to monitor (cloud and bare metal):
- CPU utilization and saturation.
- Memory usage and pressure.
- Network throughput and errors (services communicate across boundaries).
Agents/collectors:
- Host/container metrics via an OTel Collector with hostmetrics/docker receivers, or node_exporter + cAdvisor when Prometheus is preferred directly.
- Keep the collector optional in dev; enable by profile in docker-compose.yml when introduced.

Logging

Structure: JSON logs with explicit fields: ts, level, service, trace_id, span_id, event, strategy_id, symbol.
Levels: INFO (flow), WARN (degraded), ERROR (failed action), DEBUG (diagnostics; disabled in prod). Avoid print statements.
Categories emphasized: errors, info, workflow interactions, requests/events.
Correlation: inject trace/span IDs via middleware in the API (src/api/) and engine (src/engine/).
Rotation & retention: local dev ephemeral; production aggregated (central store) with ~14 days full logs, 90 days sampled.
Redaction: exclude secrets, auth tokens, PII; sanitize external API payloads before logging.

Metrics

Categories:
- System: CPU, memory, event loop lag (api/engine services).
- Pipeline: ingest duration, transform duration, load latency (data ingestion components).
- Strategy: signals generated, signals suppressed, orders placed, rejected orders, PnL (daily/weekly), position exposure.
- Reliability: error counts per external provider (e.g., data API, broker), retry counts.
Types: counters (events), gauges (current values), histograms (latency distributions). Avoid unbounded label sets (e.g., raw order IDs).
Naming: tradepsykl_<domain>_<metric> (e.g., tradepsykl_strategy_signals_total).
Label hygiene: limit labels to service, strategy, symbol (when needed), env.
SLO candidates (future): ingest freshness < 5m; order submission success ≥ 99%; strategy engine cycle latency P95 < 2s.

Tracing

Span coverage: HTTP requests (API), scheduled batch runs, rule evaluation, broker order submission, ETL flow and workflow orchestration sub-steps.
Propagation: W3C Trace Context headers through API and internal async tasks.
Sampling: 100% in dev for validation; production base sample (e.g., 10%) plus tail-based sampling for error spans.
Span attributes: include strategy_id, symbol, order_type, external_provider where applicable.
Use traces to reconstruct signal life cycle from market data ingestion to executed order.

Dashboards

Initial dashboards (one per domain): ingestion health, strategy performance, order flow reliability.
Visualizations: latency histograms, error ratios, daily PnL, open positions, retry spikes, data freshness gauge.
Service-overview dashboard: top-level golden signals (error rate, request latency P95, order success, engine loop time).

Alerting

Tiering:
- Critical: broker order failures spike, ingest stalled > 10m, strategy producing anomalous volume (possible loop).
- Warning: latency degradation (P95 > threshold), elevated retries for data provider.
- Informational: successful deploy, strategy activation/deactivation events.
Channels: initially email/webhook; later chat integration.
Noise control: multi-condition rules (e.g., sustained error rate over 5 min) to avoid flapping.

Runbooks

Each alert links to a runbook section: diagnostic steps, log queries, rollback guidance.
Runbook locations: docs/devops/deployment.md (deployment), observability extensions appended here or split into docs/devops/runbooks.md when larger.

Architecture Overview

Instrumentation libraries embedded in services (src/api/, src/engine/).
OTel SDK config initialized at service startup (resource attributes: service.name, deployment.environment).
Export pipeline:
- Local dev: in-process console/simple file for logs, Prometheus endpoint for metrics, OTLP -> local collector (optional).
- CI ephemeral: minimal metrics/log to aid debugging; traces optional.
- Prod: OTLP to collector; collector routes logs/metrics/traces to chosen backends.
Collector (future): central otel-collector container added to docker-compose.yml for standardized exporters.

Tooling & Backends (Phased)

Decision (discussion summary):

Primary stack: Grafana Free (Grafana Cloud Free tier) using Prometheus (metrics), Loki (logs), Tempo (traces). Selected for multi-user allowance; Axiom was considered but excluded due to the free tier 1-user limit.
Keep implementations backend-agnostic via OTel where possible to preserve optionality.

Phase 1 (MVP):

Logging: structured stdout (centralized via container runtime); include environment labels in log context.
Metrics: expose Prometheus /metrics (api, engine) with labels service, strategy (if applicable), and deployment.environment.
Tracing: emit OTLP traces directly to Grafana Cloud (Tempo) using service resource attributes (service.name, deployment.environment).
Dashboards: create initial Grafana dashboards filtered by deployment.environment variable.

Phase 2:

Collector: introduce OpenTelemetry Collector to consolidate exporters and apply sampling policies. Configure routes for:
- Metrics: OTLP metrics to Grafana Cloud metrics (or Prometheus remote_write if preferred).
- Logs: OTel Collector logs pipeline or Promtail to Loki.
- Traces: OTLP to Tempo (Grafana Cloud).
Alerting: Grafana alert rules on Prometheus metrics (scope by deployment.environment).
Derived metrics: PnL & exposure aggregated daily for long-term trend view.

Phase 3:

Advanced analytics: anomaly detection for strategy signal volume.
Trace-metrics correlation (ex: trace latency gating order placement SLO).
ML feature quality metrics (drift, completeness).

Environments

Dev (local Docker): full fidelity, high sampling; ephemeral retention. Used for day-to-day development and validation. See shared definitions in docs/devops/environments.md.
CI (ephemeral): smoke metrics only; reduce noise; used for performance regression checks. Not a persistent environment.
Prod: tuned sampling; alert tiers active; retention policies enforced.

Data Retention & Storage

Logs: 14 days full; 90 days sampled (error/warn only) — adjustable pending volume.
Metrics: high-resolution (≤30s) for 7 days; downsampled (5m) for 60 days; daily aggregates for 1 year.
Traces: unsampled errors for 30 days; sampled normal traces for 7 days.
Strategy performance snapshots persisted separately for historical analysis (outside telemetry) under the analytics storage layer.

Secrets & Keys

Manage sensitive values with GitHub Environments and Secrets (shared guidance: docs/devops/secrets.md); integrate via workflows in .github/workflows/ and inject into runtime configuration.
Grafana API keys and endpoints stored in environment-specific secrets; services read them via environment variables.
Do not embed credentials in the repository; see docs/devops/deployment.md for environment and access patterns.

Access & Governance

Principle of least privilege: dashboards read-only for most users; editing restricted.
Sensitive fields removed at source; avoid shipping raw API payloads to tracing/log backends.
Versioned configuration: observability settings tracked (future) in dedicated config files (referenced, not embedded in docs).

Implementation Phases & Tasks

Follow these steps in order. Phase 1 uses direct OTLP to Grafana Cloud; no local observability containers are required.

Phase 0 — Prerequisite

Create Grafana Cloud Free stack and service account; generate API key with scopes (metrics:write, logs:write, traces:write).
Record endpoints: GRAFANA_OTLP_ENDPOINT (OTLP gateway), optional GRAFANA_LOKI_ENDPOINT (if pushing logs directly).
Store secrets:

GitHub Environment (prod): GRAFANA_OTLP_ENDPOINT, GRAFANA_API_KEY (and optional GRAFANA_LOKI_ENDPOINT).
Local .env (git-ignored): same names for dev.

Phase 1 — Instrumentation and direct export

Add resource attributes in API and Engine startup: service.name, service.version, deployment.environment.
Implement structured JSON logging with optional trace correlation (trace_id, span_id when available).
Expose Prometheus /metrics (api, engine) including an environment label (e.g., deployment_environment).
Configure OTLP exporters (metrics, traces) to use GRAFANA_OTLP_ENDPOINT with GRAFANA_API_KEY.
Create initial Grafana dashboards filtered by deployment.environment.
Define initial alert thresholds (error rate, ingest freshness) and document them here.

Phase 2 — Consolidation and logs pipeline

Add Promtail or an OTel Collector logs pipeline to ship structured logs to Loki.
Introduce the OpenTelemetry Collector to centralize exporters, apply sampling policies, and route signals.
Add derived daily PnL/exposure metrics publishing job.

Phase 3 — Enhancements

Expand dashboards, add SLOs, and refine alerts; consider anomaly detection for strategy signal volume.

Future Considerations

Self-service instrumentation guidelines (developer checklist).
Synthetic checks (scheduled strategies verifying data provider uptime).
SLA/SLO formalization and quarterly review.
Integration with deployment workflows to annotate traces on release.

References

CI/CD: docs/devops/ci-cd.md
Deployment: docs/devops/deployment.md
Docker & local stack: docs/devops/docker.md
Registry: docs/devops/registry.md
Secrets: docs/devops/secrets.md (standard names and scopes)
Secrets (names): GRAFANA_OTLP_ENDPOINT, GRAFANA_API_KEY, GRAFANA_LOKI_ENDPOINT (if direct), future GRAFANA_TEMPO_ENDPOINT (if separated); environment variable DEPLOYMENT_ENVIRONMENT used across services.

Notes

Stack choices reflect the internal discussion: adopt Grafana Free (Prometheus/Loki/Tempo) with OpenTelemetry across services; Axiom excluded due to free-tier user limits. Sampling/retention values remain adjustable as real traffic patterns emerge.

Observability Design ​

Implementation Status ​

Scope & Goals ​

Principles ​

Core Pillars ​

Monitoring ​

Logging ​

Metrics ​

Tracing ​

Dashboards ​

Alerting ​

Runbooks ​

Architecture Overview ​

Tooling & Backends (Phased) ​

Environments ​

Data Retention & Storage ​

Secrets & Keys ​

Access & Governance ​

Implementation Phases & Tasks ​

Future Considerations ​

References ​

Notes ​