Skip to content

This document records key architectural decisions, their rationale, and alternatives considered.

Decision Record

Decision: Use Python for backend and strategy logic

  • Rationale: Python is widely used in finance, has strong data science libraries, and is familiar to the developer.
  • Alternatives: Node.js, .NET

Decision: Rule engine

  • Rationale: Custom Python DSL: maximum control, versionable, integrates with tests. Recommended if rules are domain-specific.
  • Alternatives: Drools / Durable rules: existing engines but Java heavy or immature Python ports.
  • Recommendation: build a thin Python rule DSL with declarative rules stored as YAML/JSON and evaluated in a sandboxed environment.

Database Abstraction Layer

Decision: Use C4 model for architecture documentation

  • Rationale: Provides clear, maintainable diagrams at multiple levels; suitable for solo developer.
  • Alternatives: 4+1, TOGAF, ArchiMate

Decision: Store historical data in Parquet files and/or NoSQL database

  • Rationale: Efficient for analytics and flexible schema.
  • Alternatives: Only SQL, only NoSQL

Decision: Latency requirements

  • Rationale: Weekly strategies: batch pipelines + cron jobs. Event trading: streaming + low millisecond to second decision loops. For ultra low latency (<100ms), this architecture may need colocated execution and native language adapters (C++/Java), but for news/event trading, second-scale is usually acceptable

Decision: Use VitePress for strategy documentation and technical reference

  • Rationale: Vue-based static site generator that integrates well with GitHub workflows. Supports Markdown, code syntax highlighting, and offers plugins for financial visualization and validation. The ability to include interactive charts and backtesting results makes it ideal for strategy vetting by financial experts.
  • Alternatives:
    • Docusaurus (React-based, more complex)
    • MkDocs (Python-based, fewer financial plugins)
    • Custom Svelte documentation (would require more development effort)
  • Recommendation: Implement VitePress early in the project lifecycle to establish documentation patterns for strategies, including standardized metrics for backtest results and risk profiles.

Decision: Hosting approach (Cloudflare + Railway)

  • Status: Active (Nov 2025)
  • Rationale: Adopt Cloudflare Pages for static site hosting (docs, future UI) and Railway for backend Docker containers (API, Engine) to minimize operational overhead and focus on trading system development. Total cost: $5-20/month vs $50-200/month for Azure Container Apps.
  • Architecture:
    • Cloudflare Pages (FREE): Static documentation and future UI deployment with global CDN
    • Cloudflare Access (FREE for 50 users): Authentication layer using GitHub org membership, with migration path to Auth0 IdP integration
    • Railway ($5-20/month): Managed platform for backend Docker containers with Git-based deployment
    • MongoDB Atlas (FREE tier): 512MB database storage
  • Key advantages:
    • Minimal cognitive load: fully managed platforms, no infrastructure maintenance
    • Cost efficiency: ~90% cheaper than Azure while meeting current needs
    • Developer experience: Git-based deployments, good dashboards, simple configuration
    • Unified authentication: Cloudflare Access supports GitHub org now, Auth0 later
    • Focus preservation: less time on infrastructure = more time on trading logic
  • Decision factors:
    • Personal/early-stage project prioritizing development velocity over platform control
    • Trading system complexity requires focus on domain logic, not infrastructure operations
    • Cost minimization critical for solo/bootstrap phase
    • Team access needed for collaborators and future trading partners
  • Alternatives considered:
    • Azure Container Apps ($50-200/month): More expensive, enterprise-focused, overkill for current scale
    • Coolify + Contabo VPS ($6-13/month): Cheaper but adds operational burden (server management, updates, monitoring)
    • Vercel: Good DX but commercial usage restrictions on free tier
    • GitHub Pages: Requires Enterprise plan ($21/user/month) for private repository access
    • Netlify: Limited authentication options on free tier
  • Implementation phases:
    1. Documentation deployment to Cloudflare Pages with GitHub org access (current)
    2. Backend services (API, Engine) to Railway when production-ready
    3. Future UI to Cloudflare Pages alongside documentation
    4. Migration to Auth0 IdP in Cloudflare Access as team grows
  • References: docs/infra/hosting/cloudflare-pages.md (documentation hosting), docs/infra/devops/deployment.md (backend deployment)

Decision: Use Auth0 (Free) for user authentication

  • Rationale: Managed identity provider with quick setup, OIDC-compliant flows, hosted Universal Login, SDKs for SPA; free tier suits a solo/trader project.
  • Alternatives: Self-host Keycloak, Firebase Auth, Azure AD B2C, custom OAuth/OIDC.
  • Implications:
    • SPA uses Authorization Code with PKCE to obtain ID/Access tokens.
    • Backend API validates RS256 JWTs using Auth0 JWKS; no app-secret required for validation.
    • Environment variables required (domain, audience, client id) are non-secret in SPA but treated as configuration; secrets stored in vault if any M2M flows are added later.
    • Free tier constraints apply (e.g., limits, branding, custom domains not included); revisit if usage grows.

Decision: Docker Compose + Systemd for host orchestration

  • Rationale: if i want automatic restarts from container crashes, health checks, rolling updated, load balancing and service discovery, scaling
  • Alternatives:
    • Portainer: GUI based container management
    • K3s: Lightweight Kubernetes

Decision: OpenTelemetry + Grafana stack (Prometheus/Loki/Tempo)

  • Status: Active (Nov 2025). Supersedes the earlier SigNoz decision based on updated constraints and discussion.
  • Rationale: Use OpenTelemetry across services as the standard API/SDK with an optional Collector as the single ingestion point. Choose Grafana Free (Cloud Free tier) as the backend due to multi-user support on the free plan and good fit for a personal project. Core needs:
    • Metrics: Prometheus (service and host/container metrics)
    • Logs: Loki (structured JSON logs with trace correlation)
    • Traces: Tempo (W3C propagation end-to-end)
    • Destination-agnostic instrumentation via OTel preserves optionality if backends change later
  • Alternatives:
    • OpenTelemetry + SigNoz (previous choice; opinionated, solid UX)
    • ELK: Elasticsearch + Logstash + Kibana (powerful, heavier ops)
    • Axiom (nice DX; free tier user limit is a blocker)
  • Notes:
    • Add exporters/sidecars or route via OpenTelemetry Collector; sampling configured centrally where possible.
    • Secrets for backend endpoints/keys managed via GitHub Environments/Secrets.
    • See observability design for phased adoption, alerting, dashboards, and retention.
  • Reference: docs/devops/observability.md

Decision: Prefect for workflow orchestration

  • Rationale: A general-purpose workflow orchestration tool, not limited to data pipelines. Provides framework for:
    • backtesting rule engine
      • Parallelize strategy evaluation
    • Devops & Infra automation
    • ML & AI training on schedule
    • ETL Pipeline
    • Web Scraping
    • Report Generation
    • API & Microservices
      • Orchestration calls to APIs
      • conditional logic and branching
    • Business Process Automation
  • Alternatives
    • Airflow: a beast
    • Dagster: Data-centric, treating data assets as first-class citizens.
      • Which to choose?
        • Choose Prefect if: You need a workflow orchestrator with a strong focus on developer experience, dynamic workflows, and fast iteration, especially for tasks like ML pipelines where data dependencies might not be fixed beforehand.
        • Choose Dagster if: You are building a mature data platform and require strong data lineage, asset management, and observability. It is well-suited for when data quality and data governance are a primary concern.
  • Note: Observabily: built-in logging, retries and state tracking

Decision: Microsoft Semantic Kernel for agent AI framework

  • Rationale: Microsoft's open-source agentic AI framework provides enterprise-grade orchestration for LLM-powered agents with strong integration capabilities. Key advantages:
    • Native integration with Azure OpenAI Service and other LLM providers (OpenAI, Anthropic, local models)
    • Built-in prompt engineering and template management for consistent agent behavior
    • Plugin architecture for extending agent capabilities with custom tools and functions
    • Memory and context management for stateful agent interactions
    • Production-ready with strong typing (C#/.NET and Python SDKs) and observability hooks
    • Active development and Microsoft backing ensures long-term support
  • Use Cases:
    • Strategy validation agents: LLM-assisted review of trading rules and backtest results
    • Market intelligence agents: Automated news/sentiment analysis with contextual understanding
    • Documentation agents: Automated generation of strategy documentation from code
    • Trading assistant: Natural language interface for querying positions, metrics, and system state
  • Alternatives:
    • LangChain (Python-first, ecosystem rich but less enterprise-focused)
    • AutoGen (Microsoft Research, multi-agent focus, less production-ready)
    • Custom LLM integration (full control but higher development overhead)
    • Haystack (NLP-focused, less suitable for agentic workflows)
  • Integration Points:
    • Strategy engine: AI agents provide second-opinion validation on generated signals
    • Observability: Agents analyze metrics/logs and suggest optimizations
    • Documentation: Auto-generate strategy docs from code annotations and backtest results
    • Risk management: AI-powered anomaly detection and circuit breaker recommendations
  • Note: Initial integration will focus on non-critical workflows (documentation, analysis) before considering any agent involvement in live trading decisions. All agent actions subject to human review until proven reliable.
  • Reference: Future implementation tracked in docs/05-implementation.md under AI/ML integration phase

Outstanding Decisions

These are candidate or partially-considered decisions captured out-of-order to avoid losing context. Promote any of these to a full decision record (with rationale, alternatives, recommendation) once you commit.

  • Data retention fine-tuning for observability backends (granularity vs. cost) — refine after first month of metrics volume data.
  • Strategy performance attribution model standardization (naming + schema for PnL breakdown) — needed before multi-strategy backtests.
  • Unified configuration format (consider pydantic settings vs. environment variable layering) — impacts deployment reproducibility.
  • Secret rotation automation (GitHub Actions vs. external tool) — decide prior to first credential expiry.
  • ML feature store approach (ad hoc Parquet vs. lightweight feature store library) — revisit when first ML prototype begins.
  • Prefect deployment topology (single agent vs. multiple work pools) — decide when more than one workflow category is active.
  • API versioning strategy (URI vs. header, when to introduce v1) — revisit before first breaking change.
  • Alert notification channel expansion (email/webhook → chat integration) — after initial alert noise is characterized.
  • Risk controls formalization (circuit breaker thresholds + global kill switch implementation detail) — before first live trade.

Add new backlog items above; keep list curated (remove or promote once resolved) to prevent drift.

Documentation generated with VitePress