Generation got cheap. Trust didn't. — banner with portfolio performance chart showing four trading agents.

Generation Got Cheap, Trust Didn’t: A Multi-Agent Trading System with Guardrails and Traceability

Intro

Getting an LLM to produce code is cheap now; verifying what it produced is the engineering work. Production agentic systems are increasingly bottlenecked by verification, not generation. The design moves to where the trust gap is — governance at every step, output validation that catches the model hallucinations, per-run audit trails that prove what the agent actually saw and emitted without relying on the agent’s self-reporting.

This system is a concrete example: four AI traders that independently research stocks and trade over real market data. Each trader runs a two-phase cycle — research, then decision — before execution. Guardrails at every step, traceability across every run.

You can see the system running live at agentic-trading.vkontech.com, and the full source code is available on the GitHub repository.

The system runs continuously with all four traders making independent decisions. Here is what the traders dashboard looks like:

Each trader starts with $100K in virtual capital and trades according to its investment philosophy — Warren models Warren Buffett’s value investing, George follows George Soros’s contrarian macro approach, Ray applies Ray Dalio’s risk parity principles, and Cathie pursues Cathie Wood’s growth/disruptive tech strategy. The performance chart shows how these strategies diverge: Cathie shows high volatility, Warren and George show steadier paths, Ray emphasizes diversification. The traders run continuously against real market data, making actual decisions — the capital is virtual, but the decisions and performance are real.

The Trading Runs view shows the activity stream:

Each row is one trading cycle: George buying GOLD, Warren holding, Cathie buying CRWD, Ray selling PG.

Output validation guardrails enforce constraints at each phase boundary, structured schemas at every hand-off keep the data the agent emits parseable downstream, and per-phase audit captures the actual prompts sent to the model, the web sources it consulted, the tool calls it made, and the reasoning fields it returned. Not what the agent says it did — what it actually did.

Here’s what some of this looks in practice. The Research Phase view shows the agent complete research process and reasoning:

The Market Analyst — running Cathie’s Growth Innovation strategy — identified four high-conviction candidates: Palantir (PLTR), UiPath (PATH), CRISPR Therapeutics (CRSP), and Coinbase (COIN). The research notes walk through the disruptive-innovation thesis for each — enterprise AI orchestration, agentic-AI automation, gene editing, crypto exchange infrastructure. Web sources are clickable links to the original articles. The tool calls at the bottom show the exact Brave Search queries and per-symbol price lookups. If the agent read it or searched for it, it’s captured.

The Decision Phase view shows how the agent made the actual trading decision:

The agent decided to BUY 665 shares of UiPath (PATH) — chosen over PLTR, CRSP, and COIN for durable ARR expansion from RPA→agentic-AI workflow monetization, lower valuation risk than the high-multiple Palantir, less binary downside than CRISPR, and lower regulatory exposure than Coinbase. Below the decision: research context (a head-to-head comparison of the four candidates), portfolio context (cash $29,720.50, positions 9/10; this purchase brings the book to 10/10 within the $7,430.12 max-position cap), and historical context (17 PATH trades in the last 90 days — 9 buys, 8 sells, net accumulating). Every input that informed the decision is visible.

The Execution Phase view captures the trade that actually went out:

The decision became a trade: BUY 665 PATH at $11.17, total $7,428.05, status COMPLETED, trade ID 1025. The audit row captures action (BUY/SELL), symbol, share count, price per share, total cost, completion status, and the trade ID that links back to the underlying transaction record.

The trading runs table links to these detailed audit views so you can drill into any run.

Note: This is not investment advice. The system trades with virtual capital. Some details may change as the system evolves — check the repo for the current state.


Meet the Agents

The traders are named after well-known investors whose trading philosophies inspire their strategies:

AgentNamed AfterStyleFocus
WarrenWarren BuffettValueUndervalued companies, strong fundamentals
GeorgeGeorge SorosContrarianMacro trends, market consensus gaps
RayRay DalioRisk ParityDiversification across uncorrelated assets
CathieCathie WoodGrowthDisruptive technology, high-growth sectors

Every trader runs the same two-phase cycle:

Every few hours, each trader runs a trading cycle as a two-agent pipeline. The researcher scans the market for candidates — there are no predefined stock lists — using MCP servers (Brave Search, Fetch) for web research and Finnhub for real-time prices. The decision maker then evaluates those candidates against the trader’s current holdings and recent trading history, and picks BUY / SELL / HOLD with reasoning. Both phases see the same portfolio context and the same price data, so neither runs blind to what the trader already owns.


Architecture

Four layers, each with one job: the agents drive the trading cycle, the backend mediates state behind a REST API, the frontend reads through that same API, the database persists.

LayerTechnologyPurpose
AgentsPython, OpenAI Agents SDKTrading cycle orchestration and LLM agents
MCPBrave Search, FetchWeb research, article retrieval
Market dataFinnhub.ioReal-time price quotes
BackendJava, Spring BootAccounts, trade execution, run/audit persistence, prompt composition
FrontendReact, TypeScript, ViteDashboard and audit views over REST
DatabasePostgreSQL (3 schemas)Persistence and analytics

A bit more on what each layer does at the boundary:

Python Agents — The orchestrator owns sequencing across the trading cycle: it calls the backend to create runs, broadcast status, execute trades, and finalize the run with a full audit payload. Both agents share the MCP toolset, so the Decision Maker can pull additional context beyond the Analyst’s research notes if it needs to.

Spring Boot Backend — REST surface in front of everything. A price cache fronts Finnhub quotes so per-cycle agent calls don’t hit the rate limit, and prompts are composed from base templates plus per-trader personality files.

React App — Reads through the REST API. The Run Detail views surface every audit field (research, decision, execution, sources, tool calls, prompts, metrics).

PostgreSQL — Three schemas hold everything that needs to survive a restart: trader metadata, the run lifecycle and its audit trail, accounts and trades, portfolio snapshots, and a market-data quote cache.


Design Decisions

The OpenAI Agents SDK documents two ways to orchestrate multi-agent flows: LLM-driven (agents-as-tools, handoffs) and code-driven (chaining agents by passing structured output). I picked code-driven for determinism — phases run in a fixed sequence. The official OpenAI Cookbook “Multi-Agent Portfolio Collaboration” takes the LLM-driven road instead — a hub-and-spoke Portfolio Manager calling specialists as tools — great for open-ended research, less so for a loop that executes trades.

On the framework: the OpenAI Agents SDK over CrewAI, AutoGen, and LangGraph because it’s a lightweight primitives library, not a full orchestration framework — a fit for a system where explicit Python code already owns the orchestration. Structured output, tool support, output validation hooks, MCP integration — all first-class, without extra abstractions on top.

Two-Agent Pipeline

The hand-off between phases is structured output only — no free conversation. The orchestrator pre-fetches the shared facts (holdings, recent activity, prices) and pushes them into both prompts.

What the Decision Maker doesn’t inherit is the Analyst’s reasoning trace. That’s the “fresh, uncontaminated mind” subagent pattern for reasoning state: shared facts are consolidated, private deliberation is firewalled. Research-time commitment to a particular name doesn’t carry forward as pressure to act on it.

Here’s the full sequence for a single trading cycle:

Output Guardrails with Self-Correction

Handing execution flow to the LLM is the point of agentic systems — and the risk. The same loop that lets the agent shine on a new problem also lets it invent a ticker or fabricate a URL. The guardrail sits where the LLM hands structured data downstream into something that will execute a trade. After each Market Analyst run the SDK validates the output against the schema; if it trips — non-positive price, malformed URL, empty candidate list — the orchestrator feeds the failed validators back as a corrective message and the agent retries, up to three attempts. Canonical verification-loop shape: structured success condition, descriptive error on miss, bounded retry budget.

Recovery is recorded at the same fidelity as a clean run. Each phase row carries the outcome (first_try, recovered, exhausted), attempt count, the validation errors from the last failed attempt, and the rejected JSON the model produced — so you can count tripwire rates per agent or inspect what the model first proposed. The same machinery is in place on the Decision Maker side, waiting for the symmetric guardrail there.

The current guardrail does structural validation — schema, types, ranges, regex-checked URLs — not semantic. A reachable URL pointing at the wrong article, or a valid ticker for a company the analyst didn’t actually research, passes the schema; only the audit trail catches those after the fact. Semantic checks (hallucination detection, source grounding) are the natural next layer — typically an asynchronous LLM-as-judge alongside the synchronous validation.


What Gets Tracked

Each trading cycle is captured as a per-phase trace, rendered by the dashboard for audit and traceability — what the LLM was shown, what it returned, the tool calls it made, the reasoning behind every decision.

Token metrics — Input, output, cached, and reasoning tokens per agent phase (research and decision). The execution phase doesn’t call the LLM, so it has no token metrics — only an outcome status.

Cost calculation — USD per agent phase from token counts and per-million-token model pricing (vendored from LiteLLM‘s model_prices_and_context_window.json). Summed across phases, you get per-cycle cost — and over time, the actual cost of running four traders continuously.

Tool call logging — Every invocation stored with tool name, parameters, an error flag, and truncated error message. When research fails, the log shows exactly which search or fetch call went wrong — cutting debugging from “reproduce and guess” to “open the log and read.”

Phase latency — Wall-clock time per phase, in milliseconds. Research suddenly taking 45 seconds instead of 15 surfaces immediately.

Prompt capture — System prompt (agent instructions) and task prompt (dynamic per-cycle context), stored per phase. Lets you compare runs before and after a prompt-template change to see the effect on behavior.

Decision reasoning — Four structured fields from the LLM’s output: rationale, portfolioContext, historicalContext, and researchContext (summarizing the Analyst’s hand-off). Stored as separate columns, so you can scan and filter without parsing free text.

Operational health — was the system up, did cycles complete on time, is anything erroring — is a different concern, and it’s not the dashboard’s job. That lives in Grafana over Loki and Postgres. The provisioned dashboards cover heartbeat freshness on the last completed cycle, recent-cycles per-agent status with research/decision latencies, log volume and error rate over time, top error messages for triage, agent status timelines, and the full guardrail-event story (trip rate by event type, per-agent trip breakdown, recent exhausted events, finalize-failure events) — wired up to email alerts on 5xx/4xx spikes, fatal errors, pod crashloops, sustained guardrail exhaustion, and any finalize failure. Below is a small sample of all the visualizations available:


Get Involved

I built this project to learn and explore agentic systems — a space the industry is still figuring out, and a work in progress with plenty of room for improvement. The full source is on GitHub; the README has setup instructions for running it locally. Issues, pull requests, and forks for your own exploration are all welcome.


Resources

The agentic systems space is vast — there’s no shortage of papers, blog posts, and frameworks worth reading. These three were among the most influential for this project:

I strongly recommend the OpenAI Agents SDK documentation if you want to learn more about building multi-agent systems with the tools used here.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Site Footer

Subscribe To My Newsletter

Email address