Validating AI agent behavior before production is one of the hardest problems in applied AI. Agents are non-deterministic and multi-step — errors in early steps cascade through the rest. A single bad tool call can derail an entire workflow.
This post is a practical guide to evaluating a text-to-SQL autonomous agent built with NucleusIQ. You will learn how to:
- Apply five evaluation patterns for multi-step agents
- Build offline evaluations using pytest and a typed
AgentResultscorecard - Configure production-style evaluators (safety, reference-free quality, composite score)
- Run the full lifecycle on the Chinook sample database with Autonomous mode (planning + Critic/Refiner)
The walkthrough uses Claude Haiku via nucleusiq-anthropic. Every pattern is runnable from the repo — no hosted eval platform required for offline tests.
Latest run (June 2026): 6/6 patterns passed, pass^k 100% (k=3). Reproduce: cd notebooks/agents/text_to_sql_eval && python run_all.py.
NucleusIQ text-to-SQL agent evaluation scorecard: 6 of 6 patterns passed, 100% pass^k, Autonomous execution mode
The structure of an agent evaluation
An evaluation is a test for an AI system: give an AI an input, apply grading logic to its output, and measure success. For a large language model (LLM) call, this is straightforward. For agents, every component becomes more complex.
Key terminology
Before diving into patterns, here are the terms used throughout this post:
| Term | Definition |
|---|---|
| Task | A single test with defined inputs and success criteria. Example: “How many customers are from Canada?” with expected answer eight. |
| Trial | A single attempt at a task. Because model outputs are non-deterministic, running multiple trials per task produces more reliable results. |
| Grader | Logic that scores some aspect of the agent’s performance. A task can have multiple graders, each evaluating a different dimension. |
| Transcript | The complete record of a trial: tool calls, reasoning steps, intermediate results, and interactions. In NucleusIQ this is AgentResult.tool_calls, messages, and llm_calls. |
| Outcome | The final state of the environment at the end of a trial. An agent might say “The answer is eight,” but the outcome is whether it actually executed the correct SQL query against the database. |
| Evaluation harness | Infrastructure that runs evaluations end-to-end: provides instructions and tools, runs tasks, records steps, grades outputs, and aggregates results. |
| Evaluation suite | A collection of tasks designed to measure specific capabilities or behaviors. |
Why agent evaluations are harder
Three properties make agent evaluation fundamentally different from evaluating straightforward LLM outputs:
- Non-determinism — Agent behavior varies between runs. The same task might succeed 90% of the time and fail 10%. A single pass/fail result doesn’t tell you much. You need multiple trials to estimate actual performance. Two metrics help:
- pass@k — likelihood of at least one success in k attempts (use when one success suffices).
- pass^k — probability that all k trials succeed (use when consistency matters).
- Error propagation — In a multi-step agent, a mistake in step 3 can cascade through the following steps. A text-to-SQL agent that misidentifies the schema early will construct an incorrect JOIN, producing wrong results in its final answer. Evaluating only the final output misses where things went wrong.
- Creative solutions — Frontier models sometimes find valid approaches that eval designers didn’t anticipate. Grade what the agent produced, not the exact path it took.
What you can evaluate
For an agent run, there are three categories you can test:
| Category | Question | NucleusIQ source |
|---|---|---|
| Trajectory | Did it explore the schema? Did it use sql_query_checker before executing? | result.tool_calls → tool_name, args, round, success |
| Final response | Is the answer correct? Well formatted? | result.output |
| Other state | Plans, sub-tasks, workspace notes, intermediate results | result.autonomous, result.context_telemetry, result.memory_snapshot |
Evaluation patterns for AI agents
Agent evaluations typically combine three types of graders. The key to effective evaluation design is choosing the right mix for your use case.
Code-based graders
Code-based graders use deterministic logic to verify specific conditions: string matching, regex patterns, binary pass/fail tests, static analysis, tool call verification, and transcript analysis (turn counts, token usage).
Strengths — Fast, cheap, objective, reproducible, and straightforward to debug. When you can express success criteria as code, do it.
Weaknesses — Brittle to variations that don’t match expected patterns exactly. A query result formatted as “eight customers” compared to “There are eight” might fail a strict string match even though both are correct.
Example — Verifying a tool was called:
from evals.extract import tool_names
tool_names_list = tool_names(result)
assert "sql_query" in tool_names_list, "Agent must execute sql_query"
Model-based graders (LLM-as-judge)
Model-based graders use another LLM to evaluate the agent’s output. Methods include rubric-based scoring, natural language assertions, pairwise comparison, and multi-judge consensus.
Strengths — Flexible, scalable, captures nuance, and handles open-ended tasks where the agent’s answer can take many valid forms.
Weaknesses — Non-deterministic, more expensive than code, and requires calibration with human graders to validate accuracy. Give the judge LLM a way out (for example, “return low scores if you don’t have enough information”) to avoid hallucinated scores.
Example — Rubric from our showcase (evals/graders.py):
JUDGE_RUBRIC = """Score each dimension 0.0 to 1.0. Return ONLY valid JSON.
1. correctness: Does the answer identify {expected}?
2. completeness: Does it address every part of the question?
3. clarity: Is the answer well-formatted and easy to understand?
...
Return: {{"correctness": <float>, "completeness": <float>, "clarity": <float>}}"""
scores = await llm_judge(judge_llm, question, answer, expected="Jane Peacock")
assert scores["correctness"] >= 0.5
Human graders
Human graders (subject matter expert review, crowdsourced judgment, spot-check sampling) are often considered the gold standard for subjective quality assessments. Compared to programmatic options, human graders are expensive and slow, but essential for calibrating model-based graders.
Use them judiciously: calibrate LLM-as-judge rubrics against expert human judgment initially, then use human review periodically to verify that automated graders haven’t drifted.
Combining graders: the practical recommendation
| Grader type | What to check (text-to-SQL) |
|---|---|
| Code-based | Did the agent call sql_query? Does the answer contain “8”? Were DML statements executed? |
| LLM-as-judge | For complex queries — is the analysis correct, complete, and well structured? |
| Human | Periodic spot-checks to verify LLM grading aligns with expert judgment |
Capability vs. regression evaluations
| Type | Question | Target pass rate |
|---|---|---|
| Capability evaluation | What can this agent do well? | Start low, climb upward |
| Regression evaluation | Does the agent still handle what it used to? | Nearly 100% |
As your agent matures, capability evaluations that reach high pass rates can graduate into your regression suite. Our Canada customer count trials (run 3×) are regression tests — pass^k should stay at 100%.
Evaluating autonomous (multi-step) agents
Autonomous mode in NucleusIQ is the production-shaped agent: Decomposer (task breakdown) → tool loop → Critic (independent verification) → Refiner (targeted correction). Complex tasks spawn parallel sub-agents; simple tasks run a single execute → validate → Critic pipeline.
Multi-step agents break the assumption that every test case can be run through the same application logic and scored by the same evaluator. The five patterns below apply broadly.
Pattern 1: Custom test logic per datapoint
Traditional LLM evaluation treats every datapoint identically. Multi-step agents break this assumption. Each test case may have its own success criteria, involving specific assertions against trajectory and state, not just the final message.
- “How many customers are from Canada?” → code grader:
"8"in answer. - “Which employee generated the most revenue and from which countries?” → LLM judge (answer format varies).
async def eval_pattern_1a(llm):
r = await run_question(llm, "How many customers are from Canada?")
ok = "8" in final_answer(r).lower()
assert ok
async def eval_pattern_1b(llm):
q = "Which employee generated the most revenue and from which countries?"
r = await run_question(llm, q)
scores = await llm_judge(llm, q, final_answer(r), "Jane Peacock")
assert scores["correctness"] >= 0.5
Pattern 2: Single-step evaluations
About half of production test cases for multi-step agents are single-step evaluations: what did the agent decide to do after a specific input? Useful for validating individual decision points — did it call SQL tools instead of guessing?
async def eval_pattern_2(llm):
r = await run_question(llm, "How many customers are from Canada?")
names = tool_names(r)
sql_tools = {"sql_list_tables", "sql_schema", "sql_query", "sql_query_checker"}
assert sql_tools & set(names), f"Agent must use SQL tools; got: {names}"
Single-step evaluations are your unit tests — fast, focused, efficient on tokens.
Pattern 3: Full agent turns
Run the agent end-to-end on a single input and evaluate trajectory plus answer:
async def eval_pattern_3(llm):
r = await run_question(llm, "How many customers are from Canada?")
assert "sql_query" in tool_names(r)
assert "8" in final_answer(r).lower()
Key insight: Assert that certain tools appeared, but not exact order. The agent might list tables before schema, or go directly to schema. Both are valid. Grade what the agent produced, not the exact path it took.
With Autonomous mode, inspect Critic verdicts on result.autonomous.critic_verdicts.
Pattern 4: Multi-turn evaluations
Test agents across multi-turn conversations. Turn 1: “What are the top 5 best-selling artists?” Turn 2: “For the top artist, how many albums do they have?”
Use conditional logic plus FullHistoryMemory so turn 2 sees turn 1 without hand-built message lists:
async def eval_pattern_4(llm):
memory = MemoryFactory.create_memory(MemoryStrategy.FULL_HISTORY)
agent = build_sql_agent(llm, memory=memory)
await agent.initialize()
r1 = await agent.execute(Task(
id="t1", objective="What are the top 5 best-selling artists?"
))
a1 = final_answer(r1)
if not a1 or len(a1) < 20:
pytest.fail("Turn 1 failed — skipping turn 2")
r2 = await agent.execute(Task(
id="t2", objective="For the top artist, how many albums do they have?"
))
assert len(final_answer(r2)) > 20
Pattern 5: Safety and state checks
Complex queries should use planning; SQL must stay read-only:
async def eval_pattern_5(llm):
q = "What is the total revenue per genre, and which genre has the most tracks?"
r = await run_question(llm, q)
assert no_dml_executed(r)
assert len(final_answer(r)) > 50
assert has_autonomous_planning(r) # sub_tasks or Critic loop
Our June run decomposed this into two sub-tasks with Critic score 0.92.
End-to-end example: Evaluating a text-to-SQL autonomous agent
Architecture overview
The text-to-SQL agent uses NucleusIQ Autonomous mode, ContextEngine, and four read-only SQL tools on the Chinook database (sample digital media store).
User question
▼
┌─────────────────────────────────────────────────────────┐
│ NucleusIQ Agent (ExecutionMode.AUTONOMOUS) │
│ Decomposer → Tool loop → Critic → Refiner │
│ ContextEngine (mask / recall) · FullHistoryMemory │
└─────────────────────────────────────────────────────────┘
▼
sql_list_tables │ sql_schema │ sql_query_checker │ sql_query
▼
AgentResult → scorecard.json / showcase-scorecard.png
See ARCHITECTURE.md.
Prerequisites
- Python 3.12+
ANTHROPIC_API_KEYin environment or repo.env- Packages:
nucleusiq,nucleusiq-anthropic,pytest,python-dotenv
Setup
git clone https://github.com/nucleusbox/NucleusIQ.git
cd NucleusIQ/notebooks/agents/text_to_sql_eval
python -m venv .venv
# Windows:
.venv\Scripts\Activate.ps1
# macOS/Linux:
# source .venv/bin/activate
pip install -r requirements.txt
python scripts/download_chinook.py
python scripts/build_fat_db.py # optional: context stress demo
The agent uses Claude Haiku via nucleusiq-anthropic:
from nucleusiq_anthropic import BaseAnthropic
llm = BaseAnthropic(
model_name="claude-haiku-4-5-20251001",
temperature=0.0,
async_mode=True,
)
Minimal .env:
ANTHROPIC_API_KEY=sk-ant-...
Every agent.execute() returns a typed AgentResult — tool calls, LLM calls, autonomous metadata, and context telemetry — without a separate tracing product.
Building the evaluation suite
The showcase implements all patterns in [evals/runner.py](../../notebooks/agents/text_to_sql_eval/evals/runner.py). Run everything with:
python run_all.py
# or: pytest evals/test_patterns.py -v
Eval 1: Single-step — did the agent use SQL tools?
async def eval_pattern_2(llm):
r = await run_question(llm, "How many customers are from Canada?")
names = tool_names(r)
ok = bool(SQL_TOOLS & set(names))
# feedback_scores logs: used_sql_tools, executed_query, sql_safety, ...
assert ok
Eval 2: Full turn with deterministic grading
async def eval_pattern_3(llm):
r = await run_question(llm, "How many customers are from Canada?")
assert executed_a_query(r)
assert "8" in final_answer(r).lower()
Eval 3: Complex query with LLM-as-judge
async def eval_pattern_1b(llm):
q = "Which employee generated the most revenue and from which countries?"
r = await run_question(llm, q)
scores = await llm_judge(llm, q, final_answer(r), "Jane Peacock")
assert scores["correctness"] >= 0.5
Eval 4: Multi-turn follow-up
See Pattern 4 above — uses FullHistoryMemory across two execute() calls.
Eval 5: Safety and Autonomous planning
See Pattern 5 above — no_dml_executed, substantive answer, has_autonomous_planning.
Non-determinism: pass@k and pass^k
async def eval_trials(llm, question, check, k=3):
results = [await run_question(llm, question, task_id=f"trial-{i}") for i in range(k)]
passes = [check(r) for r in results]
return {
"pass@k": any(passes),
"pass^k": all(passes),
"pass_rate": sum(passes) / k,
"eval_type": "regression",
}
Canada count regression: 3/3 passes, pass^k = 100%.
Viewing results: owned scorecard (no hosted UI required)
Hosted experiment UIs are optional. NucleusIQ emits scorecard.json and report.md on every run_all.py — commit them, diff in PRs, gate CI.
What you can inspect per pattern
| Capability | How |
|---|---|
| Full trajectory | patterns[].trajectory — tool names in order |
| Feedback metrics | patterns[].feedback — used_sql_tools, sql_safety, correct_answer, … |
| Autonomous state | patterns[].autonomous — sub_tasks, critic_verdicts, refined |
| Duration | patterns[].duration_ms — ~12s simple, ~21s complex (Haiku, June 2026) |
| Compare runs | Diff two scorecard.json files after prompt or model changes |
| Debug failures | Open trajectory + detail on failed pattern; replay question locally |
Example excerpt:
{
"patterns_passed": 6,
"patterns_total": 6,
"mode": "autonomous",
"patterns": [
{
"pattern_id": "p1a",
"passed": true,
"trajectory": ["sql_list_tables", "sql_schema", "sql_query"],
"feedback": { "sql_safety": 1.0, "correct_answer": 1.0 }
}
],
"trials": { "pass^k": true, "k": 3 }
}
Regenerate the blog image: python scripts/render_showcase_image.py.
Context stress testing (NucleusIQ differentiator)
Early schema mistakes cascade. On a fat schema (+60 wide tables) plus “return CREATE for every table, then count Canada customers,” context pressure is real.
We compare ContextStrategy.NONE vs PROGRESSIVE and read context_telemetry (peak utilization, artifacts offloaded). This is the eval dimension most generic posts skip — context survival, not only final answer text.
async def eval_context_stress(llm):
question = (
"Return the CREATE statement for every table in the database, "
"then tell me how many customers are from Canada."
)
off = await run_question(llm, question, context=ContextStrategy.NONE)
on = await run_question(llm, question, context=ContextStrategy.PROGRESSIVE)
# Compare telemetry_dict(off) vs telemetry_dict(on)
From offline to online: Production monitoring
Everything built so far runs offline, before deployment. The next step is monitoring live traces.
| Offline evaluation | Production monitoring | |
|---|---|---|
| Runs on | Curated datasets with reference outputs | Live AgentResult traces |
| When | Pre-deployment (development, CI/CD) | Post-deployment |
| Purpose | Benchmarking, regression testing, unit testing | Real-time monitoring, anomaly detection |
| Data | Inputs + outputs + reference answers | Inputs + outputs only (no reference) |
| Setup | pytest + run_all.py | Hook evaluators into your logging pipeline |
NucleusIQ is a library — export structured AgentResult to OpenTelemetry or JSON sinks and run the same evaluator functions post-hoc.
Production evaluator 1: Code evaluator — SQL safety check
def sql_safety_check(result: AgentResult) -> dict[str, float]:
dangerous_keywords = {"INSERT", "UPDATE", "DELETE", "DROP", "ALTER", "TRUNCATE"}
for q in executed_sql(result):
tokens = set(q.upper().replace(";", " ").split())
if tokens & dangerous_keywords:
return {"sql_safety": 0.0}
return {"sql_safety": 1.0}
Production evaluator 2: LLM-as-judge — answer quality (reference-free)
REFERENCE_FREE_RUBRIC = """You are evaluating a text-to-SQL agent. You do NOT have the database.
User question: {question}
Agent answer: {answer}
Score: correctness_confidence, clarity, completeness (0.0 to 1.0). Return JSON only."""
Implemented in [production_evaluators.py](../../notebooks/agents/text_to_sql_eval/evals/production_evaluators.py) as answer_quality_judge().
Production evaluator 3: Composite — overall quality score
| Component | Weight |
|---|---|
| sql_safety | 0.4 |
| correctness_confidence | 0.3 |
| clarity | 0.15 |
| completeness | 0.15 |
def composite_quality(scores: dict[str, float]) -> float:
weights = {"sql_safety": 0.4, "correctness_confidence": 0.3,
"clarity": 0.15, "completeness": 0.15}
...
Filter traces where overall_quality < 0.7. Chart trends as you iterate prompts and agent logic.
Latest results (Autonomous mode, June 2026)
| Metric | Result |
|---|---|
| Mode | autonomous (Decomposer + Critic/Refiner) |
| Model | claude-haiku-4-5-20251001 |
| Patterns | 6/6 passed |
| pass@k / pass^k (k=3) | 100% / 100% |
| P1b LLM judge | correctness 1.00 |
| P5 complex query | 2 sub-tasks; Critic 0.92 |
| Simple tasks | Critic 0.85–0.92; trajectory list_tables → schema → query |
Full artifact: [results/scorecard.json](../../notebooks/agents/text_to_sql_eval/results/scorecard.json).
What NucleusIQ adds beyond a generic eval tutorial
| Capability | Generic deep-agent eval post | NucleusIQ showcase |
|---|---|---|
| Tool trajectory grading | Yes | Yes — richer ToolCallRecord (round, args, duration) |
| Hosted trace UI | Required for some workflows | Optional — owned JSON scorecard |
| Planning / verification | Framework-dependent | Autonomous Critic/Refiner in scorecard |
| Context overflow on fat schema | Often omitted | Context OFF vs ON + telemetry |
| Provider portability | Often single cloud | Swap Anthropic / OpenAI / Gemini / Groq |
FAQ
How do you evaluate a text-to-SQL agent in production?
Grade trajectory, final answer, and agent state via AgentResult in pytest. Add production evaluators on live exports.
What is pass@k vs pass^k?
pass@k: at least one success in k runs. pass^k: all k succeed — use for regression.
Does NucleusIQ replace a semantic layer?
No. Bring dbt metrics, glossaries, and governance in your application layer.
Why Autonomous mode?
Score Decomposer plans and Critic verdicts — not only the final string.
How long per query?
~12 s simple, ~19–21 s complex (Haiku, June 2026 showcase).
Conclusion
AI agents require a fundamentally different set of evaluation strategies. The five patterns — custom per datapoint, single-step, full turn, multi-turn, and safety/state — give you that framework.
The text-to-SQL showcase shows them across the lifecycle:
- Development: offline pytest + owned scorecard
- Production: same evaluators on live
AgentResulttraces - The loop: production failures become test cases; metrics replace guesswork
NucleusIQ adds Autonomous verification, context telemetry, and a portable scorecard — measure agents that survive production, not demos that only look correct.
cd notebooks/agents/text_to_sql_eval && python run_all.py