Your Text-to-SQL Agent Passed the Demo. Did It Survive Production?

Validating AI agent behavior before production is one of the hardest problems in applied AI. Agents are non-deterministic and multi-step — errors in early steps cascade through the rest. A single bad tool call can derail an entire workflow.

This post is a practical guide to evaluating a text-to-SQL autonomous agent built with NucleusIQ. You will learn how to:

Apply five evaluation patterns for multi-step agents
Build offline evaluations using pytest and a typed AgentResult scorecard
Configure production-style evaluators (safety, reference-free quality, composite score)
Run the full lifecycle on the Chinook sample database with Autonomous mode (planning + Critic/Refiner)

The walkthrough uses Claude Haiku via nucleusiq-anthropic. Every pattern is runnable from the repo — no hosted eval platform required for offline tests.

Latest run (June 2026): 6/6 patterns passed, pass^k 100% (k=3). Reproduce: cd notebooks/agents/text_to_sql_eval && python run_all.py.

NucleusIQ text-to-SQL agent evaluation scorecard: 6 of 6 patterns passed, 100% pass^k, Autonomous execution mode

NucleusIQ

v0.6.0 · Open Source · MIT Licensed

Tired of complex agent frameworks? NucleusIQ gives you 3 execution modes, 10 production plugins, and provider portability — in pure Python. Try it →

Gearbox Strategy 10 Production Plugins Provider Portable

      $ pip install nucleusiq nucleusiq-openai
    

★ Star on GitHub Read Docs PyPI

The structure of an agent evaluation

An evaluation is a test for an AI system: give an AI an input, apply grading logic to its output, and measure success. For a large language model (LLM) call, this is straightforward. For agents, every component becomes more complex.

Key terminology

Before diving into patterns, here are the terms used throughout this post:

Term	Definition
Task	A single test with defined inputs and success criteria. Example: “How many customers are from Canada?” with expected answer eight.
Trial	A single attempt at a task. Because model outputs are non-deterministic, running multiple trials per task produces more reliable results.
Grader	Logic that scores some aspect of the agent’s performance. A task can have multiple graders, each evaluating a different dimension.
Transcript	The complete record of a trial: tool calls, reasoning steps, intermediate results, and interactions. In NucleusIQ this is `AgentResult.tool_calls`, `messages`, and `llm_calls`.
Outcome	The final state of the environment at the end of a trial. An agent might say “The answer is eight,” but the outcome is whether it actually executed the correct SQL query against the database.
Evaluation harness	Infrastructure that runs evaluations end-to-end: provides instructions and tools, runs tasks, records steps, grades outputs, and aggregates results.
Evaluation suite	A collection of tasks designed to measure specific capabilities or behaviors.

Why agent evaluations are harder

Three properties make agent evaluation fundamentally different from evaluating straightforward LLM outputs:

Non-determinism — Agent behavior varies between runs. The same task might succeed 90% of the time and fail 10%. A single pass/fail result doesn’t tell you much. You need multiple trials to estimate actual performance. Two metrics help:
- pass@k — likelihood of at least one success in k attempts (use when one success suffices).
- pass^k — probability that all k trials succeed (use when consistency matters).
Error propagation — In a multi-step agent, a mistake in step 3 can cascade through the following steps. A text-to-SQL agent that misidentifies the schema early will construct an incorrect JOIN, producing wrong results in its final answer. Evaluating only the final output misses where things went wrong.
Creative solutions — Frontier models sometimes find valid approaches that eval designers didn’t anticipate. Grade what the agent produced, not the exact path it took.

What you can evaluate

For an agent run, there are three categories you can test:

Category	Question	NucleusIQ source
Trajectory	Did it explore the schema? Did it use `sql_query_checker` before executing?	`result.tool_calls` → `tool_name`, `args`, `round`, `success`
Final response	Is the answer correct? Well formatted?	`result.output`
Other state	Plans, sub-tasks, workspace notes, intermediate results	`result.autonomous`, `result.context_telemetry`, `result.memory_snapshot`

Evaluation patterns for AI agents

Agent evaluations typically combine three types of graders. The key to effective evaluation design is choosing the right mix for your use case.

Code-based graders

Code-based graders use deterministic logic to verify specific conditions: string matching, regex patterns, binary pass/fail tests, static analysis, tool call verification, and transcript analysis (turn counts, token usage).

Strengths — Fast, cheap, objective, reproducible, and straightforward to debug. When you can express success criteria as code, do it.

Weaknesses — Brittle to variations that don’t match expected patterns exactly. A query result formatted as “eight customers” compared to “There are eight” might fail a strict string match even though both are correct.

Example — Verifying a tool was called:

from evals.extract import tool_names

tool_names_list = tool_names(result)
assert "sql_query" in tool_names_list, "Agent must execute sql_query"

Model-based graders (LLM-as-judge)

Model-based graders use another LLM to evaluate the agent’s output. Methods include rubric-based scoring, natural language assertions, pairwise comparison, and multi-judge consensus.

Strengths — Flexible, scalable, captures nuance, and handles open-ended tasks where the agent’s answer can take many valid forms.

Weaknesses — Non-deterministic, more expensive than code, and requires calibration with human graders to validate accuracy. Give the judge LLM a way out (for example, “return low scores if you don’t have enough information”) to avoid hallucinated scores.

Example — Rubric from our showcase (evals/graders.py):

JUDGE_RUBRIC = """Score each dimension 0.0 to 1.0. Return ONLY valid JSON.
1. correctness: Does the answer identify {expected}?
2. completeness: Does it address every part of the question?
3. clarity: Is the answer well-formatted and easy to understand?
...
Return: {{"correctness": <float>, "completeness": <float>, "clarity": <float>}}"""

scores = await llm_judge(judge_llm, question, answer, expected="Jane Peacock")
assert scores["correctness"] >= 0.5

Human graders

Human graders (subject matter expert review, crowdsourced judgment, spot-check sampling) are often considered the gold standard for subjective quality assessments. Compared to programmatic options, human graders are expensive and slow, but essential for calibrating model-based graders.

Use them judiciously: calibrate LLM-as-judge rubrics against expert human judgment initially, then use human review periodically to verify that automated graders haven’t drifted.

Combining graders: the practical recommendation

Grader type	What to check (text-to-SQL)
Code-based	Did the agent call `sql_query`? Does the answer contain “8”? Were DML statements executed?
LLM-as-judge	For complex queries — is the analysis correct, complete, and well structured?
Human	Periodic spot-checks to verify LLM grading aligns with expert judgment

Capability vs. regression evaluations

Type	Question	Target pass rate
Capability evaluation	What can this agent do well?	Start low, climb upward
Regression evaluation	Does the agent still handle what it used to?	Nearly 100%

As your agent matures, capability evaluations that reach high pass rates can graduate into your regression suite. Our Canada customer count trials (run 3×) are regression tests — pass^k should stay at 100%.

Evaluating autonomous (multi-step) agents

Autonomous mode in NucleusIQ is the production-shaped agent: Decomposer (task breakdown) → tool loop → Critic (independent verification) → Refiner (targeted correction). Complex tasks spawn parallel sub-agents; simple tasks run a single execute → validate → Critic pipeline.

Multi-step agents break the assumption that every test case can be run through the same application logic and scored by the same evaluator. The five patterns below apply broadly.

Pattern 1: Custom test logic per datapoint

Traditional LLM evaluation treats every datapoint identically. Multi-step agents break this assumption. Each test case may have its own success criteria, involving specific assertions against trajectory and state, not just the final message.

“How many customers are from Canada?” → code grader: "8" in answer.
“Which employee generated the most revenue and from which countries?” → LLM judge (answer format varies).

async def eval_pattern_1a(llm):
    r = await run_question(llm, "How many customers are from Canada?")
    ok = "8" in final_answer(r).lower()
    assert ok

async def eval_pattern_1b(llm):
    q = "Which employee generated the most revenue and from which countries?"
    r = await run_question(llm, q)
    scores = await llm_judge(llm, q, final_answer(r), "Jane Peacock")
    assert scores["correctness"] >= 0.5

Pattern 2: Single-step evaluations

About half of production test cases for multi-step agents are single-step evaluations: what did the agent decide to do after a specific input? Useful for validating individual decision points — did it call SQL tools instead of guessing?

async def eval_pattern_2(llm):
    r = await run_question(llm, "How many customers are from Canada?")
    names = tool_names(r)
    sql_tools = {"sql_list_tables", "sql_schema", "sql_query", "sql_query_checker"}
    assert sql_tools & set(names), f"Agent must use SQL tools; got: {names}"

Single-step evaluations are your unit tests — fast, focused, efficient on tokens.

Pattern 3: Full agent turns

Run the agent end-to-end on a single input and evaluate trajectory plus answer:

async def eval_pattern_3(llm):
    r = await run_question(llm, "How many customers are from Canada?")
    assert "sql_query" in tool_names(r)
    assert "8" in final_answer(r).lower()

Key insight: Assert that certain tools appeared, but not exact order. The agent might list tables before schema, or go directly to schema. Both are valid. Grade what the agent produced, not the exact path it took.

With Autonomous mode, inspect Critic verdicts on result.autonomous.critic_verdicts.

Pattern 4: Multi-turn evaluations

Test agents across multi-turn conversations. Turn 1: “What are the top 5 best-selling artists?” Turn 2: “For the top artist, how many albums do they have?”

Use conditional logic plus FullHistoryMemory so turn 2 sees turn 1 without hand-built message lists:

async def eval_pattern_4(llm):
    memory = MemoryFactory.create_memory(MemoryStrategy.FULL_HISTORY)
    agent = build_sql_agent(llm, memory=memory)
    await agent.initialize()

    r1 = await agent.execute(Task(
        id="t1", objective="What are the top 5 best-selling artists?"
    ))
    a1 = final_answer(r1)
    if not a1 or len(a1) < 20:
        pytest.fail("Turn 1 failed — skipping turn 2")

    r2 = await agent.execute(Task(
        id="t2", objective="For the top artist, how many albums do they have?"
    ))
    assert len(final_answer(r2)) > 20

Pattern 5: Safety and state checks

Complex queries should use planning; SQL must stay read-only:

async def eval_pattern_5(llm):
    q = "What is the total revenue per genre, and which genre has the most tracks?"
    r = await run_question(llm, q)
    assert no_dml_executed(r)
    assert len(final_answer(r)) > 50
    assert has_autonomous_planning(r)  # sub_tasks or Critic loop

Our June run decomposed this into two sub-tasks with Critic score 0.92.

End-to-end example: Evaluating a text-to-SQL autonomous agent

Architecture overview

The text-to-SQL agent uses NucleusIQ Autonomous mode, ContextEngine, and four read-only SQL tools on the Chinook database (sample digital media store).

User question
     ▼
┌─────────────────────────────────────────────────────────┐
│  NucleusIQ Agent (ExecutionMode.AUTONOMOUS)             │
│  Decomposer → Tool loop → Critic → Refiner              │
│  ContextEngine (mask / recall) · FullHistoryMemory      │
└─────────────────────────────────────────────────────────┘
     ▼
 sql_list_tables │ sql_schema │ sql_query_checker │ sql_query
     ▼
 AgentResult → scorecard.json / showcase-scorecard.png

See ARCHITECTURE.md.

Prerequisites

Python 3.12+
ANTHROPIC_API_KEY in environment or repo .env
Packages: nucleusiq, nucleusiq-anthropic, pytest, python-dotenv

Setup

git clone https://github.com/nucleusbox/NucleusIQ.git
cd NucleusIQ/notebooks/agents/text_to_sql_eval
python -m venv .venv
# Windows:
.venv\Scripts\Activate.ps1
# macOS/Linux:
# source .venv/bin/activate

pip install -r requirements.txt
python scripts/download_chinook.py
python scripts/build_fat_db.py   # optional: context stress demo

The agent uses Claude Haiku via nucleusiq-anthropic:

from nucleusiq_anthropic import BaseAnthropic

llm = BaseAnthropic(
    model_name="claude-haiku-4-5-20251001",
    temperature=0.0,
    async_mode=True,
)

Minimal .env:

ANTHROPIC_API_KEY=sk-ant-...

Every agent.execute() returns a typed AgentResult — tool calls, LLM calls, autonomous metadata, and context telemetry — without a separate tracing product.

Building the evaluation suite

The showcase implements all patterns in [evals/runner.py](../../notebooks/agents/text_to_sql_eval/evals/runner.py). Run everything with:

python run_all.py
# or: pytest evals/test_patterns.py -v

Eval 1: Single-step — did the agent use SQL tools?

async def eval_pattern_2(llm):
    r = await run_question(llm, "How many customers are from Canada?")
    names = tool_names(r)
    ok = bool(SQL_TOOLS & set(names))
    # feedback_scores logs: used_sql_tools, executed_query, sql_safety, ...
    assert ok

Eval 2: Full turn with deterministic grading

async def eval_pattern_3(llm):
    r = await run_question(llm, "How many customers are from Canada?")
    assert executed_a_query(r)
    assert "8" in final_answer(r).lower()

Eval 3: Complex query with LLM-as-judge

async def eval_pattern_1b(llm):
    q = "Which employee generated the most revenue and from which countries?"
    r = await run_question(llm, q)
    scores = await llm_judge(llm, q, final_answer(r), "Jane Peacock")
    assert scores["correctness"] >= 0.5

Eval 4: Multi-turn follow-up

See Pattern 4 above — uses FullHistoryMemory across two execute() calls.

Eval 5: Safety and Autonomous planning

See Pattern 5 above — no_dml_executed, substantive answer, has_autonomous_planning.

Non-determinism: pass@k and pass^k

async def eval_trials(llm, question, check, k=3):
    results = [await run_question(llm, question, task_id=f"trial-{i}") for i in range(k)]
    passes = [check(r) for r in results]
    return {
        "pass@k": any(passes),
        "pass^k": all(passes),
        "pass_rate": sum(passes) / k,
        "eval_type": "regression",
    }

Canada count regression: 3/3 passes, pass^k = 100%.

Viewing results: owned scorecard (no hosted UI required)

Hosted experiment UIs are optional. NucleusIQ emits scorecard.json and report.md on every run_all.py — commit them, diff in PRs, gate CI.

What you can inspect per pattern

Capability	How
Full trajectory	`patterns[].trajectory` — tool names in order
Feedback metrics	`patterns[].feedback` — `used_sql_tools`, `sql_safety`, `correct_answer`, …
Autonomous state	`patterns[].autonomous` — `sub_tasks`, `critic_verdicts`, `refined`
Duration	`patterns[].duration_ms` — ~12s simple, ~21s complex (Haiku, June 2026)
Compare runs	Diff two `scorecard.json` files after prompt or model changes
Debug failures	Open `trajectory` + `detail` on failed pattern; replay question locally

Example excerpt:

{
  "patterns_passed": 6,
  "patterns_total": 6,
  "mode": "autonomous",
  "patterns": [
    {
      "pattern_id": "p1a",
      "passed": true,
      "trajectory": ["sql_list_tables", "sql_schema", "sql_query"],
      "feedback": { "sql_safety": 1.0, "correct_answer": 1.0 }
    }
  ],
  "trials": { "pass^k": true, "k": 3 }
}

Regenerate the blog image: python scripts/render_showcase_image.py.

Context stress testing (NucleusIQ differentiator)

Early schema mistakes cascade. On a fat schema (+60 wide tables) plus “return CREATE for every table, then count Canada customers,” context pressure is real.

We compare ContextStrategy.NONE vs PROGRESSIVE and read context_telemetry (peak utilization, artifacts offloaded). This is the eval dimension most generic posts skip — context survival, not only final answer text.

async def eval_context_stress(llm):
    question = (
        "Return the CREATE statement for every table in the database, "
        "then tell me how many customers are from Canada."
    )
    off = await run_question(llm, question, context=ContextStrategy.NONE)
    on = await run_question(llm, question, context=ContextStrategy.PROGRESSIVE)
    # Compare telemetry_dict(off) vs telemetry_dict(on)

From offline to online: Production monitoring

Everything built so far runs offline, before deployment. The next step is monitoring live traces.

	Offline evaluation	Production monitoring
Runs on	Curated datasets with reference outputs	Live `AgentResult` traces
When	Pre-deployment (development, CI/CD)	Post-deployment
Purpose	Benchmarking, regression testing, unit testing	Real-time monitoring, anomaly detection
Data	Inputs + outputs + reference answers	Inputs + outputs only (no reference)
Setup	pytest + `run_all.py`	Hook evaluators into your logging pipeline

NucleusIQ is a library — export structured AgentResult to OpenTelemetry or JSON sinks and run the same evaluator functions post-hoc.

Production evaluator 1: Code evaluator — SQL safety check

def sql_safety_check(result: AgentResult) -> dict[str, float]:
    dangerous_keywords = {"INSERT", "UPDATE", "DELETE", "DROP", "ALTER", "TRUNCATE"}
    for q in executed_sql(result):
        tokens = set(q.upper().replace(";", " ").split())
        if tokens & dangerous_keywords:
            return {"sql_safety": 0.0}
    return {"sql_safety": 1.0}

Production evaluator 2: LLM-as-judge — answer quality (reference-free)

REFERENCE_FREE_RUBRIC = """You are evaluating a text-to-SQL agent. You do NOT have the database.
User question: {question}
Agent answer: {answer}
Score: correctness_confidence, clarity, completeness (0.0 to 1.0). Return JSON only."""

Implemented in [production_evaluators.py](../../notebooks/agents/text_to_sql_eval/evals/production_evaluators.py) as answer_quality_judge().

Production evaluator 3: Composite — overall quality score

Component	Weight
sql_safety	0.4
correctness_confidence	0.3
clarity	0.15
completeness	0.15

def composite_quality(scores: dict[str, float]) -> float:
    weights = {"sql_safety": 0.4, "correctness_confidence": 0.3,
               "clarity": 0.15, "completeness": 0.15}
    ...

Filter traces where overall_quality < 0.7. Chart trends as you iterate prompts and agent logic.

Latest results (Autonomous mode, June 2026)

Metric	Result
Mode	autonomous (Decomposer + Critic/Refiner)
Model	`claude-haiku-4-5-20251001`
Patterns	6/6 passed
pass@k / pass^k (k=3)	100% / 100%
P1b LLM judge	correctness 1.00
P5 complex query	2 sub-tasks; Critic 0.92
Simple tasks	Critic 0.85–0.92; trajectory `list_tables → schema → query`

Full artifact: [results/scorecard.json](../../notebooks/agents/text_to_sql_eval/results/scorecard.json).

What NucleusIQ adds beyond a generic eval tutorial

Capability	Generic deep-agent eval post	NucleusIQ showcase
Tool trajectory grading	Yes	Yes — richer `ToolCallRecord` (round, args, duration)
Hosted trace UI	Required for some workflows	Optional — owned JSON scorecard
Planning / verification	Framework-dependent	Autonomous Critic/Refiner in scorecard
Context overflow on fat schema	Often omitted	Context OFF vs ON + telemetry
Provider portability	Often single cloud	Swap Anthropic / OpenAI / Gemini / Groq

FAQ

How do you evaluate a text-to-SQL agent in production?
Grade trajectory, final answer, and agent state via AgentResult in pytest. Add production evaluators on live exports.

What is pass@k vs pass^k?
pass@k: at least one success in k runs. pass^k: all k succeed — use for regression.

Does NucleusIQ replace a semantic layer?
No. Bring dbt metrics, glossaries, and governance in your application layer.

Why Autonomous mode?
Score Decomposer plans and Critic verdicts — not only the final string.

How long per query?
~12 s simple, ~19–21 s complex (Haiku, June 2026 showcase).

Conclusion

AI agents require a fundamentally different set of evaluation strategies. The five patterns — custom per datapoint, single-step, full turn, multi-turn, and safety/state — give you that framework.

The text-to-SQL showcase shows them across the lifecycle:

Development: offline pytest + owned scorecard
Production: same evaluators on live AgentResult traces
The loop: production failures become test cases; metrics replace guesswork

NucleusIQ adds Autonomous verification, context telemetry, and a portable scorecard — measure agents that survive production, not demos that only look correct.

cd notebooks/agents/text_to_sql_eval && python run_all.py

Post Views: 105