Beyond Q&A: Designing Data Agents for Complex Analytical Work

Q&A data agents are great at quick lookups. Ask "what was our retention rate in region Y last quarter?" and a good one gives you an answer in seconds.

The problem shows up when the work gets more serious.

I don't mean "gnarly SQL with eight joins and three window functions" serious, where the query is complex but still produces one answer you can sanity-check. I mean "finance is presenting this to the board, and it rests on 15 upstream assumptions joining ERP actuals to headcount plans and CRM pipeline" serious.

Getting a plausible-looking answer is the easy part. Knowing whether to trust it is the hard part.

Because here's the thing: the analyst isn't just consuming the answer. They're presenting it. Their boss is going to drill in. Why is this number different from last month's forecast? Were the one-time restructuring charges excluded from opex? What FX rate was used for international bookings? How were mid-quarter hires counted in the headcount plan?

And once those are answered: what does this look like if Q4 pipeline converts 5 points lower?

"The AI said so" isn't a career-safe answer.

The answer is 42. Great. Now explain why.

This distinction, quick lookup versus ongoing analytical work, is why a lot of teams end up back in spreadsheets after data agent pilots that looked great in demos.

Two types of work

Most data agents are built for questions. The harder thing is models.

Q&A agents answer discrete questions. "What is our CAC by channel this month?" "Show me week-over-week churn for the last 90 days." These are well-scoped, the human can sanity-check the output intuitively, and if the answer is wrong it's often obviously wrong. Q&A agents are good at this.

Model-building agents construct analytical models: collections of dependent transformations with named intermediate steps, tunable assumptions, and reusable logic that teams revisit over time. Take a driver-based rolling forecast. You're combining ERP actuals, payroll and headcount plans, CRM pipeline, and FX rates at different grains. You're maintaining scenario branches from input assumption tables. You're producing a P&L and cash runway that finance adjusts and presents every month. That's not a query. It's a system with dependencies and intermediate logic that needs to stay correct as inputs change.

Most data agents are designed around the first mode. The second is harder, and the difficulty doesn't go away with a smarter LLM. It comes from a structural problem.

ChatGPT

Shadowfax

0:000:00

The verification problem

Here's a failure mode that doesn't get talked about enough. The agent spends 15 minutes producing a spreadsheet: dozens of sheets, 100k rows, nested formulas cascading across tabs. It's impressive. It's also completely opaque. Now someone has to audit it. That's hours of tracing cell expressions to find the 0.1% that's wrong. Did the agent actually save you any time?

Generating the work faster doesn't help if verification means reverse-engineering the answer.

A join at the wrong grain. A cost allocation using the wrong denominator. A date filter off by a week. None of these throw an error. The query runs, the rows come back, everything looks fine. Code bugs announce themselves. SQL business logic bugs just quietly return the wrong answer.

The agent doesn't catch this automatically. If it generated a 1000-line SQL script or a long pandas chain in one pass, errors become even harder to find. The agent sees only the final output, not whether each intermediate step is conceptually correct. The human reviewer is in the same position: you can't efficiently audit 1000 lines of nested CTEs (common table expressions) to find a subtle assumption error before it compounds through the rest of the pipeline.

This is the ceiling for Q&A-style agents applied to complex analytical work. Not a model quality problem. A problem of how the work gets built.

Build it so you can check your work

The fix is to break the work into small, named steps (a directed acyclic graph, or DAG), where each step does exactly one type of transformation. Both problems get easier.

The view-level DAG captures dependency flow and intermediate model outputs as the work evolves.

The agent validates each step right after writing it. An error in step 3 (wrong join grain, unexpected null behavior, row count that doesn't match expectations) surfaces immediately, before it compounds through steps 4 to 10. The agent doesn't wait until the end to discover something went wrong.

The human reviewer can look at one view at a time. A 10-line SQL view that filters a table to active customers is something a non-technical analyst can read and evaluate without understanding the whole pipeline.

We classify every view by transformation type: filter, combine, aggregate, reshape, calculate, and a few others. Each view does exactly one. Not because combining them is impossible, but because "is this join correct?" is a question a reviewer can answer in 30 seconds. "Is this join-then-aggregate-then-pivot correct all at once?" takes considerably longer, and errors in the first step are invisible until you check the last.

Dev note: why SQL and not Python/pandas

Three things pushed us toward SQL for DAG logic.

SQL is readable by business analysts without an engineering background. A short SELECT is something a finance lead can actually evaluate.
The expressiveness argument against SQL is weaker than it looks: almost all pandas transformations reduce to an expression tree anyway (Polars makes this explicit), so we're not meaningfully limiting what the agent can express.
No code generation means no arbitrary code execution, which narrows the security surface substantially. For cases where we genuinely need code, we run Python in a WASM sandbox. For ML-style work, we use declarative specs rather than having the agent write sklearn.

Reactivity: the spreadsheet instinct, at scale

Part of why spreadsheets have such staying power is that they're reactive. Change a number in one cell, everything downstream updates immediately. That feedback loop makes a spreadsheet feel like a live model. Tweak an assumption and see the effect propagate. That's exactly what you want when the whole point of the analysis is to explore "what if."

Spreadsheets break down past roughly 100k rows, or when formulas get deeply nested and start to lag. Analysts hit this regularly and either simplify the model or hand it off to a data scientist. Either way, the reactivity disappears along with their ownership of it.

Our DAG uses SQL views instead of materialized tables, which gives the same reactive behavior: change an upstream input or assumption, every downstream view recomputes immediately.

The reactivity also matters directly for the agent. After any DAG change, the agent verifies its work against the current state. If intermediate nodes were materialized, a stale snapshot could silently mislead it. The agent would see last run's data, not today's. That reactive loop depends on having the right data in the workbook to begin with.

Dev note: right-sizing for interactive analysis

We've designed this for interactive analysis: sub-second query response for iterative modeling and scenario testing, up to about 10M records. The vast majority of real business analytics involves time-bounded extracts or aggregates that fit comfortably in that range. Analysis that requires processing 100M+ raw rows is typically non-interactive batch work, better handled by a cloud OLAP warehouse.

Starting with the right data

Before any modeling happens, there's a prior question to answer: out of hundreds of tables in an enterprise warehouse, which ones are actually relevant to this analytical task? That's the job of a dedicated data import agent.

The import agent runs first. Given the analytical task, it identifies the superset of source tables and pre-aggregated or joined extracts that might be relevant, and brings those into the workbook. The workbook agent then works exclusively on those local extracts. It never queries the warehouse directly, avoiding the round-trip latency of live warehouse calls during iterative modeling.

Splitting this into two specialized agents matters because warehouse access isn't something you hand to every process. Enterprise warehouses contain data with access controls, regulatory scope, and audit requirements that don't apply uniformly across all tables. Keeping that scoped to a single, purpose-built agent (one that can be audited, permissioned, and constrained independently) keeps the governance story simple. The workbook agent never sees anything it wasn't explicitly handed.

The scoping problem is also fundamentally different from the modeling problem. The import agent can afford to be opportunistic, erring toward including tables that might be useful, because unused extracts cost storage, not correctness. It also runs with a restricted toolset.

Dev note: data source discovery

The import agent uses many techniques for identifying relevant tables: semantic search over table and column descriptions, query history, sample data previews, column lineage, etc. The separation of concern is intentional: these discovery capabilities stay in the import agent; the modeling agent never touches the warehouse.

The DAG outlasts the conversation

A well-built DAG does more than run correctly. It supports multiple analysts working on it at the same time. It documents itself as it's built. And it persists as the agent's memory across sessions.

Each view is a named, discrete step. The analyst working on revenue normalization and the one working on headcount allocation can build concurrently without stepping on each other. Review one view at a time, not the whole workbook. No more coordinating edits through a shared notebook or a chat thread.

The DAG also documents itself, in a specific sense. Every view the agent creates requires a plain-English description of its business purpose and a step-by-step explanation of the SQL logic, specific about join types, filter conditions, what rows get dropped and why. This isn't optional metadata. It's a required field in the tool contract the agent calls to create a view. The documentation exists because the construction protocol requires it, not because someone remembered to write it afterward.

That matters when your CFO questions an assumption six weeks later, or when a new team member tries to understand what the model does. The DAG doesn't just run. It explains itself, one step at a time.

Dev note: conversation threads and context

The DAG's per-view metadata is prefilled into every agent session, making the DAG a part of the agent's permanent memory. We haven't enabled conversation compaction; each analytical task can start as a fresh conversation thread with the full workbook context already loaded, no chat history replay needed.

What you're left with

The point of this architecture isn't that complex analytical work is impossible with simpler tools. Teams have built complex models in spreadsheets, SQL scripts, and Python notebooks for years.

The point is that the work has to be reviewable by the people who are accountable for it. An answer no one can verify isn't something a team can act on with confidence, regardless of how sophisticated the agent that produced it is.

The conversation is how you get there. The DAG is what you're left with.

The Answer Isn't the Artifact: Designing Data Agents for Complex Analytical Work