Back to Blog

The Sandbox Is Already Compromised

9 min read
Di WuCo-founder & CTO

When you build a product where an AI agent writes and executes code against real data, you cannot trust that code any more than you'd trust a SQL query from an anonymous HTTP request. The sandbox you build to contain it has to be designed as if it's already been breached.

That's not paranoia. It's the only defensible posture for agent infrastructure.

The question isn't whether the sandbox can be exploited. It's what happens when it is.

The Common Response Isn't Enough.

Prompt injection via data is a known attack vector. A table comment buried in schema metadata. A shared file with a hidden command. The standard response is to monitor and sanitize LLM inputs. For a chat interface, that's at least tractable; the input surface is bounded. For a data agent, it isn't: the agent samples rows from tables, reads column names, ingests arbitrary values from whatever the user connected.

So we design around a different assumption: prompt injection will succeed. What matters is what it can do when it does.

SQL injection, network escape, credential theft, cross-tenant data access: everything in our threat model is downstream of prompt injection via data. Every feature that touches the sandbox is built under the assumption that it is already compromised. When a new capability requires the sandbox to have access to something, the bar is: minimum access, maximum containment if that too gets abused.

Three Components, Three Trust Levels

The system has three components.

The web app authenticates users and renders agent output. It never communicates with a sandbox directly.

The control plane is the high-trust zone. It holds LLM API keys, database credentials, and encryption keys. It makes all LLM API calls directly. The sandbox never touches an API key or sees a model response before the control plane does. All sandbox interactions are mediated here: authentication, authorization, and audit logging happen before any request reaches a sandbox.

The sandbox is a disposable container: one per session, untrusted by design. It has access to that session's data and nothing else. It has no credentials, no path to the warehouse, no ability to call back to the control plane. We assume it will be compromised and design accordingly.

Two things follow from this. First, no direct frontend-to-sandbox communication; every interaction is gated through the control plane. No exceptions. Second, and less obvious: the sandbox never connects to the customer's data warehouse. When an analyst works with data from Snowflake or BigQuery, the control plane pulls a scoped extract of only the data the analysis requires into a DuckDB instance inside the sandbox. The agent operates entirely against that extract.

The agent's job is to explore and try things. You don't want that iterative work running live queries against a Snowflake warehouse with production data and real billing costs. More importantly, you don't want a prompt-injected agent with direct warehouse access. The local extract is the blast radius boundary for the data layer: even full sandbox compromise can't touch the source warehouse, because the sandbox was never given the connection string.

Dev note: local extracts and query performance

Cloud data warehouses are OLAP systems, optimized for throughput not low-latency interactive queries. Snowflake's Interactive Tables try to address this, but they're a specialized feature rather than a general solution. Our sandboxes achieve the same effect by loading the session's data extract directly onto the pod's ephemeral storage. Queries are sub-second; p99 is under 5 seconds. This approach is also warehouse-agnostic: the same performance profile regardless of whether the source is Snowflake, BigQuery, or anything else. Every agent session gets its own isolated, dedicated compute and memory.

The Layers

The architecture works because each layer assumes the previous one has already failed. The first two layers (auth controls and container isolation) are standard for any multi-tenant SaaS. The layers specific to running agent-generated code are where the design gets more involved: node isolation, network restrictions, and the constraints on what the agent can actually execute.

Auth and browser controls

Session tokens expire after 60 seconds. The client refreshes them automatically in the background, so there's no visible interruption, but a stolen token has a hard expiry, and session revocation takes effect on the next refresh cycle. A leaked credential is a ticking clock, not a persistent foothold. How the token refresh mechanism works.

The browser also enforces a strict Content Security Policy: deny by default, explicit whitelist of allowed hosts, HTTPS only. Even if agent output somehow contained malicious content that made it into the rendered UI, the browser won't allow it to phone home to an unauthorized host.

Container per session

Every workbook session gets its own dedicated Docker container (Kubernetes pod in production) with its own filesystem, network namespace, and resource limits. When a session ends, the container is force-removed. No persistent state survives.

Per-session rather than per-tenant: a tenant can have multiple active sessions. Per-session isolation means one compromised session can't affect the same tenant's other work. Cleanup is unconditional: force-remove the container, and there's no persistent state to clean up. The audit record lives in the control plane: every agent action, every query is logged before it ever reaches the sandbox. Forensics don't depend on the container surviving.

No secrets are injected into the container. Credentials stay in the control plane; any data a sandbox needs from external sources is fetched and loaded by the control plane.

Node isolation: GKE Autopilot and gVisor

We run on GKE Autopilot, which means we have no privileged access to the underlying nodes; Google manages them, there's no SSH, and privileged pods are not allowed. A container escape doesn't give an attacker much to work with; there's no node identity to abuse and no path to the broader cluster.

On top of that, sandbox pods run under gVisor. Rather than letting container syscalls pass through to the host kernel (as with runc) or filtering them with seccomp rules, gVisor intercepts every syscall and handles it inside a per-sandbox userspace kernel written in memory-safe Go, so the host kernel is never directly reachable. The two layers together mean a successful container escape is both much harder to achieve and lands somewhere with very little to exploit.

Dev note: gVisor syscall overhead

gVisor adds overhead relative to runc or seccomp-only filtering, because every syscall is handled in userspace rather than passed to the host kernel. For syscall-heavy workloads (lots of I/O, network calls, process forking), that cost is real. DuckDB query execution isn't that workload. Once the extract is loaded into memory, execution is CPU and memory-bandwidth bound: vectorized scans, aggregations, joins, all in userspace. The syscall rate is low, and the overhead is negligible.

Network isolation

Sandbox pods have tightly restricted network egress. Kubernetes firewall rules cut them off from the public internet, from the private VPC (which contains the control plane, databases, and internal services), and explicitly block the cloud metadata server. The metadata endpoint is the dangerous one: in any major cloud provider, a pod that can reach the metadata endpoint can request the node's service account credentials, and from there potentially access any resource that node's identity has permissions for.

The one exception is object storage. Sandboxes are allowed to reach it via Private Google Access reserved routes because session snapshots (saving and restoring the state of a workbook) require it. But access is never open-ended: whenever a sandbox needs to snapshot or restore, the control plane mints a downscoped short-lived credential that grants access only to that specific sandbox's namespace in object storage. It cannot read or write any other sandbox's data, and the credential expires shortly after it's issued.

Constraining agent-generated code

The agent produces two kinds of executable output: SQL queries and Python visualization code. Each runs in its own constrained environment, using different containment strategies because they have different properties.

SQL is easy to analyze statically. Each sandbox session runs a DuckDB instance with a hardened configuration applied at startup: extensions blocked, network disabled, filesystem access restricted to specific paths, memory and disk bounded, then locked so no subsequent connection can change those settings. Every SQL string passes through an AST parser before it reaches DuckDB, and we reject anything that isn't exactly one SELECT statement. DROP TABLE, COPY TO, CREATE FUNCTION, batched multi-statements, all rejected before execution, regardless of how they're formatted or obfuscated.

Python visualization code can't be meaningfully restricted through static analysis the way SQL can. Instead, when the agent produces a visualization, it generates Altair Python code that needs to be compiled into a Vega JSON spec. We execute that inside a stateless Pyodide runtime running server-side in a WebAssembly virtual machine. There's no host filesystem access, no network, no subprocess spawning. Not because of permission checks. Because the WASM runtime doesn't support them. It starts in under 500ms using pre-built memory snapshots, making it practical as a stateless per-render tool.

Credentials never enter the sandbox

LLM API keys, database credentials, SSO secrets: none of them enter the sandbox. Warehouse credentials (Snowflake, BigQuery, whatever the customer connected) don't enter the sandbox either, because the sandbox doesn't need them; it's working against a local DuckDB extract, not the live warehouse. There's no connection string to steal because there was never one to give.

Communication is strictly one-directional: the control plane talks to the sandbox, and the sandbox has no mechanism to call back. Each sandbox session is provisioned with a uniquely keyed ephemeral API token; this ensures only the control plane can issue instructions to that sandbox, and that no two sessions share an identity. The only credentials it ever receives are the downscoped object storage tokens described above, issued on demand and scoped to its own namespace.

If an attacker achieves full code execution in a sandbox, bypassing every other layer, they still cannot call LLM APIs, access the database containing all tenants' metadata, access other tenants' storage, or persist access beyond the session. The blast radius is bounded by what the sandbox was ever given.

The Sandbox Is Already Compromised

You cannot sanitize an LLM. You cannot guarantee the agent will never be manipulated by something in its context window. What you can do is make sure the infrastructure assumes it already has been, and build the blast radius boundaries accordingly.

The sandbox is already compromised. Build from there.

About the Author

Di Wu

Co-founder & CTO

Principal Engineer at Snowflake, Distinguished Engineer at Rubrik, CTO at BetterWorks, and Engineer at Palantir.

LinkedIn