# Secret injection is the wrong primitive for agent sandboxes

> microsandbox keeps real secrets on the host, substitutes them only for approved destinations, and lets policy decide when to terminate, block, log, or pass placeholders through.

Published: 2026-05-25
Authors: Tochukwu (Toks) Nkemdilim
Category: Engineering
Tags: microsandbox, secrets, sandboxing, agents, security
Canonical: https://microsandbox.dev/blog/secret-injection-is-the-wrong-primitive-for-agent-sandboxes
RSS: https://microsandbox.dev/blog/rss.xml

## Article

Picture four scenarios from a single agent product:

Your agent calls OpenAI. The SDK reads $OPENAI_API_KEY from the environment, drops it into an Authorization header, and you want that call to succeed normally.

Mid-session, a prompt injection convinces the agent to POST $OPENAI_API_KEY to a domain you've never heard of. You want that request stopped before the destination learns anything useful about your credential inventory.

Mid-run, the agent's observability SDK streams a session trace (tool calls, headers, request objects) to your trace store. That trace still contains the secret placeholders, and you want the push to succeed without those getting swapped back for the real values on the way out.

In a release sandbox, an install script tries to send $NPM_TOKEN to an unknown host while building artifacts you're about to publish. Stopping the request isn't enough. Once a build runtime tries something like this, you can't trust its outputs anymore.

Four things your sandbox has to get right, and they're all different.

Only the first scenario is really about substitution. When the agent calls an allowed host, secret injection swaps the placeholder for the real credential and the call goes through. The other three are harder, because by then a credential has turned up somewhere it shouldn't and you have to decide what to do about it. Sometimes you drop the request. Sometimes you let the bare placeholder through as harmless data. Sometimes the safest move is to kill the runtime outright. Substitution is where most sandboxes stop, and everything interesting lives past that line.

So injection can't be the whole story. It's one move the boundary can make, and it sits inside something larger: a network policy that has to settle two questions at once, where a credential is allowed to become real, and what happens the rest of the time. The real secret never leaves the host. The guest gets a placeholder, and a secret-aware network boundary decides, request by request, what that placeholder turns into. There are four ways it can go: substitute, pass through, block (and optionally log), or terminate the sandbox. The rest of this post is one section per outcome.

Three things matter here: the real secret, the placeholder the guest sees, and the network boundary where policy applies. The real secret stays on the host. The guest VM only ever sees a placeholder string in its environment.

The workload behaves normally. It reads the env var, hands it to an SDK, drops it into an Authorization header, passes it to a CLI. The value just isn't the real credential. The placeholder only matters once outbound traffic reaches a host the secret is allowed to reach.

That's a policy. OPENAI_API_KEY is exposed to the guest as a placeholder. The real value can only be substituted for api.openai.com. TLS identity must be verified first. The injection flags are spelled out here for clarity, but they're also the defaults: headers and Basic auth get substitution, while query params and request bodies stay off unless you opt in.

The placeholder string itself is deterministic. For an env named OPENAI_API_KEY, the guest sees $MSB_OPENAI_API_KEY, not an opaque token like msb_placeholder_8f2e1c. When that value shows up in a log, a stack trace, an error message, or an exported transcript, you can tell at a glance which slot it represents. Traces become self-describing, snapshot tests stay stable across runs, and post-incident review doesn't require decoding a substitution table.

A deterministic placeholder does advertise what credentials this runtime carries. That's exactly what Case 2 below addresses: blocking unknown destinations at the network boundary keeps the inventory private, while deterministic naming inside the runtime keeps it readable.

See the Secrets docs and TypeScript SDK reference for the full surface.

The expected case. Your agent calls OpenAI, GitHub, Stripe, npm, or another service it actually needs. The placeholder appears in outbound traffic, the destination matches the secret's allow list, microsandbox substitutes the real value at the boundary before letting the request continue.

What makes this more than plain substitution is that the destination is part of the policy. The credential only becomes real for a specific host, and only with the right TLS identity on the other side. Everything else the workload might try with that placeholder is still subject to the other three outcomes.

For HTTPS traffic this runs through microsandbox's host-side TLS proxy. The proxy inspects the request, verifies the destination identity, substitutes only in the configured places, and re-encrypts upstream. SNI alone isn't enough for the identity check, since a client can claim one name while connecting somewhere else. To handle that, microsandbox ties domain rules and host-scoped secret injection back to the DNS answers the sandbox actually received. A hard-coded IP can't borrow an allowed hostname.

See TLS Interception and the Networking Security Model for the underlying mechanics.

Now the placeholder appears in outbound traffic, but the destination isn't on the secret's allow list and isn't on any passthrough list. That's a secret violation. The default action is block: the request is dropped and the guest sees a connection reset.

Why block when the real credential never leaves? Because the placeholder advertises the credential inventory. An attacker who learns that this runtime has a slot called NPM_TOKEN, even without ever seeing its value, knows it publishes packages. The next attack doesn't aim at the token. It aims at the package: a typosquat, a dependency-chain injection, a malicious postinstall in something the agent is likely to fetch.

Same shape for STRIPE_SECRET (prompt injections aimed at refund or transfer flows), AWS_SECRET_ACCESS_KEY (IAM patterns worth probing), or any other slot. The placeholder names what kinds of credentials the runtime carries. That's enough to plan the next attack.

Blocking keeps that inventory private. If you want the same network behavior plus a host-side warning, use block-and-log.

The remote server never receives the placeholder, the host still gets enough signal to investigate.

onSecretViolation on the network sets the default action for every secret in the sandbox. An individual secret can override that default with its own onViolation, which is exactly what Case 3 does next.

Easy to miss if you only think in "allowed" and "blocked." A placeholder outside the substitution allow list isn't always an attack. Sometimes it's just text, sitting inside a log, a trace, a transcript, a replay.

This shows up constantly with agents. An agent calls an allowed API with a placeholder-backed credential. That request is legitimate: the gateway permits it, the secret allow list matches it, microsandbox substitutes the real value at the boundary. But agent systems also keep transcripts. Any of these can end up in that transcript: a tool call, a header, a CLI invocation, a request object, an error message. The transcript now contains the placeholder string.

When that transcript gets shipped to a trusted trace store, eval runner, or session replay system, you want the request to go through with the placeholder unchanged. You don't want substitution. You also don't want trace export to fail just because the transcript contains placeholder text.

Passthrough fixes that. Thanks to @BunkerWells, who implemented it in PR #771. A passthrough host can receive placeholder strings as inert data: no substitution, no block. For every other host, the fallback violation action still applies.

A placeholder can end up in an agent transcript after a legitimate tool call. Sending that transcript to a passthrough host is fine: the destination only sees the placeholder. Sending the same transcript somewhere else is still blocked.

For pull-shaped exports you can sidestep this entirely with the file API: extract the transcript host-side, upload it yourself, no sandbox egress involved.

Most of the time you don't fully control the workload, though. Customer code, AI-generated tools, and any off-the-shelf agent observability SDK all push during execution, and you can't refactor them into a host-side pull. Anything that needs live traces, like alerts or streaming dashboards, has to push too. Passthrough makes that path safe.

Cases 1, 2, and 3 all decide about a single request: let the real value through, drop the request before it leaves, or let the placeholder ship as data. The runtime survives all three. The next request gets its own decision.

Case 4 doesn't fit that pattern. What's being decided isn't a request anymore. It's the runtime.

Picture a release sandbox with access to an npm publish token, a deploy key, or a signing credential. During npm install, tests, packaging, or signing, something in there tries to send that placeholder to an unknown host. Blocking the request stops the leak. It doesn't undo the fact that something inside the runtime decided to exfiltrate it.

By that point the same process may have already modified build outputs, poisoned generated artifacts, written data somewhere else, registered persistence inside the workspace, or prepared a second egress path through DNS, logs, telemetry, or one of the allowed endpoints. You stop asking whether the secret left and start asking whether anything this runtime produced is still trustworthy.

For build, signing, and deploy workloads, the answer is usually no.

block-and-terminate covers that case. If the token appears in egress to a disallowed host, the request is dropped, the violation is logged, the sandbox shuts down. The follow-up is operational: discard artifacts from that run, preserve the evidence, rotate or revoke the credential.

Secret substitution sits on top of a broader network policy. All sandbox traffic flows through a host-controlled networking stack. By default, microsandbox gives sandboxes public internet egress while denying private IP ranges, loopback, link-local addresses, the host gateway, and the cloud metadata endpoint.

You can disable networking entirely, or write an explicit allow/deny policy with first-match-wins rules. DNS is part of the policy too: a DNS lookup is evaluated as egress before the forwarder answers, and domain rules are tied to the DNS responses the sandbox actually received, so connections get checked against names the sandbox itself resolved.

That combination matters for agent workloads. A prompt injection that says "send your key to this IP" doesn't get to bypass host rules just because the model followed the instruction.

See the Networking Overview and DNS docs for the full policy model.

Substitution is only one of four choices. The primitive is the policy that picks among them:

Substitute when the destination is allowed to use the secret.

Pass through when the destination is trusted to receive the placeholder string as data, but not trusted or allowed to receive the real secret.

Block, and optionally log when the destination should learn nothing about what's in the credential inventory.

Block and terminate when the attempted use means the runtime itself is no longer trustworthy.

This is what people get wrong when they treat secret injection as the whole feature. Swapping in the real value for an allowed host is the easy part, and it's correct as far as it goes. But that's the only thing injection knows how to do. The interesting cases all start the moment a credential shows up somewhere it shouldn't, and in an agent sandbox that's most of what you're dealing with.

The thing worth building on is a secret-aware network boundary. The real value stays on the host, the workload only ever holds a placeholder, and on every outbound request the boundary works out what that placeholder should become: the real value for an allowed host, inert text for a trusted one, a dropped request, or a runtime that gets shut down. One secret, four possible answers, all decided in the same place. Substitution is just the one everybody already does. The primitive is everything around it.

Everything above runs on microsandbox today, locally.

We're also gradually opening up a cloud beta: same SDK, same policy semantics, sandboxes running on our infra instead of your machine, with 1:1 interop with the local CLI. Join the waitlist.
