Back to Blog
Engineering

How we made our OCI filesystem 47× faster

We replaced our user-space filesystem with a real disk image that the VM mounts directly. Here's how we got there, and what fell out along the way.

How we made our OCI filesystem 47× faster

A user in our Discord said microsandbox felt slow. Listing every file in the Python standard library took 5.3 seconds inside a sandbox; in Docker it took milliseconds. We went digging.

We fixed it in v0.4: we replaced our user-space filesystem with a Linux disk image that the VM mounts directly. The geometric mean speedup across our mixed guest-visible filesystem suite is 47×, with the worst-case rows more than 1,000× faster, and the host filesystem code is about 5,300 lines shorter.

Where this started

My first try was monofs: a content-addressed filesystem with block-level dedup, compression, and distributed read replicas. It stored images at 1.3× their original size on disk, and microsandbox is local-first, so the long-tail dedup payoff wasn't worth the up-front cost. For v0.3 I switched to OCI plus a user-space overlay built on a libkrun hook; we got layer dedup and identical behavior on Linux and macOS, but everything still ran outside the kernel.

Where the time was going

Every file operation inside the VM had to bounce out to the host through FUSE, which is Linux's mechanism for letting an ordinary program act as a filesystem. To open a file, the VM hands the request to our host process, which walks every layer looking for the file and sends the answer back; the same trip happens for every stat, every readdir, and every cache miss. A single Python import triggers dozens of these round trips before your code even starts running, and a ten-layer image multiplies the cost of each one.

We spent the next stretch of v0.3 trying to make that path faster: better caching, fewer syscalls, smaller responses. Each change shaved a few percent. None of them changed the order of magnitude.

Docker doesn't have this problem because Docker uses the kernel's own layered-filesystem driver (overlayfs), so file operations never leave the kernel. We were trying to match a kernel filesystem from outside the kernel; no cache could close that gap.

So we deleted the filesystem.

The new plan

The new plan was to stop bouncing every file operation between the VM and the host. We'd build a Linux filesystem image ahead of time, hand it to the VM as a virtual disk, and let the VM's own kernel mount it. With FUSE out of the loop, file operations inside the VM would stay inside the VM.

Before
app
guest VFS
virtiofs / FUSE boundary
host filesystem code
layer lookup / overlay logic
response back into the VM
After
app
guest VFS
guest overlayfs
guest EROFS
virtio-blk
cached block-backed image
Before, every lookup crossed the VM/host boundary. After, normal reads and lookups stay inside the guest kernel.

The filesystem we picked is EROFS: read-only, in-tree since the kernel needed it for Android, and easy to author. EROFS also solved the macOS problem: the VM's own kernel is Linux regardless of what's running outside it, so once the disk image is built, the host's filesystem stops mattering.

No mkfs, no mount, no helpers

microsandbox runs on both Linux and macOS, and macOS lacks the host-side tools you'd normally use to build a filesystem image: no mkfs.ext4, no mkfs.erofs, no loopback mounts. If our image pipeline depended on any of them, we'd either have to ship a helper VM (heavy, slow to start) or live with a permanent split between platforms, and neither option fit microsandbox's "single self-contained binary" promise. So we wrote the image writers ourselves in Rust. A filesystem is a byte layout on disk; the writers just produce that layout. Three small pieces do the work:

  • An EROFS writer that emits the read-only image of an OCI layer.
  • An ext4 writer that emits the sparse, journaled scratch area each sandbox gets.
  • A VMDK descriptor that stitches everything into one virtual disk.

Nothing in the pipeline shells out, asks for root, or mounts a loopback device, and the same Rust code path builds the images on Linux and Apple Silicon without depending on host-only filesystem tools. The EROFS artifacts round-trip through a reader we also wrote, and CI boots the full stack under the real VM kernel. If a byte is wrong, two different readers tell us about it.

The first cut

The obvious way to use these writers was one EROFS image per OCI layer. The VM would get one virtual disk per layer plus one for the scratch area, and the kernel's overlayfs would merge them at boot. It worked: the first measurements landed between 10× and 175× faster than v0.3 depending on the workload, and we were ready to ship.

First cut
layer 1
/dev/vda
layer 2
/dev/vdb
layer 3
/dev/vdc
layer 30
/dev/vd?
One EROFS image per OCI layer. Python images attached ~10 disks; some custom builds pushed past the microVM's virtio device cap.

Then we counted the layers. A Python image runs around ten; CUDA images more; some user-built ones push thirty or forty. microVMs cap how many devices they can carry, and we were attaching one disk per layer. We raised the cap, but the real fix was to stop using virtual disks to tell the VM "this image has layers" when the filesystem could carry that information itself.

The cleaner shape

The EROFS folks pointed us at a feature we hadn't been using: EROFS can build a metadata-only image, just the merged directory tree plus a pointer per file saying which underlying blob holds its bytes and at what offset. The kernel reads that image, treats the whole bundle as one virtual disk, and answers every lookup with a single calculation instead of a search across layers.

The pipeline becomes:

  • Pull the OCI layers as usual.
  • Build one small metadata image describing the merged tree.
  • Hand the VM one virtual disk that stitches the metadata and the layer blobs together.

The VM now only has to attach two rootfs block devices, no matter how many layers the original image had: one read-only VMDK-backed stack for the image (which internally references the merged-metadata image plus the per-layer EROFS extents), and one writable ext4 upper for the sandbox. Overlayfs only ever combines those two. This is the version we shipped, with a small libkrunfw kernel config tweak (CONFIG_EROFS_FS_XATTR + CONFIG_EROFS_FS_SECURITY) so EROFS exposes the xattrs overlayfs needs for whiteouts.

Inside the pipeline

At pull time, the host materializes each OCI layer into an EROFS artifact keyed by its diff ID, merges the layer metadata with provenance, writes fsmeta.erofs, and emits a VMDK descriptor over fsmeta.erofs plus the layer extents. At sandbox create time, microsandbox creates a sparse upper.ext4 for that sandbox. At boot, the guest sees /dev/vda for the read-only lower stack and /dev/vdb for the writable upper, and Linux overlayfs assembles /.

Image stack
OCI layers
per-layer EROFS artifacts
merged metadata + provenance
fsmeta.erofs
VMDK descriptor over fsmeta + layer extents
/dev/vda (read-only lower)
Sandbox upper
sparse upper.ext4
/dev/vdb (writable upper/work)
guest overlayfs
/
Two block devices per sandbox boot, regardless of how many layers the original image declared. The host pays the layer-walking cost once at pull time; the guest pays nothing for it at runtime.

What it bought us

We ran the same benchmark suite three times against both versions on a python image, with fresh state between runs. Across fourteen mixed guest-visible filesystem workloads, the geometric mean speedup is 47.18×, and the eight biggest movers are below.

The bars fall into two groups:

  • Rootfs path: the cleanest measure of the new OCI path; these operations now stay inside the guest kernel instead of bouncing through the host.
  • /tmp tmpfs: real guest-visible wins, but from cutting out the FUSE round-trip on guest tmpfs workloads rather than from the new EROFS lower-rootfs path.
1×10×100×1,000×
file_delete_1k
1109.94×
rename_1k
876.58×
small_file_create_1k
240.78×
metadata_scan_stdlib
240.28×
read_all_py_stdlib
116.40×
deep_tree_traverse
47.16×
concurrent_read_4t
20.93×
random_read_stdlib
4.01×
Rootfs path
/tmp tmpfs
Log scale. v0.3.14 baseline = 1×. Higher is better.

metadata_scan_stdlib scans the metadata of every file in the Python standard library. It used to take half a second. It now takes about 2 milliseconds.

What we stopped having to worry about

Linux's overlayfs is a large spec, covering whiteouts, opaque directories, hardlinks across copy-up, directory renames, and a handful of xattr conventions that all have to behave exactly right. Our v0.3 reimplemented most of it in user space, and we were still chasing edge cases the day we deleted it. v0.4 doesn't reimplement any of it, because the VM's own kernel does the merging, and the bugs we used to have aren't fixed; they're gone.

The host still has to understand OCI layer semantics, but only once, at pull time. Whiteouts, opaque directories, hardlinks, xattrs, and case-sensitive paths get normalized into the merged metadata tree before fsmeta.erofs is written. After that, the runtime path is ordinary kernel EROFS plus overlayfs.

macOS's APFS is case-insensitive by default. Plenty of Linux images contain files whose names differ only by case, and extracting them onto a Mac used to collapse the second into the first. v0.4 never extracts to the host filesystem; the EROFS writer streams the tar straight into a binary image where both names live as distinct entries on disk.

What this lets us build on

Because the rootfs is now a real disk image, the surrounding product surface gets cheaper.

  • OCI patches. Rootfs patches users want on top of the image get baked into upper.ext4 before boot, instead of bolted on through a runtime overlay protocol.
  • Shared lower layers. The per-layer EROFS artifacts are content-addressed by diff ID, so two sandboxes that share a base image share those bytes on disk and in cache.
  • Snapshots. A sandbox's writable state is a single ext4 file; preserving or copying it is a file copy.
  • Disk-image roots. Custom non-OCI disk-image rootfs reuses the same block-device boot machinery, minus the fsmerge step in front of it.

What this doesn't fix

OCI rootfs only. Bind volumes (host directories you share into the VM) still go through the old path. Their contents can change at any time while the VM is reading from them, which a read-only disk image cannot represent.

First pulls aren't faster. We do more work at pull time now to build the images, though it is parallel across layers and bounded by tar decompression, so it lands close to where it was. Subsequent sandbox creates are faster, because we only emit a sparse scratch image.

Writes to the image are still copy-on-write through overlayfs. Modifying a file from the image copies it up into the writable upper, exactly as in any overlay setup. The rootfs wins here are on lookup- and read-heavy paths; the /tmp lifecycle wins in the chart come from /tmp being a guest-side tmpfs by default, which is a separate runtime decision.

What we would tell our past selves

The boring primitive in the kernel often beats the clever one in user space. Both monofs and our v0.3 overlay were ambitious designs, but EROFS is a boring, in-kernel file format, and for a sandbox rootfs the boring one won. We spent months tuning user-space code before accepting that the structural answer was to stop competing with the kernel and use it.

NIH is fine when the existing thing breaks your design. Shelling out to mkfs.ext4 or mkfs.erofs would have meant either a helper VM or a Linux-only split, both of which would have undone microsandbox's "single self-contained binary" promise. Writing the writers ourselves was the cost of keeping that promise, and we'd make the same trade again.

Stay open to better ideas while shipping. Our first cut was already a big win, and we were tempted to ship it as is. The cleaner shape EROFS suggested looked like a nice-to-have at the time, but holding the PR open another week to absorb it turned a one-off optimization into something we are happy to support long term.

Run benchmarks inside the VM. Timing from the host would have hidden the worst of the FUSE round-trip costs and made the win look smaller than it was. Time the thing your user actually waits on.

Try it

This ships in microsandbox 0.4 and later. Install the CLI:

curl -sSL https://install.microsandbox.dev | sh

Or use the SDK for your language:

uv add microsandbox       # python
npm install microsandbox  # typescript
cargo add microsandbox    # rust

The benchmarks live in their own repo so they can grow into a cross-runtime comparison. With msb on PATH and a fresh ~/.microsandbox:

git clone https://github.com/superradcompany/sandbox-bench
cd sandbox-bench/benches/fs
just bench-quick

Requires just and uv.