<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Amit Limaye]]></title><description><![CDATA[Amit Limaye]]></description><link>https://amitlimaye1.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!XCfR!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Famitlimaye1.substack.com%2Fimg%2Fsubstack.png</url><title>Amit Limaye</title><link>https://amitlimaye1.substack.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 14 Apr 2026 20:16:11 GMT</lastBuildDate><atom:link href="https://amitlimaye1.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Amit Limaye]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[amitlimaye1@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[amitlimaye1@substack.com]]></itunes:email><itunes:name><![CDATA[Amit Limaye]]></itunes:name></itunes:owner><itunes:author><![CDATA[Amit Limaye]]></itunes:author><googleplay:owner><![CDATA[amitlimaye1@substack.com]]></googleplay:owner><googleplay:email><![CDATA[amitlimaye1@substack.com]]></googleplay:email><googleplay:author><![CDATA[Amit Limaye]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[A VM With No Kernel: A Minimal Syscall Shim and a Hypervisor Backstop ]]></title><description><![CDATA[A VM With No Kernel: A Minimal Syscall Shim and a Hypervisor Backstop]]></description><link>https://amitlimaye1.substack.com/p/a-vm-with-no-kernel-a-minimal-syscall</link><guid isPermaLink="false">https://amitlimaye1.substack.com/p/a-vm-with-no-kernel-a-minimal-syscall</guid><dc:creator><![CDATA[Amit Limaye]]></dc:creator><pubDate>Tue, 14 Apr 2026 05:37:57 GMT</pubDate><content:encoded><![CDATA[<h1>A VM With No Kernel: A Minimal Syscall Shim and a Hypervisor Backstop</h1><p><em>This is post 2 of a series. <a href="https://amitlimaye1.substack.com/p/rewriting-every-syscall-in-a-linux">Post 1 covers the binary rewriter</a>.</em></p><h2>Picking up where we left off</h2><p>Yesterday&#8217;s post covered how a binary rewriter replaces every <code>syscall</code> instruction in a Linux binary with a trap. The process thinks it&#8217;s making system calls. Instead, a small shim intercepts each one, checks policy, and decides what to do. The process runs inside a lightweight KVM-based VM with no operating system &#8212; just the shim.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://amitlimaye1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>That raises two immediate questions: if there&#8217;s no kernel, who handles the syscalls? And what does a VM look like when there&#8217;s no OS inside it?</p><p>This post answers both.</p><h2>What a kernel actually does</h2><p>A Linux kernel is a massive piece of software. It manages hardware, schedules processes, enforces permissions, routes signals, maintains dozens of filesystem types, handles networking from raw Ethernet to TCP congestion control, and implements roughly 450 system calls. All of this exists because the kernel must handle the general case &#8212; any number of processes, any hardware, any workload.</p><p>But look at a single-process container workload. A Python script that reads data, calls an API, and writes output. What does it actually need from the kernel?</p><p><strong>Kernel subsystemWhat it doesDoes a single process need it?</strong>Process schedulerDecides which process runs nextNo &#8212; one process, it always runsIPC (pipes, shared memory, message queues)Processes communicate with each otherNo &#8212; nobody to talk toUser/group permissionsControls access between usersNo &#8212; one process, one identityDevice driversManages hardware devicesNo &#8212; no hardware to accessVirtual filesystem layerManages mount namespaces, overlayfs, procfsNo &#8212; just needs to read and write filesSignal routingDelivers signals between processesMinimal &#8212; no sender to receive fromMulti-process memory managementCOW fork, shared mappings, per-process page tablesNo &#8212; one address spaceNetworking stackFull TCP/IP, routing, netfilter, socket buffersPartial &#8212; needs socket I/O, but not the full stack</p><p>Most of what a kernel does is coordination between competing processes and abstraction over diverse hardware. A single-process workload with known I/O patterns needs almost none of this.</p><h2>The shim: a kernel in 60 syscalls</h2><p>Instead of stripping down a Linux kernel &#8212; which, as discussed in the previous post, leads to entangled dependencies and hacks &#8212; the approach is to write the syscall handlers from scratch. Implement just what the process needs. Nothing more.</p><p>Here&#8217;s the actual dispatch table from the shim. Every syscall the process makes ends up here:</p><p><strong>File I/O</strong> &#8212; the basics of reading and writing data: - <code>read</code>, <code>write</code>, <code>open</code>, <code>close</code>, <code>openat</code> &#8212; standard file operations - <code>lseek</code> &#8212; seek within a file - <code>stat</code>, <code>fstat</code>, <code>statx</code>, <code>newfstatat</code> &#8212; file metadata - <code>access</code> &#8212; check if a path exists - <code>readlink</code>, <code>readlinkat</code> &#8212; resolve symbolic links - <code>getdents64</code> &#8212; list directory entries - <code>getcwd</code> &#8212; current working directory - <code>pread64</code> &#8212; read at a specific offset - <code>writev</code> &#8212; scatter-gather write - <code>ioctl</code> &#8212; device control (mostly stubbed &#8212; returns <code>ENOTTY</code> for terminals)</p><p><strong>Memory management</strong> &#8212; the process needs to allocate memory: - <code>brk</code> &#8212; extend the heap - <code>mmap</code> &#8212; map anonymous memory (with page tracking) - <code>munmap</code> &#8212; release mapped memory - <code>mprotect</code> &#8212; change page permissions - <code>mremap</code> &#8212; resize a mapping (allocate new, copy, free old) - <code>madvise</code> &#8212; advisory hints (accepted, ignored)</p><p><strong>Networking</strong> &#8212; socket operations for HTTP/API calls: - <code>socket</code> &#8212; create a socket - <code>connect</code> &#8212; connect to a remote host - <code>bind</code> &#8212; bind to a local address - <code>sendto</code>, <code>recvfrom</code> &#8212; send and receive data - <code>getsockname</code>, <code>getpeername</code> &#8212; socket address queries - <code>poll</code> &#8212; wait for I/O readiness</p><p><strong>I/O multiplexing</strong> &#8212; event loops for async runtimes: - <code>epoll_create1</code>, <code>epoll_ctl</code>, <code>epoll_wait</code> &#8212; epoll interface - <code>pipe2</code> &#8212; create a pipe pair - <code>eventfd2</code> &#8212; event notification</p><p><strong>Process identity and time</strong>: - <code>getpid</code>, <code>gettid</code> &#8212; process/thread ID (returns 1) - <code>getuid</code>, <code>getgid</code>, <code>geteuid</code>, <code>getegid</code> &#8212; user/group IDs (returns 1000) - <code>uname</code> &#8212; system identification - <code>clock_gettime</code> &#8212; high-resolution timestamps (computed from TSC, no VM exit) - <code>getrandom</code> &#8212; random bytes</p><p><strong>Process lifecycle</strong>: - <code>exit</code>, <code>exit_group</code> &#8212; terminate - <code>clone</code>, <code>fork</code>, <code>vfork</code> &#8212; spawn (escalated to hypervisor &#8212; creates a new VM) - <code>execve</code> &#8212; execute a new binary (escalated to hypervisor) - <code>wait4</code> &#8212; wait for child (escalated to hypervisor)</p><p><strong>Runtime stubs</strong> &#8212; syscalls that runtimes like Python/musl probe for during startup. They don&#8217;t do real work, but returning an error would cause the runtime to crash or fall into slow paths: - <code>rt_sigaction</code>, <code>rt_sigprocmask</code> &#8212; signal handling (returns 0, no signals delivered) - <code>sigaltstack</code> &#8212; alternate signal stack (returns 0) - <code>set_tid_address</code>, <code>set_robust_list</code> &#8212; thread setup (returns safe defaults) - <code>arch_prctl</code> &#8212; set FS/GS base for TLS - <code>prlimit64</code> &#8212; resource limits (returns configured maximums) - <code>futex</code> &#8212; futex operations (returns 0 &#8212; safe for single-threaded) - <code>rseq</code> &#8212; restartable sequences (returns <code>ENOSYS</code>, glibc handles this gracefully)</p><p>That&#8217;s roughly 60 syscalls that the shim handles today &#8212; enough to run a statically-linked CPython 3.12 binary through startup, HTTP calls, file I/O, and shutdown. Other runtimes will have different requirements. Go&#8217;s runtime probes different syscalls at startup. Node.js with V8 exercises a different set. The dispatch table grows as we test against more workloads &#8212; each new runtime might add a handful of cases. But the shape holds: single-process workloads use a small fraction of the 450 syscalls Linux provides, and the hypervisor backstop means we don&#8217;t need to implement everything before we start.</p><h2>No ring 3: the process runs at ring 0</h2><p>Here&#8217;s a detail that surprises people: the guest process runs at ring 0 &#8212; the same privilege level as the shim. There&#8217;s no user/kernel boundary inside the VM. No ring 3 to ring 0 transition on each syscall. No <code>SYSENTER</code>/<code>SYSEXIT</code> overhead.</p><p>In a traditional OS, the ring 0/ring 3 split exists to protect the kernel from the process. But in a single-process VM, there&#8217;s nothing to protect &#8212; the shim <em>is</em> the kernel, and the process is the only thing running. The real isolation boundary isn&#8217;t between rings inside the VM. It&#8217;s the VM itself. KVM and the hypervisor enforce that the guest &#8212; shim and process together &#8212; can&#8217;t touch host memory, can&#8217;t access host devices, can&#8217;t escape the VM. That boundary is enforced by hardware (VT-x, EPT page tables), not by ring transitions.</p><p>Running everything at ring 0 has practical benefits. The <code>INT3</code> trap from the rewritten <code>syscall</code> instruction stays within ring 0 &#8212; same-privilege interrupt, no stack switch, no segment reload. The CPU pushes three values (RIP, CS, RFLAGS) instead of five (which includes a stack switch for cross-ring interrupts). The shim handler runs, computes the result, and <code>IRETQ</code> returns to the process. The round-trip is faster because there&#8217;s no privilege transition to perform.</p><p>It also simplifies the shim. No need for separate kernel and user page tables. No need for <code>SWAPGS</code> to switch segment bases. No need to manage TSS entries for stack switching. The shim&#8217;s page tables are the process&#8217;s page tables. Everything that would normally exist to maintain the user/kernel boundary &#8212; and everything that can go wrong at that boundary &#8212; is simply absent.</p><p>The natural concern is: if the process runs at ring 0, can&#8217;t it overwrite the shim? Today, the shim&#8217;s code pages are mapped read-only in the guest page tables, so a direct write faults. But the process also runs at ring 0, which means it could in theory modify the page tables themselves. The answer to this is hardware memory protection keys &#8212; Intel&#8217;s PKS (Protection Keys for Supervisor). With PKS, the shim&#8217;s pages are tagged with a key that the process cannot write to, even at ring 0. The page table pages themselves get the same treatment. This is a hardware-enforced separation within a single privilege level &#8212; no ring transition needed, no performance cost. This is on the roadmap before shipping &#8212; it&#8217;s what protects the shim&#8217;s process state, the policy table, and the page tables from modification by the guest code. Only the shim can operate on that memory. The architecture is designed around PKS from the start; it just hasn&#8217;t been wired up yet.</p><h2>Three tiers of handling</h2><p>Not all syscalls are handled the same way. The shim has three distinct paths, and the choice matters for both performance and security:</p><p><strong>Tier 1: Emulate in the shim (nanoseconds)</strong></p><p>Most syscalls never leave the VM. <code>brk</code> extends a pointer. <code>getpid</code> returns 1. <code>clock_gettime</code> reads the TSC and converts it to nanoseconds using a frequency the hypervisor provided at boot. <code>mmap</code> tracks allocations in a simple list. <code>read</code> and <code>write</code> on file descriptors operate on ring buffers in shared memory.</p><p>No VM exit. No hypervisor involvement. The shim computes the result and returns it directly. This is the fast path, and it&#8217;s where the vast majority of syscalls go.</p><p><strong>Tier 2: Delegate to the hypervisor (microseconds)</strong></p><p>Some operations genuinely need host resources. <code>connect()</code> needs to open a real TCP connection. <code>fork()</code> needs to create a new VM. Writing to <code>stdout</code> (fd 1) needs to reach the host terminal.</p><p>For these, the shim writes the syscall number and arguments into a shared memory region &#8212; the governance mailbox &#8212; and triggers a VM exit via an I/O port write. The hypervisor reads the mailbox, performs the real operation, writes the result back, and resumes the VM. One round-trip, a few microseconds.</p><p><strong>Tier 3: Deny (nanoseconds)</strong></p><p>Anything not in the dispatch table falls through to the default case. The shim escalates it to the hypervisor, which checks policy. If the syscall isn&#8217;t authorized &#8212; and for the ~390 syscalls that aren&#8217;t implemented, it never is &#8212; the hypervisor returns <code>-EPERM</code> or <code>-ENOSYS</code>.</p><p>The denied syscall never executes. No side effects, no partial state changes. The process gets an error code and continues.</p><pre><code><code>Syscall arrives at shim
  &#9474;
  &#9500;&#9472; In dispatch table?
  &#9474;    &#9500;&#9472; Can emulate locally? &#8594; handle in shim (ns)
  &#9474;    &#9492;&#9472; Needs host resources? &#8594; governance mailbox &#8594; VMexit &#8594; hypervisor (&#181;s)
  &#9474;
  &#9492;&#9472; Not in dispatch table &#8594; escalate &#8594; hypervisor denies (-EPERM / -ENOSYS)
</code></code></pre><h2>The hypervisor as escape hatch</h2><p>This is the key difference from unikernels.</p><p>Unikernels compile the application and a minimal kernel into a single image. When the unikernel encounters something it can&#8217;t handle &#8212; a system call it didn&#8217;t implement, a device it doesn&#8217;t have a driver for, a networking edge case &#8212; it&#8217;s stuck. There&#8217;s no fallback. The scope of what you must implement up front is enormous, and getting it wrong means the application crashes.</p><p>The shim doesn&#8217;t have this problem. Anything it can&#8217;t handle, it escalates to the hypervisor. The hypervisor is a normal Linux process on the host, with full access to the host kernel. It can make real system calls, open real sockets, access real files. The guest process doesn&#8217;t know the difference &#8212; it made a system call and got a result back.</p><p>This changes the engineering economics completely:</p><p><strong>UnikernelShim + Hypervisor</strong>Must implement before shippingEverything the app needsJust the hot pathHandling edge casesCrash or return errorDelegate to hypervisorAdding new syscall supportRebuild and redeploy the imageAdd a case to the dispatch tableUnmodified binariesUsually no &#8212; need to recompileYes &#8212; binary rewriter handles it</p><p>This doesn&#8217;t mean delegation is free. Every syscall that gets escalated to the hypervisor is a per-syscall design decision. How much guest state does the hypervisor need to read? Can it access the guest&#8217;s memory buffers directly, or does data need to be copied? Does the hypervisor need to maintain state across multiple calls (like a file position for sequential reads)? Can the operation be performed asynchronously, or does the guest block until the hypervisor responds?</p><p>For <code>write(1, buf, len)</code> this is straightforward &#8212; the hypervisor reads <code>len</code> bytes from the guest&#8217;s buffer and writes them to the host&#8217;s stdout. For <code>connect()</code> it&#8217;s more involved &#8212; the hypervisor needs to perform a real TCP handshake on the host, manage the resulting socket, and set up a ring buffer pair for subsequent I/O. For <code>fork()</code> it&#8217;s a major operation &#8212; snapshot the guest&#8217;s memory, send it to a pool daemon, spin up a new VM with the snapshot.</p><p>Each delegated syscall is a small protocol between the shim and the hypervisor. The governance mailbox carries the arguments, but the hypervisor needs to know what those arguments mean and how to act on them. This is engineering work &#8212; not as much as implementing a full kernel, but not zero either.</p><p>The system works because the set of syscalls that actually need delegation is small. Most calls are emulated locally in the shim. The few that need the hypervisor are well-defined, and you add them one at a time as workloads demand them.</p><h2>Two examples from the code</h2><p>Abstract tiers are useful, but code makes it concrete. Here&#8217;s one syscall from each of the first two tiers.</p><h3>Tier 1: <code>clock_gettime</code> &#8212; emulated in the shim, zero VM exits</h3><p>When a process calls <code>clock_gettime(CLOCK_REALTIME, &amp;ts)</code>, a normal kernel goes through the vDSO, reads clock sources, applies NTP adjustments. The shim does this instead:</p><pre><code><code>pub fn now_ns() -&gt; u64 {
    let tsc_at_boot = clock_field(0);
    let unix_ns_at_boot = clock_field(8);
    let tsc_freq_khz = clock_field(16);

    if tsc_freq_khz == 0 {
        return 0;
    }

    let elapsed_tsc = rdtsc().wrapping_sub(tsc_at_boot);
    let elapsed_us = elapsed_tsc / (tsc_freq_khz / 1000);
    unix_ns_at_boot + elapsed_us * 1000
}
</code></code></pre><p>The hypervisor writes three values into a known memory location at boot: the TSC value at boot time, the corresponding Unix timestamp, and the TSC frequency. The shim reads the TSC directly with <code>rdtsc</code> &#8212; which doesn&#8217;t cause a VM exit on modern CPUs &#8212; computes the elapsed time, and returns wall-clock nanoseconds.</p><p>No VM exit. No host interaction. A Python process calling <code>time.time()</code> millions of times in a loop pays nanoseconds per call instead of microseconds.</p><p>This is a practical shortcut, not the final design. TSC-based time drifts over long-running VMs because there&#8217;s no NTP correction. For short-lived agent tasks (seconds to minutes), the drift is negligible. For longer workloads, we&#8217;d use KVM&#8217;s paravirtual clock (<code>kvm-clock</code>) or periodically sync against the host. The architecture supports either &#8212; the shim just reads from a fixed memory location, and what the hypervisor writes there can change without touching the shim code.</p><h3>Tier 2: <code>write(1, ...)</code> &#8212; delegated via the governance mailbox</h3><p>When a process writes to stdout, the output needs to reach the host terminal. The shim can&#8217;t handle this locally &#8212; it needs the hypervisor. This is the escalation path:</p><pre><code><code>pub fn escalate(nr: u64, a1: u64, a2: u64, a3: u64, a4: u64, a5: u64) -&gt; u64 {
    unsafe {
        let mb = mailbox();

        // Write syscall number and arguments into the shared mailbox
        mb.add(0).write_volatile(nr);
        mb.add(1).write_volatile(a1);
        mb.add(2).write_volatile(a2);
        mb.add(3).write_volatile(a3);
        mb.add(4).write_volatile(a4);
        mb.add(5).write_volatile(a5);
        mb.add(7).write_volatile(0);  // pre-clear return value

        compiler_fence(Ordering::SeqCst);

        // Trigger KVM_EXIT_IO &#8594; hypervisor wakes and handles the request
        outl(NEXUS_GOV_PORT, nr as u32);

        compiler_fence(Ordering::SeqCst);

        mb.add(7).read_volatile()  // return value written by hypervisor
    }
}
</code></code></pre><p>The mailbox is a fixed-address struct in guest memory &#8212; 8 qwords: syscall number, 6 arguments, and a return value. The shim fills in the arguments, executes <code>outl</code> on a designated I/O port, which triggers <code>KVM_EXIT_IO</code> on the host side. The hypervisor reads the mailbox, performs the real <code>write(1, buf, len)</code> on the host, writes the result back into the mailbox, and resumes the VM. One round-trip, a few microseconds.</p><p>No virtio queues. No shared ring buffers for control. No feature negotiation. Just a struct in memory and an I/O port trigger. The simplicity is deliberate &#8212; this is a hot path that needs to be auditable in minutes, not days.</p><h2>The guest memory layout: every page accounted for</h2><p>So the shim handles syscalls and the hypervisor handles delegation. But where does all of this live in memory? In a normal VM, the guest OS manages its own address space &#8212; the hypervisor hands it a chunk of RAM and lets it allocate. Here, there&#8217;s no guest OS to manage anything. The hypervisor places every page with a specific purpose, and the guest gets exactly what it needs.</p><p>The address space is split into two halves via two PDPT entries:</p><pre><code><code>PDPT[0] &#8594; Low memory  (0x00000000 - 0x3FFFFFFF, 1 GiB)
           ELF binaries, brk heap. This is where the process lives.

PDPT[1] &#8594; High memory (0x40000000 - 0x5FFFFFFF, 512 MiB)
           System area: page tables, shim, stack, rings, mmap pool.
           No user code loads here.
</code></code></pre><p>Low memory is entirely for the guest binary. ELF segments load at their toolchain-native addresses (0x200000, 0x400000, etc.) without conflicting with system infrastructure. The first 2MB (PD_low[0]) is not present &#8212; any NULL pointer dereference traps immediately.</p><p>The system area at 1 GiB has a fixed layout:</p><pre><code><code>SYS_BASE + Offset    Size     Purpose
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
0x0000               4KB      PML4 (page map level 4)
0x1000               4KB      PDPT (page directory pointer table)
0x2000               4KB      PD_low (page directory for 0-1GiB)
0x3000               4KB      PD_high (page directory for 1-2GiB)
0x4000               4KB      GDT + IDT + governance mailbox
  +0x100                        IDT (256 entries)
  +0x4E00                       GOV mailbox (64 bytes)
0x5000               ~28KB    Shim code (.text + .rodata)
0x18000              96KB     mmap page tables (24 PTs)
0x20000              4KB      Shim data page
  +0x800                        Initial brk GPA
  +0x808                        Heap limit
  +0x900                        Policy table (128 &#215; 8 bytes)
  +0xD00                        Clock data (TSC freq, boot time)
0x200000             2MB      Guest stack (grows down)
0x1000000            48MB     mmap region (bump allocator)
</code></code></pre><p>Every page has a specific purpose. There&#8217;s no general-purpose allocator, no free list, no dynamic allocation of system structures. The hypervisor knows the exact state of every byte before the guest starts.</p><h3>Page tables are read-only</h3><p>The page table pages (PML4, PDPT, PD_low, PD_high) are mapped read-only in the guest. The guest cannot modify its own address space. It can&#8217;t mark new pages as executable. It can&#8217;t remap shim memory as writable. It can&#8217;t create new mappings outside the regions the hypervisor set up.</p><p>This isn&#8217;t enforced by a policy check &#8212; it&#8217;s enforced by the page table permissions themselves. A write to any page table page triggers a fault. There&#8217;s no syscall to call, no privilege to escalate. The mechanism for changing the memory layout simply doesn&#8217;t exist inside the guest.</p><h3>The governance mailbox</h3><p>At offset 0x4E00 in the GDT page sits the governance mailbox &#8212; 64 bytes of shared memory between the shim and the hypervisor:</p><pre><code><code>Offset   Type    Field
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
0x00     u64     syscall_nr
0x08     u64     arg1
0x10     u64     arg2
0x18     u64     arg3
0x20     u64     arg4
0x28     u64     arg5
0x30     u64     arg6
0x38     u64     return value
</code></code></pre><p>The shim writes the syscall number and arguments, triggers <code>KVM_EXIT_IO</code> via an <code>outl</code> instruction, and the hypervisor reads the mailbox on the host side. After handling the request, the hypervisor writes the return value and resumes the VM.</p><p>Why not virtio? Virtio is designed for high-throughput data transfer between guest and host &#8212; descriptor rings, available/used ring buffers, feature negotiation, driver initialization. For a syscall escalation path that carries 8 values per invocation, that&#8217;s enormous overhead. The mailbox is a fixed struct at a fixed address. No initialization. No negotiation. No driver code in the guest.</p><h3>Ring buffers for network I/O</h3><p>The governance mailbox handles control flow &#8212; syscall escalation. But for streaming network I/O, a synchronous mailbox is too slow. Each <code>send()</code> or <code>recv()</code> would need a VM exit to move data.</p><p>Network sockets use ring buffers in shared memory. Each connected socket gets a ring pair &#8212; one for transmit, one for receive &#8212; mapped at a fixed GPA region starting at 0xC0000000 (3 GiB):</p><pre><code><code>Ring pair layout (per socket):

  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
  &#9474; Control (64 bytes)               &#9474;
  &#9474;   head, tail, capacity, flags    &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; TX ring (64KB)                   &#9474;
  &#9474;   shim writes &#8594; hypervisor reads &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; RX ring (64KB)                   &#9474;
  &#9474;   hypervisor writes &#8594; shim reads &#9474;
  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;

  64 ring pairs total, ~8 MiB
</code></code></pre><p>Single producer, single consumer. Lock-free. The shim writes to the TX ring and the hypervisor drains it on the host side, sending the data over the real TCP connection. The hypervisor fills the RX ring with incoming data from the network, and the shim reads from it. Head and tail pointers are cache-line aligned to avoid false sharing.</p><p>What we&#8217;re doing here is essentially emulating what a network device provides at its lowest level &#8212; a TX ring and an RX ring for moving bytes between software and the wire. The difference is that a real NIC&#8217;s ring buffer feeds into a full TCP/IP stack in the guest kernel: socket buffers, congestion control, routing tables, netfilter, segmentation offload. The guest doesn&#8217;t need any of that. The entire TCP/IP stack is delegated to the hypervisor. The shim presents a socket API to the process, translates it into ring buffer reads and writes, and the hypervisor handles the actual protocol work on the host &#8212; where a mature, battle-tested network stack already exists.</p><p>When a process calls <code>send(fd, buf, len)</code>, the shim copies data into the socket&#8217;s TX ring and returns immediately &#8212; no VM exit. For <code>recv()</code>, the shim checks the RX ring; if data is available, it copies it out without a VM exit. Only when the ring is empty does the shim stall or escalate.</p><p>The ring buffers are backed by a shared <code>memfd</code> &#8212; the hypervisor and the guest see the same physical pages. The common case (data available, ring not full) involves zero VM exits.</p><h3>File I/O: page-backed VFS cache</h3><p>File I/O is fundamentally different from network I/O &#8212; files have random access, seek positions, and known sizes. A ring buffer doesn&#8217;t make sense here.</p><p>Instead, file contents live in a page-backed cache region at GPA 0x80000000 (2 GiB). The hypervisor loads file data into this region before or during guest execution, backed by hugepages when available. Each file in the VFS metadata table points to a data GPA within this cache where its contents reside.</p><p>This isn&#8217;t a traditional buffer cache. In a multi-process environment, the kernel&#8217;s page cache is a shared resource &#8212; it has to handle eviction pressure from competing processes, speculative readahead that may be wasted, and cache coherency across processes accessing the same file. None of that applies here. There&#8217;s one process. We know exactly what it will need at startup (the runtime), and the policy tells us what task-specific files it will access. There&#8217;s no contention, no eviction, no wasted readahead. We can make informed decisions about what to pre-load and what to load lazily &#8212; the noise of a multi-process environment is simply absent.</p><p>When the process calls <code>read(fd, buf, len)</code>, the shim looks up the fd&#8217;s data GPA and current file position, then copies directly from the cache pages into the process&#8217;s buffer. No ring buffer, no VM exit &#8212; just a <code>memcpy</code> from one guest address to another. <code>lseek()</code> updates the position. <code>pread()</code> reads at an arbitrary offset. Random access works naturally because the file is just a contiguous region of memory.</p><p>Where does the file data come from? The hypervisor reads it from the host filesystem. The policy configuration specifies a mapping &#8212; host path to guest path &#8212; and the hypervisor loads the contents into the cache region. But not all files are loaded the same way.</p><p><strong>Pre-cached files</strong> are loaded before the guest starts. These are files that the runtime needs immediately at startup &#8212; Python&#8217;s standard library, SSL certificates, shared configuration. If CPython tries to <code>import os</code> and the file isn&#8217;t there yet, the interpreter crashes before user code ever runs. Pre-caching ensures the runtime boots without stalling. The cost is paid once, before <code>KVM_RUN</code>, and can be amortized across a warm pool of pre-booted VMs that all share the same base files.</p><p><strong>Lazy-loaded files</strong> are loaded on demand &#8212; the metadata entry exists in the table (the guest can see the file in directory listings and stat it), but the data pages aren&#8217;t populated yet. When the process actually reads the file, the shim checks a <code>data_ready</code> flag in the metadata entry. If the data isn&#8217;t loaded, it signals the hypervisor via an I/O port and spins until the flag is set. The hypervisor loads the data from the host into the cache pages, sets the flag, and the shim resumes.</p><p>Lazy loading matters for several reasons. First, task-specific data: a warm pool of pre-booted VMs can&#8217;t know what task they&#8217;ll run &#8212; the input data is different every time. The VM boots with the runtime pre-cached, gets assigned a task, and task-specific files are loaded lazily when the process first touches them. The first read pays a VM exit; every subsequent read is a memcpy.</p><p>Second, memory. A Python installation with its standard library and common packages can be hundreds of megabytes. Most of those files are never touched in a given task &#8212; the process imports a handful of modules, not the entire stdlib. Pre-caching everything would make each VM far too memory-hungry, especially when running hundreds of them on a single machine. Lazy loading means the VM only pays memory for files the process actually reads.</p><p>Third, and this is something we&#8217;re still exploring: dynamic loading opens the door to content hooks. Because the hypervisor controls when and how file data is loaded into the cache, it has an interception point &#8212; it can inspect, transform, or substitute file contents before the guest sees them. A file that contains sensitive data on the host could be redacted based on the agent&#8217;s policy before being loaded into the guest cache. A configuration file could be rewritten per-task based on the agent&#8217;s permissions. The guest&#8217;s <code>read()</code> returns whatever the hypervisor placed in the cache pages, and the guest has no way to know whether that matches what&#8217;s on disk. This is future work, but the architecture supports it because of the lazy loading design.</p><p>The guest never sees host paths. It sees whatever guest paths the policy defines. A file at <code>/data/customers.csv</code> in the guest might come from <code>/var/tasks/job-4821/input.csv</code> on the host. The mapping is entirely controlled by the hypervisor &#8212; the guest has no way to discover or access host paths that aren&#8217;t in its VFS configuration.</p><p>To be clear: this is not a filesystem. There&#8217;s no ext4, no overlayfs, no mount table, no inodes, no dentries. It&#8217;s a flat list of memory regions with names. <code>open("/data/input.csv")</code> walks the metadata table for a match. Found &#8594; allocate an fd. Not found &#8594; <code>-ENOENT</code>. The file doesn&#8217;t exist because access was denied &#8212; it doesn&#8217;t exist because there is literally nothing at that path.</p><p>This eliminates entire categories of filesystem attacks:</p><p><strong>AttackTraditional filesystemIn-memory VFS</strong>Path traversal (<code>../../etc/shadow</code>)Possible if misconfiguredNo directory tree to traverseSymlink followingPossibleNot implemented today &#8212; if added, resolved in the metadata table, not the filesystemTOCTOU race conditionsPossibleSingle process, no concurrent mutation &#8212; contents are fixed once loadedMount namespace escapePossible with privilegesNo mount conceptInode/dentry exhaustionPossibleFixed table, no allocation</p><div><hr></div><h2>What this means for security</h2><p>Every syscall that isn&#8217;t implemented in the shim doesn&#8217;t exist as an attack surface. But more importantly, even the syscalls that <em>are</em> implemented are simpler &#8212; and simpler means more auditable.</p><p>Consider <code>open()</code> in the Linux kernel. It handles path resolution across mount namespaces, follows symbolic links (up to 40 levels deep), checks permissions against uid/gid/capabilities, negotiates with the filesystem driver, allocates inodes and dentries, handles O_CREAT/O_EXCL atomicity, manages file locks, updates access timestamps. It&#8217;s thousands of lines of code with decades of CVEs.</p><p><code>open()</code> in the shim: walk a flat metadata table. If the path is there, allocate a file descriptor. If not, return <code>-ENOENT</code>. No mount namespaces, no symlinks, no permission checks beyond &#8220;does this entry exist.&#8221; The table itself defines what exists.</p><p>The same applies across the board:</p><p><strong>SyscallLinux kernel complexityShim complexity</strong><code>mmap</code>VMA trees, page fault handlers, COW, shared mappings, file-backed mappings, huge pagesBump allocator, anonymous only<code>connect</code>Full TCP state machine, routing table, netfilter, congestion controlWrite destination to mailbox, hypervisor does the connect<code>getpid</code>Namespace-aware PID lookup across PID namespacesReturn 1<code>clock_gettime</code>vDSO, multiple clock sources, NTP adjustmentsRead TSC, multiply by pre-computed frequency</p><p>The shim&#8217;s <code>clock_gettime</code> implementation is instructive. The hypervisor reads the host&#8217;s TSC frequency at boot and writes it into a known memory location. The shim reads the TSC directly (which doesn&#8217;t cause a VM exit on modern CPUs), multiplies by the frequency, and returns the result. No VM exit, no host interaction, nanosecond precision. Ten lines of code where the kernel has hundreds.</p><h2>What we didn&#8217;t implement &#8212; and why that&#8217;s the point</h2><p>The ~390 syscalls that aren&#8217;t in the dispatch table aren&#8217;t bugs to fix. They&#8217;re attack surface that doesn&#8217;t exist.</p><p>No <code>ptrace</code> &#8212; the process can&#8217;t debug or inspect itself. No <code>mount</code> &#8212; the process can&#8217;t modify its filesystem view. No <code>setuid</code> &#8212; the process can&#8217;t escalate privileges. No <code>kexec_load</code> &#8212; the process can&#8217;t replace the kernel. No <code>bpf</code> &#8212; the process can&#8217;t install eBPF programs. No <code>perf_event_open</code> &#8212; the process can&#8217;t access performance counters.</p><p>These syscalls are regularly the source of privilege escalation CVEs in containers. In the shim, they return <code>-ENOSYS</code>. The code path that could be exploited doesn&#8217;t exist &#8212; not in the shim, not anywhere in the VM. There is no kernel to exploit.</p><h2>Running unmodified binaries</h2><p>An important consequence of honoring the syscall ABI: the process doesn&#8217;t need to be recompiled, re-linked, or modified in any way beyond the binary rewrite from post 1. A statically-linked Python binary, compiled by someone else, downloaded from a package repository, runs unmodified. So does a Go binary, a Rust binary, or a C program linked against musl.</p><p>The shim doesn&#8217;t care what language the binary was written in or how it was compiled. It cares about what syscalls the binary makes and what arguments it passes. As long as those fall within the ~60 implemented syscalls &#8212; which they do for typical single-process workloads &#8212; the binary runs.</p><p>This is where the library OS and unikernel approaches fall apart in practice. They require rebuilding the application against a custom runtime. That means maintaining compatibility with every language ecosystem, every package manager, every build system. The binary rewrite + syscall ABI approach sidesteps all of this: if it runs on Linux, and it uses fewer than ~60 syscalls, it runs on the shim.</p><div><hr></div><h2>What&#8217;s next</h2><p>A single VM with a shim and a hypervisor backstop is useful. A thousand of them on one machine is a platform. But a thousand VMs with 512MB of system area each would consume half a terabyte of RAM. That doesn&#8217;t work.</p><p>Tomorrow&#8217;s post covers how shared memory, warm pools, and copy-on-write make it possible to run thousands of these VMs on a single machine &#8212; and why the fixed, deterministic memory layout described here is what makes sharing possible in the first place.</p><div><hr></div><p><em>This is post 2 of a 7-part series on building a minimal VM runtime. Subscribe to get the rest.</em></p><p><em>If you have questions or want to discuss &#8212; reach out on <a href="https://www.linkedin.com/in/alimaye/">LinkedIn</a>.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://amitlimaye1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Rewriting Every Syscall in a Linux Binary at Load Time]]></title><description><![CDATA[The problem that started this]]></description><link>https://amitlimaye1.substack.com/p/rewriting-every-syscall-in-a-linux</link><guid isPermaLink="false">https://amitlimaye1.substack.com/p/rewriting-every-syscall-in-a-linux</guid><dc:creator><![CDATA[Amit Limaye]]></dc:creator><pubDate>Mon, 13 Apr 2026 05:38:16 GMT</pubDate><content:encoded><![CDATA[<h2>The problem that started this</h2><p>There&#8217;s something odd about the way we run software today. Most containers &#8212; the dominant unit of deployment in production &#8212; run a single process. One Python script, one Node.js server, one Go binary. But that single process sits on top of a full Linux kernel &#8212; roughly 450 system calls, most of which it will never use. The kernel knows about devices, schedulers, multi-process coordination, signal routing, dozens of filesystem types, and hundreds of other things that a single-process workload doesn&#8217;t care about. The process doesn&#8217;t care how the machine is laid out or what hardware is available. It needs CPU, memory, and I/O. That&#8217;s it.</p><p>Think about what this gap means. We&#8217;re running on a platform that provides a vast surface of features these workloads will never touch. And increasingly, the code running inside these containers is code that isn&#8217;t fully trusted &#8212; third-party libraries, generated code, autonomous agents making decisions at runtime. That 450-syscall interface becomes difficult to reason about and even harder to secure.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://amitlimaye1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This is not a unique observation, and it&#8217;s not the first time people have tried to address it. Stripping a kernel of features you don&#8217;t need is an old idea &#8212; sometimes motivated by security, sometimes by resource constraints. It&#8217;s been common in the embedded world for decades, where you can&#8217;t afford a full kernel on a device with 256KB of RAM. In the server world, the same instinct shows up as hardened kernel configs with hundreds of options disabled, custom builds with subsystems removed, and unikernels that compile app and kernel together. And people have shown real success with this approach &#8212; substantial gains in memory footprint and measurably better security postures. But they still always end up with more than they want, because of how deeply things are entangled in the kernel. Subsystems have deep interdependencies. Removing the scheduler breaks assumptions in the memory manager. Disabling networking pulls threads that unravel into the VFS layer. You end up with hacks &#8212; <code>#ifdef</code>s around code paths that might still be reachable, stub functions that return success without doing anything, and a constant fear that some corner case will hit a code path you thought you&#8217;d removed.</p><p>Unikernels have tried to address the entire kernel surface from the other direction &#8212; building up from nothing. But the problem turns out to be vast. Once you need to support processes that talk to real devices, care about hardware topology, or depend on specific OS features, the scope explodes. You end up rebuilding large chunks of what you were trying to avoid.</p><p>But what if you don&#8217;t need any of that? What if the process doesn&#8217;t care about devices, doesn&#8217;t need hardware access, and only uses a small slice of the OS interface? Instead of subtracting from a full kernel or rebuilding one from scratch, what if you started from zero and only implemented what the process actually calls?</p><p>strace a Python process &#8212; a script that reads data, makes HTTPS calls, and writes output. It uses about 40 distinct syscalls. <code>read</code>, <code>write</code>, <code>open</code>, <code>close</code>, <code>socket</code>, <code>connect</code>, <code>send</code>, <code>recv</code>, <code>brk</code>, <code>mmap</code>, <code>clock_gettime</code>, <code>exit</code> &#8212; and a couple dozen more for memory management and file metadata. The other 410 syscalls? Multi-process coordination, device management, signal handling, things a single-process workload never touches.</p><p>So implement those ~40 syscalls as a library. A &#8220;library kernel&#8221; &#8212; just the syscalls the process needs, written from scratch, no Linux baggage.</p><p>The idea isn&#8217;t new. Unikernels tried this. So did various library OS projects. But they all hit the same wall: how do you get the process to call your library instead of the real kernel?</p><p>The standard approaches are:</p><p><strong>Compiler integration</strong>: Modify the toolchain to emit calls to your library instead of <code>syscall</code> instructions. This works, but now you need to support every compiler, every language runtime, every version. GCC, Clang, rustc, Go&#8217;s compiler, CPython&#8217;s build system, Node.js&#8217;s V8 &#8212; each with its own way of emitting syscalls. The maintenance surface is enormous.</p><p><code>LD_PRELOAD</code><strong> / libc interposition</strong>: Intercept at the C library level by overriding libc functions like <code>write()</code> and <code>open()</code>. But not everything goes through libc. Go makes syscalls directly. So does musl in some paths. JIT compilers emit raw <code>syscall</code> instructions. Anything that bypasses libc bypasses your interposition. You&#8217;re playing whack-a-mole with an ever-growing list of exceptions.</p><p><strong>Custom libc</strong>: Build a libc that routes to your implementation. Similar problem &#8212; you need the process to link against your libc, which means controlling the build. And statically-linked binaries ignore your libc entirely.</p><p>Every approach that works at the source level, the compiler level, or the library level has the same fundamental problem: there are too many paths to a <code>syscall</code> instruction, and you need to cover all of them. Miss one, and the process escapes your control.</p><p>And containers aren&#8217;t secure enough as-is. A container shares the host kernel. Every one of those 450 syscalls is a potential attack surface &#8212; the process can probe them, exploit bugs in their implementation, or use them in unintended combinations. The kernel&#8217;s syscall interface is the largest attack surface in the system, and containers expose all of it.</p><p>But eventually you realize something: every one of these approaches &#8212; compiler-generated code, libc wrappers, JIT output, hand-written assembly, Go&#8217;s raw syscalls, musl&#8217;s internal paths &#8212; they all converge on the same point. Whatever language, whatever toolchain, whatever runtime, the process eventually executes the same two-byte instruction: <code>0F 05</code>. The <code>syscall</code> opcode is the single most consistent hook point across the entire software stack. It doesn&#8217;t matter how the code got there. It always arrives at the same place.</p><p>Work at that level &#8212; below the language, below the compiler, below libc &#8212; and you only have one thing to catch.</p><p>The syscall interface is just an ABI &#8212; a contract. A process puts a number in <code>rax</code>, arguments in <code>rdi</code> through <code>r9</code>, and executes <code>syscall</code>. It gets a result back in <code>rax</code>. The process doesn&#8217;t care who honored that contract. If you implement those ~40 syscalls yourself &#8212; returning the same values, honoring the same error codes &#8212; the process can&#8217;t tell the difference. You control its entire view of the world. What happens with the other 410 syscalls &#8212; how you handle the ones you don&#8217;t implement, what to do when a process needs something outside your set &#8212; is a design question I&#8217;ll get into in later posts. For now, the foundational problem:</p><p>The answer: rewrite the binary at load time. Replace every <code>syscall</code> instruction with a trap that redirects to your own implementation.</p><div><hr></div><h2>Why not ptrace, seccomp, or eBPF?</h2><p>There are established ways to intercept system calls on Linux. Each one has a limitation that matters when your goal is to enforce policy on untrusted code &#8212; not just observe it.</p><p><strong>ptrace</strong> (strace, gdb):<br>The kernel stops the process, notifies the tracer, the tracer inspects and resumes. That&#8217;s two context switches per syscall &#8212; roughly 10-20&#181;s of overhead each time. For a process making thousands of syscalls per second, ptrace adds double-digit milliseconds of delay. More fundamentally, ptrace is designed for debugging, not enforcement. The API is awkward for building a policy engine.</p><p><strong>seccomp-bpf</strong>:<br>Seccomp lets you install a BPF filter that the kernel evaluates on every syscall. It&#8217;s fast &#8212; the filter runs in-kernel. But the actions are coarse: allow, kill the process, return an error, or trap to a user-space handler (via <code>SECCOMP_RET_TRACE</code>, which brings back ptrace overhead). You can&#8217;t inspect pointer arguments &#8212; the BPF filter only sees register values, not the memory they point to. You can&#8217;t read the filename being <code>open()</code>ed or the buffer being <code>write()</code>n. And you can&#8217;t modify anything &#8212; seccomp is a one-way gate.</p><p><strong>eBPF</strong>:<br>eBPF programs attached to tracepoints or LSM hooks can observe and enforce at the syscall level with low overhead &#8212; LSM hooks can deny calls outright. But eBPF is deliberately restricted from modifying process state. You can deny a <code>connect()</code>, but you can&#8217;t change the destination address, return a custom result, or emulate the call with different behavior. You can&#8217;t inspect the buffer contents a <code>write()</code> is about to send. The verifier guarantees safety, which means eBPF enforcement is binary &#8212; allow or deny &#8212; without the ability to intercept, inspect, and rewrite at the level a full policy engine needs.</p><p>What&#8217;s needed is something different:</p><p>Requirement ptrace seccomp eBPF Binary rewrite Low overhead per syscall No (~10-20&#181;s) Yes Yes Yes Inspect pointer arguments (filenames, buffers) Yes (slow) No Read-only Yes Modify return values Yes (slow) No No Yes Gracefully deny (return -EPERM, process continues) Yes (slow) Partial (ERRNO mode) No Yes Emulate the syscall entirely Yes (slow) No No Yes No kernel module required Yes Yes Yes Yes</p><p>Binary rewriting gives you the full set: low overhead, full argument inspection, return value control, and complete emulation &#8212; all without a kernel module.</p><p>The idea is simple: if you can replace every <code>syscall</code> instruction in a binary with a trap that redirects to your own handler, you get complete control over the process&#8217;s interaction with the outside world. Let&#8217;s look at what it takes to build one. The primary reference for this work is the <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html">Intel 64 and IA-32 Architectures Software Developer&#8217;s Manual, Volume 2</a> (specifically the opcode maps in Appendix A), and we validated our instruction length decoder against <a href="https://www.capstone-engine.org/">Capstone</a>, the open-source disassembly framework.</p><div><hr></div><h2>How the rewriter works</h2><p>The rewrite is a single pass over the <code>.text</code> section of an ELF binary. When it runs is a deployment choice &#8212; it could happen at container build time, when a container image is pulled from a registry (via a webhook or admission controller), or at load time just before execution. We currently do it at load time: after loading the ELF into memory, before the first instruction runs. You could also do it once for a binary and cache the result &#8212; the rewrite is deterministic, so there&#8217;s no reason to repeat it.</p><h3>Step 1: Instruction Length Decoding</h3><p>You can&#8217;t just scan for the byte sequence <code>0F 05</code> (the encoding of <code>syscall</code>). Those two bytes might appear as part of a larger instruction &#8212; an immediate operand, a displacement, or a prefix combination. Naively replacing them would corrupt unrelated instructions.</p><p>Instead, the rewriter walks the code at <strong>instruction boundaries</strong> using an Instruction Length Decoder (ILD). The ILD doesn&#8217;t fully disassemble each instruction &#8212; it only computes its length. That&#8217;s enough to advance to the next instruction boundary and know exactly where opcodes are versus operands.</p><p>The ILD handles the full x86-64 encoding complexity:</p><ul><li><p><strong>Legacy prefixes</strong> (up to 4): <code>LOCK</code>, <code>REP</code>, segment overrides, operand-size (<code>66</code>), address-size (<code>67</code>)</p></li><li><p><strong>REX prefix</strong>: the <code>40-4F</code> byte that extends registers and changes operand width</p></li><li><p><strong>Opcode</strong>: 1-byte, 2-byte (<code>0F</code> escape), or 3-byte (<code>0F 38</code>, <code>0F 3A</code>)</p></li><li><p><strong>ModRM + SIB + displacement</strong>: addressing mode encoding</p></li><li><p><strong>Immediates</strong>: variable width depending on opcode and prefixes</p></li></ul><p>The core is two lookup tables &#8212; one for 1-byte opcodes, one for 2-byte opcodes &#8212; derived from the Intel Software Developer&#8217;s Manual (Vol. 2A, Tables A-2 and A-3). Each table entry encodes whether the opcode has a ModRM byte and what size immediate follows:</p><pre><code><code>const HAS_MODRM: u8 = 0x80;
const IMM8: u8     = 1;
const IMM32: u8    = 3;

static OP1_TAB: [u8; 256] = [
    /* 00 ADD r/m,r */  HAS_MODRM,
    /* 01 ADD r/m,r */  HAS_MODRM,
    ...
    /* 68 PUSH imm32 */ IMM32,
    /* 69 IMUL r,imm */ HAS_MODRM | IMM32,
    ...
    // 256 entries covering the full primary opcode space
];
</code></code></pre><p>The decoder walks prefix &#8594; REX &#8594; opcode &#8594; ModRM &#8594; SIB &#8594; displacement &#8594; immediate, accumulating the length at each step. For the curious, the full decoder handles VEX (2-byte and 3-byte) and EVEX prefixes for AVX/AVX-512, <code>MOV r64, imm64</code> (the only instruction with a 64-bit immediate), the F6/F7 <code>TEST</code> special case (where the ModRM <code>/r</code> field determines whether an immediate follows), and a dozen other encoding quirks.</p><p>It&#8217;s about 440 lines of Rust. Not pretty, but complete.</p><h3>Step 2: Find and patch</h3><p>With the ILD, the rewriter walks the code instruction-by-instruction. At each position, it skips past prefixes and REX to find the opcode bytes. If it finds <code>0F 05</code> at the opcode position, that&#8217;s a real <code>syscall</code> instruction &#8212; not a coincidental byte pattern inside an immediate:</p><pre><code><code>pub fn rewrite_syscalls(code: &amp;mut [u8]) -&gt; usize {
    let mut count = 0;
    let mut pos = 0;

    while pos &lt; code.len() {
        let ilen = match ild_length(&amp;code[pos..]) {
            Some(n) =&gt; n,
            None =&gt; { pos += 1; continue; }  // skip undecodable byte
        };

        // Skip prefixes and REX to find the opcode
        let mut opc = pos;
        while opc &lt; pos + ilen {
            let b = code[opc];
            if is_legacy_prefix(b) { opc += 1; }
            else { break; }
        }
        if opc &lt; pos + ilen &amp;&amp; (code[opc] &amp; 0xF0) == 0x40 {
            opc += 1;  // skip REX
        }

        // Patch: SYSCALL (0F 05) &#8594; INT3 (CC) + NOP (90)
        if opc + 1 &lt; pos + ilen
            &amp;&amp; code[opc] == 0x0F
            &amp;&amp; code[opc + 1] == 0x05
        {
            code[opc] = 0xCC;     // INT3
            code[opc + 1] = 0x90; // NOP
            count += 1;
        }

        pos += ilen;
    }
    count
}
</code></code></pre><p>The replacement is <code>INT3</code> (<code>0xCC</code>) followed by <code>NOP</code> (<code>0x90</code>). <code>INT3</code> is a one-byte instruction that triggers interrupt vector 3. <code>NOP</code> pads the second byte so the instruction lengths stay aligned &#8212; <code>syscall</code> is 2 bytes, <code>INT3 + NOP</code> is 2 bytes. No instruction boundary shifts, no relocation needed.</p><h3>A real example: CPython 3.12</h3><p>To make this concrete &#8212; here&#8217;s what happens when the rewriter runs on a statically-linked Python 3.12 binary:</p><pre><code><code>$ nexus-vmm test_policy.json
Loaded ELF: python3.elf (entry=0x683C6C, 4 segments)
  segment 1 (0x683000, 8745266 bytes, exec): patched 363 syscalls
Rewriter: patched 363 total syscall instructions
</code></code></pre><p>A 19MB static binary. 8.7MB of executable code. The ILD walks every instruction in the <code>.text</code> section and finds 363 <code>syscall</code> instructions &#8212; scattered across musl libc, Python&#8217;s posixmodule, the socket module, signal handling, memory allocation, and dozens of other call sites. Each one is replaced with <code>INT3 + NOP</code>. The rewrite takes about 48ms including loading the binary into guest memory. After this, not a single <code>syscall</code> instruction remains in the process image. Every path to the kernel goes through the shim.</p><h3>Step 3: The shim catches the trap</h3><p>The rewritten binary doesn&#8217;t run on the host kernel &#8212; it runs inside a lightweight VM (we use KVM). The VM has no operating system. Instead, a small shim &#8212; a few kilobytes of Rust code loaded into the VM&#8217;s memory &#8212; acts as the only thing between the process and the hardware. Before the guest runs, the hypervisor sets up an IDT (Interrupt Descriptor Table) with vector 3 pointing to the shim&#8217;s handler. When the rewritten <code>INT3</code> fires, the CPU pushes <code>RIP</code>, <code>CS</code>, and <code>RFLAGS</code> onto the stack and jumps to the handler.</p><p>The handler reads the syscall number from <code>rax</code> and arguments from <code>rdi</code>, <code>rsi</code>, <code>rdx</code>, <code>r10</code>, <code>r8</code>, <code>r9</code> &#8212; the standard Linux syscall ABI. It calls into a dispatch function that decides what to do:</p><pre><code><code>INT3 fires
  &#8594; CPU pushes RIP/CS/RFLAGS, jumps to shim handler
  &#8594; handler reads rax (syscall number), rdi-r9 (arguments)
  &#8594; dispatch:
      - check policy table &#8594; DENY? return -EPERM
      - can emulate locally? &#8594; emulate, return result
      - need host help? &#8594; escalate to hypervisor
  &#8594; handler writes result to rax
  &#8594; IRETQ back to guest
</code></code></pre><p>The entire path &#8212; trap, dispatch, emulate, return &#8212; runs in <strong>under a microsecond</strong> for the common case. The process resumes at the instruction after the original <code>syscall</code>, with the result in <code>rax</code>, exactly as if a real syscall had executed.</p><div><hr></div><h2>The edge cases that nearly killed us</h2><p>The basic rewriter is straightforward. The edge cases are where most of the time goes.</p><h3>JIT&#8217;d code: the LSTAR self-healing trick</h3><p>The binary rewriter runs once at load time. But what about code generated at runtime &#8212; JIT compilers like V8 (Node.js), LuaJIT, or the Python regex engine? These emit fresh machine code containing <code>syscall</code> instructions that don&#8217;t exist when the rewrite pass runs.</p><p>The solution: the LSTAR MSR as a safety net.</p><p>On x86-64, the <code>syscall</code> instruction transfers control to the address in the LSTAR register. Normally, this points to the kernel&#8217;s syscall entry point. Set LSTAR to point to a <strong>self-healing handler</strong> in the shim.</p><p>If a JIT&#8217;d <code>syscall</code> executes (because the rewriter never had a chance to patch it), it jumps to the LSTAR handler instead of the kernel. The handler:</p><ol><li><p>Records the address of the <code>syscall</code> instruction that triggered it</p></li><li><p><strong>Patches it in place</strong> &#8212; overwrites <code>0F 05</code> with <code>CC 90</code> right there in the JIT&#8217;d code</p></li><li><p>Constructs an interrupt frame and falls through to the same INT3 handler</p></li></ol><p>The first execution of a JIT&#8217;d <code>syscall</code> takes the slow LSTAR path. Every subsequent execution hits the patched <code>INT3</code> and takes the fast path. The system self-heals &#8212; every new <code>syscall</code> instruction gets caught and patched on first encounter.</p><pre><code><code>_syscall_entry:                    ; LSTAR target
  ; SYSCALL saved: RIP &#8594; rcx, RFLAGS &#8594; r11
  ; Record the syscall instruction address for patching
  lea   r15, [rcx - 2]            ; address of the syscall opcode (2 bytes back)
  mov   [rip + LSTAR_PATCH_GPA], r15
  ; Save original RSP, then build the 5-qword iretq frame
  ; that the shared INT3 handler expects
  mov   r15, rsp                  ; save guest RSP
  push  0x10                      ; SS
  push  r15                       ; RSP (guest's original)
  push  r11                       ; RFLAGS
  push  0x08                      ; CS
  push  rcx                       ; RIP (return address)
  jmp   _start                    ; fall through to shared INT3 handler
</code></code></pre><p>The dispatch function checks <code>LSTAR_PATCH_GPA</code> on every invocation. If non-zero, it patches the two bytes at that address to <code>CC 90</code> and clears the variable. Next time that code runs, <code>INT3</code> fires directly &#8212; no LSTAR round-trip.</p><p>The remaining edge cases aren&#8217;t about the rewriter itself &#8212; they&#8217;re about building the shim that catches the rewritten traps. But they cost enough debugging time to be worth sharing.</p><h3>LLVM dead-code elimination</h3><p>Our shim has a policy table &#8212; an array of <code>u64</code> values at a fixed memory address, written by the hypervisor before boot. The shim reads these values to make fast-path policy decisions.</p><p>The first implementation:</p><pre><code><code>#[link_section = ".shim_policy"]
static POLICY_TABLE: [AtomicU64; 128] = [const { AtomicU64::new(0) }; 128];
</code></code></pre><p>This worked in debug builds. In release builds with optimizations, the policy table vanished. LLVM saw a <code>static</code> initialized to all zeros, determined that every load from it must produce zero, constant-folded the policy checks to <code>false</code>, eliminated them as dead code, and removed the static entirely.</p><p>The fix: don&#8217;t use a static at all. Read from the fixed guest physical address with <code>read_volatile</code>:</p><pre><code><code>pub fn check_write(gov_idx: u32, len: u64) -&gt; bool {
    let ptr = POLICY_TABLE_ADDR as *const u64;
    unsafe {
        let slot = core::ptr::read_volatile(ptr.add(gov_idx as usize));
        slot &amp; ALLOW_WRITE != 0
    }
}
</code></code></pre><p><code>read_volatile</code> tells the compiler: this memory may change at any time, you cannot assume its value. The optimizer cannot fold it, cannot eliminate it, cannot reason about what&#8217;s behind the pointer. The hypervisor writes the real policy values to that physical address before boot, and the shim reads them at runtime &#8212; exactly as intended.</p><h3>Dynamic linking sections in a no_std binary</h3><p>Our shim is built as a <code>no_std</code> Rust binary &#8212; no standard library, no runtime, no dynamic linking. But when we extracted the raw binary with <code>objcopy -O binary</code>, the output was 10x larger than expected.</p><p>The culprit: the LLVM linker (<code>lld</code>) emits <code>.dynamic</code>, <code>.dynsym</code>, <code>.gnu.hash</code>, <code>.hash</code>, and <code>.dynstr</code> sections even in <code>no_std</code> binaries. These sections are marked <code>ALLOC</code>, which tells <code>objcopy</code> they belong in the binary image. So our shim.bin contained megabytes of empty dynamic linking metadata that would never be used.</p><p>The fix is in the linker script:</p><pre><code><code>/DISCARD/ : {
    *(.dynamic)
    *(.dynsym)
    *(.gnu.hash)
    *(.hash)
    *(.dynstr)
}
</code></code></pre><p>Explicit discard. The sections don&#8217;t appear in the output, <code>objcopy</code> produces a minimal binary, and the shim fits in a single 4KB page as intended.</p><div><hr></div><h2>What this enables</h2><p>Once every syscall routes through your handler, you can enforce arbitrary policy on untrusted code with near-native performance:</p><ul><li><p><code>open("/data/customers.csv")</code> &#8594; <strong>allow</strong> (it&#8217;s in the policy)</p></li><li><p><code>open("/etc/shadow")</code> &#8594; <strong>deny</strong>, return <code>-ENOENT</code> (file doesn&#8217;t exist in the controlled environment)</p></li><li><p><code>connect("api.openai.com:443")</code> &#8594; <strong>allow</strong>, route through TLS inspection</p></li><li><p><code>connect("pastebin.com:443")</code> &#8594; <strong>deny</strong>, return <code>-EPERM</code></p></li><li><p><code>write(socket_fd, buffer_containing_pii)</code> &#8594; <strong>block</strong> before a single byte hits the network</p></li></ul><p>The process sees standard syscall return values. <code>-EPERM</code> when denied, normal results when allowed. It doesn&#8217;t know it&#8217;s being intercepted. It can&#8217;t detect the interception (the <code>syscall</code> instruction was replaced, there&#8217;s nothing to probe). And it can&#8217;t bypass it &#8212; every path to the kernel goes through our shim.</p><p>This is the foundation of a minimal VM runtime where untrusted code runs inside a controlled environment with full visibility into every action it takes. The binary rewriter is the first layer: it creates a position between the process and the kernel where you can see everything, control everything, and audit everything, at a cost of under a microsecond per syscall.</p><p>But a binary rewriter alone isn&#8217;t a runtime. The next question is: what handles those intercepted syscalls? A full Linux kernel is overkill for a single-process workload that does file I/O and network calls. Tomorrow&#8217;s post covers why a ~40-syscall &#8220;kernel&#8221; is enough, and how a hypervisor backstop handles everything else.</p><div><hr></div><p><em>This is post 1 of a 7-part series on building a minimal VM runtime for AI agent execution. If you&#8217;re working on agent sandboxing, runtime security, or just want to follow along &#8212; subscribe for the rest.</em></p><p><em>If you&#8217;re dealing with agent security at your company, I&#8217;d love to hear what you&#8217;ve tried and what&#8217;s missing &#8212; reach out on <a href="https://www.linkedin.com/in/alimaye/">LinkedIn</a>.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://amitlimaye1.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>