Rewriting Every Syscall in a Linux Binary at Load Time
The problem that started this
There’s something odd about the way we run software today. Most containers — the dominant unit of deployment in production — run a single process. One Python script, one Node.js server, one Go binary. But that single process sits on top of a full Linux kernel — roughly 450 system calls, most of which it will never use. The kernel knows about devices, schedulers, multi-process coordination, signal routing, dozens of filesystem types, and hundreds of other things that a single-process workload doesn’t care about. The process doesn’t care how the machine is laid out or what hardware is available. It needs CPU, memory, and I/O. That’s it.
Think about what this gap means. We’re running on a platform that provides a vast surface of features these workloads will never touch. And increasingly, the code running inside these containers is code that isn’t fully trusted — third-party libraries, generated code, autonomous agents making decisions at runtime. That 450-syscall interface becomes difficult to reason about and even harder to secure.
This is not a unique observation, and it’s not the first time people have tried to address it. Stripping a kernel of features you don’t need is an old idea — sometimes motivated by security, sometimes by resource constraints. It’s been common in the embedded world for decades, where you can’t afford a full kernel on a device with 256KB of RAM. In the server world, the same instinct shows up as hardened kernel configs with hundreds of options disabled, custom builds with subsystems removed, and unikernels that compile app and kernel together. And people have shown real success with this approach — substantial gains in memory footprint and measurably better security postures. But they still always end up with more than they want, because of how deeply things are entangled in the kernel. Subsystems have deep interdependencies. Removing the scheduler breaks assumptions in the memory manager. Disabling networking pulls threads that unravel into the VFS layer. You end up with hacks — #ifdefs around code paths that might still be reachable, stub functions that return success without doing anything, and a constant fear that some corner case will hit a code path you thought you’d removed.
Unikernels have tried to address the entire kernel surface from the other direction — building up from nothing. But the problem turns out to be vast. Once you need to support processes that talk to real devices, care about hardware topology, or depend on specific OS features, the scope explodes. You end up rebuilding large chunks of what you were trying to avoid.
But what if you don’t need any of that? What if the process doesn’t care about devices, doesn’t need hardware access, and only uses a small slice of the OS interface? Instead of subtracting from a full kernel or rebuilding one from scratch, what if you started from zero and only implemented what the process actually calls?
strace a Python process — a script that reads data, makes HTTPS calls, and writes output. It uses about 40 distinct syscalls. read, write, open, close, socket, connect, send, recv, brk, mmap, clock_gettime, exit — and a couple dozen more for memory management and file metadata. The other 410 syscalls? Multi-process coordination, device management, signal handling, things a single-process workload never touches.
So implement those ~40 syscalls as a library. A “library kernel” — just the syscalls the process needs, written from scratch, no Linux baggage.
The idea isn’t new. Unikernels tried this. So did various library OS projects. But they all hit the same wall: how do you get the process to call your library instead of the real kernel?
The standard approaches are:
Compiler integration: Modify the toolchain to emit calls to your library instead of syscall instructions. This works, but now you need to support every compiler, every language runtime, every version. GCC, Clang, rustc, Go’s compiler, CPython’s build system, Node.js’s V8 — each with its own way of emitting syscalls. The maintenance surface is enormous.
LD_PRELOAD / libc interposition: Intercept at the C library level by overriding libc functions like write() and open(). But not everything goes through libc. Go makes syscalls directly. So does musl in some paths. JIT compilers emit raw syscall instructions. Anything that bypasses libc bypasses your interposition. You’re playing whack-a-mole with an ever-growing list of exceptions.
Custom libc: Build a libc that routes to your implementation. Similar problem — you need the process to link against your libc, which means controlling the build. And statically-linked binaries ignore your libc entirely.
Every approach that works at the source level, the compiler level, or the library level has the same fundamental problem: there are too many paths to a syscall instruction, and you need to cover all of them. Miss one, and the process escapes your control.
And containers aren’t secure enough as-is. A container shares the host kernel. Every one of those 450 syscalls is a potential attack surface — the process can probe them, exploit bugs in their implementation, or use them in unintended combinations. The kernel’s syscall interface is the largest attack surface in the system, and containers expose all of it.
But eventually you realize something: every one of these approaches — compiler-generated code, libc wrappers, JIT output, hand-written assembly, Go’s raw syscalls, musl’s internal paths — they all converge on the same point. Whatever language, whatever toolchain, whatever runtime, the process eventually executes the same two-byte instruction: 0F 05. The syscall opcode is the single most consistent hook point across the entire software stack. It doesn’t matter how the code got there. It always arrives at the same place.
Work at that level — below the language, below the compiler, below libc — and you only have one thing to catch.
The syscall interface is just an ABI — a contract. A process puts a number in rax, arguments in rdi through r9, and executes syscall. It gets a result back in rax. The process doesn’t care who honored that contract. If you implement those ~40 syscalls yourself — returning the same values, honoring the same error codes — the process can’t tell the difference. You control its entire view of the world. What happens with the other 410 syscalls — how you handle the ones you don’t implement, what to do when a process needs something outside your set — is a design question I’ll get into in later posts. For now, the foundational problem:
The answer: rewrite the binary at load time. Replace every syscall instruction with a trap that redirects to your own implementation.
Why not ptrace, seccomp, or eBPF?
There are established ways to intercept system calls on Linux. Each one has a limitation that matters when your goal is to enforce policy on untrusted code — not just observe it.
ptrace (strace, gdb):
The kernel stops the process, notifies the tracer, the tracer inspects and resumes. That’s two context switches per syscall — roughly 10-20µs of overhead each time. For a process making thousands of syscalls per second, ptrace adds double-digit milliseconds of delay. More fundamentally, ptrace is designed for debugging, not enforcement. The API is awkward for building a policy engine.
seccomp-bpf:
Seccomp lets you install a BPF filter that the kernel evaluates on every syscall. It’s fast — the filter runs in-kernel. But the actions are coarse: allow, kill the process, return an error, or trap to a user-space handler (via SECCOMP_RET_TRACE, which brings back ptrace overhead). You can’t inspect pointer arguments — the BPF filter only sees register values, not the memory they point to. You can’t read the filename being open()ed or the buffer being write()n. And you can’t modify anything — seccomp is a one-way gate.
eBPF:
eBPF programs attached to tracepoints or LSM hooks can observe and enforce at the syscall level with low overhead — LSM hooks can deny calls outright. But eBPF is deliberately restricted from modifying process state. You can deny a connect(), but you can’t change the destination address, return a custom result, or emulate the call with different behavior. You can’t inspect the buffer contents a write() is about to send. The verifier guarantees safety, which means eBPF enforcement is binary — allow or deny — without the ability to intercept, inspect, and rewrite at the level a full policy engine needs.
What’s needed is something different:
Requirement ptrace seccomp eBPF Binary rewrite Low overhead per syscall No (~10-20µs) Yes Yes Yes Inspect pointer arguments (filenames, buffers) Yes (slow) No Read-only Yes Modify return values Yes (slow) No No Yes Gracefully deny (return -EPERM, process continues) Yes (slow) Partial (ERRNO mode) No Yes Emulate the syscall entirely Yes (slow) No No Yes No kernel module required Yes Yes Yes Yes
Binary rewriting gives you the full set: low overhead, full argument inspection, return value control, and complete emulation — all without a kernel module.
The idea is simple: if you can replace every syscall instruction in a binary with a trap that redirects to your own handler, you get complete control over the process’s interaction with the outside world. Let’s look at what it takes to build one. The primary reference for this work is the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (specifically the opcode maps in Appendix A), and we validated our instruction length decoder against Capstone, the open-source disassembly framework.
How the rewriter works
The rewrite is a single pass over the .text section of an ELF binary. When it runs is a deployment choice — it could happen at container build time, when a container image is pulled from a registry (via a webhook or admission controller), or at load time just before execution. We currently do it at load time: after loading the ELF into memory, before the first instruction runs. You could also do it once for a binary and cache the result — the rewrite is deterministic, so there’s no reason to repeat it.
Step 1: Instruction Length Decoding
You can’t just scan for the byte sequence 0F 05 (the encoding of syscall). Those two bytes might appear as part of a larger instruction — an immediate operand, a displacement, or a prefix combination. Naively replacing them would corrupt unrelated instructions.
Instead, the rewriter walks the code at instruction boundaries using an Instruction Length Decoder (ILD). The ILD doesn’t fully disassemble each instruction — it only computes its length. That’s enough to advance to the next instruction boundary and know exactly where opcodes are versus operands.
The ILD handles the full x86-64 encoding complexity:
Legacy prefixes (up to 4):
LOCK,REP, segment overrides, operand-size (66), address-size (67)REX prefix: the
40-4Fbyte that extends registers and changes operand widthOpcode: 1-byte, 2-byte (
0Fescape), or 3-byte (0F 38,0F 3A)ModRM + SIB + displacement: addressing mode encoding
Immediates: variable width depending on opcode and prefixes
The core is two lookup tables — one for 1-byte opcodes, one for 2-byte opcodes — derived from the Intel Software Developer’s Manual (Vol. 2A, Tables A-2 and A-3). Each table entry encodes whether the opcode has a ModRM byte and what size immediate follows:
const HAS_MODRM: u8 = 0x80;
const IMM8: u8 = 1;
const IMM32: u8 = 3;
static OP1_TAB: [u8; 256] = [
/* 00 ADD r/m,r */ HAS_MODRM,
/* 01 ADD r/m,r */ HAS_MODRM,
...
/* 68 PUSH imm32 */ IMM32,
/* 69 IMUL r,imm */ HAS_MODRM | IMM32,
...
// 256 entries covering the full primary opcode space
];
The decoder walks prefix → REX → opcode → ModRM → SIB → displacement → immediate, accumulating the length at each step. For the curious, the full decoder handles VEX (2-byte and 3-byte) and EVEX prefixes for AVX/AVX-512, MOV r64, imm64 (the only instruction with a 64-bit immediate), the F6/F7 TEST special case (where the ModRM /r field determines whether an immediate follows), and a dozen other encoding quirks.
It’s about 440 lines of Rust. Not pretty, but complete.
Step 2: Find and patch
With the ILD, the rewriter walks the code instruction-by-instruction. At each position, it skips past prefixes and REX to find the opcode bytes. If it finds 0F 05 at the opcode position, that’s a real syscall instruction — not a coincidental byte pattern inside an immediate:
pub fn rewrite_syscalls(code: &mut [u8]) -> usize {
let mut count = 0;
let mut pos = 0;
while pos < code.len() {
let ilen = match ild_length(&code[pos..]) {
Some(n) => n,
None => { pos += 1; continue; } // skip undecodable byte
};
// Skip prefixes and REX to find the opcode
let mut opc = pos;
while opc < pos + ilen {
let b = code[opc];
if is_legacy_prefix(b) { opc += 1; }
else { break; }
}
if opc < pos + ilen && (code[opc] & 0xF0) == 0x40 {
opc += 1; // skip REX
}
// Patch: SYSCALL (0F 05) → INT3 (CC) + NOP (90)
if opc + 1 < pos + ilen
&& code[opc] == 0x0F
&& code[opc + 1] == 0x05
{
code[opc] = 0xCC; // INT3
code[opc + 1] = 0x90; // NOP
count += 1;
}
pos += ilen;
}
count
}
The replacement is INT3 (0xCC) followed by NOP (0x90). INT3 is a one-byte instruction that triggers interrupt vector 3. NOP pads the second byte so the instruction lengths stay aligned — syscall is 2 bytes, INT3 + NOP is 2 bytes. No instruction boundary shifts, no relocation needed.
A real example: CPython 3.12
To make this concrete — here’s what happens when the rewriter runs on a statically-linked Python 3.12 binary:
$ nexus-vmm test_policy.json
Loaded ELF: python3.elf (entry=0x683C6C, 4 segments)
segment 1 (0x683000, 8745266 bytes, exec): patched 363 syscalls
Rewriter: patched 363 total syscall instructions
A 19MB static binary. 8.7MB of executable code. The ILD walks every instruction in the .text section and finds 363 syscall instructions — scattered across musl libc, Python’s posixmodule, the socket module, signal handling, memory allocation, and dozens of other call sites. Each one is replaced with INT3 + NOP. The rewrite takes about 48ms including loading the binary into guest memory. After this, not a single syscall instruction remains in the process image. Every path to the kernel goes through the shim.
Step 3: The shim catches the trap
The rewritten binary doesn’t run on the host kernel — it runs inside a lightweight VM (we use KVM). The VM has no operating system. Instead, a small shim — a few kilobytes of Rust code loaded into the VM’s memory — acts as the only thing between the process and the hardware. Before the guest runs, the hypervisor sets up an IDT (Interrupt Descriptor Table) with vector 3 pointing to the shim’s handler. When the rewritten INT3 fires, the CPU pushes RIP, CS, and RFLAGS onto the stack and jumps to the handler.
The handler reads the syscall number from rax and arguments from rdi, rsi, rdx, r10, r8, r9 — the standard Linux syscall ABI. It calls into a dispatch function that decides what to do:
INT3 fires
→ CPU pushes RIP/CS/RFLAGS, jumps to shim handler
→ handler reads rax (syscall number), rdi-r9 (arguments)
→ dispatch:
- check policy table → DENY? return -EPERM
- can emulate locally? → emulate, return result
- need host help? → escalate to hypervisor
→ handler writes result to rax
→ IRETQ back to guest
The entire path — trap, dispatch, emulate, return — runs in under a microsecond for the common case. The process resumes at the instruction after the original syscall, with the result in rax, exactly as if a real syscall had executed.
The edge cases that nearly killed us
The basic rewriter is straightforward. The edge cases are where most of the time goes.
JIT’d code: the LSTAR self-healing trick
The binary rewriter runs once at load time. But what about code generated at runtime — JIT compilers like V8 (Node.js), LuaJIT, or the Python regex engine? These emit fresh machine code containing syscall instructions that don’t exist when the rewrite pass runs.
The solution: the LSTAR MSR as a safety net.
On x86-64, the syscall instruction transfers control to the address in the LSTAR register. Normally, this points to the kernel’s syscall entry point. Set LSTAR to point to a self-healing handler in the shim.
If a JIT’d syscall executes (because the rewriter never had a chance to patch it), it jumps to the LSTAR handler instead of the kernel. The handler:
Records the address of the
syscallinstruction that triggered itPatches it in place — overwrites
0F 05withCC 90right there in the JIT’d codeConstructs an interrupt frame and falls through to the same INT3 handler
The first execution of a JIT’d syscall takes the slow LSTAR path. Every subsequent execution hits the patched INT3 and takes the fast path. The system self-heals — every new syscall instruction gets caught and patched on first encounter.
_syscall_entry: ; LSTAR target
; SYSCALL saved: RIP → rcx, RFLAGS → r11
; Record the syscall instruction address for patching
lea r15, [rcx - 2] ; address of the syscall opcode (2 bytes back)
mov [rip + LSTAR_PATCH_GPA], r15
; Save original RSP, then build the 5-qword iretq frame
; that the shared INT3 handler expects
mov r15, rsp ; save guest RSP
push 0x10 ; SS
push r15 ; RSP (guest's original)
push r11 ; RFLAGS
push 0x08 ; CS
push rcx ; RIP (return address)
jmp _start ; fall through to shared INT3 handler
The dispatch function checks LSTAR_PATCH_GPA on every invocation. If non-zero, it patches the two bytes at that address to CC 90 and clears the variable. Next time that code runs, INT3 fires directly — no LSTAR round-trip.
The remaining edge cases aren’t about the rewriter itself — they’re about building the shim that catches the rewritten traps. But they cost enough debugging time to be worth sharing.
LLVM dead-code elimination
Our shim has a policy table — an array of u64 values at a fixed memory address, written by the hypervisor before boot. The shim reads these values to make fast-path policy decisions.
The first implementation:
#[link_section = ".shim_policy"]
static POLICY_TABLE: [AtomicU64; 128] = [const { AtomicU64::new(0) }; 128];
This worked in debug builds. In release builds with optimizations, the policy table vanished. LLVM saw a static initialized to all zeros, determined that every load from it must produce zero, constant-folded the policy checks to false, eliminated them as dead code, and removed the static entirely.
The fix: don’t use a static at all. Read from the fixed guest physical address with read_volatile:
pub fn check_write(gov_idx: u32, len: u64) -> bool {
let ptr = POLICY_TABLE_ADDR as *const u64;
unsafe {
let slot = core::ptr::read_volatile(ptr.add(gov_idx as usize));
slot & ALLOW_WRITE != 0
}
}
read_volatile tells the compiler: this memory may change at any time, you cannot assume its value. The optimizer cannot fold it, cannot eliminate it, cannot reason about what’s behind the pointer. The hypervisor writes the real policy values to that physical address before boot, and the shim reads them at runtime — exactly as intended.
Dynamic linking sections in a no_std binary
Our shim is built as a no_std Rust binary — no standard library, no runtime, no dynamic linking. But when we extracted the raw binary with objcopy -O binary, the output was 10x larger than expected.
The culprit: the LLVM linker (lld) emits .dynamic, .dynsym, .gnu.hash, .hash, and .dynstr sections even in no_std binaries. These sections are marked ALLOC, which tells objcopy they belong in the binary image. So our shim.bin contained megabytes of empty dynamic linking metadata that would never be used.
The fix is in the linker script:
/DISCARD/ : {
*(.dynamic)
*(.dynsym)
*(.gnu.hash)
*(.hash)
*(.dynstr)
}
Explicit discard. The sections don’t appear in the output, objcopy produces a minimal binary, and the shim fits in a single 4KB page as intended.
What this enables
Once every syscall routes through your handler, you can enforce arbitrary policy on untrusted code with near-native performance:
open("/data/customers.csv")→ allow (it’s in the policy)open("/etc/shadow")→ deny, return-ENOENT(file doesn’t exist in the controlled environment)connect("api.openai.com:443")→ allow, route through TLS inspectionconnect("pastebin.com:443")→ deny, return-EPERMwrite(socket_fd, buffer_containing_pii)→ block before a single byte hits the network
The process sees standard syscall return values. -EPERM when denied, normal results when allowed. It doesn’t know it’s being intercepted. It can’t detect the interception (the syscall instruction was replaced, there’s nothing to probe). And it can’t bypass it — every path to the kernel goes through our shim.
This is the foundation of a minimal VM runtime where untrusted code runs inside a controlled environment with full visibility into every action it takes. The binary rewriter is the first layer: it creates a position between the process and the kernel where you can see everything, control everything, and audit everything, at a cost of under a microsecond per syscall.
But a binary rewriter alone isn’t a runtime. The next question is: what handles those intercepted syscalls? A full Linux kernel is overkill for a single-process workload that does file I/O and network calls. Tomorrow’s post covers why a ~40-syscall “kernel” is enough, and how a hypervisor backstop handles everything else.
This is post 1 of a 7-part series on building a minimal VM runtime for AI agent execution. If you’re working on agent sandboxing, runtime security, or just want to follow along — subscribe for the rest.
If you’re dealing with agent security at your company, I’d love to hear what you’ve tried and what’s missing — reach out on LinkedIn.