Per-VM Guest Networking Without a Bridge

tobiogundiyan1 pts0 comments

Per-VM Guest Networking Without a Bridge | SpaceScale

Per-VM Guest Networking Without a Bridge<br>Tobi Ogundiyan

Building the Layer 2 local host TAP networking was not as easy as I thought. Firecracker is dumb by design. It does not manage guest networking or IP allocation. It only provides a TAP device and the entire responsibility of local host networking falls on the host. This is a blessing because it gives us the flexibility to manage the network topology from the host daemon without fighting the VMM. In the current setup, the guest IP settings are passed to the guest daemon inside the VM as arguments via custom kernel command line flags:

guestd.ipv4=<br>guestd.gateway=<br>The Hardcoded Starting Point

To get the boot path working end to end, I hardcoded a single private IP address of 172.16.0.2/30 as the VM address and 172.16.0.1/30 as the gateway. For every VM we want to form a point-to-point link where the guest and its default gateway are the only nodes on their own network. According to CIDR rules, a /30 mask gives exactly 4 addresses with only two usable:

172.16.0.0 -- network address [unusable]<br>172.16.0.1 -- default gateway (host)<br>172.16.0.2 -- guest VM address<br>172.16.0.3 -- broadcast address [unusable]<br>The first VM worked. I was able to ping it and it responded. So how do we scale beyond one guest?

Scaling Beyond One Guest

The initial design was just for one guest. If two TAPs share the same IP, the host acting as the router will not know where to send packets addressed to that IP and will eventually drop them. This is commonly called an IP conflict, which is why we cannot reuse 172.16.0.2 for another VM.

The solution was to carve out a larger private address pool. Per RFC 1918 and CIDR rules, we can reserve 172.16.0.0/16 as our per-host VM address pool and slice it into smaller point-to-point blocks. The host portion is 16 bits. Doing the maths:

172.16.0.0/16<br>├── 65,536 total addresses [2^16]<br>├── ÷ 4 addresses per /30 [network, gateway, guest, broadcast]<br>└── = 16,384 isolated VM networks per host<br>With this larger address pool, we can hand out 16,384 unique point-to-point links per host, far beyond what a single bare-metal host will ever schedule.

VMs will get consecutive /30 blocks using this addressing scheme:

network host guest broadcast<br>──────────────────────────────────────────────────────────────────<br>172.16.0.0/30 172.16.0.1 172.16.0.2 172.16.0.3<br>172.16.0.4/30 172.16.0.5 172.16.0.6 172.16.0.7<br>172.16.0.8/30 172.16.0.9 172.16.0.10 172.16.0.11<br>...<br>172.16.0.252/30 172.16.0.253 172.16.0.254 172.16.0.255<br>172.16.1.0/30 172.16.1.1 172.16.1.2 172.16.1.3<br>Now that the maths is done, the next challenge was teaching the host daemon how to hand out these /30 blocks to VMs and reclaim them when VMs die.

Leasing and Clawbacks

I had previously built a context ID allocator for host-to-guest communication over vsock that uses a mutex-protected ring to lease out IDs and claw them back when a lease expires. I decided to reuse that exact pattern to build a subnet allocator.

The idea is simple. We maintain a pool of 16,384 subnet indices (0 to 16,383). When a VM boots, it acquires the next free index. When a VM dies, it releases its index back to the pool. The index maps deterministically to a concrete /30. Index 0 is always 172.16.0.0/30, index 1 is always 172.16.0.4/30, and so on. No state needs to be persisted. The mapping is pure arithmetic.

First we define constants for the network base addresses and the valid index range:

const (<br>subnetBaseA byte = 172<br>subnetBaseB byte = 16<br>maxSubnetIndex uint16 = 16383 // (65536 / 4) - 1<br>firstSubnetIndex uint16 = 0<br>The allocator struct tracks active leases with a mutex-protected map and a cursor that remembers where to scan next:

type subnetAllocator struct {<br>mu sync.Mutex<br>next uint16<br>used map[uint16]struct{}<br>To lease the next available /30, the Acquire method scans forward from the current cursor. If the current index is free, it claims it, advances the cursor, and returns the computed subnet. If the index is already in use, it advances and tries the next one. If it comes full circle back to where it started, the pool is exhausted:

func (a *subnetAllocator) Acquire() (Subnet, error) {<br>a.mu.Lock()<br>defer a.mu.Unlock()<br>start := a.next<br>for {<br>if _, exists := a.used[index]; !exists {<br>a.used[index] = struct{}{}<br>a.advanceLocked()<br>return subnetForIndex(index), nil

a.advanceLocked()<br>if a.next == start {<br>return Subnet{}, errNoSubnetAvailable<br>The subnetForIndex function is where the address arithmetic lives. Each /30 block consumes 4 addresses, so we multiply the index by 4 to get the starting offset within the /16 pool. We then split that offset across the third and fourth octets using bit shifting. Shifting right by 8 gives us the third octet, how many full 256-address blocks we have crossed. Masking with 0xFF gives us the remainder for the fourth octet. The host takes offset + 1 and the guest takes offset + 2:

func subnetForIndex(index uint16) Subnet {<br>offset :=...

guest host index address next point

Related Articles