Moving a 6-node NETLAB+ cluster off VMware to Proxmox | Solomon Neas<br>Skip to content<br>The cluster worked fine. That was the problem.
Six Lenovo SR630 nodes, 196 cores, about 2.75 TB of RAM, running NETLAB+ for live coursework: Cisco, Palo Alto, Security+, CySA+, ethical hacking, A+. Students logged in every week and spun up real pods on it. Nothing was broken.
Then Broadcom ended perpetual licensing and moved VMware to per-core subscription, and the number stopped making sense. At Broadcom’s 2024 list, our 196 cores penciled out to roughly $26,000 a year on vSphere Foundation ($135/core) or about $69,000 on the full Cloud Foundation bundle ($350/core), and the rates have only climbed since. Opening renewal quotes across the industry ran two to five times prior spend; negotiated deals often settled nearer 1.3 to 2x. Either way, over a five-year cycle it’s a low-hundreds-of-thousands line item to keep running software that was already installed and already doing the job. On a community-college lab budget, that’s not a renewal. It’s a wall.
So the cluster moved to Proxmox. The hypervisor swap was the easy part. The parts worth writing about came after: restoring hundreds of gigs of pod images over the public internet, and a Proxmox 9 upgrade that detonated the environment and sent me rebuilding it from my own notes.
The money
VMware (Broadcom subscription)Proxmox VELicense modelPer-core annual subscription; perpetual licenses ended early 2024AGPLv3, no license costList price per core/yr (2024 launch list)$135 (vSphere Foundation) to $350 (Cloud Foundation)$0196 cores, annual$26,000 to $69,000 (before renewal multipliers)$0196 cores, 5-year$130,000 to $345,000$0<br>This was never “VMware bad.” It did the job for years and the feature set was never the complaint. The problem is that a vendor can re-price a working platform overnight and your only move is pay or leave. That’s not a technology risk, it’s an architecture risk that happens to land on an invoice.
Proxmox covered what the lab actually used: KVM/QEMU, clustering, live migration, and a real Linux host underneath instead of an appliance you poke at through a sanctioned API. Licensing went to zero. The work didn’t disappear, it moved from defending a renewal to running the platform. That’s the better bill.
How the VMs came across
Everyone asks this first, so: there was no VMDK conversion on my end. A normal self-hosted VMware-to-Proxmox move does involve qemu-img or an OVF import; mine didn’t, because NDG, the company behind NETLAB+, does that ingest upstream and distributes everything as Proxmox Backup Server snapshots from a server they host. You add their PBS as a datastore, point it at the netlab namespace, and pull: the NETLAB-VE management appliance first, then every course pod, restored into qcow2 on a node-local SSD datastore.
So “migrating” the student workloads wasn’t a lift-and-shift. It was a clean re-pull onto rebuilt infrastructure. Which sounds simple, right up until you do the math on moving that much data over a WAN.
Pulling pods over the internet is the slow, flaky part
The NETLAB-VE management appliance alone restores as nine virtual disks, one of them 100 GB, and every course pod stacks more behind it. Each disk comes down with pbs-restore from NDG’s backup server over the public internet, slow and flaky in equal measure. One appliance disk, mostly empty, still logged a pbs-restore speed of 3.57 MB/s; the big data disks were the real wait. When a transfer dropped, it looked like this:
progress 4% (read 1291845632 bytes, zeroes = 9% ...)<br>restore failed: connection reset<br>error before or during data restore, some or all disks were not completely restored.<br>TASK ERROR: ... pbs-restore ... failed: exit code 255<br>A reset partway through doesn’t unwind cleanly. pbs-restore clears its own temp qcow2 volumes, but it leaves the VM config behind (the task logs state is NOT cleaned up), so you delete that by hand before retrying the whole disk set from zero. Multiply that across every pod for every course, on a back-to-school deadline.
The lesson that stuck: when the images live on a server you don’t control, your migration timeline is bounded by the link to it, not by your local switch. Stage early, restore in parallel where the storage can take it, and assume a few will die partway and need a cleanup before the retry.
This is also where LACP earns its keep and lets you down in the same breath. A bond across the 1G links gives you resilience and aggregate throughput across many flows, but a single restore stream is one flow and rides one link. Bonding is not a speed-up for one big transfer, and expecting it to be is a good way to lose an afternoon staring at a port graph wondering why the other link is idle.
The network
I treated this as an infrastructure project, not a hypervisor swap, and the network is why. Each host had a dual-port 10G NIC split across two switches: one port carried management to the Cisco as a plain access port,...