The backup SSH daemon I run before every do-release-upgrade · ma.ttias.beThe backup SSH daemon I run before every do-release-upgrade
Mattias GeniarI spent a chunk of the last few weeks upgrading a fleet of Ubuntu servers in place, one LTS to the next, with do-release-upgrade. Dozens of boxes, mostly stateless, mostly boring once you’ve done the first few.<br>The thing that kept trying to lock me out wasn’t the new OS, the kernel, or some package that wouldn’t configure. It was SSH itself, dying in the middle of the upgrade, on the exact connection I was using to run it.<br>If you only ever upgrade one box every couple of years, you might never notice this, or you’ll blame the network and reconnect. Do it thirty times in a row and the pattern is impossible to miss. So here’s what happens, why the built-in safety net doesn’t fire when you’ve done everything else right, and the backup daemon I now start before touching anything.<br>SSH dies mid-upgrade#<br>Partway through every hop, new SSH connections to the box start failing. From my laptop:<br>$ ssh user@server -p 22<br>kex_exchange_identification: read: Connection reset by peer<br>Connection reset by 167.x.x.x port 22<br>The session I was already in stays alive. systemctl status ssh on the box says active (running). sshd -t says the config is fine. But every new connection gets reset, for minutes at a time.<br>This isn’t a bug, it’s openssh-server upgrading itself out from under you. sshd uses privilege separation: the listening parent re-executes the on-disk sshd binary for every new connection. During the upgrade, the new binary lands on disk while the parent process in memory is still the old one. The two disagree about the format of the state they pass to each other, and the handshake collapses. The machinery is right there in the binary:<br>$ strings /usr/sbin/sshd | grep rexec<br>rexec of %s failed: %s<br>send_rexec_state<br>incomplete message<br>rexec version mismatch<br>On the server side it shows up in the auth log as a recv_rexec_state: buffer error: incomplete message. Nothing is broken in a way you need to fix. The box is fine. You just can’t get a new shell on it until the upgrade finishes replacing openssh and the parent gets restarted (which a reboot does cleanly).<br>The catch is that “you can’t get a new shell” is exactly the situation you do not want to be in halfway through an OS upgrade, when something else might need your attention.<br>tmux disables the safety net#<br>Ubuntu’s own upgrader knows about this. When it detects you’re running over SSH, it starts a second sshd on port 1022 specifically so you have a spare door if the main one jams. Here’s the actual code in ubuntu-release-upgrader-core (1:22.04.20 on jammy):<br>def _sshMagic(self):<br>""" this will check for server mode and if we run over ssh.<br>if this is the case, we will ask and spawn a additional<br>daemon (to be sure we have a spare one around in case<br>of trouble)<br>"""<br>pidfile = os.path.join("/var/run/release-upgrader-sshd.pid")<br>if (not os.path.exists(pidfile) and<br>os.path.isdir("/proc") and<br>is_child_of_process_name("sshd")):<br>...<br>port = 1022<br>res = subprocess.call(["/usr/sbin/sshd",<br>"-o", "PidFile=%s" % pidfile,<br>"-p", str(port)])
Read the if. It only starts the spare sshd when is_child_of_process_name("sshd") is true, meaning the upgrader’s process has sshd somewhere up its parent chain. That function literally walks /proc//stat from itself up to PID 1, looking for a process called sshd:<br>def is_child_of_process_name(processname, pid=None):<br>if not pid:<br>pid = os.getpid()<br>while pid > 0:<br>with open("/proc/%s/stat" % pid) as stat_f:<br>stat = stat_f.read()<br>command = stat.partition("(")[2].rpartition(")")[0]<br>if command == processname:<br>return True<br>pid = int(stat.rpartition(")")[2].split()[1])<br>return False
What disables that fallback is your next step, and it’s the right step to take. A release upgrade can take 20 minutes or more, and you should never run something that long on a raw SSH connection, because if your laptop’s wifi hiccups, the upgrade dies with it. So you run it inside tmux (or screen, which I’ve been preaching since 2008<br>).<br>But tmux daemonizes. When you start a session, the tmux server forks off and gets reparented to init, it is not a child of your shell. So anything running inside tmux has a parent chain that goes to PID 1, never back through sshd. You can watch it happen:<br>$ tmux new-session -d -s upg 'sleep 120'<br>$ # walk the parent chain of that sleep:<br>pid 4090 (sleep) -> ppid 4089<br>pid 4089 (tmux: server) -> ppid 1<br>The sleep inside tmux is a child of the tmux server, which is a child of init. sshd appears nowhere. So is_child_of_process_name("sshd") returns False, _sshMagic does nothing, and the upgrader’s spare sshd never starts.<br>The two correct decisions cancel each other out. You run the upgrade in tmux so a dropped connection can’t kill it, and that single act silently switches off the one...