My homelab server hung on every reboot for years. The fix was one line of YAML.
My rock4 — a Radxa RockPi4 running DietPi with four SATA SSDs on a Penta HAT — has never rebooted cleanly. For as long as I've had it in the rack, issuing sudo shutdown -r now meant walking over to the machine, waiting ten minutes to confirm it was definitely stuck, and flipping the power switch. Every single time.
It worked perfectly otherwise. Services ran fine. Drives mounted fine. The machine was solid right up until the moment you asked it to restart.
This is the story of finding the actual cause — and why the fix I thought would work made no difference at all.
The obvious culprit (that wasn't)
When you have a server that hangs on shutdown, the usual suspects are slow-stopping services, or so I was led to believe. The systemd-analyze blame output on rock4 had an obvious candidate: unattended-upgrades.service, which by default gets a TimeoutStopSec of 1800 seconds — 30 minutes. If an apt upgrade happened to be running at shutdown time, systemd would sit there for half an hour waiting for it to finish before giving up.
I applied a drop-in to cap it at 5 minutes. It still hung. For over two hours.
I dug deeper and found a second culprit: apt-daily-upgrade.service, a separate timer-triggered unit that calls unattended-upgrades. It has its own TimeoutStopSec of 900 seconds. I capped that too.
Still hung.
At this point I was fairly sure the apt theory was wrong, but I didn't have a better one yet.
The diagnostic that changed everything
Here's the thing about a “hung” server: it's worth checking whether the machine is actually dead or just systemd that's stuck.
After triggering a shutdown and watching rock4 go dark, I opened LanScan and scanned the local network. rock4 was still there. Still responding to pings. Port 111 (rpcbind) still open.
That's not a dead machine. That's a machine with a live kernel where systemd has frozen mid-shutdown.
systemd shuts down in phases, supposedly: it stops services, then unmounts filesystems, then hands off to the kernel for the actual reboot. If it gets stuck at the filesystem unmount step, the kernel never gets the reboot signal — the machine just idles there indefinitely, still on the network, lights still on, going nowhere.
The question was: which mount was blocking?
rock4 has four local SATA drives and one NFS mount — /mnt/media, served from my itx machine over the local network. I pulled up the running containers:
docker inspect jackett --format '{{ json .Mounts }}'
There it was:
/mnt/media/media/Downloads → /downloads
jackett — my torrent indexer — had an NFS-backed path bound as a Docker volume.
Why this hangs forever
When Docker mounts a volume into a container, the kernel creates a bind mount that keeps a reference count on that filesystem. Even after Docker stops the container, the overlay filesystem machinery can retain a reference to the underlying mountpoint.
So when systemd later runs umount /mnt/media, the kernel sees that something still holds a reference to that mount and returns EBUSY. Systemd retries. The NFS server is still up, healthy, and reachable — but that doesn't matter. The umount call isn't failing because the server is gone; it's failing because the local kernel thinks something still has the filesystem open.
And here's the critical part: umount has no timeout. The TimeoutStopSec settings on services don't help. The soft,timeo=30 NFS mount option doesn't help — that governs read/write operation timeouts, not the unmount syscall itself. Without something explicitly forcing a lazy unmount, systemd will wait forever.
The fix
jackett is a torrent indexer. It speaks to tracker APIs and returns search results to Radarr and Sonarr. It does not need to read or write files on disk. The downloads volume was there because at some point, someone (me, almost certainly) copy-pasted a docker-compose snippet from the internet without thinking about whether every line was necessary.
The fix was removing one line from services/jackett.yml:
# Before
volumes:
- /bricks/rock4-2/jackett:/config
- /mnt/media/media/Downloads:/downloads # ← this line
# After
volumes:
- /bricks/rock4-2/jackett:/config
Redeployed jackett, issued sudo shutdown -r now, and watched. Three minutes later, rock4 was back online. No power cycle. First clean reboot in years.
The general rule
If you're running Docker containers on a machine that also has NFS mounts, think hard before binding any NFS-backed path into a container volume. The risk isn't that Docker will do something wrong — it's that the combination of Docker's bind mount lifecycle and the kernel's umount semantics creates a window where shutdown can hang indefinitely with no error message and no timeout.
If you genuinely need an NFS path inside a container, the belt-and-suspenders fix is to add x-systemd.mount-timeout=30 to the relevant fstab entry. This caps the mount's teardown time at 30 seconds rather than forever — not ideal, but it bounds the hang.
itx.local:/mnt/media /mnt/media nfs soft,timeo=30,x-systemd.mount-timeout=30 0 0
But better is to audit your container volume mounts and ask: does this service actually need filesystem access, or is it just inheriting a volume that was copy-pasted into the config at some point?
Why it was so hard to diagnose
A few things made this particularly hard to spot:
No error message. The machine doesn't log “stuck waiting for NFS umount.” It just sits there. Systemd is doing exactly what it's supposed to do: retrying an unmount that keeps returning EBUSY. There's nothing in the journal because journald itself has already stopped by the time the hang happens.
The wrong hypothesis was plausible. Unattended-upgrades with a 1800s timeout genuinely can cause shutdown hangs. Capping it was the right thing to do regardless. It just wasn't the root cause here.
The symptom was intermittent enough to seem random. Sometimes rock4 rebooted. When the NFS server (itx) was down or the jackett container had been recently restarted, Docker might have already released the reference by the time shutdown reached the umount step. This made it feel like a timing issue rather than a deterministic one.
The diagnostic breakthrough — checking whether the machine was still pingable after it “hung” — was the key. A dead machine and a machine stuck mid-shutdown look identical from across the room. They look very different from a network scanner.
The problem is probably older than NFS
After fixing the hang, I realised something. rock4 ran GlusterFS for years before the NFS migration — a distributed filesystem where each node contributes “brick” drives to a replicated pool. The containers on rock4 mounted GlusterFS paths like /mnt/storage/jackett, and those mounts have the same property as NFS: they're network-backed filesystems that can't unmount cleanly while something holds a kernel reference to them.
GlusterFS uses FUSE (Filesystem in Userspace) to expose its mounts locally. FUSE unmounts are actually harder to complete cleanly than NFS: to release a GlusterFS FUSE mount, the glusterd daemon has to coordinate across the network, consult its peers, and tear down brick connections in order. If Docker is still holding a reference to the mountpoint, glusterd can't complete that teardown, and umount returns EBUSY — the same outcome as NFS, but with more moving parts and more ways to stall.
So the sequence was almost certainly: Docker container with GlusterFS volume → indefinite hang → GlusterFS decommissioned → NFS mounted → same container config carried across with updated paths → Docker container with NFS volume → still hangs.
Different filesystem, identical mechanism, years of continuity. The jackett config probably got its downloads volume added once, years ago, and nobody thought to question it during the storage migration.
The GlusterFS angle matters beyond this one machine. Between roughly 2018 and 2022, GlusterFS was enormously popular in self-hosted circles — TrueNAS Scale shipped it as the default clustered storage backend, and countless homelab builds adopted it for redundant storage across a few nodes. Many of those setups ran Docker containers with GlusterFS-backed volumes. Many of those setups probably had machines that wouldn't reboot cleanly. It's a reasonable bet that a lot of those people never connected the reboot hang to the storage layer.
RedHat deprecated GlusterFS in RHEL 9 (announced 2022). The official framing was “focus on other storage solutions,” but the operational complexity was a significant part of the story: GlusterFS was difficult to run at small scale, prone to split-brain, and had long-running issues with graceful shutdown and FUSE lifecycle management. The Docker reboot hang described here is a concrete example of that class of problem — the kind of subtle, hard-to-diagnose operational failure that accumulates over time and eventually makes a piece of software too difficult to maintain and recommend.
If you ran GlusterFS and your server never quite rebooted cleanly: this was probably why.
Setup
- rock4: Radxa RockPi4, DietPi (Armbian kernel 6.18), 4× 3.6TB SATA SSDs via Penta HAT
- itx: Rock 5 ITX, NFS server, mergerfs pool at
/mnt/media - Container management: uncloud
- jackett:
lscr.io/linuxserver/jackett