I tried self-hosting a PDS. It didn't work and the bug was IPv6
I wanted to self-host a tangled server. I refuse to call it a “knot”, because that's just fucking stupid.
FYI, it's supposed to be a small, federated, GitHub-like git server built on the AT Protocol. The catch: tangled identities are AT Protocol identities, and the obvious way to get one is a Bluesky account.
I'm in Australia, where Bluesky now gates sign-up behind age verification (facial recognition, a credit card, or an ID scan).
I object to this on principle, so I said “fuck that”, and besides I do NOT need another social media account, so a bsky.social account was off the table.
The good news is that the AT Protocol doesn't require Bluesky at all. Your identity is just a DID anchored to a PDS (Personal Data Server) — and you can run your own. So I did: the official Bluesky reference PDS, in a container on eon (a Raspberry-Pi-class DietPi box), behind my cluster's Caddy, on my own domain. I created an account, verified the handle, pointed the knot at it. Everything came up green.
Then I tried to log in to tangled with my shiny self-hosted identity, and got:
Failed to start auth flow: auth request failed: PAR request failed (HTTP 400): invalid_client_metadata
This is the story of how that error sent me down a completely wrong path for hours — and how a single differential test dragged me back to a mundane, and much more interesting, root cause.
The obvious culprit (that wasn't)
Thirty-second primer, because OAuth is a swamp of acronyms and I won't pretend otherwise. “Log in with Google” is OAuth: one service vouches for you to another. Here, tangled is the app asking for access (in the jargon, the client), and my PDS — my little self-hosted identity server — is the bouncer deciding whether to let it in (the authorization server).
The handshake starts before my browser is even involved: tangled shoves a request straight at my PDS, server to server. That shove has a name — a Pushed Authorization Request, or PAR. And invalid_client_metadata is the bouncer sniffing the request and going: I don't like the look of this club's paperwork.
So I read the paperwork:
tangled's client metadata asks for a long list of scopes- scopes being the specific permissions an app wants, the “this app would like to access your repos and your handle” checklist
tangledasks for fine-grained ones:repo:sh.tangled.repo,rpc:sh.tangled.repo.create,blob:*/*,identity:handle
Then I checked what my PDS said it understood:
scopes_supported: ['atproto', 'transition:email', 'transition:generic', 'transition:chat.bsky']
The old, coarse set. None of the granular ones. Case closed, right?
My PDS (@atproto/pds version 0.5.9) was too old to grasp the permissions tangled was asking for, so it spat the request out. Bluesky's own docs even say granular permissions are “rolling out to the self-hosted PDS distribution” — translation: not in a shipping self-hosted PDS yet.
I went a long way down this road.
- Checked npm for a newer PDS — latest stable was 0.5.10, one patch ahead, useless.
- Checked the pre-release tags — somehow older than stable, also useless.
- Then I found Tranquil, a PDS rewritten in Rust that explicitly supports the granular scopes, read its install docs, confirmed it ran on my hardware, and got within a hair of migrating my brand-new identity onto a multi-container, Postgres-backed, community-alpha server.
That is a colossal amount of effort and risk to dodge a problem I had not actually confirmed was the problem. Mercifully, it was not the problem.
The question that broke the theory
The thing that saved me was one stupid question: if my PDS can't do this, how is tangled doing it for everyone else right now?
People log in to tangled every day with handles on .bsky.social or .tngl.sh. That second one — .tngl.sh — is tangled's own PDS hosting. Whatever software it runs, it demonstrably works with the exact login I was failing. So instead of theorising, I could just go poke it.
I had a .tngl.sh handle lying around from earlier, so I worked out which PDS hosts it and interrogated it directly. Two things fell out immediately, and both upended the original theory:
tngl.shruns the Bluesky reference PDS — the same@atproto/pdssoftware I installed, version0.4.208. Not Tranquil. So migrating to Tranquil would have been “solving” a problem the working system doesn't have.- Its
scopes_supportedwas the exact same legacy set as mine. And tangled login works there fine.
If the box that works advertises the same scopes as the box that doesn't, then the scopes were never it. The theory was built on a coincidence I'd mistaken for a clue.
So I fired the identical PAR request at both PDSes and put the answers side by side:
# my PDS:
{"error":"invalid_client_metadata",
"error_description":"Unable to obtain client metadata for \"https://tangled.org/oauth/client-metadata.json\""}
# tngl.sh:
{"error":"invalid_request",
"error_description":"client authentication method \"private_key_jwt\" required a \"client_assertion\""}
There it is, in the error text. tngl.sh fetched and read tangled's paperwork and moved on to the next step — it was only complaining that my bare test didn't include a signature it expected. Mine never got that far. “Unable to obtain client metadata.” My PDS couldn't even download the document.
This was never about permissions. It was a failed HTTP request. My server tried to fetch one URL and couldn't.
Why couldn't it fetch one URL?
From inside the PDS container, using the same mechanism the PDS itself uses (Node's built-in fetch):
https://tangled.org/oauth/client-metadata.json -> ETIMEDOUT
https://plc.directory -> 200
https://example.com -> 200
The internet at large was fine. Only tangled.org hung until it gave up (ETIMEDOUT is the network politely telling you nothing ever answered). One host timing out while everything else sails through has a particular smell to it.
I ruled out the boring stuff first. MTU — the largest chunk a network link will carry in one go — was a normal 1500 on both the container and the host, so I wasn't looking at the classic “big packets silently vanish” trap. To be sure, I forced a full encrypted handshake to tangled.org over IPv4 with the right hostname attached, big packets and all. It completed. Not an MTU black hole.
The tell was in DNS. tangled.org publishes two addresses: an A record (its IPv4 address) and an AAAA record (its IPv6 one). And my container was listening on :::3000 — that's IPv6 notation, so IPv6 was very much in the game. So I told Node to stop being clever about which one it picked:
node --no-network-family-autoselection -e 'fetch("https://tangled.org/oauth/client-metadata.json").then(r=>console.log(r.status))'
# -> 200
Instant 200. The bug was IPv6.
Here's the clever-that-backfired bit. Modern Node (I'm on 24) uses a trick called Happy Eyeballs: when a host has both an IPv4 and an IPv6 address, it races connections to both and uses whichever replies first. The whole point is resilience — if one path is dead, the other wins and you never notice.
Except my IPv6 path wasn't dead. It was black-holed — packets marched out and nothing ever came back, no rejection, no error, just silence. So Node's connection sat there waiting for a reply that would never come, instead of failing fast and letting IPv4 win the race. plc.directory and example.com happened not to provoke it; tangled.org's particular setup did. A dead path is fine. A path that eats packets and says nothing is poison.
The quick fix was one environment variable on the PDS container — tell Node to prefer IPv4 and stop racing:
NODE_OPTIONS: "--dns-result-order=ipv4first --no-network-family-autoselection"
Redeploy, replay the PAR, and now my PDS gives the identical answer tngl.sh did. Login unblocked. No Tranquil, no migration, no waiting for a release that was never the issue.
But slapping “never use IPv6” on a machine that's supposed to have IPv6 isn't a fix, it's a tourniquet. The real question was still standing there: why was IPv6 black-holing in the first place?
The actual root cause: Docker quietly knifed my IPv6
This is the part that was worth the whole detour.
First, whose fault — the network, the router, or the box? Easy to settle by comparison. My Mac, sitting on the same LAN, has a proper public IPv6 address handed down from the ISP and pings the IPv6 internet in about 22ms. So the router and the ISP do IPv6 perfectly well. But eon had no public IPv6 address and no IPv6 route to the outside world at all — just a link-local address (the IPv6 equivalent of “I can talk to the room I'm in”) and Tailscale's private range. It had never picked up the router's IPv6 at all.
I'd assumed IPv6 just worked everywhere because it works on my Mac. It works on my Mac because macOS does SLAAC out of the box.
SLAAC — Stateless Address Autoconfiguration — is the no-paperwork way devices get IPv6: the router periodically shouts its network details onto the LAN (these shouts are called Router Advertisements), and any device that hears one configures its own address from it. No DHCP server, no central list, just “here's the network, help yourself.”
My Mac does that natively. So do my Ubuntu boxes — their network managers handle the router's shouts themselves, in normal software. But my DietPi nodes use a barebones networking setup where listening for those shouts is the Linux kernel's job. And the kernel was sitting there with its thumb up its arse.
The relevant kernel switches (sysctl is just the knob-panel for kernel settings) were identical on every DietPi node:
net.ipv6.conf.eth0.accept_ra = 1
net.ipv6.conf.eth0.forwarding = 1
And that pair together is the trap. forwarding = 1 means the host is willing to act as a router, shuffling packets between interfaces on behalf of others. accept_ra is whether it listens to the router's shouts. The catch the kernel enforces: a box that's acting as a router will not listen to other routers' shouts — because routers are supposed to be authoritative, not take their config secondhand. So with accept_ra = 1 and forwarding = 1, the kernel silently bins every Router Advertisement. No shouts heard, no SLAAC, no address, no route. Silently. There's the magic word again.
So who switched on forwarding? Docker. The moment Docker starts, it flips IPv6 forwarding on so it can route packets to your containers. Every Docker-running DietPi node I had was therefore sitting at forwarding=1 + accept_ra=1 — meaning every one of them quietly lost outbound IPv6 the instant Docker came up. And I never noticed, because nothing I ran needed outbound IPv6 until a Node app went looking for an IPv6-published host and hung itself on the silence.
My Ubuntu nodes dodged it purely by luck: their network managers do the listening in software, sidestepping the kernel switch entirely, so the forwarding clash never touched them.
The actual fix is one knob, set to stick across reboots:
net.ipv6.conf.<iface>.accept_ra = 2
accept_ra = 2 means “listen to the router's shouts even while forwarding.” Flip that, and the kernel starts hearing Router Advertisements again, SLAAC kicks in, and the node grabs a real public IPv6 address and a route to the world like every other device on the network. I'm rolling it out across the whole fleet with Ansible, so every machine is a proper IPv6 citizen — not just the ones whose networking happened to paper over the gap by accident.
What I'd tell past-me
- Go find a working version and diff against it. I torched hours on a tidy, well-evidenced, completely wrong theory. One question — “how does the thing that works do it?” — and one identical request fired at a known-good server flattened the whole story in minutes. When something should work and doesn't, stop theorising and go interrogate a copy that does.
- Same symptom does not mean same cause. Both PDSes advertised the same scopes. That coincidence is exactly what sold me the wrong story for an afternoon. The error messages, read slowly and side by side, were trying to tell me the truth the whole time.
- “It works on my machine” can be three different accidents wearing a trenchcoat. IPv6 worked on my Mac and my Ubuntu boxes for three unrelated reasons, none of which applied to my DietPi nodes. Same behaviour across a fleet is something you check, not something you assume.
- Docker turns on IP forwarding and that can silently murder kernel
SLAAC. If you run containers on a Debian-ish host with barebones networking and your IPv6 has “mysteriously” vanished, go look ataccept_raversusforwarding. You almost certainly wantaccept_ra=2. - A black-holed network path is worse than a dead one. A cleanly-dead IPv6 stack fails fast and the IPv4 fallback wins. A path that swallows packets and says nothing makes well-behaved software hang forever. If connections to some hosts time out while everything else is fine, suspect a half-configured address family long before you blame MTU, DNS, or the far end.
So, now, to continue my yak-shaving I have managed to:
- host an identity server
- so now I can get an identity
- I didn't need to make a stupid social media account to fix it
- now I can attempt to run a federated git server with a stupid name
And all this is so I can host my own code repos instead of dealing with Microsoft's continued enshittification of Github, and the ongoing collapse of the USA up its own arsehole of late-stage capitalism as it goose-steps into oligarchic cronyism.
Hoo-fucking-ray!