Suranyami

Polyglot developer, geometric tessellation fan, ambient DJ.

This is the sequel to how to self-host a tangled git server without Bluesky. That post gets you a live git server, owned by your own AT Protocol identity, on your own domain. This one is the next five minutes — actually pushing a repository to it and watching the wire close.

Same honesty disclaimer as before: I'm not a protocol person. What follows is a recipe I worked out by doing it, getting it wrong, and reading the logs with an LLM that knew where to look. It works. It's not authoritative.

Here's the whole thing, upfront:

git remote add knot git@knot.suranyami.com:did:plc:akmmkxg66qtexw6pl6erhwfe
git push knot main

That's a real command against my real server. The did:plc:... is the repo's permanent ID, not yours. We'll get to why it looks like that. Read on for the three steps and the two ways I embarrassed myself getting there.


The thing nobody explains: there are two DIDs

Before the steps, one concept that had me stuck for an embarrassing amount of time. A tangled server deals in two different AT Protocol identities, and the clone URL uses the one you're not expecting.

  • Your identity — the DID that owns the repo. Mine is did:plc:tg42msv45ief3qphccenrogh, handle forge.suranyami.com. This is you. You logged in as this.
  • The repo's identity — when you create a repo, the server mints a separate DID for the repo itself. That's the did:plc:akmmkxg66qtexw6pl6erhwfe in the clone URL. It's not you. It's the repo's own permanent ID, anchored to the server.

So the clone URL is not git@your-server:you/repo, the way it would be on GitHub or Forgejo. It's git@your-server:<repoDid> — the repo's own DID, bare, no username, no repo name, no .git. That's the permalink form tangled gives you, and it's the one that actually works.

If that feels weird, it is. It's also the whole point of federated identity — the repo is a first-class object with its own ID that survives any one server, not a path under someone's account. Whether that's worth the cognitive tax is a separate question. Today we're just making it work.


Step 1 — make the repo on tangled.org

Sign in at tangled.org as your self-hosted handle. Create a repo:

  • Owner: your identity (forge.suranyami.com).
  • Host: your server (knot.suranyami.com).
  • Name: whatever you want it called.

Tick “Use permalink” when it offers a clone URL. That gives you the git@your-server:did:plc:<repoDid> form — the one that works. Copy it. (There's no .git on the end. There is never a .git on the end. More on that in a minute.)

Which repo to push? I used a real one — maze, a Phoenix app that's already public on GitHub with no live secrets in its history. A tangled server is publicly-browsable federated git. There is no private-repo toggle. It is a different threat model from GitHub, where you can shove secrets into a private repo and trust the platform's access control. Here you cannot. So pick something you'd happily publish to the world — because you are.

If your repo has ever tracked real credentials, that's a job for git filter-repo and a quiet rotation, not a knot push. I have one of those coming myself. Not today.


Step 2 — register your SSH key

Same settings page on tangled.org: paste your public key (~/.ssh/id_ed25519.pub or equivalent), give it a name.

One thing I had backwards: the key lands on your identity immediately, at registration — not lazily on your first push. The model I was carrying in my head (from half-reading the docs) was “first git sign mints the key.” That's a real concept, but it's the server's own signing keypair, a separate thing. Your SSH key, the one that authenticates you to the server over SSH, is written the moment you register it. So if you can ssh git@your-server and get accepted before you've ever pushed, that's why.

Test it:

ssh -T git@knot.suranyami.com
# Welcome to this knot!        ← see the note below before you panic

If you get Permission denied (publickey), the key isn't registered (or isn't the one your agent is offering). If you get the welcome line, you're in.


Step 3 — add the remote and push

cd path/to/your-repo
git remote add knot git@knot.suranyami.com:did:plc:<repoDid>
git push knot main

knot is just a remote name — call it tangled, origin2, whatever. I used knot because the env vars already call it that and I'd lost the energy to fight the branding in two places.

That's the happy path. It worked for me, second try. The first try — and a chunk of lost afternoon — is the actual story.


Two ways I broke this, and the lesson in each

The clone URL has no .git, and that is not optional

I did what you do with every git remote I've ever configured: I appended .git. Muscle memory. The server returned 404 repo not found.

The reason is in the server's lookup table. When you ask for did:plc:.../maze.git, the server's guard looks up the name with the .git suffix — but it stores repo aliases under the bare name (maze). So maze.git never matches, and you get a 404 for a repo that definitely exists. Drop the .git and it resolves.

This is the kind of bug that's invisible until you read the guard's own log, which lives at /home/git/guard.log inside the server container:

docker exec <tangled-container> cat /home/git/guard.log
# status=200 OK  fullPath=/home/git/repositories/did:plc:akmmkxg66qtexw6pl6erhwfe
# command completed success=true

That log is the authoritative source of truth for “did the server accept my request.” git's own stdout is a liar by omission here, because of the second bug:

Welcome to this knot! is not an error

The server prints a welcome banner — a MOTD — on stderr before it exec's the actual git command. For a brand-new empty repo, git-upload-pack emits zero refs, so the only thing you see is the MOTD. To the naked eye that looks identical to “the push failed and printed a message.” It didn't. exit=0 is the signal. The banner is noise.

I spent a genuinely silly amount of time convinced my first push had failed because of that banner, and then a sillier amount of time convinced the clone URL David had pasted me was truncated (no repo name, no .git — surely that's wrong). It wasn't truncated. It was the correct permalink form. I just couldn't see past the MOTD.

The lesson, for the third time in as many weeks: when a command's human-readable output and its exit code disagree, the exit code is the one telling the truth. Read the log the server itself keeps, not the string git happened to surface.


Proving it landed

Once the push returns * [new branch] main -> main, verify from the outside:

# clone it back, from anywhere
git clone git@knot.suranyami.com:did:plc:akmmkxg66qtexw6pl6erhwfe /tmp/knot-clone
cd /tmp/knot-clone && git log --oneline    # your full history, from the server

Or ask the server's own API what branches it knows about:

curl -s "https://knot.suranyami.com/xrpc/sh.tangled.repo.branches?repo=did:plc:akmmkxg66qtexw6pl6erhwfe" | jq .
# [{ "name": "main", "hash": "8bf07b2...", "is_default": true }]

And because this is federated identity, the push also wrote records back to your PDS — the git event is signed into your identity's record store, which is the whole mechanism by which other servers discover and mirror your repo. Check the collections on your owner DID:

curl -s "https://pds.suranyami.com/xrpc/com.atproto.repo.describeRepo?repo=did:plc:tg42msv45ief3qphccenrogh" \
  | jq '.collections'
# ["io.atcr.sailor.profile","sh.tangled.actor.profile","sh.tangled.knot",
#  "sh.tangled.publicKey","sh.tangled.repo"]

sh.tangled.repo turning up is the push having made it all the way through to the identity layer. That's the wire closing.


What you end up with

  • A repo on a git host you own, on a federated network that no single company controls.
  • A clone URL that's a permanent ID, not a path under someone's account — ugly, but it survives a server move in a way you/repo never could.
  • The same one-file-backup discipline as the self-hosting post: your .env (and the rotation key inside it) is your identity. Lose it and the repo's ownership goes with it.

None of this needed me to be a protocol expert, and I'm still not one. It needed the working command, the two gotchas that stop it from working, and the willingness to read the server's own log instead of trusting git's summary. If you've already got the server up, the push is five minutes. The debugging around it took longer than both posts combined.

Discuss...

A confession before the recipe, because I'd rather be honest than look clever: I am not an AT Protocol expert, and I'm no great sysadmin either. What follows is a recipe I stitched together from trial, error, docs read sideways, and a lot of pairing with an LLM that knew the bits I didn't. It works — I'm running it right now — but if any of it reads as authoritative, that's hard-won, not pre-loaded.

Right. What are we actually building?

  • tangled — a federated git host. Think GitHub, except no single company owns the whole thing; anyone can run a piece and the pieces talk to each other.
  • The piece you run is, in tangled's words, a “knot”. I think that name is daft, so I'll call it a tangled server. It's the bit that actually holds your repositories.
  • To own one you need an AT Protocol identity. AT Protocol is the open plumbing under Bluesky — and, crucially, not Bluesky. Your identity there is a DID (a permanent ID that's genuinely yours, not rented from anyone) anchored to a PDS (Personal Data Server — the box that holds your identity and data).

The lazy way to get that identity is to sign up for Bluesky. I'm in Australia, where that now means handing over a face scan, a credit card, or a photo of my ID for “age verification”. Hard no. So I run my own PDS instead, and the identity never touches Bluesky.

See this post for the painful experience that was to set up.

At the end of this you'll have a working git host on your own domain, registered on the tangled network, owned by an identity you fully control.

Two moving parts:

  1. A PDS — the official Bluesky reference PDS. It's open source; “Bluesky the company” and “Bluesky's software” are different things, and this is the software. It mints and hosts your DID.
  2. A tangled server — configured to be owned by the DID from part 1.

I run both on eon, a Raspberry-Pi-class box, behind my cluster's reverse proxy (Caddy) via uncloud. The uc deploy / x-ports lines below are uncloud's way of saying “publish this service”; if you're on plain Docker Compose, swap them for however you do reverse-proxy ingress. Everything else carries over.


What you'll need

  • A host that can run a container, reachable on ports 80/443 (I'm assuming a reverse proxy out front terminating TLS).
  • A domain you control DNS for. I point a wildcard *.suranyami.com at my cluster edge, so pds.suranyami.com and knot.suranyami.com both resolve and get auto-issued certs with no per-host records.
  • Somewhere durable to keep data on local disk. Not NFS — these services use SQLite, and SQLite over a network share is a recipe for corruption. (I learned that one the hard way on an unrelated service. Different post.)
  • openssl, curl, and jq on whatever machine you run the setup commands from.

Step 1 — stand up the PDS

Make the three secrets

The PDS needs three secret values. These are the commands the upstream installer uses; I didn't invent them, I lifted them:

# JWT signing secret
openssl rand --hex 16

# admin password (HTTP basic-auth for the admin API)
openssl rand --hex 16

# PLC rotation key — a secp256k1 private key, in hex. THIS IS YOUR IDENTITY.
openssl ecparam --name secp256k1 --genkey --noout --outform DER \
  | tail --bytes=+8 | head --bytes=32 | xxd --plain --cols 32

⚠️ Back up that rotation key like your identity depends on it, because it does. It's the cryptographic key that controls your DID. Lose it and the identity is gone for good — and your git server's ownership goes with it. Drop all three into a gitignored .env, and put the rotation key in your password manager on top of that:

# .env (gitignored)
PDS_JWT_SECRET=...
PDS_ADMIN_PASSWORD=...
PDS_PLC_ROTATION_KEY_K256_PRIVATE_KEY_HEX=...

The service definition

This is the upstream compose file with everything I didn't want stripped out — no bundled reverse proxy, no host networking, no auto-updater — because my cluster already does TLS and proxies it through to port 3000:

# services/pds.yml
services:
  pds:
    image: ghcr.io/bluesky-social/pds:0.4
    environment:
      PDS_HOSTNAME: pds.suranyami.com
      PDS_PORT: "3000"
      PDS_JWT_SECRET: ${PDS_JWT_SECRET}
      PDS_ADMIN_PASSWORD: ${PDS_ADMIN_PASSWORD}
      PDS_PLC_ROTATION_KEY_K256_PRIVATE_KEY_HEX: ${PDS_PLC_ROTATION_KEY_K256_PRIVATE_KEY_HEX}
      PDS_DATA_DIRECTORY: /pds
      PDS_BLOBSTORE_DISK_LOCATION: /pds/blocks
      PDS_DID_PLC_URL: https://plc.directory
      PDS_BSKY_APP_VIEW_URL: https://api.bsky.app
      PDS_BSKY_APP_VIEW_DID: did:web:api.bsky.app
      PDS_CRAWLERS: https://bsky.network
      PDS_SERVICE_HANDLE_DOMAINS: .suranyami.com   # allows handles like forge.suranyami.com
      PDS_INVITE_REQUIRED: "true"                  # no open signups; admin creates accounts
    volumes:
      - /bricks/eon-1/pds:/pds                      # SQLite + blobstore (local disk)
    x-ports:
      - pds.suranyami.com:3000/https               # web/API via cluster Caddy (TLS auto)
    x-machines:
      - eon
    restart: always

Make the data dir on the host first (mkdir -p /bricks/eon-1/pds), then deploy. With uncloud the secrets get pulled from .env at deploy time:

set -a; . ./.env; set +a
uc deploy -y -f services/pds.yml

Check it's alive:

curl -s https://pds.suranyami.com/xrpc/_health
# {"version":"0.4.x"}

Step 2 — create your identity

Heads up: this particular PDS image ships without the pdsadmin helper that the docs assume you have. Took me a minute to work out you can just talk to the admin API directly instead. Signups are invite-only (that's the PDS_INVITE_REQUIRED line above), so mint yourself an invite first — the -u admin:... is the admin password doing HTTP basic auth:

set -a; . ./.env; set +a

curl -s -X POST https://pds.suranyami.com/xrpc/com.atproto.server.createInviteCode \
  -u "admin:${PDS_ADMIN_PASSWORD}" \
  -H "Content-Type: application/json" \
  -d '{"useCount": 1}'
# → {"code":"pds.suranyami.com-xxxxx-xxxxx"}

Then create the account with that code:

curl -s -X POST https://pds.suranyami.com/xrpc/com.atproto.server.createAccount \
  -H "Content-Type: application/json" \
  -d '{
    "email": "admin@suranyami.com",
    "handle": "forge.suranyami.com",
    "password": "<a-strong-account-password>",
    "inviteCode": "<code-from-above>"
  }'
# → { "did": "did:plc:...", "handle": "forge.suranyami.com", ... }

Run that exactly once. Run it twice and the PDS throws Handle already taken back at you — which isn't a new failure, it's the first run having succeeded. To redo an account you delete it through the admin API first; you don't re-create over the top of it.

Save the returned DID and the account password into .env (PDS_ACCOUNT_DID, PDS_ACCOUNT_HANDLE, PDS_ACCOUNT_PASSWORD). The DID is the thing your git server gets owned by, so keep it handy.

A handle gotcha that caught me out: the PDS quietly reserves a pile of role-ish handles. admin, git, code, repo, dev and source are all taken before you even start, and createAccount just rejects you. forge, ops, vcs, scm, tangled, knot and hub were free — I went with forge.suranyami.com because it names the job (a git host), not me. If your handle gets bounced, this is why.


Step 3 — prove the handle is yours

The wildcard DNS gets the web address resolving, but the handle itself needs a separate proof so tangled (and the login flow) will trust that forge.suranyami.com really is you. That proof is a DNS TXT record — a little text note attached to a DNS name:

host:  _atproto.forge.suranyami.com
value: did=did:plc:tg42msv45ief3qphccenrogh

The explicit _atproto record wins over the wildcard for that one name. Check it resolved, and that the PDS agrees the handle maps to your DID:

dig +short TXT _atproto.forge.suranyami.com
# "did=did:plc:tg42msv45ief3qphccenrogh"

curl -s "https://pds.suranyami.com/xrpc/com.atproto.repo.describeRepo?repo=forge.suranyami.com" \
  | jq '{handle, did: .didDoc.id, handleIsCorrect}'
# handleIsCorrect: true

handleIsCorrect: true is the bit you're after.


Step 4 — stand up the tangled server

tangled's server image lives in tangled's own registry, atcr.io, which is private and checks AT Protocol identities at the door. So you log in to it with your self-hosted handle and an app-password — a throwaway password scoped to one tool, so you're not handing your real account password to the docker CLI.

Mint one from your PDS:

set -a; . ./.env; set +a

ACCESS=$(curl -s -X POST https://pds.suranyami.com/xrpc/com.atproto.server.createSession \
  -H "Content-Type: application/json" \
  -d "{\"identifier\":\"${PDS_ACCOUNT_DID}\",\"password\":\"${PDS_ACCOUNT_PASSWORD}\"}" \
  | jq -r .accessJwt)

curl -s -X POST https://pds.suranyami.com/xrpc/com.atproto.server.createAppPassword \
  -H "Authorization: Bearer ${ACCESS}" \
  -H "Content-Type: application/json" \
  -d '{"name": "atcr"}'
# → {"password":"xxxx-xxxx-xxxx-xxxx", ...}   ← store as KNOT_ATCR_APPPW in .env

Log the host that'll run the server into the registry (run this on that machine):

docker login atcr.io -u forge.suranyami.com   # paste the app-password

Now the service itself. The one line that ties this whole thing to you is KNOT_SERVER_OWNER — your DID from Step 2. (Yes, the env vars insist on calling it a knot. I've made my peace with the YAML if not the branding.)

# services/knot.yml
services:
  knot:
    image: atcr.io/tangled.org/knot:latest
    environment:
      KNOT_SERVER_HOSTNAME: knot.suranyami.com
      KNOT_SERVER_OWNER: did:plc:tg42msv45ief3qphccenrogh   # your self-hosted DID
      KNOT_SERVER_DB_PATH: /app/knotserver.db
      KNOT_REPO_SCAN_PATH: /home/git/repositories
      KNOT_SERVER_INTERNAL_LISTEN_ADDR: localhost:5444
    volumes:
      - /bricks/eon-1/knot/keys:/etc/ssh/keys              # stable SSH host keys
      - /bricks/eon-1/knot/repositories:/home/git/repositories
      - /bricks/eon-1/knot/server:/app                     # SQLite db + app state
    x-ports:
      - knot.suranyami.com:5555/https   # web UI via cluster Caddy (TLS auto)
      - 2222:22@host                    # git-over-SSH, raw TCP on the host
    x-machines:
      - eon
    restart: always

Deploy, and confirm the web UI answers on a valid cert:

uc deploy -y -f services/knot.yml
curl -s -o /dev/null -w '%{http_code}\n' https://knot.suranyami.com
# 200

Don't reach for curl -I here — the knot only allows GET on /, so a HEAD comes back 405 and you'll convince yourself the server's broken when it isn't. (What you'll actually get at / is an ASCII banner saying “this is a knot server”. The real surface is /xrpc/; the human UI lives on tangled.org, rendered on top of your knot's data.)

The web UI and registration work over 443 alone. Cloning and pushing over SSH from outside your own network needs one more thing: a port-forward on your router (public TCP 22 → eon:2222) and a fixed local IP for the host. That's optional and entirely separate from getting registered — skip it for now if you just want the thing live.


Step 5 — register it with tangled

Sign in at tangled.org as forge.suranyami.com, using your account password (not the app-password, not the admin password). What happens next is the nice part of running your own identity: tangled bounces you to your own PDS to approve the login, because your PDS — not Bluesky, not tangled — is the thing that vouches for you. Approve it, then:

Settings → Knots → add knot.suranyami.com.

tangled fetches your server over HTTPS, checks it's owned by your DID, and links them. That's it. Push a repo.

If you want to watch the wire actually close, ask your PDS what records your DID is now carrying:

curl -s "https://pds.suranyami.com/xrpc/com.atproto.repo.describeRepo?repo=did:plc:tg42msv45ief3qphccenrogh" \
  | jq '.collections'
# ["io.atcr.sailor.profile", "sh.tangled.actor.profile", "sh.tangled.knot"]

sh.tangled.knot is the registration record, and its key is your knot's hostname. Not sh.tangled.publicKey — that one turns up later, the first time the knot signs a git push, so its absence right after registering is normal. (I spent a confused few minutes waiting for it. Don't.)


Two things that'll bite you

1. The login fails and the error is a lie. When I first tried to register, the login died with invalid_client_metadata and I lost hours chasing a permissions theory that turned out to be completely wrong. The real cause was that my PDS couldn't make one outbound network request — a broken IPv6 path it should never have had. If your registration login throws that error, don't trust where it's pointing you; the whole saga (and the actual fix) is its own post: the bug was IPv6. The quick tourniquet, if you hit it before sorting IPv6 properly, is one line on the PDS:

NODE_OPTIONS: "--dns-result-order=ipv4first --no-network-family-autoselection"

2. “Wrong identifier or password” when the password is right. A 32-character password with no word breaks is trivially easy to truncate when you copy it, and then you'll swear blind it's correct. If a createSession call from the command line works but the browser login keeps rejecting you, the value reaching the form has drifted — your account is fine. Reset the password without recreating the account (recreating mints a brand-new DID and orphans your git server — don't):

curl -s -X POST https://pds.suranyami.com/xrpc/com.atproto.admin.updateAccountPassword \
  -u "admin:${PDS_ADMIN_PASSWORD}" \
  -H "Content-Type: application/json" \
  -d "{\"did\":\"${PDS_ACCOUNT_DID}\",\"password\":\"<new-password>\"}"

Then confirm it with createSession and paste the new value straight off your clipboard, not retyped.


Two log lines that look fatal and aren't

Once the knot's up, two ERROR lines will scroll past and both are nothing. I lost ten minutes on each before I trusted the rest of the stack, so:

1. failed to resolve did/handle handle=.well-known — random bots hammer /.well-known/acme-challenge/<anything> on any public hostname, looking for exposed ACME endpoints. Caddy only intercepts the challenge tokens it issued; everything else falls through to the knot, and the knot's router treats the first path segment as a handle to resolve. .well-known fails the handle regex, so it 500s and logs ERROR. Your cert is fine — Caddy already issued it. Ignore it, or add a Caddy rule to 404 /.well-known/acme-challenge/* before it reaches the proxy.

2. database is locked on a backfill-collaborators migration — two code paths touch the SQLite file at the same instant during startup and the loser hits the busy timeout. On a fresh knot there's nothing to backfill (count=0), so losing the race costs nothing; it just never marks the migration done, so it re-runs every boot. Ugly, harmless. A clean restart once the container's settled clears it. I didn't bother.


What you end up with

  • A DID you own outright, on your own PDS, with a custom-domain handle — no Bluesky account, no face scan, no age check.
  • A git host owned by that DID, on a federated network you don't depend on any single company to keep running.
  • One file (.env) you absolutely must back up — the rotation key inside it is your identity.

None of this needs you to be a protocol wizard. I'm certainly not one. It needs an afternoon, a domain, and a stubborn refusal to hand your ID to a website just to host your own code.

Discuss...

I wanted to self-host a tangled server. I refuse to call it a “knot”, because that's just fucking stupid.

FYI, it's supposed to be a small, federated, GitHub-like git server built on the AT Protocol. The catch: tangled identities are AT Protocol identities, and the obvious way to get one is a Bluesky account.

I'm in Australia, where Bluesky now gates sign-up behind age verification (facial recognition, a credit card, or an ID scan).

I object to this on principle, so I said “fuck that”, and besides I do NOT need another social media account, so a bsky.social account was off the table.

The good news is that the AT Protocol doesn't require Bluesky at all. Your identity is just a DID anchored to a PDS (Personal Data Server) — and you can run your own. So I did: the official Bluesky reference PDS, in a container on eon (a Raspberry-Pi-class DietPi box), behind my cluster's Caddy, on my own domain. I created an account, verified the handle, pointed the knot at it. Everything came up green.

Then I tried to log in to tangled with my shiny self-hosted identity, and got:

Failed to start auth flow: auth request failed: PAR request failed (HTTP 400): invalid_client_metadata

This is the story of how that error sent me down a completely wrong path for hours — and how a single differential test dragged me back to a mundane, and much more interesting, root cause.


The obvious culprit (that wasn't)

Thirty-second primer, because OAuth is a swamp of acronyms and I won't pretend otherwise. “Log in with Google” is OAuth: one service vouches for you to another. Here, tangled is the app asking for access (in the jargon, the client), and my PDS — my little self-hosted identity server — is the bouncer deciding whether to let it in (the authorization server).

The handshake starts before my browser is even involved: tangled shoves a request straight at my PDS, server to server. That shove has a name — a Pushed Authorization Request, or PAR. And invalid_client_metadata is the bouncer sniffing the request and going: I don't like the look of this club's paperwork.

So I read the paperwork:

  • tangled's client metadata asks for a long list of scopes
  • scopes being the specific permissions an app wants, the “this app would like to access your repos and your handle” checklist
  • tangled asks for fine-grained ones: repo:sh.tangled.repo, rpc:sh.tangled.repo.create, blob:*/*, identity:handle

Then I checked what my PDS said it understood:

scopes_supported: ['atproto', 'transition:email', 'transition:generic', 'transition:chat.bsky']

The old, coarse set. None of the granular ones. Case closed, right?

My PDS (@atproto/pds version 0.5.9) was too old to grasp the permissions tangled was asking for, so it spat the request out. Bluesky's own docs even say granular permissions are “rolling out to the self-hosted PDS distribution” — translation: not in a shipping self-hosted PDS yet.

I went a long way down this road.

  • Checked npm for a newer PDS — latest stable was 0.5.10, one patch ahead, useless.
  • Checked the pre-release tags — somehow older than stable, also useless.
  • Then I found Tranquil, a PDS rewritten in Rust that explicitly supports the granular scopes, read its install docs, confirmed it ran on my hardware, and got within a hair of migrating my brand-new identity onto a multi-container, Postgres-backed, community-alpha server.

That is a colossal amount of effort and risk to dodge a problem I had not actually confirmed was the problem. Mercifully, it was not the problem.


The question that broke the theory

The thing that saved me was one stupid question: if my PDS can't do this, how is tangled doing it for everyone else right now?

People log in to tangled every day with handles on .bsky.social or .tngl.sh. That second one — .tngl.sh — is tangled's own PDS hosting. Whatever software it runs, it demonstrably works with the exact login I was failing. So instead of theorising, I could just go poke it.

I had a .tngl.sh handle lying around from earlier, so I worked out which PDS hosts it and interrogated it directly. Two things fell out immediately, and both upended the original theory:

  1. tngl.sh runs the Bluesky reference PDS — the same @atproto/pds software I installed, version 0.4.208. Not Tranquil. So migrating to Tranquil would have been “solving” a problem the working system doesn't have.
  2. Its scopes_supported was the exact same legacy set as mine. And tangled login works there fine.

If the box that works advertises the same scopes as the box that doesn't, then the scopes were never it. The theory was built on a coincidence I'd mistaken for a clue.

So I fired the identical PAR request at both PDSes and put the answers side by side:

# my PDS:
{"error":"invalid_client_metadata",
 "error_description":"Unable to obtain client metadata for \"https://tangled.org/oauth/client-metadata.json\""}

# tngl.sh:
{"error":"invalid_request",
 "error_description":"client authentication method \"private_key_jwt\" required a \"client_assertion\""}

There it is, in the error text. tngl.sh fetched and read tangled's paperwork and moved on to the next step — it was only complaining that my bare test didn't include a signature it expected. Mine never got that far. “Unable to obtain client metadata.” My PDS couldn't even download the document.

This was never about permissions. It was a failed HTTP request. My server tried to fetch one URL and couldn't.


Why couldn't it fetch one URL?

From inside the PDS container, using the same mechanism the PDS itself uses (Node's built-in fetch):

https://tangled.org/oauth/client-metadata.json   -> ETIMEDOUT
https://plc.directory                             -> 200
https://example.com                               -> 200

The internet at large was fine. Only tangled.org hung until it gave up (ETIMEDOUT is the network politely telling you nothing ever answered). One host timing out while everything else sails through has a particular smell to it.

I ruled out the boring stuff first. MTU — the largest chunk a network link will carry in one go — was a normal 1500 on both the container and the host, so I wasn't looking at the classic “big packets silently vanish” trap. To be sure, I forced a full encrypted handshake to tangled.org over IPv4 with the right hostname attached, big packets and all. It completed. Not an MTU black hole.

The tell was in DNS. tangled.org publishes two addresses: an A record (its IPv4 address) and an AAAA record (its IPv6 one). And my container was listening on :::3000 — that's IPv6 notation, so IPv6 was very much in the game. So I told Node to stop being clever about which one it picked:

node --no-network-family-autoselection -e 'fetch("https://tangled.org/oauth/client-metadata.json").then(r=>console.log(r.status))'
# -> 200

Instant 200. The bug was IPv6.

Here's the clever-that-backfired bit. Modern Node (I'm on 24) uses a trick called Happy Eyeballs: when a host has both an IPv4 and an IPv6 address, it races connections to both and uses whichever replies first. The whole point is resilience — if one path is dead, the other wins and you never notice.

Except my IPv6 path wasn't dead. It was black-holed — packets marched out and nothing ever came back, no rejection, no error, just silence. So Node's connection sat there waiting for a reply that would never come, instead of failing fast and letting IPv4 win the race. plc.directory and example.com happened not to provoke it; tangled.org's particular setup did. A dead path is fine. A path that eats packets and says nothing is poison.

The quick fix was one environment variable on the PDS container — tell Node to prefer IPv4 and stop racing:

NODE_OPTIONS: "--dns-result-order=ipv4first --no-network-family-autoselection"

Redeploy, replay the PAR, and now my PDS gives the identical answer tngl.sh did. Login unblocked. No Tranquil, no migration, no waiting for a release that was never the issue.

But slapping “never use IPv6” on a machine that's supposed to have IPv6 isn't a fix, it's a tourniquet. The real question was still standing there: why was IPv6 black-holing in the first place?


The actual root cause: Docker quietly knifed my IPv6

This is the part that was worth the whole detour.

First, whose fault — the network, the router, or the box? Easy to settle by comparison. My Mac, sitting on the same LAN, has a proper public IPv6 address handed down from the ISP and pings the IPv6 internet in about 22ms. So the router and the ISP do IPv6 perfectly well. But eon had no public IPv6 address and no IPv6 route to the outside world at all — just a link-local address (the IPv6 equivalent of “I can talk to the room I'm in”) and Tailscale's private range. It had never picked up the router's IPv6 at all.

I'd assumed IPv6 just worked everywhere because it works on my Mac. It works on my Mac because macOS does SLAAC out of the box.

SLAAC — Stateless Address Autoconfiguration — is the no-paperwork way devices get IPv6: the router periodically shouts its network details onto the LAN (these shouts are called Router Advertisements), and any device that hears one configures its own address from it. No DHCP server, no central list, just “here's the network, help yourself.”

My Mac does that natively. So do my Ubuntu boxes — their network managers handle the router's shouts themselves, in normal software. But my DietPi nodes use a barebones networking setup where listening for those shouts is the Linux kernel's job. And the kernel was sitting there with its thumb up its arse.

The relevant kernel switches (sysctl is just the knob-panel for kernel settings) were identical on every DietPi node:

net.ipv6.conf.eth0.accept_ra   = 1
net.ipv6.conf.eth0.forwarding  = 1

And that pair together is the trap. forwarding = 1 means the host is willing to act as a router, shuffling packets between interfaces on behalf of others. accept_ra is whether it listens to the router's shouts. The catch the kernel enforces: a box that's acting as a router will not listen to other routers' shouts — because routers are supposed to be authoritative, not take their config secondhand. So with accept_ra = 1 and forwarding = 1, the kernel silently bins every Router Advertisement. No shouts heard, no SLAAC, no address, no route. Silently. There's the magic word again.

So who switched on forwarding? Docker. The moment Docker starts, it flips IPv6 forwarding on so it can route packets to your containers. Every Docker-running DietPi node I had was therefore sitting at forwarding=1 + accept_ra=1 — meaning every one of them quietly lost outbound IPv6 the instant Docker came up. And I never noticed, because nothing I ran needed outbound IPv6 until a Node app went looking for an IPv6-published host and hung itself on the silence.

My Ubuntu nodes dodged it purely by luck: their network managers do the listening in software, sidestepping the kernel switch entirely, so the forwarding clash never touched them.

The actual fix is one knob, set to stick across reboots:

net.ipv6.conf.<iface>.accept_ra = 2

accept_ra = 2 means “listen to the router's shouts even while forwarding.” Flip that, and the kernel starts hearing Router Advertisements again, SLAAC kicks in, and the node grabs a real public IPv6 address and a route to the world like every other device on the network. I'm rolling it out across the whole fleet with Ansible, so every machine is a proper IPv6 citizen — not just the ones whose networking happened to paper over the gap by accident.


What I'd tell past-me

  • Go find a working version and diff against it. I torched hours on a tidy, well-evidenced, completely wrong theory. One question — “how does the thing that works do it?” — and one identical request fired at a known-good server flattened the whole story in minutes. When something should work and doesn't, stop theorising and go interrogate a copy that does.
  • Same symptom does not mean same cause. Both PDSes advertised the same scopes. That coincidence is exactly what sold me the wrong story for an afternoon. The error messages, read slowly and side by side, were trying to tell me the truth the whole time.
  • “It works on my machine” can be three different accidents wearing a trenchcoat. IPv6 worked on my Mac and my Ubuntu boxes for three unrelated reasons, none of which applied to my DietPi nodes. Same behaviour across a fleet is something you check, not something you assume.
  • Docker turns on IP forwarding and that can silently murder kernel SLAAC. If you run containers on a Debian-ish host with barebones networking and your IPv6 has “mysteriously” vanished, go look at accept_ra versus forwarding. You almost certainly want accept_ra=2.
  • A black-holed network path is worse than a dead one. A cleanly-dead IPv6 stack fails fast and the IPv4 fallback wins. A path that swallows packets and says nothing makes well-behaved software hang forever. If connections to some hosts time out while everything else is fine, suspect a half-configured address family long before you blame MTU, DNS, or the far end.

So, now, to continue my yak-shaving I have managed to:

  • host an identity server
  • so now I can get an identity
  • I didn't need to make a stupid social media account to fix it
  • now I can attempt to run a federated git server with a stupid name

And all this is so I can host my own code repos instead of dealing with Microsoft's continued enshittification of Github, and the ongoing collapse of the USA up its own arsehole of late-stage capitalism as it goose-steps into oligarchic cronyism.

Hoo-fucking-ray!

Discuss...

My rock4 — a Radxa RockPi4 running DietPi with four SATA SSDs on a Penta HAT — has never rebooted cleanly. For as long as I've had it in the rack, issuing sudo shutdown -r now meant walking over to the machine, waiting ten minutes to confirm it was definitely stuck, and flipping the power switch. Every single time.

It worked perfectly otherwise. Services ran fine. Drives mounted fine. The machine was solid right up until the moment you asked it to restart.

This is the story of finding the actual cause — and why the fix I thought would work made no difference at all.


The obvious culprit (that wasn't)

When you have a server that hangs on shutdown, the usual suspects are slow-stopping services, or so I was led to believe. The systemd-analyze blame output on rock4 had an obvious candidate: unattended-upgrades.service, which by default gets a TimeoutStopSec of 1800 seconds — 30 minutes. If an apt upgrade happened to be running at shutdown time, systemd would sit there for half an hour waiting for it to finish before giving up.

I applied a drop-in to cap it at 5 minutes. It still hung. For over two hours.

I dug deeper and found a second culprit: apt-daily-upgrade.service, a separate timer-triggered unit that calls unattended-upgrades. It has its own TimeoutStopSec of 900 seconds. I capped that too.

Still hung.

At this point I was fairly sure the apt theory was wrong, but I didn't have a better one yet.


The diagnostic that changed everything

Here's the thing about a “hung” server: it's worth checking whether the machine is actually dead or just systemd that's stuck.

After triggering a shutdown and watching rock4 go dark, I opened LanScan and scanned the local network. rock4 was still there. Still responding to pings. Port 111 (rpcbind) still open.

That's not a dead machine. That's a machine with a live kernel where systemd has frozen mid-shutdown.

systemd shuts down in phases, supposedly: it stops services, then unmounts filesystems, then hands off to the kernel for the actual reboot. If it gets stuck at the filesystem unmount step, the kernel never gets the reboot signal — the machine just idles there indefinitely, still on the network, lights still on, going nowhere.

The question was: which mount was blocking?

rock4 has four local SATA drives and one NFS mount — /mnt/media, served from my itx machine over the local network. I pulled up the running containers:

docker inspect jackett --format '{{ json .Mounts }}'

There it was:

/mnt/media/media/Downloads → /downloads

jackett — my torrent indexer — had an NFS-backed path bound as a Docker volume.


Why this hangs forever

When Docker mounts a volume into a container, the kernel creates a bind mount that keeps a reference count on that filesystem. Even after Docker stops the container, the overlay filesystem machinery can retain a reference to the underlying mountpoint.

So when systemd later runs umount /mnt/media, the kernel sees that something still holds a reference to that mount and returns EBUSY. Systemd retries. The NFS server is still up, healthy, and reachable — but that doesn't matter. The umount call isn't failing because the server is gone; it's failing because the local kernel thinks something still has the filesystem open.

And here's the critical part: umount has no timeout. The TimeoutStopSec settings on services don't help. The soft,timeo=30 NFS mount option doesn't help — that governs read/write operation timeouts, not the unmount syscall itself. Without something explicitly forcing a lazy unmount, systemd will wait forever.


The fix

jackett is a torrent indexer. It speaks to tracker APIs and returns search results to Radarr and Sonarr. It does not need to read or write files on disk. The downloads volume was there because at some point, someone (me, almost certainly) copy-pasted a docker-compose snippet from the internet without thinking about whether every line was necessary.

The fix was removing one line from services/jackett.yml:

# Before
volumes:
  - /bricks/rock4-2/jackett:/config
  - /mnt/media/media/Downloads:/downloads  # ← this line

# After
volumes:
  - /bricks/rock4-2/jackett:/config

Redeployed jackett, issued sudo shutdown -r now, and watched. Three minutes later, rock4 was back online. No power cycle. First clean reboot in years.


The general rule

If you're running Docker containers on a machine that also has NFS mounts, think hard before binding any NFS-backed path into a container volume. The risk isn't that Docker will do something wrong — it's that the combination of Docker's bind mount lifecycle and the kernel's umount semantics creates a window where shutdown can hang indefinitely with no error message and no timeout.

If you genuinely need an NFS path inside a container, the belt-and-suspenders fix is to add x-systemd.mount-timeout=30 to the relevant fstab entry. This caps the mount's teardown time at 30 seconds rather than forever — not ideal, but it bounds the hang.

itx.local:/mnt/media  /mnt/media  nfs  soft,timeo=30,x-systemd.mount-timeout=30  0  0

But better is to audit your container volume mounts and ask: does this service actually need filesystem access, or is it just inheriting a volume that was copy-pasted into the config at some point?


Why it was so hard to diagnose

A few things made this particularly hard to spot:

No error message. The machine doesn't log “stuck waiting for NFS umount.” It just sits there. Systemd is doing exactly what it's supposed to do: retrying an unmount that keeps returning EBUSY. There's nothing in the journal because journald itself has already stopped by the time the hang happens.

The wrong hypothesis was plausible. Unattended-upgrades with a 1800s timeout genuinely can cause shutdown hangs. Capping it was the right thing to do regardless. It just wasn't the root cause here.

The symptom was intermittent enough to seem random. Sometimes rock4 rebooted. When the NFS server (itx) was down or the jackett container had been recently restarted, Docker might have already released the reference by the time shutdown reached the umount step. This made it feel like a timing issue rather than a deterministic one.

The diagnostic breakthrough — checking whether the machine was still pingable after it “hung” — was the key. A dead machine and a machine stuck mid-shutdown look identical from across the room. They look very different from a network scanner.


The problem is probably older than NFS

After fixing the hang, I realised something. rock4 ran GlusterFS for years before the NFS migration — a distributed filesystem where each node contributes “brick” drives to a replicated pool. The containers on rock4 mounted GlusterFS paths like /mnt/storage/jackett, and those mounts have the same property as NFS: they're network-backed filesystems that can't unmount cleanly while something holds a kernel reference to them.

GlusterFS uses FUSE (Filesystem in Userspace) to expose its mounts locally. FUSE unmounts are actually harder to complete cleanly than NFS: to release a GlusterFS FUSE mount, the glusterd daemon has to coordinate across the network, consult its peers, and tear down brick connections in order. If Docker is still holding a reference to the mountpoint, glusterd can't complete that teardown, and umount returns EBUSY — the same outcome as NFS, but with more moving parts and more ways to stall.

So the sequence was almost certainly: Docker container with GlusterFS volume → indefinite hang → GlusterFS decommissioned → NFS mounted → same container config carried across with updated paths → Docker container with NFS volume → still hangs.

Different filesystem, identical mechanism, years of continuity. The jackett config probably got its downloads volume added once, years ago, and nobody thought to question it during the storage migration.

The GlusterFS angle matters beyond this one machine. Between roughly 2018 and 2022, GlusterFS was enormously popular in self-hosted circles — TrueNAS Scale shipped it as the default clustered storage backend, and countless homelab builds adopted it for redundant storage across a few nodes. Many of those setups ran Docker containers with GlusterFS-backed volumes. Many of those setups probably had machines that wouldn't reboot cleanly. It's a reasonable bet that a lot of those people never connected the reboot hang to the storage layer.

RedHat deprecated GlusterFS in RHEL 9 (announced 2022). The official framing was “focus on other storage solutions,” but the operational complexity was a significant part of the story: GlusterFS was difficult to run at small scale, prone to split-brain, and had long-running issues with graceful shutdown and FUSE lifecycle management. The Docker reboot hang described here is a concrete example of that class of problem — the kind of subtle, hard-to-diagnose operational failure that accumulates over time and eventually makes a piece of software too difficult to maintain and recommend.

If you ran GlusterFS and your server never quite rebooted cleanly: this was probably why.


Setup

  • rock4: Radxa RockPi4, DietPi (Armbian kernel 6.18), 4× 3.6TB SATA SSDs via Penta HAT
  • itx: Rock 5 ITX, NFS server, mergerfs pool at /mnt/media
  • Container management: uncloud
  • jackett: lscr.io/linuxserver/jackett

Discuss...

Just watched this:

https://www.imdb.com/title/tt0460791/?ref_=nv_sr_srsg_6_tt_8_nm_0_in_0_q_the%20fall

“The Fall”, by director Tarsem Singh.

Outstandingly beautiful visuals. Like watching a graphic novel by Möbius brought to life. An opiate-filled fever-dream of over-the-top sensations for the pure sake of it.

Simply incredible.

Discuss...

So, uncloud is obviously still a work in progress, so bugs will happen. Because I've drunk the cool-aid and dived into using it on my homelab, I'm doing my best to be a good netizen and helping troubleshoot any issues I find.

Today, I had a great session with the author, where my cluster was unresponsive, except for running commands like uc machine ls. The services were still running, but I'd lost the ability to perform new deploys or check the state of services. Seems like a bug, and might be related to WireGuard + Tailscale + Uncloud and IP6 addresses.

Along the way, though, I learned a new command, which got me completely unblocked:

uc ctx conn

Run this and you get a list of all the machines in your cluster and you can choose a different default machine to proxy traffic through. In my case, it seems like there's something wrong with only the Ubuntu machines I have… the others on DietPi work just fine.

One caveat I had, though: because my machines are behind NAT on a home network, I have ports 80 and 443 redirected to the “default machine”, so I had to change that redirection to point to the local IP address of the new default machine, then do a new deploy of whatever services I needed to update. I believe the issue here is probably to do with registering the Let's Encrypt certificate… they require the machine to be discoverable for that to complete.

Discuss...

This is my docker-compose.yaml for beszel:

services:
  beszel:
    image: henrygd/beszel:latest
    x-ports:
      - beszel.your-domain.com:8090/https
    volumes:
      - ./beszel_data:/beszel_data
      - ./beszel_socket:/beszel_socket
  • Deploy the beszel webapp with uc deploy bezel.yml
  • Signup and login
  • Go to settings/tokens and activate “Universal Token”
  • Under the ••• drop-down menu, select “Copy Docker Compose”. This will give you something like this:
services:
  beszel-agent:
    image: henrygd/beszel-agent
    container_name: beszel-agent
    restart: unless-stopped
    network_mode: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./beszel_agent_data:/var/lib/beszel-agent
      # monitor other disks / partitions by mounting a folder in /extra-filesystems
      # - /mnt/disk/.beszel:/extra-filesystems/sda1:ro
    environment:
      LISTEN: 45876
      KEY: 'ssh-ed25519 xxxxxxxxxxxxxxxxxxxxxxxxx'
      TOKEN: xxxx-xxxxx-xxxxx-xxxxx
      HUB_URL: https://beszel.your-domain.com

Add this line to the bottom of it:

    deploy:
      mode: global

This will ensure that the agent is installed on all your machines.

I usually just paste the beszel-agent bit into the first docker-compose, then re-run:

uc deploy -f beszel.yml

This will give you some output like this:

[+] Deploying services 8/8
 ✔ Container beszel-agent-xmai on eon    Started         1.4s 
 ✔ Container beszel-agent-os6i on itx    Started         0.6s 
 ✔ Container beszel-agent-hkhd on node2  Started         0.6s 
 ✔ Container beszel-agent-w84p on node3  Started         1.4s 
 ✔ Container beszel-agent-qd42 on node4  Started         0.6s 
 ✔ Container beszel-agent-c79q on pico   Started         0.5s 
 ✔ Container beszel-agent-v7ff on rock4  Started         0.8s 
 ✔ Container beszel-agent-odec on rock5  Started         0.7s 

Then you might want to rename the nodes in the beszel web UI for easier machine identification. I still haven't worked out how to make that process automatic, but it's not a big deal.

Discuss...

Just before Christmas, I spent an hour or so setting up uncloud on my homelab, and I am stunned at how easy it was to get working.

The motivation for doing this is because I’ve known for a long time that Swarmpit is basically abandoned. Disappointing, but true. The latest release of DietPi, my preferred distro for my Raspberry Pi and RockChip SBCs, included an update to docker and docker-compose that completely broke all operability with Swarmpit. Queue panicked hunting for alternatives and a fortuitous discovery of Uncloud

Here's how I set it up.

DNS

  • Added a wildcard CNAME DNS record pointing *.suranyami.com to my dynamic DNS address: suranyami.duckdns.org.

Tailscale

  • Installed tailscale on each of the machines (Installation Instructions)
  • Because I'm using DietPi, the recommended way to install tailscale is using dietpi-software.
  • Connect each machine to my free tailnet (free tier allows up to 100 nodes) using sudo tailscale up and following the on-screen instructions.

This gives me a stable URL for each individual machine that I can SSH into without needing to do NAT redirection on the router. For instance, my machine called node1 is available to me (and only me) at ssh dietpi@node1.tailxxxxx.ts.net.

SSH config

  • Updated my ~/.ssh/config with entries for all the machines that look like this:
Host node1
  Hostname node1.tailxxxxx.ts.net
  User dietpi

Uncloud

  • Installed uncloud on my laptop: curl -fsS https://get.uncloud.run/install.sh | sh
  • Initialized the cluster by picking one of the above machines as a first server: uc machine init dietpi@node1.tailxxxxx.ts.net --name node1

NAT port redirection

  • Because all my machines are behind NAT, I configured my router to map ports 80 and 443 to point to the above machine. This can ultimately be any of the machines configured after this. The important point here is that at least one of the machines running the caddy reverse-proxy that uncloud installs, needs to be receiving ports 80 and 443 from the outside world.

Add more machines

  • Add other machines using uc machine add dietpi@node2.tailxxxxx.ts.net --name node2

Deploy services

  • Deploy services using uc deploy -f plex.yml where plex.yml is a subset of a docker-compose file, but with minor changes. For instance, to deploy to a specific machine (which I have to do because I need to redirect port 32400 from the router to a specific machine, because plex is annoying like that), I do this:
services:
plex:
  image: linuxserver/plex:arm64v8-latest
# ...
  x-machines:
    - node2
  x-ports:
    - 32400:32400@host
    - plex.suranyami.com:32400/https

And that's about it. No manual reverse-proxy configuration, no manual entry of IP addresses, everything is just automatically given a letsencrypt SSL certificate and load-balanced to wherever the servers are running.

This is honestly the easiest way to self-host anything I've found.

It's been 2 weeks or so now, and now that I've got the knack of the x-ports port-mapping syntax, I've also managed to get all my other services running everywhere.

Notable edge cases were:

Minecraft

x-ports:
  - 25565:25565@host

Plex

x-ports:
  - 32400:32400@host
  - plex.suranyami.com:32400/https

Needed 2 mappings, one for the internal subnet for use by the AppleTV, because of some idiosyncrasy of the way the native Plex app works with behind the NAT versus over t'interwebz.

Jellyfin

x-ports:
    - 1900:1900@host
    - 7359:7359@host
    - jellyfin.suranyami.com:8096/https

Only outages I've had so far were purely hardware-related: robo-vacuum somehow knocked out a power cord that was already loose… derp. That won't happen again. And, the fan software wasn't installed on my RockPi 4 NAS box, so it overheated and shut down. Fixed that this morning.

global deployment

I'm currently using Netdata to monitor my nodes. It's WAY overkill for what I'm running, but hey, whatever. For this we need to do a global deployment:

services:
  netdata:
    image: netdata/netdata:latest
    hostname: "{{.Node.Hostname}}"
# ...
    volumes:
# ...
      - /etc/hostname:/host/etc/hostname:ro
    deploy:
      mode: global

This is essentially the same as a normal docker-swarm compose file, but because it's not actually docker-swarm, this line is a hack to get the hostname: - /etc/hostname:/host/etc/hostname:ro.

There is also a quirk that (hopefully) might be fixed in future versions of uncloud: the volumes don't get created automatically on each machine. For that I had to execute a bunch of uc volume create commands like this:

c volume create netdataconfig -m node2
uc volume create netdataconfig -m node3
uc volume create netdataconfig -m node4
uc volume create netdatalib -m node2
uc volume create netdatalib -m node3
uc volume create netdatalib -m node4
uc volume create netdatacache -m node2
uc volume create netdatacache -m node3
uc volume create netdatacache -m node4

Replicated deployment

One very nice feature is replicated deployment with automatic load balancing. There's not a lot of documentation about how it works at the moment, so I'm a bit suss on it, but essentially it looks like this in the compose file:

    deploy:
      mode: replicated
      replicas: 4

This will cause it to pick a random set of machines and deploy a container on each, and load-balance incoming requests.

There are caveats to this, of course. The service configuration will need to be on a shared volume, for instance, and some services do NOT behave well in this situation. plex is the worst example of this… if you store its configuration, caches and DB on a shared volume, you are gonna have a very bad time indeed because of race-conditions, non-atomicity, file corruption etc.

Which is a shame, because Plex is the service I'd most like to be replicated. I dunno what the solution is. Use something other than Plex seems like the most obvious answer, but as far as I know the alternatives have the same issue.

Discuss...

  1. Oil, filtered from the previous deep fry, repackaged in a Nikka whisky bottle, because they’re cute, and Nikka whisky is frikkin’ fantastic, so of course we’ve got some old bottles lying around.

  2. Soya sauce, decanted daintily into a maple syrup bottle, which is totally not at all confusing some times.

  3. Salt and pepper shakers that are big and chunky and sit in a really crappy wooden base I made to stop them falling over all the time because I’m the clumsiest person I know.

  4. HomePod, playing bleepy noises. The other half of the stereo pair is on the other side of the kitchen, because stereo separation is important and don’t lecture me about how speakers are arranged. Tonight there was a decent selection featuring Wolfram Spyra, Space Frogs, International People’s Gang, Woob, Si Begg, Basement Jaxx, Tosca, and a fun Grimes track from before she went a bit mental after hanging out with mister nazi-baby-maker.

  5. The handle of my Chinese cleaver, the knife I use for literally everything.

  6. The handles of 2 quite nice Global brand knives that were a wedding present from a family member. These are very good Japanese knives. The other knives (and these) are all hanging from the magnetic knife holder I installed.

  7. Preserved lemons. Gotta start using them now, because it’s been long enough. Better decant some into smaller bottles to give away. Funny that I don’t have any Moroccan cook books. Probably still traumatized by “that event in Morocco” 25+ years ago.

  8. Garlic oil. Such an amazingly simple thing to make, and you end up with 2 x awesome things. Chop up garlic. Fry in oil till crispy. Scoop out crispy garlic and use as topping. Add garlic-infused oil to literally anything to make it taste amazing.

  9. Super-cheap oil spray thingy from Aldi. Think it was $10. So useful.

  10. Rosemary-salt. In a blender, combine rosemary, salt crystals, lemon rind, peppercorns. Whizz. Zero to hero.

  11. Left-over Sichuan pepper-salt. Grind Sichuan peppercorns with salt. Left over from making 白切鸡, bái qiē jī, “white-cooked chicken”, then shallow-frying half the chicken the following day for a lovely crispy-skin experience.

Discuss...

Enter your email to subscribe to updates.