We are three weeks away from shipping NavEngine v4, an echo from the previous piece on Business Driven development.
I say "shipping" loosely. There is no deployment script I run, no SSH session I open, no Kubernetes rollout I watch. The software lives on infrastructure I have never seen, behind firewalls I cannot reach, on machines whose specs I do not know. Shipping, in this context, means pushing an image to a registry and trusting that a process running inside a container on a customer's server will eventually notice and do something about it.
That gap - between what I push and what the customer runs - is what this piece is about.
The Assumption Collapse
Every CI/CD tool I have ever used was built on a premise so foundational that nobody thought to state it: you control the deployment target. You own the server. You have the keys. Deployment, in the conventional sense, is just automation wrapped around access you already have.
NavEngine is quite the opposite. It exists as a custom image - qcow2 - shipped onto the customer's infrastructure. The customer owns the machine. I do not have SSH access unless I go through DWService and even then, that is a support channel, not a deployment one. Yet somehow, we need to continuously deliver updates to machines we cannot reach, across connections we cannot guarantee, without breaking software that is actively in use.
So the question became: if you cannot push, how do you deliver?
Indulge The moment you decide to ship software you don't host, you have made a decision with consequences that will follow you for the life of the product. Not just operationally. Architecturally. Every assumption your codebase makes about the environment it runs in now belongs to someone else's infrastructure. That is not a deployment problem. It is a design problem that shows up at deployment time.
The Answer is Pull
CI/CD flow diagram.
Watchtower. It runs as a container alongside the rest of the stack, polls the Docker registry on a configured interval, and when it detects a new image digest on the tag it is watching, it pulls and restarts the relevant containers. No webhook, no push, no SSH. The installation phones home for updates and takes what it finds.
The key design decision here was the floating tag. Every customer's Butane config ships with core:stable. Not core:v3.0.39. Not a digest pin. stable. When Watchtower polls the registry and sees that stable now resolves to a different digest than what is currently running, it pulls. What "stable" points to is entirely under my control, from the registry side, without touching anything on the customer's machine.
This sounds obvious once you say it. It took longer than I would like to admit to get there.
Two Registries, Two floating tags, One gate
Here is the full pipeline as it actually runs.
Every push to main triggers a build on the dev registry. The image gets tagged with a version identifier and a floating tag - staging. A staging environment - running the same Compose stack, same Butane configuration, same structure as a customer installation - pulls from staging. This is where the image lives until I am satisfied it works.
When staging looks good, I create a release. The production registry builds from that release, tags the image with the version (v3.0.40) and overwrites the floating stable tag. Customer installations, on their next Watchtower poll interval, see a new digest on stable and update.
The critical detail: stable is never overwritten by a push to main. Only by a release. The staging gate is the only thing standing between a broken image and a customer's running installation. There is no automated rollout percentage, no canary fleet, no gradual traffic shift. The gate is a human decision, made after watching staging run and deciding it is ready.
For a solo-operated product at this stage, that is the right call. Complexity in release infrastructure that you do not need is just surface area for things to go wrong.
Indulge A staging environment that does not accurately reflect production is a very expensive placebo. It gives you confidence without giving you information. The hardest thing about shipping self-hosted software is that your staging environment runs on infrastructure you understand, with data you created, on a network you control. Your customer's environment is none of those things. No pipeline fully closes that gap. The best you can do is know exactly where your confidence ends.
What Happens When stable is broken
It will happen. An image that passes staging will break in a customer environment for a reason that staging did not surface - a schema migration that assumed a clean database, a dependency that behaves differently on older hardware, a configuration value that was present in staging and absent in the field.
The recovery flow is: fix on main, watch it pass staging, cut a new release. v3.0.41 overwrites stable. Watchtower picks it up on the next poll interval. The customer, who may or may not have noticed anything, is now running the fixed image.
The window between the broken image landing and the fix arriving is real. Depending on how fast the hotfix moves through staging and how long the Watchtower poll interval is, a customer could be running broken software for anywhere from minutes to hours. There is no remote kill switch. There is no way to reach in and restart a service. There is DWService if the situation is bad enough to warrant it, but that is a support escalation, not a deployment tool.
This is the honest cost of not controlling the deployment target. You accept a recovery latency that you cannot compress below a certain floor. The mitigation is not a cleverer pipeline. It is investing deeply in staging fidelity and in making sure the image fails loudly rather than silently - health checks that surface problems immediately, startup validation that refuses to run on bad configuration rather than running badly.
A system that fails loudly is a system that can be fixed. A system that degrades quietly is a system that erodes trust before anyone knows there is a problem.
Enterprise and standard: different cadences, same pattern
How do we manage customers that diverge from the main product line with an enterprise license?
NavEngine has two license tiers. Enterprise customers are on a separate release cadence from standard customers. The mechanism is straightforward: separate floating tags on the production registry. core-enterprise:stable and core-standard:stable. The Butane config shipped to each customer points at the appropriate tag. Enterprise releases can go out on a different schedule, carry different feature sets, and move more cautiously than standard releases.
What prevents a standard customer from pointing Watchtower at the enterprise tag? Mostly friction. The Compose file is baked into the Butane config at provisioning time. There is no SSH access to change it. A customer would need console access and the motivation to go looking - unlikely for most, impossible to rule out for all.
The proper answer is registry-level access control: pull tokens scoped to the tags each customer is entitled to, issued at license activation and revoked at expiry. This means the registry enforces entitlement, not just the application. An expired license means an expired pull token means no updates, enforced at the point of delivery rather than after the fact.
This is on the roadmap. For v4, the answer is friction and trust.
Indulge License enforcement in self-hosted software is a negotiation between what you can technically control and what you have to trust. You cannot fully control what runs on a machine you do not own. At some point, a sufficiently motivated customer can circumvent almost any enforcement mechanism you build. The goal is not to make circumvention impossible. It is to make compliance easier than circumvention, and to make the value proposition strong enough that the question rarely comes up.
The license server is not in the update path
One decision I am glad we made early: the licensing server and the Docker registry are separate infrastructure. They do not share a failure domain.
Watchtower polls the registry. The license server is called from within one of the running containers as part of normal application operation. If the registry is unreachable, the software keeps running. If the license server is unreachable, the backend falls back to its last known state - persisted to disk, not held in memory, so it survives a container restart. The check runs periodically. The grace window is generous enough that a license server outage does not immediately affect customers, but not so generous that expired licenses can run indefinitely.
This matters because the failure modes compound. A product update that requires a license check to proceed has just made the license server a dependency of your deployment pipeline. Any outage that hits your license infrastructure also hits your ability to ship updates to paying customers. Keeping these paths separate means they fail independently, and independent failures are recoverable in ways that cascading ones are not.
What the pipeline actually looks like
CI/CD flow diagram (drawio).
The immutable version tags are not just for auditing. They are the rollback reference. If v3.0.40 breaks everything, v3.0.39 still exists in the registry. I can retag it as stable manually and customers will roll back on the next poll. This has not been needed yet. It is there for the day it is.
Indulge Most CI/CD writing treats deployment as the end of the story. Ship it, watch the metrics, move on. Self-hosted software inverts this. Deployment is the beginning of a period during which software you cannot reach is running in an environment you cannot see, on behalf of a customer whose experience you will only hear about if something goes wrong. The pipeline is not a delivery mechanism. It is a trust mechanism. Every decision in it is a decision about how much you trust the image before it leaves your hands.
Three Weeks Out
NavEngine v4 is three weeks away. The pipeline is running. Staging has held. The tags are in place.
None of that answers the only question that matters: what happens when the software leaves you?
It will run on machines you have never touched, against data you have never seen, in environments that do not care about your assumptions. By the time it fails, if it fails, it will already be someone else’s problem - and still entirely yours. The customer notices before you do. That is the thing. I am shit scared.
I suppose that this is the essence of these notes - to document real systems in real time. This is the inversion self-hosted systems force on you: treating deployment not as the end of control but as the beginning of its absence.
So you design for that absence.
You design for recovery over prevention.
For visibility over certainty and trust over control.
Everything else is just what it takes to make that possible.