Reboot-tolerance in Weave Net 1.5
Weave Net has always been pretty robust: instead of a central store of config data, everything is done peer-to-peer, and machine failures don’t hold up any other peers because our data structures use...
Safety Fast with Weave GitOps Trusted & Progressive Delivery
September Release - Weave GitOps 2022.09
Weave GitOps Automation for Helm and GitHub Actions
Weave Net has always been pretty robust: instead of a central store of config data, everything is done peer-to-peer, and machine failures don’t hold up any other peers because our data structures use eventual consistency.
However, there were a few corner cases. You could get unlucky with a series of failures, say a machine reboot coupled with a network delay, and have to restart that node again because it was out of sync with the rest of the network.
Another case: say your hosts are split across two datacenters, and the link between them is down (the network is “partitioned”). Weave Net has always let each side carry on working independently, but if it then happened that every host in one of the datacenters was rebooted, then they would need to wait until the link was restored to learn the network state and carry on.
These are pretty unlikely corner cases, but Murphy’s law applies: whatever can go wrong will go wrong, so in in version 1.5 we gave Weave Net the power to recover.
There are two key features that make this work:
- Each peer now saves its state to disk so it will re-sync correctly with the rest of the network on reboot.
- We pick up the unique identity of the underlying machine from the BIOS or hypervisor, so we can retain ownership over a reboot.
Persistence is implemented using the BoltDB library, a lightning-fast store that is simple and reliable and perfect for what we need. We put the files in a Docker volume container named
weavedb so they can be managed alongside all other Weave containers – you should leave this container alone unless you want to remove that host from the Weave network.
One thing to note: this new persistence means that you may see a lingering reference to peers you have permanently removed from the network; the advice as usual is to
<a href="http://live-weavewww.pantheonsite.io/documentation/net-1.5.0-ipam/net-1.5.0-stop-remove-peers-ipam/">weave reset</a> them before shut-down, or run the
<a href="http://live-weavewww.pantheonsite.io/documentation/net-1.5.0-ipam/net-1.5.0-stop-remove-peers-ipam/">weave rmpeer</a> command on one other peer once it has gone.