Restarting HA enabled servers being in terminated state
There are at least two cases I know about when Kea users attempted to recover from the HA "terminated" state after fixing the clock skew problem and they failed by ending up back in the terminated states on both servers, despite having the clocks in sync. To my best knowledge this is because our ARM is not very specific how the HA enabled servers should be restarted in such situation. It merely talks about restarting them. Reading this, the user would restart one of them first and leave the other one running to avoid the total loss of the DHCP service. Or maybe it is simpler to restart one of them, and then go to the other one and restart it. The effect of sequential restart of the servers is that once the first server starts up it sees the second one still being in the terminated state. The HA state machine is programmed to transition it to the terminated state seeing the partner being in the terminated state. Such restart has no effect. You still have both servers in the terminated state. You restart the second one and have the same story.
The way the administrator should restart the servers is actually to stop both and then start both. Or... start one of them and wait for it to settle in the partner-down state. Then, start the second one.
Running server must not be in the terminated state upon restarting the partner, as it will cause the partner to transition to the terminated state!
Our ARM must be crystal clear about it!