Andrei Pavel · e3c1f9fd
--- a/Designs/HA-split-brain-issues.md
+++ b/Designs/HA-split-brain-issues.md
+# Split Brain Problems in HA
+## Premature Partner Down
+### The Problem
+During normal operation in the `hot-standby` mode, the two DHCP servers should remain in the `hot-standby` state. In this state, the primary server should respond to the entire DHCP traffic and the standby server should remain passive and only record lease updates sent by the primary. The standby server sends periodic heartbeats to the primary to check its availability and the primary also sends the heartbeats to the standby to check its availability. As long as both servers respond to the heartbeats the setup is considered healthy and no failover actions are triggered.
+The communication problem between the two servers may arise for various reasons, e.g. one of the servers terminated unexpectedly, the network connection between the servers is broken (network partitioning), firewalls settings may prevent communication both ways or in one direction. In all those cases, the HA setup is expected to guarantee that the DHCP service remains available to the connecting DHCP clients and any possible outage period is as short as possible.
+The DHCP servers experiencing communication problems with their partners should ensure that the communication interruption is not transient and that the partner is unable to provide DHCP service. In this case the server should transition to the `partner-down` state, in which it is responsible for handling the entire DHCP traffic and is not attempting to send lease updates to the partner which presumably can't receive those updates anyway.
+Kea HA can be configured to use 2 step failover procedure to ensure that the partner is really unable to serve DHCP clients. As a first step, it keeps sending heartbeats for the configured amount of time (`max-response-delay`) before it proceeds to step two. If the communication issue is transient, one of the heartbeats sent within this time period may indicate that the partner is still alive and the servers would continue normal operation. However, if the communication is not re-established within this time period, the standby server will proceed to the next failover step. At this point, the standby server starts monitoring the Discover or Rebind messages sent to the primary. The server looks into the `secs` field value in these messages. If the `secs` field value is 0, it means that the client is sending its first message to the server. If the `secs` field value is greater than 0 it indicates that the client is retrying, presumably because the server that should handle this request has been unavailable. If he `secs` field value exceeds the configured threshold the standby server considers the particular DHCP client `unacked`. If the number of `unacked` clients exceeds the configured value, the standby server consider the primary to be down and transitions to the `partner-down` state.
+Note that the primary server running in the `hot-standby` configuration does not perform the described step 2, when it has communication problems with the standby. That's because the standby is normally not expected to respond to DHCP. The primary transitions to the `partner-down` state right after it finds that the heartbeats have been failing for longer than `max-response-delay`.
+We have found that in certain cases the 2 step failover mechanism may not prevent both servers to transition to the `partner-down` state, which is highly undesired. If both servers are in this state, they both consider each other unavailable and they both respond to the DHCP queries. This may result in lease database inconsistencies. In particular, it may lead to situations that two different clients will be offered the same IP address. We are going to call this a `split brain` situation.
+### The Mitigation
+There are some ideas how to improve the failover logic in HA to avoid the split brain problem. However, this is going to go through the standard development process and will take some considerable amount of time before it is widely adopted. In order to help avoid split brain issues in the current deployments we below provide diagrams which strive to explain the source of the problem and how it can be mitigated using `asymmetrical` configurations.
+The following diagram presents two cases. The first one is the split brain case occurring when using a typical symmetric configuration.
+BOTH SERVERS
+- heartbeat-delay: 10000
+- max-ack-delay: 10000
+- max-response-delay: 45000
+- max-unacked-clients": 10
+The second one demonstrates how the split brain situation is avoided with asymmetric configuration.
+PRIMARY
+- heartbeat-delay: 10000
+- max-ack-delay: 10000
+- max-response-delay: 45000
+- max-unacked-clients": 10
+STANDBY:
+- heartbeat-delay: 10000
+- max-ack-delay: 10000
+- max-response-delay: 65000
+- max-unacked-clients": 10
+![HA-race](uploads/1c78606ba300399f356b66e122037f0b/HA-race.png)
+Both servers send heartbeats to each every 10 seconds. The heartbeats are not in sync. Each server decides when to send the heartbeat independently. The communication failure occurs as a result of network partitioning. The standby sees the communication problem before the primary because it sends the heartbeat earlier. From now on, the standby counts the time of 45s (max-response-delay). A bit later, the primary also experiences communication problem and starts counting 45s. While two servers experience communication issues there are DHCP clients sending DHCPDISCOVERs that should be normally responded to by the primary. The primary processes the DHCP traffic but it fails on sending lease updates to the standby. As a result, the primary drops DHCP responses and the clients don't get leases. This is expected to take no longer than 45s until the primary transitions to the partner-down state.
+However, the clients keep retrying and bump up the secs field values accordingly. When 45s elapses on the standby server the server starts the procedure of monitoring the secs field values. Soon after that, the primary transitions to the `partner-down` state as it has been unable to communicate with the standby for longer than 45s. However, there is a narrow window before the primary transitions to the state and when the standby starts monitoring the secs field values in the DHCP requests. In this time window the clients still don't get responses from the primary and the standby starts considering these clients unacked. In fact, even when the primary transitions to the partner-down state and stats responding to DHCP traffic the standby still sees the requests coming up with high secs values. The standby has no means to see whether they have been responded or not, because it can only see the client's request (broadcast),  not the responses (unicast). Therefore, even when the primary is in the `partner-down` state already, the standby may assume that certain clients are unacked. That leads to the situation that the standby also transitions to the `partner-down` state and both servers now actively respond to the DHCP traffic.
+Let's now take a look at the second picture.
+Both servers are still sending heartbeats every 10s. This time again the standby experiences the communication failure before the primary. From that moment it starts counting 65s and keeps trying to send heartbeats to the primary. Soon, the primary observes the communication failure with the standby. As in the previous case, the primary starts counting 45s before transitioning to the `partner-down` state. The DHCP clients are again sending DHCP requests to the primary which fails to respond to them because it can't send lease updates to the standby. However, this time the max-response-time on the standby is offset by 20s comparing to the standby. When the 45s elapses on the primary it transitions to the `partner-down` state and the 20s offset provides the primary with the extra time to deal with DHCP retries from the clients which tried to communicate with it over the period of 45s. The longer the offset the more probable that the primary will deal with all retrying clients. After 65s if the standby no longer sees the unacked clients (strictly speaking, more unacked clients than `max-unacked-clients` value), it will simply remain in the `hot-standby` state. This is desired situation. The primary is now responding to the entire DHCP traffic and the standby remains passive, occasionally sending a heartbeat to check primary's availability.
+The offset of 20s has been selected such that most of the retrying clients should come back within this time to get the lease from the primary. Obviously, the longer this time is the more likely that will be the case. On the other hand, selecting too long time is also not desired because it means that it will take more time for the standby to transition to the `partner-down` state in case of the actual primary's unexpected termination.
+## Lost Leases on Standby Server
+### The Problem
+Consider a pair of servers in the hot-standby mode. Breaking communication between them causes the primary to transition to the partner-down state after a few unsuccessful heartbeats. The primary server stops sending the lease updates to the standby server. The standby server remains in the hot-standby state.
+When the communication is back, there is a potential race condition between the heartbeats sent by the servers to each other. If the primary sends the heartbeat first, it sees the standby in the hot-standby state and transitions to the same state. As a result, the standby server doesn't synchronize its lease database and lacks all leases allocated by the primary server in the partner-down state. If the standby server sends the heartbeat first, it sees the primary server in the partner-down state and synchronizes its database. Eventually, the standby server indicates it is ready, and the primary transitions to the partner-down state. There is a different behavior depending on which server succeeds in sending the heartbeat after the communication loss.
+This issue was first described in https://gitlab.isc.org/isc-projects/kea/-/issues/1959.
+### The Solution
+A primary server seeing its partner in the hot-standby state should not immediately transition to the hot-standby state. If the primary server allocated any leases in the partner-down state, the standby server did not receive suitable lease updates. Thus, the standby server must synchronize its database. The primary server must indicate it to the standby. The primary server should stay in the partner-down state until the standby sees the primary in this state. The standby will react by transitioning to the waiting state, then synchronizing its database and indicating it is ready.
+## Partner Down: Synchronize or not Synchronize
+### The Problem
+There are two scenarios in which a server can transition to the partner-down state. The first is a communication problem described above. The second is the termination of the partner server.
+A server remains in the partner-down state while the other is synchronizing. Then, the server has two choices: transition to the normal operation state or synchronize its lease database with the partner. The choice depends on whether the partner allocated any leases while the server was in the partner-down state and the server didn't receive the lease updates. If the partner is recovering from termination, it didn't allocate any leases. If the partner is recovering from a split-brain situation when both servers were in the partner-down state, it could have allocated some leases.
+The server must be able to determine which of these two cases is present.
+### The Solution
+The heartbeat reply can be extended to convey an indicator of whether the replying server has allocated any leases for which it did not send lease updates to the partner. The partner could decide if it should synchronize the lease database.
+Indicator can be implemented as a counter incremented for each allocated lease for which the lease update was not sent. The querying server should remember the last counter value and compare it with the new counter value each time it is received. If the values are unequal and the new value is not equal to 0, it implies that the server should synchronize the lease database with the partner. The 0 value indicates that the partner is recovering from the termination (restarting).
+## Post Synchronization Race
+### The Problem
+A server synchronizing lease database first disables the partner's DHCP service to ensure that new leases are not allocated while the synchronization is in progress. When the synchronization is finished, the server re-enables the partner's DHCP service and moves to the ready state. There is a short period between enabling the service and when the partner observes the ready state on the synchronized server. During this period, the server in the partner-down state can potentially allocate new leases, which the freshly synchronized server will miss.
+### The Solution
+To better coordinate the synchronization process, the synchronizing server should send a new command, synced-notify, instead of the dhcp-enable at the end of the synchronization. The receiving server should ensure that it doesn't allocate new leases until it reaches the state when it can successfully send lease updates to the partner. It should also ensure that the DHCP service is enabled at the appropriate time.
+For example, suppose the server receives the new command in the partner-down state. In that case, the server should transition to the waiting state or normal operation state, depending on the failure scenario.   Upon entering any of these states, the DHCP service state will be enabled or disabled accordingly.
+To ensure backward compatibility between old and new Kea versions, a server sending the synced-notify command must not fail upon receiving a response from its partner that this command is unsupported. Instead, the server should send the dhcp-enable command to the partner.