Split Brain Problems in HA
Problem
During normal operation in the hot-standby
mode, the two DHCP servers should remain in the hot-standby
state. In this state, the primary server should respond to the entire DHCP traffic and the standby server should remain passive and only record lease updates sent by the primary. The standby server sends periodic heartbeats to the primary to check its availability and the primary also sends the heartbeats to the standby to check its availability. As long as both servers respond to the heartbeats the setup is considered healthy and no failover actions are triggered.
The communication problem between the two servers may arise for various reasons, e.g. one of the servers terminated unexpectedly, the network connection between the servers is broken (network partitioning), firewalls settings may prevent communication both ways or in one direction. In all those cases, the HA setup is expected to guarantee that the DHCP service remains available to the connecting DHCP clients and any possible outage period is as short as possible.
The DHCP servers experiencing communication problems with their partners should ensure that the communication interruption is not transient and that the partner is unable to provide DHCP service. In this case the server should transition to the partner-down
state, in which it is responsible for handling the entire DHCP traffic and is not attempting to send lease updates to the partner which presumably can't receive those updates anyway.
Kea HA can be configured to use 2 step failover procedure to ensure that the partner is really unable to serve DHCP clients. As a first step, it keeps sending heartbeats for the configured amount of time (max-response-delay
) before it proceeds to step two. If the communication issue is transient, one of the heartbeats sent within this time period may indicate that the partner is still alive and the servers would continue normal operation. However, if the communication is not re-established within this time period, the standby server will proceed to the next failover step. At this point, the standby server starts monitoring the Discover or Rebind messages sent to the primary. The server looks into the secs
field value in these messages. If the secs
field value is 0, it means that the client is sending its first message to the server. If the secs
field value is greater than 0 it indicates that the client is retrying, presumably because the server that should handle this request has been unavailable. If he secs
field value exceeds the configured threshold the standby server considers the particular DHCP client unacked
. If the number of unacked
clients exceeds the configured value, the standby server consider the primary to be down and transitions to the partner-down
state.
Note that the primary server running in the hot-standby
configuration does not perform the described step 2, when it has communication problems with the standby. That's because the standby is normally not expected to respond to DHCP. The primary transitions to the partner-down
state right after it finds that the heartbeats have been failing for longer than max-response-delay
.
We have found that in certain cases the 2 step failover mechanism may not prevent both servers to transition to the partner-down
state, which is highly undesired. If both servers are in this state, they both consider each other unavailable and they both respond to the DHCP queries. This may result in lease database inconsistencies. In particular, it may lead to situations that two different clients will be offered the same IP address. We are going to call this a split brain
situation.
Mitigation
There are some ideas how to improve the failover logic in HA to avoid the split brain problem. However, this is going to go through the standard development process and will take some considerable amount of time before it is widely adopted. In order to help avoid split brain issues in the current deployments we below provide diagrams which strive to explain the source of the problem and how it can be mitigated using asymmetrical
configurations.
The following diagram presents two cases. The first one is the split brain case occurring when using a typical symmetric configuration.
BOTH SERVERS
- heartbeat-delay: 10000
- max-ack-delay: 10000
- max-response-delay: 45000
- max-unacked-clients": 10
The second one demonstrates how the split brain situation is avoided with asymmetric configuration.
PRIMARY
- heartbeat-delay: 10000
- max-ack-delay: 10000
- max-response-delay: 45000
- max-unacked-clients": 10
STANDBY:
- heartbeat-delay: 10000
- max-ack-delay: 10000
- max-response-delay: 65000
- max-unacked-clients": 10
Both servers send heartbeats to each every 10 seconds. The heartbeats are not in sync. Each server decides when to send the heartbeat independently. The communication failure occurs as a result of network partitioning. The standby sees the communication problem before the primary because it sends the heartbeat earlier. From now on, the standby counts the time of 45s (max-response-delay). A bit later, the primary also experiences communication problem and starts counting 45s. While two servers experience communication issues there are DHCP clients sending DHCPDISCOVERs that should be normally responded to by the primary. The primary processes the DHCP traffic but it fails on sending lease updates to the standby. As a result, the primary drops DHCP responses and the clients don't get leases. This is expected to take no longer than 45s until the primary transitions to the partner-down state.
However, the clients keep retrying and bump up the secs field values accordingly. When 45s elapses on the standby server the server starts the procedure of monitoring the secs field values. Soon after that, the primary transitions to the partner-down
state as it has been unable to communicate with the standby for longer than 45s. However, there is a narrow window before the primary transitions to the state and when the standby starts monitoring the secs field values in the DHCP requests. In this time window the clients still don't get responses from the primary and the standby starts considering these clients unacked. In fact, even when the primary transitions to the partner-down state and stats responding to DHCP traffic the standby still sees the requests coming up with high secs values. The standby has no means to see whether they have been responded or not, because it can only see the client's request (broadcast), not the responses (unicast). Therefore, even when the primary is in the partner-down
state already, the standby may assume that certain clients are unacked. That leads to the situation that the standby also transitions to the partner-down
state and both servers now actively respond to the DHCP traffic.
Let's now take a look at the second picture.
Both servers are still sending heartbeats every 10s. This time again the standby experiences the communication failure before the primary. From that moment it starts counting 65s and keeps trying to send heartbeats to the primary. Soon, the primary observes the communication failure with the standby. As in the previous case, the primary starts counting 45s before transitioning to the partner-down
state. The DHCP clients are again sending DHCP requests to the primary which fails to respond to them because it can't send lease updates to the standdby. However, this time the max-response-time on the standby is offset by 20s comparing to the standby. When the 45s elapses on the primary it transitions to the partner-down
state and the 20s offset provides the primary with the extra time to deal with DHCP retries from the clients which tried to communicate with it over the period of 45s. The longer the offset the more probable that the primary will deal with all retrying clients. After 65s if the standby no longer sees the unacked clients (strictly speaking, more unacked clients than max-unacked-clients
value), it will simply remain in the hot-standby
state. This is desired situation. The primary is now responding to the entire DHCP traffic and the standby remains passive, occasionally sending a heartbeat to check primary's availability.
The offset of 20s has been selected such that most of the retrying clients should come back within this time to get the lease from the primary. Obviously, the longer this time is the more likely that will be the case. On the other hand, selecting too long time is also not desired because it means that it will take more time for the standby to transition to the partner-down
state in case of the actual primary's unexpected termination.
Sending Lease Updates While Communication Interrupted
...