When communication between HA peers is interrupted the servers should continue to respond to DHCP
This is a proposal to improve our failover transition. The two servers are exchanging heartbeats to see whether they are online. If the communication between them gets broken they retry sending the heartbeats for a configured amount of time after which they optionally start analyzing the DHCP traffic. They monitor whether the partner server is responding by checking the value of the secs field/elapsed time option. While the servers have their communication broken, they don't respond to the traffic because lease updates don't go through. As a result, it is possible that a number of clients is failing to get the lease and when the traffic analysis is started they set the secs/elapsed time to high values. That may cause premature transition to the partner-down state of both servers.
Two solutions to consider:
- Respond to the DHCP even if the communication is broken and lease updates don't go through the control channel. That way, the client gets the lease even in the event of broken connection between the servers. The drawback is that unsent lease updates pile up and need to be sent when the connection is re-established.
- If the secs field/elapsed time value is greater than the duration between the moment of time when the DHCP analysis started and now it means that the client started to apply for the lease when the communication was broken and before the analysis started. Such clients should not be counted as unacked. Or, we should subtract (secs - max-response-delay) and treat this as an input to traffic analysis.