Logging and access to current count of 'unacked-clients' during HA heartbeat communication failures'
As suggested from questions on Support ticket #15920 ...
- The first question is 'how do I know that my HA peer is ready and waiting to take over if my server fails?'
Particularly in the case of hot-standby, there's not much visibility into the readiness of the standby server to take over - for example, is it 'seeing' the client traffic that the other server is handling? The conclusion we came to was that it would be good, operationally, to either increase the server (dhcp4 and/or dhcp6) to 'see' the client packets that are being seen but dropped, or to monitor pkt4-received and pkt6-received statistics.
But it might be more helpful if the heartbeat logging (to confirm that the peers are in communication with each other) also indicated whether or not this server is 'seeing' the traffic that it is expecting its partner to handle, unless/until a peer-down state is reached.
- The second question is 'why hasn't my HA server taken over yet from its peer?'
In most HA configurations, there are two triggers for a server to take over from its peer. The first is that the HA heartbeat has failed (receipt of and/or send of) and HA starts logging communication interrupted. If the server is also configured with non-zero max-unacked-clients, it does not take over right away, but at this point starts watching the client traffic for the other server to determine if it is responding or not.
There is no direct visibility into the status of unacked-clients while it is doing this (although clearly the server that is monitoring and counting them is keeping records). Please could this be logged (along with the HA heartbeat status) so that it's possible for sysadmins to know the full status of HA in a communications-interrupted situation and before reaching the peer-down state.