Kea HA-hook terminating both servers on clock skew
Describe the bug When the HA-hook in hot-standby mode detects a too big clock skew it terminates both DHCP servers causing a service disruption.
To Reproduce Steps to reproduce the behavior:
- Run Kea dhcpv4, ca with basic ha-config
- During operation get a clock skew that is considered too large (in this incident 63s)
- Wait for HA to recognize skew/cause a resync of leases (there was a connection issue 30s before causing partner-down, then resync on when issue went away)
- HA "terminates" (disables) dhcp4 on both servers
Expected behavior Expected "Primary" server to continue dhcp4 in standalone
Environment:
- Kea version: 1.6 from package repository
- OS: Ubuntu 18.04 x64
- If/which hooks where loaded in: lease-cmds, ha
Additional Information Log from Primary:
2019-11-17 02:29:19.800 WARN [kea-dhcp4.ha-hooks/10133] HA_HEARTBEAT_COMMUNICATIONS_FAILED failed to send heartbeat to dhcp2-kea (http://10.0.0.6:8080/): Connection timed out
2019-11-17 02:29:29.812 WARN [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATE_COMMUNICATIONS_FAILED [hwtype=1 00:00:00:00:00:e4], cid=[no info], tid=0xba32a81e: failed to communicate with dhcp2-kea (http://10.0.0.6:8080/): Connection timed out
2019-11-17 02:29:39.824 WARN [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATE_COMMUNICATIONS_FAILED [hwtype=1 00:00:00:00:00:e4], cid=[no info], tid=0xba32a81e: failed to communicate with dhcp2-kea (http://10.0.0.6:8080/): Connection timed out
2019-11-17 02:29:49.836 WARN [kea-dhcp4.ha-hooks/10133] HA_HEARTBEAT_COMMUNICATIONS_FAILED failed to send heartbeat to dhcp2-kea (http://10.0.0.6:8080/): Connection timed out
2019-11-17 02:29:58.807 WARN [kea-dhcp4.ha-hooks/10133] HA_HEARTBEAT_FAILED heartbeat to dhcp2-kea (http://10.0.0.6:8080/) failed: unable to forward command to the dhcp4 service: Operation canceled. The server is likely to be offline, error code 1
2019-11-17 02:30:03.808 INFO [kea-dhcp4.ha-hooks/10133] HA_STATE_TRANSITION server transitions from PARTNER-DOWN to WAITING state, partner state is HOT-STANDBY
2019-11-17 02:30:03.808 INFO [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATES_DISABLED lease updates will not be sent to the partner while in WAITING state
2019-11-17 02:30:03.808 INFO [kea-dhcp4.ha-hooks/10133] HA_LOCAL_DHCP_DISABLE local DHCP service is disabled while the dhcp1-kea is in the WAITING state
2019-11-17 02:30:03.808 INFO [kea-dhcp4.ha-hooks/10133] HA_STATE_TRANSITION server transitions from WAITING to SYNCING state, partner state is HOT-STANDBY
2019-11-17 02:30:03.808 INFO [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATES_DISABLED lease updates will not be sent to the partner while in SYNCING state
2019-11-17 02:30:03.808 INFO [kea-dhcp4.ha-hooks/10133] HA_SYNC_START starting lease database synchronization with dhcp2-kea
2019-11-17 02:30:03.881 INFO [kea-dhcp4.ha-hooks/10133] HA_LEASES_SYNC_LEASE_PAGE_RECEIVED received 15 leases from dhcp2-kea
2019-11-17 02:30:03.909 INFO [kea-dhcp4.ha-hooks/10133] HA_SYNC_SUCCESSFUL lease database synchronization with dhcp2-kea completed successfully in 100.921 ms
2019-11-17 02:30:03.909 INFO [kea-dhcp4.ha-hooks/10133] HA_STATE_TRANSITION server transitions from SYNCING to READY state, partner state is HOT-STANDBY
2019-11-17 02:30:03.909 INFO [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATES_DISABLED lease updates will not be sent to the partner while in READY state
2019-11-17 02:30:03.909 INFO [kea-dhcp4.ha-hooks/10133] HA_STATE_TRANSITION server transitions from READY to HOT-STANDBY state, partner state is HOT-STANDBY
2019-11-17 02:30:03.909 INFO [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATES_ENABLED lease updates will be sent to the partner while in HOT-STANDBY state
2019-11-17 02:30:03.909 INFO [kea-dhcp4.ha-hooks/10133] HA_LOCAL_DHCP_ENABLE local DHCP service is enabled while the dhcp1-kea is in the HOT-STANDBY state
2019-11-17 02:30:08.910 INFO [kea-dhcp4.ha-hooks/10133] HA_STATE_TRANSITION server transitions from HOT-STANDBY to TERMINATED state, partner state is TERMINATED
2019-11-17 02:30:08.910 INFO [kea-dhcp4.ha-hooks/10133] HA_LEASE_UPDATES_DISABLED lease updates will not be sent to the partner while in TERMINATED state
2019-11-17 02:30:08.910 ERROR [kea-dhcp4.ha-hooks/10133] HA_TERMINATED HA service terminated due to an unrecoverable condition. Check previous error message(s), address the problem and restart!
2019-11-17 02:33:06.953 INFO [kea-dhcp4.commands/10133] COMMAND_RECEIVED Received command 'ha-heartbeat' ## this is from our monitoring, not the standby
Logs from standby:
2019-11-17 02:28:54.120 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'ha-heartbeat'
2019-11-17 02:28:58.796 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'lease4-update'
2019-11-17 02:29:59.225 ERROR [kea-dhcp4.commands/14468] COMMAND_SOCKET_WRITE_FAIL Error while writing to command socket 28 : Broken pipe
2019-11-17 02:29:59.228 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'ha-heartbeat'
2019-11-17 02:29:59.228 ERROR [kea-dhcp4.ha-hooks/14468] HA_HIGH_CLOCK_SKEW_CAUSES_TERMINATION my time: 2019-11-17 01:29:59, partner's time: 2019-11-17 01:28:56, partner's clock is 63s behind, causing HA service to terminate
2019-11-17 02:29:59.228 INFO [kea-dhcp4.ha-hooks/14468] HA_STATE_TRANSITION server transitions from HOT-STANDBY to TERMINATED state, partner state is HOT-STANDBY
2019-11-17 02:29:59.229 INFO [kea-dhcp4.ha-hooks/14468] HA_LEASE_UPDATES_DISABLED lease updates will not be sent to the partner while in TERMINATED state
2019-11-17 02:29:59.229 ERROR [kea-dhcp4.ha-hooks/14468] HA_TERMINATED HA service terminated due to an unrecoverable condition. Check previous error message(s), address the problem and restart!
2019-11-17 02:30:03.821 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'ha-heartbeat'
2019-11-17 02:30:03.844 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'dhcp-disable'
2019-11-17 02:30:03.868 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'lease4-get-page'
2019-11-17 02:30:03.898 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'dhcp-enable'
2019-11-17 02:30:08.922 INFO [kea-dhcp4.commands/14468] COMMAND_RECEIVED Received command 'ha-heartbeat' ## probably monitoring again, after this guaranteed just monitoring
What is weird to me is that only the standby saw the clock skew... could the heartbeat response been delayed that long causing a false-positive? It might not be a bad idea to make a second check for clock skew if that is really what happened.