FQDN next-server setting can cause server stalls and failover interruptions
name: Bug report
about: FQDN next-server settings can cause server stalls and failover interruptions
Describe the bug If next-server on a host is set to an FQDN, the DNS resolution of that FQDN is done in-line on the main thread via a call to gethostbyname. If DNS is unresponsive for any reason, the main thread stalls for 20-40 seconds as the host resolution cycles through retries and timeout. The dhcp server is then unresponsive for that period of time to other clients and failover peers. If the stall exceeds the failover settings, the failover server will go into communications-interrupted.
Here's some log messages from when this behavior occurs. Note the 40 second delay between the first two messages.
Apr 8 09:42:37 dhcp03 dhcpd: DHCPACK on 10.176.235.107 to 18:a9:05:f8:64:21 (xxx01x1x9x3) via bond1
Apr 8 09:43:17 dhcp03 dhcpd: dns01.example.com: temporary name server failure
Apr 8 09:43:17 dhcp03 dhcpd: DHCPREQUEST for 10.176.253.168 from cc:dd:ff:99:aa:11 (ET456789) via 10.176.253.190
Apr 8 09:43:17 dhcp03 dhcpd: DHCPACK on 10.176.253.168 to cc:dd:ff:99:aa:11 (ET456789) via 10.176.253.190
The sanitized configuration for the enclosing pool is:
subnet 10.176.253.160 netmask 255.255.255.224 {
pool {
range 10.176.253.161 10.176.253.182;
failover peer "DHCP03-DHCP04";
deny dynamic bootp clients;
}
# policies
authoritative;
default-lease-time 43200;
filename "/smsboot/x86/w1234.com";
next-server dns1.example.com;
ping-check True;
# options
option routers 10.176.253.190;
option domain-name "example.com";
}
To Reproduce Steps to reproduce the behavior:
- Run dhcpd with the the next-server configured on a particular host, and the value of that option is an unresponsive DNS server. Not an unresolvable name, but once the name is resolved, the IP will not respond.
- The host client requests that lease.
- The server then attempts to resolve the FQDN.
- The logs stop even if other clients attempt discovers or renews.
- Eventually the server will respond with an offer but after 30 seconds or so.
Expected behavior The server should not come to a complete halt due to the configuration of one host (or one pool/subnet/etc.) . Only the response to that host should be affected by any delays or issues with DNS resolution. The server should continue to service other clients and maintain the failover connection while the DNS resolution is in progress.
Environment:
- ISC DHCP version: 4.4.2-P1
- OS: Linux custom distribution, kernel 5.1.19
- enable-failover, enable-dhcpv6, enable-paranoia, enable-binary-leases
Additional Information
Some initial questions
- Are you sure your feature is not already implemented in the latest ISC DHCP version? Yes.
- Are you sure your requested feature is not already implemented in Kea? Perhaps it's a good time to consider migration? Yes, KEA allows configuration of next-server with FQDN, but does not resolve it correctly.
- Are you sure what you would like to do is not possible using some other mechanisms? All mitigations are global system changes to resolv.conf. Shortening timeouts and retries on DNS would reduce the stall time, but might also result in a resolution failure if the DNS service is just slow, rather than unavailable.
- Have you discussed your idea on dhcp-users and/or dhcp-workers mailing lists?
Is your feature request related to a problem? Please describe. I want to use FQDNs in next-server clauses. When I do this, and DNS becomes unavailable, the DHCP server slows down and failover breaks.
Describe the solution you'd like If I use FQDNs in a next-server clause, and DNS becomes unavailable, I want the server to continue serving clients without an FQDN next-server clause and to keep the failover connection uninterrupted.
Describe alternatives you've considered Using IPs - we are trying to spread the load on TFTP servers via DNS round-robin lookups. If we hard code the IPs, then we have to manually spread the load which is very difficult to predict. resolv.conf changes - we've lowered the timeout and retries in resolv.conf. This has reduced the stall times accordingly. However, it is a system wide change and we are concerned about DNS timeouts should DNS just slow down for some reason.
Additional context None.
Funding its development No
Participating in development Yes
Contacting you schtever@gmail.com steve.j.thompson@bt.com