ISC_R_HOSTDOWN was not being handled by resolver.c
Summary
Treat ISC_R_HOSTDOWN like ISC_R_HOSTUNREACH in resolver.c (already done so in xfrin.c). Similar ISC_R_NETDOWN like ISC_R_NETUNREACH.
BIND version affected
This has been encountered with both BIND 9.18-25-S1 and 9.18.27-S1. It did not occur with BIND 9.16-S, but the likely significant difference is that 9.16(-S) was still using the legacy BIND sockets code; 9.18(-S) is using libuv for dispatch (backend resolver), so we might see some more/difference socket errors coming back up the networking stack that we didn't encounter before.
BIND 9.18.25-S1 (Supported Preview Version) <id:>
running on FreeBSD amd64 13.3-RELEASE-p1 FreeBSD 13.3-RELEASE-p1
<--snipped confidential customer build environment information-->
built by make with '--disable-linux-caps' '--localstatedir=/var' '--sysconfdir=/usr/local/etc/namedb' '--with-dlopen=yes' '--with-libxml2' '--with-openssl=/usr' '--enable-dnsrps' '--with-readline=libedit' '--enable-dnstap' '--disable-fixed-rrset' '--disable-geoip' '--without-maxminddb' '--without-gssapi' '--with-libidn2=/usr/local' '--with-json-c' '--disable-largefile' '--with-lmdb=/usr/local' '--disable-querytrace' '--enable-tcp-fastopen' '--prefix=/usr/local' '--mandir=/usr/local/man' '--disable-silent-rules' '--infodir=/usr/local/share/info/' '--build=amd64-portbld-freebsd13.2' 'build_alias=amd64-portbld-freebsd13.2' 'CC=cc' 'CFLAGS=-O2 -pipe -g -DLIBICONV_PLUG -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -ljson-c -fstack-protector-strong ' 'LIBS=-L/usr/local/lib' 'CPPFLAGS=-DLIBICONV_PLUG -isystem /usr/local/include' 'CPP=cpp' 'PKG_CONFIG=pkgconf' 'PKG_CONFIG_LIBDIR=/wrkdirs/usr/ports/dns/edgebsd-bind918/work/.pkgconfig:/usr/local/libdata/pkgconfig:/usr/local/share/pkgconfig:/usr/libdata/pkgconfig' 'PYTHON=/usr/local/bin/python3.9' 'READLINE_CFLAGS=-L/usr/local/lib'
compiled by CLANG FreeBSD Clang 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c)
compiled with OpenSSL version: OpenSSL 1.1.1t-freebsd 7 Feb 2023
linked to OpenSSL version: OpenSSL 1.1.1w-freebsd 11 Sep 2023
compiled with libuv version: 1.47.0
linked to libuv version: 1.48.0
compiled with libnghttp2 version: 1.58.0
linked to libnghttp2 version: 1.61.0
compiled with libxml2 version: 2.10.4
linked to libxml2 version: 21107
compiled with json-c version: 0.17
linked to json-c version: 0.17
compiled with zlib version: 1.2.13
linked to zlib version: 1.3.1
compiled with protobuf-c version: 1.4.1
linked to protobuf-c version: 1.4.1
threads support is enabled
DNSSEC algorithms: RSASHA1 NSEC3RSASHA1 RSASHA256 RSASHA512 ECDSAP256SHA256 ECDSAP384SHA384 ED25519 ED448
DS algorithms: SHA-1 SHA-256 SHA-384
HMAC algorithms: HMAC-MD5 HMAC-SHA1 HMAC-SHA224 HMAC-SHA256 HMAC-SHA384 HMAC-SHA512
TKEY mode 2 support (Diffie-Hellman): yes
TKEY mode 3 support (GSS-API): no
default paths:
named configuration: /usr/local/etc/namedb/named.conf
rndc configuration: /usr/local/etc/namedb/rndc.conf
DNSSEC root key: /usr/local/etc/namedb/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/pid
named lock file: /var/run/named/named.lock
Steps to reproduce
- Set up a resolver with global forwarding to more than one nameserver.
- One of the forwarders needs to be a) on the same subnet as the resolver and b) down
- Start the resolver
- Query the resolver with a series of unique names - you need to send enough queries that each require recursion for the resolver to have set initial SRTTs for all of its forwarders and to have gotten around to trying them all, including the one that is currently unavailable.
You should find that eventually the resolver is responding SERVFAIL to all of the queries, along with the following symptoms:
- There is no evidence in pcaps of it sending out any queries to any of the forwarders
- The arp table shows this destination IP address as incomplete - e.g: downserver.example.com, (192.168.2.25) at (incomplete) on eth0 expired [ethernet]
- This 'absent' forwarder is the one with the shortest SRTT in ADB in a cache dump
- Logging at -d99 shows that senddone is failing with EHOSTDOWN and that we immediately cancel the send and do no SRTT processing and don't retry the fetch with any of the other (good) forwarders in the list.
@40000000664bb57815114d8c dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: connected: success
@40000000664bb578151168e4 dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: attaching handle 0x8e9f10780 to 0x8e9ec8010
@40000000664bb578151170b4 dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: reading
@40000000664bb57815118824 dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: connect callback: success
@40000000664bb5781512dbfc resolver: debug 11: sending packet to 192.168.2.25#53
@40000000664bb5781513224c dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: sending
@40000000664bb5781513aeec dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: sent: host down
@40000000664bb5781513ca44 dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: canceling response: operation canceled, connected/reading (none/not reading), requests 1
@40000000664bb5781513e59c dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: canceling read on 0x8e9f10780
@40000000664bb578151400f4 dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: read callback: operation canceled
@40000000664bb57815145acc dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: canceling response: host down, canceled/reading (none/not reading), requests 1
@40000000664bb5781514b88c rate-limit: debug 99: client @0x86ecdf960 ::1#22250 (16-fail.isc.org): rrl=0x0, HAVECOOKIE=0, result=ISC_R_HOSTDOWN, fname=0x8c0499580(0), is_zone=0, RECURSIONOK=1, query.rpz_st=0x0(0), RRL_CHECKED=0
@40000000664bb578151583ac dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: read callback:operation canceled, requests 1
@40000000664bb5781515a2ec dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: canceling response: operation canceled, canceled/not reading (none/not reading), requests 1
@40000000664bb5781515c22c query-errors: debug 4: fetch completed at resolver.c:1966 for 16-fail.isc.org/A in 0.000465: host down/success [domain:.,referral:0,restart:1,qrysent:1,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
@40000000664bb5781515e554 dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: destroying
@40000000664bb5781515e93c dispatch: debug 90: dispatch 0x98cfa7b80: UDP response 0x8e9ec8000: detaching handle 0x8e9f10780 from 0x8e9ec8010
@40000000664bb57815168d4c security: debug 3: client @0x86ecdf960 ::1#22250 (16-fail.isc.org): reset client
What is the current bug behavior?
- If the send fails with EHOSTDOWN, then instead of penalising this server's SRTT and trying again with another server in the list, we send back SERVFAIL to the client right away.
What is the expected correct behavior?
Proper SRTT processing of the failure and try again with another server in the list (forwarder or server from an NS RRset for a specific domain).