host: Search stopped after SERVFAIL (regression)
Summary
In versions before 9.10 the host
command tried all domains from the search list until it got non-error response.
With commit 8afea636 it was changed to continue searching only after NXDOMAIN
response.
This causes problems in some of our scenarios where we have some domains served by local named
but others delegated to upstream servers. If there are network problems, some of the upstream servers early on the search list can become unreachable and we get SERVFAIL
while subsequent checks for the locally hosted search domains would succeed.
In old version of host
we got correct response but newer ones return SERVFAIL
to the user on first SERVFAIL
response from the server.
The -s a SERVFAIL response should stop query
CLI argument doesn't seem to make any difference here.
BIND version used
BIND 9.16.23-RH (Extended Support Version) <id:fde3b1f>
running on Linux x86_64 <version-edited> #1 SMP Mon Dec 5 17:22:42 PST 2022
built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-python=/usr/bin/python3' '--with-libtool' '--localstatedir=/var' '--with-pic' '--disable-static' '--includedir=/usr/include/bind9' '--with-tuning=large' '--with-libidn2' '--with-maxminddb' '--with-dlopen=yes' '--with-gssapi=yes' '--with-lmdb=yes' '--without-libjson' '--with-json-c' '--enable-dnstap' '--enable-fixed-rrset' '--enable-full-report' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CC=gcc' 'CFLAGS= -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'LDFLAGS=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 ' 'LT_SYS_LIBRARY_PATH=/usr/lib64:' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
compiled by GCC 11.3.1 20220421 (Red Hat 11.3.1-2)
compiled with OpenSSL version: OpenSSL 3.0.1 14 Dec 2021
linked to OpenSSL version: OpenSSL 3.0.1 14 Dec 2021
compiled with libuv version: 1.42.0
linked to libuv version: 1.42.0
compiled with libxml2 version: 2.9.13
linked to libxml2 version: 20913
compiled with json-c version: 0.14
linked to json-c version: 0.14
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
linked to maxminddb version: 1.5.2
compiled with protobuf-c version: 1.3.3
linked to protobuf-c version: 1.3.3
threads support is enabled
default paths:
named configuration: /etc/named.conf
rndc configuration: /etc/rndc.conf
DNSSEC root key: /etc/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/named.pid
named lock file: /var/run/named/named.lock
geoip-directory: /usr/share/GeoIP
Steps to reproduce
(Stripped down versions of our configs below.)
We run named
in a namespace to simulate connectivity problems:
cd /tmp
sudo ip netns add test1
sudo ip netns exec test1 ip link set lo up
sudo ip netns exec test1 named -c /tmp/host20.named.conf -p 1054 -g
Actual test:
sudo ip netns exec test1 host -p 1054 test3
What is the current bug behavior?
Host test3.baddomain.com not found: 2(SERVFAIL)
What is the expected correct behavior?
test3.subtest3.domaintest.com has address 10.1.1.3
Relevant configuration files
/tmp/host20.named.conf:
options { listen-on-v6 {any;}; };
zone "domaintest.com" { type master; file "/tmp/host20.zone.domaintest.com"; };
/tmp/host20.zone.domaintest.com:
$ORIGIN domaintest.com.
$TTL 1h
domaintest.com. IN SOA ns.domaintest.com. username.domaintest.com. (
2007120710 ; serial number of this zone file
1d ; slave refresh (1 day)
2h ; slave retry time in case of a problem (2 hours)
4w ; slave expiration time (4 weeks)
1h ; maximum caching time in case of failed lookups (1 hour)
)
domaintest.com. NS .
domaintest.com. A 10.1.1.1
test2.subtest2 IN A 10.1.1.2
test3.subtest3 IN A 10.1.1.3
test4.subtest4 IN A 10.1.1.4
/etc/resolv.conf
search baddomain.com subtest2.domaintest.com subtest3.domaintest.com subtest4.domaintest.com
nameserver 127.0.0.1
Possible fixes
The problem comes from here: 8afea636
We can't access linked RT#34711 so we don't know what is the story behind this change but it could either be reverted or additional explicit check for dns_rcode_servfail
could be added. Current behaviour (if required) might be linked to -s
CLI flag.