BIND 9.11.28 server hangs under load with 'dnssec-validation auto' configured
$ uname -a
Linux agr-centos7-ipc-test1 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ named -V
BIND 9.11.28 (Extended Support Version) <id:60f9417>
running on Linux x86_64 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
built by make with '--enable-ipv6' '--enable-filter-aaaa' '--enable-largefile' '--enable-fixed-rrset' '--enable-threads' '--enable-dnstap' '--disable-shared' '--with-dlopen=no' '--with-openssl=/opt/incontrol/dns/openssl' '--with-geoip2=/opt/incontrol/dns/maxminddb' '--with-protobuf-c=/opt/incontrol/dns/protobuf-c' '--with-libfstrm=/opt/incontrol/dns/fstrm' '--without-gssapi' '--without-pkcs11' '--with-libxml2=yes' '--with-libjson=yes' '--with-tuning=large' '--prefix=/opt/incontrol/dns' 'LDFLAGS=-ldl' 'PKG_CONFIG_PATH=/opt/incontrol/dns/openssl/lib/pkgconfig'
compiled by GCC 4.8.5 20150623 (Red Hat 4.8.5-44)
compiled with OpenSSL version: OpenSSL 1.1.1i 8 Dec 2020
linked to OpenSSL version: OpenSSL 1.1.1i 8 Dec 2020
compiled with libxml2 version: 2.9.1
linked to libxml2 version: 20901
compiled with libjson-c version: 0.11
linked to libjson-c version: 0.11
compiled with zlib version: 1.2.7
linked to zlib version: 1.2.7
linked to maxminddb version: 1.4.3
compiled with protobuf-c version: 1.3.3
linked to protobuf-c version: 1.3.3
threads support is enabled
default paths:
named configuration: /opt/incontrol/dns/etc/named.conf
rndc configuration: /opt/incontrol/dns/etc/rndc.conf
DNSSEC root key: /opt/incontrol/dns/etc/bind.keys
nsupdate session key: /opt/incontrol/dns/var/run/named/session.key
named PID file: /opt/incontrol/dns/var/run/named/named.pid
named lock file: /opt/incontrol/dns/var/run/named/named.lock
geoip-directory: opt/incontrol/dns/maxminddb/share/GeoIP
How to reproduce
named.conf
options {
directory "/opt/incontrol/dns/db";
pid-file "/opt/incagent-12.0.24/etc/named.pid";
allow-transfer { none; };
allow-query { any; };
dnssec-validation auto ;
};
zone "." IN {
type hint;
file "db.cache";
};
zone "localhost" IN {
type master;
file "db.localhost";
allow-update { none; };
};
zone "0.0.127.in-addr.arpa" IN {
type master;
file "db.127.0.0";
allow-update { none; };
};
-
Get the latest bind.keys from https://downloads.isc.org/isc/bind9/keys/9.11/ (This step is not necessary to reproduce the problem)
-
rm -f db/managed-keys*
-
start named
-
Run resperf
$ resperf -s <server_ip> -d queryfile-example-current
queryfile-example-current has 10,000,000 records
- In another window run 'rndc status' in a loop
#!/bin/bash
while true
do
rndc status
done
- Sometimes 'rndc status' will hang and when that happens named does not respond to any queries.
Last 'rndc status' before hung:
version: BIND 9.11.28 (Extended Support Version) <id:60f9417> (not available)
running on latest: Linux x86_64 4.15.12 #1 SMP Wed Mar 21 12:30:16 EDT 2018
boot time: Fri, 19 Mar 2021 17:05:02 GMT
last configured: Fri, 19 Mar 2021 17:05:02 GMT
configuration file: /opt/incontrol/dns/etc/named.conf
CPUs found: 4
worker threads: 4
UDP listeners per interface: 3
number of zones: 102 (99 automatic)
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is OFF
recursive clients: 422/900/1000
tcp clients: 4/150
TCP high-water: 4
server is up and running
Sometimes 'recursive clients:' was 2/900/1000 but never 0/900/1000 when it hangs.
At that time strace looks like below:
# strace -f -p <pid>
[pid 6826] epoll_wait(8, <unfinished ...>
[pid 6825] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 6824] select(10, [9], [], NULL, NULL <unfinished ...>
[pid 6822] rt_sigsuspend([], 8 <unfinished ...>
[pid 6823] futex(0x7f8e8449e0d4, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>
[pid 6824] <... select resumed>) = 1 (in [9])
[pid 6824] read(9, ";\275\10vt\376", 36) = 6
[pid 6824] read(9, 0x7f8e82007d30, 30) = -1 EAGAIN (Resource temporarily unavailable)
[pid 6824] select(10, [9], [], NULL, NULL
The problem does not happen in 9.11.22 with same configuration.