could not get query source dispatcher error after reconfig
Summary
We are running an hidden master only bind server with about many zones , most signed with auto-dnssec maintain; inline-signing yes;
Every few days (7-8 days) we get the following error in the log :
Jul 9 11:32:37 nsmaster named[68180]: general: info: received control channel command 'reconfig'
Jul 9 11:32:48 nsmaster named[68180]: general: error: could not get query source dispatcher (213.36.252.194#0)
Jul 9 11:32:48 nsmaster named[68180]: general: error: reloading configuration failed: out of memory
and the server must be restarted.
BIND version used
# /usr/local/sbin/named -V
BIND 9.16.4 (Stable Release) <id:0849b42>
running on FreeBSD amd64 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC
built by make with '--disable-linux-caps' '--localstatedir=/var' '--sysconfdir=/usr/local/etc/namedb' '--with-dlopen=yes' '--with-libxml2' '--with-openssl=/usr' '--with-readline=-L/usr/local/lib -ledit' '--with-dlz-filesystem=yes' '--disable-dnstap' '--disable-fixed-rrset' '--disable-geoip' '--without-maxminddb' '--without-gssapi' '--with-libidn2=/usr/local' '--with-json-c' '--disable-largefile' '--with-lmdb=/usr/local' '--disable-native-pkcs11' '--without-python' '--disable-querytrace' 'STD_CDEFINES=-DDIG_SIGCHASE=1' '--enable-tcp-fastopen' '--with-tuning=large' '--disable-symtable' '--prefix=/usr/local' '--mandir=/usr/local/man' '--infodir=/usr/local/share/info/' '--build=amd64-portbld-freebsd12.1' 'build_alias=amd64-portbld-freebsd12.1' 'CC=cc' 'CFLAGS=-O2 -pipe -DLIBICONV_PLUG -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -ljson-c -fstack-protector-strong ' 'LIBS=-L/usr/local/lib' 'CPPFLAGS=-DLIBICONV_PLUG -isystem /usr/local/include' 'CPP=cpp' 'PKG_CONFIG=pkgconf'
compiled by CLANG 4.2.1 Compatible FreeBSD Clang 8.0.1 (tags/RELEASE_801/final 366581)
compiled with OpenSSL version: OpenSSL 1.1.1d-freebsd 10 Sep 2019
linked to OpenSSL version: OpenSSL 1.1.1d-freebsd 10 Sep 2019
compiled with libxml2 version: 2.9.10
linked to libxml2 version: 20910
compiled with json-c version: 0.13.1
linked to json-c version: 0.13.1
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
threads support is enabled
default paths:
named configuration: /usr/local/etc/namedb/named.conf
rndc configuration: /usr/local/etc/namedb/rndc.conf
DNSSEC root key: /usr/local/etc/namedb/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/pid
named lock file: /var/run/named/named.lock
It is the FreeBSD port compiled with --with-tuning=large
We had the same issue before with --with-tuning=default
We also had the same issue before with bind 9.11.20.
I try starting named with -U 20 but this does not change anything.
We have been running the same configuration WITHOUT DNSSEC signed zones for years on smaller servers without this issue.
Steps to reproduce
We periodically regenerate our configuration to add/update/remove zones. when needed, we use "rndc reconfig"
What is the current bug behavior?
After some rndc reconfig the named server stop working with the errors:
Jul 9 11:32:48 nsmaster named[68180]: general: error: could not get query source dispatcher (213.36.252.194#0)
Jul 9 11:32:48 nsmaster named[68180]: general: error: reloading configuration failed: out of memory
and top reports:
last pid: 62441; load averages: 0.29, 0.57, 0.70 up 169+19:01:18 11:47:47
15 processes: 1 running, 14 sleeping
CPU: 1.4% user, 0.0% nice, 4.8% system, 0.0% interrupt, 93.8% idle
Mem: 7100M Active, 5153M Inact, 18G Laundry, 14G Wired, 582M Buf, 18G Free
ARC: 7506M Total, 4261M MFU, 1187M MRU, 30M Anon, 287M Header, 1742M Other
3994M Compressed, 17G Uncompressed, 4.45:1 Ratio
Swap: 64G Total, 422M Used, 64G Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
68180 bind 170 52 0 30G 26G sigwai 3 42.3H 0.00% named
When named is started, top reports :
last pid: 64008; load averages: 0.49, 0.68, 1.00 up 169+21:04:34 13:51:03
21 processes: 1 running, 20 sleeping
CPU: 1.4% user, 0.0% nice, 4.8% system, 0.0% interrupt, 93.8% idle
Mem: 8509M Active, 2467M Inact, 22M Laundry, 16G Wired, 582M Buf, 36G Free
ARC: 7491M Total, 4043M MFU, 1410M MRU, 11M Anon, 286M Header, 1740M Other
3990M Compressed, 18G Uncompressed, 4.52:1 Ratio
Swap: 64G Total, 98M Used, 64G Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
63924 bind 170 52 0 9180M 7033M sigwai 25 13:36 0.00% named
There must be some memory / resource leak issue.
The server has 64G RAM / 56 cores . This should not be a memory issue.
When it happens, we need to stop and start the named server.
What is the expected correct behavior?
There should not be any errors after the reconfig and the server should not stop working. Its memory usage should not grow that much .
Relevant configuration files
Too many zones (about 73000) to paste them all here
logging {
channel stdlog {
syslog local1;
print-category yes;
print-severity yes;
print-time no;
};
category default { stdlog; };
category queries { "null"; };
category query-errors { "null"; };
category update { "null"; };
category update-security { "null"; };
category security { "null"; };
};
options {
// All file and path names are relative to the chroot directory,
// if any, and should be fully qualified.
directory "/usr/local/etc/namedb/working";
pid-file "/var/run/named/pid";
dump-file "/var/dump/named_dump.db";
statistics-file "/var/stats/named.stats";
listen-on { 127.0.1.4; 213.36.252.194; };
listen-on-v6 { 2a01:e0d:1:2:58bf:f9c2:0:1; };
disable-empty-zone "255.255.255.255.IN-ADDR.ARPA";
disable-empty-zone "0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.IP6.ARPA";
disable-empty-zone "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.IP6.ARPA";
query-source address 213.36.252.194 port *;
query-source-v6 address 2a01:e0d:1:2:58bf:f9c2:0:1 port *;
allow-transfer {
127.0.1.4;
213.36.252.128/25;
2a01:e0b:1:e:0:0:0:0/64;
213.36.252.32/27;
62.210.98.15;
213.36.253.14;
};
startup-notify-rate 100;
notify-source 213.36.252.194;
recursion no;
notify no;
check-integrity no;
minimal-responses yes;
max-transfer-idle-out 5;
max-transfer-time-out 10;
tcp-clients 1000;
tcp-listen-queue 100;
transfers-out 1000;
dnssec-enable yes;
sig-validity-interval 60 30;
masterfile-format text;
request-ixfr no;
provide-ixfr no;
};
zone "." { type hint; file "/usr/local/etc/namedb/named.root"; };
key "rndc-key" {
algorithm hmac-sha256;
secret "xxx";
};
controls {
inet 127.0.1.4
port 953
allow { any; } keys { "rndc-key"; };
};
// les zones
include "/usr/local/etc/namedb/named.conf.custom.inc";
include "/usr/local/etc/namedb/named.conf.custom-old.inc";
Most zones are signed like:
zone "bookmyname.be" {
type master;
file "custom/b/o/bookmyname.be/bookmyname.be";
notify explicit;
also-notify { 213.36.252.135; 62.210.98.15; 213.36.253.14; };
auto-dnssec maintain;
inline-signing yes;
key-directory "custom/b/o/bookmyname.be";
};
a few are not signed and have the following config:
zone "bookmyname.lu" {
type master;
file "custom/b/o/bookmyname.lu/bookmyname.lu";
notify explicit;
also-notify { 213.36.252.135; 62.210.98.15; 213.36.253.14; };
};
Relevant logs and/or screenshots
# rndc status
version: BIND 9.16.4 (Stable Release) <id:0849b42>
running on nsmaster.free.org: FreeBSD amd64 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC
boot time: Thu, 09 Jul 2020 09:52:25 GMT
last configured: Thu, 09 Jul 2020 10:42:49 GMT
configuration file: /usr/local/etc/namedb/named-custom.conf
CPUs found: 56
worker threads: 56
UDP listeners per interface: 56
number of zones: 144025 (0 automatic)
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/900/1000
tcp clients: 0/1000
TCP high-water: 17
server is up and running
Possible fixes
No idea