Pre-fork deadlock during key-directory locking, due to the "In-View" statement.
Summary
When attempting to test the fixes for #2778 (closed), named deadlocks during initialization. In my particular case, named is invoked like so:
/usr/sbin/named -u bind -t /var/named
On downstream servers (which have no dnssec maintenance enabled, no key-directory:
clauses), named starts normally and performs correctly. On the upstream server, which has several views with corresponding, per-view key-directory:
clauses, the invocation of named never completes, the command line as shown never returning to its caller.
BIND version used
BIND 9.16.18 (Stable Release) <id:1c027c4>
running on Linux x86_64 5.12.12 #1 SMP Fri Jun 18 05:38:32 EDT 2021
built by make with '--build=x86_64-slackware-linux-gnu' '--docdir=/usr/doc/bind-9.16.18' '--sysconfdir=/etc/bind' '--infodir=/usr/info' '--libdir=/usr/lib64' '--mandir=/usr/man' '--prefix=/usr' '--localstatedir=/var' '--enable-largefile' '--with-libtool' '--enable-shared' '--with-gssapi=/usr' '--with-libidn2' '--with-dlopen' '--with-dlz-filesystem' '--with-dlz-stub' '--enable-auto-validation' '--enable-dnsrps' '--enable-full-report' '--enable-fixed-rrset' '--enable-querytrace' '--with-python=/usr/bin/python3' '--with-openssl' 'build_alias=x86_64-slackware-linux-gnu' 'CFLAGS=-g -O2 -fPIC -march=opteron' 'PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig:/usr/local/share/pkgconfig:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
compiled by GCC 10.3.0
compiled with OpenSSL version: OpenSSL 1.1.1k 25 Mar 2021
linked to OpenSSL version: OpenSSL 1.1.1k 25 Mar 2021
compiled with libuv version: 1.41.0
linked to libuv version: 1.41.0
compiled with libxml2 version: 2.9.12
linked to libxml2 version: 20912
compiled with json-c version: 0.15
linked to json-c version: 0.15
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
threads support is enabled
default paths:
named configuration: /etc/bind/named.conf
rndc configuration: /etc/bind/rndc.conf
DNSSEC root key: /etc/bind/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/named.pid
named lock file: /var/run/named/named.lock
Steps to reproduce
A reproducer is still in-the-works. At the moment, I'm wading through strace
and eu-stack
output, looking for anything that can narrow down where the deadlock is occurring.
What is the current bug behavior?
named deadlocks during the initialization phase, where it checks its configuration for usability before returning a 0 (OK) status to its invoker.
What is the expected correct behavior?
named should check its configuration, and return to its caller with 0 exit status (configuration looks OK) or some other value (incorrect configuration).
Output of eu-stack utility
Here, named has been invoked with "-n 1" to reduce the number of threads to keep track of.
PID 16077 - process
TID 16077:
#0 0x00007f8cf0939632 __sigtimedwait
#1 0x00007f8cf0aee014 sigwait
#2 0x00007f8cf1075d99 isc_app_ctxrun
#3 0x00007f8cf1076024 isc_app_run
#4 0x000000000041a8ce main
#5 0x00007f8cf09211ad __libc_start_main
#6 0x000000000041b3ea _start
TID 16078:
#0 0x00007f8cf0aed2cb __lll_lock_wait
#1 0x00007f8cf0ae6172 __pthread_mutex_lock
#2 0x00007f8cf13f5a9d dns__zone_lockunlock_keyfiles
#3 0x00007f8cf13d8fb9 dns_update_signaturesinc
#4 0x00007f8cf13dc409 dns_update_signatures
#5 0x00007f8cf13fbf0f rss_post
#6 0x00007f8cf13fc270 setnsec3param
#7 0x00007f8cf10a7de4 isc_task_run
#8 0x00007f8cf1092781 process_netievent
#9 0x00007f8cf10928f5 process_queue
#10 0x00007f8cf10930a1 async_cb
#11 0x00007f8cf0e78f65
#12 0x00007f8cf0e89af5 uv__io_poll
#13 0x00007f8cf0e797a4 uv_run
#14 0x00007f8cf10929ae nm_thread
#15 0x00007f8cf10aa1c1 isc__trampoline_run
#16 0x00007f8cf0ae3e85 start_thread
#17 0x00007f8cf0a08b3f __clone
TID 16079:
#0 0x00007f8cf0af023e __futex_abstimed_wait_cancelable64
#1 0x00007f8cf0aea116 pthread_cond_timedwait@@GLIBC_2.3.2
#2 0x00007f8cf10bd698 isc_condition_waituntil
#3 0x00007f8cf10ac1a7 run
#4 0x00007f8cf10aa1c1 isc__trampoline_run
#5 0x00007f8cf0ae3e85 start_thread
#6 0x00007f8cf0a08b3f __clone
TID 16080:
#0 0x00007f8cf0a08e66 epoll_wait
#1 0x00007f8cf10b5514 netthread
#2 0x00007f8cf10aa1c1 isc__trampoline_run
#3 0x00007f8cf0ae3e85 start_thread
#4 0x00007f8cf0a08b3f __clone
I will also attach the strace log, showing file and process events. It is somewhat interesting in that, only the first view (external) is processed. None of the various internal views show any file-based activity.trace.log
Analysis
It turns out, after applying some debugging statements to the code, that the culprit is an attempt to lock the same key-directory
more than once, resulting in a deadlock. There is a loop that walks over a per-zone linked list of views, locking the key directory of each in turn. However, if one of those views is referenced in another view by way of the in-view
zone statement, then that view (and its key directory) are included in the list more than once.
Possible fixes
The code that maintains and walks the linked list of per-zone views will need to be updated to keep track of whether it has been locked already, and to not attempt a double-lock; the same logic will need to be applied on the unlock.