Pre-fork deadlock during key-directory locking, due to the "In-View" statement.
When attempting to test the fixes for #2778 (closed), named deadlocks during initialization. In my particular case, named is invoked like so:
/usr/sbin/named -u bind -t /var/named
On downstream servers (which have no dnssec maintenance enabled, no
key-directory: clauses), named starts normally and performs correctly. On the upstream server, which has several views with corresponding, per-view
key-directory: clauses, the invocation of named never completes, the command line as shown never returning to its caller.
BIND version used
BIND 9.16.18 (Stable Release) <id:1c027c4> running on Linux x86_64 5.12.12 #1 SMP Fri Jun 18 05:38:32 EDT 2021 built by make with '--build=x86_64-slackware-linux-gnu' '--docdir=/usr/doc/bind-9.16.18' '--sysconfdir=/etc/bind' '--infodir=/usr/info' '--libdir=/usr/lib64' '--mandir=/usr/man' '--prefix=/usr' '--localstatedir=/var' '--enable-largefile' '--with-libtool' '--enable-shared' '--with-gssapi=/usr' '--with-libidn2' '--with-dlopen' '--with-dlz-filesystem' '--with-dlz-stub' '--enable-auto-validation' '--enable-dnsrps' '--enable-full-report' '--enable-fixed-rrset' '--enable-querytrace' '--with-python=/usr/bin/python3' '--with-openssl' 'build_alias=x86_64-slackware-linux-gnu' 'CFLAGS=-g -O2 -fPIC -march=opteron' 'PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig:/usr/local/share/pkgconfig:/usr/lib64/pkgconfig:/usr/share/pkgconfig' compiled by GCC 10.3.0 compiled with OpenSSL version: OpenSSL 1.1.1k 25 Mar 2021 linked to OpenSSL version: OpenSSL 1.1.1k 25 Mar 2021 compiled with libuv version: 1.41.0 linked to libuv version: 1.41.0 compiled with libxml2 version: 2.9.12 linked to libxml2 version: 20912 compiled with json-c version: 0.15 linked to json-c version: 0.15 compiled with zlib version: 1.2.11 linked to zlib version: 1.2.11 threads support is enabled default paths: named configuration: /etc/bind/named.conf rndc configuration: /etc/bind/rndc.conf DNSSEC root key: /etc/bind/bind.keys nsupdate session key: /var/run/named/session.key named PID file: /var/run/named/named.pid named lock file: /var/run/named/named.lock
Steps to reproduce
A reproducer is still in-the-works. At the moment, I'm wading through
eu-stack output, looking for anything that can narrow down where the deadlock is occurring.
What is the current bug behavior?
named deadlocks during the initialization phase, where it checks its configuration for usability before returning a 0 (OK) status to its invoker.
What is the expected correct behavior?
named should check its configuration, and return to its caller with 0 exit status (configuration looks OK) or some other value (incorrect configuration).
Output of eu-stack utility
Here, named has been invoked with "-n 1" to reduce the number of threads to keep track of.
PID 16077 - process TID 16077: #0 0x00007f8cf0939632 __sigtimedwait #1 0x00007f8cf0aee014 sigwait #2 0x00007f8cf1075d99 isc_app_ctxrun #3 0x00007f8cf1076024 isc_app_run #4 0x000000000041a8ce main #5 0x00007f8cf09211ad __libc_start_main #6 0x000000000041b3ea _start TID 16078: #0 0x00007f8cf0aed2cb __lll_lock_wait #1 0x00007f8cf0ae6172 __pthread_mutex_lock #2 0x00007f8cf13f5a9d dns__zone_lockunlock_keyfiles #3 0x00007f8cf13d8fb9 dns_update_signaturesinc #4 0x00007f8cf13dc409 dns_update_signatures #5 0x00007f8cf13fbf0f rss_post #6 0x00007f8cf13fc270 setnsec3param #7 0x00007f8cf10a7de4 isc_task_run #8 0x00007f8cf1092781 process_netievent #9 0x00007f8cf10928f5 process_queue #10 0x00007f8cf10930a1 async_cb #11 0x00007f8cf0e78f65 #12 0x00007f8cf0e89af5 uv__io_poll #13 0x00007f8cf0e797a4 uv_run #14 0x00007f8cf10929ae nm_thread #15 0x00007f8cf10aa1c1 isc__trampoline_run #16 0x00007f8cf0ae3e85 start_thread #17 0x00007f8cf0a08b3f __clone TID 16079: #0 0x00007f8cf0af023e __futex_abstimed_wait_cancelable64 #1 0x00007f8cf0aea116 pthread_cond_timedwait@@GLIBC_2.3.2 #2 0x00007f8cf10bd698 isc_condition_waituntil #3 0x00007f8cf10ac1a7 run #4 0x00007f8cf10aa1c1 isc__trampoline_run #5 0x00007f8cf0ae3e85 start_thread #6 0x00007f8cf0a08b3f __clone TID 16080: #0 0x00007f8cf0a08e66 epoll_wait #1 0x00007f8cf10b5514 netthread #2 0x00007f8cf10aa1c1 isc__trampoline_run #3 0x00007f8cf0ae3e85 start_thread #4 0x00007f8cf0a08b3f __clone
I will also attach the strace log, showing file and process events. It is somewhat interesting in that, only the first view (external) is processed. None of the various internal views show any file-based activity.trace.log
It turns out, after applying some debugging statements to the code, that the culprit is an attempt to lock the same
key-directory more than once, resulting in a deadlock. There is a loop that walks over a per-zone linked list of views, locking the key directory of each in turn. However, if one of those views is referenced in another view by way of the
in-view zone statement, then that view (and its key directory) are included in the list more than once.
The code that maintains and walks the linked list of per-zone views will need to be updated to keep track of whether it has been locked already, and to not attempt a double-lock; the same logic will need to be applied on the unlock.