NTA-related crash in checkbogus() after an "rndc reload"
The following crash occurred in the dnssec
system test:
https://gitlab.isc.org/isc-private/bind9/-/jobs/1070127
I:dnssec:checking positive and negative validation with negative trust anchors (134)
I:dnssec:ns4 Negative trust anchor added: bogus.example/_default, expires 06-Aug-2020 09:49:06.000
I:dnssec:ns4 Negative trust anchor added: badds.example/_default, expires 06-Aug-2020 09:48:58.000
I:dnssec:ns4 Negative trust anchor added: secure.example/_default, expires 06-Aug-2020 09:48:59.000
I:dnssec:ns4 Negative trust anchor added: fakenode.secure.example/_default, expires 06-Aug-2020 09:49:03.000
I:dnssec:ns4 rndc: connection to remote host closed
I:dnssec:ns4 This may indicate that
I:dnssec:ns4 * the remote server is using an older version of the command protocol,
I:dnssec:ns4 * this host is not authorized to connect,
I:dnssec:ns4 * the clocks are not synchronized, or
I:dnssec:ns4 * the key is invalid.
rndc: connect failed: 10.53.0.4#5009: connection refused
rndc: connect failed: 10.53.0.4#5009: connection refused
I:dnssec:stopping servers
I:dnssec:Core dump(s) found: dnssec/ns4/core.14410
R:dnssec:FAIL
D:dnssec:backtrace from dnssec/ns4/core.14410:
D:dnssec:--------------------------------------------------------------------------------
D:dnssec:Core was generated by `/builds/isc-private/bind9/bind-9.16.6/bin/named/.libs/named -D dnssec-ns4 -X na'.
D:dnssec:Program terminated with signal SIGSEGV, Segmentation fault.
D:dnssec:#0 0x00007fef72ffb8c5 in isc__mem_get (mctx=0xdededededededede, size=24, file=0x7fef733ea7b9 "resolver.c", line=10758) at mem.c:2430
D:dnssec:2430 REQUIRE(ISCAPI_MCTX_VALID(mctx));
D:dnssec:[Current thread is 1 (Thread 0x7fef692e0700 (LWP 14436))]
D:dnssec:#0 0x00007fef72ffb8c5 in isc__mem_get (mctx=0xdededededededede, size=24, file=0x7fef733ea7b9 "resolver.c", line=10758) at mem.c:2430
D:dnssec:#1 0x00007fef73328737 in dns_resolver_createfetch (res=0x7fef6c14c4d0, name=0x7feeef4c7350, type=type@entry=47, domain=domain@entry=0x0, nameservers=nameservers@entry=0x0, forwarders=forwarders@entry=0x0, client=0x0, id=0, options=1024, depth=0, qc=0x0, task=0x7feeedd31c80, action=0x7fef732bb080 <fetch_done>, arg=0x7feeef4c7230, rdataset=0x7feeef4c7260, sigrdataset=0x7feeef4c72d8, fetchp=0x7feeef4c7258) at resolver.c:10758
D:dnssec:#2 0x00007fef732bafff in checkbogus (task=0x7feeedd31c80, event=<optimized out>) at nta.c:265
D:dnssec:#3 0x00007fef7300e72d in dispatch (threadid=<optimized out>, manager=<optimized out>) at task.c:1152
D:dnssec:#4 run (queuep=<optimized out>) at task.c:1344
D:dnssec:#5 0x00007fef72969fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
D:dnssec:#6 0x00007fef725914cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
D:dnssec:--------------------------------------------------------------------------------
D:dnssec:full backtrace from dnssec/ns4/core.14410 saved in core.14410-backtrace.txt
D:dnssec:core dump dnssec/ns4/core.14410 archived as dnssec/ns4/core.14410.gz
E:dnssec:2020-08-06T09:49:17+0000
What happens here is that checkbogus()
(the timer callback used for
starting an NTA recheck query) calls dns_resolver_createfetch()
while
the resolver is being destroyed (after an rndc reload
). I believe the
root cause is that NTA code does not grab any references to the resolver
when the recheck timer is first set up.
I managed to reproduce this with the following patch applied:
diff --git a/lib/dns/resolver.c b/lib/dns/resolver.c
index 8956baee81..efd18e1f3c 100644
--- a/lib/dns/resolver.c
+++ b/lib/dns/resolver.c
@@ -14,6 +14,7 @@
#include <ctype.h>
#include <inttypes.h>
#include <stdbool.h>
+#include <unistd.h>
#include <isc/atomic.h>
#include <isc/counter.h>
@@ -10066,6 +10067,7 @@ destroy(dns_resolver_t *res) {
isc_timer_detach(&res->spillattimer);
res->magic = 0;
isc_mem_put(res->mctx, res, sizeof(*res));
+ sleep(5);
}
static void
I set nta-recheck 2s;
in named.conf
, started named
, then added an
NTA (rndc nta dnssec-failed.org.
), and finally stopped the server
(CTRL+C):
06-Aug-2020 13:06:37.741 running
06-Aug-2020 13:06:37.765 resolver priming query complete
06-Aug-2020 13:06:37.765 managed-keys-zone: Key 20326 for zone . is now trusted (acceptance timer complete)
06-Aug-2020 13:06:38.548 received control channel command 'nta dnssec-failed.org.'
06-Aug-2020 13:06:38.555 flush tree 'dnssec-failed.org.' in cache view '_default': success
06-Aug-2020 13:06:38.555 added NTA 'dnssec-failed.org.' (3600 sec) in view '_default'
^C06-Aug-2020 13:06:40.028 no longer listening on 127.0.0.1#5300
06-Aug-2020 13:06:40.028 no longer listening on ::1#5300
06-Aug-2020 13:06:40.031 shutting down
06-Aug-2020 13:06:40.031 stopping statistics channel on 127.0.0.1#8080
06-Aug-2020 13:06:40.031 stopping statistics channel on ::1#8080
06-Aug-2020 13:06:40.031 stopping command channel on 127.0.0.1#9953
06-Aug-2020 13:06:40.031 stopping command channel on ::1#9953
06-Aug-2020 13:06:46.568 resolver.c:10741: REQUIRE((__builtin_expect(!!((res) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(res))->magic == ((('R') << 24 | ('e') << 16 | ('s') << 8 | ('!')))), 1))) failed, back trace
06-Aug-2020 13:06:46.568 #0 0x5648605f3403 in ??
06-Aug-2020 13:06:46.568 #1 0x7f123e0f987a in ??
06-Aug-2020 13:06:46.568 #2 0x7f123e647e77 in ??
06-Aug-2020 13:06:46.568 #3 0x7f123e5434c3 in ??
06-Aug-2020 13:06:46.568 #4 0x7f123e142c2e in ??
06-Aug-2020 13:06:46.568 #5 0x7f123e143543 in ??
06-Aug-2020 13:06:46.568 #6 0x7f123dbc6422 in ??
06-Aug-2020 13:06:46.568 #7 0x7f123daf5bf3 in ??
06-Aug-2020 13:06:46.568 exiting (due to assertion failure)
Aborted (core dumped)
The crash is not exactly the same (more on this below), but the backtrace points to the same code location:
(gdb) bt
#0 0x00007f123da32355 in raise () from /usr/lib/libc.so.6
#1 0x00007f123da1b853 in abort () from /usr/lib/libc.so.6
#2 0x00005648605f3701 in assertion_failed (file=0x7f123e77d1d0 "resolver.c", line=10741,
type=isc_assertiontype_require,
cond=0x7f123e780b80 "(__builtin_expect(!!((res) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(res))->magic == ((('R') << 24 | ('e') << 16 | ('s') << 8 | ('!')))), 1))") at ./main.c:261
#3 0x00007f123e0f987a in isc_assertion_failed (file=0x7f123e77d1d0 "resolver.c", line=10741,
type=isc_assertiontype_require,
cond=0x7f123e780b80 "(__builtin_expect(!!((res) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(res))->magic == ((('R') << 24 | ('e') << 16 | ('s') << 8 | ('!')))), 1))") at assertions.c:46
#4 0x00007f123e647e77 in dns_resolver_createfetch (res=0x0, name=0x7f1214012c80, type=47,
domain=0x0, nameservers=0x0, forwarders=0x0, client=0x0, id=0, options=1024, depth=0,
qc=0x0, task=0x7f12309fa970, action=0x7f123e543056 <fetch_done>, arg=0x7f1214012b60,
rdataset=0x7f1214012b90, sigrdataset=0x7f1214012c08, fetchp=0x7f1214012b88)
at resolver.c:10741
#5 0x00007f123e5434c3 in checkbogus (task=0x7f12309fa970, event=0x0) at nta.c:265
#6 0x00007f123e142c2e in dispatch (manager=0x7f123b93b010, threadid=1) at task.c:1152
#7 0x00007f123e143543 in run (queuep=0x7f123b93c0a0) at task.c:1344
#8 0x00007f123dbc6422 in start_thread () from /usr/lib/libpthread.so.0
#9 0x00007f123daf5bf3 in clone () from /usr/lib/libc.so.6
I believe that the reason for the discrepancy between the two call
stacks is that in my reproducer, the resolver is fully destroyed before
checkbogus()
calls dns_resolver_createfetch()
, which is why the
latter fails immediately on the REQUIRE(VALID_RESOLVER(res));
assertion; in the original crash (in GitLab CI), the resolver must have
still been valid when dns_resolver_createfetch()
was called because it
crashed later on (after res->mctx
became 0xdededededededede
) and it
crashed due to a segfault rather than an assertion failure.
Furthermore, inspecting the core dump shows that the resolver has in
fact been destroyed by the time the process crashed:
(gdb) frame 2
#2 0x00007fef732bafff in checkbogus (task=0x7feeedd31c80, event=<optimized out>) at nta.c:265
265 result = dns_resolver_createfetch(
(gdb) print view->resolver
$17 = (dns_resolver_t *) 0xdededededededede
I am marking this as confidential out of abundance of caution, until someone can confirm my findings.