Possible race between resolver query timeout and validation in BIND 9.16

A customer's BIND 9.6.27 named instance has crashed after days of working.

Here is the backtrace:

Thread 1 (Thread 0x7faa1ecba700 (LWP 2298)):
#0  0x00007faa22f2b5f7 in raise () from /lib64/libc.so.6
#1  0x00007faa22f2cce8 in abort () from /lib64/libc.so.6
#2  0x000000000042ac65 in assertion_failed (file=<optimized out>, line=<optimized out>, type=<optimized out>, 
    cond=<optimized out>) at ./main.c:272
#3  0x00007faa2477afea in isc_assertion_failed (file=file@entry=0x7faa25b66c3f "resolver.c", line=line@entry=4561, 
    type=type@entry=isc_assertiontype_require, 
    cond=cond@entry=0x7faa25b67660 "(__builtin_expect(!!((fctx) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(fctx))->magic == ((('F') << 24 | ('!') << 16 | ('!') << 8 | ('!')))), 1))") at assertions.c:48
#4  0x00007faa25ab2aee in fctx_unlink (fctx=fctx@entry=0x7fa9b9b27ab0) at resolver.c:4561
#5  0x00007faa25ab4532 in maybe_destroy (fctx=fctx@entry=0x7fa9b9b27ab0, locked=true) at resolver.c:5722
#6  0x00007faa25ac1728 in validated (task=<optimized out>, event=0x7fa854ad0f60) at resolver.c:5815
#7  0x00007faa247aa0fe in task_run (task=0x7fa7db68a690) at task.c:851
#8  isc_task_run (task=0x7fa7db68a690) at task.c:944
#9  0x00007faa2478ebfd in isc__nm_async_task (ev0=ev0@entry=0x7fa977674df0, worker=0x1669e40) at netmgr.c:873
#10 0x00007faa24794598 in process_netievent (worker=worker@entry=0x1669e40, ievent=0x7fa977674df0) at netmgr.c:952
#11 0x00007faa24794779 in process_queue (worker=worker@entry=0x1669e40, type=type@entry=NETIEVENT_TASK) at netmgr.c:1021
#12 0x00007faa24794ec3 in process_all_queues (worker=0x1669e40) at netmgr.c:792
#13 async_cb (handle=0x166a1a0) at netmgr.c:821
#14 0x00007faa23dbb4b3 in uv__async_io () from /lib64/libuv.so.1
#15 0x00007faa23dce983 in uv__io_poll () from /lib64/libuv.so.1
#16 0x00007faa23dbbc88 in uv_run () from /lib64/libuv.so.1
#17 0x00007faa2479482f in nm_thread (worker0=0x1669e40) at netmgr.c:727
#18 0x00007faa247ac845 in isc__trampoline_run (arg=0x190d2d0) at trampoline.c:198
#19 0x00007faa232bedc5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007faa22fec1cd in clone () from /lib64/libc.so.6

Here's only the most important part isolated:

#3  0x00007faa2477afea in isc_assertion_failed (file=file@entry=0x7faa25b66c3f "resolver.c", line=line@entry=4561, 
    type=type@entry=isc_assertiontype_require, 
    cond=cond@entry=0x7faa25b67660 "(__builtin_expect(!!((fctx) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(fctx))->magic == ((('F') << 24 | ('!') << 16 | ('!') << 8 | ('!')))), 1))") at assertions.c:48
#4  0x00007faa25ab2aee in fctx_unlink (fctx=fctx@entry=0x7fa9b9b27ab0) at resolver.c:4561
#5  0x00007faa25ab4532 in maybe_destroy (fctx=fctx@entry=0x7fa9b9b27ab0, locked=true) at resolver.c:5722
#6  0x00007faa25ac1728 in validated (task=<optimized out>, event=0x7fa854ad0f60) at resolver.c:5815
#7  0x00007faa247aa0fe in task_run (task=0x7fa7db68a690) at task.c:851

Both in the beginning of validated() and in the beginning of fctx_unlink() there is a REQUIRE(VALID_FCTX(fctx)); check. The first one passed successfully, while the second one caused a crash.

So somewhere between validated() and fctx_unlink(), the fetch context (fctx) was unexpectedly destroyed.

Destroyed in the same thread? (not likely)

There can be two scenarios - either this happened during the execution of the same callback (should be easy to find), or another thread interfered and destroyed the context while validated() was doing its job. So let's start with the first one. For a moment it seemed that there is such a code path leading to a double fctx_destroy():

validated() -> maybe_destroy() -> dns_validator_cancel() -> dns_resolver_destroyfetch() -> fctx_decreference() -> fctx_destroy()
                               -> fctx_destroy()

But the problem is that before calling dns_validator_cancel(), the maybe_destroy() makes sure that the bucket is locked:

	bucketnum = fctx->bucketnum;
	if (!locked) {
		LOCK(&res->buckets[bucketnum].lock);
	}

... and dns_resolver_destroyfetch() also locks that same bucket:

	FTRACE("destroyfetch");

	bucketnum = fctx->bucketnum;
	LOCK(&res->buckets[bucketnum].lock);

But we actually do not experience any deadlocks, which most likely means that this exact code path never(?) executes.

A multi-threading error?

Can two different tasks (threads) work with (and try to destroy) the same fetch context? All the fetch callbacks are running on the same task (derived from the bucket number), apart from the final callback which runs on a task chosen by the creator of the fetch.

So I can only imagine the following scenario how this could have happened:

[T1] The fetch times out, the timeout callback fctx_timeout() gets called, which initiates sending the final timeout event of the fetch to the creator's callback, through fctx_done() -> fctx_sendevents().
[T1] Simultaneously, a validation completes so then the task starts executing the validated() function.
[T1] validated() locks the bucket lock, unlinks the current validator from the fctx->validators list (which happens to clear the list), destroys the validator, and for a short moment unlocks the bucket lock.
[T2] At that same brief moment, the creator's final callback sees that there are no more active fetches on fctx, and the fctx->validators list is empty, so it destroys the fetch context.
[T1] validated() locks the bucket lock again, calls maybe_destroy() which tries to destroy the fetch context again, and so it crashes.

Edited Jun 10, 2022 by Arаm Sаrgsyаn

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information