Crashes during resolver shutdown

In the past week¹, named crashed on the same line in three distinct system tests:

resolver.c:9837: INSIST(((res->dbuckets[i].list).head == ((void *)0))) failed, back trace

The crash can be triggered both on shutdown and during a reconfiguration. It looks like something messed up the order in which resolver shutdown steps are taken - in the addzone and rndc occurrences of this crash, the resolver is being destroyed in destroy() and thus it is expected that all the counters storing per-domain fetch counts (fctx->dbuckets) are already released and yet at least one of them has not yet been freed because another thread is concurrently executing fctx_destroy() and has not yet reached the fcount_decr(fctx); call inside it.

The obvious suspect is 9da902a2 (part of !1952 (merged)), but I was unable to reproduce this either under stress or by adding sleep() calls, so I could not do a bisect. Interestingly enough, all of those crashes happened on Debian sid (amd64).

This looks like a non-exploitable race between threads, but I am marking it confidential for the time being out of an abundance of caution.

I am attaching three files with GDB output, one for each of the failed system tests (addzone, masterformat, rndc). Note that in the masterformat case, all the counters in fctx->dbuckets are already released, which may be misleading (presumably because the other thread managed to release the problematic counter between the moment the INSIST() condition was checked and the moment the process exited), but the value of i (the bucket number) is readable (406). For the other two cases, the domain for which the fetch counter is yet to be released is the root domain:

(gdb) print res->dbuckets[420].list
$533 = {head = 0x7febfc04ae70, tail = 0x7febfc04ae70}
(gdb) print res->dbuckets[420].list.head
$534 = (fctxcount_t *) 0x7febfc04ae70
(gdb) print *res->dbuckets[420].list.head
$535 = {
  fdname = {
    name = {
      magic = 1145983854, 
      ndata = 0x7febfc04af80 "", 
      length = 1, 
      labels = 1, 
      attributes = 1, 
      offsets = 0x7febfc04aec0 "", 
      buffer = 0x7febfc04af40, 
      link = {
        prev = 0xffffffffffffffff, 
        next = 0xffffffffffffffff
      }, 
      list = {
        head = 0x0, 
        tail = 0x0
      }
    }, 
    offsets = '\000' <repeats 127 times>, 
    buffer = {
      magic = 1114990113, 
      base = 0x7febfc04af80, 
      length = 255, 
      used = 1, 
      current = 0, 
      active = 0, 
      link = {
        prev = 0xffffffffffffffff, 
        next = 0xffffffffffffffff
      }, 
      mctx = 0x0, 
      autore = false
    }, 
    data = '\000' <repeats 254 times>
  }, 
  domain = 0x7febfc04ae70, 
  count = 1, 
  allowed = 2, 
  dropped = 0, 
  logged = 0, 
  link = {
    prev = 0x0, 
    next = 0x0
  }
}

@wpk, @ondrej: I think you would be best equipped to investigate this.

addzone.log masterformat.log rndc.log

Unfortunately, I am unable to look back further than a week because that is how long we configured GitLab to keep test artifacts for - i.e. this could have started happening further in the past than just a week ago, but it was definitely not happening a month ago. ↩