Resolver issues with refactored dispatch code

This issue attempts to describe various issues with resolver behavior found after merging !4601 (merged) (#2401 (closed)). Most of these issues are intermittent, so it is important to keep track of them somewhere in order to not forget that they exist. We should get to the bottom of all of these issues before we release BIND 9.18.0.

Recursive Perflab tests cause the resolver to stop responding.

This issue might be the simplest to start with because the behavior observed seems to be consistent rather than intermittent. Namely, all Perflab jobs which test a resolver seem to crank out a response rate of some 70-120 kQPS at the beginning of the test and then... the resolver stops responding indefinitely. While Perflab was not designed with recursive tests in mind and therefore we can treat its recursive results with a grain of salt, it certainly should not be reporting zeros all over the place.
(Resolved by !5500 (merged).)
respdiff tests are sometimes slow.

Ever since we merged the dispatch branch, the respdiff tests started failing intermittently for main (and only main) because of timeouts.
- job 2016337: pass, ~2m30s per each 10,000 queries
- job 2016622: pass, ~2m45s per each 10,000 queries
- job 2017990: pass, ~2m30s per each 10,000 queries
- job 2020093: fail, 7+ minutes per each 10,000 queries
- job 2023057: fail, 16+ minutes per each 10,000 queries
- job 2023490: pass, ~2m40s per each 10,000 queries
I do not think varying CI runner stress can be blamed for this, not for discrepancies this large. It also never happened before merging !4601 (merged), AFAIK.
A lot of "stress" test graph indicate growing memory use. #3002 (closed)

While testing October BIND 9 releases, one of the 1-hour "stress" tests ran in recursive mode for BIND 9.17.19 yielded a graph which indicates that memory use growth over time might be an issue.

https://wiki.isc.org/bin/viewfile/QA/BindQaResults_9_11_36?filename=bind-9.17.19-linux-amd64-recursive-1h.png;rev=1

However, that phenomenon was not observable for other OS/arch combinations this specific code revision was tested with.

It was also not observable on the same OS/arch combination for a very similar code revision (the code differences should not have any effect on memory use patterns):

https://wiki.isc.org/bin/viewfile/QA/BindQaResults_9_11_36?filename=bind-9.17.19-linux-amd64-recursive-1h.png;rev=2

Pre-release tests run for BIND 9.17.20 confirmed that memory leaks are a common thing when named is used as a recursive resolver. More details are available in #3002 (closed).

The "stress" tests are run on isolated VMs and despite being pretty synthetic (fixed traffic pattern, everything happens on one machine, etc.), they have a history of being very stable, so typical issues like test host load varying over time etc. are not a factor here.
Lame servers with IPv6 unreachable cause hang on shutdown. #2927 (closed)
resolver test fails intermittently #3013 (closed)

See https://gitlab.isc.org/isc-projects/bind9/-/jobs/2054296

I:resolver:query count error: 6 NS records: expected queries 10, actual 11
I:resolver:failed

Assertion failed in dns_resolver_logfetch() #2962 (closed)
Assertion failed in dns_dispatch_gettcp() #2963 (closed)
Assertion failed in dns_resolver_destroyfetch() #2969 (closed)
ThreadSanitizer issues with adb #2978 (closed) #2979 (closed)
fctx_cancelquery() attempts to process a query which has already been freed #3018 (closed)
premature TCP connection closure leaks fetch contexts (hang on shutdown) #3026 (closed)
validator loops can cause shutdown hang #3033
ADB finds for a broken zone may cause fetch contexts to hang #3037
ASAN error in fctx_cancelquery() #3102 (closed)

I decided to open a single issue for all of the above problems because I sense they are somehow related and I hope that fixing the root cause of one of them will eliminate the other ones as well.

Edited Jul 13, 2022 by Evan Hunt

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information