cacheclean system test fails intermittently and does not preserve forensic data

https://gitlab.isc.org/isc-private/bind9/-/jobs/474768:

S:cacheclean:Thu Dec 12 10:24:18 UTC 2019
T:cacheclean:1:A
A:cacheclean:System test cacheclean
I:cacheclean:PORTRANGE:5900 - 5999
I:cacheclean:check correctness of routine cache cleaning (1)
I:cacheclean:only one tcp socket was used (2)
I:cacheclean:reset and check that records are correctly cached initially (3)
I:cacheclean:check flushing of the full cache (4)
I:cacheclean:check flushing of individual nodes (interior node) (5)
I:cacheclean:check flushing of individual nodes (leaf node, under the interior node) (6)
I:cacheclean:check flushing of individual nodes (another leaf node, with both positive and negative cache entries) (7)
I:cacheclean:failed
I:cacheclean:check flushing a nonexistent name (8)
I:cacheclean:check flushing of namespaces (9)
I:cacheclean:check flushing a nonexistent namespace (10)
I:cacheclean:check the number of cached records remaining (11)
I:cacheclean:check the check that flushname of a partial match works (12)
I:cacheclean:check the number of cached records remaining (13)
I:cacheclean:check flushtree clears adb correctly (14)
I:cacheclean:check expire option returned from master zone (15)
I:cacheclean:check expire option returned from slave zone (16)
I:cacheclean:exit status: 1
R:cacheclean:FAIL
E:cacheclean:Thu Dec 12 10:28:25 UTC 2019

This failure was probably caused by an almost two-second stall in execution (ns2/named.run) between querying for third1.second1.top1.flushtest.example/A and then requerying for it with +noanswer +auth¹:

12-Dec-2019 10:27:53.849 dns_adb_destroyfind on find 0x42a37f84d20
12-Dec-2019 10:27:53.849 client @0x42a5c0351c0 (no-peer): free
12-Dec-2019 10:27:53.849 sockmgr 0x42a43959010: watcher got message -2 for socket -1
12-Dec-2019 10:27:53.849 fctx 0x42aae098010(third1.second1.top1.flushtest.example/A): unlink
12-Dec-2019 10:27:53.849 fctx 0x42aae098010(third1.second1.top1.flushtest.example/A): destroy
12-Dec-2019 10:27:55.400 socket 0x42a439712e0: dispatch_recv:  event 0x42abf336200 -> task 0x42a4396ee70
12-Dec-2019 10:27:55.401 socket 0x42a43971880: dispatch_recv:  event 0x42ac4225200 -> task 0x42a4396ef28
12-Dec-2019 10:27:55.402 socket 0x42a439712e0: internal_recv: task 0x42a4396ee70 got event 0x42a43971388

This probably caused the second query to yield a 3598 negative TTL, which led the test code to believe that the answer had come from cache even though it had not.

The "probably" bits above stem from the fact that proving this theory is impossible because the cacheclean system test preserves almost no forensic information, making intermittent failure diagnosis mostly an exercise in tasseography.

Let's ignore for a while the fact that the test could easily do away with grabbing that record just once, but it nevertheless issues two queries for it because the first dig invocation uses flags which prevent a negative answer TTL to be extracted from dig output... ↩