Investigate/fix how CATZ update post-processing can block servicing of inbound queries in BIND 9.16

Related to Support Ticket RT #19629

After upgrading a busy authoritative server from BIND 9.11 to BIND 9.16, the behaviour of the RTTs on the primary zone monitoring test queries (sampled at a rate of 2 QPS) changed significantly to become much more 'spiky'. Overall, the RTTs are lower (9.16 is faster at servicing queries than 9.11), but there are some significant 'spikes' where the RTT of the test queries are much larger than average.

Investigation of the causes (carried out by eliminating potential candidates during the running and monitoring of a 'test' server highlighted that the 'spikes' corresponded to the period immediately after an update had been received for a catalog zone.

(Also noted was that the spikes didn't occur after every catalog zone update, but this would tally with the way that inbound client queries are hashed to a netmgr thread which may or may not be the one to get temporarily blocked by CATZ post-processing)

My understanding is that catalog zone post-processing is a task that runs to completion, so does not iterate/pause for the duration of its operation to sort out adds/deletes/changes to catalog zones on the receiving secondary server - so this is a plausible cause for these test RTT spikes.

Also a potential candidate might be inbound AXFR/IXFR processing as a follow-on outcome of CATZ updates.

Please can this be investigated/tested/confirmed and solutions considered. It may be that CATZ post-processing should be considered as another candidate to migrate to threadpools, although it might also need to be something that iterates also, if it ends up locking resources that could block inbound client queries too (as in, it's not just about it sitting on the netmgr thread doing its thing, but it also by what it is doing, blocks other threads - ref !5151 (merged)

Edited Dec 31, 2021 by Cathy Almond