fetches-per-server quota is lower-bounded to 1 instead of to 2% of quota
On a server with "fetches-per-server 4000;" I was surprised to see a cache dump with the ADB values for a server showing me a quota set to 1.
; problem-server.example.com [v4 TTL 2658] [v4 not_found] [v6 unexpected] ; 192.0.2.25 [srtt 948570] [flags 00004000] [ttl -342230] [atr 0.62] [quota 1]
Although we didn't document the lower bound in the ARM (this also needs to be addressed), the KB article (https://kb.isc.org/docs/aa-01304) explaining how fetchlimits work, based on information from Engineering, describes the adjustment algorithm thus:
The fetches-per-server option sets a hard upper limit to the number of outstanding fetches allowed for a single server. The lower limit is 2% of fetches-per-server, but never below 1. It also allows you to select what to do with the queries that are being limited - either drop them, or send back a SERVFAIL response.
Clearly however, this is not what is in code, as seen in adb.c maybe-adjust-quota(), the last thing we do:
/* Ensure we don't drop to zero */
if (addr->entry->quota == 0)
addr->entry->quota = 1;
}
The background to this, although very much a corner case, is a mis-configured server that responds to A queries but sends back nothing (so the fetches timeout) for AAAA queries for the same name. This is interacting particularly badly with fetches-per-server because the 'good' queries all get answers and are cached, whereas the 'bad' ones all timeout, SERVFAIL to the client and are not cached.
Turning on servfail cache would mitigate that to some extent.
But nevertheless, the quota going all the way down to 1 (instead of to 80) is making matter much worse.
Please fix, because although this corner case is not our problem as such (the mis-behaving server is being fixed), it is bad that the quota is going down so low that it's very hard to get enough queries to be processed in order to recalculate the atr often enough to be reasonably representative of the query rate to this server.
There was also no evidence of the low quota in the logging - presumably because it had been a rock bottom for longer than the logfile sample I looked at - which was for several hours. It therefore needed a cache dump to identify the problem.
(P.S. I'm assuming it's inefficient to calculate "2% of quota" every time we pass this way, so the bottom limit probably wants calculating initially to use here, and might well be something we want to add to adb on a per-server basis too, in anticipation of future work on fetches-per-server to allow for server-specific quota overrides)
Reference: https://support.isc.org/Ticket/Display.html?id=13720