Do a better job of logging when fetches-per-zone is triggered

Based on a lot of trawling and deduction about what could be happening, in Support ticket #20945...

BLAT (bottom line at the top) - I originally thought this had to be a bug in the logging or the maintenance of the counters, but I was wrong - we just, and particularly in this situation, don't really log anything useful for the administrator. We could do better!

Detail:

We are presented with a sequence of log messages about fetches-per-zone having been triggered for the com domain - should we worry about them?

/var/log/named/default.2:15-Jul-2022 02:12:28.654 spill: info: too many simultaneous fetches for com (allowed 1162 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:12:33.165 spill: info: too many simultaneous fetches for com (allowed 2382 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:12:41.151 spill: info: too many simultaneous fetches for com (allowed 1522 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:12:49.593 spill: info: too many simultaneous fetches for com (allowed 1357 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:12:57.290 spill: info: too many simultaneous fetches for com (allowed 904 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:05.679 spill: info: too many simultaneous fetches for com (allowed 3854 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:10.944 spill: info: too many simultaneous fetches for com (allowed 912 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:36.487 spill: info: too many simultaneous fetches for com (allowed 5206 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:37.712 spill: info: too many simultaneous fetches for com (allowed 762 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:47.819 spill: info: too many simultaneous fetches for com (allowed 760 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:49.927 spill: info: too many simultaneous fetches for com (allowed 906 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:13:54.215 spill: info: too many simultaneous fetches for com (allowed 1757 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:14:05.718 spill: info: too many simultaneous fetches for com (allowed 1444 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:14:21.880 spill: info: too many simultaneous fetches for com (allowed 2159 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:14:25.377 spill: info: too many simultaneous fetches for com (allowed 1708 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:14:34.921 spill: info: too many simultaneous fetches for com (allowed 887 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:14:56.110 spill: info: too many simultaneous fetches for com (allowed 1950 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:14:58.813 spill: info: too many simultaneous fetches for com (allowed 2773 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:08.737 spill: info: too many simultaneous fetches for com (allowed 1997 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:11.528 spill: info: too many simultaneous fetches for com (allowed 1210 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:17.300 spill: info: too many simultaneous fetches for com (allowed 1281 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:19.483 spill: info: too many simultaneous fetches for com (allowed 941 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:29.511 spill: info: too many simultaneous fetches for com (allowed 2122 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:36.315 spill: info: too many simultaneous fetches for com (allowed 960 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:15:38.065 spill: info: too many simultaneous fetches for com (allowed 1153 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:16:11.206 spill: info: too many simultaneous fetches for com (allowed 810 spilled 1)
/var/log/named/default.2:15-Jul-2022 02:16:12.717 spill: info: too many simultaneous fetches for com (allowed 897 spilled 1)

It is mighty peculiar that we only ever log a single spill (=dropped fetch). Usually when we see a zone being rate limited, we see logging every 60s or so, and the counters going up and up.

But read on ...

The way fetches-per-zone and its logging works is this.

There is a structure that is created (and which persists for as long as there are any outstanding fetches) for each zone that we might want to limit - in this case com.
The structure has in it:
- count - a gauge that is incremented or decremented and reflects how many outstanding fetches there are in progress
- allowed - a counter that is only incremented - it goes up each time we allow a fetch to proceed
- dropped - a counter that is only incremented - it goes up each time we decide to drop a fetch
- logged - the timestamp of when we last logged that we dropped anything
The structure goes away once 'count' has gone back down to zero
We ONLY think about logging if 'dropped' is incremented. We also only log (having incremented 'dropped') if it's more than 60s since the last time we logged for this zone.

COROLLARY: We ALWAYS log the first drop for a zone that is being limited by fetches-per-zone, but thereafter, will only log at most every 60s for the duration of the existence of this specific counting structure.

====

Now looking at the logs for com - I see that:

a) We're only ever logging a spill (drop) of 1. This is the first drop we did against com for this fetch-limits counting structure.

b) We never see anything bigger than 1 being logged. This can't be because we never dropped anything ever again because: - We only think about logging WHEN we drop something - therefore if the counting structure persisted, the next time we log would be because we dropped another fetch - but it's still logging only 1! - Even if we think about logging, we do it at most every 60s, yet we're logging more frequently than that, which means that the counting structure has to have gone away and then been recreated for com, for us to be able to log more frequently.

c) See b) - we're logging more frequently than 60s...

====

My hypothesis on what's happening therefore is that there are periodic spikes in getting query responses from com, but between times, all is normal (and/or all is in cache) so the outstanding fetches count goes back to zero and the counting structure is deleted.

The logging is happening more frequently than 60s (and we only ever see "spilled 1" which means that we see only the fist time we drop a fetch to the com servers.

We may well drop a bunch more (and have no way of knowing that we did from the logs), but whatever is causing us to drop them, subsides before the 60s is up (after which we'd log again the next time we drop one) - and moreover, subsides so much so, that the count of outstandings went back to zero and we deleted the fetches counter structure entirely!

====

What I don't know, is how many drops this server actually did, before things went back to normal, because we deleted the structure that didn't the counting before we reached 60s after the first log message.

Also, now that I understand it, it's really annoying that I'm seeing the first instance of a drop being logged as if it contains useful facts about how many fetches we're spilling. Nope - that's not it. What we're logging is "fetches-per-zone has been triggered for the first time for com, and we dropped our first fetch, we might drop a few more, who knows? ..."

My suggestion for an improved logging strategy is something like this:

a) First time we log, we say something like:

too many simultaneous fetches for com (allowed 1997 spilled 1; fetches-per-zone initial trigger event)

b) Subsequent logging (based on 'we dropped something else AND it's more than 60s since we last logged') should look moreorless the same:

too many simultaneous fetches for com (allowed 6423 spilled 24; cumulative since initial trigger event)

c) New, when we delete the counting structure and IFF the spill counter>0, we have a new log emitted:

fetch counters for com now being discarded (allowed 2345 spilled 6; cumulative since initial trigger event)

Or something like that.

Note that we only want to emit the new log IF we ever emitted the first one. Also that we can't log unless something triggers it (in the case of the other logs, the trigger is that we spill something, this last one is that we destroy the counter structure).

Obviously we don't want to log every time we hit 60s if we're no longer spilling. Also this means that we might not log that final event until after the server has been running for hours, or if we're shutting it down, if we're no longer spilling. But it's better than nothing. And it at least gives us a final count of 'we spilled this many..', which is the piece of information that was woefully lacking in the log sequence I investigated above.

And yes, I agree that logging meaningfully and useful for fetches-per-zone is hard, plus there's a balance to be struck between too much and too little.

I'm also delighted that we can now dump all the information about the current fetch-limits status 'now', should we want to:

#665 (closed)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information