Serve-stale feature is not operationally useful - some suggestions for making it better
As described in Support ticket #16171 :
The problem with serve-stale was (and still is after some testing on 9.16.1),
that every client that asks for e.g. "isc.org A" will all have to wait for 10
seconds before they get the stale answer. There seem to be no table of stale
resolvers so each time a request comes in, BIND seems to try the resolver
again to find out if it answers or not.
This really is not helpful - most clients will have given up and gone away and will never get a usable answer.
IF the name is one that is popular, then because of 'clients-per-query' and the fact that we attach any future waiting clients for the same query to the already-existing fetch process, then the late arrivals stand a fighting chance of getting a response from stale cache before they give up - but the majority won't.
See also #1688 - we haven't documented very thoroughly how this works anyway, and we certainly have not documented how it interacts with fetch-limits and other resolver-protecting features.
Here's a sample config that was being used for testing:
stale-answer-enable yes;
stale-answer-ttl 600;
max-stale-ttl 1w;
There is nothing there that provides for a configurable period of 'staleness' so that after the first time the failure to refresh has taken place, a server can immediately serve this stale content to any clients who come along later instead of repeating the refresh attempt (and likely failing again).
I think the issue is that although we do have some control over how stale an answer can be before we stop serving it, we haven't thought sufficiently about how long clients will be prepared to wait for a query response if we have to attempt to refresh and then fail for each client (or set of clients) when queried.
Note: I do not think we should immediately serve stale answers whenever there's cache content available that has recently expired - this is not what we're trying to achieve. The idea of serve-stale as the converse of pre-fetch ('post-fetch'?) is somehow terribly tempting because it feels like it would be faster and a better experience for the clients, plus there's this nice symmetry with pre-fetch logic. But I think it's wrong - and would absolutely break how we handle TTL=0 answers today. Authoritative server operators expect resolvers to come back to them as soon as their cached content expired. We should not skip this step.
But what would be more helpful (to both clients and to servers) when there are non-responding authoritative servers, would be a way to flag a stale answer with the timestamp of when the last failing refresh attempt occurred, and if a client queries the same name again within a suitable time period (configurable? Something like 10s feels like a good default here), then the stale answer gets used right away.
We're preserving resolver resources by doing this (and anyway, if we couldn't resolve this name 1s ago, why are we trying again immediately if we've got something usable-but-stale in cache we could use instead?)