warning: checkhints: unable to get root NS rrset from cache: not found
Periodically we see reports of resolvers that are failing to respond to clients successfully, perhaps with a build-up of recursive clients, inbound UDP packet drops, late and missing query responses and so on. Rebooting the server entirely usually fixes the problem - for a time. Flushing cache may also buy some relief, but generally this does not last as long as if the server is rebooted entirely.
Plus one symptom in the logs - repeated spates of messages like this:
31-May-2021 16:08:38.110 general: warning: checkhints: unable to get root NS rrset from cache: not found
31-May-2021 16:08:41.110 general: warning: checkhints: unable to get root NS rrset from cache: not found
31-May-2021 16:08:42.151 general: warning: checkhints: unable to get root NS rrset from cache: not found
This error message occurs when the root nameservers have just been primed, but when checkhints goes to look at them, they're no longer available in cache (have been expired, possible also removed), all in a very short period of time.
Reports of this have been seen intermittently for many years and from many versions of BIND 9. Typically (in the older reports) this was a rare occurrence seen on a resolver that had been running for a long time; months, possibly years. Therefore after rebooting, the error and the problem was never seen again (or at least not within the shelf-life of the admin who reported it to us originally).
We suspect that what is happening is that the cache structure and content have become unmaintainable over a long period of content being added, expired and removed, and that there it's become impossible to add new RRsets to cache without using expiring existing content because of max-cache-size. The cache tree structure itself also occupies memory, and we've seen a few instances where a long-lived cache has become 'straggly' but also sparsely populated.
What we haven't been able to catch (yet), is the exact path taken that causes this error to be logged, although we have been hoping that improved stats, along with a catch it earlier
assertion (the server anyway needs to be restarted when it has reached this state) might help. See #2082 (closed) .
We have also seen that in one or two instances of this warning being logged, in addition there was a problem reaching some of the root nameserver addresses listed in the root hints and used for priming. Either the root hints were out of date and an older IP address was unreachable, or there were local routing issues (typically IPv6-related) reaching some root server addresses. This shouldn't be a problem, per the way that root hints priming is designed, but 'fixing' the root hints appears to have made the problem go away in some instances, as has fixing the routing and unreachability of some root hint addresses.
For anyone experiencing this problem for the first time, the likelihood is that one or more things have changed in your operating environment, and that these are causing cache content to be more substantial than before, or potentially distributed differently. For example:
- Installing a version of BIND that has
stale-cache-enable yes
by default - An increase in client queries overall
- Client query patterns changing - perhaps causing a higher rate than usual of cached negative responses
- An increase in dual-stack clients querying for AAAA records
- An increase in client querying for HTTPS records
- A new client application that uses DNS-based probing
- Clients using a tunnelling-over-DNS service
- Using a client filtering service that operates by means of resolving the original client query first by appending another private zone name to it and checking the response status before allowing the original query to pass - thus adding the filtering RRsets to cache as well as the actual client query responses.
Currently, clues may be found in the BIND statistics and also in a dump of cache.
Firstly, these counters (available either from the output from rndc stats
or using the xml or json statistics interface), can be a good indicator that there is too much cache cleaning taking place due to memory pressure, versus RRset TTL expiration:
DeleteLRU - "cache records deleted due to memory exhaustion"
DeleteTTL - "cache records deleted due to TTL expiration"
These are counters, therefore although seeing DeleteLRU far exceeding DeleteTTL in a single snapshot of the stats is a good indicator that all is not well with cache, ideally you want to monitor the trend over time.
Also these :
HeapMemInUse - "cache heap memory in use"
TreeMemInUse - "cache tree memory in use"
HeapMemMax - "cache heap highest memory in use"
TreeMemMax - "cache tree highest memory in use"
All of the above are gauges - they tell you 'this is where we are now', so a snapshot can be useful, as well as monitoring pattern over time. The 'Max' is a high water mark.
Aside: don't be tempted to look at either of these - they are not useful operationally and aren't counting what you might think they are from their names:
HeapMemTotal - "cache heap memory total"
TreeMemTotal - "cache tree memory total"
And finally, there are counters available of what's in cache currrently by RType. These are prefixed with !
for counters of NXRRSET (pseudo RR indicating that a name that was queried existed but the type didn't), #
for stale content, and ~
for content that has expired and is waiting on housekeeping/deletion.
If there is any kind of unexpected skew, it might be worth dumping cache to see what's in there.
And then decide - is it just that max-cache-size is now insufficient, or is that something else needs to be done to reduce cache content.