[ISC-support #17264] ADB overmem condition and cleaning - very difficult to detect and causes erratic behavior
The ADB's mctx size is set to 1/8 of the max-cache-size, if set. This is the only means to control the ADB memory limit. There is also a hard-coded maximum ADB size applied to ADBs for views that share a cache.
When it goes overmem, the ADB starts removing names and entries. The strategy for removing entries doesn't seem to be tied strongly to utility.
This can lead to erratic behavior as BIND is constantly forgetting information about server SRTTs, EDNS capabilities, and other useful data.
In some cases, if not all of the entries for servers associated with a zone are affected by the overmem purge, this can cause the resolver to fixate on a small subset of the servers authoritative for the zone - and not necessarily the subset with the best SRTT.
There is no logging at any level related to ADB overmem activities, nor are there any stats directly related to ADB memory usage.
There are stats for counts of names and entries, along with the number of buckets for each type, but there's no reliable way to map those to memory usage.
The stats channel does contain detail for the ADB memory contexts, but there's no reliable way to map those memory contexts to a particular view.
It seems likely that most of the time the symptoms of an overmem ADB will be minor and nearly impossible to directly measure - small delays and increases in CPU usage associated with the repeated creation and destruction of ADB entries and/or fixation on suboptimal upstream servers - but will definitely degrade the quality of service that the resolver is providing.
This behavior was noticed by a customer when their monitoring zone happened to, by chance, be negatively affected.
Most of the symptoms described here are theoretical, based on my understanding of the code and various customer-described, but unreproducable and otherwise unexplained, behaviors.
This issue is a feature request covering:
- specific testing by ISC to better understand the impact and range of behaviors with the ADB is overmem #2441
- additional logging related to ADB overmem activities #2435
- additional stats/metrics relating to ADB overmem #2436
- improvements to ADB overmem behavior (ideally based on some utility metric) #2437
- ability to directly control ADB size independent of cache size #2438
- revisit the hard-coded shared-cache maximum ADB size (e.g. remove in favor of configuration) #2439
- system tests related to any/all items above