Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • BIND BIND
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 527
    • Issues 527
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 96
    • Merge requests 96
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ISC Open Source Projects
  • BINDBIND
  • Issues
  • #2495
Closed
Open
Created Feb 16, 2021 by Cathy Almond@cathyaDeveloper

Mitigation for cache bloat due to negative RRsets (NXDOMAIN and NXRRSET) when preserving expired RRsets using max-stale-ttl

I'm tagging this as 'Customer' because it is affecting/has affected several customer caches (unanticipated cache bloat) with 'stale-cache-enable yes;'


By design, negative cache content is not used in the same way as positive cache content. 'stale-answer-client-timeout' does not apply to negative content, thus the clients will timeout and likely retry the same query several times before giving up, long before the resolver-query-timeout. Only when named has failed to get a response from the authoritative server(s) does the 'stale-refresh-time' window commence.

There is a good reason for this - when querying the authoritative servers for a name that is already NXDOMAIN or NXRRSET in cache, we don't know if this is going to be replaced with positive content or not. Therefore, to ensure that positive content is preferred, we have to wait to be sure that it's sensible to serve the stale negative content as a last resort.

UNFORTUNATELY:

  1. Most negative content is single-use - so adding max-stale-ttl to its retention period can significantly increase cache bloat - for no useful reason whatsoever.

  2. Most negative content has (also by design of sane zone administrators, or overridden by max-ncache-ttl) a much shorter TTL in cache than positive content.

So when we're not preserving stale content, the negative content is fairly quickly removed from cache, in comparison with the positive RRsets. But adding max-stale-ttl to the retention period can quite significantly tilt the balance in favour of all of this one-use-only cached negative content.

We don't have any statistics that measure cache hits for different RTYPEs, so we don't know for sure, what percentage of negative content is used again (cache hits) versus just sitting there for no good reason (cache misses) - perhaps we should? (Perhaps we should also have stats on stale hits and misses by RTYPE?)

But in any case, having seen too many caches overloaded with stale negative content, I should like to propose an option that can be used either to shorten the max-stale-ttl for negative content, but that also takes a value of zero, to disable its retention entirely.

Assignee
Assign to
Time tracking