[CVE-2023-6516] Specific recursive query patterns may lead to an out-of-memory condition
Quick Links | |
---|---|
Incident Manager: | @michal |
Deputy Incident Manager: | @chuck |
Public Disclosure Date: | 2024-02-13 |
CVSS Score: | 7.5 |
Security Advisory: | isc-private/printing-press!80 |
Mattermost Channel: | CVE-2023-6516: cache tree pruning may exhaust available memory |
Support Ticket: | SF#1406 |
Release Checklist: | #4515 (closed) & #4555 (closed) |
Earlier Than T-5
-
🔗 (IM) Pick a Deputy Incident Manager -
🔗 (IM) Respond to the bug reporter -
🔗 (SwEng) Ensure there are no public merge requests which inadvertently disclose the issue -
🔗 (IM) Assign a CVE identifier -
🔗 (SwEng) Update this issue with the assigned CVE identifier and the CVSS score -
🔗 (SwEng) Determine the range of product versions affected (including the Subscription Edition) -
🔗 (SwEng) Determine whether workarounds for the problem exist -
🔗 (SwEng) If necessary, coordinate with other parties -
🔗 (Support) Prepare "earliest" notification text and hand it off to Marketing -
🔗 (Marketing) Update "earliest" notification document in SF portal and send bulk email to earliest customers -
🔗 (Support) Create a merge request for the Security Advisory and include all readily available information in it -
🔗 (SwEng) Prepare a private merge request containing a system test reproducing the problem -
🔗 (SwEng) Notify Support when a reproducer is ready -
🔗 (SwEng) Prepare a detailed explanation of the code flow triggering the problem -
🔗 (SwEng) Prepare a private merge request with the fix -
🔗 (SwEng) Ensure the merge request with the fix is reviewed and has no outstanding discussions -
🔗 (Support) Review the documentation changes introduced by the merge request with the fix -
🔗 (SwEng) Prepare backports of the merge request addressing the problem for all affected (and still maintained) branches of a given product -
🔗 (Support) Finish preparing the Security Advisory -
🔗 (QA) Create (or update) the private issue containing links to fixes & reproducers for all CVEs fixed in a given release cycle -
🔗 (QA) (BIND 9 only) Reserve a block ofCHANGES
placeholders once the complete set of vulnerabilities fixed in a given release cycle is determined -
🔗 (QA) Merge the CVE fixes in CVE identifier order -
🔗 (QA) Prepare a standalone patch for the last stable release of each affected (and still maintained) product branch -
🔗 (QA) Prepare ASN releases (as outlined in the Release Checklist)
At T-5
-
🔗 (Marketing) Update the text on the T-5 (from the Printing Press project) and "earliest" ASN documents in the SF portal -
🔗 (Marketing) (BIND 9 only) Update the BIND -S information document in SF with download links to the new versions -
🔗 (Marketing) Bulk email eligible customers to check the SF portal -
🔗 (Marketing) (BIND 9 only) Send a pre-announcement email to the bind-announce mailing list to alert users that the upcoming release will include security fixes
At T-1
-
🔗 (First IM) Send notifications to OS packagers
On the Day of Public Disclosure
-
🔗 (IM) Grant QA & Marketing clearance to proceed with public release -
🔗 (QA/Marketing) Publish the releases (as outlined in the release checklist) -
🔗 (Support) (BIND 9 only) Add the new CVEs to the vulnerability matrix in the Knowledge Base -
🔗 (Support) Bump Document Version for the Security Advisory and publish it in the Knowledge Base -
🔗 (First IM) Send notification emails to third parties -
🔗 (First IM) Advise MITRE about the disclosed CVEs -
🔗 (First IM) Merge the Security Advisory merge request -
🔗 (IM) Inform original reporter (if external) that the security disclosure process is complete -
🔗 (Marketing) Update the SF portal to clear the ASN -
🔗 (Marketing) Email ASN recipients that the embargo is lifted
After Public Disclosure
-
🔗 (QA)Merge a regression test reproducing the bug into all affected (and still maintained) branches
Version: 9.16.38-S1 Note: you might want to handle this case as a security bug.
We've noticed that BIND 9.16 (tested with 9.16.38-S1) can consume much more memory than max-cache-size under certain conditions. Specifically, it can be reproduced in my test environment as follows:
- run named with the attached configuration (named.conf)
- run another named instance on the same host (named-auth.conf and example.zone). The first instance forwards all recursive queries to the second instance.
- run the attached script (cachetest.py; you need python and dnspython)
- watch memory footprint of the first instance
While max-cache-size is set to 256MB, the process memory footprint will well exceed that value. And, at around 1.3GB, statistics-channel indeed shows the cache uses a lot more memory than 256M:
{
"id":"0x7fdad09fe630",
"name":"cache",
"references":8,
"total":7132065424,
"inuse":1009703195,
"maxinuse":1009703195,
"malloced":1021440461,
"maxmalloced":1021440461,
"pools":0,
"hiwater":234881024,
"lowater":201326592
}
Also, rndc dumpdb indicates that only very few cache entries exist in the cache.
grep 192.0.2.1 named_dump.db | wc -l
13
And, when we stop named, it takes about 3 minutes to complete shutdown:
20-Oct-2023 20:56:45.292 stopping command channel on 127.0.0.1#953
20-Oct-2023 20:59:49.988 exiting
Our analysis concluded that this is because:
- a lot of "leaf" cache entries are purged due to overmemory condition (the python script's query pattern is chosen to cause it)
- many number of "prune_tree" events are sent to the rbtdb's task
- but these events are not handled fast enough, so many rbt nodes are kept in memory while even more are added by new queries
We are not fully sure exactly why the event handling is so slow, but confirmed that a patch (attached, cache.patch) to prevent excessive sending of the task events helps avoid the problem.
Interestingly, BIND 9.18.19-S1 didn't show this problem in my experiment. I've not figured out why.
You'll probably want to prevent the problem for 9.16, either by the patch or in some other way. We'd also appreciate an explanation on why it doesn't happen for 9.18.