[CVE-2023-6516] Specific recursive query patterns may lead to an out-of-memory condition

Quick Links	🔗
Incident Manager:	@michal
Deputy Incident Manager:	@chuck
Public Disclosure Date:	2024-02-13
CVSS Score:	7.5
Security Advisory:	isc-private/printing-press!80
Mattermost Channel:	CVE-2023-6516: cache tree pruning may exhaust available memory
Support Ticket:	SF#1406
Release Checklist:	#4515 (closed) & #4555 (closed)

💡 Click here (internal resource) for general information about the security incident handling process.

Earlier Than T-5

At T-5

🔗 (Marketing) Update the text on the T-5 (from the Printing Press project) and "earliest" ASN documents in the SF portal
🔗 (Marketing) (BIND 9 only) Update the BIND -S information document in SF with download links to the new versions
🔗 (Marketing) Bulk email eligible customers to check the SF portal
🔗 (Marketing) (BIND 9 only) Send a pre-announcement email to the bind-announce mailing list to alert users that the upcoming release will include security fixes

At T-1

🔗 (First IM) Send notifications to OS packagers

On the Day of Public Disclosure

After Public Disclosure

🔗 (QA) ~~Merge a regression test reproducing the bug into all affected (and still maintained) branches~~

Version: 9.16.38-S1 Note: you might want to handle this case as a security bug.

We've noticed that BIND 9.16 (tested with 9.16.38-S1) can consume much more memory than max-cache-size under certain conditions. Specifically, it can be reproduced in my test environment as follows:

run named with the attached configuration (named.conf)
run another named instance on the same host (named-auth.conf and example.zone). The first instance forwards all recursive queries to the second instance.
run the attached script (cachetest.py; you need python and dnspython)
watch memory footprint of the first instance

While max-cache-size is set to 256MB, the process memory footprint will well exceed that value. And, at around 1.3GB, statistics-channel indeed shows the cache uses a lot more memory than 256M:

{
"id":"0x7fdad09fe630",
"name":"cache",
"references":8,
"total":7132065424,
"inuse":1009703195,
"maxinuse":1009703195,
"malloced":1021440461,
"maxmalloced":1021440461,
"pools":0,
"hiwater":234881024,
"lowater":201326592
}

Also, rndc dumpdb indicates that only very few cache entries exist in the cache.

 grep 192.0.2.1 named_dump.db | wc -l
13

And, when we stop named, it takes about 3 minutes to complete shutdown:

20-Oct-2023 20:56:45.292 stopping command channel on 127.0.0.1#953
20-Oct-2023 20:59:49.988 exiting

Our analysis concluded that this is because:

a lot of "leaf" cache entries are purged due to overmemory condition (the python script's query pattern is chosen to cause it)
many number of "prune_tree" events are sent to the rbtdb's task
but these events are not handled fast enough, so many rbt nodes are kept in memory while even more are added by new queries

We are not fully sure exactly why the event handling is so slow, but confirmed that a patch (attached, cache.patch) to prevent excessive sending of the task events helps avoid the problem.

Interestingly, BIND 9.18.19-S1 didn't show this problem in my experiment. I've not figured out why.

You'll probably want to prevent the problem for 9.16, either by the patch or in some other way. We'd also appreciate an explanation on why it doesn't happen for 9.18.

Edited Mar 28, 2024 by Michał Kępień

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information