[CVE-2024-0760] Flood of DNS messages over a single TCP/DoT connections makes server unusable
Quick Links | |
---|---|
Incident Manager: | @aram |
Deputy Incident Manager: | @dankney |
Public Disclosure Date: | 2024-07-23 |
CVSS Score: | 7.5 |
Security Advisory: | isc-private/printing-press!98 |
Mattermost Channel: | CVE-2024-0760 |
Support Ticket: | N/A |
Release Checklist: | #4735 (closed) |
Earlier Than T-5
-
🔗 (IM) Pick a Deputy Incident Manager -
🚫 🔗 (IM) Respond to the bug reporter - found internally by @pspacek -
🔗 (SwEng) Ensure there are no public merge requests which inadvertently disclose the issue -
🔗 (IM) Assign a CVE identifier -
🔗 (SwEng) Update this issue with the assigned CVE identifier and the CVSS score -
🔗 (SwEng) Determine the range of product versions affected (including the Subscription Edition) -
🔗 (SwEng) Determine whether workarounds for the problem exist -
🔗 (SwEng) Ifnecessary, coordinate with other parties -
🔗 (Support) Prepare "earliest" notification text and hand it off to Marketing -
🔗 (Marketing) Update "earliest" notification document in SF portal and send bulk email to earliest customers -
🔗 (Support) Create a merge request for the Security Advisory and include all readily available information in it -
🔗 (SwEng) Prepare a private merge request containing a system test reproducing the problem -
🔗 (SwEng) Notify Support when a reproducer is ready -
🔗 (SwEng) Prepare a detailed explanation of the code flow triggering the problem -
🔗 (SwEng) Prepare a private merge request with the fix -
🔗 (SwEng) Ensure the merge request with the fix is reviewed and has no outstanding discussions -
🔗 (Support) Review the documentation changes introduced by the merge request with the fix -
🔗 (SwEng) Prepare backports of the merge request addressing the problem for all affected and still maintained branches of a given product (bind-9.18) -
🔗 (Support) Finish preparing the Security Advisory -
🔗 (QA) Create (or update) the private issue containing links to fixes & reproducers for all CVEs fixed in a given release cycle -
🔗 (QA) (BIND 9 only) Reserve a block ofCHANGES
placeholders once the complete set of vulnerabilities fixed in a given release cycle is determined -
🔗 (QA) Merge the CVE fixes in CVE identifier order -
🔗 (QA) Prepare a standalone patch for the last stable release of each affected (and still maintained) product branch -
🔗 (QA) Prepare ASN releases (as outlined in the Release Checklist)
At T-5
-
🔗 (Marketing) Update the text on the T-5 (from the Printing Press project) and "earliest" ASN documents in the SF portal -
🔗 (Marketing) (BIND 9 only) Update the BIND -S information document in SF with download links to the new versions -
🔗 (Marketing) Bulk email eligible customers to check the SF portal -
🔗 (Marketing) (BIND 9 only) Send a pre-announcement email to the bind-announce mailing list to alert users that the upcoming release will include security fixes
At T-1
-
🔗 (First IM) Send notifications to OS packagers
On the Day of Public Disclosure
-
🔗 (IM) Grant QA & Marketing clearance to proceed with public release -
🔗 (QA/Marketing) Publish the releases (as outlined in the release checklist) -
🔗 (Support) (BIND 9 only) Add the new CVEs to the vulnerability matrix in the Knowledge Base -
🔗 (Support) Bump Document Version for the Security Advisory and publish it in the Knowledge Base -
🔗 (First IM) Send notification emails to third parties -
🔗 (First IM) Advise MITRE about the disclosed CVEs -
🔗 (First IM) Merge the Security Advisory merge request -
🚫 🔗 (IM) Inform original reporter (if external) that the security disclosure process is complete -
🔗 (Marketing) Update the SF portal to clear the ASN -
🔗 (Marketing) Email ASN recipients that the embargo is lifted
After Public Disclosure
-
🔗 (QA)Merge a regression test reproducing the bug into all affected (and still maintained) branches
Summary
Flood of DNS messages over single TCP connections makes server unusable.
CVE Identifier
Reserved the following CVE ID(s):
CVE-2024-0760
├─ State: RESERVED
├─ Owning CNA: isc
├─ Reserved by: ondrej@isc.org (isc)
└─ Reserved on: Fri Jan 19 20:26:52 2024
Remaining quota: 994
BIND versions affected
- Affects v9.19: 64ef6968
- Affects v9.18: 418a1ad7 (with OPCODE=QUERY)
It does NOT affect these versions:
Preconditions and assumptions
None.
Attacker's abilities
Attacker has ability to open TCP connection to server under attack, and keep flooding the server with messages.
Impact
Legitimate QPS drops to ~ 0. Eventually server consumes all memory and crashes (if the attack lasts long enough).
Here's table showing QPS over time during the test.
- Time 0 - only legitimate traffic (REFUSED because of the ACL, not much work involved)
- Time 1 - attack starts
- Time 16 - attack stops
- Time 23 -
named
finally recovers
It still takes another 7 seconds to recover - CPU is busy during that period.
test time | legit | attack |
---|---|---|
0 | 155 528 | 0 |
1 | 0 | 103 675 |
2 | 261 | 69 462 |
3 | 257 | 71 908 |
4 | 0 | 71 595 |
5 | 258 | 71 304 |
6 | 260 | 70 678 |
7 | 0 | 70 065 |
8 | 266 | 71 200 |
9 | 0 | 69 448 |
10 | 257 | 72 393 |
11 | 258 | 71 443 |
12 | 0 | 69 474 |
13 | 259 | 71 073 |
14 | 257 | 69 471 |
15 | 0 | 71 176 |
16 | 260 | 0 |
17 | 518 | 0 |
18 | 0 | 0 |
19 | 0 | 0 |
20 | 0 | 0 |
21 | 186 | 0 |
22 | 37 792 | 0 |
23 | 156 413 | 0 |
Steps to reproduce
- Attached named.conf is just an example to demonstrate that ACLs cannot stop the attack. Any config is vulnerable.
- Start BIND server with command:
named -g -c named.conf &> /dev/null
- attached config produces lots of logging because of the ACLs - but all the ACLs can be commented out
- Simulate legitimate clients using command:
yes '. A' | dnsperf -S1 -O suppress=timeout -c 256
- of course real legit clients would not be getting RCODE=REFUSED, but it's good enough for demonstration
- Simulate attack traffic using command
python tcploop.py 127.0.0.1 53 query.tcpdns --report-interval 1
As the name of the file suggests, it's an OPCODE=IQUERY message, which gets NOTIMP right away. It does not really matter what the message is (see #4481 (comment 422446)), I just wanted to have something which is syntactically valid and causes minimal processing on the server.
To test with memory size limited, try this command:
systemd-run -p MemoryMax=1G -p MemorySwapMax=0 --user --same-dir -t named -g -c named.conf &> /dev/null
Reproducer for DoT
- query.tcpdns
- tlsloop.py
- Usage:
python tlsloop.py 127.0.0.1 853 query.tcpdns
What is the current bug behavior?
Legitimate QPS drops towards 0.
Memory consumption increases until the server consumes almost all available memory and is killed by OOM killer.
EDIT: Formerly the test instructions were incorrect. I forgot to limit Swap size and that lead to incorrect conclusion: and memory consumption stays like that while attack is in progress. Artificially limiting memory available to server process does not cause crash/OOM condition.
What is the expected correct behavior?
I would expect roughly proportional resources dedicated to each client.
Maybe also tear down TCP connection if the client is doing weird things with it.
Relevant logs
With default config none at all.
With -d 99
log level it is actually interesting. See log from a simple run with 1000 "attack" messages within a single connection, followed by connection closure.
Here's is excerpt:
Phase 1
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 (no-peer): allocate new client
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: TCP request
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: using view '_default'
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: request is not signed
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: recursion not available (recursion not enabled for view)
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 (no-peer): allocate new client
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: TCP request
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: using view '_default'
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: request is not signed
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: recursion not available (recursion not enabled for view)
...
Phase 2
06-Dec-2023 14:26:57.836 client @0x7f0b8f647400 127.0.0.1#55642: send failed: connection reset
06-Dec-2023 14:26:57.836 client @0x7f0b8f647400 127.0.0.1#55642: reset client
06-Dec-2023 14:26:57.836 client @0x7f0b8f666000 127.0.0.1#55642: send failed: connection reset
06-Dec-2023 14:26:57.836 client @0x7f0b8f666000 127.0.0.1#55642: reset client
...
Phase 3
06-Dec-2023 14:26:57.846 client @0x7f0b8f647400 127.0.0.1#55642: freeing client
06-Dec-2023 14:26:57.846 client @0x7f0b8f666000 127.0.0.1#55642: freeing client
...
That looks innocent, except for two things:
- the timestamps! There is quite a gap between "recursion not available" and "send failed: connection reset" message, and then there is another time-gap before reaching "freeing client".
- All the attacker-sent messages first go through "phase 1", then all of them in order go through "phase 2", and then reach "phase 3". No wonder it eats lots of memory.