[CVE-2024-0760] Flood of DNS messages over a single TCP/DoT connections makes server unusable

Quick Links	🔗
Incident Manager:	@aram
Deputy Incident Manager:	@dankney
Public Disclosure Date:	2024-07-23
CVSS Score:	7.5
Security Advisory:	isc-private/printing-press!98
Mattermost Channel:	CVE-2024-0760
Support Ticket:	N/A
Release Checklist:	#4735 (closed)

💡 Click here (internal resource) for general information about the security incident handling process.

Earlier Than T-5

At T-5

🔗 (Marketing) Update the text on the T-5 (from the Printing Press project) and "earliest" ASN documents in the SF portal
🔗 (Marketing) (BIND 9 only) Update the BIND -S information document in SF with download links to the new versions
🔗 (Marketing) Bulk email eligible customers to check the SF portal
🔗 (Marketing) (BIND 9 only) Send a pre-announcement email to the bind-announce mailing list to alert users that the upcoming release will include security fixes

At T-1

🔗 (First IM) Send notifications to OS packagers

On the Day of Public Disclosure

🔗 (IM) Grant QA & Marketing clearance to proceed with public release
🔗 (QA/Marketing) Publish the releases (as outlined in the release checklist)
🔗 (Support) (BIND 9 only) Add the new CVEs to the vulnerability matrix in the Knowledge Base
🔗 (Support) Bump Document Version for the Security Advisory and publish it in the Knowledge Base
🔗 (First IM) Send notification emails to third parties
🔗 (First IM) Advise MITRE about the disclosed CVEs
🔗 (First IM) Merge the Security Advisory merge request
🚫 🔗 (IM) Inform original reporter (if external) that the security disclosure process is complete
🔗 (Marketing) Update the SF portal to clear the ASN
🔗 (Marketing) Email ASN recipients that the embargo is lifted

After Public Disclosure

🔗 (QA) ~~Merge a regression test reproducing the bug into all affected (and still maintained) branches~~

Summary

Flood of DNS messages over single TCP connections makes server unusable.

CVE Identifier

Reserved the following CVE ID(s):

CVE-2024-0760
├─ State:	RESERVED
├─ Owning CNA:	isc
├─ Reserved by:	ondrej@isc.org (isc)
└─ Reserved on:	Fri Jan 19 20:26:52 2024

Remaining quota: 994

BIND versions affected

Affects v9.19: 64ef6968
Affects v9.18: 418a1ad7 (with OPCODE=QUERY)

It does NOT affect these versions:

v9.16: 0e282066
~"v9.11 (EoL)": 43a2e6aa
Other versions were not tested

Preconditions and assumptions

None.

Attacker's abilities

Attacker has ability to open TCP connection to server under attack, and keep flooding the server with messages.

Impact

Legitimate QPS drops to ~ 0. Eventually server consumes all memory and crashes (if the attack lasts long enough).

Here's table showing QPS over time during the test.

Time 0 - only legitimate traffic (REFUSED because of the ACL, not much work involved)
Time 1 - attack starts
Time 16 - attack stops
Time 23 - named finally recovers

It still takes another 7 seconds to recover - CPU is busy during that period.

test time	legit	attack
0	155 528	0
1	0	103 675
2	261	69 462
3	257	71 908
4	0	71 595
5	258	71 304
6	260	70 678
7	0	70 065
8	266	71 200
9	0	69 448
10	257	72 393
11	258	71 443
12	0	69 474
13	259	71 073
14	257	69 471
15	0	71 176
16	260	0
17	518	0
18	0	0
19	0	0
20	0	0
21	186	0
22	37 792	0
23	156 413	0

Steps to reproduce

Attached named.conf is just an example to demonstrate that ACLs cannot stop the attack. Any config is vulnerable.
Start BIND server with command: named -g -c named.conf &> /dev/null

attached config produces lots of logging because of the ACLs - but all the ACLs can be commented out

Simulate legitimate clients using command: yes '. A' | dnsperf -S1 -O suppress=timeout -c 256

of course real legit clients would not be getting RCODE=REFUSED, but it's good enough for demonstration

Simulate attack traffic using command python tcploop.py 127.0.0.1 53 query.tcpdns --report-interval 1

~~As the name of the file suggests, it's an OPCODE=IQUERY message, which gets NOTIMP right away. It does not really matter what the message is~~ (see #4481 (comment 422446)), I just wanted to have something which is syntactically valid and causes minimal processing on the server.

To test with memory size limited, try this command:

systemd-run -p MemoryMax=1G -p MemorySwapMax=0 --user --same-dir -t named -g -c named.conf &> /dev/null

Reproducer for DoT

query.tcpdns
tlsloop.py
Usage: python tlsloop.py 127.0.0.1 853 query.tcpdns

What is the current bug behavior?

Legitimate QPS drops towards 0.

Memory consumption increases until the server consumes almost all available memory and is killed by OOM killer.

EDIT: Formerly the test instructions were incorrect. I forgot to limit Swap size and that lead to incorrect conclusion: ~~and memory consumption stays like that while attack is in progress. Artificially limiting memory available to server process does not cause crash/OOM condition.~~

What is the expected correct behavior?

I would expect roughly proportional resources dedicated to each client.

Maybe also tear down TCP connection if the client is doing weird things with it.

Relevant logs

With default config none at all.

With -d 99 log level it is actually interesting. See log from a simple run with 1000 "attack" messages within a single connection, followed by connection closure.

Here's is excerpt:

Phase 1

06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 (no-peer): allocate new client
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: TCP request
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: using view '_default'
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: request is not signed
06-Dec-2023 14:26:57.819 client @0x7f0b8f647400 127.0.0.1#55642: recursion not available (recursion not enabled for view)

06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 (no-peer): allocate new client
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: TCP request
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: using view '_default'
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: request is not signed
06-Dec-2023 14:26:57.819 client @0x7f0b8f666000 127.0.0.1#55642: recursion not available (recursion not enabled for view)
...

Phase 2

06-Dec-2023 14:26:57.836 client @0x7f0b8f647400 127.0.0.1#55642: send failed: connection reset
06-Dec-2023 14:26:57.836 client @0x7f0b8f647400 127.0.0.1#55642: reset client

06-Dec-2023 14:26:57.836 client @0x7f0b8f666000 127.0.0.1#55642: send failed: connection reset
06-Dec-2023 14:26:57.836 client @0x7f0b8f666000 127.0.0.1#55642: reset client
...

Phase 3

06-Dec-2023 14:26:57.846 client @0x7f0b8f647400 127.0.0.1#55642: freeing client
06-Dec-2023 14:26:57.846 client @0x7f0b8f666000 127.0.0.1#55642: freeing client
...

That looks innocent, except for two things:

the timestamps! There is quite a gap between "recursion not available" and "send failed: connection reset" message, and then there is another time-gap before reaching "freeing client".
All the attacker-sent messages first go through "phase 1", then all of them in order go through "phase 2", and then reach "phase 3". No wonder it eats lots of memory.

Edited Sep 12, 2024 by Nicki Křížek

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information