Potential cache poisoning due to unexpected recursion instead of following delegation when serve-stale is enabled
Summary
As reported in Support ticket #22830
The server is authoritative for some zones as well as supporting recursion for others. Some zones delegate subdomains to other nameservers. For those, the NS RRset in the delegation is unreachable/unresolvable.
With stale-answer-enable no;
the expected SERVFAIL is returned to the clients querying for names in these subdomains.
With stale-answer-enable yes;
the resolver appears not to follow the delegation but instead attempts resolution directly from the root nameservers instead, sometimes providing different answers to the client that those intended by the configuration and delegation (albeit broken).
BIND version used
BIND 9.16.43-Ubuntu (Extended Support Version) <id:de6f1a0>
running on Linux x86_64 5.15.0-1041-aws #46~20.04.1-Ubuntu SMP Wed Jul 19 15:40:00 UTC 2023
built by make with '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=/usr/lib/x86_64-linux-gnu' '--libexecdir=/usr/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--libdir=/usr/lib/x86_64-linux-gnu' '--sysconfdir=/etc/bind' '--with-python=python3' '--localstatedir=/' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-gost=no' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-libidn2' '--with-json-c' '--with-lmdb=/usr' '--with-gnu-ld' '--with-maxminddb' '--with-atf=no' '--enable-ipv6' '--enable-rrl' '--enable-filter-aaaa' '--disable-native-pkcs11' '--enable-dnstap' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/build/bind9-FMDtLY/bind9-9.16.43=. -fstack-protector-strong -Wformat -Werror=format-security -fno-strict-aliasing -fno-delete-null-pointer-checks -DNO_VERSION_DATE -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'
compiled by GCC 9.4.0
compiled with OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
linked to OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
compiled with libuv version: 1.44.2
linked to libuv version: 1.44.2
compiled with libxml2 version: 2.9.10
linked to libxml2 version: 20910
compiled with json-c version: 0.13.1
linked to json-c version: 0.13.1
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
linked to maxminddb version: 1.4.2
compiled with protobuf-c version: 1.3.3
linked to protobuf-c version: 1.3.3
threads support is enabled
DNSSEC algorithms: RSASHA1 NSEC3RSASHA1 RSASHA256 RSASHA512 ECDSAP256SHA256 ECDSAP384SHA384 ED25519 ED448
DS algorithms: SHA-1 SHA-256 SHA-384
HMAC algorithms: HMAC-MD5 HMAC-SHA1 HMAC-SHA224 HMAC-SHA256 HMAC-SHA384 HMAC-SHA512
TKEY mode 2 support (Diffie-Hellman): yes
TKEY mode 3 support (GSS-API): yes
default paths:
named configuration: /etc/bind/named.conf
rndc configuration: /etc/bind/rndc.conf
DNSSEC root key: /etc/bind/bind.keys
nsupdate session key: //run/named/session.key
named PID file: //run/named/named.pid
named lock file: //run/named/named.lock
geoip-directory: /usr/share/GeoIP
Steps to reproduce
Pasting here from the report to Support team: Locally setup in-addr.arpa for private /16 network: 59.10.in-addr.arpa
with delegation for /24:
$ORIGIN 59.10.in-addr.arpa.
1 NS nss1.example.net.
NS nss2.example.net.
NS nss3.example.net.
As these NSs are fake, we can't contact them ever and without serve-stale enabled - we always receive SERVFAIL.
But, as soon as serve-stale is enabled, named will start to try to run recursion from the root, and we start getting NXDOMAIN (what is cacheable answer)
# dig 1.59.10.in-addr.arpa @127.0.0.1
; <<>> DiG 9.16.43-Ubuntu <<>> 1.59.10.in-addr.arpa @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 21068
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: c6da2a29e4d47c6a01000000651554607e40e4b8c2a75c00 (good)
;; QUESTION SECTION:
;1.59.10.in-addr.arpa. IN A
;; AUTHORITY SECTION:
10.in-addr.arpa. 10800 IN SOA prisoner.iana.org. hostmaster.root-servers.org. 1 604800 60 604800 604800
;; Query time: 156 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 28 10:24:32 UTC 2023
;; MSG SIZE rcvd: 172
The same thing with direct zones.
For example:
- Test zone: myctl.com
- Test record: test.myctl.com (on the external NSs it returns 127.0.0.1)
Delegation in the local zone file:
$ORIGIN myctl.com.
test NS nss1.example.net.
NS nss2.example.net.
NS nss3.example.net.
Without serve-stale enabled, I always have SERVFAIL answer.
With serve-stale enabled, I have SERVFAL twice, then recursion started from the root, and I will have answer from the external nameservers not specified in the localzone file:
# dig test.myctl.com. @127.0.0.1
; <<>> DiG 9.16.43-Ubuntu <<>> test.myctl.com. @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 29218
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 42aa29742ab548b00100000065155561f82a157e5a4464ae (good)
;; QUESTION SECTION:
;test.myctl.com. IN A
;; Query time: 60 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 28 10:28:49 UTC 2023
;; MSG SIZE rcvd: 71
# dig test.myctl.com. @127.0.0.1
; <<>> DiG 9.16.43-Ubuntu <<>> test.myctl.com. @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42780
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: c14a6db123fa69bc010000006515556372206781ce6b989d (good)
;; QUESTION SECTION:
;test.myctl.com. IN A
;; ANSWER SECTION:
test.myctl.com. 300 IN A 127.0.0.1
;; AUTHORITY SECTION:
test.myctl.com. 3600 IN NS nss2.example.net.
test.myctl.com. 3600 IN NS nss1.example.net.
test.myctl.com. 3600 IN NS nss3.example.net.
;; Query time: 4 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 28 10:28:51 UTC 2023
;; MSG SIZE rcvd: 155
What is the current bug behavior?
Unexpected recursion from root down (ignoring the delegation from the local auth parent zones) by the resolver when stale-answer-enable is 'yes'
What is the expected correct behavior?
Consistent SERVFAIL (by following the delegation NS RRset and being unable to resolve the delegated nameserver names in the parent zone).
Relevant configuration files
Relevant snippets from named.conf (for the full configuration, see the Support ticket):
options {
directory "/var/cache/bind";
listen-on-v6 {
"none";
};
dnssec-validation no;
minimal-responses no;
qname-minimization off;
stale-answer-enable yes;
stale-answer-client-timeout 1800;
stale-cache-enable yes;
stale-refresh-time 30;
masterfile-format text;
};
zone "59.10.in-addr.arpa" in {
type master;
file "zones/59.10.rev";
forwarders {
};
};
zone "myctl.com" in {
type master;
file "zones/myctl.com";
forwarders {
};
};
zone "localhost" {
type master;
file "/etc/bind/db.local";
};
zone "127.in-addr.arpa" {
type master;
file "/etc/bind/db.127";
};
zone "0.in-addr.arpa" {
type master;
file "/etc/bind/db.0";
};
zone "255.in-addr.arpa" {
type master;
file "/etc/bind/db.255";
};
# cat /var/cache/bind/zones/myctl.com
$ORIGIN .
$TTL 3600 ; 1 hour
myctl.com IN SOA dc1.example.net. corporate.example.net. (
428210 ; serial
900 ; refresh (15 minutes)
600 ; retry (10 minutes)
86400 ; expire (1 day)
3600 ; minimum (1 hour)
)
NS ns1.myctl.com.
NS ns2.myctl.com.
ns1.myctl.com. IN A 127.0.0.1
ns2.myctl.com. IN A 127.0.0.1
$ORIGIN myctl.com.
test NS nss1.example.net.
NS nss2.example.net.
NS nss3.example.net.
# cat /var/cache/bind/zones/59.10.rev
$ORIGIN .
$TTL 3600 ; 1 hour
59.10.in-addr.arpa IN SOA dc1.example.net. corporate.example.net. (
428210 ; serial
900 ; refresh (15 minutes)
600 ; retry (10 minutes)
86400 ; expire (1 day)
3600 ; minimum (1 hour)
)
NS ns1.example.net.
NS ns2.example.net.
$ORIGIN 59.10.in-addr.arpa.
1 NS nss1.example.net.
NS nss2.example.net.
NS nss3.example.net.
Notably, this server IS authoritative for the parent zones but delegates to an NS RRset that it's not authoritative for and where the names can't be resolved to anything useful.
Therefore the resolver should be attempting to use the delegation NS RRset for these internal-only zones and delegations from them, and not attempting resolution from the root down (but it DOES nevertheless attempt that with stale-answer-enable yes;
Why this is potentially a security defect:
Quoting the reporter:
We expect that internal customers will get some internal IPs in answers, or didn’t get anything if something wrong (like broken internal NSs) or get answers from the cache when the NSs configured in the zone file are not available but answers already in the cache. But not external IPs or unexpected answers.
Lets assume something goes wrong with our internal NSs (nss[1-3].example.net, like in example above), and everyone inside the company get some sort of external IP (or loopback IP) for some requested record (like in example above). Let it be “supernewfeature.myctl.com” (this is only for example), and everyone inside the company start to run some sort of queries against that record with answer pointed to unexpected place, then service might be overload/unexpected responses replied/anything else.
When the serve-stale is disabled - everyone will get SERVFAIL, and external services will not be impacted.
Also I see this as a way of potential attack, when in the external nameservers place some sort of victim IP address what can cause to DDoS or pointing to some sort of fishing website.
That's why I evaluate that issue as security issue.