BIND issueshttps://gitlab.isc.org/isc-projects/bind9/-/issues2023-11-01T07:38:48Zhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2588Fixup the copyright texts2023-11-01T07:38:48ZOndřej SurýFixup the copyright textsThe following discussion from !4807 should be addressed:
- [ ] @sgoldlust started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/4807#note_200295): (+2 comments)
> I know this is not what you asked for m...The following discussion from !4807 should be addressed:
- [ ] @sgoldlust started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/4807#note_200295): (+2 comments)
> I know this is not what you asked for my review on, but there's no reason for the word "you" to be capitalized in these messages.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2524Coccinelle unhappy about lib/ns/tests/nstest.c2023-11-01T07:36:21ZMichal NowakCoccinelle unhappy about lib/ns/tests/nstest.cCoccinelle is [unhappy](https://gitlab.isc.org/isc-projects/bind9/-/jobs/1522218) about `lib/ns/tests/nstest.c` on `main` and `v9_16`:
```
EXN: Failure("rule starting on line 26: already tagged token:\nC code context\nFile \"./lib/ns/te...Coccinelle is [unhappy](https://gitlab.isc.org/isc-projects/bind9/-/jobs/1522218) about `lib/ns/tests/nstest.c` on `main` and `v9_16`:
```
EXN: Failure("rule starting on line 26: already tagged token:\nC code context\nFile \"./lib/ns/tests/nstest.c\", line 716, column 1, charpos = 16367\n around = 'if',\n whole content = \tif (qctx != NULL) {")
```Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2518IPv4 Flamethrower quit abruptly on FreeBSD 122024-01-22T16:23:03ZMichal NowakIPv4 Flamethrower quit abruptly on FreeBSD 12IPv4 Flamethrower quit abruptly in [stress:recursive:freebsd12:amd64](https://gitlab.isc.org/isc-private/bind9/-/jobs/1517463) CI job over the `v9_11_sub` branch:
```
2021-02-23:03:31:55 INFO: starting TCP generators
2021-02-23:03:31:55 ...IPv4 Flamethrower quit abruptly in [stress:recursive:freebsd12:amd64](https://gitlab.isc.org/isc-private/bind9/-/jobs/1517463) CI job over the `v9_11_sub` branch:
```
2021-02-23:03:31:55 INFO: starting TCP generators
2021-02-23:03:31:55 INFO: starting generator #3 (tcp) on 10.53.0.3 in /var/tmp/gitlab_runner/builds/YdCaoq4b/0/isc-private/bind9/output/generator3
2021-02-23:03:31:55 INFO: (using query file /var/tmp/gitlab_runner/builds/YdCaoq4b/0/isc-private/bind9/output/query_datafile)
2021-02-23:03:31:55 INFO: starting generator #4 (tcp) on [fd92:7065:b8e:ffff::3] in /var/tmp/gitlab_runner/builds/YdCaoq4b/0/isc-private/bind9/output/generator4
2021-02-23:03:31:55 INFO: (using query file /var/tmp/gitlab_runner/builds/YdCaoq4b/0/isc-private/bind9/output/query_datafile)
2021-02-23:03:31:55 INFO: checking processes, 1 hours 0 minutes left
2021-02-23:03:32:56 INFO: checking processes, 59 minutes left
2021-02-23:03:32:56 ERROR: process with PID file /var/tmp/gitlab_runner/builds/YdCaoq4b/0/isc-private/bind9/output/generator3/generator.pid (pid = 60111) is no longer running
```
`generator3/generator.log` does hot have the usual final summary when it exits correctly:
```
--class: "IN"
--dnssec: true
--help: false
--qps-flow: null
--targets: null
--version: false
-F: "inet"
-M: "GET"
-P: "tcp"
-Q: "10000"
-R: false
-T: "A"
-b: null
-c: "10"
-d: "1"
-f: "/var/tmp/gitlab_runner/builds/YdCaoq4b/0/isc-private/bind9/output/query_datafile"
-g: "file"
-l: "0"
-n: "0"
-o: null
-p: "5300"
-q: "10"
-r: "test.com"
-t: "3"
-v: "99"
GENOPTS: []
TARGET: "10.53.0.3"
file: push "091195.test.example.."
file: push "099598.test.example.."
file: push "011761.test.example.."
file: push "097867.test.example.."
file: push "025447.test.example.."
file: push "037011.test.example.."
file: push "050838.test.example.."
file: push "022788.test.example.."
file: push "093318.test.example.."
file: push "076772.test.example.."
0 key/value generator arguments
binding to 0.0.0.0
flaming target(s) [10.53.0.3] on port 5300 with 30 concurrent generators, each sending 100 queries every 1000ms on protocol tcp
query generator [file] contains 105000 record(s)
rate limit @ 10000 QPS (333.333 QPS per concurrent sender)
0.00130632s: send: 0, avg send: 0, recv: 0, avg recv: 0, min/avg/max resp: 0/nan/0ms, in flight: 0, timeouts: 0
1.01032s: send: 3000, avg send: 3000, recv: 2200, avg recv: 2200, min/avg/max resp: 39.3282/225.582/593.608ms, in flight: 822, timeouts: 0
```
There's no core dump file in the artifact archive.
According to logs `named` instances appear to be working at the time when generator #3 quit.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2429Consider TLS session caching for DoH (and/or DoT)2022-12-14T18:35:36ZArtem BoldarievConsider TLS session caching for DoH (and/or DoT)### Description
Establishing a TLS connection might be costly and involves some data exchanges that might exceed the actual DNS query size for simple queries. Haven't looked into it much, but WEB servers could do that, so should BIND.
...### Description
Establishing a TLS connection might be costly and involves some data exchanges that might exceed the actual DNS query size for simple queries. Haven't looked into it much, but WEB servers could do that, so should BIND.
This feature is worth at least researching because I would expect DoH clients (mostly browsers) to return shortly after making a query, so this could improve performance.Not plannedArtem BoldarievArtem Boldarievhttps://gitlab.isc.org/isc-projects/bind9/-/issues/21979.16.6 exited with assertion failure after minor in-flight configuration change2022-01-25T16:56:10ZHåvard Eidnes9.16.6 exited with assertion failure after minor in-flight configuration change<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [...<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [security-officer@isc.org](security-officer@isc.org).
-->
### Summary
10s after a minor configuration change, BIND exited with an assertion failure.
After a restart, it has continued working as normal.
### BIND version used
```
BIND 9.16.6 (Stable Release) <id:25846cf>
running on NetBSD amd64 9.0_RC1 NetBSD 9.0_RC1 (GENERIC) #0: Sat Dec 14 12:36:33 UTC 2019 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC
built by make with '--with-libxml2=yes' '--with-tuning=large' '--enable-dnstap' '--with-protobuf-c=/usr/pkg' '--with-libfstrm=/usr/pkg' '--sysconfdir=/etc' '--localstatedir=/var'
compiled by GCC 7.4.0
compiled with OpenSSL version: OpenSSL 1.1.1c 28 May 2019
linked to OpenSSL version: OpenSSL 1.1.1c 28 May 2019
compiled with libuv version: 1.38.0
linked to libuv version: 1.38.0
compiled with libxml2 version: 2.9.10
linked to libxml2 version: 20910
compiled with zlib version: 1.2.10
linked to zlib version: 1.2.10
compiled with protobuf-c version: 1.3.2
linked to protobuf-c version: 1.3.2
threads support is enabled
default paths:
named configuration: /etc/named.conf
rndc configuration: /etc/rndc.conf
DNSSEC root key: /etc/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/named.pid
named lock file: /var/run/named/named.lock
```
### Steps to reproduce
The configuration changes introduced were:
```
--- named.conf 2020/10/02 10:02:54 1.23
+++ named.conf 2020/10/02 10:06:18 1.24
@@ -1,7 +1,6 @@
options {
directory "/etc/namedb";
- dnssec-enable yes;
dnssec-validation yes;
managed-keys-directory "keys";
@@ -17,6 +16,10 @@
// minimization for now. May be related to forwarding...
//qname-minimization off;
+ // Be nice, conform to DNS flag day 2020
+ edns-udp-size 1232;
+ max-udp-size 1232;
+
// Force these in preparation for anycast addresses
// which we never want to use as query source
query-source address 158.37.2.68;
@@ -82,7 +85,7 @@
};
};
-managed-keys {
+trust-anchors {
"." initial-key 257 3 8
"AwEAAagAIKlVZrpC6Ia7gEzahOR+9W29euxhJhVVLOyQbSEW0O8gcCjF
FVQUTf6v58fLjwBd0YI0EzrAcQqBGCzh/RStIoO8g0NfnfL2MTJRkxoX
```
### What is the current *bug* behavior?
```
Oct 2 12:06:07 tos-res named[6701]: reloading configuration succeeded
Oct 2 12:06:07 tos-res named[6701]: scheduled loading new zones
Oct 2 12:06:07 tos-res named[6701]: any newly configured zones are now loaded
Oct 2 12:06:07 tos-res named[6701]: running
Oct 2 12:06:17 tos-res named[6701]: resolver.c:10193: INSIST(((res->dbuckets[i].list).head == ((void *)0))) failed, back trace
Oct 2 12:06:17 tos-res named[6701]: #0 0x434978 in assertion_failed()+0x4d
Oct 2 12:06:17 tos-res named[6701]: #1 0x5edd88 in isc_assertion_failed()+0xa
Oct 2 12:06:17 tos-res named[6701]: #2 0x550e33 in dns_resolver_detach()+0x501
Oct 2 12:06:17 tos-res named[6701]: #3 0x5896cf in destroy()+0x129
Oct 2 12:06:17 tos-res named[6701]: #4 0x58a427 in adb_shutdown()+0x52
Oct 2 12:06:17 tos-res named[6701]: #5 0x610f77 in run()+0x6b2
Oct 2 12:06:17 tos-res named[6701]: #6 0x72753c20c1d8 in _fini()+0x72753bbc5778
Oct 2 12:06:17 tos-res named[6701]: #7 0x72753bc87af0 in _fini()+0x72753b641090
Oct 2 12:06:17 tos-res named[6701]: exiting (due to assertion failure)
```
### What is the expected *correct* behavior?
BIND should have continued working as normal.
I don't know, it *may* be coincidental, but 10s is "too close for comfort".
Besides, that BIND continues working now after a full restart tends to indicate
that the problem was the in-flight configuration change and not the configuration
change itself.
### Relevant configuration files
`named-checkconf -px` output follows:
```
logging {
channel "normal" {
syslog "local2";
severity dynamic;
};
channel "trash" {
syslog "local3";
severity dynamic;
};
channel "security" {
syslog "local4";
severity dynamic;
};
channel "qerrs" {
syslog "local1";
severity dynamic;
};
channel "queries" {
syslog "local0";
severity dynamic;
};
channel "client_log" {
file "/var/log/client.log" versions 30 size 10485760;
severity dynamic;
print-time yes;
};
channel "rpzlog" {
file "/var/log/named.rpz" versions 50 size 10485760;
severity info;
print-time yes;
print-severity yes;
print-category yes;
};
channel "null" {
null ;
};
category "default" {
"normal";
"default_debug";
};
category "general" {
"normal";
"default_debug";
};
category "config" {
"normal";
"default_debug";
};
category "network" {
"normal";
"default_debug";
};
category "notify" {
"normal";
"default_debug";
};
category "xfer-in" {
"normal";
"default_debug";
};
category "xfer-out" {
"normal";
"default_debug";
};
category "dnssec" {
"security";
};
category "security" {
"security";
};
category "rpz" {
"rpzlog";
};
category "database" {
"null";
};
category "lame-servers" {
"null";
};
category "update-security" {
"null";
};
category "update" {
"null";
};
category "query-errors" {
"qerrs";
};
category "queries" {
"queries";
};
category "client" {
"client_log";
};
};
options {
datasize 8589934592;
directory "/etc/namedb";
dnstap-output unix"/var/run/named/dnstap.sock";
hostname "tos-res.uninett.no";
listen-on {
"any";
};
listen-on-v6 {
"any";
};
managed-keys-directory "keys";
querylog no;
server-id "tos-res.uninett.no";
dnssec-validation yes;
dnstap {
client query;
};
edns-udp-size 1232;
max-udp-size 1232;
qname-minimization relaxed;
query-source address 158.37.2.68 port 0;
query-source-v6 address 2001:700:0:804f::ca53 port 0;
recursion yes;
response-policy {
zone "dns-rpz.uninett.no";
zone "zone3.ph.rpz.switch.ch" policy disabled;
zone "zone3.mw.rpz.switch.ch" policy disabled;
zone "zone3.misc.rpz.switch.ch" policy disabled;
} break-dnssec yes;
allow-query {
"localnets";
78.91.0.0/16;
128.39.0.0/16;
129.177.0.0/16;
129.240.0.0/15;
129.242.0.0/16;
144.164.0.0/16;
151.157.0.0/16;
152.94.0.0/16;
156.116.0.0/16;
157.249.0.0/16;
158.36.0.0/14;
161.4.0.0/16;
192.111.33.0/24;
192.133.32.0/24;
192.146.238.0/23;
193.156.0.0/15;
2001:700::/32;
146.172.4.0/23;
148.122.20.52/31;
148.123.37.165/32;
2001:67c:29f4::/48;
44.141.124.0/24;
44.141.132.0/24;
193.35.52.0/22;
};
forward first;
forwarders {
158.38.0.168;
128.39.2.24;
};
};
statistics-channels {
inet 127.0.0.1 port 8053 allow {
127.0.0.1/32;
};
inet 158.37.2.68 port 8053 allow {
158.38.62.0/23;
158.38.10.0/24;
};
};
server 54.209.136.173/32 {
send-cookie no;
};
server 204.153.45.2/32 {
send-cookie no;
};
trust-anchors {
"." initial-key 257 3 8 "AwEAAagAIKlVZrpC6Ia7gEzahOR+9W29euxhJhVVLOyQbSEW0O8gcCjF
FVQUTf6v58fLjwBd0YI0EzrAcQqBGCzh/RStIoO8g0NfnfL2MTJRkxoX
bfDaUeVPQuYEhg37NZWAJQ9VnMVDxP/VHL496M/QZxkjf5/Efucp2gaD
X6RS6CXpoY68LsvPVjR0ZSwzz1apAzvN9dlzEheX7ICJBBtuA6G3LQpz
W5hOA2hzCTMjJPJ8LbqF6dsV6DoBQzgul0sGIcGOYl7OyQdXfZ57relS
Qageu+ipAdTTJ25AsRTAoub8ONGcLmqrAmRLKBP1dfwhYB4N7knNnulq
QxA+Uk1ihz0=";
"." initial-key 257 3 8 "AwEAAaz/tAm8yTn4Mfeh5eyI96WSVexTBAvkMgJzkKTOiW1vkIbzxeF3
+/4RgWOq7HrxRixHlFlExOLAJr5emLvN7SWXgnLh4+B5xQlNVz8Og8kv
ArMtNROxVQuCaSnIDdD5LKyWbRd2n9WGe2R8PzgCmr3EgVLrjyBxWezF
0jLHwVN8efS3rCj/EWgvIWgb9tarpVUDK/b58Da+sqqls3eNbuv7pr+e
oZG+SrDK6nWeL3c6H5Apxz7LjVc1uTIdsIXxuOLYA4/ilBmSVIzuDWfd
RUfhHdY6+cn8HFRm+2hM8AnXGXws9555KrUB5qihylGa8subX2Nn6UwN
R1AkUTV74bU=";
"7.4.nrenum.net" initial-key 257 3 8 "AwEAAdyLRICD7vMGdRG+uwF9176xm5u+E22zJehX7luBrY8LeUsw0aT9
WxBe2aKYSoBbAROVcuQJ/8EbbL+XhX5RKieRZFLDS1hQc+BpLY4Vse5G
2OeWYbH9lWEUM6/XErTsUikYfchXxWg6PkidN/howfNmo7iHDgeG/Xfz
E+i2MLZHCCnNND6v2DE8aP4qYzmU/jEc7n4814z2HR1dzpK/eXZwY3Tv
MjnTh3cqayi8b2B7+tedwV874plFOtMdTwywnMnXf1R3C3HBIZXHu55F
Ptd7cMbikW0lEc7BRRYL50knDMk7jcnsnA7MI1hOu3vI1cNAUWM+CmWX
DXShJKcLF0s=";
};
zone "." {
type hint;
file "root.cache";
};
zone "localhost" {
type master;
file "localhost";
};
zone "127.IN-ADDR.ARPA" {
type master;
file "127";
};
zone "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" {
type master;
file "loopback.v6";
};
zone "dns-rpz.uninett.no" {
type slave;
file "sz/dns-rpz.uninett.no";
masters {
158.38.212.119;
};
};
zone "zone3.ph.rpz.switch.ch" {
type slave;
file "sz/zone3.ph.rpz.switch.ch";
masters {
158.38.212.119;
};
};
zone "zone3.mw.rpz.switch.ch" {
type slave;
file "sz/zone3.mw.rpz.switch.ch";
masters {
158.38.212.119;
};
};
zone "zone3.misc.rpz.switch.ch" {
type slave;
file "sz/zone3.misc.rpz.switch.ch";
masters {
158.38.212.119;
};
};
```
### Relevant logs and/or screenshots
See above.
### Possible fixes
Sorry, don't know.BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2154Investigate rpz system test failure2023-10-23T13:59:27ZMark AndrewsInvestigate rpz system test failureJob [#1161531](https://gitlab.isc.org/isc-projects/bind9/-/jobs/1161531) failed for c5c2a4820b6dd705443e42a515cd20fc4293d35b:Job [#1161531](https://gitlab.isc.org/isc-projects/bind9/-/jobs/1161531) failed for c5c2a4820b6dd705443e42a515cd20fc4293d35b:BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2153Rebuild RBTDB while rehashing2021-10-05T15:35:42ZBrian ConryRebuild RBTDB while rehashing@ondrej had an idea related to rebuilding the RBTDB while rehashing as a means of clearing out empty interior nodes.
This issue is a reminder.
Description to be updated and amended.@ondrej had an idea related to rebuilding the RBTDB while rehashing as a means of clearing out empty interior nodes.
This issue is a reminder.
Description to be updated and amended.BIND 9.19.xOndřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2110dnssec-signzone report() missing newline2023-10-31T20:15:55ZScott Nicholasdnssec-signzone report() missing newline<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [...<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [security-officer@isc.org](security-officer@isc.org).
-->
### Summary
A regression in report() function in dnssec-signzone printing newline.
### BIND version used
### Steps to reproduce
(How one can reproduce the issue - this is very important.)
### What is the current *bug* behavior?
```
[root@foo named]# dnssec-signzone -3 deadc0ffee -E pkcs11 -S -K /var/named/keys -X +90d -x -o example.org example.org.zone
Fetching example.org/RSASHA256/26302 (KSK) from key repository.Fetching example.org/RSASHA256/27193 (ZSK) from key repository.Verifying the zone using the following algorithms: RSASHA256.
Zone fully signed:
Algorithm: RSASHA256: KSKs: 1 active, 0 stand-by, 0 revoked
ZSKs: 1 active, 0 present, 0 revoked
example.org.zone.signed
```
### What is the expected *correct* behavior?
```
[root@foo named]# dnssec-signzone -3 deadc0ffee -E pkcs11 -S -K /var/named/keys -X +90d -x -o example.org example.org.zone
Fetching example.org/RSASHA256/26302 (KSK) from key repository.
Fetching example.org/RSASHA256/27193 (ZSK) from key repository.
Verifying the zone using the following algorithms: RSASHA256.
Zone fully signed:
Algorithm: RSASHA256: KSKs: 1 active, 0 stand-by, 0 revoked
ZSKs: 1 active, 0 present, 0 revoked
example.org.zone.signed
```
### Relevant configuration files
N/A
### Relevant logs and/or screenshots
N/A
### Possible fixes
https://gitlab.isc.org/isc-projects/bind9/-/blob/main/bin/dnssec/dnssec-signzone.c#L2729
BIND 9.11 has a putc('\n') there.BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2082Cache Cleaning Diagnostic Information2021-10-05T15:26:10ZBrian ConryCache Cleaning Diagnostic Information1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of nodes without data
1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of deadnodes
1. A c...1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of nodes without data
1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of deadnodes
1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of times that the function `lib/dns/cache.c:incremental_cleaning_action` is called
1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of times that the function `lib/dns/cache.c:overmem_cleaning_action` is called
1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of times that the function `lib/dns/rbtdb.c:overmem_purge` is called
1. A counter in `cachestats` (JSON/XML) and `++ Cache Statistics ++` (named.stats) for the number of times that the function `lib/dns/rbtdb.c:cleanup_dead_nodes` is called
Additional logging at `DNS_LOGCATEGORY_DATABASE`, `DNS_LOGMODULE_CACHE`, `ISC_LOG_DEBUG(1)` in the following functions:
1. `lib/dns/rbtdb.c:overmem_purge` - log node name (local name plus tree origin?) purged; log mctx in_use delta for both heap and tree after the purge
1. `lib/dns/rbtdb.c:cleanup_dead_nodes` - log `bucketnum`; log number of nodes purged; log mctx in_use delta for both heap and tree after the purge
Noting for the record that we already have CacheNodes/"cache database nodes" giving us the total node count.
Also noting that counts of fully expired and fully ancient nodes would be nice, but there aren't usually code events marking a node's transition from one group to another, so that will have to be something left for core dumps or full database traversals.
Finally, if this is to be prepared as a patch, can it please also include adding an `INSIST(0)` in `lib/dns/rootns.c:dns_root_checkhints()` immediately following the logging of "unable to get root NS rrset from cache"?BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2017auto-dnssec zones loose NSEC3 params when the zone journal is removed2023-02-18T03:56:51ZKlaus Darilionauto-dnssec zones loose NSEC3 params when the zone journal is removed### Summary
When Bind removed the journal "journal file is out of date: removing journal file", then the zone also forgets is NSEC3 settings.
### BIND version used
```
BIND 9.12.2-P2 <id:b2bf278>
running on Linux x86_64 4.15.0-74-gene...### Summary
When Bind removed the journal "journal file is out of date: removing journal file", then the zone also forgets is NSEC3 settings.
### BIND version used
```
BIND 9.12.2-P2 <id:b2bf278>
running on Linux x86_64 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019
built by make with '--prefix=/dns/bind/9.12.2-P2' '--enable-threads' '--enable-static' '--enable-ipv6=yes' '--with-openssl=yes' '--with-gssapi=no' '--enable-rrl' 'CFLAGS=-g'
compiled by GCC 4.8.4
compiled with OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
linked to OpenSSL version: OpenSSL 1.0.2n 7 Dec 2017
compiled with libxml2 version: 2.9.1
linked to libxml2 version: 20904
compiled with zlib version: 1.2.8
linked to zlib version: 1.2.11
threads support is enabled
```
I noticed this problem also with older versions, but have not tested newer versions.
### Steps to reproduce
I use Bind as bump-in-the-wire signer, ie:
```
zone "nxdomain.at" {
type slave;
file "/tmp/nxdomain.at";
masters { 176.9.98.135; };
auto-dnssec maintain;
dnssec-dnskey-kskonly no;
inline-signing yes;
key-directory "/tmp/";
};
```
I do not know why, but the journal is lost quite often. I can trigger it often by retransfering the zone with "rndc retransfer":
```
received control channel command 'retransfer nxdomain.at'
transfer of 'nxdomain.at/IN (unsigned)' from 176.9.98.135#53: connected using 83.136.34.11#52809
zone nxdomain.at/IN (unsigned): transferred serial 2020070901
transfer of 'nxdomain.at/IN (unsigned)' from 176.9.98.135#53: Transfer status: success
transfer of 'nxdomain.at/IN (unsigned)' from 176.9.98.135#53: Transfer completed: 1 messages, 13 records, 16168 bytes, 0.025 secs (646720 bytes/sec)
zone nxdomain.at/IN (signed): journal file is out of date: removing journal file
zone nxdomain.at/IN (signed): loaded serial 2020073557
zone nxdomain.at/IN (signed): receive_secure_serial: unchanged
zone nxdomain.at/IN (signed): receive_secure_serial: unchanged
zone nxdomain.at/IN (signed): sending notifies (serial 2020073557)
```
### What is the current *bug* behavior?
The problem is, that when this happens, a zone which uses NSEC3 for zone walking protections, is suddenly vulnerable to zone walking.
### What is the expected *correct* behavior?
The nsec3 params should be recovered when the journal is broken, or should be stored in a separate file, i.e. in the keys directory.
### Relevant configuration files
```
options {
directory "/var/cache/bind";
// Disable recursion
allow-recursion {"none";};
allow-update { none; };
recursion no;
// Allow new zones to be added via rdnc tool
allow-new-zones yes;
//========================================================================
// If BIND logs error messages about the root key being expired,
// you will need to update your keys. See https://www.isc.org/bind-keys
//========================================================================
dnssec-validation auto;
auth-nxdomain no; # conform to RFC1035
listen-on-v6 { any; };
// lifetime of DNSSEC signatures (RRSIGs) in days
sig-validity-interval 30;
max-journal-size 1m;
version none;
};
```BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1962Integrate GitHub's super-linter image with CI2021-10-18T07:48:19ZMichal NowakIntegrate GitHub's super-linter image with CIRecently GitHub [introduced](https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/) [Super Linter](https://github.com/github/super-linter), a Docker image which provides an easy way to lint various s...Recently GitHub [introduced](https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/) [Super Linter](https://github.com/github/super-linter), a Docker image which provides an easy way to lint various source files.
`sudo docker run -e RUN_LOCAL=true -v $PWD:/tmp/lint github/super-linter`:
```
The script has completed
ERRORS FOUND in MARKDOWN:[16]
ERRORS FOUND in BASH:[373]
ERRORS FOUND in PERL:[44]
ERRORS FOUND in PYTHON:[12]
Exiting with errors found!
```
Given that how easy is to use it in GitHub Action it may become very popular for upstream projects and sideline linter versions distributed with the OS.
On the pro side it would help us not to care about updating our linters and go with whatever Super Linter provides.
On the con side it may not be ideal for outside GitHub use.BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1903Authoritative server leaks 260 KB every 1-2 hours2023-11-03T06:59:46ZMichal NowakAuthoritative server leaks 260 KB every 1-2 hoursI run [stress test](https://gitlab.isc.org/isc-private/bind-qa/-/tree/master/bind9/stress) against BIND 9.16.3 authoritative server on Alpine Linux 3.12 (uses MUSL libc) for 18 hours and I noticed that in many cases `named`'s VSZ usage b...I run [stress test](https://gitlab.isc.org/isc-private/bind-qa/-/tree/master/bind9/stress) against BIND 9.16.3 authoritative server on Alpine Linux 3.12 (uses MUSL libc) for 18 hours and I noticed that in many cases `named`'s VSZ usage bumps 260 bytes every 1-2 hours. I haven't spotted this on Linux distributions with glibc. There are a few discrepancies from the "260 byte rule", but it seems too regular to be a coincidence. Although the mem usage bump is really tiny, there might be a leak of structure.
![named-memory-use-graph-alpine-3.12](/uploads/de49507ad28ed28fababf1b713f29b6d/named-memory-use-graph-alpine-3.12.png)
Here are `VSZ`/`RSS` data every 30 seconds: [alpine-vm.txt](/uploads/9942d3ad62e9c8ec145a2e9fb9bc815b/alpine-vm.txt)
Here's a sample of few last lines funneled via `uniq`:
```
...
422496
422756
423016
423276
423536
423796
424056
424316
424576
424836
425096
425356
425616
425876
```Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1892Reuse nmsockets in TCP2022-12-14T09:00:17ZWitold KrecickiReuse nmsockets in TCPWe currently don't reuse isc_nmsocket_t sockets at all, destroying them after a connection is closed. That's a performance hit for TCP. We should put a semi-ready (allocated + cond/mutex initialized) objects on a stack for reuse, just li...We currently don't reuse isc_nmsocket_t sockets at all, destroying them after a connection is closed. That's a performance hit for TCP. We should put a semi-ready (allocated + cond/mutex initialized) objects on a stack for reuse, just like we do with uvreqs and handles.BIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1890automation of DS Record submit to registrar/parent, integrated with 'new' ka...2021-10-05T14:46:01Zpgndautomation of DS Record submit to registrar/parent, integrated with 'new' kasp/dnssec-policy support in bind### Description
i'm migrating/implementing the new `dnssec-policy` usage & KASP workflow in my bind 9.16.3.
the new policy does a nice job of streamlining the signing/key mgmt.
after key generation/rotation, the 'last step' is subm...### Description
i'm migrating/implementing the new `dnssec-policy` usage & KASP workflow in my bind 9.16.3.
the new policy does a nice job of streamlining the signing/key mgmt.
after key generation/rotation, the 'last step' is submitting new/changed DS Records to the relevant registrar
i'd like to automate the process of submitting generated DS Records to the registrar/parent using a capable registrar's DNSSEC API.
as i understand, there is neither any mechanism in Bind for automating the DS Record submit, nor is there
an external hook mechanism to external scripts that can handle the task.
offline, it's been suggested to me that with the current version of bind, a 'best' approach would be to write a simple script that checks for the existence of the CDS/CDNSKEY RRset in each signed zone.
then, when a new record is added, trigger a submission of the DS to the parent. and, similarly, when a record is removed, trigger a withdrawal of the DS.
rather than re-inventing the wheel ... i'm guessing i'm not the only one who'd like to automate this.
### Request
an additional response on ML
> This is where we need to get the registrars to follow standards. They are written
> so everyone doesn’t have to cobble together ad-hoc solutions. Hourly scans of all
> the DNSSEC delegations by the registrars would do.
>
> Personally I prefer push solutions but I couldn’t get the IETF to agree.
> https://tools.ietf.org/html/draft-andrews-dnsop-update-parent-zones-04
sounds reasonable. at very least, better than nothing.
in the absence of a standards-based solution, integrated in bind's dnssec-policy/kasp feature set, an option for script/execution hooks in bind to external scripts, would be a good 1st step, even if ad-hoc
e.g., "if when change in DS Record in local bind, then fire this external script which will manage the DS submit/withdraw via API to registrar"
failing any/all of that^, a well documented example of a completely de-coupled solution, independent of bind itself, ideally registrar/API agnostic, but demonstrated to work, would be useful.
that's of course doable -- but again, ad-hoc, and seems a step backwards given the nice progress with dnssec-policy/kasp simplifications in recent versions.
### Links / referencesBIND 9.19.xMatthijs Mekkingmatthijs@isc.orgMatthijs Mekkingmatthijs@isc.orghttps://gitlab.isc.org/isc-projects/bind9/-/issues/1871RNDC command to dump stats2023-05-31T11:52:38ZVicky Riskvicky@isc.orgRNDC command to dump statsWe have a request, from a resolver operator, to be able to issue an rndc command to dump a file of statistics in some format that can be imported and processed by Prometheus. I think the use case is, the node might be in a remote pop wit...We have a request, from a resolver operator, to be able to issue an rndc command to dump a file of statistics in some format that can be imported and processed by Prometheus. I think the use case is, the node might be in a remote pop with limited connectivity and setting up on-going streaming may be difficult for some reason, or perhaps there isn't the access for that.
entered on behalf of cmosher@quad9.netBIND 9.19.xhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1864Commit 54fe75b9b76eba92efd0fc1cded4a0ac0adc0ba9 introduced a regression in 9.112021-10-05T12:17:40ZBrian ConryCommit 54fe75b9b76eba92efd0fc1cded4a0ac0adc0ba9 introduced a regression in 9.11This commit causes a problem in 9.11 when using `query-source` with `port 53`.
After a reconfig or reload the server will stop receiving data.
I've confirmed that 9.14 and later versions are not affected.
I've also confirmed that both ...This commit causes a problem in 9.11 when using `query-source` with `port 53`.
After a reconfig or reload the server will stop receiving data.
I've confirmed that 9.14 and later versions are not affected.
I've also confirmed that both 9.11 and 9.11-S are affected.
This was discovered from a customer ticket, though the customer has since removed the query source port configuration.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1863dnssec-verify specify a current time2024-02-14T14:44:27ZPeter Daviesdnssec-verify specify a current time### Description
dnssec-verify specify a current time:
If we generate a signed zone via dnssec-signzone with an inception time in the future, then we have a way to fully verify it (including signature validity time checks) with dnssec-ve...### Description
dnssec-verify specify a current time:
If we generate a signed zone via dnssec-signzone with an inception time in the future, then we have a way to fully verify it (including signature validity time checks) with dnssec-verify before hand.
### Request
A command line parameter that allowed you to specify a current time to be used by dnssec-verify.
### Links / references
[GL #743 ](https://gitlab.isc.org/isc-projects/bind9/-/issues/743)
[RT #16466](https://support.isc.org/Ticket/Display.html?id=16466)Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1837[ISC-support #14369] Considering changing ECS caching strategy or negative data2024-03-13T13:17:48ZBrian Conry[ISC-support #14369] Considering changing ECS caching strategy or negative dataCurrently our ECS implementation caches all negative responses at the global scope.
A customer has requested that we store them at the learned scope.
RFC 7871 (Informational only) section 7.4, paragraph 4 notes that the original spec w...Currently our ECS implementation caches all negative responses at the global scope.
A customer has requested that we store them at the learned scope.
RFC 7871 (Informational only) section 7.4, paragraph 4 notes that the original spec was ambiguous and hints that the interpretation of scoping negative answers is more likely to be used in future protocol specifications (if any) than caching them globally.
> This issue is expected to be revisited in a future revision of the protocol, possibly blessing the mixing of positive and negative answers. There are implications for cache data structures that developers should consider when writing new ECS code.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1784RFC9103: DNS Zone Transfer over TLS (XoT)2022-08-01T11:52:34ZPeter DaviesRFC9103: DNS Zone Transfer over TLS (XoT)### Description
The [RFC9103](https://datatracker.ietf.org/doc/html/rfc9103) describes the use of TLS to encrypt zones transfers in order to provide confidentiality, known as XFR-over-TLS (XoT). The standard has been adopted by the DPRI...### Description
The [RFC9103](https://datatracker.ietf.org/doc/html/rfc9103) describes the use of TLS to encrypt zones transfers in order to provide confidentiality, known as XFR-over-TLS (XoT). The standard has been adopted by the DPRIVE WG.
### Feature Request
The feature request is for BIND to support XFR-over-TLS as described in the above RFC. This will obviously be dependent on DoT (RFC7858) being implemented in BIND. The specific aspects of the XoT implementation that are desired are:
- [x] * Support for both AXFR and IXFR
- [x] * XoT requires `dot` ALPN token to be negotiated (See: #2794)
- [x] * XoT requires TLSv1.3 or higher (See: #2795, and related #2796)
- [x] * Support for XFR-over-TLS both when BIND is acting as a primary and a secondary
- [X] * XFR-over-TLS (XoT): Primaries need to be able to restrict XFR to just TLS (#2776)
- [ ] * Related: Replace `tcp-only` with a more generic option (#2992)
- [X] * Support for authentication of TLS connections via X.509 certificates (Strict TLS and Mutual TLS)
- Related MR: !5600
- [X] * A TLS contexts cache needs to be implemented for contexts reuse and fast retrieval of the data associated with contexts (like CA intermediates chain): #3067, !5672
- [x] * Add remote TLS certificate verification support, implement Strict and Mutual TLS authentication (#3163)
- [ ] * Optimisation of TCP/TLS connections such that persistent connections can be re-used for multiple IXFRs for the same zone, and also IXFRs for different zones.
- [X] Client TLS session resumption support: !6274
### Related issues/bugs
- [x] * #2450 - Follow-up from "Draft: Resolve "XoT xfrin""
- See !5602 which addresses the most important points from the issue
- [x] * #2884 - Sometimes dig aborts on an AXFR query over TLS
- [X] * #2986 - TLS not working on the client-side (dig/named)
- [X] * #3004 - dig and named crash when receiving XFR over TLS
See RT [#16298](https://support.isc.org/Ticket/Display.html?id=16298)BIND 9.19.xArtem BoldarievArtem Boldarievhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1776BIND 9.16 and cache node locks for name cleaning vs. 'the thundering herd'2021-10-05T12:07:29ZCathy AlmondBIND 9.16 and cache node locks for name cleaning vs. 'the thundering herd'From [Support ticket #16212](https://support.isc.org/Ticket/Display.html?id=16212)
During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes ...From [Support ticket #16212](https://support.isc.org/Ticket/Display.html?id=16212)
During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes processing a second or two later (yes, delays of seconds not ms from this) we 'caught' one interesting scenario on BIND 9.16 in which it appeared that the vast majority of the active threads (netmgr and taskmgr both - so both client queries being answered from cache, AND client queries for which recursion had just taken place) were competing for the same cache node lock.
The pstack output demonstrating the problem was automatically triggered by monitoring for anomalies in inbound versus outbound network traffic.
The symptoms when this issue occurs are that:
* Outbound client-facing traffic rates plummet (well below the proportion that you would expect to see if it was only cache-misses not being serviced
* Recursive query rates plummet too
* CPU use increases - but in user space not in system space
* Recursive clients backlog increases (and may hit the limit)
* Fetchlimits may be triggered (we suspect this, and its predecessor are symptom not cause however, although triggering fetchlimits will exacerbate the situation, both from the client perspective, and as increased traffic rates as clients retry/re-send.
What we saw in the pstacks was that the majority netmgr threads (these answer directly from cache) were attempting to get a write lock on the node - for example:
```
Thread 74 (Thread 0x7f3ff366e700 (LWP 11713)):
#0 isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980, type=type@entry=isc_rwlocktype_write) at rwlock.c:57
#1 0x000000000051d826 in decrement_reference (rbtdb=rbtdb@entry=0x7f3fc6457010, node=node@entry=0x7f3eace34510, least_serial=least_serial@entry=0, nlock=nlock@entry=isc_rwlocktype_read, tlock=tlock@entry=isc_rwlocktype_none, pruning=pruning@entry=false) at rbtdb.c:2040
#2 0x00000000005215bf in detachnode (db=0x7f3fc6457010, targetp=targetp@entry=0x7f3ff366da88) at rbtdb.c:5352
#3 0x00000000005217be in rdataset_disassociate (rdataset=<optimized out>) at rbtdb.c:8691
#4 0x00000000005657e8 in dns_rdataset_disassociate (rdataset=rdataset@entry=0x7f3fad30cf28) at rdataset.c:111
#5 0x00000000004ebb21 in msgresetnames (first_section=0, msg=0x7f3fad2e1a50, msg@entry=0x7f3fad30b5f0) at message.c:438
#6 msgreset (msg=msg@entry=0x7f3fad2e1a50, everything=everything@entry=false) at message.c:524
#7 0x00000000004ec95a in dns_message_reset (msg=0x7f3fad2e1a50, intent=intent@entry=1) at message.c:760
#8 0x00000000004797ba in ns_client_endrequest (client=0x7f3fae5b8550) at client.c:229
#9 ns__client_reset_cb (client0=0x7f3fae5b8550) at client.c:1586
#10 0x0000000000632989 in isc_nmhandle_unref (handle=handle@entry=0x7f3fae5b83e0) at netmgr.c:1158
#11 0x0000000000632c30 in isc__nm_uvreq_put (req0=req0@entry=0x7f3ff366dbb8, sock=<optimized out>) at netmgr.c:1291
#12 0x00000000006357c4 in udp_send_cb (req=<optimized out>, status=<optimized out>) at udp.c:465
#13 0x00007f3ff5375153 in uv__udp_run_completed () from /lib64/libuv.so.1
#14 0x00007f3ff53754d3 in uv__udp_io () from /lib64/libuv.so.1
#15 0x00007f3ff5367c43 in uv_run () from /lib64/libuv.so.1
#16 0x0000000000632fda in nm_thread (worker0=0x138e3e0) at netmgr.c:481
#17 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3ff484488d in clone () from /lib64/libc.so.6
```
A handful of threads are attempting to get a read lock on the same node - for example:
```
Thread 59 (Thread 0x7f3feab0e700 (LWP 11734)):
#0 0x00007f3ff4f3d144 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1 0x000000000063cc6e in isc_rwlock_lock (rwl=0x7f3f59523980, type=type@entry=isc_rwlocktype_read) at rwlock.c:48
#2 0x00000000005129c6 in rdataset_getownercase (rdataset=<optimized out>, name=0x7f3feaaffde0) at rbtdb.c:9770
#3 0x000000000056620a in towiresorted (rdataset=rdataset@entry=0x7f3ec42dee70, owner_name=owner_name@entry=0x7f3ec42dd0a0, cctx=<optimized out>, target=<optimized out>, order=<optimized out>, order_arg=order_arg@entry=0x7f3ec42b8718, partial=true, options=1, countp=0x7f3feab005dc, state=<optimized out>) at rdataset.c:444
#4 0x0000000000566e3f in dns_rdataset_towirepartial (rdataset=rdataset@entry=0x7f3ec42dee70, owner_name=owner_name@entry=0x7f3ec42dd0a0, cctx=<optimized out>, target=<optimized out>, order=<optimized out>, order_arg=order_arg@entry=0x7f3ec42b8718, options=<optimized out>, options@entry=1, countp=<optimized out>, countp@entry=0x7f3feab005dc, state=<optimized out>, state@entry=0x0) at rdataset.c:565
#5 0x00000000004ecc71 in dns_message_rendersection (msg=0x7f3ec42b8550, sectionid=sectionid@entry=1, options=options@entry=6) at message.c:2086
#6 0x00000000004780f3 in ns_client_send (client=client@entry=0x7f3ec5d4b510) at client.c:555
#7 0x0000000000485b7c in query_send (client=0x7f3ec5d4b510) at query.c:552
#8 0x000000000048de23 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10921
#9 0x000000000048f76d in query_respond (qctx=0x7f3feab09a70) at query.c:7414
#10 query_prepresponse (qctx=qctx@entry=0x7f3feab09a70) at query.c:9913
#11 0x000000000049181c in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=0) at query.c:6836
#12 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#13 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#14 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#15 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#16 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#17 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#18 0x000000000048de05 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10853
#19 0x0000000000492420 in query_dname (qctx=<optimized out>) at query.c:9806
#20 query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65568) at query.c:6872
#21 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#22 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#23 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#24 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#25 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#26 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#27 0x000000000048de05 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10853
#28 0x0000000000492420 in query_dname (qctx=<optimized out>) at query.c:9806
#29 query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65568) at query.c:6872
#30 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#31 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#32 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#33 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#34 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#35 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#36 0x0000000000494b26 in query_setup (client=client@entry=0x7f3ec5d4b510, qtype=<optimized out>) at query.c:5217
#37 0x0000000000497056 in ns_query_start (client=client@entry=0x7f3ec5d4b510) at query.c:11318
#38 0x000000000047b101 in ns__client_request (handle=<optimized out>, region=<optimized out>, arg=<optimized out>) at client.c:2209
#39 0x0000000000635462 in udp_recv_cb (handle=<optimized out>, nrecv=48, buf=0x7f3feab0ab00, addr=<optimized out>, flags=<optimized out>) at udp.c:329
#40 0x00007f3ff53755db in uv__udp_io () from /lib64/libuv.so.1
#41 0x00007f3ff53779c8 in uv__io_poll () from /lib64/libuv.so.1
#42 0x00007f3ff5367c70 in uv_run () from /lib64/libuv.so.1
#43 0x0000000000632fda in nm_thread (worker0=0x13926e8) at netmgr.c:481
#44 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#45 0x00007f3ff484488d in clone () from /lib64/libc.so.6
```
Meanwhile, the threads run by taskmgr (this bunch would have recursed) were attempting to get write locks (unsurprisingly, although depending on the node and the client query, I guess it's also possible that one might want to get a read lock):
Here's a writer:
```
Thread 50 (Thread 0x7f3fe587b700 (LWP 11746)):
#0 isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980, type=type@entry=isc_rwlocktype_write) at rwlock.c:57
#1 0x000000000051d826 in decrement_reference (rbtdb=rbtdb@entry=0x7f3fc6457010, node=node@entry=0x7f3eace34510, least_serial=least_serial@entry=0, nlock=nlock@entry=isc_rwlocktype_read, tlock=tlock@entry=isc_rwlocktype_none, pruning=pruning@entry=false) at rbtdb.c:2040
#2 0x00000000005215bf in detachnode (db=0x7f3fc6457010, targetp=0x7f3fe587acc0) at rbtdb.c:5352
#3 0x00000000004bdd83 in dns_db_detachnode (db=<optimized out>, nodep=nodep@entry=0x7f3fe587acc0) at db.c:588
#4 0x00000000004804cb in qctx_clean (qctx=qctx@entry=0x7f3fe587a830) at query.c:5097
#5 0x000000000048db5a in ns_query_done (qctx=qctx@entry=0x7f3fe587a830) at query.c:10834
#6 0x000000000048f76d in query_respond (qctx=0x7f3fe587a830) at query.c:7414
#7 query_prepresponse (qctx=qctx@entry=0x7f3fe587a830) at query.c:9913
#8 0x000000000049181c in query_gotanswer (qctx=qctx@entry=0x7f3fe587a830, res=res@entry=0) at query.c:6836
#9 0x0000000000496870 in query_resume (qctx=0x7f3fe587a830) at query.c:6134
#10 fetch_callback (task=<optimized out>, event=0x7f3ead5c9c18) at query.c:5716
#11 0x000000000064007a in dispatch (threadid=<optimized out>, manager=<optimized out>) at task.c:1152
#12 run (queuep=<optimized out>) at task.c:1344
#13 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f3ff484488d in clone () from /lib64/libc.so.6
```
In this particular instance, every single one of the legacy i/o-handler threads was twiddling its thumbs (sitting on epoll_wait() ) - which is probably not too surprising, if no taskmgr workers are sending out queries to auth servers?
Doing stats on this particular capture (74 threads - 24x netmgr, 24x taskmgr, 24x legacy i/o plus 1 each main and the timer thread), we have:
33 instances of isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980
31 instances of rbtdb=rbtdb@entry=0x7f3fc6457010
30 instances of node=node@entry=0x7f3eace34510
It might be that it's possible to prove from the pstack output that this is a series of different names all attached to the same node, versus a single name that is expiring that all of the threads are attempting to clean-up simultaneously.
Either way, the locking is not working well in this situation - there's a lot of spinning in user space it would appear.
Hypotheses being tendered currently include:
* This scenario has always potentially existed, but using pthread-rwlocks amplifies it considerably
* Could this be a case where prefetching (enabled with default settings in this example) hits a surprise edge case?
* Is it possible we're seeing the after-effects of another delay which has resulted in late client query-response processing for something that has a very short TTL in cache?
* Is this a scenario where a client comes along and queries near-simultaneously (and probably quite innocently) for a lot of similar names under the same domain/apex very close to the time where they would all be naturally expiring from cache?
* Could it be that TTL=0 handling has broken in 9.16 with the introduction of netmgr (noting that TTL=0 responses from auth servers would be expected to be available solely to the clients that recursed and waited for the fetch completion - not to anyone who came along after the fetch had populated cache for the waiting client request to be fulfilled - this should all be in taskmgr and none of it in netmgr)?
* Do we perhaps have too many threads running (detected CPUs = 24)?BIND 9.19.xOndřej SurýOndřej Surý