ISC Open Source Projects issueshttps://gitlab.isc.org/groups/isc-projects/-/issues2020-05-04T10:37:46Zhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1723Replace fputs() with fprintf()2020-05-04T10:37:46ZMichał KępieńReplace fputs() with fprintf()The following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_120581): (+3 comments)
> Replacing `fputs()` with `fprintf()` sounds ...The following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_120581): (+3 comments)
> Replacing `fputs()` with `fprintf()` sounds like something to do
> tree-wide in another issue (may be possible with Coccinelle)?May 2020 (9.11.19, 9.11.19-S1, 9.14.12, 9.16.3)https://gitlab.isc.org/isc-projects/bind9/-/issues/1722Ensure unit test core dumps are collected for Automake builds2020-06-08T10:16:20ZMichał KępieńEnsure unit test core dumps are collected for Automake buildsThe following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_120597): (+1 comment)
> We need to make sure that - with this MR merg...The following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_120597): (+1 comment)
> We need to make sure that - with this MR merged (post-factum is okay) -
> if any unit test crashes, usable core dumps are placed among job
> artifacts in GitLab CI. Most of the work put into the
> `unit/unittest.sh.in` script was done to ensure that.June 2020 (9.11.20, 9.11.20-S1, 9.16.4, 9.17.2)Michal NowakMichal Nowakhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1721Grow and shrink dnssec-sign statistics on key rollover events2021-08-31T13:28:54ZMatthijs Mekkingmatthijs@isc.orgGrow and shrink dnssec-sign statistics on key rollover events!2067 introduced dnssec-sign statistics (#513) to the zone statistics. This introduced an operational issue because when using `zone-statistics full;` the memory usage is going through the roof. It turns out that using the key id as inde...!2067 introduced dnssec-sign statistics (#513) to the zone statistics. This introduced an operational issue because when using `zone-statistics full;` the memory usage is going through the roof. It turns out that using the key id as index wasn't the greatest idea.
!3304 fixes this (#1179) by allocating just four key slots per zone. If a zone exceeds the number of keys for example through a key rollover, the keys will be rotated out on a FIFO basis.
This works for most cases, and fixes the immediate problem of high memory usage, but if you sign your zone with many, many keys, or are sign with a ZSK/KSK double algorithm strategy you may experience weird statistics.
A better strategy would to grow the number of key slots per zone on key rollover events: Grow during key rollover, shrink on a LRU basis.
In addition, if a zone is signed with two algorithms there is a very small chance that two keys will have the same key tag. The dnssec-sign statistics prevents operators identifying which key is which when there are common key ids across algorithms.
When dumping stats, rather than passing the key tag, pass `kval` and show an appropriate value label to be constructed or something like this to be emitted.
```
{
"algorithm": value,
"tag": value,
"sign-count": value
"refresh-count": value
},
```September 2021 (9.16.21, 9.16.21-S1, 9.17.18)Matthijs Mekkingmatthijs@isc.orgMatthijs Mekkingmatthijs@isc.orghttps://gitlab.isc.org/isc-projects/bind9/-/issues/1720Building documentation is broken with Automake2020-05-11T11:07:47ZMichał KępieńBuilding documentation is broken with AutomakeThe following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_120576): (+1 comment)
> This breaks the existing release process and ...The following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_120576): (+1 comment)
> This breaks the existing release process and will either need to be
> tweaked before we release a version with this MR merged or we will need
> to fix building documentation and revert this removal.June 2020 (9.11.20, 9.11.20-S1, 9.16.4, 9.17.2)https://gitlab.isc.org/isc-projects/stork/-/issues/223Update Stork ARM describing host reservations support2020-04-02T19:01:55ZMarcin SiodelskiUpdate Stork ARM describing host reservations supportFollowing the ticket #210 and #214 we need to describe host reservation support in Stork.Following the ticket #210 and #214 we need to describe host reservation support in Stork.0.6Marcin SiodelskiMarcin Siodelskihttps://gitlab.isc.org/isc-projects/bind9/-/issues/1719Observed stats underflow in multiple stats2021-12-03T13:02:15ZBrian ConryObserved stats underflow in multiple statsWhile looking at a customer-provided named.stats file it was observed that several counters had underflowed in BIND version 9.11.16-S1.
A selection of the underflowing stats:
```
+++ Statistics Dump +++ (1584405000)
++ Cache DB RRsets +...While looking at a customer-provided named.stats file it was observed that several counters had underflowed in BIND version 9.11.16-S1.
A selection of the underflowing stats:
```
+++ Statistics Dump +++ (1584405000)
++ Cache DB RRsets ++
[View: default]
18446744073709551614 ~A
18446744073709551615 ~NS
++ Socket I/O Statistics ++
18446744073709551556 UDP/IPv4 sockets active
18446744073709551584 UDP/IPv6 sockets active
18446744073709551600 TCP/IPv4 sockets active
+++ Statistics Dump +++ (1584405300)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~CNAME
18446744073709551614 ~!AAAA
+++ Statistics Dump +++ (1584405900)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~AAAA
+++ Statistics Dump +++ (1584408000)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~RRSIG
+++ Statistics Dump +++ (1584418500)
++ Resolver Statistics ++
[View: default]
18446744073709551615 active fetches
+++ Statistics Dump +++ (1584447000)
++ Cache DB RRsets ++
[View: default]
18446744073709551608 ~NXDOMAIN
+++ Statistics Dump +++ (1584485100)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~NULL
+++ Statistics Dump +++ (1584504300)
++ Cache DB RRsets ++
[View: default]
18446744073709551521 NULL
+++ Statistics Dump +++ (1584515400)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~TXT
+++ Statistics Dump +++ (1584518400)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~NSEC
+++ Statistics Dump +++ (1584676500)
++ Cache DB RRsets ++
[View: default]
18446744073709551557 !AAAA
+++ Statistics Dump +++ (1584737700)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 !CNAME
+++ Statistics Dump +++ (1584886800)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~MX
+++ Statistics Dump +++ (1585119900)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 !MX
+++ Statistics Dump +++ (1585137000)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 ~DS
+++ Statistics Dump +++ (1585282500)
++ Cache DB RRsets ++
[View: default]
18446744073709551610 !A
+++ Statistics Dump +++ (1585457400)
++ Cache DB RRsets ++
[View: default]
18446744073709551506 CNAME
```
I'd like to specifically call out `active fetches` which might get lost in the noise near the middle of those lines.
The same stats file has history going back to unknown prior versions (August 2019) also has underflows on:
```
+++ Statistics Dump +++ (1566255596)
++ Cache DB RRsets ++
[View: default]
18446744073709547273 #A
18446744073709551608 #TXT
18446744073709551247 #AAAA
18446744073709551615 #SRV
18446744073709548950 #!AAAA
+++ Statistics Dump +++ (1566260101)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #NS
+++ Statistics Dump +++ (1566307500)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #RRSIG
+++ Statistics Dump +++ (1566355500)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #!A
+++ Statistics Dump +++ (1567977300)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 !A6
+++ Statistics Dump +++ (1568136600)
++ Name Server Statistics ++
18446744073709551466 recursing clients
+++ Statistics Dump +++ (1570120200)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #!SRV
+++ Statistics Dump +++ (1571707501)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #CNAME
+++ Statistics Dump +++ (1571768400)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #!TXT
+++ Statistics Dump +++ (1574442900)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #DS
+++ Statistics Dump +++ (1574886601)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #NXDOMAIN
+++ Statistics Dump +++ (1575558600)
++ Cache DB RRsets ++
[View: default]
18446744073709551615 #PTR
```
The most significant of the last set is probably `recursing clients`.August 2020 (9.11.22, 9.11.22-S1, 9.16.6, 9.17.4)Diego dos Santos FronzaDiego dos Santos Fronzahttps://gitlab.isc.org/isc-projects/bind9/-/issues/1718[CVE-2020-8619] An asterisk character in an empty non-terminal can cause an a...2020-08-05T12:05:41ZHolger Wirtz[CVE-2020-8619] An asterisk character in an empty non-terminal can cause an assertion failure in rbtdb.c### Summary
Sudden crash of the named process (1-10 minutes after restart)
### BIND version used
```
BIND 9.11.17 (Extended Support Version) <id:65c9496>
running on Linux x86_64 3.16.0-10-amd64 #1 SMP Debian 3.16.81-1 (2020-01-17)
bui...### Summary
Sudden crash of the named process (1-10 minutes after restart)
### BIND version used
```
BIND 9.11.17 (Extended Support Version) <id:65c9496>
running on Linux x86_64 3.16.0-10-amd64 #1 SMP Debian 3.16.81-1 (2020-01-17)
built by make with '--prefix=/usr' '--mandir=/usr/share/man' '--libdir=/usr/lib/x86_64-linux-gnu' '--infodir=/usr/share/info' '--sysconfdir=/etc/bind' '--with-python=python3' '--localstatedir=/' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-gnu-ld' '--enable-ipv6' '--enable-filter-aaaa'
compiled by GCC 4.9.2
compiled with OpenSSL version: OpenSSL 1.0.1t 3 May 2016
linked to OpenSSL version: OpenSSL 1.0.1t 3 May 2016
compiled with libxml2 version: 2.9.1
linked to libxml2 version: 20901
compiled with libjson-c version: 0.11.99
linked to libjson-c version: 0.11.99
compiled with zlib version: 1.2.8
linked to zlib version: 1.2.8
threads support is enabled
default paths:
named configuration: /etc/bind/named.conf
rndc configuration: /etc/bind/rndc.conf
DNSSEC root key: /etc/bind/bind.keys
nsupdate session key: //run/named/session.key
named PID file: //run/named/named.pid
named lock file: //run/named/named.lock
```
### Steps to reproduce
```
# Created bind as usual (works with <= 9.11.14):
VERSION=9.11.17
wget -O bind-$(VERSION).tar.gz https://downloads.isc.org/isc/bind9/$(VERSION)/bind-$(VERSION).tar.gz
wget -O bind-$(VERSION).tar.gz.sha512.asc https://downloads.isc.org/isc/bind9/$(VERSION) /bind-$(VERSION).tar.gz.sha512.asc
gpg --verify bind-$(VERSION).tar.gz.sha512.asc bind-$(VERSION).tar.gz
tar -zxf bind-$(VERSION).tar.gz
bind-$(VERSION)
./configure --prefix=/usr \
--mandir=\$${prefix}/share/man \
--libdir=\$${prefix}/lib/$(DEB_HOST_MULTIARCH) \
--infodir=\$${prefix}/share/info \
--sysconfdir=/etc/bind \
--with-python=python3 \
--localstatedir=/ \
--enable-threads \
--enable-largefile \
--with-libtool \
--enable-shared \
--enable-static \
--with-openssl=/usr \
--with-gssapi=/usr \
--with-gnu-ld \
--enable-ipv6 \
--enable-filter-aaaa
make && make install
```
### What is the current *bug* behavior?
After a few minutes, bind crashes with the following message in general.log:
```
01-Apr-2020 11:24:11.101 general: rbtdb.c:2097: INSIST(!((void *)((node)->deadlink.prev) != (void *)(-1))) failed, back trace
01-Apr-2020 11:24:11.101 general: #0 0x43fecd in ??
01-Apr-2020 11:24:11.101 general: #1 0x7ff0f7cedcaa in ??
01-Apr-2020 11:24:11.101 general: #2 0x7ff0f8fb2da5 in ??
01-Apr-2020 11:24:11.101 general: #3 0x7ff0f8fc2d6c in ??
01-Apr-2020 11:24:11.101 general: #4 0x44e3fd in ??
01-Apr-2020 11:24:11.101 general: #5 0x4585b8 in ??
01-Apr-2020 11:24:11.101 general: #6 0x4353f6 in ??
01-Apr-2020 11:24:11.101 general: #7 0x7ff0f7d179c7 in ??
01-Apr-2020 11:24:11.101 general: #8 0x7ff0f6e98064 in ??
01-Apr-2020 11:24:11.101 general: #9 0x7ff0f686662d in ??
01-Apr-2020 11:24:11.101 general: exiting (due to assertion failure)
```
### What is the expected *correct* behavior?
No crash.
### Relevant configuration files
named.conf:
```
include "/etc/bind/named.conf.local"; // only ACLs, logging and statistic channels
include "/etc/bind/named.conf.options"; // look down
include "/etc/bind/bind.keys";
include "/etc/bind/named.conf.namedboot";
include "/etc/bind/tsig.key";
```
named.options:
```
options {
directory "/var/cache/bind";
pid-file "/var/run/named/named.pid";
auth-nxdomain no; # conform to RFC1035
listen-on-v6 { ::1; ********; };
listen-on { 127.0.0.1; *********; };
allow-query { any; };
allow-transfer { ******; };
recursion no;
version "0";
dnssec-enable yes;
dnssec-validation yes;
tcp-clients 1500;
rate-limit {
responses-per-second 50;
};
};
controls {
inet 127.0.0.1 allow { 127.0.0.1; ::1; };
};
```
### Relevant logs and/or screenshots
general.log:
```
...
01-Apr-2020 11:24:11.101 general: rbtdb.c:2097: INSIST(!((void *)((node)->deadlink.prev) != (void *)(-1))) failed, back trace
01-Apr-2020 11:24:11.101 general: #0 0x43fecd in ??
01-Apr-2020 11:24:11.101 general: #1 0x7ff0f7cedcaa in ??
01-Apr-2020 11:24:11.101 general: #2 0x7ff0f8fb2da5 in ??
01-Apr-2020 11:24:11.101 general: #3 0x7ff0f8fc2d6c in ??
01-Apr-2020 11:24:11.101 general: #4 0x44e3fd in ??
01-Apr-2020 11:24:11.101 general: #5 0x4585b8 in ??
01-Apr-2020 11:24:11.101 general: #6 0x4353f6 in ??
01-Apr-2020 11:24:11.101 general: #7 0x7ff0f7d179c7 in ??
01-Apr-2020 11:24:11.101 general: #8 0x7ff0f6e98064 in ??
01-Apr-2020 11:24:11.101 general: #9 0x7ff0f686662d in ??
01-Apr-2020 11:24:11.101 general: exiting (due to assertion failure)
```
### Possible fixes
see above...June 2020 (9.11.20, 9.11.20-S1, 9.16.4, 9.17.2)Mark AndrewsMark Andrewshttps://gitlab.isc.org/isc-projects/bind9/-/issues/1717rwlock contention in isc_log_wouldlog() API (performance impact)2020-04-08T08:54:05ZOndřej Surýrwlock contention in isc_log_wouldlog() API (performance impact)In !3229, we introduced a rwlock contention in `isc_log_wouldlog()` which has a significant performance impact.In !3229, we introduced a rwlock contention in `isc_log_wouldlog()` which has a significant performance impact.April 2020 (9.11.18, 9.16.2, 9.17.1)Ondřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/stork/-/issues/221Link to BIND/Kea documentation on RTD2020-04-14T15:16:06ZVicky Riskvicky@isc.orgLink to BIND/Kea documentation on RTDAs an administrator of BIND or Kea it would be convenient to have a link to the documentation for the product. I would like to click on the link and open a new browser window or tab on Read the Docs. I would be fine with seeing the LATES...As an administrator of BIND or Kea it would be convenient to have a link to the documentation for the product. I would like to click on the link and open a new browser window or tab on Read the Docs. I would be fine with seeing the LATEST version of the docs, and would be able to select another version if that is what I need.0.7Michal NowikowskiMichal Nowikowskihttps://gitlab.isc.org/isc-projects/kea/-/issues/1174move to c++11 std::chrono library2020-07-16T13:05:29ZFrancis Dupontmove to c++11 std::chrono libraryIn #1005 I explain that boost::posix_time::microsec_clock::local_time() is a performance pig. The idea is to move to the C++11 std::chrono standard library:
- check whether the std::chrono library is implemented on all supported platfor...In #1005 I explain that boost::posix_time::microsec_clock::local_time() is a performance pig. The idea is to move to the C++11 std::chrono standard library:
- check whether the std::chrono library is implemented on all supported platforms
- find the first boost version which implements chrono to posix time conversion
- move the internal stats library code to chrono and verify performance change (could win a few percent)kea1.7.10Francis DupontFrancis Duponthttps://gitlab.isc.org/isc-projects/stork/-/issues/219container with stork agent and kea with hosts db is needed2020-04-01T17:39:05ZMichal Nowikowskicontainer with stork agent and kea with hosts db is needed0.6Michal NowikowskiMichal Nowikowskihttps://gitlab.isc.org/isc-projects/kea/-/issues/1173handle congestion recovery in multi-threading mode2020-05-08T08:14:47ZFrancis Duponthandle congestion recovery in multi-threading modeCurrently when the talk queue is full the server run_one method skips the receivePacket() call and returns.
This has a bad impact on mechanisms using external sockets because they are not served.
I propose to handle the congestion reco...Currently when the talk queue is full the server run_one method skips the receivePacket() call and returns.
This has a bad impact on mechanisms using external sockets because they are not served.
I propose to handle the congestion recovery inside the multi-threading code itself:
- make the congestion recovery disabled at configuration time when multi-threading is enabled
- change the task queue (aka ThreadPool) by a ringkea1.7.8Francis DupontFrancis Duponthttps://gitlab.isc.org/isc-projects/bind9/-/issues/1716Shutdown assertion failure crash in resolver.c2021-10-05T11:48:56ZWitold KrecickiShutdown assertion failure crash in resolver.chttps://gitlab.isc.org/isc-projects/bind9/-/jobs/792484
```
#3 0x00007fa2985c8ed9 in isc_assertion_failed (file=<value optimized out>, line=<value optimized out>, type=<value optimized out>, cond=<value optimized out>) at assertions.c:...https://gitlab.isc.org/isc-projects/bind9/-/jobs/792484
```
#3 0x00007fa2985c8ed9 in isc_assertion_failed (file=<value optimized out>, line=<value optimized out>, type=<value optimized out>, cond=<value optimized out>) at assertions.c:46
No locals.
#4 0x00007fa2993c0350 in destroy (resp=<value optimized out>) at resolver.c:9997
i = <value optimized out>
a = <value optimized out>
#5 dns_resolver_detach (resp=<value optimized out>) at resolver.c:10541
res = 0x7fa288166e90
#6 0x00007fa2993fd047 in destroy (view=0x7fa2840055c0) at view.c:407
dns64 = 0x7fa283efc1a8
dlzdb = 0x7fa283efc050
#7 0x00007fa2993fdcda in dns_view_weakdetach (viewp=<value optimized out>) at view.c:724
view = <value optimized out>
#8 0x00007fa2993febd3 in adb_shutdown (task=<value optimized out>, event=0x0) at view.c:758
view = 0x0
#9 0x00007fa2985eff29 in dispatch (queuep=<value optimized out>) at task.c:1152
dispatch_count = 0
done = false
finished = false
requeue = false
event = 0x7fa2840056c8
task = 0x7fa2881732d8
```https://gitlab.isc.org/isc-projects/bind9/-/issues/1715kasp system test timing issue with view zones2020-04-09T06:50:53ZMatthijs Mekkingmatthijs@isc.orgkasp system test timing issue with view zonesMost zones are checked if signing has been completed, before executing the kasp specific checks. There is not yet such a check for zones in views. This may result in intermittent failures.Most zones are checked if signing has been completed, before executing the kasp specific checks. There is not yet such a check for zones in views. This may result in intermittent failures.April 2020 (9.11.18, 9.16.2, 9.17.1)https://gitlab.isc.org/isc-projects/kea/-/issues/1172applying lease timer values to a class2021-08-20T15:36:11ZPeter Daviesapplying lease timer values to a classCustomer request: [#RT16196](https://support.isc.org/Ticket/Display.html?id=16196)
The ability to assign lease timer values to Client Class definitions.Customer request: [#RT16196](https://support.isc.org/Ticket/Display.html?id=16196)
The ability to assign lease timer values to Client Class definitions.kea1.9.11Tomek MrugalskiTomek Mrugalskihttps://gitlab.isc.org/isc-projects/bind9/-/issues/1714'provide-ixfr no;' should still send up-to-date responses.2020-06-18T08:39:09ZMark Andrews'provide-ixfr no;' should still send up-to-date responses.June 2020 (9.11.20, 9.11.20-S1, 9.16.4, 9.17.2)Ondřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/kea/-/issues/1171thread sanitizer reporting unit test in perfdhcp2020-05-06T20:07:18ZFrancis Dupontthread sanitizer reporting unit test in perfdhcpLooking the report and the code FakeScenPerfSocket should protect the access to planned_responses_ between receiveX and send methods.Looking the report and the code FakeScenPerfSocket should protect the access to planned_responses_ between receiveX and send methods.kea1.7.8Francis DupontFrancis Duponthttps://gitlab.isc.org/isc-projects/bind9/-/issues/1713When BIND is built with --with-tuning=large, we're setting RCVBUFSIZE far too...2021-05-17T07:46:39ZCathy AlmondWhen BIND is built with --with-tuning=large, we're setting RCVBUFSIZE far too big for most production serversSee Support ticket [#16171](https://support.isc.org/Ticket/Display.html?id=16171) and also KB article [--with-tuning=large - about using this build-time option](https://kb.isc.org/docs/aa-01314)
From BIND 9.16.0, we changed the default ...See Support ticket [#16171](https://support.isc.org/Ticket/Display.html?id=16171) and also KB article [--with-tuning=large - about using this build-time option](https://kb.isc.org/docs/aa-01314)
From BIND 9.16.0, we changed the default so that when building BIND, you get --with-tuning=large automatically.
Per the KB article, the difference between the tunings is this:
> 3. RCVBUFSIZE changed from 32K to 16M
>
> Increasing RCVBUFSIZE (the receive buffer size) will reduce dropped packets,
> but it may also hurt socket performance on some platforms; the Linux kernel
> allocates the receive buffer space when creating a socket, and an increase
> from 32k to 16m allocated per socket is potentially significant.
In lib/isc/unix/socket.c on master, it looks like this:
```c
#ifdef TUNE_LARGE
#ifdef sun
#define RCVBUFSIZE (1 * 1024 * 1024)
#define SNDBUFSIZE (1 * 1024 * 1024)
#else /* ifdef sun */
#define RCVBUFSIZE (16 * 1024 * 1024)
#define SNDBUFSIZE (16 * 1024 * 1024)
#endif /* ifdef sun */
#else /* ifdef TUNE_LARGE */
#define RCVBUFSIZE (32 * 1024)
#define SNDBUFSIZE (32 * 1024)
#endif /* TUNE_LARGE */
```
(Although we did it slightly differently in the old socket code, the size increase remains the same).
Let's do some sums on that. Assuming an average client query is 70bytes, then without TUNE_LARGE, our socket receive buffer of 32Kb can hold just under 470 client queries before it's full.
With TUNE_LARGE, our socket receive buffer of 16Mb can hold just under 240K queries.
More sums - let's suppose a server can handle a maximum qps of 50K. That means that if queries come in faster than that, the backlog is going to grow, and when the buffer is full (and named is reading and processing them first in, first out), that the length of time each query has been waiting in the buffer before being handled is just under 5s.
That is hopeless - most clients give up and stop waiting for an answer in under 2s. If an overloaded server capable of 50Kqps is to have any chance at all of giving some clients answers that they're still interested in, then it needs a much smaller RCVBUFSIZE. Something that doesn't hold any more than 1s (or even less?) worth of client queries.
----
Now consider another scenario. A resolver is capable of handling 50Kqps, and has been provisioned so that its normal load is half of that - 25Kqps.
It gets distracted (perhaps it was processing a large inbound IXFR zone update) and gets slower and a backlog builds up.
The clients start to time-out and re-send their original queries - those go into the backlog queue too. Maybe they even stop listening for a reply to their original queries. Now any query response from the resolver is going to be ignored.
Observationally, when this happens, the client query rate ramps up - with dual-stack clients that also retry over IPv6, we can see up to 6x or 7x the query rate than normal. This is way higher than the 50Kqps that is the peak rate the server can handle.
The server recovers and starts handling the backlog. All the queries it is responding too are several seconds stale, so query responses fall on deaf ears (closed sockets). The clients continue to send queries at 6-7x the normal load, so the backlog can never be cleared.
The server has been rendered effectively useless because it can never recover.
This has already been demonstrated in ticket [#16171](https://support.isc.org/Ticket/Display.html?id=16171), and is the reason why it's important not to run named with socket buffers that allow a backlog to build up of longer than 0-1s worth of queries, assuming that the server is running at its maximum QPS.
I also suspect that it might be the reason why we've not been successful in replicating the total server hang-up in ticket [#14339](https://support.isc.org/Ticket/Display.html?id=14339). Our test tools don't increase the client pressure with re-sends and retries when the server under test doesn't respond promptly.
Conclusion:
a) The default receive buffer size with --with-tuning=large is potentially too big for most production servers anyway, and we'd be better served by being more conservative
b) We need a knob, so that administrators (who know what they're doing, and why) can tune the receive buffer sizes appropriate (potentially per listening socket for some obscure set-ups where the admins are technically sophisticated and capable?)
c) We need to provide some tuning advice to BIND consumers - in the ARM and/or in the KB.May 2020 (9.11.19, 9.11.19-S1, 9.14.12, 9.16.3)Witold KrecickiWitold Krecickihttps://gitlab.isc.org/isc-projects/bind9/-/issues/1712Serve-stale feature is not operationally useful - some suggestions for making...2022-01-19T11:20:49ZCathy AlmondServe-stale feature is not operationally useful - some suggestions for making it betterAs described in Support ticket [#16171](https://support.isc.org/Ticket/Display.html?id=16171) :
```
The problem with serve-stale was (and still is after some testing on 9.16.1),
that every client that asks for e.g. "isc.org A" will all ...As described in Support ticket [#16171](https://support.isc.org/Ticket/Display.html?id=16171) :
```
The problem with serve-stale was (and still is after some testing on 9.16.1),
that every client that asks for e.g. "isc.org A" will all have to wait for 10
seconds before they get the stale answer. There seem to be no table of stale
resolvers so each time a request comes in, BIND seems to try the resolver
again to find out if it answers or not.
```
This really is not helpful - most clients will have given up and gone away and will never get a usable answer.
IF the name is one that is popular, then because of 'clients-per-query' and the fact that we attach any future waiting clients for the same query to the already-existing fetch process, then the late arrivals stand a fighting chance of getting a response from stale cache before they give up - but the majority won't.
See also #1688 - we haven't documented very thoroughly how this works anyway, and we certainly have not documented how it interacts with fetch-limits and other resolver-protecting features.
Here's a sample config that was being used for testing:
```
stale-answer-enable yes;
stale-answer-ttl 600;
max-stale-ttl 1w;
```
There is nothing there that provides for a configurable period of 'staleness' so that after the first time the failure to refresh has taken place, a server can immediately serve this stale content to any clients who come along later instead of repeating the refresh attempt (and likely failing again).
I think the issue is that although we do have some control over how stale an answer can be before we stop serving it, we haven't thought sufficiently about how long clients will be prepared to wait for a query response if we have to attempt to refresh and then fail for each client (or set of clients) when queried.
Note: I **do not** think we should immediately serve stale answers whenever there's cache content available that has recently expired - this is not what we're trying to achieve. The idea of serve-stale as the converse of pre-fetch ('post-fetch'?) is somehow terribly tempting because it feels like it would be faster and a better experience for the clients, plus there's this nice symmetry with pre-fetch logic. But I think it's wrong - and would absolutely break how we handle TTL=0 answers today. Authoritative server operators **expect** resolvers to come back to them as soon as their cached content expired. We should not skip this step.
But what would be more helpful (to both clients and to servers) when there are non-responding authoritative servers, would be a way to flag a stale answer with the timestamp of when the last failing refresh attempt occurred, and if a client queries the same name again within a suitable time period (configurable? Something like 10s feels like a good default here), then the stale answer gets used right away.
We're preserving resolver resources by doing this (and anyway, if we couldn't resolve this name 1s ago, why are we trying again immediately if we've got something usable-but-stale in cache we could use instead?)August 2020 (9.11.22, 9.11.22-S1, 9.16.6, 9.17.4)Matthijs Mekkingmatthijs@isc.orgMatthijs Mekkingmatthijs@isc.orghttps://gitlab.isc.org/isc-projects/bind9/-/issues/1711Assertion failure crash reported in 9.14.102021-10-05T11:48:03ZMichael McNallyAssertion failure crash reported in 9.14.10Received via email to security-officer@isc.org:
>>>
Hi,
We recently started to get constant crashes (core dump) of Bind (v 9.14.10) in our production environment. The crash seems to happen once every 3-4 days. We were wondering if we c...Received via email to security-officer@isc.org:
>>>
Hi,
We recently started to get constant crashes (core dump) of Bind (v 9.14.10) in our production environment. The crash seems to happen once every 3-4 days. We were wondering if we could get some more information as to what exactly is causing bind to crash in this manner. From the ISC website, it looks like these kinds of assertion failures usually happen when some security constraint is violated.
Please find attached the complete stack trace from the core dump for reference.
Please let us know if any more information is required from our side. Any insights you might have on this would be greatly appreciated.
>>>