Repeated crashes from BIND 9.16.40-S1 and 9.16.42.S1 consistently from either db.c:1263 or db.c:138
Summary
This organisation run 20 or so servers, they are all on BIND 9.16.40-S1 and 9.16.42.S1. The servers that run 9.16.42-S1 are recently upgraded. The servers running 9.16.40-S1 have been doing so for almost as long as this version has been available. The crashes started around June 29/30, so something has changed that is triggering them.
Here is a sample of the crashes as reported in their logs:
mpcold1/named.log:2023-07-02T11:43:09+00:00 mpcold1 named[28600]: 02-Jul-2023 11:43:09.241 general: critical: db.c:1263: REQUIRE((__b
uiltin_expect(!!((db) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(db))->magic == ((('D') << 24 | ('N') << 16 |
('S') << 8 | ('D')))), 1))) failed
mpcold1/named.log:2023-07-02T11:43:09+00:00 mpcold1 named[28600]: 02-Jul-2023 11:43:09.241 general: critical: exiting (due to asserti
on failure)
mpcold2/named.log:2023-07-05T14:50:13+00:00 mpcold2 named[2841]: 05-Jul-2023 14:50:13.161 general: critical: db.c:138: REQUIRE((__bui
ltin_expect(!!((source) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(source))->magic == ((('D') << 24 | ('N') <
< 16 | ('S') << 8 | ('D')))), 1))) failed
mpcold2/named.log:2023-07-05T14:50:13+00:00 mpcold2 named[2841]: 05-Jul-2023 14:50:13.161 general: critical: exiting (due to assertio
n failure)
mpcold3/named.log:2023-07-02T09:18:11+00:00 mpcold3 named[60441]: 02-Jul-2023 09:18:11.174 general: critical: db.c:1263: REQUIRE((__b
uiltin_expect(!!((db) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(db))->magic == ((('D') << 24 | ('N') << 16 |
('S') << 8 | ('D')))), 1))) failed
mpcold3/named.log:2023-07-02T09:18:11+00:00 mpcold3 named[60441]: 02-Jul-2023 09:18:11.174 general: critical: exiting (due to asserti
on failure)
mpcold3/named.log:2023-07-04T19:41:07+00:00 mpcold3 named[28125]: 04-Jul-2023 19:41:07.874 general: critical: db.c:1263: REQUIRE((__b
uiltin_expect(!!((db) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(db))->magic == ((('D') << 24 | ('N') << 16 |
('S') << 8 | ('D')))), 1))) failed
mpcold3/named.log:2023-07-04T19:41:07+00:00 mpcold3 named[28125]: 04-Jul-2023 19:41:07.874 general: critical: exiting (due to asserti
on failure)
mpcold4/named.log:2023-07-02T09:54:16+00:00 mpcold4 named[81687]: 02-Jul-2023 09:54:16.146 general: critical: db.c:1263: REQUIRE((__b
uiltin_expect(!!((db) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(db))->magic == ((('D') << 24 | ('N') << 16 |
('S') << 8 | ('D')))), 1))) failed
mpcold4/named.log:2023-07-02T09:54:16+00:00 mpcold4 named[81687]: 02-Jul-2023 09:54:16.146 general: critical: exiting (due to asserti
on failure)
mpcold4/named.log:2023-07-03T06:39:10+00:00 mpcold4 named[14845]: 03-Jul-2023 06:39:10.816 general: critical: db.c:138: REQUIRE((__bu
iltin_expect(!!((source) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(source))->magic == ((('D') << 24 | ('N')
<< 16 | ('S') << 8 | ('D')))), 1))) failed
mpcold4/named.log:2023-07-03T06:39:10+00:00 mpcold4 named[14845]: 03-Jul-2023 06:39:10.816 general: critical: exiting (due to asserti
on failure)
mpcold4/named.log:2023-07-04T11:12:28+00:00 mpcold4 named[71958]: 04-Jul-2023 11:12:28.531 general: critical: db.c:1263: REQUIRE((__b
uiltin_expect(!!((db) != ((void *)0)), 1) && __builtin_expect(!!(((const isc__magic_t *)(db))->magic == ((('D') << 24 | ('N') << 16 |
('S') << 8 | ('D')))), 1))) failed
BIND version used
9.16.40-S1 and 9.16.42-S1 (I don't have named -V, but can get this). Self-built and RPMs created that they install on their suite of resolvers. Stripped binaries - we're working on this as we have a lot of core dumps ...
Steps to reproduce
No reproduction, just wait...
What is the current bug behavior?
BIND crashes.
What is the expected correct behavior?
BIND doesn't crash.
Relevant configuration files
Significantly, what we have here is resolvers that are acting as proxies in front of Intranet auth servers. The queries they receive had originally been fielded by load-balancers, that flip the RD bit and then send them on to one of two views (one is auth only, the other accepts recursion). We're interested in the recursive side of things, and the fact the query header RD bit has been flipped is interesting for the background picture, but shouldn't affect the scenario that we have a resolver repeatedly crashing. The important components of the options are:
options {
directory "/var/named";
//listen-on port 53 { _obscured_; _obscured_; _obscured_; _obscured_; };
listen-on port 53 { _obscured_; _obscured_; _obscured_; _obscured_; _obscured_; _obscured_; };
query-source address _obscured_;
query-source-v6 address _obscured_;
dnssec-enable yes;
dnssec-validation no;
//edns-udp-size 1024;
version "";
stale-cache-enable no;
allow-update { none ; };
allow-transfer { none; };
transfer-source _obscured_;
transfers-per-ns 1;
transfers-in 10;
recursive-clients 5000;
tcp-idle-timeout 30;
tcp-clients 750;
notify no;
send-cookie no;
require-server-cookie no;
ecs-forward {any;};
ecs-zones { obscured; };
fetches-per-zone 35 drop;
fetch-quota-params 100 .1 .3 .7;
fetches-per-server 50 drop;
rate-limit {
slip 2; // Every other response truncated
window 15; // Seconds to bucket
responses-per-second 100; // # of good responses per prefix-length/sec
referrals-per-second 80; // referral responses
nodata-per-second 25; // nodata responses
nxdomains-per-second 20; // nxdomain responses
errors-per-second 25; // error responses
all-per-second 500; // When we drop all
log-only no; // Debugging mode
qps-scale 5000; // x / 1000 * per-second = new drop limit
//exempt-clients { 127.0.0.1; _obscured_; _obscured_ !};
ipv4-prefix-length 24; // Define the IPv4 block size
ipv6-prefix-length 56; // Define the IPv6 block size
max-table-size 1200000; // 40 bytes * this number = max memory
min-table-size 500; // pre-allocate to speed startup
};
};
Relevant logs and/or screenshots
See crashes logged above.
Significantly in the one I first looked at, just ahead of the crash was this:
Jul 4 11:22:09 mppacd2 named[2884]: client @0x7f7eb03864e0 obscured#44786 (obscured): view Rec- instance: recursive-clients soft limit exceeded (4901/4900/5000), aborting oldest query Jul 4 11:22:09 mppacd2 named[2884]: client @0x7f7e7c720f20 obscured-different#62182 (obscured-different): view Rec- instance: no more recursive clients (5000/4900/5000): quota reached
However, exploring further - this is not consistently associated with the crashes - the quota is being reached with no crashes, and the crashes occur without quota reached being logged.
Note however that both fetch-limits and RRL are being triggered.
More data is available from Support ticket #22339
Possible fixes
I checked the crash code locations - here are my musings
db.c:1263
isc_result_t
dns_db_getservestalerefresh(dns_db_t *db, uint32_t *interval) {
REQUIRE(DNS_DB_VALID(db));
REQUIRE((db->attributes & DNS_DBATTR_CACHE) != 0);
if (db->methods->getservestalerefresh != NULL) {
return ((db->methods->getservestalerefresh)(db, interval));
}
return (ISC_R_NOTIMPLEMENTED);
}
db.c:138
dns_db_attach(dns_db_t *source, dns_db_t **targetp) {
/*
* Attach *targetp to source.
*/
REQUIRE(DNS_DB_VALID(source));
REQUIRE(targetp != NULL && *targetp == NULL);
(source->methods->attach)(source, targetp);
ENSURE(*targetp == source);
}
The crash is for the exact same reason in each - this test fails:
REQUIRE(DNS_DB_VALID(source));
But why do we not have a valid DB pointer? dns_db_attach()
is used all over the place - too many for me to do anything useful with without stack traces that show me the circumstances around the call. But there are only a handful of locations where we call dns_db_getservestalerefresh()
so I took a meander...
The likeliest candidate is in query.c from query_lookup(). Note that we're getting the pointer to the db from the client. (Aside: also that we're getting the value of the length of time in which stale answers are directly returned from cache before attempting to refresh them (in case a previous attempt in doing so has failed), without first checking if we're going to use stale content at all, or even if we have stale cache enabled. So this is unnecessary, except that we then decide what to do with the value. But this lookup from cache DB should work anyway.)
I'm bothered that the pointer to the cachedb is dud. Might this be a race between looking in cache to construct query response (on a timeout) versus the same client being dropped simultaneously because of hitting recursive clients quota?
(void)dns_db_getservestalerefresh(qctx->client->view->cachedb,
&stale_refresh);
That's the call - the pointer to cachedb that is invalid has come from the client structure. Why is that 'bad'? Where did we get the pointer to the client struct from, and why is only this operation crashing if this is a race between 'do something with this client' and 'this client is being dropped because of some quota or something'?
Looking at the other locations, I doubt that the crash has come from server.c - that's just a call with a direct pointer to the cache DB in order to get the value to output in response to an rndc query to get the values.
Meanwhile, I see in cache.c that the call to dns_db_getservestalerefresh() has a wrapper: dns_cache_getservestalerefresh()
But I think this is also not useful. I see it used in server.c whilst checking what its value is to determine whether or not it's OK for two views to share the same cache. And see what dns_cache_getservestalerefresh() does before it calls dns_db_getservestalerefresh() - calls REQUIRE(VALID_CACHE(cache)). I do not think we passed this way in the direction of the crash...