Outstanding BIND statistics issues - umbrella ticket
Per https://pad.isc.org/p/bind9-call-2020-08-05 line 249 action item on Cathy.
Summary: I looked at open and closed GL issue, BUGS and also went a little way into Support (but searching on the various keywords was not helpful because we use the stats a lot - mostly we've opened issue/feature tickets in BUGS/GL as needed). I also trawled the BIND 9.11, 9.11-S, 9.16 and 9.17 CHANGES for inspiration.
Here's what I think is outstanding:
3 Dec 2021: Still a problem with underflow of counter of Recursive Clients in BIND 9.16 #3044
- NOT a customer issue, but one affecting two (helpful) BIND users This is despite the fixes committed (allegedly) to BIND 9.11.10-S1, 9.15.3 and 9.14.5 with change 5277 (#1074 (closed) and #602 (closed)). Users were reporting this issue against 9.14.6, 9.14.7 and 9.14.8 One of the users said that updating to 9.14.11 appears to have fixed this Another of the users said that updating to 9.16.3 had the same outcome. But nobody knows what did it (although Matthijs at one point suspected not the counting itself, but something in the processing 'passing that way too often' and decrementing twice. The bug being in the rbtdb management. Reporter requested that the issue be closed. Subsequently Ondrej said that some of the stats are still misbehaving, and one of the reporters came back saying that 9.16.4 still has issues.
- I've reopened it. The 9.16.4 reporter has been offered a couple of diagnostic 'patches'
I think #1274 (closed) might turn out to be related to :
#1087 (closed) 5374. [bug] Statistics counters tracking recursive clients and 9.16.2 & 9.17.1 active connections could underflow. [GL #1087 (closed)]
But note again - this fix has NOT been back-ported to 9.11. Why not? Was the underlying causing a feature that was added later? Is it a feature that was back-ported to 9.11-S?
#1475 (closed) This is fixed in 9.16.6 but hasn't been backported (as it should have been IMHO) to 9.11-S:
5461. [bug] The STALE rdataset header attribute was updated while
9.16.6 the write lock was not being held, leading to incorrect
(not back- statistics. The header attributes are now converted to
ported!) use atomic operations. [GL #1475]
#1934 (closed) This is a request to backport #1719 (closed) to 9.11 - although it's not totally clear whether or not it's needed. There seems to be some relationship between #1719 (closed) and #1067 (closed) that I wasn't able to disentangle. History of those two fixes is:
5466. [bug] Addressed an error in recursive clients stats reporting.
9.16.6 [GL #1719]
5249. [bug] Fix a possible underflow in recursion clients
9.11.9 statistics when hitting recursive clients
9.15.2 soft quota. [GL #1067]
We have one active customer frustrated by these stats (on 9.16.3 - https://support.isc.org/Ticket/Display.html?id=16711 but the general principle is that we want the stats to work on 9.11 too please because Support uses them when analyzing customer issues and 9.11 (and 9.11-S) have at least another year of support life.
#1793 (closed) This has been both reported on bind-users and was seen in a customer ticket in which Support was using the stats to help diagnose the issues. #16297 (Amusingly, the statistic was counting something else entirely which turned out to be useful to count (failures to get anything back from the global forwarders) - so perhaps there is something else here we need to count anyway)
But it should be trivial to fix - yet nobody has done so yet.
New feature request:
#1871 (closed) We have a request, from a resolver operator (Customer), to be able to issue an rndc command to dump a file of statistics in some format that can be imported and processed by Prometheus.
Feature request:
#1287 additional BIND stats/counters - DLZ invocations (Customer)
Feature request:
#1286 additional BIND stats/counters - NXDOMAIN redirection (Customer)
This was opened by Michał and seems a worthwhile thing to tackle:
#1621 (closed) "statistics" system test is prone to races and does not preserve forensic data
Now on to some of the real oldies that (to my knowledge) have never been tackled:
https://bugs.isc.org/Ticket/Display.html?id=41794 BIND cache consumption stats anomaly/improvement suggestion (bconry)
https://bugs.isc.org/Ticket/Display.html?id=46108 JSON and XML stats channel data misrepresents shared caches Bit of an anomaly - JSON version doesn't give any clue that the cache is shares; XML does. named.stats gets it right. (not customer-reported)
https://bugs.isc.org/Ticket/Display.html?id=46112 stats channel (named.stats/XML/JSON) defiencies in handling shared zones This is primarily only an issue when 'zone-statistics full' is in use. (not customer-reported)
(THIS one looks like it ought to be quite significant?) https://bugs.isc.org/Ticket/Display.html?id=42679 Another memory context 'total' accounting error? (bconry)
- CHANGE 4046 does not fix this
- This appears to be a design-decision fault?
- Was uncovered investigating a Customer ticket
https://bugs.isc.org/Ticket/Display.html?id=44529 Document BIND stats counters much more comprehensively
And finally, just so that I don't lose it - here's my trawl of what was fixed and when - it may not be TOTALLY complete, but it's an interesting history that might turn out to be useful somewhere along the line:
5461. [bug] The STALE rdataset header attribute was updated while
9.16.6 the write lock was not being held, leading to incorrect
(not back- statistics. The header attributes are now converted to
ported!) use atomic operations. [GL #1475]
Should go to 9.11-S?
5466. [bug] Addressed an error in recursive clients stats reporting.
9.16.6 [GL #1719]
(not back-ported - related to -> https://gitlab.isc.org/isc-projects/bind9/-/issues/1934 ?)
5407. [func] Zone timers are now exported via statistics channel.
9.16.4 Thanks to Paul Frieden, Verizon Media. [GL #1232]
5374. [bug] Statistics counters tracking recursive clients and
9.16.2 & 9.17.1 active connections could underflow. [GL #1087]
(not backported?)
5373. [bug] Collecting statistics for DNSSEC signing operations
9.16.2 (change 5254) caused an array of significant size (over
100 kB) to be allocated for each configured zone. Each
of these arrays is tracking all possible key IDs; this
could trigger an out-of-memory condition on servers with
a high enough number of zones configured. Fixed by
tracking up to four keys per zone and rotating counters
when keys are replaced. This fixes the immediate problem
of high memory usage, but should be improved in a future
release by growing or shrinking the number of keys to
track upon key rollover events. [GL #1179]
5343. [func] Add statistics counters to the netmgr. [GL #1311]
9.15.8
5327. [func] Added a statistics counter to track queries
9.11.14 dropped because the recursive-clients quota was
exceeded. [GL #1399]
5314. [func] Added a new statistics variable "tcp-highwater"
9.11.13 that reports the maximum number of simultaneous TCP
clients BIND has handled while running. [GL #1206]
5310. [bug] TCP failures were affecting EDNS statistics. [GL #1059]
9.11.13
5277. [bug] Cache DB statistics could underflow when serve-stale
9.11.10-S1 was in use, because of a bug in counter maintenance
9.15.3 when RRsets become stale.
9.14.5
(serve- Functions for dumping statistics have been updated
stale from to dump active, stale, and ancient statistic
9.12, and counters. Ancient RRset counters are prefixed
9.11.4-S1 stats with '~'; stale RRset counters are still prefixed
from 9.16) with '#'. [GL #602]
5254. [func] Collect metrics to report to the statistics-channel
9.15.2 DNSSEC signing operations (dnssec-sign) and refresh
operations (dnssec-refresh) per zone and per keytag.
[GL #513]
5249. [bug] Fix a possible underflow in recursion clients
9.11.9 statistics when hitting recursive clients
9.15.2 soft quota. [GL #1067]
(Some relationship with GL #1719?)
4723. [bug] Statistics counter DNSTAPdropped was misidentified
9.11.3 as DNSSECdropped. [RT #46002]
4715. [bug] TreeMemMax was mis-identified as a second HeapMemMax
9.11.3 in the Json cache statistics. [RT #45980]
4600. [bug] Adjust RPZ trigger counts only when the entry
9.11.2 being deleted exists. [RT #43386]
4584. [bug] A number of memory usage statistics were not properly
9.11.1 reported when they exceeded 4G. [RT #44750]
4296. [bug] TCP packet sizes were calculated incorrectly in the
9.11.0 stats channel; they could be counted in the wrong
histogram bucket. [RT #40587]
4248. [performance] Add an isc_atomic_storeq() function, use it in
9.11.0 stats counters to improve performance.
[RT #39972] [RT #39979]
4243. [func] Improved stats reporting from Timothe Litt. [RT #38941]
9.11.0
4220. [doc] Improve documentation for zone-statistics.
9.11.0 [RT #36955]
4156. [func] Added statistics counters to track the sizes
9.11.0 of incoming queries and outgoing responses in
histogram buckets, as specified in RSSAC002.
[RT #39049]
4144. [func] Add statistics counters for nxdomain redirections.
9.11.0 [RT #39790]
4136. [bug] Stale statistics counters with the leading
9.11.0 '#' prefix (such as #NXDOMAIN) were not being
updated correctly. This has been fixed. [RT #39141]
4084. [bug] Fix a possible race in updating stats counters.
9.11.0 [RT #38826]4084. [bug] Fix a possible race in updating stats counters.
[RT #38826]
4046. [bug] Accounting of "total use" in memory context
9.11.0 statistics was not correct. [RT #38370]
3938. [func] Added quotas to be used in recursive resolvers
9.11.0 that are under high query load for names in zones
whose authoritative servers are nonresponsive or
are experiencing a denial of service attack.
- "fetches-per-server" limits the number of
simultaneous queries that can be sent to any
single authoritative server. The configured
value is a starting point; it is automatically
adjusted downward if the server is partially or
completely non-responsive. The algorithm used to
adjust the quota can be configured via the
"fetch-quota-params" option.
- "fetches-per-zone" limits the number of
simultaneous queries that can be sent for names
within a single domain. (Note: Unlike
"fetches-per-server", this value is not
self-tuning.)
- New stats counters have been added to count
queries spilled due to these quotas.
See the ARM for details of these options. [RT #37125]
3790. [bug] Handle broken nameservers that send BADVERS in
9.11.0 response to unknown EDNS options. Maintain
statistics on BADVERS responses.
3755. [func] Add stats counters for known EDNS options + others.
9.10.0 [RT #35447]
3739. [func] Added per-zone stats counters to track TCP and
9.10.0 UDP queries. [RT #35375]
3622. [tuning] Eliminate an unnecessary lock when incrementing
9.10.0 cache statistics. [RT #34339]
3554. [bug] RRL failed to correctly rate-limit upward
9.10.0 referrals and failed to count dropped error
responses in the statistics. [RT #33225]
3520. [bug] 'mctx' was not being referenced counted in some places
9.10.0 where it should have been. [RT #32794]
3472. [bug] The active-connections counter in the socket
9.10.0 statistics could underflow. [RT #31747]
3392. [func] Keep statistics on REFUSED responses. [RT #31412]
9.10.0
3336. [func] Maintain statistics for RRsets tagged as "stale".
9.10.0 [RT #29514]
3326. [func] Added task list statistics: task model, worker
9.10.0 threads, quantum, tasks running, tasks ready.
[RT #27678]
3325. [func] Report cache statistics: memory use, number of
9.10.0 nodes, number of hash buckets, hit and miss counts.
[RT #27056]
3323. [func] Report the number of buckets the resolver is using.
9.10.0 [RT #27020]
3322. [func] Monitor the number of active TCP and UDP dispatches.
9.10.0 [RT #27055]
3321. [func] Monitor the number of recursive fetches and the
9.10.0 number of open sockets, and report these values in
the statistics channel. [RT #27054]
3320. [func] Added support for monitoring of recursing client
9.10.0 count. [RT #27009]
3319. [func] Added support for monitoring of ADB entry count and
9.10.0 hash size. [RT #27057]
2580. [bug] UpdateRej statistics counter could be incremented twice
9.7.0 for one rejection. [RT #19476]
2577. [doc] Clarified some statistics counters. [RT #19454]
9.7.0
2541. [bug] Conditionally update dispatch manager statistics.
9.7.0 [RT #19247]
2537. [func] Added more statistics counters including those on socket
9.7.0 I/O events and query RTT histograms. [RT #18802]
2367. [bug] Improve counting of dns_resstatscounter_retry
9.6.0 [RT #18030]
2361. [bug] "recursion" statistics counter could be counted
9.6.0 multiple times for a single query. [RT #17990]
2355. [func] Extend the number statistics counters available.
9.6.0 [RT #17590]
2346. [func] Memory statistics now cover all active memory contexts
9.6.0 in increased detail. [RT #17580]
2320. [func] Make statistics counters thread-safe for platforms
9.6.0 that support certain atomic operations. [RT #17466]
2274. [func] Log zone transfer statistics. [RT #17336]
9.5.0
1878. [func] Detect duplicates of UDP queries we are recursing on
and drop them. New stats category "duplicate".
[RT #2471]