BIND issueshttps://gitlab.isc.org/isc-projects/bind9/-/issues2023-11-23T15:08:44Zhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4449too long CNAME chains do not elicit SERVFAIL or even log message2023-11-23T15:08:44ZPetr Špačekpspacek@isc.orgtoo long CNAME chains do not elicit SERVFAIL or even log message### Summary
CNAME chain length is currently limited to ~ 16 steps. Chains longer than this limit are cut short, but the RDCODE is still NOERROR. This creates impression that the final hop might be NODATA answer.
Also I can't see any lo...### Summary
CNAME chain length is currently limited to ~ 16 steps. Chains longer than this limit are cut short, but the RDCODE is still NOERROR. This creates impression that the final hop might be NODATA answer.
Also I can't see any log message in logs that resolution was terminated prematurely.
### BIND version used
* ~"Affects v9.19": a819d3644634997a78b162988156e90f409e1ce8
* ~"Affects v9.18": 6817bf1284fe8aea303365d2dd17bc5523e7a41b
* ~"Affects v9.16": 161d69aba357fa830bb6ef2b097b0447929041f0
* ~"Affects v9.11 (EoL)" : v9.11.37-S1
* Other versions were not tested
### Steps to reproduce
* Setup an auth zone with too long CNAME chain:
- [local.zone](/uploads/af4b4f699adb8b3bf87d5cac31b5d33f/local.zone)
- [named.conf](/uploads/2a0c44310bfbe99a2ffe6bbc1b36bacc/named.conf)
Query for it in default resolver config.
### What is the current *bug* behavior?
RCODE=NOERROR despite the incomplete CNAME chain.
```
$ dig c0000.local.testiscorg.ch. A
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20544
;; flags: qr rd ra; QUERY: 1, ANSWER: 17, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 102100bdc30994f717d0d76d655f5b7d3a1f039d04cd3a86 (good)
;; QUESTION SECTION:
;c0000.local.testiscorg.ch. IN A
;; ANSWER SECTION:
c0000.local.testiscorg.ch. 0 IN CNAME c0001.local.testiscorg.ch.
c0001.local.testiscorg.ch. 0 IN CNAME c0002.local.testiscorg.ch.
c0002.local.testiscorg.ch. 0 IN CNAME c0003.local.testiscorg.ch.
c0003.local.testiscorg.ch. 0 IN CNAME c0004.local.testiscorg.ch.
c0004.local.testiscorg.ch. 0 IN CNAME c0005.local.testiscorg.ch.
c0005.local.testiscorg.ch. 0 IN CNAME c0006.local.testiscorg.ch.
c0006.local.testiscorg.ch. 0 IN CNAME c0007.local.testiscorg.ch.
c0007.local.testiscorg.ch. 0 IN CNAME c0008.local.testiscorg.ch.
c0008.local.testiscorg.ch. 0 IN CNAME c0009.local.testiscorg.ch.
c0009.local.testiscorg.ch. 0 IN CNAME c0010.local.testiscorg.ch.
c0010.local.testiscorg.ch. 0 IN CNAME c0011.local.testiscorg.ch.
c0011.local.testiscorg.ch. 0 IN CNAME c0012.local.testiscorg.ch.
c0012.local.testiscorg.ch. 0 IN CNAME c0013.local.testiscorg.ch.
c0013.local.testiscorg.ch. 0 IN CNAME c0014.local.testiscorg.ch.
c0014.local.testiscorg.ch. 0 IN CNAME c0015.local.testiscorg.ch.
c0015.local.testiscorg.ch. 0 IN CNAME c0016.local.testiscorg.ch.
c0016.local.testiscorg.ch. 0 IN CNAME c0017.local.testiscorg.ch.
```
### What is the expected *correct* behavior?
Same output but SERVFAIL.
### Relevant logs and/or screenshots
There is no log message indicating that the chain was cut prematurely. Here's named log running at `-d 99` from the main branch: [named.log](/uploads/4e9e9f8c70bf4fbc187082914a4b06ac/named.log)
### Other implementations
- PowerDNS Recursor 4.9.1 SERVFAILs and cuts the chain on c0011
- Knot Resolver 5.7.0 SERVFAILs and cuts the chain on c0013
- Unbound 1.19.0 commit 197bf154 SERVFAILs and does not return anything in the ANSWER section. [PCAP](/uploads/abb00b0409388e4a5cedf867a934e9f7/dns.pcap) suggests it stops chasing after encountering c0011.https://gitlab.isc.org/isc-projects/bind9/-/issues/3691stats channels and `rndc dumpstats` do not expose all counters from `rndc sta...2022-11-29T13:42:48ZPetr Špačekpspacek@isc.orgstats channels and `rndc dumpstats` do not expose all counters from `rndc status`### Summary
JSON and XML stat channels, and `rndc dumpstats` command, do not expose counters from `rndc status`. This forces users to scrape both channels to get complete picture.
### BIND version used
9.19.8-dev (Development Release)...### Summary
JSON and XML stat channels, and `rndc dumpstats` command, do not expose counters from `rndc status`. This forces users to scrape both channels to get complete picture.
### BIND version used
9.19.8-dev (Development Release) 9128e54 , but it certainly dates long way back.
### What is the current *bug* behavior?
Compare lines produced by `rndc status` with content of JSON stats channel:
| rndc status line | evaluation | JSON key |
|-------------------------------------------------------------|-------------------|-------------|
| version: BIND 9.19.8-dev (Development Release) <id:9128e54> | different format, just `9.19.8-dev` | version |
| running on p: Linux x86_64 6.0.8-arch1-1 #1 SMP … | missing | |
| boot time: Tue, 22 Nov 2022 08:43:49 GMT | different format | boot-time |
| last configured: Tue, 22 Nov 2022 09:34:20 GMT | different format | config-time |
| configuration file: /etc/named.conf | missing | |
| CPUs found: 8 | missing | |
| worker threads: 8 | missing | |
| UDP listeners per interface: 8 | missing | |
| number of zones: 103 (98 automatic) | missing | |
| debug level: 0 | missing | |
| xfers running: 0 | missing | |
| xfers deferred: 0 | missing | |
| soa queries in progress: 0 | missing | |
| query logging is OFF | missing | |
| recursive clients: 0/900/1000 | missing | |
| tcp clients: 0/150 | missing | |
| TCP high-water: 0 | missing | |
| server is up and running | missing | |
### What is the expected *correct* behavior?
All information from `rndc status` is also exposed in other stats channels.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3688rndc status "soa queries in progress" counter includes also AXFR in progress2023-11-02T17:05:06ZPetr Špačekpspacek@isc.orgrndc status "soa queries in progress" counter includes also AXFR in progress### Summary
rndc status "soa queries in progress" counter includes also AXFRs in progress, even transfers which made SOA query and received a valid answer with SOA before initiating transfer itself.
### BIND version used
- ~"Affects v...### Summary
rndc status "soa queries in progress" counter includes also AXFRs in progress, even transfers which made SOA query and received a valid answer with SOA before initiating transfer itself.
### BIND version used
- ~"Affects v9.19": 8272cc2
- ~"Affects v9.18": v9_18_9
- ~"Affects v9.16": v9_16_35
- ~"Affects v9.11 (EoL)": v9_11_37
### Steps to reproduce
1. Use following config to transfer (public) se. zone:
```
zone se {
type secondary;
primaries { 45.155.96.61; };
notify no;
};
```
2. Run `tcpdump` and watch SOA queries go by:
```
sudo tcpdump -i any 'udp and host 45.155.96.61'
```
3. Run BIND:
```
named -g -c secondary.conf
```
4. Observe output from `rndc status` before the transfer finishes.
### What is the current *bug* behavior?
tcpdump shows:
```
18:43:53.463606 enp0s13f0u1u2u3 Out IP p.50306 > zonedata.iis.se.domain: 21273 [1au] SOA? se. (35)
18:43:53.512928 enp0s13f0u1u2u3 In IP zonedata.iis.se.domain > p.50306: 21273*- 1/0/1 SOA (107)
```
`rndc status` at the same time shows:
```
xfers running: 1
xfers deferred: 0
soa queries in progress: 1
```
`named` log at the time:
```
21-Nov-2022 18:43:53.509 zone se/IN: Transfer started.
21-Nov-2022 18:43:53.556 transfer of 'se/IN' from 45.155.96.61#53: connected using 45.155.96.61#53
```
### What is the expected *correct* behavior?
I would expect "soa queries in progress" counter so be 0 at this point in time.Not plannedMark AndrewsMark Andrewshttps://gitlab.isc.org/isc-projects/bind9/-/issues/3192[ISC-support #20070] Wildcards, literal asterisk labels, and RPZ zones2024-02-14T14:54:18ZChuck Stearns[ISC-support #20070] Wildcards, literal asterisk labels, and RPZ zones### Summary
A literal asterisk in a RR label can be used to bypass RPZ records.
### BIND version used
9.11.33-S1 (though I think this also affects 9.16 and 9.18)
### Steps to reproduce
RPZ entries:
```
test.com CNAME .
*.test.com C...### Summary
A literal asterisk in a RR label can be used to bypass RPZ records.
### BIND version used
9.11.33-S1 (though I think this also affects 9.16 and 9.18)
### Steps to reproduce
RPZ entries:
```
test.com CNAME .
*.test.com CNAME .
```
AND
test.com zone containing `*.test.com {type} {value}` (must not be delegated)
OR
sub.*.test.com zone definition
Example test:
```
$ dig @0 test.sub.\*.test.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @0 test.sub.*.test.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62448
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;test.sub.*.test.com. IN A
;; ANSWER SECTION:
test.sub.*.test.com. 3600 IN A 127.0.0.1
;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(0.0.0.0)
;; WHEN: Wed Jan 26 16:47:19 EST 2022
;; MSG SIZE rcvd: 64
```
### What is the current *bug* behavior?
Queries containing a literal asterisk (such as `sub.*.test.com` or `*.test.com`) will be answered, rather than caught by RPZ.
### What is the expected *correct* behavior?
RPZ expected to catch the query, like so:
```
$ dig @0 sub.test.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @0 sub.test.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 31154
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 2
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;sub.test.com. IN A
;; ADDITIONAL SECTION:
localhost.rpz. 1 IN SOA localhost. postmaster.localhost. 2004052401 3600 1800 604800 3600
;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(0.0.0.0)
;; WHEN: Wed Jan 26 16:40:21 EST 2022
;; MSG SIZE rcvd: 110
```Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3136dnstap-read: add incremental packet number to the output2022-02-10T15:06:10Zmartinvonwittichdnstap-read: add incremental packet number to the output### Description
Currently, `dnstap-read` doesn't print any kind of unique identifier per packet. E.g. if I'm using it without any options to get the short summary format:
```
09-Feb-2022 02:33:36.294 CQ 127.0.0.1:41718 -> 127.0.0.1:0 U...### Description
Currently, `dnstap-read` doesn't print any kind of unique identifier per packet. E.g. if I'm using it without any options to get the short summary format:
```
09-Feb-2022 02:33:36.294 CQ 127.0.0.1:41718 -> 127.0.0.1:0 UDP 33b example.com/IN/MX
09-Feb-2022 02:33:36.334 RR 192.168.1.2:55163 <- 192.168.1.1:53 UDP 71b example.com/IN/MX
09-Feb-2022 02:33:36.294 RQ 192.168.1.2:55163 -> 192.168.1.1:53 UDP 33b example.com/IN/MX
09-Feb-2022 02:33:36.334 CR 127.0.0.1:41718 <- 127.0.0.1:0 UDP 102b example.com/IN/MX
09-Feb-2022 02:33:38.453 CQ 127.0.0.1:57293 -> 127.0.0.1:0 UDP 33b example.com/IN/MX
09-Feb-2022 02:33:38.453 CR 127.0.0.1:57293 <- 127.0.0.1:0 UDP 102b example.com/IN/MX
```
and then I want to lookup the details of one of these packets in the `-p` format, I have to search for the whole line to find it.
It's even worse in the `-y` format because contrary to the `-p` format, the YAML representation doesn't contain the original summary line, and while the summary and `-p` will print timestamps in the local timezone, `-y` will print UTC timestamps.
### Request
I would like `dnstap-read` to prefix each packet with an incremental number in the summary and in the `-p` output, so that the details for a packet can easily be searched. The YAML representation should contain the number in an additional YAML field.
### Links / references
I like the way it works in `tshark` - each line in the summary is prefixed with an incremental packet number:
```
server ~ # tshark -i ens3 -w test.pcap
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ens3'
4 ^C
server ~ # tshark -r test.pcap
Running as user "root" and group "root". This could be dangerous.
1 2022-02-09 17:43:01,528756106 02:00:62:3e:71:f5 → 02:00:62:3e:71:f9 ARP 42 Who has 172.16.56.10? Tell 172.16.0.1
2 2022-02-09 17:43:01,528792938 02:00:62:3e:71:f9 → 02:00:62:3e:71:f5 ARP 42 172.16.56.10 is at 02:00:62:3e:71:f9
3 2022-02-09 17:43:02,068971390 172.16.56.10 → 172.21.0.10 SSH 102 Server: Encrypted packet (len=36)
4 2022-02-09 17:43:02,170798836 172.21.0.10 → 172.16.56.10 TCP 66 55798 → 22 [ACK] Seq=1 Ack=37 Win=990 Len=0 TSval=1691964300 TSecr=1370499909
```
When printing the capture file with the `-V` option, the first line of each frame is prefixed with `Frame n`, which makes it easy to search in a pager:
```
server ~ # tshark -r test.pcap -V | head -n 5
Running as user "root" and group "root". This could be dangerous.
Frame 1: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Interface id: 0 (ens3)
Interface name: ens3
Encapsulation type: Ethernet (1)
Arrival Time: Feb 9, 2022 17:43:01.528756106 CET
```Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3063dnssec-verify detect and support multiple cores2023-11-02T17:02:21ZDaniel Stirnimanndnssec-verify detect and support multiple cores### Description
We use `dnssec-verify` (from BIND 9.16) to validate large DNSSEC-signed zones. I noticed that on a multi core processor (eg 16 cores) always only one cpu is used. I guess, validation time could be speed up a lot if all a...### Description
We use `dnssec-verify` (from BIND 9.16) to validate large DNSSEC-signed zones. I noticed that on a multi core processor (eg 16 cores) always only one cpu is used. I guess, validation time could be speed up a lot if all available cores would be used.
### Request
Make `dnssec-verify` use all available cores automatically for operations for which this is possible eg. signature verification.
`dnssec-signzone` already automatically detects and uses all available cores and even has an argument switch to specify an specific number (`man dnssec-signzone`). I think something like this would be very useful:
```
-n ncpus
This option specifies the number of threads to use. By default, one thread is started for each detected CPU.
```Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3027Move setting of @SO@ to copy_setports2022-03-01T09:43:31ZMark AndrewsMove setting of @SO@ to copy_setportsSetting @SO@ in conf files is currently done by configure. copy_setports should be capable of doing this.Setting @SO@ in conf files is currently done by configure. copy_setports should be capable of doing this.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2999remove duplicate code between DLZ example & dlzexternal test2022-01-07T09:40:28ZPetr Špačekpspacek@isc.orgremove duplicate code between DLZ example & dlzexternal testVersion: main branch, 4bebcd45033400a6b1a3057c60c3a2cfd5fd4029
These two files are substantially the same:
- contrib/dlz/example/dlz_example.c
- bin/tests/system/dlzexternal/driver/driver.c
I think we should remove one of them and use ...Version: main branch, 4bebcd45033400a6b1a3057c60c3a2cfd5fd4029
These two files are substantially the same:
- contrib/dlz/example/dlz_example.c
- bin/tests/system/dlzexternal/driver/driver.c
I think we should remove one of them and use symlink instead of copy. Updating two files is ... suboptimal.Long-termhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2984ECS-IP not visible in the "rpz.log"2023-04-12T12:34:45ZThomas AmgartenECS-IP not visible in the "rpz.log"<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [...<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [security-officer@isc.org](security-officer@isc.org).
-->
### Summary
If an RPZ-enabled BIND is behind a proxy/loadbalancer (for example dnsdist), which injects the ECS-IP, there's actually no way to have/see the client ip address (ECS-IP) in the "rpz.log". Instead, one can correctly see only the ip address from the proxy/dnsdist itself and not the address from the effective source.
### BIND version used
Tested with BIND-9.16.21
### Steps to reproduce
- Place a proxy/dnsdist in front of BIND and inject the ECS-IP.
```
Domain Name System (response)
Transaction ID: 0x5d00
Flags: 0x8183 Standard query response, No such name
1... .... .... .... = Response: Message is a response
.000 0... .... .... = Opcode: Standard query (0)
.... .0.. .... .... = Authoritative: Server is not an authority for domain
.... ..0. .... .... = Truncated: Message is not truncated
.... ...1 .... .... = Recursion desired: Do query recursively
.... .... 1... .... = Recursion available: Server can do recursive queries
.... .... .0.. .... = Z: reserved (0)
.... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
.... .... ...0 .... = Non-authenticated data: Unacceptable
.... .... .... 0011 = Reply code: No such name (3)
Questions: 1
Answer RRs: 0
Authority RRs: 0
Additional RRs: 1
Queries
example.ch: type A, class IN
Name: example.ch
[Name Length: 8]
[Label Count: 2]
Type: A (Host Address) (1)
Class: IN (0x0001)
Additional records
<Root>: type OPT
Name: <Root>
Type: OPT (41)
UDP payload size: 1232
Higher bits in extended RCODE: 0x00
EDNS0 version: 0
Z: 0x0000
0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
.000 0000 0000 0000 = Reserved: 0x0000
Data length: 40
Option: COOKIE
Option Code: COOKIE (10)
Option Length: 24
Option Data: faf2434380c56c3d01000000617a1b9c7be4e739ff1b30de
Client Cookie: faf2434380c56c3d
Server Cookie: 01000000617a1b9c7be4e739ff1b30de
Option: CSUBNET - Client subnet
Option Code: CSUBNET - Client subnet (8)
Option Length: 8
Option Data: 00012000c0a8ec02
Family: IPv4 (1)
Source Netmask: 32
Scope Netmask: 0
Client Subnet: 172.16.16.33 <------------------
[Request In: 13]
[Time: 0.000221000 seconds]
```
- Then query a domain via proxy, which triggers RPZ
### What is the current *bug* behavior?
- Verify the "rpz.log", which only shows the proxy-ip
```
27-Oct-2021 15:41:27.940 rpz: info: client @0x7f3db81aa0f8 127.0.0.1#44353 (example.ch): rpz QNAME NXDOMAIN rewrite example.ch/A/IN via example.ch.blacklist-rpz.test.local
```
### What is the expected *correct* behavior?
A way to see the ECS-IP, the effective client ip address, like this is already implemented, when enabling the builtin "rndc querylog on":
```
27-Oct-2021 15:41:27.940 queries: info: client @0x7f3db81aa0f8 127.0.0.1#44353 (example.ch): query: example.ch IN A +E(0)K (127.0.0.1) [ECS 172.16.16.33/32/0]
```
### Relevant configuration files
(Paste any relevant configuration files - please use code blocks (```)
to format console output. If submitting the contents of your
configuration file in a non-confidential Issue, it is advisable to
obscure key secrets: this can be done automatically by using
`named-checkconf -px`.)
### Relevant logs and/or screenshots
### Possible fixesNot plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2955Bad html generated for 9.11 doc on web site2022-03-01T09:46:14ZMark AndrewsBad html generated for 9.11 doc on web siteSee https://bind.isc.org/doc/arm/9.11/man.named.conf.html for example. White space is incorrect.See https://bind.isc.org/doc/arm/9.11/man.named.conf.html for example. White space is incorrect.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2938CID 339072 (#1 of 1): Unchecked return value (CHECKED_RETURN)2023-01-09T11:11:24ZMark AndrewsCID 339072 (#1 of 1): Unchecked return value (CHECKED_RETURN)lib/dns/rpz.c:
```
2246
CID 339072 (#1 of 1): Unchecked return value (CHECKED_RETURN)
25. check_return: Calling isc_timer_reset without checking return value (as is done elsewhere 9 out of 10 times).
2247 isc_timer_...lib/dns/rpz.c:
```
2246
CID 339072 (#1 of 1): Unchecked return value (CHECKED_RETURN)
25. check_return: Calling isc_timer_reset without checking return value (as is done elsewhere 9 out of 10 times).
2247 isc_timer_reset(rpz->updatetimer, isc_timertype_inactive, NULL,
2248 NULL, true);
```Not plannedMark AndrewsMark Andrewshttps://gitlab.isc.org/isc-projects/bind9/-/issues/2744warning: checkhints: unable to get root NS rrset from cache: not found2024-03-27T00:34:35ZCathy Almondwarning: checkhints: unable to get root NS rrset from cache: not foundPeriodically we see reports of resolvers that are failing to respond to clients successfully, perhaps with a build-up of recursive clients, inbound UDP packet drops, late and missing query responses and so on. Rebooting the server entir...Periodically we see reports of resolvers that are failing to respond to clients successfully, perhaps with a build-up of recursive clients, inbound UDP packet drops, late and missing query responses and so on. Rebooting the server entirely usually fixes the problem - for a time. Flushing cache may also buy some relief, but generally this does not last as long as if the server is rebooted entirely.
Plus one symptom in the logs - repeated spates of messages like this:
```
31-May-2021 16:08:38.110 general: warning: checkhints: unable to get root NS rrset from cache: not found
31-May-2021 16:08:41.110 general: warning: checkhints: unable to get root NS rrset from cache: not found
31-May-2021 16:08:42.151 general: warning: checkhints: unable to get root NS rrset from cache: not found
```
This error message occurs when the root nameservers have just been primed, but when checkhints goes to look at them, they're no longer available in cache (have been expired, possible also removed), all in a very short period of time.
Reports of this have been seen intermittently for many years and from many versions of BIND 9. Typically (in the older reports) this was a rare occurrence seen on a resolver that had been running for a long time; months, possibly years. Therefore after rebooting, the error and the problem was never seen again (or at least not within the shelf-life of the admin who reported it to us originally).
We suspect that what is happening is that the cache structure and content have become unmaintainable over a long period of content being added, expired and removed, and that there it's become impossible to add new RRsets to cache without using expiring existing content because of max-cache-size. The cache tree structure itself also occupies memory, and we've seen a few instances where a long-lived cache has become 'straggly' but also sparsely populated.
What we haven't been able to catch (yet), is the exact path taken that causes this error to be logged, although we have been hoping that improved stats, along with a `catch it earlier` assertion (the server anyway needs to be restarted when it has reached this state) might help. See #2082 .
We have also seen that in one or two instances of this warning being logged, in addition there was a problem reaching some of the root nameserver addresses listed in the root hints and used for priming. Either the root hints were out of date and an older IP address was unreachable, or there were local routing issues (typically IPv6-related) reaching some root server addresses. **This shouldn't be a problem**, per the way that root hints priming is designed, _but 'fixing' the root hints appears to have made the problem go away in some instances, as has fixing the routing and unreachability of some root hint addresses._
----
For anyone experiencing this problem for the first time, the likelihood is that one or more things have changed in your operating environment, and that these are causing cache content to be more substantial than before, or potentially distributed differently. For example:
- Installing a version of BIND that has `stale-cache-enable yes` by default
- An increase in client queries overall
- Client query patterns changing - perhaps causing a higher rate than usual of cached negative responses
- An increase in dual-stack clients querying for AAAA records
- An increase in client querying for HTTPS records
- A new client application that uses DNS-based probing
- Clients using a tunnelling-over-DNS service
- Using a client filtering service that operates by means of resolving the original client query first by appending another private zone name to it and checking the response status before allowing the original query to pass - thus adding the filtering RRsets to cache as well as the actual client query responses.
Currently, clues may be found in the BIND statistics and also in a dump of cache.
Firstly, these counters (available either from the output from `rndc stats` or using the xml or json statistics interface), can be a good indicator that there is too much cache cleaning taking place due to memory pressure, versus RRset TTL expiration:
DeleteLRU - "cache records deleted due to memory exhaustion"
DeleteTTL - "cache records deleted due to TTL expiration"
These are counters, therefore although seeing DeleteLRU far exceeding DeleteTTL in a single snapshot of the stats is a good indicator that all is not well with cache, ideally you want to monitor the trend over time.
Also these :
HeapMemInUse - "cache heap memory in use"
TreeMemInUse - "cache tree memory in use"
HeapMemMax - "cache heap highest memory in use"
TreeMemMax - "cache tree highest memory in use"
All of the above are gauges - they tell you 'this is where we are now', so a snapshot can be useful, as well as monitoring pattern over time. The 'Max' is a high water mark.
Aside: don't be tempted to look at either of these - they are not useful operationally and aren't counting what you might think they are from their names:
HeapMemTotal - "cache heap memory total"
TreeMemTotal - "cache tree memory total"
And finally, there are counters available of what's in cache currrently by RType. These are prefixed with `!` for counters of NXRRSET (pseudo RR indicating that a name that was queried existed but the type didn't), `#` for stale content, and `~` for content that has expired and is waiting on housekeeping/deletion.
If there is any kind of unexpected skew, it might be worth dumping cache to see what's in there.
And then decide - is it just that max-cache-size is now insufficient, or is that something else needs to be done to reduce cache content.May 2024 (9.18.27, 9.18.27-S1, 9.19.24)Mark AndrewsMark Andrewshttps://gitlab.isc.org/isc-projects/bind9/-/issues/1793failed query to a `forward only` forwarder increments `serverquota` counter (...2023-11-02T16:58:14ZCathy Almondfailed query to a `forward only` forwarder increments `serverquota` counter (spilled due to server quota)As observed in [Support ticket #16297](https://support.isc.org/Ticket/Display.html?id=16297)
I was inspecting the stats output and was very surprised to see this:
` 13779 spilled due to server quota`
The server in questi...As observed in [Support ticket #16297](https://support.isc.org/Ticket/Display.html?id=16297)
I was inspecting the stats output and was very surprised to see this:
` 13779 spilled due to server quota`
The server in question does not have `fetches-per-server` configured, so this defaults to zero (unlimited). But yet...
Looking at the code - I suspect there's a failure mode that drops through the 'out' block in fctx_getaddresses() without resetting all_spilled (which starts at 'true').
```c
static isc_result_t
fctx_getaddresses(fetchctx_t *fctx, bool badcache) {
dns_rdata_t rdata = DNS_RDATA_INIT;
isc_result_t result;
dns_resolver_t *res;
isc_stdtime_t now;
unsigned int stdoptions = 0;
dns_forwarder_t *fwd;
dns_adbaddrinfo_t *ai;
bool all_bad;
dns_rdata_ns_t ns;
bool need_alternate = false;
bool all_spilled = true;
```
...
```c
/*
* If all of the addresses found were over the
* fetches-per-server quota, return the configured
* response.
*/
if (all_spilled) {
result = res->quotaresp[dns_quotatype_server];
inc_stats(res, dns_resstatscounter_serverquota);
}
```
This is a server that is using global forwarding, so we skip case 'normal_nses', which is where 'all_spilled' is normally reset from true to false during processing:
```c
if (fctx->fwdpolicy == dns_fwdpolicy_only)
goto out;
```
So I'm guessing that what's been 'counted' and then reported here, is failures in getting responses back from any of the global forwarders (which tallies quite nicely with the problem I'm investigating - even though this wasn't a counter I was expecting to see in the stats!).
The assumption seems to be if it's a failure for any other reason than fetch-limits, that something will reset the 'all_spilled' flag - it would appear that assumption is flawed for some configurations and situations. Could someone have a look at this please - it should be an easy one to fix.
I note that this has also been noticed before on bind-users:
https://lists.isc.org/pipermail/bind-users/2016-June/097011.html
I observed this in 9.11.15-S1, but the code path looks the same still on master.
Requested changes:
- [ ] fix serverquota counter
- [ ] add a new counter for specifically for situation when all forwarders have failedNot planned