BIND issueshttps://gitlab.isc.org/isc-projects/bind9/-/issues2024-02-24T07:52:45Zhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4564slightly worse cold cache performance after the KeyTrap fix2024-02-24T07:52:45ZTom Krizekslightly worse cold cache performance after the KeyTrap fixThe keytrap fix (merged in isc-private/bind9!628) has a significant impact on client query latency during startup (cold cache) as well as increased memory consumption. Using `recursive-clients 1000;` yielded quite similar results to `100...The keytrap fix (merged in isc-private/bind9!628) has a significant impact on client query latency during startup (cold cache) as well as increased memory consumption. Using `recursive-clients 1000;` yielded quite similar results to `10000` (which was used for most charts in here).
## Cold cache latency
### :question: 9.18 cold cache UDP
![keytrap-cold-cache-latency-9.18](/uploads/9953460d5a50fb571f18c21ece5f7415/keytrap-cold-cache-latency-9.18.png)
#### load 15x
| | before | after |
| ------ | ------ | ----- |
| responses <2.0s | 91 % | 79 % |
| responses <100ms | 88 % | 75 % |
#### load 10x
| | before | after |
| ------ | ------ | ----- |
| responses <2.0s | 96 % | 89 % |
| responses <100ms | 92 % | 85 % |
#### load 5x
| | before | after |
| ------ | ------ | ----- |
| responses <2.0s | 99 % | 98 % |
| responses <100ms | 93 % | 92 % |
### :question: 9.16 cold cache UDP
![keytrap-cold-cache-latency-9.16](/uploads/795943f2c6c6959c15f30c9f41597976/keytrap-cold-cache-latency-9.16.png)
### :white_check_mark: 9.19 cold cache UDP
the impact for ~"v9.19" is quite minimal and I don't consider it an issue
![keytrap-cold-cache-latency-9.19](/uploads/d86be8aeb8170b8e596bd85666937a81/keytrap-cold-cache-latency-9.19.png)
### :question: 9.18 cold cache TCP
The performance drop can also be observed with TCP with much lower overall throughput.
![keytrap-hot-cache-latency-tcp-9.18](/uploads/b581e94d77c5c1cdc9e143fdfff3bac2/keytrap-hot-cache-latency-tcp-9.18.png)
## :white_check_mark: Hot cache latency
The good news is that it doesn't really affect performance with hot cache.
![keytrap-hot-cache-latency-9.18](/uploads/e1704c0c853dc5b6ab798e1821b4a7db/keytrap-hot-cache-latency-9.18.png)
## Memory consumption
### :question: 9.18 memory consumption UDP
The initial memory consumption also gets slightly higher:
![keytrap-memory-9.18](/uploads/7d89b5dbc594c64c57802e11418c03b6/keytrap-memory-9.18.png)
### :white_check_mark: 9.18 memory consumption TCP
While the memory consumption is slightly higher for TCP under load, I think this can be explained by the fact that some queries take longer time to resolve -> some connections might be open for a longer time, thus consume more resources than before.
![keytrap-memory-tcp-9.18](/uploads/88be13f12e69bb69fc7f6d917dc21e6f/keytrap-memory-tcp-9.18.png)May 2024 (9.18.27, 9.18.27-S1, 9.19.24)https://gitlab.isc.org/isc-projects/bind9/-/issues/4427Various improvements to hashing and hash table management2024-02-24T08:19:32ZMichał KępieńVarious improvements to hashing and hash table managementThis is a meta issue to keep track of various improvements to hashing
and hash table management that were implemented since ~"v9.18".
Sparked by a [Mattermost discussion][1].
---
- [x] #4306/!8288 Implement incremental hashing
---
...This is a meta issue to keep track of various improvements to hashing
and hash table management that were implemented since ~"v9.18".
Sparked by a [Mattermost discussion][1].
---
- [x] #4306/!8288 Implement incremental hashing
---
[1]: https://mattermost.isc.org/isc/pl/rsyemrkwhtfcbxhtyxddhkn58yMay 2024 (9.18.27, 9.18.27-S1, 9.19.24)https://gitlab.isc.org/isc-projects/bind9/-/issues/3792incoming AXFR sometimes does not close TCP connection2024-02-24T07:53:11ZPetr Špačekpspacek@isc.orgincoming AXFR sometimes does not close TCP connection### Summary
I've noticed in PCAPs that sometimes BIND does not close TCP connection after successful incoming AXFR. This might cause source port depletion on a busy server.
### BIND version used
* ~"Affects v9.19": 9.19.9 56d7e01
* No...### Summary
I've noticed in PCAPs that sometimes BIND does not close TCP connection after successful incoming AXFR. This might cause source port depletion on a busy server.
### BIND version used
* ~"Affects v9.19": 9.19.9 56d7e01
* Not reproducible on ~"v9.18" (9.18.11 equivalent, b04ab06) - albeit closing the connection can take more than one second, it happens from the secondary side as expected
* ~"Affects v9.16": (9.16.37, b4a65aaea19762a3712932aa2270e8a833fbde22) - reproducible
Don't ask me how is that possible ...
### Steps to reproduce
1. Configure primary with 100k zones + catalog - can be BIND or Knot DNS (recommended to take BIND out of equation on one side)
2. Configure BIND as secondary for the catalog
3. Start secondary with clean state
### What is the current *bug* behavior?
PCAPs show that sometimes the primary closes hanging connection after primary-side timeout.
### What is the expected *correct* behavior?
Connections are closed as soon as possible.
### Relevant configuration files
#### Primary
* [named.conf](/uploads/863bf85788384d2e4893ea94cc606c89/named.conf)
* [catalog.db](/uploads/c515216922d648acf6065f7a50b36233/catalog.db)
* [empty.db](/uploads/5686c122ffb6fd4eb035bc1b88931e0f/empty.db)
Knot DNS version: [knotd.conf](/uploads/e59561f0b1f2047d348a51303d5a2119/knotd.conf)
#### Secondary
* [named.conf](/uploads/984a16e8322400cc6465b14ca45710ef/named.conf)
### Relevant logs and/or screenshots
* Primary: [primary.log.zst](/uploads/e50efe9e008cd762b3a671245e207b7d/primary.log.zst)
* Secondary: [secondary-for-knotd-conf3000.log.zst](/uploads/e8732553815f19ebf3b483629afd6279/secondary-for-knotd-conf3000.log.zst)
* search for `z19823.test` and look at timestamps
* PCAP: [bindconf3000.pcap.zst](/uploads/86b89c9ddc6d4e1c6066cfd1a997c25b/bindconf3000.pcap.zst)
* search for `tcp.stream eq 37322` in Wireshark to get `z19823.test` transfer
Suspicious conversation from the PCAP, times relative to the previous packet:
|No. | Time | Source | Source Port | Destination | Reply code | Info|
|--- | --- | --- | --- | --- | --- | ---|
|484345 | 0 | 192.0.2.2 | 40571 | 192.0.2.1 | | 40571 → 53 [SYN] Seq=0 Win=64660 Len=0 MSS=1220 SACK_PERM TSval=3661096036 TSecr=0 WS=128|
|484346 | 0,000027 | 192.0.2.1 | 53 | 192.0.2.2 | | 53 → 40571 [SYN, ACK] Seq=0 Ack=1 Win=65232 Len=0 MSS=1220 SACK_PERM TSval=1123290483 TSecr=3661096036 WS=128|
|484347 | 0,000008 | 192.0.2.2 | 40571 | 192.0.2.1 | | 40571 → 53 [ACK] Seq=1 Ack=1 Win=64768 Len=0 TSval=3661096036 TSecr=1123290483|
|511718 | 1,98078 | 192.0.2.2 | 40571 | 192.0.2.1 | | Standard query 0x47aa AXFR z19823.test|
|511719 | 0,000019 | 192.0.2.1 | 53 | 192.0.2.2 | | 53 → 40571 [ACK] Seq=1 Ack=32 Win=65280 Len=0 TSval=1123292464 TSecr=3661098017|
|511724 | 0,000107 | 192.0.2.1 | 53 | 192.0.2.2 | No error | Standard query response 0x47aa AXFR z19823.test SOA <Root> NS invalid SOA <Root>|
|511726 | 0,000009 | 192.0.2.2 | 40571 | 192.0.2.1 | | 40571 → 53 [ACK] Seq=32 Ack=121 Win=64768 Len=0 TSval=3661098017 TSecr=1123292464|
|601979 | 9,49634 | 192.0.2.1 | 53 | 192.0.2.2 | | 53 → 40571 [FIN, ACK] Seq=121 Ack=32 Win=65280 Len=0 TSval=1123301960 TSecr=3661098017|
|602469 | 0,040942 | 192.0.2.2 | 40571 | 192.0.2.1 | | 40571 → 53 [ACK] Seq=32 Ack=122 Win=64768 Len=0 TSval=3661107554 TSecr=1123301960|
|621475 | 1,959518 | 192.0.2.2 | 40571 | 192.0.2.1 | | 40571 → 53 [FIN, ACK] Seq=32 Ack=122 Win=64768 Len=0 TSval=3661109514 TSecr=1123301960|
|621476 | 0,000019 | 192.0.2.1 | 53 | 192.0.2.2 | | 53 → 40571 [ACK] Seq=122 Ack=33 Win=65280 Len=0 TSval=1123303961 TSecr=3661109514|
### Possible fixesMay 2024 (9.18.27, 9.18.27-S1, 9.19.24)https://gitlab.isc.org/isc-projects/bind9/-/issues/3811Lock contention in the RBTDB2023-11-02T17:03:41ZOndřej SurýLock contention in the RBTDB@pspacek discovered a lock contention of the RBTDB nodelocks in the `rdataset_getownercase()` and `decrement_reference()` functions.@pspacek discovered a lock contention of the RBTDB nodelocks in the `rdataset_getownercase()` and `decrement_reference()` functions.Not plannedOndřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3261Run cache cleaning as offloaded work2024-03-01T10:04:56ZOndřej SurýRun cache cleaning as offloaded workThe cache cleaning is on-task incremental process which is ideal candidate for running it as offloaded work.
NOTE for myself or whomever is going to do the job - great care needs to be taken care with signaling the end of cleaning - cur...The cache cleaning is on-task incremental process which is ideal candidate for running it as offloaded work.
NOTE for myself or whomever is going to do the job - great care needs to be taken care with signaling the end of cleaning - currently this is being serialized by the task, but if we move this into the threadpool the signalling needs to be done by atomic variable (or something like that).Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3861suspicious test results with 'reuseport no'2023-02-09T19:03:13ZPetr Špačekpspacek@isc.orgsuspicious test results with 'reuseport no'### Summary
This is suspicion, not a clear-cut bug.
### BIND version used
- ~"Affects v9.18": v9_18_11
Other versions were not tested.
### Steps to reproduce
- use 16-thread machine
- I was testing in AWS VM type c5n.4xlarge for s...### Summary
This is suspicion, not a clear-cut bug.
### BIND version used
- ~"Affects v9.18": v9_18_11
Other versions were not tested.
### Steps to reproduce
- use 16-thread machine
- I was testing in AWS VM type c5n.4xlarge for server and c5n.xlarge for client machine
- configure `reuseport no;` and "refuse everything"
- use dnsperf 2.11 [+ JSON output MR](https://github.com/pspacek/dnsperf/tree/json_output):
- `-Q 100000 -S1 -O suppress=timeout,unexpected -c 256 -q 65535 -t 1 -l 60 -O verbose-interval-stats -O json -O latency-histogram`
- query file with single line: `z123.test. SOA` - it should result in REFUSED because we don't have any such zone
- cycle through 10 dnsperf runs without restarting server
- hope that it reproduces erratic behavior; in my experiments it happens at least in in ~ 1/10 runs, sometimes more
### What is the current *bug* behavior?
**Sometimes** the server has high latency. Dunno why. I was testing against echo server and it did not exhibit this behavior.
![histogram-raw.svg](/uploads/e20182379751faeabcd967567e2978f9/histogram-raw.svg)
### What is the expected *correct* behavior?
Even performance.
### Relevant configuration files
[refused.conf](/uploads/f7255a304baba60c8dff5d02058b7e4d/refused.conf)
### Relevant logs and/or screenshots
Nothing suspicious.https://gitlab.isc.org/isc-projects/bind9/-/issues/3698rndc can get very very slow if large number of requests is made over short pe...2023-01-12T11:25:12ZPetr Špačekpspacek@isc.orgrndc can get very very slow if large number of requests is made over short period of time### Summary
rndc can get very very slow if large number of requests is made over short period of time. Primary cause seems to be that packet deduplication goes O(n^2).
### BIND version used
- ~"Affects v9.19" : e2bbf38cdb42de70d504b2f...### Summary
rndc can get very very slow if large number of requests is made over short period of time. Primary cause seems to be that packet deduplication goes O(n^2).
### BIND version used
- ~"Affects v9.19" : e2bbf38cdb42de70d504b2ff281fb360cd0f27c0
- I assume that supported versions are affected
### Steps to reproduce
TL;DR call `rndc addzone` in a tight loop, and measure response time.
Helper scripts:
* [addconfprim.py](/uploads/129fa9ac79cd590c9172e6d61f28a59f/addconfprim.py) - run this
* [rndc.py](/uploads/23c2eba46244ac7decbe865d84a92052/rndc.py)
### What is the current *bug* behavior?
Adding zones slows down very quickly.
```
33000 zones present; adding last 1000 took 1.93 secs
...
65000 zones present; adding last 1000 took 4.42 secs
```
### What is the expected *correct* behavior?
No speed degradation.
### Relevant configuration files
```
key "key" {
algorithm hmac-sha256;
secret "ptCZS/77Xm2sIzCdO/oxEoer2BbDgCfvF0CrqrcdRWM=";
};
options {
max-cache-size 10M;
recursion no;
notify no;
allow-new-zones yes;
lmdb-mapsize 110M;
};
```
* [empty.db](/uploads/da7366d7d37edc16d43b7280f7cbaf6f/empty.db)
### Possible fixes
From a quick glance, the problem centers around `DUP_LIFETIME` defined in `lib/isccc/cc.c` and in the inefficiency of `isccc_cc_cleansymtab()` and it's use.https://gitlab.isc.org/isc-projects/bind9/-/issues/3464Histograms for timing and memory statistics2023-11-02T17:05:05ZTony FinchHistograms for timing and memory statisticsBIND needs to be able to record statistics covering a wide range of possible values (several decimal orders of magnitude):
* latency times, from submilliseond queries on the same LAN to multi-minute zone transfers
* memory usage, fo...BIND needs to be able to record statistics covering a wide range of possible values (several decimal orders of magnitude):
* latency times, from submilliseond queries on the same LAN to multi-minute zone transfers
* memory usage, for zones from a handful of records to tens of millions
* message sizes, from 64 bytes to 64 kilobytes
In this issue I'm outlining a possible design for a general-purpose histogram data structure,
that could be added to `libisc` for collecting statistics efficiently in several places in BIND.
## existing histograms in BIND
The statistics channel has histograms for request and response sizes, which use buckets that
are defined manually with some tediously repetitive code. These could be replaced by the
proposed self-tuning histograms, although the bucketing will be somewhat different.
## examples of general-purpose histograms
It's possible to record histograms of values covering a wide range, with bucket sizes chosen automatically to provide a particular level of accuracy (e.g. 1% or 10%), and without using more than a few KiB for each histogram. Existing examples are:
* [circllhist, Circonus log-linear histogram](https://github.com/openhistogram/libcircllhist),
aka [OpenHistogram](https://openhistogram.io/)
Uses decimal floating point with two digits of mantissa and a 1 byte exponent,
to record values with 1% accuracy.
* [DDSketch from DataDog](https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/)
Uses the floating-point logarithm to a base derived from the required accuracy, rounded to an integer to make a bucket index.
Has an alternative "fast" mode more like HdrHistogram.
* [HdrHistogram, high dynamic range histogram](http://www.hdrhistogram.org/)
Uses low-precision floating point numbers as bucket indexes.
* [hg64, 64-bit histograms](https://github.com/fanf2/hg64)
My prototype implementation intended for use in BIND.
The DataDog blog article has a nice overview, and compares a quantile sketch implementation (that is designed for a particular rank error) with a histogram (designed for a particular value error). From my reading on this topic I concluded that histograms are both easier to understand, simpler to implement, and have similar or better CPU and memory usage compared to rank-error-based quantile sketches.
## key idea
The histogram counts how many measurements (time or space) have particular `uint64_t` value
or range of values, according to the histogram's configured precision (e.g. 1% or 10%).
Each range of values corresponds to a bucket or counter.
My prototype `hg64` uses a log-linear bucket spacing, which has two parts:
* a logarithm of the value to cover a large dynamic range with a few bits;
specifically, the log base 2 of a `uint64_t` varies from 0 to 63, which fits in 6 bits.
* linear, evenly spaced buckets between logarithms, to provide more precision
than you can get from just a power of 2 or 10. 4 buckets per log are enough
for 10% precision; 32 buckets per log gives 1% precision.
This log-linear bucketing is the same thing as decimal scientific notation,
like 1e9 (1 significant digit, 10% precision) or 2.2e8 (2 significant digits, 1% precision).
It's also the same as a (low-precision) binary floating point number:
the FP exponent is the logarithmic part, and the FP mantissa is the linear part.
## measurements and values
When counting time measurements, it makes sense for the `uint64_t` value to be the time measured in nanoseconds. This allows the histogram to count any time measurements we are likely to need, from submicrosecond up to a few centuries. There is no point using lower-precision time measurements because the histogram bucketing algorithm will reduce the precision as required.
Unlike nanosecond measurements, whose values are towards the logarithmic mid-range of `uint64_t`, memory measurements tend to cluster around zero. The `hg64` bucketing algorithm provides one counter for each distinct small integer; for instance, with 1% precision `hg64` has a counter for each value from 0 to 63, above which multiple values share each counter. To make the best use of these small-value counters, it makes sense to divide a memory measurement to get the desired resolution. For example, if the allocator quantum is 16 bytes, divide an allocation size by 16 before using it as a histogram value.
## incrementing counters quickly
It is very cheap to turn a `uint64_t` value into a bucket number, using CLZ to get the logarithm
with some bit shuffling to move things into place. The basic principle is
roughly the same as used by HdrHistogram and fast-mode DDSketch.
[Paul Khuong encouraged me to use his algorithm](https://twitter.com/pkhuong/status/1571831293335277573)
which is smaller and faster than the version I developed for my proof-of-concept.
As in BIND's existing statistics code, we use a relaxed atomic increment to update a counter.
When the histogram is in cache and uncontended, the whole operation (calculating the bucket
number and incrementing the counter) takes less than 2.5ns in my prototype code.
## efficient storage
The `hg64` bucket keys are small, e.g. 8 bits for 10% precision, or 11 bits for 1% precision.
We could store the buckets as a simple array of counters, which would use 2 KiB for 10%
precision, or 16 KiB for 1% precision. However a large fraction of that space will be
unused, because the values we are recording do not cover anywhere near 20 orders of
magnitude.
My prototype code has a 64 entry top-level array (one for each possible exponent)
and allocate each sub-array on demand (with a counter for each possible mantissa).
Most of the sub-arrays will remain unused. This layout supports lock-free multithreading.
## operations on histograms
* given a value, find its rank (or percentile)
* find the value at a given rank (or percentile)
* get the mean and standard deviation of the data recorded in the histogram
* merge two histograms (which may differ in precision)
* dump and load a histogram in text (e.g. csv, xml, json) and/or binary (for efficiency)
* export a histogram to a user-selected collection of buckets (e.g. for prometheus)
I have implementations of the first four.
The rank and percentile queries work on a snapshot of the working histogram, to avoid multithreading races and to make the calculations more efficient.
## exporting data
An important consumer for data recorded in histograms is Prometheus.
The docs <https://prometheus.io/docs/practices/histograms/> say it supports
* a "histogram" type (actually a cumulative frequency digest) where quantiles are calculated on the server
* a "summary" type, where quantiles are calculated on the client and the server aggregates them over a sliding window
Prometheus has its own textual format for exposing / ingesting data,
<https://prometheus.io/docs/instrumenting/exposition_formats/>.
It looks like it would be fairly easy for `hg64` and BIND to support it,
though it isn't clear whether the server is able to re-bucket data that
is exposed with a different bucketing than configured on the server.
## elsewhere on gitlab
Related issues #598 #2101 #3455Not plannedTony FinchTony Finchhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4102Use liburcu QSBR flavor2023-07-26T09:59:54ZOndřej SurýUse liburcu QSBR flavorThe QSBR flavor is faster, but also requires rcu_quiescent_state() to be called periodically from every RCU thread.The QSBR flavor is faster, but also requires rcu_quiescent_state() to be called periodically from every RCU thread.Not plannedOndřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3063dnssec-verify detect and support multiple cores2023-11-02T17:02:21ZDaniel Stirnimanndnssec-verify detect and support multiple cores### Description
We use `dnssec-verify` (from BIND 9.16) to validate large DNSSEC-signed zones. I noticed that on a multi core processor (eg 16 cores) always only one cpu is used. I guess, validation time could be speed up a lot if all a...### Description
We use `dnssec-verify` (from BIND 9.16) to validate large DNSSEC-signed zones. I noticed that on a multi core processor (eg 16 cores) always only one cpu is used. I guess, validation time could be speed up a lot if all available cores would be used.
### Request
Make `dnssec-verify` use all available cores automatically for operations for which this is possible eg. signature verification.
`dnssec-signzone` already automatically detects and uses all available cores and even has an argument switch to specify an specific number (`man dnssec-signzone`). I think something like this would be very useful:
```
-n ncpus
This option specifies the number of threads to use. By default, one thread is started for each detected CPU.
```Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3465validator repeatedly checks insecurity proofs for insecure domains2022-09-06T13:14:35ZPetr Špačekpspacek@isc.orgvalidator repeatedly checks insecurity proofs for insecure domains<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [...<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [security-officer@isc.org](security-officer@isc.org).
-->
### Summary
It seems that DNSSEC validator rechecks insecurity proofs several times while processing answer from an insecure domain.
### BIND version used
- main: 2c3eea19173bacc496d3b82a94e4b43fd067f2c4
### Steps to reproduce
- Increase logging for validator. I've used this patch: [log.patch](/uploads/e647ce5846c8c9218ecf5ef53cd5fefb/log.patch)
- Query for a subdomain of an insecure domain: `dig a.b.c.d.e.f.google.com`
- Query for another subdomain: `dig 0.b.c.d.e.f.google.com`
- Check logs
### What is the current *bug* behavior?
```
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: starting
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: attempting negative response validation from message
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: validate_neg_rrset: creating validator for google.com SOA
20-Jul-2022 16:20:23.334 validating google.com/SOA: starting
20-Jul-2022 16:20:23.334 validating google.com/SOA: attempting insecurity proof
20-Jul-2022 16:20:23.334 validating google.com/SOA: checking existence of DS at 'com'
20-Jul-2022 16:20:23.334 validating google.com/SOA: checking existence of DS at 'google.com'
20-Jul-2022 16:20:23.334 validating google.com/SOA: marking as answer (proveunsecure (4))
20-Jul-2022 16:20:23.334 validator @0x7f1c4d264400: dns_validator_destroy
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: in validator_callback_nsec
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: resuming validate_nx
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: nonexistence proof(s) not found
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: checking existence of DS at 'com'
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: checking existence of DS at 'google.com'
20-Jul-2022 16:20:23.334 validating _.f.google.com/A: marking as answer (proveunsecure (4))
20-Jul-2022 16:20:23.334 validator @0x7f1c4d263a00: dns_validator_destroy
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: starting
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: attempting negative response validation from message
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: validate_neg_rrset: creating validator for google.com SOA
20-Jul-2022 16:20:23.371 validating google.com/SOA: starting
20-Jul-2022 16:20:23.371 validating google.com/SOA: attempting insecurity proof
20-Jul-2022 16:20:23.371 validating google.com/SOA: checking existence of DS at 'com'
20-Jul-2022 16:20:23.371 validating google.com/SOA: checking existence of DS at 'google.com'
20-Jul-2022 16:20:23.371 validating google.com/SOA: marking as answer (proveunsecure (4))
20-Jul-2022 16:20:23.371 validator @0x7f1c4d264400: dns_validator_destroy
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: in validator_callback_nsec
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: resuming validate_nx
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: nonexistence proof(s) not found
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: checking existence of DS at 'com'
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: checking existence of DS at 'google.com'
20-Jul-2022 16:20:23.371 validating 0.b.c.d.e.f.google.com/A: marking as answer (proveunsecure (4))
20-Jul-2022 16:20:23.371 validator @0x7f1c4d263000: dns_validator_destroy
```
This is probably related to fact that validation proceeds "down" even if parent zone is proved to be insecure. This can be seen e.g. on `dualstack.osff2.map.fastly.net A` query. I would expect it to stop doing things at `fastly.net DS` level.
### What is the expected *correct* behavior?
I would expect the validation to be cut soon on `checking existence of DS at 'com'`. In the log excerpt for google.com subdomains it repeats three times, along with insecurity proofs for domains we should be marked as insecure in cache (`b.c.d.e.f.google.com` and everything up to `google.com`.)
### Relevant configuration files
Default config.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2549Create a BIND benchmarking package for users to download and use in their own...2022-06-03T10:15:14ZCathy AlmondCreate a BIND benchmarking package for users to download and use in their own environmentsThis feature issue ticket is to track discussions on a potential downloadable benchmark testing framework for users (performance testing is hard...)This feature issue ticket is to track discussions on a potential downloadable benchmark testing framework for users (performance testing is hard...)Long-termhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2548Resolver testing - building an Internet Simulator device2022-06-03T10:15:14ZCathy AlmondResolver testing - building an Internet Simulator deviceThis ticket is being opened to capture discussions on how to create a device to take the place of 'The Internet' for benchmarking resolver performance more realisticallyThis ticket is being opened to capture discussions on how to create a device to take the place of 'The Internet' for benchmarking resolver performance more realisticallyLong-termhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2547Resolver benchmarking testing - corralling the generic plus the important edg...2022-06-03T10:15:14ZCathy AlmondResolver benchmarking testing - corralling the generic plus the important edge cases to include when testingOpening this issue to capture discussions on what we need to include in BIND resolver performance benchmark tests, beyond the generic case.Opening this issue to capture discussions on what we need to include in BIND resolver performance benchmark tests, beyond the generic case.Long-termhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2953Resolver issues with refactored dispatch code2023-11-02T17:02:19ZMichał KępieńResolver issues with refactored dispatch codeThis issue attempts to describe various issues with resolver behavior
found after merging !4601 (#2401). Most of these issues are
intermittent, so it is important to keep track of them somewhere in
order to not forget that they exist. ...This issue attempts to describe various issues with resolver behavior
found after merging !4601 (#2401). Most of these issues are
intermittent, so it is important to keep track of them somewhere in
order to not forget that they exist. We should get to the bottom of all
of these issues before we release BIND 9.18.0.
1. [x] **Recursive Perflab tests cause the resolver to stop responding.**
This issue might be the simplest to start with because the behavior
observed seems to be consistent rather than intermittent. Namely,
all Perflab jobs which test a resolver seem to crank out a response
rate of some 70-120 kQPS at the beginning of the test and then...
the resolver stops responding indefinitely. While Perflab was not
designed with recursive tests in mind and therefore we can treat its
recursive results with a grain of salt, it certainly should not be
reporting zeros all over the place.
- https://perflab.isc.org/#/config/run/5bf195dd83ba91a870b2976f/
- https://perflab.isc.org/#/config/run/5cd6a166643076f6c1f6c26f/
- https://perflab.isc.org/#/config/run/5db74b6264458967f762143a/
- https://perflab.isc.org/#/config/run/5db74b7264458967f762143b/
- https://perflab.isc.org/#/config/run/5db74c2764458967f7621440/
- https://perflab.isc.org/#/config/run/5db74c3464458967f7621441/
(Resolved by !5500.)
2. [x] **`respdiff` tests are *sometimes* slow.**
Ever since we merged the dispatch branch, the `respdiff` tests
started failing *intermittently* for `main` (and only `main`)
because of timeouts.
- [job 2016337][1]: pass, ~2m30s per each 10,000 queries
- [job 2016622][2]: pass, ~2m45s per each 10,000 queries
- [job 2017990][3]: pass, ~2m30s per each 10,000 queries
- [job 2020093][4]: fail, 7+ minutes per each 10,000 queries
- [job 2023057][5]: fail, 16+ minutes per each 10,000 queries
- [job 2023490][6]: pass, ~2m40s per each 10,000 queries
I do not think varying CI runner stress can be blamed for this, not
for discrepancies this large. It also never happened before merging
!4601, AFAIK.
3. [x] **A lot of "stress" test graph indicate growing memory use.** #3002
While testing October BIND 9 releases, one of the 1-hour "stress"
tests ran in recursive mode for BIND 9.17.19 yielded a graph which
indicates that memory use growth over time might be an issue.
https://wiki.isc.org/bin/viewfile/QA/BindQaResults_9_11_36?filename=bind-9.17.19-linux-amd64-recursive-1h.png;rev=1
However, that phenomenon was not observable for other OS/arch
combinations this specific code revision was tested with.
It was also not observable on the *same* OS/arch combination for a
very similar code revision (the code differences should not have any
effect on memory use patterns):
https://wiki.isc.org/bin/viewfile/QA/BindQaResults_9_11_36?filename=bind-9.17.19-linux-amd64-recursive-1h.png;rev=2
Pre-release tests run for BIND 9.17.20 confirmed that memory leaks
are a common thing when `named` is used as a recursive resolver.
More details are available in #3002.
The "stress" tests are run on isolated VMs and despite being pretty
synthetic (fixed traffic pattern, everything happens on one machine,
etc.), they have a history of being very stable, so typical issues
like test host load varying over time etc. are not a factor here.
4. [x] **Lame servers with IPv6 unreachable cause hang on shutdown.** #2927
5. [x] **resolver test fails intermittently** #3013
See https://gitlab.isc.org/isc-projects/bind9/-/jobs/2054296
```
I:resolver:query count error: 6 NS records: expected queries 10, actual 11
I:resolver:failed
```
6. [x] **Assertion failed in `dns_resolver_logfetch()`** #2962
7. [x] **Assertion failed in `dns_dispatch_gettcp()`** #2963
8. [x] **Assertion failed in `dns_resolver_destroyfetch()`** #2969
9. [x] **ThreadSanitizer issues with adb** #2978 #2979
10. [x] **fctx_cancelquery() attempts to process a query which has already been freed** #3018
11. [x] **premature TCP connection closure leaks fetch contexts (hang on shutdown)** #3026
12. [ ] **validator loops can cause shutdown hang** #3033
13. [ ] **ADB finds for a broken zone may cause fetch contexts to hang** #3037
14. [ ] **ASAN error in fctx_cancelquery()** #3102
I decided to open a single issue for all of the above problems because I
sense they are somehow related and I hope that fixing the root cause of
one of them will eliminate the other ones as well.
[1]: https://gitlab.isc.org/isc-projects/bind9/-/jobs/2016337
[2]: https://gitlab.isc.org/isc-projects/bind9/-/jobs/2016622
[3]: https://gitlab.isc.org/isc-projects/bind9/-/jobs/2017990
[4]: https://gitlab.isc.org/isc-projects/bind9/-/jobs/2020093
[5]: https://gitlab.isc.org/isc-projects/bind9/-/jobs/2023057
[6]: https://gitlab.isc.org/isc-projects/bind9/-/jobs/2023490Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1831Feature request: Separate NXDOMAIN cache with its own max-ncache-size2024-03-01T10:04:57ZCathy AlmondFeature request: Separate NXDOMAIN cache with its own max-ncache-sizeThis relates to PRSD DDoS attacks, and the effect on participating resolvers when the domain under onslaught is able to keep responding and does not die or rate-limit the resolvers.
The scenario is one in which a very large number of un...This relates to PRSD DDoS attacks, and the effect on participating resolvers when the domain under onslaught is able to keep responding and does not die or rate-limit the resolvers.
The scenario is one in which a very large number of unique names are being queried, the objective being to bypass cached NXDOMAINs in resolvers and to force every name to become a query to the authoritative servers for the domain (or hosting provider) that is being attacked.
Typically, the target servers will either die, or will commence rate-limiting their perceived attackers. In the case of a resolver, this will result in a large number of recursive queries being backlogged while they wait for the server responses that never arrive.
BIND uses fetch-limits to mitigate the non-responding servers scenario.
But in the situation where the servers never die or never rate-limit, the outcome is rather different. Resolvers that can cope with the increase in traffic (which usually isn't actually that much), instead see a rapid increase in memory consumption (and decrease in cache hits!) due to the NXDOMAIN responses that are received and then cached (never to be used again).
One mitigation for resolver operators has been to reduce max-ncache-ttl to silly small values - but the effectiveness of this depends on the structure of the cache nodes and how often opportunistic cache cleaning hits these nodes.
Yes, overmem (LRU-based logic) cache-cleaning will help with this, but for many, it is going to be at the expense of 'positive' cache content, and regular clients will start to suffer with more cache-misses, as well as cache churn increasing as negative and positive cache content keeps being 'swapped'.
Mark suggested keeping negative answers in a separate cache, where they could have their own max-ncache-size and churn all by themselves, without affecting main cache.
This sounds like A Good Idea - but one that we've never get got around to, as part of ongoing DDoS mitigation work.
(Also tagging this as 'Customer' since I can find many a customer ticket where customers have been bitten by this when one specific and well-known DNS hosting company have been under attack, and their servers never falter in sending back NXDOMAIN responses to their 'attackers').Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1278Improvements to the performance of view matching2023-11-02T16:43:58ZBrian ConryImprovements to the performance of view matchingA customer has requested that some effort be made to improve the performance of matching an incoming request to a view. This does not have to be a perfect general solution.
The specific case that they would be delighted to see as a sta...A customer has requested that some effort be made to improve the performance of matching an incoming request to a view. This does not have to be a perfect general solution.
The specific case that they would be delighted to see as a starting point would be if all of the views have only `match-destination` ACLs. In a case like this, the multipl ACLs could be combined into something very much like the existing ACL structure except producing a view selection answer rather than the usual positive-neutral-negative answer.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4616Resolver cache redesign2024-03-01T12:29:31ZPetr Špačekpspacek@isc.orgResolver cache redesignThis is a meta issue to collect current problems & ideas what to do about it.
Current known problems:
- LRU cleaning can get state into a weird state: #2744
- Cache cleaning can block things, and is generally a mess: #3261, #4383
- Neg...This is a meta issue to collect current problems & ideas what to do about it.
Current known problems:
- LRU cleaning can get state into a weird state: #2744
- Cache cleaning can block things, and is generally a mess: #3261, #4383
- Negative answers from e.g. a random subdomain attack can push out useful things: #2495, #1831
- ADB vs. cache size is hardcoded and nobody knows if this is optimal or not: #2483, #2405
- Sizing is hard to get right: #614
- Cache is child-centric: #3311
- RRSIGs and not tightly bound to respective RR: #3396
- Data structures referenced by RBTDB are a mess: #4356, #3403, #3405Štěpán BalážikŠtěpán Balážikhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4505Implement kTLS support in BIND2024-02-14T17:13:53ZArtem BoldarievImplement kTLS support in BIND
Recent versions of Linux and FreeBSD support TLS encryption by kernel (kTLS). One of the benefits of it is that when TLS encryption is performed by kernel, it might use additional hardware features otherwise not available in the user sp...
Recent versions of Linux and FreeBSD support TLS encryption by kernel (kTLS). One of the benefits of it is that when TLS encryption is performed by kernel, it might use additional hardware features otherwise not available in the user space, including offloading TLS encryption to the NICs that support that (e.g. [NVIDIA Mellanox ConnectX-6 Dx](https://www.nvidia.com/en-us/networking/ethernet/connectx-6-dx/)), almost completely freeing the CPU from this task, because even in the case of hardware acceleration of encryption within the CPU, it still requires some cycles from it. Also, using it might reduce memory copying in some cases.
Of course, kernel space encryption is more limited compared to the one provided by OpenSSL and its derivatives in the user space: these limitations are imposed by hardware - e.g. NICs might not support anything but AES 128 (aka `TLS_AES_128_GCM_SHA256`), as it is the only cipher mandatory for TLS v1.3). If it is good enough for WEB servers, it should be good enough for DNS, too.
Even when kTLS is used, the handshake itself happens in the user space (e.g. using OpenSSL) with negotiated parameters passed to the kernel using `setsockopt()` calls on a TCP socket descriptor.
OpenSSL provides support for kTLS encryption natively since version 3.X (see `SSL_OP_ENABLE_KTLS` [option](https://www.openssl.org/docs/manmaster/man3/SSL_set_options.html)) but, as far as I understand it, it does so only when OpenSSL manages the underlying TCP socket file descriptor natively: not our case, as we are using LibUV for that. However, considering that the idea of kTLS is that with it enabled, we are supposed to pass unencrypted data to `send()` and `recv()`, that is kTLS-enabled socket from the higher level perspective works (mostly) as a TCP socket, we might try the following approach to implement kTLS, that *might* work:
1. We use our existing code (`tlsstream.c`) to handle handshake, just like we do now;
2. After completing the handshake, we pass the negotiated information to the kernel. OpenSSL might have some interfaces for that. In the worst case, we might need to do that by hand using. `setsockopt()`;
3. Then, we add new code paths to `tlsstream.c` to bypass TLS connection objects (`isc_tls_t`) and use the underlying TCP connection directly, which, by now, works in "kTLS-mode", providing transparent TLS encryption;
4. Control messages, like TLS shutdown, will require additional care.
That is how I see the initial plan that might or might not work. There can (and, likely, will) be unforeseen obstacles that might turn out to overcomplicate the code base so much that it might make it unfeasible to implement, like adding a kTLS-only transport. Furthermore, that might require some assistance from LibUV. That will require some trial and error.
That is mostly written with Linux in mind. If the kTLS interface in FreeBSD is similar enough (it seems so at the first glance), we should support both platforms.
The issue is created mostly to dump the information from my mind and keep kTLS under our radar: we might want to do that, as at least `dnsdist` has experimental support for it. It will be even more important in the future, as it seems now that encrypted DNS transport will be even more important to the point of replacing the good ol' Do53 at some point.
For sure, it is not a 9.20 material - rather 9.21-9.22 if we are lucky, as it is a big feature. Also, I foresee a similar concept eventually appearing for QUIC, too (kQUIC?). Also, I am aquiet certain that we *will* need #3504 for this (implemented here: !8576).
See also:
1. https://docs.kernel.org/networking/tls.html
2. https://man.freebsd.org/cgi/man.cgi?query=ktls&apropos=0&sektion=0&manpath=FreeBSD+13.0-RELEASE+and+Ports&arch=default&format=html
3. https://delthas.fr/blog/2023/kernel-tls/ - mostly discusses it in the context of HTTP and `sendfile()` acceleration, but contains many references on the topic.
4. https://docs.nvidia.com/networking/display/ofedv512580/kernel+transport+layer+security+(ktls)+offloadsLong-termArtem BoldarievArtem Boldarievhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4164Investigate performance impact of UDP_GRO2023-06-27T07:06:24ZPetr Špačekpspacek@isc.orgInvestigate performance impact of UDP_GRO### Description
Mention of UDP_GRO in [Linux udp man page](https://man.archlinux.org/man/udp.7) sounds worth investigating:
> #### UDP_GRO (since Linux 5.0)
> Enables UDP receive offload. If enabled, the socket may receive multiple dat...### Description
Mention of UDP_GRO in [Linux udp man page](https://man.archlinux.org/man/udp.7) sounds worth investigating:
> #### UDP_GRO (since Linux 5.0)
> Enables UDP receive offload. If enabled, the socket may receive multiple datagrams worth of data as a single large buffer, together with a cmsg(3) that holds the segment size. This option is the inverse of segmentation offload. It reduces receive cost by handling multiple datagrams worth of data as a single large packet in the kernel receive path, even when that exceeds MTU. This option should not be used in code intended to be portable.
More reading:
https://developers.redhat.com/articles/2021/11/05/improve-udp-performance-rhel-85
### Request
- Investigate if UDP receipt is even a bottleneck.
- Investigate if UDP_GRO makes a difference and is worth messing with (I supposed in libuv).
### Links / referencesLong-term