BIND issueshttps://gitlab.isc.org/isc-projects/bind9/-/issues2024-03-27T14:02:04Zhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4371All the things that need to be fixed before 9.202024-03-27T14:02:04ZMatthijs Mekkingmatthijs@isc.orgAll the things that need to be fixed before 9.20This is an overarching issue for keeping track on all the things that need to be completed before the 9.20.0 release.
### Features
- [ ] #1128 Offline KSK (:gear: @matthijs)
- [x] #1129 HSM support via pkcs11-provider
- [x] #4363 Enfor...This is an overarching issue for keeping track on all the things that need to be completed before the 9.20.0 release.
### Features
- [ ] #1128 Offline KSK (:gear: @matthijs)
- [x] #1129 HSM support via pkcs11-provider
- [x] #4363 Enforce stricter NSEC3 parameter limits
- [x] #4388 Accepting PROXYv2
- [x] #4241 Expose data about 'first time' zone maintenance in-progress
- [ ] #2099 Implement ZoneMD signature generation and verification. (:gear: !5217 @marka, @each)
### Config incompatibilities
- [x] #4364 named-compilezone defaults
- [x] #4373 safer "dnssec-validation yes"
- [x] #4447 "stale-answer-client-timeout" must be zero (:gear: !8699 @aram)
### Refactoring
- [x] #4411 QPDB lite (:gear: !8726 @matthijs, @each)
- [x] #4251 system test runner
### Bugs
- [x] #4340 "max-cache-size" is a no-op since BIND 9.19.16
- [x] #4213 BIND shutdown hang in checkds/ns9/ in cross-version-config-tests job
- [x] #4060 named doesn't shut down after receiving rndc stop command
- [x] #4211 AssertionError: named crashed, shutdown crash
- [ ] #4403 Resolve spike in memory at start of named (:gear: @ondrej)
- [ ] #4481 TCP issue (:gear: isc-private/bind9!639 @ondrej)
- [ ] #4475 Data races in isc_buffer_peekuint8, rdataset_settrust, and memmove (:gear: !8645 @marka)
- [x] #4625 DNSSEC validation incompatibility
- [ ] #4652 Server crash caused by external UDP queriesBIND 9.19.x2024-05-02https://gitlab.isc.org/isc-projects/bind9/-/issues/2340Enable logging of rpz re-writes to dnstap.2024-03-27T13:54:38ZPeter DaviesEnable logging of rpz re-writes to dnstap.### Description
Enable logging of rpz re-writes to dnstap.
The ability to send rpz rewrite information that is generated by category rpz to the dnstap output stream.
[RT #17273](https://support.isc.org/Ticket/Display.html?id=17273)### Description
Enable logging of rpz re-writes to dnstap.
The ability to send rpz rewrite information that is generated by category rpz to the dnstap output stream.
[RT #17273](https://support.isc.org/Ticket/Display.html?id=17273)Not plannedEvan HuntEvan Hunthttps://gitlab.isc.org/isc-projects/bind9/-/issues/3081When is BIND ready?2024-03-27T13:27:23ZGreg ChoulesWhen is BIND ready?Related to Support ticket [19717](https://support.isc.org/Ticket/Display.html?id=19717)
The purpose of this issue is to make BIND more verbose and precise about reporting various stages of readiness when starting up, leading to a definit...Related to Support ticket [19717](https://support.isc.org/Ticket/Display.html?id=19717)
The purpose of this issue is to make BIND more verbose and precise about reporting various stages of readiness when starting up, leading to a definitive "I'm ready now" log message.
The question an operator will want an answer to is, when can I send queries to this server again?
Different features will all have their own completeness check. For example: RPZ, local zones, remote zones, mirror zones, CATZ. The request is for new log messages to allow operators to track progress of each of these features and a new (or redefined) final log message when all tasks are complete.
What is a task? When is it complete and when is BIND ready to do that thing?
Us and the customer have, in parallel, come up with similar thinking on what needs to be done. The principle is, at startup time create a one-time todo list from the zone configuration statements. As each list item is completed, generate a signal and remove it from the list. When all items are completed generate a final completion signal and set the state of an indicator that can be queried by RNDC, so that users can test the current complete/not complete state periodically.
Taking some different types of zones as examples, we would expect behaviour like this:
Primary zones:
- Read zone data from local storage. Once this has been read into memory the zone is 'ready', a signal is generated and no further readiness checks need to be made: this task is complete.
Secondary zones:
- If a zone has been configured with a file, read zone data from local storage. Once this has been read into memory the zone is 'ready', a signal is generated and no further readiness checks need to be made: this task is complete. NOTE: checking whether the zone is up to date (SOA queries and possible subsequent zone transfer) is specifically excluded from this task.
- If a zone has **not** been configured with a file, make SOA queries and attempt zone transfers as necessary in order to load the zone. If zone transfer succeeds and zone data is loaded into memory the zone is 'ready', a signal is generated and no further readiness checks need to be made: this task is complete. If zone transfer fails there needs to be a limit - number of tries without success - to how long this task remains on the todo list. In this case generate a 'not ready' signal and remove the task from the list.
Catalog zones:
- These can be treated similarly to Primary or Secondary zones for the catalog itself. Once the catalog is loaded generate a ready signal and remove it from the todo list.
- However, during processing of each catalog a further list of (member) zones will be generated, each of which need to be added to the todo list and treated as a Secondary zone with no previous local data storage - i.e. needing to be transferred from a primary server.
Response Policy Zones:
- These can be treated similarly to Primary or Secondary zones for the zone data itself, but with the (possible?) additional step of needing to build the policy once it has been loaded. An RPZ should be considered ready only when the policy is active and responses would be re-written.
Mirror zones:
- These are similar to secondary zones.
Anything else?https://gitlab.isc.org/isc-projects/bind9/-/issues/4616Resolver cache redesign2024-03-01T12:29:31ZPetr Špačekpspacek@isc.orgResolver cache redesignThis is a meta issue to collect current problems & ideas what to do about it.
Current known problems:
- LRU cleaning can get state into a weird state: #2744
- Cache cleaning can block things, and is generally a mess: #3261, #4383
- Neg...This is a meta issue to collect current problems & ideas what to do about it.
Current known problems:
- LRU cleaning can get state into a weird state: #2744
- Cache cleaning can block things, and is generally a mess: #3261, #4383
- Negative answers from e.g. a random subdomain attack can push out useful things: #2495, #1831
- ADB vs. cache size is hardcoded and nobody knows if this is optimal or not: #2483, #2405
- Sizing is hard to get right: #614
- Cache is child-centric: #3311
- RRSIGs and not tightly bound to respective RR: #3396
- Data structures referenced by RBTDB are a mess: #4356, #3403, #3405Štěpán BalážikŠtěpán Balážikhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3311Consider parent-centric delegations2024-03-01T10:04:57ZOndřej SurýConsider parent-centric delegationsThis is an umbrella issue to discuss the parent vs child-centric delegations.
## Child-centric NS
The child-centric NS way lets the child NS records override the delegation NS, but the parent NS has to be used at least once. This work...This is an umbrella issue to discuss the parent vs child-centric delegations.
## Child-centric NS
The child-centric NS way lets the child NS records override the delegation NS, but the parent NS has to be used at least once. This works fine as long as the parent and child NS records are in sync. When they are not in sync (both inter and intra), the used delegation NS can vary between runs based on what's in the cache.
## Parent-centric NS
The parent-centric NS way always uses the parent NS records for delegations, but requires a separate "delegation" database that's distinct from the resource-record cache. The parent-centric NS doesn't suffer from the problems that could happen when the child-NS and parent-NS are out of sync - there's only one "authority" for the delegation NS (parent).
This approach is not without problems - because of the way DNS is (under-)specified, the child-centric NS has been used for a long time, and changing the BIND 9 to use parent NS will break some users' expectations. Fortunately for us, this path has been already paved by (at least) Nominum Vantio and Google Public DNS (and apparently the world didn't collapse).
## To be considered
- [ ] DS vs apex-CNAME
- [ ] parent vs child NSEC RRsets
- [ ] glue records from the parent pointing into the child zone
- [ ] Debug/query options
(add more as stuff comes up in the discussion)Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2405[ISC-support #17264] ADB overmem condition and cleaning - very difficult to d...2024-03-01T10:04:57ZBrian Conry[ISC-support #17264] ADB overmem condition and cleaning - very difficult to detect and causes erratic behaviorThe ADB's mctx size is set to 1/8 of the max-cache-size, if set. This is the only means to control the ADB memory limit. There is also a hard-coded maximum ADB size applied to ADBs for views that share a cache.
When it goes overmem, t...The ADB's mctx size is set to 1/8 of the max-cache-size, if set. This is the only means to control the ADB memory limit. There is also a hard-coded maximum ADB size applied to ADBs for views that share a cache.
When it goes overmem, the ADB starts removing names and entries. The strategy for removing entries doesn't seem to be tied strongly to utility.
This can lead to erratic behavior as BIND is constantly forgetting information about server SRTTs, EDNS capabilities, and other useful data.
In some cases, if not all of the entries for servers associated with a zone are affected by the overmem purge, this can cause the resolver to fixate on a small subset of the servers authoritative for the zone - and not necessarily the subset with the best SRTT.
There is no logging at any level related to ADB overmem activities, nor are there any stats directly related to ADB memory usage.
There are stats for counts of names and entries, along with the number of buckets for each type, but there's no reliable way to map those to memory usage.
The stats channel does contain detail for the ADB memory contexts, but there's no reliable way to map those memory contexts to a particular view.
It seems likely that most of the time the symptoms of an overmem ADB will be minor and nearly impossible to directly measure - small delays and increases in CPU usage associated with the repeated creation and destruction of ADB entries and/or fixation on suboptimal upstream servers - but will definitely degrade the quality of service that the resolver is providing.
This behavior was noticed by a customer when their monitoring zone happened to, by chance, be negatively affected.
Most of the symptoms described here are theoretical, based on my understanding of the code and various customer-described, but unreproducable and otherwise unexplained, behaviors.
This issue is a feature request covering:
* specific testing by ISC to better understand the impact and range of behaviors with the ADB is overmem #2441
* additional logging related to ADB overmem activities #2435
* additional stats/metrics relating to ADB overmem #2436
* improvements to ADB overmem behavior (ideally based on some utility metric) #2437
* ability to directly control ADB size independent of cache size #2438
* revisit the hard-coded shared-cache maximum ADB size (e.g. remove in favor of configuration) #2439
* system tests related to any/all items aboveNot plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2495Mitigation for cache bloat due to negative RRsets (NXDOMAIN and NXRRSET) when...2024-03-01T10:04:57ZCathy AlmondMitigation for cache bloat due to negative RRsets (NXDOMAIN and NXRRSET) when preserving expired RRsets using max-stale-ttlI'm tagging this as 'Customer' because it is affecting/has affected several customer caches (unanticipated cache bloat) with 'stale-cache-enable yes;'
----
By design, negative cache content is not used in the same way as positive cache...I'm tagging this as 'Customer' because it is affecting/has affected several customer caches (unanticipated cache bloat) with 'stale-cache-enable yes;'
----
By design, negative cache content is not used in the same way as positive cache content. 'stale-answer-client-timeout' does not apply to negative content, thus the clients will timeout and likely retry the same query several times before giving up, long before the resolver-query-timeout. Only when named has failed to get a response from the authoritative server(s) does the 'stale-refresh-time' window commence.
There is a good reason for this - when querying the authoritative servers for a name that is already NXDOMAIN or NXRRSET in cache, we don't know if this is going to be replaced with positive content or not. Therefore, to ensure that positive content is preferred, we have to wait to be sure that it's sensible to serve the stale negative content as a last resort.
UNFORTUNATELY:
1. Most negative content is single-use - so adding max-stale-ttl to its retention period can significantly increase cache bloat - for no useful reason whatsoever.
2. Most negative content has (also by design of sane zone administrators, or overridden by max-ncache-ttl) a much shorter TTL in cache than positive content.
So when we're not preserving stale content, the negative content is fairly quickly removed from cache, in comparison with the positive RRsets. But adding max-stale-ttl to the retention period can quite significantly tilt the balance in favour of all of this one-use-only cached negative content.
We don't have any statistics that measure cache hits for different RTYPEs, so we don't know for sure, what percentage of negative content is used again (cache hits) versus just sitting there for no good reason (cache misses) - perhaps we should? (Perhaps we should also have stats on stale hits and misses by RTYPE?)
But in any case, having seen too many caches overloaded with stale negative content, I should like to propose an option that can be used either to shorten the max-stale-ttl for negative content, but that also takes a value of zero, to disable its retention entirely.https://gitlab.isc.org/isc-projects/bind9/-/issues/3992The XFR unreachable cache redesign2024-03-01T09:52:29ZOndřej SurýThe XFR unreachable cache redesignThe unreachable cache for **dead primaries** was added to BIND 9 in 2006 via 1372e172d0e0b08996376b782a9041d1e3542489. It features a 10-slot LRU array with 600 seconds (10 minutes) fixed delay. During this time, any primary with a hicc...The unreachable cache for **dead primaries** was added to BIND 9 in 2006 via 1372e172d0e0b08996376b782a9041d1e3542489. It features a 10-slot LRU array with 600 seconds (10 minutes) fixed delay. During this time, any primary with a hiccup would be blocked for the whole block duration (unless overwritten by a different dead primary).
One can argue:
- 10 minutes is too long for a fixed, non-configurable delay
- 10 slots are not enough - servers could be running 1M and more zones with different primaries; and especially in situations like these, there's a high chance that more primaries would be having problems
I think this needs a redesign, but meanwhile - I think that we can drop the `UNREACH_HOLD_TIME` to something like 10 seconds (or 60?) - this should still prevent a thundering herd over the unresponsive server, but the recovery is going to be much faster.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4426Feature request - client.bind chaos class queries2023-11-20T10:39:24ZRay BellisFeature request - client.bind chaos class queriesNot plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2032Review BIND Performance suggestions KB2023-11-02T17:00:02ZVicky Riskvicky@isc.orgReview BIND Performance suggestions KBDraft is in document360.
Preview is at https://kb.isc.org/preview/v1/dbe412aa-9e0c-4071-ab12-90bfd02b877f/1
What we need is not so much Editing as Improvement:
- this is (sadly) not going to be much help to the more sophisticated users,...Draft is in document360.
Preview is at https://kb.isc.org/preview/v1/dbe412aa-9e0c-4071-ab12-90bfd02b877f/1
What we need is not so much Editing as Improvement:
- this is (sadly) not going to be much help to the more sophisticated users, because most of the advice boils down to, you have to test on your own platform, with your own traffic, so
- given this advice has to be tailored more for people with less background in performance tuning, we should provide some sample cli or log messages to look for diagnosing whether the condition is present (e.g. low memory, buffer overflow, problems with fragmented packets...)
if we can provide any better advice on how to best measure performance on a production system (in this case on a resolver), imho that will be useful to a lot of ppl. I am sort of assuming most people are using something like Prometheus/Grafana today and looking at those charts.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4199dig (and other tools) may send queries with QID=0, which confuses Net::DNS2023-11-02T16:30:30ZMichał Kępieńdig (and other tools) may send queries with QID=0, which confuses Net::DNSUnless specified manually using `+qid=<value>`, `dig` uses a random
query ID for the DNS messages it sends out:
https://gitlab.isc.org/isc-projects/bind9/-/blob/bf8acd455693edef03881fd2180c5561bc0db66d/bin/dig/dighost.c#L2334
In partic...Unless specified manually using `+qid=<value>`, `dig` uses a random
query ID for the DNS messages it sends out:
https://gitlab.isc.org/isc-projects/bind9/-/blob/bf8acd455693edef03881fd2180c5561bc0db66d/bin/dig/dighost.c#L2334
In particular, the value chosen can be 0. While QID=0 is perfectly
legal protocol-wise, it seems that some code bases, e.g. Net::DNS, are
unable to properly handle queries with QID=0. Here is an example:
https://gitlab.isc.org/isc-private/bind9/-/jobs/3509123
```
2023-07-06 14:14:45 INFO:serve-stale I:serve-stale_tmp_iwl06k82:disable responses from authoritative server (89)
2023-07-06 14:14:57 INFO:serve-stale I:serve-stale_tmp_iwl06k82:failed
```
`bin/tests/system/serve-stale_tmp_iwl06k82/dig.out.test89`:
```
;; Warning: ID mismatch: expected ID 0, got 46879
;; communications error to 10.53.0.2#19223: timed out
; <<>> DiG 9.19.15 <<>> +time +tries -p 19223 @10.53.0.2 txt disable
; (1 server found)
;; global options: +cmd
;; no servers could be reached
```
This looked weird to me, so I started `ans2/ans2.pl` manually and sent a
query to it using `dig @10.53.0.2 -p 5300 disable. TXT +qid=0 +tries=1`.
Guess what:
```
;; Warning: ID mismatch: expected ID 0, got 27885
;; communications error to 10.53.0.2#5300: timed out
; <<>> DiG 9.19.15 <<>> @10.53.0.2 -p 5300 disable. TXT +qid=0 +tries=1
; (1 server found)
;; global options: +cmd
;; no servers could be reached
```
Looking at [Net::DNS sources][1], the documentation says:
```
=head2 id
print "query id = ", $packet->header->id, "\n";
$packet->header->id(1234);
Gets or sets the query identification number.
A random value is assigned if the argument value is undefined.
```
However, the above seems to be imprecise: apparently if the ID is
*defined*, but *set to 0*, Net::DNS treats it as an undefined value.
This causes the `$packet->header->id` call to return a random value
instead of 0 for queries with QID=0, breaking responses to such queries.
I don't see any reasonable way to work around this problem in our Perl
code (apart from converting it to Python). Adding `+qid` to every `dig`
invocation in the system test suite also seems over the top for working
around something this silly. However, until we do something about this,
we might be seeing a whole class of surprising failures in the system
test suite caused by this behavior.
[1]: https://www.net-dns.org/svn/net-dns/trunk/lib/Net/DNS/Header.pmNot plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3050Post load checking of missing delegations2023-11-02T16:26:08ZMark AndrewsPost load checking of missing delegationsIs it worth while to perform a post load DS lookup for each primary / slave zone against the other loaded zone looking for a NXDOMAIN response which would indicate a missing delegation? This would catch cases like bhutan.gov.bt where b...Is it worth while to perform a post load DS lookup for each primary / slave zone against the other loaded zone looking for a NXDOMAIN response which would indicate a missing delegation? This would catch cases like bhutan.gov.bt where both it and the parent zone are served by the same servers but there isn't a delegation for bhutan.gov.bt in the gov.bt zone.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2879Add --disable-doh to a CI build?2023-11-02T16:26:08ZMark AndrewsAdd --disable-doh to a CI build?The following discussion from !5353 should be addressed:
- [ ] @marka started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/5353#note_231504): (+1 comment)
> One remaining question is "do we add yet ano...The following discussion from !5353 should be addressed:
- [ ] @marka started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/5353#note_231504): (+1 comment)
> One remaining question is "do we add yet another system with --disable-doh to CI?"
- [ ] Also should we have a CI build that does not have libnghttp2 installed.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3958Adjust default tcp-clients value upward2023-09-12T09:09:13ZVicky Riskvicky@isc.orgAdjust default tcp-clients value upwardThe default for tcp-clients is set at 150. As more users are now supporting encrypted DNS, which sessions use TCP, it is likely that the % of overall DNS sessions using TCP will increase, and the current default quota will be too low for...The default for tcp-clients is set at 150. As more users are now supporting encrypted DNS, which sessions use TCP, it is likely that the % of overall DNS sessions using TCP will increase, and the current default quota will be too low for many users.
Although it is impossible to determine the ideal setting for all users, it seems likely that users who need to limit TCP sessions can support at least an order of magnitude more sessions, like maybe 2,000.
If we are very worried about impacting small-system users of BIND, perhaps we could just change the setting for BIND -S, which is not available to hobbyists?https://gitlab.isc.org/isc-projects/bind9/-/issues/2485DNS protocol cleanup: require correct AA bit2023-08-16T16:51:42ZPetr Špačekpspacek@isc.orgDNS protocol cleanup: require correct AA bit### Description
Allegedly different resolvers treat AA bit in responses differently, and this is causing different operational problems for each implementation. PowerDNS and Knot Resolver have had issues with that.
Proposal by Peter va...### Description
Allegedly different resolvers treat AA bit in responses differently, and this is causing different operational problems for each implementation. PowerDNS and Knot Resolver have had issues with that.
Proposal by Peter van Dijk is to be strict on AA bit and punish non-compliance. Main motivation seems to be code simplification when it comes various combinations of NXDOMAIN/NOERROR without SOA RR and/or "extra" NS records in authority which are sometimes added as "good measure" but do not actually mean a referral.
Anecdotes from the field:
a) Ralf Weber from Akamai has some reservations:
> Given that a lot of people use resolvers in front of their authoritative servers who don't send AA I fail to envision what resolvers should do. If we drop non AA answers I expect huge portion of the Internet to go dark, though I don't have hard numbers on that.
b) Recent versions of PowerDNS switched to stricter mode and insist on AA bit being correct. A person from Deutsche Telecom claims this:
> To give a sense of possible impact, we have tens of millions of subscribers and only 5-10 cases per year estimated. So I guess nothing would "go dark" :slightly_smiling_face:
### Links / references
Thread https://chat.dns-oarc.net/community/pl/57pcpenfkf86tr8onmhn1q5a4a
Personally I argue this is
a) not significant enough
b) not widespread enough
to warrant full fledged flag day, but we can start being stricter on AA bit if we decide to do so. PowerDNS already went in that direction so first-mover disadvantage is already paid :-)Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3679Fix stub zone operation when talking to minimal authoritative servers2023-04-17T10:16:31ZGreg ChoulesFix stub zone operation when talking to minimal authoritative servershttps://gitlab.isc.org/isc-projects/bind9/-/issues/1736 was created to fix the issue that when the authoritative server to which a stub zone is pointing is configured for "minimal-responses yes;" and the NS are in-zone, operation of the ...https://gitlab.isc.org/isc-projects/bind9/-/issues/1736 was created to fix the issue that when the authoritative server to which a stub zone is pointing is configured for "minimal-responses yes;" and the NS are in-zone, operation of the stub zone would fail because it did not receive address records for the zone's NS in the Additional section of the NS response.
I believe that the fix was applied at the wrong end of the conversation, by making the authoritative server's response to NS queries return Additional data even if "minimal-responses" is set to "yes".
A BIND recursive server (with stub zones configured) may be talking to a non-BIND authoritative server with a more strict minimal response configuration, in which case the problem would still exist.
I think the correct fix would be to modify the stub zone code so that, if it needs address records for the given NS, it queries for them. This should work without any extra user configuration because stub zones **must** be defined with at least one primary - a bit like root hints - which is used to send the initial SOA and NS queries. The same address(es) can be used for the address queries for the NS records, given that they are in the zone itself.https://gitlab.isc.org/isc-projects/bind9/-/issues/4010Allow for scripts / hooks for key rollovers2023-04-11T12:43:32ZKarol BabiochAllow for scripts / hooks for key rollovers### Description
It seems like currently there is no good way on how to automate a KSK rollover, since the corresponding DS record has to published in the parent zone. While there is [RFC7344](https://datatracker.ietf.org/doc/html/rfc734...### Description
It seems like currently there is no good way on how to automate a KSK rollover, since the corresponding DS record has to published in the parent zone. While there is [RFC7344](https://datatracker.ietf.org/doc/html/rfc7344), in reality it is not widely adopted. Personally I don't know any registrar who supports this yet. Anyway, this would require TSIG to be secure anyway.
One of my registrars offers an HTTPS-based API to manage DNSSEC records. Hence, its possible to write scripts that will automate the key rollover process.
### Request
There should be a way to trigger a script (with some inputs such as the key id, the DS record, etc.) whenever BIND is about to rotate a key. This way it should be possible to use `dnssec-policy` and fully automate the key rollover process, including the `KSK` key (rather than only the `ZSK` key).
### Links / referencesNot plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3687Add the ability to specify TLS configuration at the zone level for catalog zones2023-03-29T07:57:03ZMark AndrewsAdd the ability to specify TLS configuration at the zone level for catalog zonesCurrently the only way to specify which TLS configuration to use with catalog member zones is to inherit it from the default-primaries settings.
One possible mechanism would be to support multiple fields in the TXT record that currently...Currently the only way to specify which TLS configuration to use with catalog member zones is to inherit it from the default-primaries settings.
One possible mechanism would be to support multiple fields in the TXT record that currently specifies the TSIG key with "" indicating that field is empty.
e.g.
- TXT keyname TLS-configuration
- TXT keyname
- TXT "" TLS-configuration
- TXT "" ""
- TXT keyname ""
Deploying such a change would require the servers involved to be upgraded prior to the use of the new record format.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3779catalog zone grammar does not enforce default-primaries key / should we suppo...2023-03-16T08:59:16ZPetr Špačekpspacek@isc.orgcatalog zone grammar does not enforce default-primaries key / should we support primary zones in catalog?### Summary
Catalog zones grammar in named.conf does not enforce/require `default-primaries` key. This can be either bug, or an opportunity to extend the feature in an meaningful way.
### BIND version used
* ~"Affects v9.16": d14a22b3...### Summary
Catalog zones grammar in named.conf does not enforce/require `default-primaries` key. This can be either bug, or an opportunity to extend the feature in an meaningful way.
### BIND version used
* ~"Affects v9.16": d14a22b3d9fa8e8bb21dfe3bb0bca216a5b93910
* ~"Affects v9.18": f5e7192691568d4c089fbdd4ed4e93c7af785bae
* ~"Affects v9.19": 0e489b9ed4ba7821c50038dade014bf2b706bd12
### Steps to reproduce
1. Define catalog zone **without** `default-primaries` key. E.g.
```
catalog-zones {
zone "catalog.invalid"
//default-masters { 127.0.0.2; }
in-memory no
zone-directory "catzones"
min-update-interval 1;
};
```
2a. Start **with** matching files on disk
2b. Start **without** matching files on disk
### What is the current *bug* behavior?
The config is accepted by parser but causes surprising behavior later on.
Variant 2A:
The zone is on disk under correct name, and it loads just fine when the file is available in `catzones` directory. `rndc zonestatus` then reports:
```
name: .
type: secondary
files: catzones/__catz___default_catalog.invalid_..db
serial: 2023010600
nodes: 8438
last loaded: Fri, 06 Jan 2023 16:03:08 GMT
next refresh: Fri, 06 Jan 2023 16:12:19 GMT
expires: Fri, 13 Jan 2023 16:03:08 GMT
secure: yes
inline signing: no
key maintenance: none
dynamic: no
reconfigurable via modzone: yes
```
Next time refresh timer hits it errors out with
```
zone ./IN: cannot refresh: no primaries
```
but continues serving the zone until it expires. Kind of works, but not so much because it can never refresh and is bound to expire eventually.
Variant 2B:
File is not on disk. It fails to load as expected, and logs
```
zone ./IN: cannot refresh: no primaries
```
immediately.
### Possible fixes
I can see two options:
a) Require the `default-primaries` and error out if it is not present. That would be the same as for regular secondary zones, I believe.
b) Make this behavior "supported", probably by switching zone type to "primary" in case there is no `default-primaries` defined for the respective catalog. (In that case `in-memory` must be configured as `no`.)
Personally I think it makes sense to do b) because it eliminates need to have two different per-zone config management procedures for primaries.
I mean - with "strict" variant adding a new primary zone always requires `rndc addzone` + catalog zone modification on the primary side.
With less strict variant `rndc addzone` is not necessary and the whole state is in the catalog zone, which is has to be maintained for secondaries anyway.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3864Investigate the hot-keys problem2023-02-14T11:03:50ZOndřej SurýInvestigate the hot-keys problem### Hot Keys
Sharded data structures are somewhat vulnerable to hot keys. I.e. a single key which is frequently operated on. In this case there will be high contention on the shard for which this key belongs to. This can be mitigated by...### Hot Keys
Sharded data structures are somewhat vulnerable to hot keys. I.e. a single key which is frequently operated on. In this case there will be high contention on the shard for which this key belongs to. This can be mitigated by introducing a non-determinstic sharding function which places hot keys in more than 1 shard. This solution does complicate things though, and introduces a probabilistic component to the data structure (e.g. look ups may result in the key not being found, when in fact it is actually in another shard. This trade off is generally acceptable in cache scenarios, where failed look ups just result in the key being re-populated).
Source: http://quinnftw.com/sharding-to-reduce-mutex-contention/
(This seems like something that might be happening in the tree structure like DNS itself for the portions of the hierarchy close to the "trunk".)