ISC Open Source Projects issueshttps://gitlab.isc.org/groups/isc-projects/-/issues2024-03-01T10:04:57Zhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2405[ISC-support #17264] ADB overmem condition and cleaning - very difficult to d...2024-03-01T10:04:57ZBrian Conry[ISC-support #17264] ADB overmem condition and cleaning - very difficult to detect and causes erratic behaviorThe ADB's mctx size is set to 1/8 of the max-cache-size, if set. This is the only means to control the ADB memory limit. There is also a hard-coded maximum ADB size applied to ADBs for views that share a cache.
When it goes overmem, t...The ADB's mctx size is set to 1/8 of the max-cache-size, if set. This is the only means to control the ADB memory limit. There is also a hard-coded maximum ADB size applied to ADBs for views that share a cache.
When it goes overmem, the ADB starts removing names and entries. The strategy for removing entries doesn't seem to be tied strongly to utility.
This can lead to erratic behavior as BIND is constantly forgetting information about server SRTTs, EDNS capabilities, and other useful data.
In some cases, if not all of the entries for servers associated with a zone are affected by the overmem purge, this can cause the resolver to fixate on a small subset of the servers authoritative for the zone - and not necessarily the subset with the best SRTT.
There is no logging at any level related to ADB overmem activities, nor are there any stats directly related to ADB memory usage.
There are stats for counts of names and entries, along with the number of buckets for each type, but there's no reliable way to map those to memory usage.
The stats channel does contain detail for the ADB memory contexts, but there's no reliable way to map those memory contexts to a particular view.
It seems likely that most of the time the symptoms of an overmem ADB will be minor and nearly impossible to directly measure - small delays and increases in CPU usage associated with the repeated creation and destruction of ADB entries and/or fixation on suboptimal upstream servers - but will definitely degrade the quality of service that the resolver is providing.
This behavior was noticed by a customer when their monitoring zone happened to, by chance, be negatively affected.
Most of the symptoms described here are theoretical, based on my understanding of the code and various customer-described, but unreproducable and otherwise unexplained, behaviors.
This issue is a feature request covering:
* specific testing by ISC to better understand the impact and range of behaviors with the ADB is overmem #2441
* additional logging related to ADB overmem activities #2435
* additional stats/metrics relating to ADB overmem #2436
* improvements to ADB overmem behavior (ideally based on some utility metric) #2437
* ability to directly control ADB size independent of cache size #2438
* revisit the hard-coded shared-cache maximum ADB size (e.g. remove in favor of configuration) #2439
* system tests related to any/all items aboveNot plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1831Feature request: Separate NXDOMAIN cache with its own max-ncache-size2024-03-01T10:04:57ZCathy AlmondFeature request: Separate NXDOMAIN cache with its own max-ncache-sizeThis relates to PRSD DDoS attacks, and the effect on participating resolvers when the domain under onslaught is able to keep responding and does not die or rate-limit the resolvers.
The scenario is one in which a very large number of un...This relates to PRSD DDoS attacks, and the effect on participating resolvers when the domain under onslaught is able to keep responding and does not die or rate-limit the resolvers.
The scenario is one in which a very large number of unique names are being queried, the objective being to bypass cached NXDOMAINs in resolvers and to force every name to become a query to the authoritative servers for the domain (or hosting provider) that is being attacked.
Typically, the target servers will either die, or will commence rate-limiting their perceived attackers. In the case of a resolver, this will result in a large number of recursive queries being backlogged while they wait for the server responses that never arrive.
BIND uses fetch-limits to mitigate the non-responding servers scenario.
But in the situation where the servers never die or never rate-limit, the outcome is rather different. Resolvers that can cope with the increase in traffic (which usually isn't actually that much), instead see a rapid increase in memory consumption (and decrease in cache hits!) due to the NXDOMAIN responses that are received and then cached (never to be used again).
One mitigation for resolver operators has been to reduce max-ncache-ttl to silly small values - but the effectiveness of this depends on the structure of the cache nodes and how often opportunistic cache cleaning hits these nodes.
Yes, overmem (LRU-based logic) cache-cleaning will help with this, but for many, it is going to be at the expense of 'positive' cache content, and regular clients will start to suffer with more cache-misses, as well as cache churn increasing as negative and positive cache content keeps being 'swapped'.
Mark suggested keeping negative answers in a separate cache, where they could have their own max-ncache-size and churn all by themselves, without affecting main cache.
This sounds like A Good Idea - but one that we've never get got around to, as part of ongoing DDoS mitigation work.
(Also tagging this as 'Customer' since I can find many a customer ticket where customers have been bitten by this when one specific and well-known DNS hosting company have been under attack, and their servers never falter in sending back NXDOMAIN responses to their 'attackers').Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/2495Mitigation for cache bloat due to negative RRsets (NXDOMAIN and NXRRSET) when...2024-03-01T10:04:57ZCathy AlmondMitigation for cache bloat due to negative RRsets (NXDOMAIN and NXRRSET) when preserving expired RRsets using max-stale-ttlI'm tagging this as 'Customer' because it is affecting/has affected several customer caches (unanticipated cache bloat) with 'stale-cache-enable yes;'
----
By design, negative cache content is not used in the same way as positive cache...I'm tagging this as 'Customer' because it is affecting/has affected several customer caches (unanticipated cache bloat) with 'stale-cache-enable yes;'
----
By design, negative cache content is not used in the same way as positive cache content. 'stale-answer-client-timeout' does not apply to negative content, thus the clients will timeout and likely retry the same query several times before giving up, long before the resolver-query-timeout. Only when named has failed to get a response from the authoritative server(s) does the 'stale-refresh-time' window commence.
There is a good reason for this - when querying the authoritative servers for a name that is already NXDOMAIN or NXRRSET in cache, we don't know if this is going to be replaced with positive content or not. Therefore, to ensure that positive content is preferred, we have to wait to be sure that it's sensible to serve the stale negative content as a last resort.
UNFORTUNATELY:
1. Most negative content is single-use - so adding max-stale-ttl to its retention period can significantly increase cache bloat - for no useful reason whatsoever.
2. Most negative content has (also by design of sane zone administrators, or overridden by max-ncache-ttl) a much shorter TTL in cache than positive content.
So when we're not preserving stale content, the negative content is fairly quickly removed from cache, in comparison with the positive RRsets. But adding max-stale-ttl to the retention period can quite significantly tilt the balance in favour of all of this one-use-only cached negative content.
We don't have any statistics that measure cache hits for different RTYPEs, so we don't know for sure, what percentage of negative content is used again (cache hits) versus just sitting there for no good reason (cache misses) - perhaps we should? (Perhaps we should also have stats on stale hits and misses by RTYPE?)
But in any case, having seen too many caches overloaded with stale negative content, I should like to propose an option that can be used either to shorten the max-stale-ttl for negative content, but that also takes a value of zero, to disable its retention entirely.https://gitlab.isc.org/isc-projects/bind9/-/issues/3261Run cache cleaning as offloaded work2024-03-01T10:04:56ZOndřej SurýRun cache cleaning as offloaded workThe cache cleaning is on-task incremental process which is ideal candidate for running it as offloaded work.
NOTE for myself or whomever is going to do the job - great care needs to be taken care with signaling the end of cleaning - cur...The cache cleaning is on-task incremental process which is ideal candidate for running it as offloaded work.
NOTE for myself or whomever is going to do the job - great care needs to be taken care with signaling the end of cleaning - currently this is being serialized by the task, but if we move this into the threadpool the signalling needs to be done by atomic variable (or something like that).Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3992The XFR unreachable cache redesign2024-03-01T09:52:29ZOndřej SurýThe XFR unreachable cache redesignThe unreachable cache for **dead primaries** was added to BIND 9 in 2006 via 1372e172d0e0b08996376b782a9041d1e3542489. It features a 10-slot LRU array with 600 seconds (10 minutes) fixed delay. During this time, any primary with a hicc...The unreachable cache for **dead primaries** was added to BIND 9 in 2006 via 1372e172d0e0b08996376b782a9041d1e3542489. It features a 10-slot LRU array with 600 seconds (10 minutes) fixed delay. During this time, any primary with a hiccup would be blocked for the whole block duration (unless overwritten by a different dead primary).
One can argue:
- 10 minutes is too long for a fixed, non-configurable delay
- 10 slots are not enough - servers could be running 1M and more zones with different primaries; and especially in situations like these, there's a high chance that more primaries would be having problems
I think this needs a redesign, but meanwhile - I think that we can drop the `UNREACH_HOLD_TIME` to something like 10 seconds (or 60?) - this should still prevent a thundering herd over the unresponsive server, but the recovery is going to be much faster.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4237Remove "dialup" and "heartbeat-interval"2024-03-01T04:30:01ZEvan HuntRemove "dialup" and "heartbeat-interval"The "dialup" and "heartbeat-interval" options have been deprecated in 9.20 (see #3700, !8080) and will need to be removed later.
The due date for this issue has been set to an arbitrary date that is presumed to fall within the BIND 9.21...The "dialup" and "heartbeat-interval" options have been deprecated in 9.20 (see #3700, !8080) and will need to be removed later.
The due date for this issue has been set to an arbitrary date that is presumed to fall within the BIND 9.21 development cycle.Not plannedEvan HuntEvan Hunt2024-08-01https://gitlab.isc.org/isc-projects/bind9/-/issues/3442Remove the "max-zone-ttl" option (on options/zone level)2024-03-01T04:29:28ZMichał KępieńRemove the "max-zone-ttl" option (on options/zone level)The `max-zone-ttl` option should now be configured as part of
`dnssec-policy`. The option with the same name on `options` and `zone`
levels should be removed.
!6542 deprecated that sort of use. This issue serves as a reminder to
ultim...The `max-zone-ttl` option should now be configured as part of
`dnssec-policy`. The option with the same name on `options` and `zone`
levels should be removed.
!6542 deprecated that sort of use. This issue serves as a reminder to
ultimately remove the relevant code altogether.
The due date for this issue has been set to an arbitrary date that is
presumed to fall within the BIND 9.21 development cycle.
See #2918, !6542Not plannedEvan HuntEvan Hunt2024-08-01https://gitlab.isc.org/isc-projects/bind9/-/issues/598Wishlist: statistics for DNS-over-TCP and TLS2024-02-29T16:01:45ZTony FinchWishlist: statistics for DNS-over-TCP and TLSA couple of suggestions:
1. For DNS-over-TLS using a proxy, it would be nice to have separate statistics counters from queries that came from the proxy. When the TLS proxy is running on the same server, it would be enough to have sepa...A couple of suggestions:
1. For DNS-over-TLS using a proxy, it would be nice to have separate statistics counters from queries that came from the proxy. When the TLS proxy is running on the same server, it would be enough to have separate counters when the client address is in the interface list that BIND keeps track of. Is this generally useful enough to be worthwhile?
2. For DNS-over-TCP (and by implication, DNS-over-TLS) it would be helpful to have some guide to setting TCP idle timeouts. Two things would help:
* include the connection age in the query log - useful for later analysis, but no good if query logging needs to be left off
* keep an overall histogram of connection age - I don't know of any smaller summary statistics that would be useful, because the distribution of queries is very skewedBIND 9.19.xAydın MercanAydın Mercanhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4615Improve dnssec-keygen warnings when unnecessary parameters are ignored2024-02-29T15:48:40ZCathy AlmondImprove dnssec-keygen warnings when unnecessary parameters are ignored### Summary
The specific instance that inspires this bug report is that these commands
> dnssec-keygen -b 2048 -a ECDSAP256SHA256 -f KSK example.com
> dnssec-keygen -b 2048 -a ECDSAP256SHA256 example.com
.. don't generate a warning th...### Summary
The specific instance that inspires this bug report is that these commands
> dnssec-keygen -b 2048 -a ECDSAP256SHA256 -f KSK example.com
> dnssec-keygen -b 2048 -a ECDSAP256SHA256 example.com
.. don't generate a warning that the -b 2048 is ignored because key algorithm ECDSAP256SHA256 has a predefined length
There may be other scenarios worth checking at the same time?
### BIND version affected
Noted against 9.16.28 (a long time ago), but the situation I don't think has changed.
### Steps to reproduce
See above - just do it?
### What is the current *bug* behavior?
No warning. dnssec-keygen goes its own sweet way and uses its built-in default length for this key
### What is the expected *correct* behavior?
It would have been really helpful to have known that the keys didn't have the requested length - this caused a bunch of other problems during migration to dnssec-policy using these keys!
What actually happened is that after restarting named and switching to dnssec-policy with these parameters:
> ksk lifetime unlimited algorithm ECDSAP256SHA256 2048;
> zsk lifetime unlimited algorithm ECDSAP256SHA256 2048;
named didn't recognise the existing keys as matching the policy and generated new ones for the zone, retiring the old keys - which is just what you don't want when migrating your existing zone's configuration and not intending to abruptly re-sign it with new keys (aargh!)
In fact, named-checkconf does fuss about the 2048:
> /etc/namedb/named.conf:54: dnssec-policy: key algorithm ECDSAP256SHA256 has predefined length; ignoring length value 2048
> /etc/namedb/named.conf:55: dnssec-policy: key algorithm ECDSAP256SHA256 has predefined length; ignoring length value 2048
So perhaps this is another small bug too - if the length is irrelevant and ignored - why did it not just recognise the existing keys?
It was perfectly happy with the same keys and with:
> ksk lifetime unlimited algorithm ECDSAP256SHA256;
> zsk lifetime unlimited algorithm ECDSAP256SHA256;May 2024 (9.18.27, 9.18.27-S1, 9.19.24)https://gitlab.isc.org/isc-projects/bind9/-/issues/4202Running the "mkeys" system test around the top of the hour may cause it to fail2024-02-29T15:26:10ZMichał KępieńRunning the "mkeys" system test around the top of the hour may cause it to failhttps://gitlab.isc.org/isc-private/bind9/-/jobs/3509369
The `mkeys` system test failed silently on the following step:
2023-07-06 15:00:59 INFO:mkeys I:mkeys_tmp_5qq267_u:revoke key with bad signature, check revocation is ignor...https://gitlab.isc.org/isc-private/bind9/-/jobs/3509369
The `mkeys` system test failed silently on the following step:
2023-07-06 15:00:59 INFO:mkeys I:mkeys_tmp_5qq267_u:revoke key with bad signature, check revocation is ignored (19)
This means that it was more than likely `set -e` that triggered the
failure. Unfortunately, the `mkeys` system test is written in a way
that does not make debugging easy when `set -e` is in effect. There are
a *lot* of steps in the relevant check and each of them could trigger
the failure:
<details>
<summary>Click to expand/collapse</summary>
```sh
n=$((n+1))
echo_i "revoke key with bad signature, check revocation is ignored ($n)"
ret=0
revoked=$($REVOKE -K ns1 "$original")
rkeyid=$(keyfile_to_key_id "$revoked")
rm -f ns1/root.db.signed.jnl
# We need to activate at least one valid DNSKEY to prevent dnssec-signzone from
# failing. Alternatively, we could use -P to disable post-sign verification,
# but we actually do want post-sign verification to happen to ensure the zone
# is correct before we break it on purpose.
$SETTIME -R none -D none -K ns1 "$standby1" > /dev/null
$SIGNER -Sg -K ns1 -N unixtime -O full -o . -f signer.out.$n ns1/root.db > /dev/null 2>/dev/null
cp -f ns1/root.db.signed ns1/root.db.tmp
BADSIG="SVn2tLDzpNX2rxR4xRceiCsiTqcWNKh7NQ0EQfCrVzp9WEmLw60sQ5kP xGk4FS/xSKfh89hO2O/H20Bzp0lMdtr2tKy8IMdU/mBZxQf2PXhUWRkg V2buVBKugTiOPTJSnaqYCN3rSfV1o7NtC1VNHKKK/D5g6bpDehdn5Gaq kpBhN+MSCCh9OZP2IT20luS1ARXxLlvuSVXJ3JYuuhTsQXUbX/SQpNoB Lo6ahCE55szJnmAxZEbb2KOVnSlZRA6ZBHDhdtO0S4OkvcmTutvcVV+7 w53CbKdaXhirvHIh0mZXmYk2PbPLDY7PU9wSH40UiWPOB9f00wwn6hUe uEQ1Qg=="
# Less than a second may have passed since ns1 was started. If we call
# dnssec-signzone immediately, ns1/root.db.signed will not be reloaded by the
# subsequent "rndc reload ." call on platforms which do not set the
# "nanoseconds" field of isc_time_t, due to zone load time being seemingly
# equal to master file modification time.
sleep 1
sed -e "/ $rkeyid \./s, \. .*$, . $BADSIG," signer.out.$n > ns1/root.db.signed
mkeys_reload_on 1 || ret=1
mkeys_refresh_on 2 || ret=1
mkeys_status_on 2 > rndc.out.$n 2>&1 || ret=1
# one key listed
count=$(grep -c "keyid: " rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'keyid:' count ($count) != 1"; ret=1; }
# it's the original key id
count=$(grep -c "keyid: $originalid" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'keyid: $originalid' count ($count) != 1"; ret=1; }
# not revoked
count=$(grep -c "REVOKE" rndc.out.$n) || true
[ "$count" -eq 0 ] || { echo_i "'REVOKE' count ($count) != 0"; ret=1; }
# trust is still current
count=$(grep -c "trust" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'trust' count != 1"; ret=1; }
count=$(grep -c "trusted since" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'trusted since' count != 1"; ret=1; }
if [ $ret != 0 ]; then echo_i "failed"; fi
status=$((status+ret))
```
</details>
However, it is possible to look at the presence/absence of certain files
among the test artifacts and also to look at file timestamps, so that
some scenarios can be ruled out. In this case, `ns1/root.db.tmp` did
not exist, so execution did not reach the `cp -f` line. This meant that
only `dnssec-revoke`, `dnssec-settime`, and `dnssec-signzone` could have
failed. However, since there were three key files in `ns1` newer than
the `$original` one (meaning that `dnssec-revoke` and `dnssec-settime`
did their job), `dnssec-signzone` was the primary suspect. I ran its
invocation from the test manually on the artifacts and...
```
$ dnssec-signzone -Sg -K ns1 -N unixtime -O full -o . -f signer.out.19 ns1/root.db
Fetching ./ECDSAP384SHA384/25503 (KSK) from key repository.
Fetching ./ECDSAP256SHA256/37163 (KSK) from key repository.
Fetching ./ECDSAP384SHA384/24825 (ZSK) from key repository.
dnssec-signzone: warning: Serial number would not advance, using increment method instead
Verifying the zone using the following algorithms:
- ECDSAP256SHA256
Missing ZSK for algorithm ECDSAP256SHA256
Missing self-signed KSK for algorithm ECDSAP384SHA384
No correct ECDSAP256SHA256 signature for . NSEC
No correct ECDSAP256SHA256 signature for . SOA
No correct ECDSAP256SHA256 signature for . NS
No correct ECDSAP256SHA256 signature for example NSEC
No correct ECDSAP256SHA256 signature for example TXT
No correct ECDSAP256SHA256 signature for a.root-servers.nil NSEC
No correct ECDSAP256SHA256 signature for a.root-servers.nil A
No correct ECDSAP256SHA256 signature for tld NSEC
The zone is not fully signed for the following algorithms:
ECDSAP256SHA256
ECDSAP384SHA384
.
DNSSEC completeness test failed.
Zone verification failed (failure)
```
But wait, how is it even possible that this zone is signed using
multiple algorithms?
```
$ git grep -F _ALGORITHM bin/tests/system/mkeys/ | wc -l
13
$ git grep -F _ALGORITHM bin/tests/system/mkeys/ | grep -vF DEFAULT_ALGORITHM
$
```
The `*_ALGORITHM` environment variables are set by
`bin/tests/system/get_algorithms.py`. The script is written in a way
that allows the algorithms used to be chosen randomly from a specific
set. The `mkeys` test takes advantage of that feature and sets
`ALGORITHM_SET` to `ecc_default`. The script tries to ensure a stable
set of algorithms is used for each system test run by [seeding its RNG
with a value derived from the current time][1]. This works most of the
time, but if we get really unlucky, `setup.sh` can be run during one
"time slot" while `tests.sh` is run during another. This is exactly
what happened in this case:
```
------------------------------ Captured log setup ------------------------------
2023-07-06 14:59:58 INFO:mkeys switching to tmpdir: /builds/isc-private/bind9/bin/tests/system/mkeys_tmp_5qq267_u
2023-07-06 14:59:58 INFO:mkeys test started: mkeys/tests_sh_mkeys.py
2023-07-06 14:59:58 INFO:mkeys using port range: <20583, 20602>
------------------------------ Captured log call -------------------------------
2023-07-06 15:00:14 INFO:mkeys I:mkeys_tmp_5qq267_u:check for signed record (1)
2023-07-06 15:00:14 INFO:mkeys I:mkeys_tmp_5qq267_u:check positive validation with valid trust anchor (2)
2023-07-06 15:00:14 INFO:mkeys I:mkeys_tmp_5qq267_u:check for failed validation due to wrong key in managed-keys (3)
...
```
(Note that when the `test started: ...` line is logged, the script
actually runs `setup.sh` first. `tests.sh` is run afterwards.)
This problem happens very rarely, so I am not sure if we need to do
anything about it, but it felt right to open an issue so that others are
aware that this is a thing. `mkeys` is the only system test that
currently sets the `ALGORITHM_SET` variable, so exposure is minimal. If
we migrate more tests to variable algorithms, this might become a more
pressing issue to address.
[1]: https://gitlab.isc.org/isc-projects/bind9/-/blob/bf8acd455693edef03881fd2180c5561bc0db66d/bin/tests/system/get_algorithms.py#L171-175May 2024 (9.18.27, 9.18.27-S1, 9.19.24)Tom KrizekTom Krizekhttps://gitlab.isc.org/isc-projects/bind9/-/issues/3810Replace system test runner with pytest2024-02-29T15:26:01ZTom KrizekReplace system test runner with pytestThe legacy solution for running systems test has evolved over the course of years and is currently a mix of shell & perl scripts intermingled with the build system, while some of the system tests utilize pytest. Implementing a more consi...The legacy solution for running systems test has evolved over the course of years and is currently a mix of shell & perl scripts intermingled with the build system, while some of the system tests utilize pytest. Implementing a more consistent solution using just pytest as a runner could bring following benefits:
- better test run isolation (i.e. artifacts from previous run don't interfere with current test run)
- more precise control over test selection (running just a single test case)
- getting rid of perl+shell glue scripts
- a simpler and more standard way to run and parallelize test runs
- solid foundation for future extensions (e.g. wrapping test execution inside a network/pid namespace)
For a transitory period of time, the legacy test framework should be supported, since it'd be difficult to replace everything at once. The pytest runner should be available in 9.18+, it'd be prudent to keep the legacy runner support until 9.16 reaches EOL. By that time, we should have enough insight to determine whether pytest proves to be a suitable replacement and throw away the legacy runner from supported branches at that point.
Migration plan for moving to pytest runner and dropping the legacy runner support:
- Phase I - pytest runner development, legacy runner supported
- [x] initial implementation of the pytest runner (#3978, !6809)
- [x] support out-of-tree tests (#4246)
- [x] resolve support on CI systems with old pytest (OpenBSD, CentOS 7) (!8193)
- [x] implement any missing (and desired) features from legacy runner (#4252)
- [x] configure `make check` to invoke pytest (#4262)
- Phase II - deprecating legacy runner - 9.19-only
- [ ] remove legacy runner control script(s) - legacy.run.sh, get_ports.sh ...
- [ ] remove no longer needed scripts from system tests (e.g. clean.sh)
- [ ] remove conf.sh(.common) and declare variables in pytest only
- [ ] remove the Makefile entanglement
- [ ] declare python and pytest-xdist as required dependencies for tests + document
- [ ] address any `FUTURE` comments in the pytest runner code
- Phase III - cleanup after legacy runner
- [ ] rewrite start.pl/stop.pl to python (related https://gitlab.isc.org/isc-projects/bind9/-/issues/3198)
- [ ] rewrite remaining setup/teardown perl&shell scripts to python
- [ ] rewrite setup.sh/prereq.sh system tests scripts to pytest fixtures
- [ ] ensure system test documentation is up to dateBIND 9.19.xTom KrizekTom Krizekhttps://gitlab.isc.org/isc-projects/bind9/-/issues/4422No supported algorithms on platform2024-02-29T15:26:01ZMark AndrewsNo supported algorithms on platformJob [#3783240](https://gitlab.isc.org/isc-projects/bind9/-/jobs/3783240) failed for 5d20a7ce254dabe1d4a99f7bd0fd1cfa6309124b:
```
$ PYTHON="$(source bin/tests/system/conf.sh; echo $PYTHON)"
Traceback (most recent call last):
File "/bu...Job [#3783240](https://gitlab.isc.org/isc-projects/bind9/-/jobs/3783240) failed for 5d20a7ce254dabe1d4a99f7bd0fd1cfa6309124b:
```
$ PYTHON="$(source bin/tests/system/conf.sh; echo $PYTHON)"
Traceback (most recent call last):
File "/builds/isc-projects/bind9/bin/tests/system/get_algorithms.py", line 241, in <module>
main()
File "/builds/isc-projects/bind9/bin/tests/system/get_algorithms.py", line 227, in main
algs = filter_supported(algs)
^^^^^^^^^^^^^^^^^^^^^^
File "/builds/isc-projects/bind9/bin/tests/system/get_algorithms.py", line 138, in filter_supported
raise RuntimeError(
RuntimeError: no DEFAULT algorithm from "stable" set supported on this platform
$
```May 2024 (9.18.27, 9.18.27-S1, 9.19.24)Tom KrizekTom Krizekhttps://gitlab.isc.org/isc-projects/kea/-/issues/2778HA service UT sporadically fail on macOS2024-02-29T06:55:54ZRazvan BecheriuHA service UT sporadically fail on macOS```
[ RUN ] HAServiceTest.sendUpdatesControlResultErrorMultiThreading
ha_service_unittest.cc:1378: Failure
Expected equality of these values:
2
factory3_->getResponseCreator()->getReceivedRequests().size()
Which is: 1
ha_ser...```
[ RUN ] HAServiceTest.sendUpdatesControlResultErrorMultiThreading
ha_service_unittest.cc:1378: Failure
Expected equality of these values:
2
factory3_->getResponseCreator()->getReceivedRequests().size()
Which is: 1
ha_service_unittest.cc:1384: Failure
Value of: update_request3
Actual: false
Expected: true
[ FAILED ] HAServiceTest.sendUpdatesControlResultErrorMultiThreading (2 ms)
```
```
[ RUN ] HAServiceTest.sendSuccessfulUpdates6MultiThreading
ha_service_unittest.cc:1542: Failure
Expected equality of these values:
1
factory3_->getResponseCreator()->getReceivedRequests().size()
Which is: 0
ha_service_unittest.cc:1549: Failure
Value of: update_request3
Actual: false
Expected: true
[ FAILED ] HAServiceTest.sendSuccessfulUpdates6MultiThreading (2 ms)
```
```
[ RUN ] HAServiceTest.sendSuccessfulUpdatesMultiThreading
ha_service_unittest.cc:1120: Failure
Expected equality of these values:
2
factory3_->getResponseCreator()->getReceivedRequests().size()
Which is: 1
ha_service_unittest.cc:1126: Failure
Value of: update_request3
Actual: false
Expected: true
[ FAILED ] HAServiceTest.sendSuccessfulUpdatesMultiThreading (2 ms)
```
It seems that all are caused by checks on the backup server.
It might be realted to the stop condition on the IO service (wait for requests, not for replies)
```
// Actually perform the lease updates.
ASSERT_NO_THROW(runIOService(TEST_TIMEOUT, [this]() {
// Finish running IO service when there are no more pending requests.
return (service_->pendingRequestSize() == 0);
}));
```next-stable-2.6https://gitlab.isc.org/isc-projects/stork/-/issues/1320Duplicated rows in the service table2024-02-28T16:28:53ZSlawek FigielDuplicated rows in the service tableThe problem was reported [on the Stork-users mailing list](https://lists.isc.org/pipermail/stork-users/2024-February/000245.html).
The `service` table rows may be duplicated on some unknown conditions. It causes the HA status displayed ...The problem was reported [on the Stork-users mailing list](https://lists.isc.org/pipermail/stork-users/2024-February/000245.html).
The `service` table rows may be duplicated on some unknown conditions. It causes the HA status displayed on the Dashboard to diverge from the status presented on the application page.
The user reports that the problem occurs in Stork 1.15 but was also observed in the previous versions. The first installed version was 1.12.
Stork was installed long after configuring HA in Kea.
It seems the same problem was reported in #616 and #818.
We should check if the problem were fixed correctly in 1.7 and if the invalid table state may preserved from the previous versions.
We should also analyze if adding the unique constraint on the `service` table would be beneficial to avoid similar issues.1.16Marcin SiodelskiMarcin Siodelskihttps://gitlab.isc.org/isc-projects/kea/-/issues/3007Kea builds are not reproducible2024-02-28T12:06:07ZSudip MukherjeeKea builds are not reproducible---
name: Bug report
about: The latest version of kea is failing the reproducible build as it adds the build path in kea-admin script.
---
**Describe the bug**
The latest version of kea is failing the reproducible build as it adds the ...---
name: Bug report
about: The latest version of kea is failing the reproducible build as it adds the build path in kea-admin script.
---
**Describe the bug**
The latest version of kea is failing the reproducible build as it adds the build path in kea-admin script.
**To Reproduce**
Steps to reproduce the behavior:
1. Build kea
2. Again build kea at a different build location
3. Use diffoscope to compare kea-admin
4. See error
The result can be seen at https://autobuilder.yocto.io/pub/repro-fail/oe-reproducible-20230806-_h282f1z/packages/diff-html/
**Expected behavior**
The built kea-admin should not contain any reference to build path.
**Environment:**
- Kea version: v2.5.0
- OS: All
- Which features were compiled in (in particular which backends): NA
- If/which hooks where loaded in: NA
**Additional Information**
The attached patch will fix the reproducible build and verified with diffoscope. [0001-kea-fix-reproducible-build-failure.patch](/uploads/7b4b13a72d4953a65e6768bdc4f78483/0001-kea-fix-reproducible-build-failure.patch)
**Contacting you**
Please email at sudipm.mukherjee@gmail.comoutstandinghttps://gitlab.isc.org/isc-projects/bind9/-/issues/4606"dry-run" mode to help with dnssec-policy migration2024-02-28T10:46:36ZCarsten Strotmann"dry-run" mode to help with dnssec-policy migration### Description
For some users of BIND 9, esp. people are part time DNS admins only, migrating from manual DNSSEC key management with "auto-dnssec maintain;" towards "dnssec-policy" is difficult.
While the documentation provided by ISC...### Description
For some users of BIND 9, esp. people are part time DNS admins only, migrating from manual DNSSEC key management with "auto-dnssec maintain;" towards "dnssec-policy" is difficult.
While the documentation provided by ISC is good, there is currently no way to "verify" the new "dnssec-policy" configuration before enabling it. Experience has shown (in DNS training classes, but also in real world deployments) that there are many things that can go wrong:
- differences in the DNSSEC key configuration (old vs. new)
- file system permissions on the old key material
- file system location of the old key material
- issues with the time-events stored in the old key material
Going online with a slightly wrong configuration can cause an immediate key rollover, which might break the zone. Recovering from this situation is possible, but requires good knowledge of BIND 9 DNSSEC workings
### Request
Provide a "dnssec-policy dry-run" mode, where BIND 9 will log the next steps in the automatic DNSSEC management to the log files (e.g. category "DNSSEC"), but will not execute any changes to the DNSSEC signed zone or the key material. This will enable the user to test drive the new "dnssec-policy" to see if it will act as expected.
Admins can create a configuration with "dry-run" mode enabled, check the logfiles, and if the actions in the log-file match the expectations, the "dry-run" mode can be removed and the new configuration will become active.
### Links / referencesMatthijs Mekkingmatthijs@isc.orgMatthijs Mekkingmatthijs@isc.orghttps://gitlab.isc.org/isc-projects/bind9/-/issues/4609ADB memory growth in 9.192024-02-28T06:53:58ZOndřej SurýADB memory growth in 9.19During the 25h test, it was discovered that ADB and main memory contextx grows suspiciously:
![bindstats.memory.contexts.ADB._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1-9.19](/uploads/5e5f039e83e4a892554001b6c7348e92/bindstats....During the 25h test, it was discovered that ADB and main memory contextx grows suspiciously:
![bindstats.memory.contexts.ADB._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1-9.19](/uploads/5e5f039e83e4a892554001b6c7348e92/bindstats.memory.contexts.ADB._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1-9.19.png)
![bindstats.memory.contexts.main._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1-main-9.19](/uploads/bad7883a65948bfd2946b84fe6505cdf/bindstats.memory.contexts.main._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1-main-9.19.png)
The growth is much slower in 9.18:
![bindstats.memory.contexts.ADB._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1](/uploads/4cc202485a129130ecd978cf23ad452a/bindstats.memory.contexts.ADB._sum_inuse-http_3A_2F_2F127.0.0.1_3A8888_2Fjson_2Fv1.png)May 2024 (9.18.27, 9.18.27-S1, 9.19.24)https://gitlab.isc.org/isc-projects/stork/-/issues/1214Shared network address utilization not consistent2024-02-27T10:03:39ZVictor PetrescuShared network address utilization not consistentHi everyone,
I've encounter an issue related to the values of the Shared Network Address Utilization. It seems that the values from the /metrics of the Stork Server frequently not showing same values as in the Stork Web Application.
Fo...Hi everyone,
I've encounter an issue related to the values of the Shared Network Address Utilization. It seems that the values from the /metrics of the Stork Server frequently not showing same values as in the Stork Web Application.
For example:
Information from Stork Web App:
![Screenshot_1205](/uploads/0d06970d5bdb83da790e7521dcde773e/Screenshot_1205.png)
Information from Stork Server /metrics:
storkserver_shared_network_address_utilization{name="1"} 0.005
storkserver_shared_network_address_utilization{name="2"} 0.154
storkserver_shared_network_address_utilization{name="3"} 0.004
storkserver_shared_network_address_utilization{name="4"} 0.003
storkserver_shared_network_address_utilization{name="5"} 0.003
storkserver_shared_network_address_utilization{name="6"} 0
As you can see the values don't match. Strange is that sometimes values do match.
Stork Web App is showing the correct values, the problem is with the ones from /metrics.
Thank you !1.16https://gitlab.isc.org/isc-projects/stork/-/issues/1313Incorrect default *.env file.2024-02-27T09:52:35ZAndreas JentschIncorrect default *.env file.---
name: Bug report
about: Create a report to help us improve
---
If you believe your bug report is a security issue (e.g. a packet that can kill the server), DO NOT
REPORT IT HERE. Please use https://www.isc.org/community/report-bug/...---
name: Bug report
about: Create a report to help us improve
---
If you believe your bug report is a security issue (e.g. a packet that can kill the server), DO NOT
REPORT IT HERE. Please use https://www.isc.org/community/report-bug/ instead or send mail to
security-office(at)isc(dot)org.
**Describe the bug**
If you start the agent with the following command, this output is generated.
/usr/bin/stork-agent --use-env-file
FATA[2024-02-08 10:39:55] main.go:406 invalid environment file: '/etc/stork/server.env': ...
**To Reproduce**
Steps to reproduce the behavior:
1. /usr/bin/stork-agent --env-file '/etc/stork/agent.env' --use-env-file
**Expected behavior**
You should adjust the default value for the .*env file.
**Environment:**
/usr/bin/stork-agent -v
1.15.0
Static hostname: dhcp-01-xgs.glattnet.ch
Icon name: computer-server
Chassis: server 🖳
Machine ID: cf6ef911a0974dbfa031e35e8f775125
Boot ID: 96edd982af2d4745a030003205e7fae2
Operating System: Rocky Linux 9.3 (Blue Onyx)
CPE OS Name: cpe:/o:rocky:rocky:9::baseos
Kernel: Linux 5.14.0-362.18.1.el9_3.x86_64
Architecture: x86-64
Hardware Vendor: HPE
Hardware Model: ProLiant DL360 Gen10
Firmware Version: U32
- Kea version: 2.2.0
tarball
linked with:
log4cplus 2.0.5
OpenSSL 3.0.7 1 Nov 2022
database:
MySQL backend 14.0, library 3.2.6
PostgreSQL backend 13.0, library 130013
Memfile backend 2.1
- Stork: 1.15.0
- OS: Rocky Linux 9.3 (Blue Onyx)
- Kea: Which features were compiled in (in particular which backends)
- Kea: If/which hooks where loaded in
**Contacting you**
E-Mail1.16https://gitlab.isc.org/isc-projects/bind9/-/issues/4603Comments to CVE-2023-56802024-02-26T16:05:08ZPeter DaviesComments to CVE-2023-5680Comments to CVE-2023-5680:
Description: When reviewing the fix for CVE-2023-5680 due to the crash we
reported separately, we've noticed many other suspicious points in its implemen
-tation. Though these are based on code inspect...Comments to CVE-2023-5680:
Description: When reviewing the fix for CVE-2023-5680 due to the crash we
reported separately, we've noticed many other suspicious points in its implemen
-tation. Though these are based on code inspection and we haven't checked whether
the issue is real or it can cause any practical problem like a crash, we're
deeply concerned about the overall quality of this implementation, and would
like to suggest ISC revisiting it, perhaps fundamentally.
The issues we've noticed are as follows (there may be more):
- it looks like a longer prefix match in ->old_ecs_root will not be found if
a shorter prefix match is found in ->ecs_root. When using two address prefix
trees, we ought to search both and use the longest prefix match, with ->ecs_root
in preference if both have equal prefix lengths.
- On a related note, it seems possible that copying (moving) data in old_ecs_root
to ecs_root can result in separate rdatasetheaders at the top level for the same record type.
- unlikely to be a big deal in practice, but this code in clean_iptree_nodedata()
probably doesn't do what it appears to intend; it results in cleaning up to 12
- as a meta issue, we're afraid the introduction of old_ecs_root and incremental
cleaning needs a lot more tests, especially low level unit tests, given its comp
-lexity. For example, if the last point is indeed an oversight, it could have
been caught by a unit test easily.
See also #4587