Running the "mkeys" system test around the top of the hour may cause it to fail
https://gitlab.isc.org/isc-private/bind9/-/jobs/3509369
The mkeys
system test failed silently on the following step:
2023-07-06 15:00:59 INFO:mkeys I:mkeys_tmp_5qq267_u:revoke key with bad signature, check revocation is ignored (19)
This means that it was more than likely set -e
that triggered the
failure. Unfortunately, the mkeys
system test is written in a way
that does not make debugging easy when set -e
is in effect. There are
a lot of steps in the relevant check and each of them could trigger
the failure:
Click to expand/collapse
n=$((n+1))
echo_i "revoke key with bad signature, check revocation is ignored ($n)"
ret=0
revoked=$($REVOKE -K ns1 "$original")
rkeyid=$(keyfile_to_key_id "$revoked")
rm -f ns1/root.db.signed.jnl
# We need to activate at least one valid DNSKEY to prevent dnssec-signzone from
# failing. Alternatively, we could use -P to disable post-sign verification,
# but we actually do want post-sign verification to happen to ensure the zone
# is correct before we break it on purpose.
$SETTIME -R none -D none -K ns1 "$standby1" > /dev/null
$SIGNER -Sg -K ns1 -N unixtime -O full -o . -f signer.out.$n ns1/root.db > /dev/null 2>/dev/null
cp -f ns1/root.db.signed ns1/root.db.tmp
BADSIG="SVn2tLDzpNX2rxR4xRceiCsiTqcWNKh7NQ0EQfCrVzp9WEmLw60sQ5kP xGk4FS/xSKfh89hO2O/H20Bzp0lMdtr2tKy8IMdU/mBZxQf2PXhUWRkg V2buVBKugTiOPTJSnaqYCN3rSfV1o7NtC1VNHKKK/D5g6bpDehdn5Gaq kpBhN+MSCCh9OZP2IT20luS1ARXxLlvuSVXJ3JYuuhTsQXUbX/SQpNoB Lo6ahCE55szJnmAxZEbb2KOVnSlZRA6ZBHDhdtO0S4OkvcmTutvcVV+7 w53CbKdaXhirvHIh0mZXmYk2PbPLDY7PU9wSH40UiWPOB9f00wwn6hUe uEQ1Qg=="
# Less than a second may have passed since ns1 was started. If we call
# dnssec-signzone immediately, ns1/root.db.signed will not be reloaded by the
# subsequent "rndc reload ." call on platforms which do not set the
# "nanoseconds" field of isc_time_t, due to zone load time being seemingly
# equal to master file modification time.
sleep 1
sed -e "/ $rkeyid \./s, \. .*$, . $BADSIG," signer.out.$n > ns1/root.db.signed
mkeys_reload_on 1 || ret=1
mkeys_refresh_on 2 || ret=1
mkeys_status_on 2 > rndc.out.$n 2>&1 || ret=1
# one key listed
count=$(grep -c "keyid: " rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'keyid:' count ($count) != 1"; ret=1; }
# it's the original key id
count=$(grep -c "keyid: $originalid" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'keyid: $originalid' count ($count) != 1"; ret=1; }
# not revoked
count=$(grep -c "REVOKE" rndc.out.$n) || true
[ "$count" -eq 0 ] || { echo_i "'REVOKE' count ($count) != 0"; ret=1; }
# trust is still current
count=$(grep -c "trust" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'trust' count != 1"; ret=1; }
count=$(grep -c "trusted since" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'trusted since' count != 1"; ret=1; }
if [ $ret != 0 ]; then echo_i "failed"; fi
status=$((status+ret))
However, it is possible to look at the presence/absence of certain files
among the test artifacts and also to look at file timestamps, so that
some scenarios can be ruled out. In this case, ns1/root.db.tmp
did
not exist, so execution did not reach the cp -f
line. This meant that
only dnssec-revoke
, dnssec-settime
, and dnssec-signzone
could have
failed. However, since there were three key files in ns1
newer than
the $original
one (meaning that dnssec-revoke
and dnssec-settime
did their job), dnssec-signzone
was the primary suspect. I ran its
invocation from the test manually on the artifacts and...
$ dnssec-signzone -Sg -K ns1 -N unixtime -O full -o . -f signer.out.19 ns1/root.db
Fetching ./ECDSAP384SHA384/25503 (KSK) from key repository.
Fetching ./ECDSAP256SHA256/37163 (KSK) from key repository.
Fetching ./ECDSAP384SHA384/24825 (ZSK) from key repository.
dnssec-signzone: warning: Serial number would not advance, using increment method instead
Verifying the zone using the following algorithms:
- ECDSAP256SHA256
Missing ZSK for algorithm ECDSAP256SHA256
Missing self-signed KSK for algorithm ECDSAP384SHA384
No correct ECDSAP256SHA256 signature for . NSEC
No correct ECDSAP256SHA256 signature for . SOA
No correct ECDSAP256SHA256 signature for . NS
No correct ECDSAP256SHA256 signature for example NSEC
No correct ECDSAP256SHA256 signature for example TXT
No correct ECDSAP256SHA256 signature for a.root-servers.nil NSEC
No correct ECDSAP256SHA256 signature for a.root-servers.nil A
No correct ECDSAP256SHA256 signature for tld NSEC
The zone is not fully signed for the following algorithms:
ECDSAP256SHA256
ECDSAP384SHA384
.
DNSSEC completeness test failed.
Zone verification failed (failure)
But wait, how is it even possible that this zone is signed using multiple algorithms?
$ git grep -F _ALGORITHM bin/tests/system/mkeys/ | wc -l
13
$ git grep -F _ALGORITHM bin/tests/system/mkeys/ | grep -vF DEFAULT_ALGORITHM
$
The *_ALGORITHM
environment variables are set by
bin/tests/system/get_algorithms.py
. The script is written in a way
that allows the algorithms used to be chosen randomly from a specific
set. The mkeys
test takes advantage of that feature and sets
ALGORITHM_SET
to ecc_default
. The script tries to ensure a stable
set of algorithms is used for each system test run by seeding its RNG
with a value derived from the current time. This works most of the
time, but if we get really unlucky, setup.sh
can be run during one
"time slot" while tests.sh
is run during another. This is exactly
what happened in this case:
------------------------------ Captured log setup ------------------------------
2023-07-06 14:59:58 INFO:mkeys switching to tmpdir: /builds/isc-private/bind9/bin/tests/system/mkeys_tmp_5qq267_u
2023-07-06 14:59:58 INFO:mkeys test started: mkeys/tests_sh_mkeys.py
2023-07-06 14:59:58 INFO:mkeys using port range: <20583, 20602>
------------------------------ Captured log call -------------------------------
2023-07-06 15:00:14 INFO:mkeys I:mkeys_tmp_5qq267_u:check for signed record (1)
2023-07-06 15:00:14 INFO:mkeys I:mkeys_tmp_5qq267_u:check positive validation with valid trust anchor (2)
2023-07-06 15:00:14 INFO:mkeys I:mkeys_tmp_5qq267_u:check for failed validation due to wrong key in managed-keys (3)
...
(Note that when the test started: ...
line is logged, the script
actually runs setup.sh
first. tests.sh
is run afterwards.)
This problem happens very rarely, so I am not sure if we need to do
anything about it, but it felt right to open an issue so that others are
aware that this is a thing. mkeys
is the only system test that
currently sets the ALGORITHM_SET
variable, so exposure is minimal. If
we migrate more tests to variable algorithms, this might become a more
pressing issue to address.