Running the "mkeys" system test around the top of the hour may cause it to fail

https://gitlab.isc.org/isc-private/bind9/-/jobs/3509369

The mkeys system test failed silently on the following step:

2023-07-06 15:00:59 INFO:mkeys     I:mkeys_tmp_5qq267_u:revoke key with bad signature, check revocation is ignored (19)

This means that it was more than likely set -e that triggered the failure. Unfortunately, the mkeys system test is written in a way that does not make debugging easy when set -e is in effect. There are a lot of steps in the relevant check and each of them could trigger the failure:

Click to expand/collapse

n=$((n+1))
echo_i "revoke key with bad signature, check revocation is ignored ($n)"
ret=0
revoked=$($REVOKE -K ns1 "$original")
rkeyid=$(keyfile_to_key_id "$revoked")
rm -f ns1/root.db.signed.jnl
# We need to activate at least one valid DNSKEY to prevent dnssec-signzone from
# failing.  Alternatively, we could use -P to disable post-sign verification,
# but we actually do want post-sign verification to happen to ensure the zone
# is correct before we break it on purpose.
$SETTIME -R none -D none -K ns1 "$standby1" > /dev/null
$SIGNER -Sg -K ns1 -N unixtime -O full -o . -f signer.out.$n ns1/root.db > /dev/null 2>/dev/null
cp -f ns1/root.db.signed ns1/root.db.tmp
BADSIG="SVn2tLDzpNX2rxR4xRceiCsiTqcWNKh7NQ0EQfCrVzp9WEmLw60sQ5kP xGk4FS/xSKfh89hO2O/H20Bzp0lMdtr2tKy8IMdU/mBZxQf2PXhUWRkg V2buVBKugTiOPTJSnaqYCN3rSfV1o7NtC1VNHKKK/D5g6bpDehdn5Gaq kpBhN+MSCCh9OZP2IT20luS1ARXxLlvuSVXJ3JYuuhTsQXUbX/SQpNoB Lo6ahCE55szJnmAxZEbb2KOVnSlZRA6ZBHDhdtO0S4OkvcmTutvcVV+7 w53CbKdaXhirvHIh0mZXmYk2PbPLDY7PU9wSH40UiWPOB9f00wwn6hUe uEQ1Qg=="
# Less than a second may have passed since ns1 was started.  If we call
# dnssec-signzone immediately, ns1/root.db.signed will not be reloaded by the
# subsequent "rndc reload ." call on platforms which do not set the
# "nanoseconds" field of isc_time_t, due to zone load time being seemingly
# equal to master file modification time.
sleep 1
sed -e "/ $rkeyid \./s, \. .*$, . $BADSIG," signer.out.$n > ns1/root.db.signed
mkeys_reload_on 1 || ret=1
mkeys_refresh_on 2 || ret=1
mkeys_status_on 2 > rndc.out.$n 2>&1 || ret=1
# one key listed
count=$(grep -c "keyid: " rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'keyid:' count ($count) != 1"; ret=1; }
# it's the original key id
count=$(grep -c "keyid: $originalid" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'keyid: $originalid' count ($count) != 1"; ret=1; }
# not revoked
count=$(grep -c "REVOKE" rndc.out.$n) || true
[ "$count" -eq 0 ] || { echo_i "'REVOKE' count ($count) != 0"; ret=1; }
# trust is still current
count=$(grep -c "trust" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'trust' count != 1"; ret=1; }
count=$(grep -c "trusted since" rndc.out.$n) || true
[ "$count" -eq 1 ] || { echo_i "'trusted since' count != 1"; ret=1; }
if [ $ret != 0 ]; then echo_i "failed"; fi
status=$((status+ret))

However, it is possible to look at the presence/absence of certain files among the test artifacts and also to look at file timestamps, so that some scenarios can be ruled out. In this case, ns1/root.db.tmp did not exist, so execution did not reach the cp -f line. This meant that only dnssec-revoke, dnssec-settime, and dnssec-signzone could have failed. However, since there were three key files in ns1 newer than the $original one (meaning that dnssec-revoke and dnssec-settime did their job), dnssec-signzone was the primary suspect. I ran its invocation from the test manually on the artifacts and...

$ dnssec-signzone -Sg -K ns1 -N unixtime -O full -o . -f signer.out.19 ns1/root.db 
Fetching ./ECDSAP384SHA384/25503 (KSK) from key repository.
Fetching ./ECDSAP256SHA256/37163 (KSK) from key repository.
Fetching ./ECDSAP384SHA384/24825 (ZSK) from key repository.
dnssec-signzone: warning: Serial number would not advance, using increment method instead
Verifying the zone using the following algorithms:
- ECDSAP256SHA256
Missing ZSK for algorithm ECDSAP256SHA256
Missing self-signed KSK for algorithm ECDSAP384SHA384
No correct ECDSAP256SHA256 signature for . NSEC
No correct ECDSAP256SHA256 signature for . SOA
No correct ECDSAP256SHA256 signature for . NS
No correct ECDSAP256SHA256 signature for example NSEC
No correct ECDSAP256SHA256 signature for example TXT
No correct ECDSAP256SHA256 signature for a.root-servers.nil NSEC
No correct ECDSAP256SHA256 signature for a.root-servers.nil A
No correct ECDSAP256SHA256 signature for tld NSEC
The zone is not fully signed for the following algorithms:
 ECDSAP256SHA256
 ECDSAP384SHA384
.
DNSSEC completeness test failed.
Zone verification failed (failure)

But wait, how is it even possible that this zone is signed using multiple algorithms?

$ git grep -F _ALGORITHM bin/tests/system/mkeys/ | wc -l
13
$ git grep -F _ALGORITHM bin/tests/system/mkeys/ | grep -vF DEFAULT_ALGORITHM
$

The *_ALGORITHM environment variables are set by bin/tests/system/get_algorithms.py. The script is written in a way that allows the algorithms used to be chosen randomly from a specific set. The mkeys test takes advantage of that feature and sets ALGORITHM_SET to ecc_default. The script tries to ensure a stable set of algorithms is used for each system test run by seeding its RNG with a value derived from the current time. This works most of the time, but if we get really unlucky, setup.sh can be run during one "time slot" while tests.sh is run during another. This is exactly what happened in this case:

------------------------------ Captured log setup ------------------------------
2023-07-06 14:59:58 INFO:mkeys switching to tmpdir: /builds/isc-private/bind9/bin/tests/system/mkeys_tmp_5qq267_u
2023-07-06 14:59:58 INFO:mkeys test started: mkeys/tests_sh_mkeys.py
2023-07-06 14:59:58 INFO:mkeys using port range: <20583, 20602>
------------------------------ Captured log call -------------------------------
2023-07-06 15:00:14 INFO:mkeys     I:mkeys_tmp_5qq267_u:check for signed record (1)
2023-07-06 15:00:14 INFO:mkeys     I:mkeys_tmp_5qq267_u:check positive validation with valid trust anchor (2)
2023-07-06 15:00:14 INFO:mkeys     I:mkeys_tmp_5qq267_u:check for failed validation due to wrong key in managed-keys (3)
...

(Note that when the test started: ... line is logged, the script actually runs setup.sh first. tests.sh is run afterwards.)

This problem happens very rarely, so I am not sure if we need to do anything about it, but it felt right to open an issue so that others are aware that this is a thing. mkeys is the only system test that currently sets the ALGORITHM_SET variable, so exposure is minimal. If we migrate more tests to variable algorithms, this might become a more pressing issue to address.