Inline signing: AXFR before zone load on a secondary server prevents zone from being served
Consider an inline-signing secondary server ("bump-in-the-wire") that is starting up with a previous version of a signed zone available locally.
The typical chain of events is that the server first loads the zone from storage and then listens to any incoming NOTIFY messages indicating that the unsigned zone has changed. When the unsigned zone gets updated, its signed counterpart also gets updated accordingly and the process repeats itself. This works as expected.
However, if the secondary server receives a NOTIFY for an inline-signed zone and manages to transfer the unsigned zone in before attempting to load the previous version of the signed zone from storage, things will break in a way which prevents the inline-signed zone from being served:
13-Oct-2020 15:38:01.994 client @0x7f41ac0012f8 10.53.0.2#60440: received notify for zone 'bits'
13-Oct-2020 15:38:01.994 zone bits/IN (unsigned): notify from 10.53.0.2#60440: no serial
13-Oct-2020 15:38:01.994 queue_soa_query: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.461 soa_query: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.464 refresh_callback: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.464 refresh_callback: zone bits/IN (unsigned): serial: new 2011072450, old not loaded
13-Oct-2020 15:38:02.464 queue_xfrin: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.464 zone bits/IN (unsigned): Transfer started.
13-Oct-2020 15:38:02.464 zone bits/IN (unsigned): no database exists yet, requesting AXFR of initial version from 10.53.0.2#32589
13-Oct-2020 15:38:02.464 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: connected using 10.53.0.3#43661
13-Oct-2020 15:38:02.464 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: sent request data
13-Oct-2020 15:38:02.467 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: received 168 bytes
;bits. IN AXFR
bits. 0 IN SOA ns2.bits. . 2011072450 20 20 1814400 3600
bits. 300 IN NS ns3.bits.
added.bits. 0 IN A 1.2.3.4
ns2.bits. 300 IN A 10.53.0.2
ns3.bits. 300 IN A 10.53.0.3
bits. 0 IN SOA ns2.bits. . 2011072450 20 20 1814400 3600
13-Oct-2020 15:38:02.467 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: got nonincremental response
13-Oct-2020 15:38:02.467 dns_zone_verifydb: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.467 zone bits/IN (unsigned): replacing zone database
13-Oct-2020 15:38:02.467 zone bits/IN (unsigned): zone transfer finished: success
13-Oct-2020 15:38:02.467 zone bits/IN (unsigned): transferred serial 2011072450
13-Oct-2020 15:38:02.467 zone_needdump: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.467 zone_settimer: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.467 zone_settimer: zone bits/IN (unsigned): enter
13-Oct-2020 15:38:02.467 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: Transfer status: success
13-Oct-2020 15:38:02.467 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: Transfer completed: 1 messages, 6 records, 168 bytes, 0.003 secs (56000 bytes/sec) (serial 2011072450)
13-Oct-2020 15:38:02.467 transfer of 'bits/IN (unsigned)' from 10.53.0.2#32589: freeing transfer context
13-Oct-2020 15:38:02.467 zone bits/IN (signed): number of nodes in database: 4
13-Oct-2020 15:38:02.467 zone bits/IN (signed): journal rollforward failed: journal out of sync with zone
13-Oct-2020 15:38:02.467 zone bits/IN (signed): not loaded due to errors.
13-Oct-2020 15:38:02.467 zone_postload: zone bits/IN (signed): done
13-Oct-2020 15:38:02.467 zone_needdump: zone bits/IN (signed): enter
13-Oct-2020 15:38:02.467 zone bits/IN (signed): receive_secure_db: out of range
The underlying cause is that in the broken case,
lib/dns/journal.c:roll_forward()
retrieves the latest SOA serial
number for the zone from the AXFR rather than from the local copy of the
signed zone.
The time window during which this can happen is rather slim, so I do not
think it is a serious issue, but I decided to open a bug report anyway,
because things are certainly working suboptimally here - it seems to me
that named
should be able to recover from such a sequence of events
just fine, serving the signed zone in the end.
This was found during release testing for BIND 9.16.8. I prepared
a crude patch which allows reliably triggering this issue in the
inline
system test:
diff --git a/bin/tests/system/inline/tests.sh b/bin/tests/system/inline/tests.sh
index 7d7df7487f4..430f6fcbb77 100755
--- a/bin/tests/system/inline/tests.sh
+++ b/bin/tests/system/inline/tests.sh
@@ -475,7 +475,9 @@ status=`expr $status + $ret`
n=`expr $n + 1`
echo_i "restart bump in the wire signer server ($n)"
ret=0
+export SKIPLOAD="bits"
start_server --noclean --restart --port ${PORT} inline ns3 || ret=1
+unset SKIPLOAD
if [ $ret != 0 ]; then echo_i "failed"; fi
status=`expr $status + $ret`
diff --git a/lib/dns/zone.c b/lib/dns/zone.c
index b8cd90b129a..57bb239c9d5 100644
--- a/lib/dns/zone.c
+++ b/lib/dns/zone.c
@@ -1995,6 +1995,16 @@ zone_load(dns_zone_t *zone, unsigned int flags, bool locked) {
REQUIRE(DNS_ZONE_VALID(zone));
+ {
+ char origin[DNS_NAME_FORMATSIZE];
+ char *skipload = getenv("SKIPLOAD");
+
+ dns_name_format(&zone->origin, origin, sizeof(origin));
+ if (skipload != NULL && !strcmp(skipload, origin)) {
+ return (ISC_R_SUCCESS);
+ }
+ }
+
if (!locked) {
LOCK_ZONE(zone);
}
Applying the above patch should result in:
I:inline:stop bump in the wire signer server (29)
I:inline:restart bump in the wire signer server (30)
I:inline:checking YYYYMMDDVV (2011072450) serial on hidden primary (31)
I:inline:checking YYYYMMDDVV (2011072450) serial in signed zone (32)
I:inline:failed
I:inline:checking YYYYMMDDVV (2011072450) serial on hidden primary, noixfr (33)
I:inline:checking YYYYMMDDVV (2011072450) serial in signed zone, noixfr (34)
and log lines similar to the ones quoted above appearing in
ns3/named.run
.
AFAICT, all maintained branches are affected.