Windows-specific system test framework glitches
On Windows, system tests may (rarely) fail because of system test framework imperfections.
Two types of intermittent issues have been observed in the past few months:
-
Issues with starting servers (example).
S:addzone:2020-07-03T09:45:21+0100 T:addzone:1:A A:addzone:System test addzone I:addzone:PORTS:,5150,5151,5152,5153,5154,5155,5156,5157,5158 Value "" invalid for option p (number expected) I:ns3:ns3/sign.sh I:addzone:starting servers Value "" invalid for option port (number expected) usage: start.pl [--noclean] [--restart] [--port <port>] test-directory [server-directory [server-options]] I:addzone:starting servers failed R:addzone:FAIL E:addzone:2020-07-03T09:45:25+0100
This failure mode has not been investigated closely, but it looks like an issue with the
bin/tests/system/get_ports.sh
script onmain
(this script is only present onmain
) - it seems that it can fail to set thePORT
environment variable in certain circumstances, which prevents testnamed
instances from being started. -
Issues with PID reuse (example).
S:rndc:2020-11-16T06:56:52-0800 T:rndc:1:A A:rndc:System test rndc I:rndc:PORTRANGE:11000 - 11099 I:rndc:starting servers I:rndc:preparing (1) I:rndc:rndc freeze I:rndc:checking zone was dumped (2) ... S:rrsetorder:2020-11-16T06:57:52-0800 T:rrsetorder:1:A A:rrsetorder:System test rrsetorder I:rrsetorder:PORTRANGE:11500 - 11599 I:rndc:exit status: 0 I:rndc:stopping servers I:rrsetorder:starting servers I:rrsetorder:Order 'fixed' disabled at compile time I:rrsetorder:Checking order fixed behaves as cyclic when disabled (master) I:rrsetorder:Checking order cyclic (master + additional) I:rrsetorder:Checking order cyclic (master) I:rrsetorder:Checking order random (master) I:rrsetorder:Random selection return 12 of 24 possible orders in 36 samples I:rrsetorder:Checking order none (primary) I:rrsetorder:Checking order cyclic (slave + additional) I:rrsetorder:Checking order cyclic (slave) I:rrsetorder:Checking order random (slave) I:rndc:ns4 didn't die when sent a SIGTERM I:rndc:stopping servers failed R:rndc:FAIL I:rrsetorder:Random selection return 12 of 24 possible orders in 36 samples E:rndc:2020-11-16T06:58:55-0800 I:rrsetorder:Checking order none (secondary) I:rrsetorder:Shutting down slave I:rrsetorder:Checking for slave's on disk copy of zone I:rrsetorder:Re-starting slave I:rrsetorder:Checking order cyclic (slave + additional, loaded from disk) I:rrsetorder:Checking order cyclic (slave loaded from disk) I:rrsetorder:Checking order random (slave loaded from disk) I:rrsetorder:Random selection return 12 of 24 possible orders in 36 samples I:rrsetorder:Checking order none (secondary loaded from disk) I:rrsetorder:Checking order cyclic (cache + additional) I:rrsetorder:failed I:rrsetorder:Checking order cyclic (cache) I:rrsetorder:failed I:rrsetorder:Checking order random (cache) I:rrsetorder:Random selection return 0 of 24 possible orders in 36 samples I:rrsetorder:failed I:rrsetorder:Checking order none (cache) I:rrsetorder:failed I:rrsetorder:Checking default order (cache) I:rrsetorder:Default selection return 0 of 24 possible orders in 36 samples I:rrsetorder:failed I:rrsetorder:Checking default order no match in rrset-order (cache) I:rrsetorder:failed I:rrsetorder:exit status: 5 I:rrsetorder:stopping servers I:rrsetorder:ns1 died before a SIGTERM was sent I:rrsetorder:stopping servers failed R:rrsetorder:FAIL E:rrsetorder:2020-11-16T07:10:23-0800
A similar failure mode was triggered in the course of BIND 9.17.1 release testing. The root cause of this problem is that signal handlers do not work on Windows and thus when SIGTERM is sent to a
named
process, it dies immediately without cleaning up its PID file. To work around this, the system test framework relies onkill
returning an error for non-existing PIDs for detecting when a givennamed
instance is no longer alive. However, Windows tends to recycle PIDs. Ifnamed
instances belonging to one system test are shut down whilenamed
instances belonging to another system test are just starting up, the system test framework may "confuse"named
instances from these two tests with each other:-
stop.pl
attempts to stopnamed
instancens1
for system testtestA
. It send it a SIGTERM.ns1
fortestA
exits without cleaning up its PID file. -
start.pl
starts upnamed
instancens1
for system testtestB
. It gets assigned the same PID asns1
fortestA
which has just exited. -
stop.pl
tests whetherns1
fortestA
is still alive. It reads its PID file and attempts tokill
the PID it read. Sincens1
fortestB
has the same PID,stop.pl
assumesns1
fortestA
is still alive. -
After 1 minute,
stop.pl
decides to send a SIGABRT tons1
fortestA
, but that one is already long gone - instead, the signal hitsns1
fortestB
, killing it (possibly in the middle oftestB
).stop.pl
reports thatns1
fortestA
did not die when it was sent a SIGTERM (even though it did). -
stop.pl
attempts tokill
ns1
fortestB
, but it was alreadykill
ed beforehand.stop.pl
reports thatns1
fortestB
died before it was sent a SIGTERM.
-