Windows-specific system test framework glitches

On Windows, system tests may (rarely) fail because of system test framework imperfections.

Two types of intermittent issues have been observed in the past few months:

Issues with starting servers (example).

S:addzone:2020-07-03T09:45:21+0100
T:addzone:1:A
A:addzone:System test addzone
I:addzone:PORTS:,5150,5151,5152,5153,5154,5155,5156,5157,5158
Value "" invalid for option p (number expected)
I:ns3:ns3/sign.sh
I:addzone:starting servers
Value "" invalid for option port (number expected)
usage: start.pl [--noclean] [--restart] [--port <port>] test-directory [server-directory [server-options]]
I:addzone:starting servers failed
R:addzone:FAIL
E:addzone:2020-07-03T09:45:25+0100

This failure mode has not been investigated closely, but it looks like an issue with the bin/tests/system/get_ports.sh script on main (this script is only present on main) - it seems that it can fail to set the PORT environment variable in certain circumstances, which prevents test named instances from being started.

Issues with PID reuse (example).

S:rndc:2020-11-16T06:56:52-0800
T:rndc:1:A
A:rndc:System test rndc
I:rndc:PORTRANGE:11000 - 11099
I:rndc:starting servers
I:rndc:preparing (1)
I:rndc:rndc freeze
I:rndc:checking zone was dumped (2)
...
S:rrsetorder:2020-11-16T06:57:52-0800
T:rrsetorder:1:A
A:rrsetorder:System test rrsetorder
I:rrsetorder:PORTRANGE:11500 - 11599
I:rndc:exit status: 0
I:rndc:stopping servers
I:rrsetorder:starting servers
I:rrsetorder:Order 'fixed' disabled at compile time
I:rrsetorder:Checking order fixed behaves as cyclic when disabled (master)
I:rrsetorder:Checking order cyclic (master + additional)
I:rrsetorder:Checking order cyclic (master)
I:rrsetorder:Checking order random (master)
I:rrsetorder:Random selection return 12 of 24 possible orders in 36 samples
I:rrsetorder:Checking order none (primary)
I:rrsetorder:Checking order cyclic (slave + additional)
I:rrsetorder:Checking order cyclic (slave)
I:rrsetorder:Checking order random (slave)
I:rndc:ns4 didn't die when sent a SIGTERM
I:rndc:stopping servers failed
R:rndc:FAIL
I:rrsetorder:Random selection return 12 of 24 possible orders in 36 samples
E:rndc:2020-11-16T06:58:55-0800
I:rrsetorder:Checking order none (secondary)
I:rrsetorder:Shutting down slave
I:rrsetorder:Checking for slave's on disk copy of zone
I:rrsetorder:Re-starting slave
I:rrsetorder:Checking order cyclic (slave + additional, loaded from disk)
I:rrsetorder:Checking order cyclic (slave loaded from disk)
I:rrsetorder:Checking order random (slave loaded from disk)
I:rrsetorder:Random selection return 12 of 24 possible orders in 36 samples
I:rrsetorder:Checking order none (secondary loaded from disk)
I:rrsetorder:Checking order cyclic (cache + additional)
I:rrsetorder:failed
I:rrsetorder:Checking order cyclic (cache)
I:rrsetorder:failed
I:rrsetorder:Checking order random (cache)
I:rrsetorder:Random selection return 0 of 24 possible orders in 36 samples
I:rrsetorder:failed
I:rrsetorder:Checking order none (cache)
I:rrsetorder:failed
I:rrsetorder:Checking default order (cache)
I:rrsetorder:Default selection return 0 of 24 possible orders in 36 samples
I:rrsetorder:failed
I:rrsetorder:Checking default order no match in rrset-order (cache)
I:rrsetorder:failed
I:rrsetorder:exit status: 5
I:rrsetorder:stopping servers
I:rrsetorder:ns1 died before a SIGTERM was sent
I:rrsetorder:stopping servers failed
R:rrsetorder:FAIL
E:rrsetorder:2020-11-16T07:10:23-0800

A similar failure mode was triggered in the course of BIND 9.17.1 release testing. The root cause of this problem is that signal handlers do not work on Windows and thus when SIGTERM is sent to a named process, it dies immediately without cleaning up its PID file. To work around this, the system test framework relies on kill returning an error for non-existing PIDs for detecting when a given named instance is no longer alive. However, Windows tends to recycle PIDs. If named instances belonging to one system test are shut down while named instances belonging to another system test are just starting up, the system test framework may "confuse" named instances from these two tests with each other:

stop.pl attempts to stop named instance ns1 for system test testA. It send it a SIGTERM. ns1 for testA exits without cleaning up its PID file.
start.pl starts up named instance ns1 for system test testB. It gets assigned the same PID as ns1 for testA which has just exited.
stop.pl tests whether ns1 for testA is still alive. It reads its PID file and attempts to kill the PID it read. Since ns1 for testB has the same PID, stop.pl assumes ns1 for testA is still alive.
After 1 minute, stop.pl decides to send a SIGABRT to ns1 for testA, but that one is already long gone - instead, the signal hits ns1 for testB, killing it (possibly in the middle of testB). stop.pl reports that ns1 for testA did not die when it was sent a SIGTERM (even though it did).
stop.pl attempts to kill ns1 for testB, but it was already killed beforehand. stop.pl reports that ns1 for testB died before it was sent a SIGTERM.