"statistics" system test is prone to races and does not preserve forensic data
It was hard to figure out what was happening in https://gitlab.isc.org/isc-private/bind9/-/jobs/674674 because the statistics
system test does not preserve enough forensic data to troubleshoot this particular failure mode easily, but I think I got to the bottom of what is happening. I believe the problem is that ns3
retries the SOA query for the a-slave
zone right before the first use of rndc stats
but the response to that query is received after the first use of rndc stats
. This retried SOA query is usually sent after the first use of rndc stats
and replied to before the second use of rndc stats
; if that is not the case, statistics get skewed. Here is how things look typically:
$ grep -E "(created socket| soa_query:|refresh:|'stats')" ns3/named.run | sed -E "/'stats'/{s|(.*)|---------- \1 ----------|}"
14-Feb-2020 13:39:51.279 dispatch 0x7faf54030ce0: created socket 0x7faf5ec9f600
14-Feb-2020 13:39:51.279 dispatch 0x7faf54030720: created socket 0x7faf5ec9f778
14-Feb-2020 13:39:51.279 dispatch 0x7faf54030160: created socket 0x7faf5ec9f8f0
14-Feb-2020 13:39:51.279 dispatch 0x7faf5402fba0: created socket 0x7faf5ec9fa68
14-Feb-2020 13:39:51.303 soa_query: zone a-slave/IN: enter
14-Feb-2020 13:39:51.303 dispatch 0x7faf54652fa0: created socket 0x7faf5a41e600
14-Feb-2020 13:39:51.304 zone a-slave/IN: refresh: unexpected rcode (NXDOMAIN) from master 10.53.0.1#5300 (source 10.53.0.3#0)
---------- 14-Feb-2020 13:39:51.537 received control channel command 'stats' ----------
14-Feb-2020 13:39:51.803 soa_query: zone a-slave/IN: enter
14-Feb-2020 13:39:51.803 dispatch 0x7faf54652fa0: created socket 0x7faf5a41ebe0
14-Feb-2020 13:39:51.807 zone a-slave/IN: refresh: unexpected rcode (NXDOMAIN) from master 10.53.0.1#5300 (source 0.0.0.0#0)
---------- 14-Feb-2020 13:39:53.564 received control channel command 'stats' ----------
while this is what happened in the job linked to above:
$ grep -E "(created socket| soa_query:|refresh:|'stats')" ns3/named.run | sed -E "/'stats'/{s|(.*)|---------- \1 ----------|}"
13-Feb-2020 14:47:44.500 dispatch 0x80545b000: created socket 0x80508d838
13-Feb-2020 14:47:44.500 dispatch 0x80545aa00: created socket 0x80508d990
13-Feb-2020 14:47:44.500 dispatch 0x80545a400: created socket 0x80508dae8
13-Feb-2020 14:47:44.500 dispatch 0x805459e00: created socket 0x80508dc40
13-Feb-2020 14:47:44.549 soa_query: zone a-slave/IN: enter
13-Feb-2020 14:47:44.550 dispatch 0x805743400: created socket 0x8054e0508
13-Feb-2020 14:47:44.552 zone a-slave/IN: refresh: unexpected rcode (NXDOMAIN) from master 10.53.0.1#12400 (source 10.53.0.3#0)
13-Feb-2020 14:47:45.048 soa_query: zone a-slave/IN: enter
13-Feb-2020 14:47:45.048 dispatch 0x805743400: created socket 0x80508d430
---------- 13-Feb-2020 14:47:45.058 received control channel command 'stats' ----------
13-Feb-2020 14:47:45.069 zone a-slave/IN: refresh: unexpected rcode (NXDOMAIN) from master 10.53.0.1#12400 (source 0.0.0.0#0)
---------- 13-Feb-2020 14:47:47.149 received control channel command 'stats' ----------
Thus, I believe that the active socket count was 1 less than the expected value in the second ns3/named.stats
file being examined.
AFAICT, the a-slave
zone is only relevant in the checking bind9.xsl vs xml
check, it is not configured on ns1
. If it was actually configured, the SOA query would not need to be retried due to an NXDOMAIN response, hopefully making the test more reliable.
Forensic data preservation should also be improved for this test.