bin/tests/system/start.pl improperly detects server startup after all servers are restarted
bin/tests/system/start.pl
waits until a running
message is logged by
a given name server instance before attempting to send a
version.bind/CH/TXT
query to it. The idea behind this was to make the
script wait until named
loads all the zones it is configured to serve
before telling the system test framework that a given server is ready to
use; this prevents the need to add boilerplate code that waits for a
specific zone to be loaded to each test expecting that.
However, when looking for running
messages,
bin/tests/system/start.pl
assumes that the existence of any such
message in the named.run
file means that a given instance has finished
loading all zones. Meanwhile, some system tests restart all the server
instances throughout their lifetime (some even do that a few times), for
example to run Python-based tests. bin/tests/system/start.pl
handles
such a scenario incorrectly - as soon as it finds any running
message in the named.run
file it inspects and it gets a response to a
version.bind/CH/TXT
query, it tells the system test framework that a
given server is ready to use, which might not be true - it is possible
that only the version.bind
zone has been loaded at that point and the
running
message was logged by a previously-shutdown named
instance.
This triggers intermittent failures for Python-based tests, e.g.:
https://gitlab.isc.org/isc-private/bind9/-/jobs/2819525
D:doth: # Check whether a response was received and whether it is sane.
D:doth: assert response
D:doth: assert query.id == response.id
D:doth:> assert len(response.answer) == 1
D:doth:E assert 0 == 1
D:doth:E +0
D:doth:E -1
D:doth:
D:doth:tests_gnutls.py:102: AssertionError
In this case, the response received was a SERVFAIL response (while a NOERROR response was expected):
10-Oct-2022 08:08:10.578 TLS server session created for 10.53.0.1#36440 on 10.53.0.1#9042
10-Oct-2022 08:08:10.578 clientmgr @0x7b2000014100 attach: 2
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(<unknown-query>): query_reset
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 (no-peer): allocate new client
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440: TCP request
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440: using view '_default'
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440: request is not signed
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440: recursion not available (recursion not enabled for view)
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(<unknown-query>): ns_query_start
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(example/SOA): qctx_init
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(example/SOA): client attr:0x20001, query attr:0x700, restarts:0, origqname:example, timer:0, authdb:0, referral:0
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(example/SOA): ns__query_start
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(example/SOA): ns__query_start: query_getdb failed
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(example/SOA): ns_query_done
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440 (example): query failed (zone not loaded) for example/IN/SOA at query.c:5644
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440 (example): reset client
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(example/SOA): query_reset
10-Oct-2022 08:08:10.578 client @0x7b7c00030368 10.53.0.1#36440: freeing client
10-Oct-2022 08:08:10.578 query client=0x7b7c00030368 thread=0x7f6305f7b640(<unknown-query>): query_reset
...
10-Oct-2022 08:08:10.726 zone example/IN: loaded serial 1397051952