System test setup sometimes fails because of missing port assignments
Example test job failures:
- https://gitlab.isc.org/isc-projects/bind9/-/jobs/1505564
- https://gitlab.isc.org/isc-projects/bind9/-/jobs/1505566
- https://gitlab.isc.org/isc-projects/bind9/-/jobs/1505642
- https://gitlab.isc.org/isc-projects/bind9/-/jobs/1505647
- https://gitlab.isc.org/isc-projects/bind9/-/jobs/1506024
The problem here is that:
-
the sleep period in case of failure to acquire the
bin/tests/system/get_ports.lock
lock file is fixed, which leads to a "thundering herd" type of problem, where (depending on how processes are scheduled by the operating system) multiple system tests try to acquire the lock file at the same time and subsequently sleep for 1 second, only for the same situation to likely happen the next time around, -
the lock file is being locked and then unlocked for every single port assignment made, not just once for the entire range of ports a system test should use; in other words, the lock file is currently locked and unlocked 13 times per system test.
Given the above, in certain cases, with the retry count set to 10
attempts and up to 6 system tests being run in parallel in every
GitLab CI job, a given system test may simply not manage to acquire the
lock file before it reaches the retry limit. This is what happened in
all the jobs linked to above (search for PORTS:,
in the job logs).
Another (arguably less severe) problem with this design is that it
results in delayed test startup when multiple system tests are started
in parallel using make -jX check
(also due to the lock file contention
issue described above). Here are some sample timings produced with a
version of bin/tests/system/run.sh
which only runs
bin/tests/system/get_ports.sh
and then immediately exits (without
running any setup.sh
or tests.sh
script):
$ time make -C bin/tests/system/ -j6 check TESTS="acl"
...
real 0m0,248s
user 0m0,217s
sys 0m0,054s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional"
...
real 0m1,294s
user 0m0,345s
sys 0m0,061s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone"
...
real 0m2,327s
user 0m0,458s
sys 0m0,126s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query"
...
real 0m3,312s
user 0m0,605s
sys 0m0,164s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth"
...
real 0m4,327s
user 0m0,754s
sys 0m0,207s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign"
...
real 0m5,352s
user 0m0,859s
sys 0m0,274s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign builtin"
...
real 0m5,343s
user 0m0,941s
sys 0m0,259s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign builtin cacheclean"
...
real 0m5,384s
user 0m1,053s
sys 0m0,276s
What this shows is that it takes almost 6 seconds to actually start all of the requested 6 tests in parallel. This initial "problem" resolves itself shortly afterwards, though, as the started tests take various amounts of time to finish and thus the lock file contention issue mostly disappears for tests started later on.
It would be nice to come up with a version of the get_ports.sh
script
which does not suffer from the above issues.