System test setup sometimes fails because of missing port assignments
Example test job failures:
The problem here is that:
the sleep period in case of failure to acquire the
bin/tests/system/get_ports.locklock file is fixed, which leads to a "thundering herd" type of problem, where (depending on how processes are scheduled by the operating system) multiple system tests try to acquire the lock file at the same time and subsequently sleep for 1 second, only for the same situation to likely happen the next time around,
the lock file is being locked and then unlocked for every single port assignment made, not just once for the entire range of ports a system test should use; in other words, the lock file is currently locked and unlocked 13 times per system test.
Given the above, in certain cases, with the retry count set to 10
attempts and up to 6 system tests being run in parallel in every
GitLab CI job, a given system test may simply not manage to acquire the
lock file before it reaches the retry limit. This is what happened in
all the jobs linked to above (search for
PORTS:, in the job logs).
Another (arguably less severe) problem with this design is that it
results in delayed test startup when multiple system tests are started
in parallel using
make -jX check (also due to the lock file contention
issue described above). Here are some sample timings produced with a
bin/tests/system/run.sh which only runs
bin/tests/system/get_ports.sh and then immediately exits (without
$ time make -C bin/tests/system/ -j6 check TESTS="acl" ... real 0m0,248s user 0m0,217s sys 0m0,054s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional" ... real 0m1,294s user 0m0,345s sys 0m0,061s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone" ... real 0m2,327s user 0m0,458s sys 0m0,126s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query" ... real 0m3,312s user 0m0,605s sys 0m0,164s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth" ... real 0m4,327s user 0m0,754s sys 0m0,207s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign" ... real 0m5,352s user 0m0,859s sys 0m0,274s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign builtin" ... real 0m5,343s user 0m0,941s sys 0m0,259s $ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign builtin cacheclean" ... real 0m5,384s user 0m1,053s sys 0m0,276s
What this shows is that it takes almost 6 seconds to actually start all of the requested 6 tests in parallel. This initial "problem" resolves itself shortly afterwards, though, as the started tests take various amounts of time to finish and thus the lock file contention issue mostly disappears for tests started later on.
It would be nice to come up with a version of the
which does not suffer from the above issues.