System test setup sometimes fails because of missing port assignments

Example test job failures:

The problem here is that:

the sleep period in case of failure to acquire the bin/tests/system/get_ports.lock lock file is fixed, which leads to a "thundering herd" type of problem, where (depending on how processes are scheduled by the operating system) multiple system tests try to acquire the lock file at the same time and subsequently sleep for 1 second, only for the same situation to likely happen the next time around,
the lock file is being locked and then unlocked for every single port assignment made, not just once for the entire range of ports a system test should use; in other words, the lock file is currently locked and unlocked 13 times per system test.

Given the above, in certain cases, with the retry count set to 10 attempts and up to 6 system tests being run in parallel in every GitLab CI job, a given system test may simply not manage to acquire the lock file before it reaches the retry limit. This is what happened in all the jobs linked to above (search for PORTS:, in the job logs).

Another (arguably less severe) problem with this design is that it results in delayed test startup when multiple system tests are started in parallel using make -jX check (also due to the lock file contention issue described above). Here are some sample timings produced with a version of bin/tests/system/run.sh which only runs bin/tests/system/get_ports.sh and then immediately exits (without running any setup.sh or tests.sh script):

$ time make -C bin/tests/system/ -j6 check TESTS="acl"
...
real	0m0,248s
user	0m0,217s
sys	0m0,054s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional"
...
real	0m1,294s
user	0m0,345s
sys	0m0,061s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone"
...
real	0m2,327s
user	0m0,458s
sys	0m0,126s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query"
...
real	0m3,312s
user	0m0,605s
sys	0m0,164s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth"
...
real	0m4,327s
user	0m0,754s
sys	0m0,207s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign"
...
real	0m5,352s
user	0m0,859s
sys	0m0,274s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign builtin"
...
real	0m5,343s
user	0m0,941s
sys	0m0,259s
$ time make -C bin/tests/system/ -j6 check TESTS="acl additional addzone allow-query auth autosign builtin cacheclean"
...
real	0m5,384s
user	0m1,053s
sys	0m0,276s

What this shows is that it takes almost 6 seconds to actually start all of the requested 6 tests in parallel. This initial "problem" resolves itself shortly afterwards, though, as the started tests take various amounts of time to finish and thus the lock file contention issue mostly disappears for tests started later on.

It would be nice to come up with a version of the get_ports.sh script which does not suffer from the above issues.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information