Recursive "stress" tests hang intermittently for BIND 9.16(-S)
For the past few days, recursive "stress" tests for v9_16
and
v9_16_sub
have been intermittently failing in weird ways:
-
hangs during statistics collection at the end of the test:
-
Fedora, amd64:
-
Fedora, arm64:
-
-
core dumps:
-
Fedora, amd64:
-
Not all recursive "stress" test jobs are failing, though, and no such
problems have been occurring for v9_18
or main
. Given that the
problem started within the past week, !6419 (merged) (9.16-specific) is the most
likely culprit, but a quick glance at the changes it contains does not
raise any immediate red flags.
I managed to collect a core dump from a hung named
instance and I
found out that all working threads are blocked on the same mutex (one of
the fetch context bucket locks). This suggests a missing mutex unlock
somewhere and explains why the server is unresponsive to queries, rndc
commands, etc. The question is how it got into that state, given that
it was seemingly working fine before !6419 (merged) and that the latter does not
look immediately suspicious.
This issue is a blocker for releasing BIND 9.16.31 and 9.16.31-S1.