"notify" system test fails intermittently due to wait_for_log() timing out for "retries exceeded"

https://gitlab.isc.org/isc-private/bind9/-/jobs/2441061

I:notify:checking notify retries expire within 45 seconds (10) (11)
I:notify:exceeded time limit waiting for 'retries exceeded' in ns3/named.run
I:notify:failed

I sense that the test code does not fulfill the author's intentions. Specifically, wait_for_log() is used without nextpartreset(), which the code comment for wait_for_log() warns against:

# wait_for_log: wait until message $2 in file $3 appears.  Bail out after
# $1 seconds.  This needs to be used in conjunction with a prior call to
# nextpart() or nextpartreset() on the same file to guarantee the offset is
# set correctly.  Tests using wait_for_log() are responsible for cleaning up
# the created <file>.prev files.

What I believe happened in the job linked to above is that a prior call to wait_for_log() (wait_for_log 45 "transfer of 'example/IN' from 10.53.0.2#.*success") "ate" the retries exceeded log lines which the failing wait_for_log() invocation waits for. wait_for_log() is not idempotent!

There are some retries exceeded lines in ns3/named.run, so I doubt there is some problem with the actual code being tested here:

$ grep -F "retries exceeded" ns3/named.run
11-Apr-2022 14:15:24.141 zone example/IN: notify to 10.53.10.53#30905: retries exceeded
11-Apr-2022 14:15:38.648 zone example/IN: notify to 10.53.10.53#30905: retries exceeded
11-Apr-2022 14:15:46.309 zone example/IN: notify to 10.53.10.53#30905: retries exceeded

I am not sure what the original intention was precisely, but there are two possibilities:

if retries exceeded is expected to appear anywhere in ns3/named.run, just use grep -F "retries exceeded" ns3/named.run
if retries exceeded is expected to appear in ns3/named.run between two specific points in time, nextpartreset()/nextpart() must be used to indicate the "starting point".

Assigning @fanf as the author of !6051 (merged)/!6096 (merged).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information