"notify" system test fails intermittently due to wait_for_log() timing out for "retries exceeded"
https://gitlab.isc.org/isc-private/bind9/-/jobs/2441061
I:notify:checking notify retries expire within 45 seconds (10) (11)
I:notify:exceeded time limit waiting for 'retries exceeded' in ns3/named.run
I:notify:failed
I sense that the test code does not fulfill the author's intentions.
Specifically, wait_for_log()
is used without nextpartreset()
, which
the code comment for wait_for_log()
warns against:
# wait_for_log: wait until message $2 in file $3 appears. Bail out after
# $1 seconds. This needs to be used in conjunction with a prior call to
# nextpart() or nextpartreset() on the same file to guarantee the offset is
# set correctly. Tests using wait_for_log() are responsible for cleaning up
# the created <file>.prev files.
What I believe happened in the job linked to above is that a prior
call to wait_for_log()
(wait_for_log 45 "transfer of 'example/IN' from 10.53.0.2#.*success"
) "ate" the retries exceeded
log lines which
the failing wait_for_log()
invocation waits for. wait_for_log()
is
not idempotent!
There are some retries exceeded
lines in ns3/named.run
, so I doubt
there is some problem with the actual code being tested here:
$ grep -F "retries exceeded" ns3/named.run
11-Apr-2022 14:15:24.141 zone example/IN: notify to 10.53.10.53#30905: retries exceeded
11-Apr-2022 14:15:38.648 zone example/IN: notify to 10.53.10.53#30905: retries exceeded
11-Apr-2022 14:15:46.309 zone example/IN: notify to 10.53.10.53#30905: retries exceeded
I am not sure what the original intention was precisely, but there are two possibilities:
-
if
retries exceeded
is expected to appear anywhere inns3/named.run
, just usegrep -F "retries exceeded" ns3/named.run
-
if
retries exceeded
is expected to appear inns3/named.run
between two specific points in time,nextpartreset()
/nextpart()
must be used to indicate the "starting point".
Assigning @fanf as the author of !6051 (merged)/!6096 (merged).