dig network issue detection in system tests
When investigating a system test failure, an issue is often caused by a network error / time out in dig due to the CI system load / instability. For example, in #3207126 the serve-stale
test failed due to a query that timed out:
$ cat serve-stale/dig.out.test31
;; communications error to 10.53.0.1#13253: timed out
; <<>> DiG 9.19.11-dev <<>> +time +tries -p 13253 @10.53.0.1 data.example TXT
; (1 server found)
;; global options: +cmd
;; no servers could be reached
To speed up the investigation of such failures, it'd be useful to grep for common failure patterns in case the test fails, e.g.:
$ grep 'timed out' */dig.out*
serve-stale/dig.out.test147:;; communications error to 10.53.0.3#13253: timed out
serve-stale/dig.out.test149:;; communications error to 10.53.0.3#13253: timed out
serve-stale/dig.out.test31:;; communications error to 10.53.0.1#13253: timed out
serve-stale/dig.out.test98:;; communications error to 10.53.0.3#13253: timed out
If the information above would be displayed in the job's log, investigation would be quicker. Of course, the context of the test still has to be taken into account, since some tests actually do expect a timeout. Nevertheless, I believe it could be helpful.