Improve accuracy of query error logging
There are two issues related to query error logging that bug me:
-
Until BIND 9.11 (inclusive), when something goes wrong while recursing for an answer to a query, the
default:
branch of aswitch
statement inquery_gotanswer()
generates a SERVFAIL response. When serve-stale was implemented in BIND 9.12, it was done in a way which causes thatdefault:
branch to only set a flag (want_stale
) in the query context for such queries, effectively moving theQUERY_ERROR()
call for such queries toquery_done()
(and makingquery_done()
call itself recursively, which hinders readability). This may be a source of confusion1 as the "query failed" log messages now point at a line inquery_done()
which appears to have something to do with serve-stale (if (qctx->want_stale)
) even if the latter is not enabled. -
Apparently we are setting
qctx->result
toDNS_R_SERVFAIL
for a lot of different failure modes:$ sed -n 's|.*\(QUERY_ERROR([^)]*);\)$|\1|p;' lib/ns/query.c | sort | uniq -c | sort 2 QUERY_ERROR(qctx, DNS_R_DROP); 2 QUERY_ERROR(qctx, DNS_R_REFUSED); 4 QUERY_ERROR(qctx, result); 19 QUERY_ERROR(qctx, DNS_R_SERVFAIL);
In some situations, this really is the only sensible thing to do (e.g. when the root key sentinel mechanism overrides the RCODE or when the SERVFAIL cache is hit) but in most others, doing this feels like replacing potentially useful diagnostic information with a rather less useful "oops, something went wrong" type of message:
client @0x7fa62c0403b0 ::1#52563 (servfail.nl): query failed (SERVFAIL) for servfail.nl/IN/A at query.c:10681
While in some cases the root cause of failed queries is indicated by the preceding log messages, troubleshooting others requires recompiling BIND with
--enable-querytrace
, which is not always a viable option for the user.I think we can improve
named
's behavior in this regard. Note that the response message's RCODE is derived fromqctx->result
usingdns_result_torcode()
, which handles a number of possibleisc_result_t
values and returns SERVFAIL for anything not explicitly listed. If we usedQUERY_ERROR(qctx, result);
instead ofQUERY_ERROR(qctx, DNS_R_SERVFAIL);
where possible, the specific error could be logged byquery_error()
and the response's RCODE would still be SERVFAIL. The win here would be that some light could be shed on the less obvious query errors just by bumping the debug level usingrndc trace
(or enabling query logging) rather than recompiling with--enable-querytrace
.
-
see e.g. https://lists.isc.org/pipermail/bind-users/2018-September/100868.html, though the issue caught me by surprise a couple of times as well while debugging
↩