Concurrently running dns_request-based UDP queries may inadvertently cancel each other

lib/dns/request.c:find_udp() does not set the DNS_DISPATCHATTR_EXCLUSIVE attribute when requesting a non-port-specific UDP dispatch to use for sending a request created by dns_request_createvia() or dns_request_createraw(). This may result in multiple different dns_request-based queries using the same UDP dispatch (i.e. UDP socket). Given that lib/dns/request.c:req_cancel() is used for cleaning up both successful and failed requests, destroying a properly answered request can cause an unrelated, queued send operation (for a different request using the same UDP socket) to be canceled. The problem manifests itself in the form of the following log message:

21-Feb-2021 19:28:45.734 zone verify-ixfr/IN: refresh: failure trying master 10.53.0.2#9500 (source 10.53.0.3#0): operation canceled

(source: job 1514369, bin/tests/system/mirror/ns3/named.run)

Note that the above log message is the same as the one mentioned in https://kb.isc.org/docs/aa-01213, but the root cause is different (the message above was generated on Windows and thus could not be caused by Netfilter interference).

Both outgoing NOTIFY messages (on primaries) and SOA refresh queries (on secondaries) are sent using the dns_request API.

An example high-level sequence of events leading to this problem is:

request A is created using dns_request_createvia(),
request A gets assigned a non-exclusive UDP dispatch,
request A's query is sent out,
response to request A's query arrives and its processing starts,
request B is created using dns_request_createvia(),
request B gets assigned the same UDP dispatch as request A,
request B's query is queued for sending,
request A is cleaned up, req_cancel() calls isc_socket_cancel(),
request B's query is canceled before it gets put on the wire,
request B's callback is called with ISC_R_CANCELED.

The simplest way to prevent this is to do what named does for its outgoing resolver queries, which is to use exclusive UDP dispatches unless notify-source and/or transfer-source includes a specific port number (e.g. transfer-source 10.53.0.1 port 1234;). Here is a patch which achieves that:

diff --git a/lib/dns/request.c b/lib/dns/request.c
index f2c3d3d94d..47429c9850 100644
--- a/lib/dns/request.c
+++ b/lib/dns/request.c
@@ -628,6 +630,11 @@ find_udp_dispatch(dns_requestmgr_t *requestmgr, const isc_sockaddr_t *srcaddr,
 	default:
 		return (ISC_R_NOTIMPLEMENTED);
 	}
+
+	if (isc_sockaddr_getport(srcaddr) == 0) {
+		attrs |= DNS_DISPATCHATTR_EXCLUSIVE;
+	}
+
 	attrmask = 0;
 	attrmask |= DNS_DISPATCHATTR_UDP;
 	attrmask |= DNS_DISPATCHATTR_TCP;

Without this patch, I could reproduce this problem in the mirror system test within minutes. With this patch applied, I could not reproduce it in the same environment for hours, which I think is solid enough proof to confirm that the root cause analysis above is accurate.

However, the proposed patch still does not solve the problem for query sources using an explicit port number.

Furthermore, fixing this might not be worth the effort as the dispatch code is in the process of being migrated to netmgr (!4601 (merged)), which means the offending part of the code will be dropped altogether soon enough. As for BIND 9.11, AFAICT, the problem does not have any severe real-world consequences:

SOA queries which fail to be sent out get requeued,
NOTIFY messages which fail to be sent out are not requeued, but the secondaries should still eventually manage to refresh the zone (according to the SOA refresh timer).

My guess is that we have been unknowingly hitting this issue in GitLab CI, but in order for this bug to trigger a test failure, certain conditions must be met. In the case of the failed mirror system test mentioned above, canceling the first SOA query caused a fallback to the alternative zone transfer source address (alt-transfer-source), which was not allowed to transfer the verify-ixfr zone from ns2. This triggered an AXFR retry down the road, which was unexpected and rightly caused the test to fail.