Dispatch code intermittently reports "response doesn't match" during the "reclimit" system test
https://gitlab.isc.org/isc-private/bind9/-/jobs/3068391
S:reclimit:2023-01-13T13:29:59+0000
T:reclimit:1:A
A:reclimit:System test reclimit
I:reclimit:PORTS:12586,12587,12588,12589,12590,12591,12592,12593,12594,12595,12596,12597,12598
I:reclimit:starting servers
I:reclimit:set max-recursion-depth=12
I:reclimit:attempt excessive-depth lookup (1)
I:reclimit:attempt permissible lookup (2)
I:reclimit:set max-recursion-depth=5
I:reclimit:attempt excessive-depth lookup (3)
I:reclimit:attempt permissible lookup (4)
I:reclimit:set max-recursion-depth=100, max-recursion-queries=50
I:reclimit:attempt excessive-queries lookup (5)
I:reclimit:attempt permissible lookup (6)
I:reclimit:failed
I:reclimit:set max-recursion-depth=100, max-recursion-queries=40
I:reclimit:attempt excessive-queries lookup (7)
I:reclimit:attempt permissible lookup (8)
I:reclimit:attempting NS explosion (9)
I:reclimit:exit status: 1
I:reclimit:stopping servers
R:reclimit:FAIL
E:reclimit:2023-01-13T13:30:05+0000
The reason for the test failure is that one of the "intermediate"
queries sent by ns3
to ans4
(ns1.1.example.org/A
) got a response
that the dispatch code did not like, necessitating a requery, which in
turn caused the max-recursion-queries
limit to be exceeded for the
indirect6.example.org/A
query, causing the test to fail.
The interesting part is why the dispatch code did not like the
response that the ans4
"server" sent. Here are the relevant log
excerpts:
13-Jan-2023 13:30:02.298 sending packet to 10.53.0.4#12586
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57016
;; flags:; QUESTION: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;ns1.1.example.org. IN A
13-Jan-2023 13:30:02.298 dispatch 0x7f55202f6400: UDP response 0x7f5508de0800: sending
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: connected: success
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: attaching handle 0x7f5508a05980 to 0x7f5508ce4a18
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: reading
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: connect callback: success
13-Jan-2023 13:30:02.298 sending packet to 10.53.0.4#12586
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56518
;; flags:; QUESTION: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;ns1.1.example.org. IN AAAA
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: sending
13-Jan-2023 13:30:02.298 dispatch 0x7f55202f6400: UDP response 0x7f5508de0800: sent: success
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: sent: success
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: read callback:success, requests 1
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: got valid DNS message header, /QR 1, id 57016
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: response doesn't match
13-Jan-2023 13:30:02.298 dispatch 0x7f55202b8600: UDP response 0x7f5508ce4a00: continue reading
The relevant piece of code is:
594 /*
595 * The QID and the address must match the expected ones.
596 */
597 if (resp->id != id || !isc_sockaddr_equal(&peer, &resp->peer)) {
598 dispentry_log(resp, LVL(90), "response doesn't match");
599 inc_stats(disp->mgr, dns_resstatscounter_mismatch);
600 goto next;
601 }
Given the got valid DNS message header
log message with the expected
ID, it seems that isc_sockaddr_equal(&peer, &resp->peer)
must have
returned false
. I do not know why that happened because I do not have
a PCAP file containing the traffic exchanged between the servers in
question.
This only happens intermittently, so for now I am merely opening this issue to record it and make others aware of it. My gut feeling is that there is no way to figure out what happened here without a PCAP file or improved dispatch debugging, but I would love to be proved wrong :-)