intermittent resolver test failure
Periodically the resolver
system test fails with a message similar to this:
I:resolver:query count error: 6 NS records: expected queries 10, actual 11
I:resolver:failed
I made a stripped-down version of the test with only this particular check in it and ran it in a loop to catch numerous instances of this failure. The cause was two iterative queries being sent at the same time with the same UDP source address. So both replies would be sent to the same UDP address, and if the wrong one was processed first, it would be dropped due to a header mismatch. 0.8 seconds later, the dispatch times out and resends the query, resulting in successful resolution, but an extra query being logged in the statistics.
Prior to the netmgr dispatch refactoring, there was code to avoid using the same source address more than once, but we removed it, on the suspicion that its complexity outweighed its benefits. That's probably why the test started failing... and, since the test only failed occasionally, and resolution succeeds anyway, that still seems correct to me.
I see three options to fix this:
- put back the socket-avoidance code (IMHO this is still probably not worth it)
- instead of dropping messages when the header doesn't match, try to find the correct dispentry object, or
- change the test so it can tolerate an extra logged query.