BIND responds with SERVFAIL for EDNS=1 query
I have a very puzzling case with three of our BIND 9.14.8 servers. These three are part of a cluster of 10 servers (the others are Knot and NSD).
The servers have a management interface (for SSH, monitoring, zone transfers, etc), a dummy interface with several addresses (part of an anycasted prefix) and a service interface, which connects to their upstream router. The router brings in DNS queries and passes them to the DNS server via the service interface, and the DNS responses are sent back out via this same service interface to the router.
When I send an EDNS1 query to the BIND server via its management interface, it always correctly responds with BADVERS:
; <<>> DiG 9.16.2 <<>> +norec +edns +noednsnegotiation +nocookie apnic.net soa @10.111.0.21 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: BADVERS, id: 30243 ;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
tcpdump on the BIND server shows:
22.214.171.124.27513 > 10.111.0.21.53: [udp sum ok] 30243 [1au] SOA? apnic.net. ar: . OPT UDPsize=4096 (38) 10.111.0.21.53 > 126.96.36.199.27513: [bad udp cksum 0xe0a9 -> 0x5d15!] 30243- q: SOA? apnic.net. 0/0/1 ar: . OPT UDPsize=4096 (38)
However, when I send the same query to the anycast address, and it hits BIND, I get SERVFAIL:
; <<>> DiG 9.16.2 <<>> +norec +edns +noednsnegotiation +nocookie apnic.net soa @188.8.131.52 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14810 ;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
tcpdump on the BIND server shows this:
184.108.40.206.60666 > 220.127.116.11.53: [udp sum ok] 14810 [1au] SOA? apnic.net. ar: . OPT UDPsize=4096 (38) 18.104.22.168.53 > 22.214.171.124.60666: [bad udp cksum 0xdcc9 -> 0x1cbb!] 14810 ServFail- q: SOA? apnic.net. 0/0/1 ar: . OPT UDPsize=4096 (38)
And now the most puzzling part... this is not consistent. If I send EDNS1 queries for various zones to the same BIND servers via their anycast interfaces, they respond with BADVERS for some, and SERVFAIL for others. The neighbour servers (Knot and NSD) consistently send BADVERS, and exhibit no problems. It's only BIND, and only for some zones, and not others.
Has anyone else seen this behaviour? Does anyone have any hint as to why this might be happening? Could there be some combination of query attributes that make BIND take a different code path and return SERVFAIL instead of BADVERS?