DNS resolution fails temporarily
We are using two Named Servers in our Production system. BIND 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 (Extended Support Version) on Linux x86_64 4.9.215-36.el7.x86_64
Recently, we started to see a trend when the DNS resolution fails between a specific time period for random domain names(out of over 100 records). Each record may fail for max 10 minutes. At all other times, it works absolutely fine.
AT THE TIME OF ISSUE:
dig @MY_DNS_SERVER docker.mycompany.net
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> @MY_DNS_SERVER docker.mycompany.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 33467
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;docker.mycompany.net. IN A
;; AUTHORITY SECTION:
mycompany.net. 58 IN SOA ns-604.awsdns-11.net. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
;; Query time: 0 msec
;; SERVER: MY_DNS_SERVER#53(MY_DNS_SERVER)
;; WHEN: Thu Nov 04 10:22:01 UTC 2021
;; MSG SIZE rcvd: 128
NORMAL TIMES:
dig @MY_DNS_SERVER docker.mycompany.net
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> @MY_DNS_SERVER docker.mycompany.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28581
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 7
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;docker.mycompany.net. IN A
;; ANSWER SECTION:
docker.mycompany.net. 54 IN A PROPER IP
docker.mycompany.net. 54 IN A PROPER IP
;; AUTHORITY SECTION:
mycompany.net. 81323 IN NS ns-16.awsdns-02.com.
mycompany.net. 81323 IN NS ns-1158.awsdns-16.org.
mycompany.net. 81323 IN NS ns-604.awsdns-11.net.
mycompany.net. 81323 IN NS ns-1731.awsdns-24.co.uk.
;; ADDITIONAL SECTION:
ns-1158.awsdns-16.org. 47096 IN A 205.251.196.134
ns-16.awsdns-02.com. 25941 IN A 205.251.192.16
ns-604.awsdns-11.net. 81323 IN A 205.251.194.92
ns-1158.awsdns-16.org. 68043 IN AAAA 2600:9000:5304:8600::1
ns-16.awsdns-02.com. 60863 IN AAAA 2600:9000:5300:1000::1
ns-1731.awsdns-24.co.uk. 38132 IN AAAA 2600:9000:5306:c300::1
;; Query time: 0 msec
;; SERVER: MY_DNS_SERVER#53(MY_DNS_SERVER)
;; WHEN: Thu Nov 04 10:29:03 UTC 2021
;; MSG SIZE rcvd: 347
The BIND Cache metrics shows a trend where it exactly starts to increase around the start of the issue(2:30pm IST). Though the DNS Resolution resolves within an hour, the graph shows a continuous upward trend which decreases only beyond midnight that day.
This is badly affecting Production users. Please share your suggestions as soon as possible.