Bad performance in BIND 9.16.6

Summary

Hello everybody,

I am creating this issue because I am having problems with my VMs that are running BIND 9.16.6 in openSUSE Leap 15.3 and ESXi 7.0 for which I am getting bad performances when querying them in UDP.

After some days of research, I have not found a solution yet so I am coming here hoping that someone have any idea that could help me to go forward.

For information, everything was working fine when I was using 9.11 version so the issue seems to be related to in some way to #2143 (closed)

BIND version used

BIND 9.16.6 (Stable Release) <id:25846cf>
running on Linux x86_64 5.3.18-150300.59.87-default #1 SMP Thu Jul 21 14:31:28 UTC 2022 (cc90276)
built by make with '--host=x86_64-suse-linux-gnu' '--build=x86_64-suse-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/lib' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--disable-dependency-tracking' '--with-python=/usr/bin/python3' '--includedir=/usr/include/bind' '--disable-static' '--with-openssl' '--enable-threads' '--with-libtool' '--with-libxml2' '--with-libjson' '--with-libidn2' '--with-dlz-mysql' '--with-dlz-ldap' '--with-randomdev=/dev/urandom' '--enable-ipv6' '--with-pic' '--disable-openssl-version-check' '--with-tuning=large' '--with-geoip' '--with-dlopen' '--with-gssapi=yes' '--disable-isc-spnego' '--enable-fixed-rrset' '--enable-filter-aaaa' '--with-systemd' '--enable-full-report' 'build_alias=x86_64-suse-linux-gnu' 'host_alias=x86_64-suse-linux-gnu' 'CFLAGS=-fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g -fPIE -DNO_VERSION_DATE' 'LDFLAGS=-pie' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
compiled by GCC 7.5.0
compiled with OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
linked to OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
compiled with libuv version: 1.18.0
linked to libuv version: 1.18.0
compiled with libxml2 version: 2.9.7
linked to libxml2 version: 20907
compiled with json-c version: 0.13
linked to json-c version: 0.13
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
threads support is enabled

default paths:
  named configuration:  /etc/named.conf
  rndc configuration:   /etc/rndc.conf
  DNSSEC root key:      /etc/bind.keys
  nsupdate session key: /var/run/named/session.key
  named PID file:       /var/run/named/named.pid
  named lock file:      /var/run/named/named.lock

Steps to reproduce

The problem is happening constantly since I start BIND.

What is the current bug behavior?

I have observed that there is always a connection with a high Recv-Q value and another weird thing is that I have got 2vCPUs and 1 NIC with 2 UDP listeners per interface but when I check with ss command, I can see 3 network connections attached to the IP of the NIC.

What is the expected correct behavior?

Recv-Q should be getting empty instead of being stuck to a high value. Besides, I suppose that I should see only 2 UDP connections for my NIC (according to the number of UDP listeners per interface).

Relevant logs and/or screenshots

Output of ss -u -a -n '( sport = :53 )' command:

State                       Recv-Q                      Send-Q                                           Local Address:Port                                           Peer Address:Port
UNCONN                      213760                      0                                                 172.31.18.10:53                                                  0.0.0.0:*
UNCONN                      0                           0                                                 172.31.18.10:53                                                  0.0.0.0:*
UNCONN                      0                           0                                                 172.31.18.10:53                                                  0.0.0.0:*
UNCONN                      0                           0                                                    127.0.0.1:53                                                  0.0.0.0:*
UNCONN                      0                           0                                                    127.0.0.1:53                                                  0.0.0.0:*

Output of rndc status command:

version: BIND 9.16.6 (Stable Release) <id:25846cf> (None of your business)
running on vmxdns: Linux x86_64 5.3.18-150300.59.87-default #1 SMP Thu Jul 21 14:31:28 UTC 2022 (cc90276)
boot time: Thu, 25 Aug 2022 09:40:54 GMT
last configured: Thu, 25 Aug 2022 09:40:54 GMT
configuration file: /etc/named.conf (/var/lib/named/etc/named.conf)
CPUs found: 2
worker threads: 2
UDP listeners per interface: 2
number of zones: 116 (99 automatic)
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is OFF
recursive clients: 0/900/1000
tcp clients: 0/150
TCP high-water: 8
server is up and running

I have open a VMWware case to check with them if there is something at VM level that could explain that behavior but, for the moment, they think that the problem comes from BIND, not being able to read data from the network buffer.

Have you got any ideas, please ?

Thank you very much for your help.

Regards.