lsof very slow - too many open file descriptors by named
Summary
"lsof" on my Linux machine ran unusually slow, taking tens of seconds to offer output. Investigating, the culprit was the named process, which appeared to be holding hundreds of thousands of open file descriptors.
BIND version used
BIND 9.16.37-Debian (Extended Support Version) <id:2b2afb2>
running on Linux x86_64 4.19.261-deb11 #1 SMP PREEMPT Fri Oct 21 22:53:59 PDT 2022
built by make with '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=/usr/lib/x86_64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--libdir=/usr/lib/x86_64-linux-gnu' '--sysconfdir=/etc/bind' '--with-python=python3' '--localstatedir=/' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-gost=no' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-libidn2' '--with-json-c' '--with-lmdb=/usr' '--with-gnu-ld' '--with-maxminddb' '--with-atf=no' '--enable-ipv6' '--enable-rrl' '--enable-filter-aaaa' '--disable-native-pkcs11' '--enable-dnstap' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/build/bind9-t8MKLi/bind9-9.16.37=. -fstack-protector-strong -Wformat -Werror=format-security -fno-strict-aliasing -fno-delete-null-pointer-checks -DNO_VERSION_DATE -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'
compiled by GCC 10.2.1 20210110
compiled with OpenSSL version: OpenSSL 1.1.1n 15 Mar 2022
linked to OpenSSL version: OpenSSL 1.1.1n 15 Mar 2022
compiled with libuv version: 1.40.0
linked to libuv version: 1.40.0
compiled with libxml2 version: 2.9.10
linked to libxml2 version: 20910
compiled with json-c version: 0.15
linked to json-c version: 0.15
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
linked to maxminddb version: 1.5.2
compiled with protobuf-c version: 1.3.3
linked to protobuf-c version: 1.3.3
threads support is enabled
DNSSEC algorithms: RSASHA1 NSEC3RSASHA1 RSASHA256 RSASHA512 ECDSAP256SHA256 ECDSAP384SHA384 ED25519 ED448
DS algorithms: SHA-1 SHA-256 SHA-384
HMAC algorithms: HMAC-MD5 HMAC-SHA1 HMAC-SHA224 HMAC-SHA256 HMAC-SHA384 HMAC-SHA512
TKEY mode 2 support (Diffie-Hellman): yes
TKEY mode 3 support (GSS-API): yes
default paths:
named configuration: /etc/bind/named.conf
rndc configuration: /etc/bind/rndc.conf
DNSSEC root key: /etc/bind/bind.keys
nsupdate session key: //run/named/session.key
named PID file: //run/named/named.pid
named lock file: //run/named/named.lock
geoip-directory: /usr/share/GeoIP
Steps to reproduce
"systemctl start named" on a Debian Linux server with a large number of interfaces.
Any subsequent invocation of the "lsof" command runs very slowly.
What is the current bug behavior?
date; lsof -nP |grep named | wc; date
Wed Apr 19 15:12:48 PDT 2023
305698 3493060 43406181
Wed Apr 19 15:12:59 PDT 2023
What is the expected correct behavior?
systemctl stop named
date; lsof -nP >> /dev/null; date
Wed Apr 19 15:20:50 PDT 2023
Wed Apr 19 15:20:51 PDT 2023
Relevant configuration files
n/a
Relevant logs and/or screenshots
systemctl start named
sleep 5
lsof -nP |grep named | cat -n | tail -6
305693 named 9656 9737 isc-socket-39 bind 3684u IPv6 4081192 0t0 TCP [fe80::fc54:ff:fe24:738c]:53 (LISTEN)
305694 named 9656 9737 isc-socket-39 bind 3685u IPv6 4081193 0t0 TCP [fe80::fc54:ff:fe24:738c]:53 (LISTEN)
305695 named 9656 9737 isc-socket-39 bind 3686u IPv6 4081194 0t0 TCP [fe80::fc54:ff:fe24:738c]:53 (LISTEN)
305696 named 9656 9737 isc-socket-39 bind 3687u IPv6 4081195 0t0 TCP [fe80::fc54:ff:fe24:738c]:53 (LISTEN)
305697 named 9656 9737 isc-socket-39 bind 3688u IPv6 4081196 0t0 TCP [fe80::fc54:ff:fe24:738c]:53 (LISTEN)
305698 named 9656 9737 isc-socket-39 bind 3689u IPv6 2707016 0t0 TCP [::1]:953 (LISTEN)
Possible fixes
Partial workaround:
listen-on-v6 { ::1; 2000::/3; };
listen-on { 127.0.0.0/8; 192.168.99.0/24; 192.168.95.0/24; };
reduced the count to 76,000 from 306,000. Still very excessive and still slows down lsof, though not by as much.
Edited by William Herrin