BIND 9.16 and cache node locks for name cleaning vs. 'the thundering herd'

From Support ticket #16212

During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes processing a second or two later (yes, delays of seconds not ms from this) we 'caught' one interesting scenario on BIND 9.16 in which it appeared that the vast majority of the active threads (netmgr and taskmgr both - so both client queries being answered from cache, AND client queries for which recursion had just taken place) were competing for the same cache node lock.

The pstack output demonstrating the problem was automatically triggered by monitoring for anomalies in inbound versus outbound network traffic.

The symptoms when this issue occurs are that:

Outbound client-facing traffic rates plummet (well below the proportion that you would expect to see if it was only cache-misses not being serviced
Recursive query rates plummet too
CPU use increases - but in user space not in system space
Recursive clients backlog increases (and may hit the limit)
Fetchlimits may be triggered (we suspect this, and its predecessor are symptom not cause however, although triggering fetchlimits will exacerbate the situation, both from the client perspective, and as increased traffic rates as clients retry/re-send.

What we saw in the pstacks was that the majority netmgr threads (these answer directly from cache) were attempting to get a write lock on the node - for example:

Thread 74 (Thread 0x7f3ff366e700 (LWP 11713)):
#0  isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980, type=type@entry=isc_rwlocktype_write) at rwlock.c:57
#1  0x000000000051d826 in decrement_reference (rbtdb=rbtdb@entry=0x7f3fc6457010, node=node@entry=0x7f3eace34510, least_serial=least_serial@entry=0, nlock=nlock@entry=isc_rwlocktype_read, tlock=tlock@entry=isc_rwlocktype_none, pruning=pruning@entry=false) at rbtdb.c:2040
#2  0x00000000005215bf in detachnode (db=0x7f3fc6457010, targetp=targetp@entry=0x7f3ff366da88) at rbtdb.c:5352
#3  0x00000000005217be in rdataset_disassociate (rdataset=<optimized out>) at rbtdb.c:8691
#4  0x00000000005657e8 in dns_rdataset_disassociate (rdataset=rdataset@entry=0x7f3fad30cf28) at rdataset.c:111
#5  0x00000000004ebb21 in msgresetnames (first_section=0, msg=0x7f3fad2e1a50, msg@entry=0x7f3fad30b5f0) at message.c:438
#6  msgreset (msg=msg@entry=0x7f3fad2e1a50, everything=everything@entry=false) at message.c:524
#7  0x00000000004ec95a in dns_message_reset (msg=0x7f3fad2e1a50, intent=intent@entry=1) at message.c:760
#8  0x00000000004797ba in ns_client_endrequest (client=0x7f3fae5b8550) at client.c:229
#9  ns__client_reset_cb (client0=0x7f3fae5b8550) at client.c:1586
#10 0x0000000000632989 in isc_nmhandle_unref (handle=handle@entry=0x7f3fae5b83e0) at netmgr.c:1158
#11 0x0000000000632c30 in isc__nm_uvreq_put (req0=req0@entry=0x7f3ff366dbb8, sock=<optimized out>) at netmgr.c:1291
#12 0x00000000006357c4 in udp_send_cb (req=<optimized out>, status=<optimized out>) at udp.c:465
#13 0x00007f3ff5375153 in uv__udp_run_completed () from /lib64/libuv.so.1
#14 0x00007f3ff53754d3 in uv__udp_io () from /lib64/libuv.so.1
#15 0x00007f3ff5367c43 in uv_run () from /lib64/libuv.so.1
#16 0x0000000000632fda in nm_thread (worker0=0x138e3e0) at netmgr.c:481
#17 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3ff484488d in clone () from /lib64/libc.so.6

A handful of threads are attempting to get a read lock on the same node - for example:

Thread 59 (Thread 0x7f3feab0e700 (LWP 11734)):
#0  0x00007f3ff4f3d144 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x000000000063cc6e in isc_rwlock_lock (rwl=0x7f3f59523980, type=type@entry=isc_rwlocktype_read) at rwlock.c:48
#2  0x00000000005129c6 in rdataset_getownercase (rdataset=<optimized out>, name=0x7f3feaaffde0) at rbtdb.c:9770
#3  0x000000000056620a in towiresorted (rdataset=rdataset@entry=0x7f3ec42dee70, owner_name=owner_name@entry=0x7f3ec42dd0a0, cctx=<optimized out>, target=<optimized out>, order=<optimized out>, order_arg=order_arg@entry=0x7f3ec42b8718, partial=true, options=1, countp=0x7f3feab005dc, state=<optimized out>) at rdataset.c:444
#4  0x0000000000566e3f in dns_rdataset_towirepartial (rdataset=rdataset@entry=0x7f3ec42dee70, owner_name=owner_name@entry=0x7f3ec42dd0a0, cctx=<optimized out>, target=<optimized out>, order=<optimized out>, order_arg=order_arg@entry=0x7f3ec42b8718, options=<optimized out>, options@entry=1, countp=<optimized out>, countp@entry=0x7f3feab005dc, state=<optimized out>, state@entry=0x0) at rdataset.c:565
#5  0x00000000004ecc71 in dns_message_rendersection (msg=0x7f3ec42b8550, sectionid=sectionid@entry=1, options=options@entry=6) at message.c:2086
#6  0x00000000004780f3 in ns_client_send (client=client@entry=0x7f3ec5d4b510) at client.c:555
#7  0x0000000000485b7c in query_send (client=0x7f3ec5d4b510) at query.c:552
#8  0x000000000048de23 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10921
#9  0x000000000048f76d in query_respond (qctx=0x7f3feab09a70) at query.c:7414
#10 query_prepresponse (qctx=qctx@entry=0x7f3feab09a70) at query.c:9913
#11 0x000000000049181c in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=0) at query.c:6836
#12 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#13 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#14 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#15 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#16 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#17 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#18 0x000000000048de05 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10853
#19 0x0000000000492420 in query_dname (qctx=<optimized out>) at query.c:9806
#20 query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65568) at query.c:6872
#21 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#22 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#23 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#24 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#25 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#26 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#27 0x000000000048de05 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10853
#28 0x0000000000492420 in query_dname (qctx=<optimized out>) at query.c:9806
#29 query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65568) at query.c:6872
#30 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#31 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#32 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#33 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#34 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#35 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#36 0x0000000000494b26 in query_setup (client=client@entry=0x7f3ec5d4b510, qtype=<optimized out>) at query.c:5217
#37 0x0000000000497056 in ns_query_start (client=client@entry=0x7f3ec5d4b510) at query.c:11318
#38 0x000000000047b101 in ns__client_request (handle=<optimized out>, region=<optimized out>, arg=<optimized out>) at client.c:2209
#39 0x0000000000635462 in udp_recv_cb (handle=<optimized out>, nrecv=48, buf=0x7f3feab0ab00, addr=<optimized out>, flags=<optimized out>) at udp.c:329
#40 0x00007f3ff53755db in uv__udp_io () from /lib64/libuv.so.1
#41 0x00007f3ff53779c8 in uv__io_poll () from /lib64/libuv.so.1
#42 0x00007f3ff5367c70 in uv_run () from /lib64/libuv.so.1
#43 0x0000000000632fda in nm_thread (worker0=0x13926e8) at netmgr.c:481
#44 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#45 0x00007f3ff484488d in clone () from /lib64/libc.so.6

Meanwhile, the threads run by taskmgr (this bunch would have recursed) were attempting to get write locks (unsurprisingly, although depending on the node and the client query, I guess it's also possible that one might want to get a read lock):

Here's a writer:

Thread 50 (Thread 0x7f3fe587b700 (LWP 11746)):
#0  isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980, type=type@entry=isc_rwlocktype_write) at rwlock.c:57
#1  0x000000000051d826 in decrement_reference (rbtdb=rbtdb@entry=0x7f3fc6457010, node=node@entry=0x7f3eace34510, least_serial=least_serial@entry=0, nlock=nlock@entry=isc_rwlocktype_read, tlock=tlock@entry=isc_rwlocktype_none, pruning=pruning@entry=false) at rbtdb.c:2040
#2  0x00000000005215bf in detachnode (db=0x7f3fc6457010, targetp=0x7f3fe587acc0) at rbtdb.c:5352
#3  0x00000000004bdd83 in dns_db_detachnode (db=<optimized out>, nodep=nodep@entry=0x7f3fe587acc0) at db.c:588
#4  0x00000000004804cb in qctx_clean (qctx=qctx@entry=0x7f3fe587a830) at query.c:5097
#5  0x000000000048db5a in ns_query_done (qctx=qctx@entry=0x7f3fe587a830) at query.c:10834
#6  0x000000000048f76d in query_respond (qctx=0x7f3fe587a830) at query.c:7414
#7  query_prepresponse (qctx=qctx@entry=0x7f3fe587a830) at query.c:9913
#8  0x000000000049181c in query_gotanswer (qctx=qctx@entry=0x7f3fe587a830, res=res@entry=0) at query.c:6836
#9  0x0000000000496870 in query_resume (qctx=0x7f3fe587a830) at query.c:6134
#10 fetch_callback (task=<optimized out>, event=0x7f3ead5c9c18) at query.c:5716
#11 0x000000000064007a in dispatch (threadid=<optimized out>, manager=<optimized out>) at task.c:1152
#12 run (queuep=<optimized out>) at task.c:1344
#13 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f3ff484488d in clone () from /lib64/libc.so.6

In this particular instance, every single one of the legacy i/o-handler threads was twiddling its thumbs (sitting on epoll_wait() ) - which is probably not too surprising, if no taskmgr workers are sending out queries to auth servers?

Doing stats on this particular capture (74 threads - 24x netmgr, 24x taskmgr, 24x legacy i/o plus 1 each main and the timer thread), we have:

33 instances of isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980

31 instances of rbtdb=rbtdb@entry=0x7f3fc6457010

30 instances of node=node@entry=0x7f3eace34510

It might be that it's possible to prove from the pstack output that this is a series of different names all attached to the same node, versus a single name that is expiring that all of the threads are attempting to clean-up simultaneously.

Either way, the locking is not working well in this situation - there's a lot of spinning in user space it would appear.

Hypotheses being tendered currently include:

This scenario has always potentially existed, but using pthread-rwlocks amplifies it considerably
Could this be a case where prefetching (enabled with default settings in this example) hits a surprise edge case?
Is it possible we're seeing the after-effects of another delay which has resulted in late client query-response processing for something that has a very short TTL in cache?
Is this a scenario where a client comes along and queries near-simultaneously (and probably quite innocently) for a lot of similar names under the same domain/apex very close to the time where they would all be naturally expiring from cache?
Could it be that TTL=0 handling has broken in 9.16 with the introduction of netmgr (noting that TTL=0 responses from auth servers would be expected to be available solely to the clients that recursed and waited for the fetch completion - not to anyone who came along after the fetch had populated cache for the waiting client request to be fulfilled - this should all be in taskmgr and none of it in netmgr)?
Do we perhaps have too many threads running (detected CPUs = 24)?

Edited Apr 21, 2020 by Cathy Almond