ISC Open Source Projects issueshttps://gitlab.isc.org/groups/isc-projects/-/issues2022-10-13T10:51:30Zhttps://gitlab.isc.org/isc-projects/stork/-/issues/250Easier Kea sabotage (demo only)2022-10-13T10:51:30ZTomek MrugalskiEasier Kea sabotage (demo only)Ok, sabotage is maybe not the best word, but now that I have your attention, here's what the problem is:
we want to kill and later revive Kea server that is working in HA pair, so the HA problems demonstration is easier.
During the dem...Ok, sabotage is maybe not the best word, but now that I have your attention, here's what the problem is:
we want to kill and later revive Kea server that is working in HA pair, so the HA problems demonstration is easier.
During the demo I did the following:
1. docker-compose exec agent-kea-ha2 /bin/bash
2. edit /etc/supervisor.conf. In [kea-dhcp4] section I modified autorestart = false
3. kill 1 (supervisor process). This restarted the container (and kicked me out in the process)
4. I was now able to get in again (docker-compose ...) and then kill kea: killall kea-dhcp4
<observe how Stork detects HA failures, wait a bit>
5. start kea with kea-dhcp4 -c /etc/kea/kea-dhcp4.conf
The absolute minimum we could do is this:
1. tweak docker recipes to build the containers with autorestart disabled
2. maybe tweak the DHCP traffic generator, to make it easier to start/stop the service.
Alternatively, we should describe somewhere how non-developer can stop and start Kea service in the container.
I don't know, maybe it would be easier to figure some easy way to stop/start the whole container altogether?
In any case there should be some wiki update regarding how to do it.outstandinghttps://gitlab.isc.org/isc-projects/bind9/-/issues/1783AX_CHECK_COMPILE_FLAG -fno-delete-null-pointer-checks does not fail for clang2020-04-29T16:26:35ZMark AndrewsAX_CHECK_COMPILE_FLAG -fno-delete-null-pointer-checks does not fail for clangMay 2020 (9.11.19, 9.11.19-S1, 9.14.12, 9.16.3)Mark AndrewsMark Andrewshttps://gitlab.isc.org/isc-projects/bind9/-/issues/17829.16.x: listen-on-v6 { any; }; no longer works as documented on FreeBSD2020-06-08T12:28:11Zmsinatra9.16.x: listen-on-v6 { any; }; no longer works as documented on FreeBSD<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [...<!--
If the bug you are reporting is potentially security-related - for example,
if it involves an assertion failure or other crash in `named` that can be
triggered repeatedly - then please do *NOT* report it here, but send an
email to [security-officer@isc.org](security-officer@isc.org).
-->
### Summary
In 9.14.x running on FreeBSD, 'listen-on-v6 { any; )' functions as documented in the ARM:
When { any; } is specified as the address_match_list for the listen-on-v6 option, the server does not bind a separate socket to each IPv6 interface address as it does for IPv4 if the operating system has enough API support for IPv6 (specifically if it conforms to RFC 3493 and RFC 3542). Instead, it listens on the IPv6 wildcard address. If the system only has incomplete API support for IPv6,however, the behavior is the same as that for IPv4.
In 9.16.x, it does not function as documented.
9.14.x:
```
root@devns1:~ # fgrep listen-on-v6 /etc/namedb/named.conf
listen-on-v6 { any; };
root@devns1:~ # sockstat | grep named
bind named 44277 3 dgram -> /var/run/logpriv
bind named 44277 21 tcp6 *:53 *:*
bind named 44277 23 tcp4 127.0.0.1:53 *:*
bind named 44277 24 tcp4 127.0.0.1:953 *:*
bind named 44277 25 tcp6 ::1:953 *:*
bind named 44277 512 udp6 *:53 *:*
bind named 44277 514 udp4 127.0.0.1:53 *:*
```
9.16.1 (also verified on 9.16.2):
```
root@devns1:~ # fgrep listen-on-v6 /etc/namedb/named.conf
listen-on-v6 { any; };
root@devns1:~ # sockstat | grep named
bind named 617 27 udp6 ::1:53 *:*
bind named 617 28 tcp6 ::1:53 *:*
bind named 617 29 tcp6 ::1:53 *:*
bind named 617 30 udp6 fe80::1%lo0:53 *:*
bind named 617 31 tcp6 fe80::1%lo0:53 *:*
bind named 617 32 tcp6 fe80::1%lo0:53 *:*
bind named 617 33 udp4 127.0.0.1:53 *:*
bind named 617 34 tcp4 127.0.0.1:53 *:*
bind named 617 35 tcp4 127.0.0.1:53 *:*
bind named 617 36 tcp4 127.0.0.1:953 *:*
bind named 617 37 tcp6 ::1:953 *:*
```
### BIND version used
9.16.2 exhibits the bug:
```
BIND 9.16.2 (Stable Release) <id:b310dc7>
running on FreeBSD amd64 11.3-RELEASE-p7 FreeBSD 11.3-RELEASE-p7 #0: Tue Mar 17 08:32:23 UTC 2020 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
built by make with '--disable-linux-caps' '--localstatedir=/var' '--sysconfdir=/usr/local/etc/namedb' '--with-dlopen=yes' '--with-libxml2' '--with-openssl=/usr/local' '--with-readline=-L/usr/local/lib -ledit' '--with-dlz-filesystem=yes' '--disable-dnstap' '--disable-fixed-rrset' '--disable-geoip' '--without-maxminddb' '--without-gssapi' '--with-libidn2=/usr/local' '--with-json-c' '--disable-largefile' '--with-lmdb=/usr/local' '--disable-native-pkcs11' '--without-python' '--disable-querytrace' 'STD_CDEFINES=-DDIG_SIGCHASE=1' '--enable-tcp-fastopen' '--with-tuning=default' '--disable-symtable' '--prefix=/usr/local' '--mandir=/usr/local/man' '--infodir=/usr/local/share/info/' '--build=amd64-portbld-freebsd11.3' 'build_alias=amd64-portbld-freebsd11.3' 'CC=cc' 'CFLAGS=-O2 -pipe -DLIBICONV_PLUG -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing ' 'LDFLAGS= -L/usr/local/lib -ljson-c -Wl,-rpath,/usr/local/lib -fstack-protector-strong ' 'LIBS=-L/usr/local/lib' 'CPPFLAGS=-DLIBICONV_PLUG -isystem /usr/local/include' 'CPP=cpp' 'PKG_CONFIG=pkgconf'
compiled by CLANG 4.2.1 Compatible FreeBSD Clang 8.0.0 (tags/RELEASE_800/final 356365)
compiled with OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
linked to OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
compiled with libxml2 version: 2.9.10
linked to libxml2 version: 20910
compiled with json-c version: 0.13.1
linked to json-c version: 0.13.1
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
threads support is enabled
default paths:
named configuration: /usr/local/etc/namedb/named.conf
rndc configuration: /usr/local/etc/namedb/rndc.conf
DNSSEC root key: /usr/local/etc/namedb/bind.keys
nsupdate session key: /var/run/named/session.key
named PID file: /var/run/named/pid
named lock file: /var/run/named/named.lock
```
### Steps to reproduce
1. System running FreeBSD 11.3-RELEASE or 12.1-RELEASE.
2. Install BIND916 either from ports (with default options).
3. Create an lo1 interface with (an) IPv6 address(es). We use lo1 for the service addresses of our anycast instances.
4. `ifconfig lo1 down`
5. Start named with a basic recursive or authoritative config, with `listen-on-v6 { any; };` configured.
6. `sockstat | grep named`. named will not be listening on the wildcard -nor- on the IPv6 addresses configured on lo1. This is because FreeBSD supports Enhanced DAD on all interfaces and marks all v6 addresses as 'tentative' until the interface comes up.
7. `ifconfig lo1 up`
8. `sockstat | grep named`. named is still not listening on lo1's IPv6 addresses.
9. Attempt to query the server on the IPv6 address on lo1. It will time out.
10. `rndc scan`
11. repeat steps 8 and 9. Still not listening on the lo1 addresses and not responding.
12. RESTART named. It is now listening on the new lo1 addresses.
With 9.14.x, queries to the new address do not time out because named is properly listening on the wildcard.
WORKAROUND:
`ifconfig lo1 no_dad`. This disables DAD processing on the loopback (not clear why you need it there anyway) and clears the `tentative` flag even if lo1 is down. named will listen explicitly on the IPv6 addresses whether lo1 is marked "UP" or "DOWN." Note that this does not work reliably on FreeBSD 11.3-RELEASE, but does work on 12.1-RELEASE.June 2020 (9.11.20, 9.11.20-S1, 9.16.4, 9.17.2)Witold KrecickiWitold Krecickihttps://gitlab.isc.org/isc-projects/stork/-/issues/249look up DNS names for assigned addresses2022-11-16T11:54:51ZVicky Riskvicky@isc.orglook up DNS names for assigned addressesCan we do a reverse lookup on addresses we are managing, and display any configured DNS names in Stork?
It is likely that the overwhelming proportion of addresses will NOT have a DNS name, so we have to consider that in the UI. We also ...Can we do a reverse lookup on addresses we are managing, and display any configured DNS names in Stork?
It is likely that the overwhelming proportion of addresses will NOT have a DNS name, so we have to consider that in the UI. We also don't want to DDOS the DNS server looking up stuff that isn't in there too much. Dynamically assigned addresses are far less likely to have a DNS name (despite DDNS).
We might start by looking up DNS names for host reservations only and reporting them in that panel.
Alternatively, we could have a button to look up the DNS name for a specific host reservation, and cache it and report it along with that HR, but only look them up when explicitly triggered to do so.backloghttps://gitlab.isc.org/isc-projects/bind9/-/issues/1780Fix system tests failing with Automake2020-04-27T14:27:27ZMichał KępieńFix system tests failing with Automake 1. `.gitlab-ci.yml` script for running tests is currently broken as it
hides test failures[^1].
2. A number of system tests (e.g. `rrsetorder`) are consistently
failing.
We should first make CI jobs fail when tests fail and t... 1. `.gitlab-ci.yml` script for running tests is currently broken as it
hides test failures[^1].
2. A number of system tests (e.g. `rrsetorder`) are consistently
failing.
We should first make CI jobs fail when tests fail and then fix the
failures one by one.
[^1]: `( cd bin/tests/system && make -j${TEST_PARALLEL_JOBS:-1} -k check V=1 ) || cat bin/tests/system/test-suite.log`May 2020 (9.11.19, 9.11.19-S1, 9.14.12, 9.16.3)https://gitlab.isc.org/isc-projects/bind9/-/issues/1778Cleanup the final remnants of platform.h2021-10-05T12:07:45ZOndřej SurýCleanup the final remnants of platform.hThere are still few remaining bits in the `platform.h` header that we need to remove and finally get rid of the header.There are still few remaining bits in the `platform.h` header that we need to remove and finally get rid of the header.BIND 9.17 BackburnerOndřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1777Update the build instructions for automake2020-05-01T07:07:24ZOndřej SurýUpdate the build instructions for automakeGo through various README and other documentation files and update the instructions how to build BIND 9 with automake in place.Go through various README and other documentation files and update the instructions how to build BIND 9 with automake in place.May 2020 (9.11.19, 9.11.19-S1, 9.14.12, 9.16.3)Ondřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1776BIND 9.16 and cache node locks for name cleaning vs. 'the thundering herd'2021-10-05T12:07:29ZCathy AlmondBIND 9.16 and cache node locks for name cleaning vs. 'the thundering herd'From [Support ticket #16212](https://support.isc.org/Ticket/Display.html?id=16212)
During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes ...From [Support ticket #16212](https://support.isc.org/Ticket/Display.html?id=16212)
During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes processing a second or two later (yes, delays of seconds not ms from this) we 'caught' one interesting scenario on BIND 9.16 in which it appeared that the vast majority of the active threads (netmgr and taskmgr both - so both client queries being answered from cache, AND client queries for which recursion had just taken place) were competing for the same cache node lock.
The pstack output demonstrating the problem was automatically triggered by monitoring for anomalies in inbound versus outbound network traffic.
The symptoms when this issue occurs are that:
* Outbound client-facing traffic rates plummet (well below the proportion that you would expect to see if it was only cache-misses not being serviced
* Recursive query rates plummet too
* CPU use increases - but in user space not in system space
* Recursive clients backlog increases (and may hit the limit)
* Fetchlimits may be triggered (we suspect this, and its predecessor are symptom not cause however, although triggering fetchlimits will exacerbate the situation, both from the client perspective, and as increased traffic rates as clients retry/re-send.
What we saw in the pstacks was that the majority netmgr threads (these answer directly from cache) were attempting to get a write lock on the node - for example:
```
Thread 74 (Thread 0x7f3ff366e700 (LWP 11713)):
#0 isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980, type=type@entry=isc_rwlocktype_write) at rwlock.c:57
#1 0x000000000051d826 in decrement_reference (rbtdb=rbtdb@entry=0x7f3fc6457010, node=node@entry=0x7f3eace34510, least_serial=least_serial@entry=0, nlock=nlock@entry=isc_rwlocktype_read, tlock=tlock@entry=isc_rwlocktype_none, pruning=pruning@entry=false) at rbtdb.c:2040
#2 0x00000000005215bf in detachnode (db=0x7f3fc6457010, targetp=targetp@entry=0x7f3ff366da88) at rbtdb.c:5352
#3 0x00000000005217be in rdataset_disassociate (rdataset=<optimized out>) at rbtdb.c:8691
#4 0x00000000005657e8 in dns_rdataset_disassociate (rdataset=rdataset@entry=0x7f3fad30cf28) at rdataset.c:111
#5 0x00000000004ebb21 in msgresetnames (first_section=0, msg=0x7f3fad2e1a50, msg@entry=0x7f3fad30b5f0) at message.c:438
#6 msgreset (msg=msg@entry=0x7f3fad2e1a50, everything=everything@entry=false) at message.c:524
#7 0x00000000004ec95a in dns_message_reset (msg=0x7f3fad2e1a50, intent=intent@entry=1) at message.c:760
#8 0x00000000004797ba in ns_client_endrequest (client=0x7f3fae5b8550) at client.c:229
#9 ns__client_reset_cb (client0=0x7f3fae5b8550) at client.c:1586
#10 0x0000000000632989 in isc_nmhandle_unref (handle=handle@entry=0x7f3fae5b83e0) at netmgr.c:1158
#11 0x0000000000632c30 in isc__nm_uvreq_put (req0=req0@entry=0x7f3ff366dbb8, sock=<optimized out>) at netmgr.c:1291
#12 0x00000000006357c4 in udp_send_cb (req=<optimized out>, status=<optimized out>) at udp.c:465
#13 0x00007f3ff5375153 in uv__udp_run_completed () from /lib64/libuv.so.1
#14 0x00007f3ff53754d3 in uv__udp_io () from /lib64/libuv.so.1
#15 0x00007f3ff5367c43 in uv_run () from /lib64/libuv.so.1
#16 0x0000000000632fda in nm_thread (worker0=0x138e3e0) at netmgr.c:481
#17 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3ff484488d in clone () from /lib64/libc.so.6
```
A handful of threads are attempting to get a read lock on the same node - for example:
```
Thread 59 (Thread 0x7f3feab0e700 (LWP 11734)):
#0 0x00007f3ff4f3d144 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1 0x000000000063cc6e in isc_rwlock_lock (rwl=0x7f3f59523980, type=type@entry=isc_rwlocktype_read) at rwlock.c:48
#2 0x00000000005129c6 in rdataset_getownercase (rdataset=<optimized out>, name=0x7f3feaaffde0) at rbtdb.c:9770
#3 0x000000000056620a in towiresorted (rdataset=rdataset@entry=0x7f3ec42dee70, owner_name=owner_name@entry=0x7f3ec42dd0a0, cctx=<optimized out>, target=<optimized out>, order=<optimized out>, order_arg=order_arg@entry=0x7f3ec42b8718, partial=true, options=1, countp=0x7f3feab005dc, state=<optimized out>) at rdataset.c:444
#4 0x0000000000566e3f in dns_rdataset_towirepartial (rdataset=rdataset@entry=0x7f3ec42dee70, owner_name=owner_name@entry=0x7f3ec42dd0a0, cctx=<optimized out>, target=<optimized out>, order=<optimized out>, order_arg=order_arg@entry=0x7f3ec42b8718, options=<optimized out>, options@entry=1, countp=<optimized out>, countp@entry=0x7f3feab005dc, state=<optimized out>, state@entry=0x0) at rdataset.c:565
#5 0x00000000004ecc71 in dns_message_rendersection (msg=0x7f3ec42b8550, sectionid=sectionid@entry=1, options=options@entry=6) at message.c:2086
#6 0x00000000004780f3 in ns_client_send (client=client@entry=0x7f3ec5d4b510) at client.c:555
#7 0x0000000000485b7c in query_send (client=0x7f3ec5d4b510) at query.c:552
#8 0x000000000048de23 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10921
#9 0x000000000048f76d in query_respond (qctx=0x7f3feab09a70) at query.c:7414
#10 query_prepresponse (qctx=qctx@entry=0x7f3feab09a70) at query.c:9913
#11 0x000000000049181c in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=0) at query.c:6836
#12 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#13 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#14 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#15 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#16 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#17 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#18 0x000000000048de05 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10853
#19 0x0000000000492420 in query_dname (qctx=<optimized out>) at query.c:9806
#20 query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65568) at query.c:6872
#21 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#22 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#23 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#24 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#25 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#26 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#27 0x000000000048de05 in ns_query_done (qctx=qctx@entry=0x7f3feab09a70) at query.c:10853
#28 0x0000000000492420 in query_dname (qctx=<optimized out>) at query.c:9806
#29 query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65568) at query.c:6872
#30 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#31 0x00000000004950f6 in query_zone_delegation (qctx=0x7f3feab09a70) at query.c:8003
#32 query_delegation (qctx=qctx@entry=0x7f3feab09a70) at query.c:8031
#33 0x0000000000491a1a in query_gotanswer (qctx=qctx@entry=0x7f3feab09a70, res=res@entry=65565) at query.c:6842
#34 0x0000000000493a22 in query_lookup (qctx=qctx@entry=0x7f3feab09a70) at query.c:5617
#35 0x0000000000494036 in ns__query_start (qctx=qctx@entry=0x7f3feab09a70) at query.c:5493
#36 0x0000000000494b26 in query_setup (client=client@entry=0x7f3ec5d4b510, qtype=<optimized out>) at query.c:5217
#37 0x0000000000497056 in ns_query_start (client=client@entry=0x7f3ec5d4b510) at query.c:11318
#38 0x000000000047b101 in ns__client_request (handle=<optimized out>, region=<optimized out>, arg=<optimized out>) at client.c:2209
#39 0x0000000000635462 in udp_recv_cb (handle=<optimized out>, nrecv=48, buf=0x7f3feab0ab00, addr=<optimized out>, flags=<optimized out>) at udp.c:329
#40 0x00007f3ff53755db in uv__udp_io () from /lib64/libuv.so.1
#41 0x00007f3ff53779c8 in uv__io_poll () from /lib64/libuv.so.1
#42 0x00007f3ff5367c70 in uv_run () from /lib64/libuv.so.1
#43 0x0000000000632fda in nm_thread (worker0=0x13926e8) at netmgr.c:481
#44 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#45 0x00007f3ff484488d in clone () from /lib64/libc.so.6
```
Meanwhile, the threads run by taskmgr (this bunch would have recursed) were attempting to get write locks (unsurprisingly, although depending on the node and the client query, I guess it's also possible that one might want to get a read lock):
Here's a writer:
```
Thread 50 (Thread 0x7f3fe587b700 (LWP 11746)):
#0 isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980, type=type@entry=isc_rwlocktype_write) at rwlock.c:57
#1 0x000000000051d826 in decrement_reference (rbtdb=rbtdb@entry=0x7f3fc6457010, node=node@entry=0x7f3eace34510, least_serial=least_serial@entry=0, nlock=nlock@entry=isc_rwlocktype_read, tlock=tlock@entry=isc_rwlocktype_none, pruning=pruning@entry=false) at rbtdb.c:2040
#2 0x00000000005215bf in detachnode (db=0x7f3fc6457010, targetp=0x7f3fe587acc0) at rbtdb.c:5352
#3 0x00000000004bdd83 in dns_db_detachnode (db=<optimized out>, nodep=nodep@entry=0x7f3fe587acc0) at db.c:588
#4 0x00000000004804cb in qctx_clean (qctx=qctx@entry=0x7f3fe587a830) at query.c:5097
#5 0x000000000048db5a in ns_query_done (qctx=qctx@entry=0x7f3fe587a830) at query.c:10834
#6 0x000000000048f76d in query_respond (qctx=0x7f3fe587a830) at query.c:7414
#7 query_prepresponse (qctx=qctx@entry=0x7f3fe587a830) at query.c:9913
#8 0x000000000049181c in query_gotanswer (qctx=qctx@entry=0x7f3fe587a830, res=res@entry=0) at query.c:6836
#9 0x0000000000496870 in query_resume (qctx=0x7f3fe587a830) at query.c:6134
#10 fetch_callback (task=<optimized out>, event=0x7f3ead5c9c18) at query.c:5716
#11 0x000000000064007a in dispatch (threadid=<optimized out>, manager=<optimized out>) at task.c:1152
#12 run (queuep=<optimized out>) at task.c:1344
#13 0x00007f3ff4f39e65 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f3ff484488d in clone () from /lib64/libc.so.6
```
In this particular instance, every single one of the legacy i/o-handler threads was twiddling its thumbs (sitting on epoll_wait() ) - which is probably not too surprising, if no taskmgr workers are sending out queries to auth servers?
Doing stats on this particular capture (74 threads - 24x netmgr, 24x taskmgr, 24x legacy i/o plus 1 each main and the timer thread), we have:
33 instances of isc_rwlock_lock (rwl=rwl@entry=0x7f3f59523980
31 instances of rbtdb=rbtdb@entry=0x7f3fc6457010
30 instances of node=node@entry=0x7f3eace34510
It might be that it's possible to prove from the pstack output that this is a series of different names all attached to the same node, versus a single name that is expiring that all of the threads are attempting to clean-up simultaneously.
Either way, the locking is not working well in this situation - there's a lot of spinning in user space it would appear.
Hypotheses being tendered currently include:
* This scenario has always potentially existed, but using pthread-rwlocks amplifies it considerably
* Could this be a case where prefetching (enabled with default settings in this example) hits a surprise edge case?
* Is it possible we're seeing the after-effects of another delay which has resulted in late client query-response processing for something that has a very short TTL in cache?
* Is this a scenario where a client comes along and queries near-simultaneously (and probably quite innocently) for a lot of similar names under the same domain/apex very close to the time where they would all be naturally expiring from cache?
* Could it be that TTL=0 handling has broken in 9.16 with the introduction of netmgr (noting that TTL=0 responses from auth servers would be expected to be available solely to the clients that recursed and waited for the fetch completion - not to anyone who came along after the fetch had populated cache for the waiting client request to be fulfilled - this should all be in taskmgr and none of it in netmgr)?
* Do we perhaps have too many threads running (detected CPUs = 24)?BIND 9.19.xOndřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/stork/-/issues/248after entering kea app page with ha status there are lots of errors in the we...2020-06-03T18:16:19ZMichal Nowikowskiafter entering kea app page with ha status there are lots of errors in the web browser consolethe errors:
```
ERROR TypeError: "this._receivedStatus is undefined"
```the errors:
```
ERROR TypeError: "this._receivedStatus is undefined"
```0.8Marcin SiodelskiMarcin Siodelskihttps://gitlab.isc.org/isc-projects/bind9/-/issues/1775Resizing (growing) of cache hash tables causes delays in processing of client...2020-11-13T18:31:42ZCathy AlmondResizing (growing) of cache hash tables causes delays in processing of client queriesFrom [Support ticket #16212](https://support.isc.org/Ticket/Display.html?id=16212)
During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes ...From [Support ticket #16212](https://support.isc.org/Ticket/Display.html?id=16212)
During investigations of intermittent 'brownouts' - periods in which named seemingly stops actioning client queries for a short period, and then resumes processing a second or two later (yes, delays of seconds not ms from this) we 'caught' one culprit red-handed in a pstack run that was automatically triggered by an 'alarm' in monitoring inbound and outbound server traffic rates.
The thread in question was holding the cache tree lock, while growing the hash table:
```
Thread 21 (Thread 0x7f54d8b2f700 (LWP 19115)):
#0 0x000000000052bc7b in rehash (rbt=0x7f54b8c04058, newcount=<optimized out>) at rbt.c:2376
#1 0x000000000052da99 in hash_node (name=0x7f53d9562bb0, node=0x7f541cf79538, rbt=0x7f54b8c04058) at rbt.c:2389
#2 dns_rbt_addnode (rbt=0x7f54b8c04058, name=0x7f53d9562bb0, nodep=0x7f54d8b2dd28) at rbt.c:1451
#3 0x00000000005367ef in rbt_addnode_withdata (rbtdb=0x7f54b8c03010, rbt=0x7f54b8c04058, name=<optimized out>, nodep=0x7f54d8b2dd28) at rbtdb.c:2016
#4 0x000000000053ba42 in findnodeintree (rbtdb=0x7f54b8c03010, tree=0x7f54b8c04058, name=0x7f53d9562bb0, create=true, nodep=0x7f54d8b2ed30) at rbtdb.c:3339
#5 0x00000000005babb5 in cache_name (now=1587326409, zerottl=false, name=0x7f53d9562bb0, section=1, query=0x7f54600100d0, fctx=0x7f5449e172d0) at resolver.c:5876
#6 cache_message (now=1587326409, zerottl=false, query=0x7f54600100d0, fctx=0x7f5449e172d0) at resolver.c:6336
#7 resquery_response (task=0x7f5387cbb628, event=<optimized out>) at resolver.c:9166
#8 0x000000000068a8b1 in dispatch (manager=0x7f54dedc7010) at task.c:1157
#9 run (uap=0x7f54dedc7010) at task.c:1331
#10 0x00007f54dd90cdd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f54dd635ead in clone () from /lib64/libc.so.6
```
The other cause of similar problems is when growing the ADB tables - that one however is logged, whereas it doesn't look like 'rehash' or anything that calls it owns up (via logging) to what it is doing.
Our immediate quick-fix wish is for a solution to the delays caused by growing hash tables that is along the lines of being able to specify the starting size as named is launched. This needs to be either run-time or configurable in named.conf. (It is *not* helpful to make it build-time only because in many environments there will be a single build that is distributed to many servers whose needs/sizing can vary.)
It would also be really helpful if any hash table growing could be logged - to include what the size is expanding to (this will help admins to tune their servers accordingly).
====
Longer term, I understand that the wish is to replace the current and now fairly ancient hashing solution with something more modern, faster, and in particular, that doesn't need to block access when resizing - I'll leave engineering to open a new and independent ticket for that. For the here and now, we need a quicker fix, not a new development feature that can't be back-ported or easily applied.August 2020 (9.11.22, 9.11.22-S1, 9.16.6, 9.17.4)Ondřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/stork/-/issues/247update stork deps (for backend in go and for webui)2020-04-22T10:09:03ZMichal Nowikowskiupdate stork deps (for backend in go and for webui)deps for update:
- angular to 9.1
- and moredeps for update:
- angular to 9.1
- and more0.7Michal NowikowskiMichal Nowikowskihttps://gitlab.isc.org/isc-projects/bind9/-/issues/1774Get Windows builds working again2020-05-29T12:31:25ZMichał KępieńGet Windows builds working againThe following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_125128): (+1 comment)
> Fair enough, but there is a ton of `#if _MSC_...The following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_125128): (+1 comment)
> Fair enough, but there is a ton of `#if _MSC_VER ...` conditional blocks
> still out there. Something to address in a separate issue? (Or when we
> try to fix Windows builds?)June 2020 (9.11.20, 9.11.20-S1, 9.16.4, 9.17.2)https://gitlab.isc.org/isc-projects/bind9/-/issues/1773Consider including all compile flags in "named -V" output2023-11-02T16:58:14ZMichał KępieńConsider including all compile flags in "named -V" outputThe following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_125108): (+2 comments)
> Nit: I think this would be better written as...The following discussion from !985 should be addressed:
- [ ] @michal started a [discussion](https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/985#note_125108): (+2 comments)
> Nit: I think this would be better written as:
>
> echo "CFLAGS: $STD_CFLAGS $CFLAGS"
>
> because that is the order that Automake uses in the `Makefile`s it
> generates with `AM_CFLAGS` set to `$(STD_CFLAGS)`.
>
> Also, something for the future perhaps, but it would be nice to see
> *all* (effective) `CFLAGS`/`CPPFLAGS`/`LDFLAGS` in `named -V` output,
> not just those overridden using environment variables set when
> `./configure` is run.Not plannedhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1772Properly test GSSAPI TSIG against Windows client / server2021-10-05T12:03:57ZOndřej SurýProperly test GSSAPI TSIG against Windows client / serverSince we now have Windows as part of the CI, we can (most probably) properly test GSSAPI TSIG against Windows. This would involve discovering how to configure both Windows client and Windows server and walk through available authenticat...Since we now have Windows as part of the CI, we can (most probably) properly test GSSAPI TSIG against Windows. This would involve discovering how to configure both Windows client and Windows server and walk through available authentication mechanisms to test them all.BIND 9.17 Backburnerhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1771Refactor how we load librpz.so2023-11-02T16:58:14ZOndřej SurýRefactor how we load librpz.soCurrently, there are three ways how `librpz.so` could be linked into BIND 9. In BIND 9.17, the `dlopen()` is mandatory (via libltdl), so this needs little bit of refactoring.Currently, there are three ways how `librpz.so` could be linked into BIND 9. In BIND 9.17, the `dlopen()` is mandatory (via libltdl), so this needs little bit of refactoring.Not plannedOndřej SurýOndřej Surýhttps://gitlab.isc.org/isc-projects/bind9/-/issues/1770Review how we use sys/un.h2020-12-10T08:11:45ZOndřej SurýReview how we use sys/un.hThere's:
* `#ifdef ISC_PLATFORM_HAVESYSUNH`
* `#ifdef ISC_PLAFORM_HAVESYSUNH`
* `#ifdef AF_UNIX`
Let's fix this properly in stable releases...There's:
* `#ifdef ISC_PLATFORM_HAVESYSUNH`
* `#ifdef ISC_PLAFORM_HAVESYSUNH`
* `#ifdef AF_UNIX`
Let's fix this properly in stable releases...December 2020 (9.11.26, 9.11.26-S1, 9.16.10, 9.16.10-S1, 9.17.8)Michal NowakMichal Nowakhttps://gitlab.isc.org/isc-projects/stork/-/issues/246Display the source of information for hosts in the UI2020-04-21T10:25:12ZMarcin SiodelskiDisplay the source of information for hosts in the UIWe do record in the database whether the host reservation comes from the config file or from the hosts_cmds hooks library. We want to display that in the UI next to each reservation.We do record in the database whether the host reservation comes from the config file or from the hosts_cmds hooks library. We want to display that in the UI next to each reservation.0.7Marcin SiodelskiMarcin Siodelskihttps://gitlab.isc.org/isc-projects/stork/-/issues/245Subnet filtering bug: showing the same subnet several times2020-05-29T14:51:28ZTomek MrugalskiSubnet filtering bug: showing the same subnet several timesThere's a bug in filtering subnets. I have only agent-kea configured. It reports there are 9 subnets. I went to DHCP->Subnets and used "6" as a filtering string hoping to see only 192.0.6.0 subnet. However, it now shows 11 subnets includ...There's a bug in filtering subnets. I have only agent-kea configured. It reports there are 9 subnets. I went to DHCP->Subnets and used "6" as a filtering string hoping to see only 192.0.6.0 subnet. However, it now shows 11 subnets including 3 copies of 192.0.6.0.
If I use a longer filter string 0.6.0 it now limits the subnets correctly, but still shows 3 copies of 192.0.6.0 subnet.
This is what I have in the db:
```
stork=> select * from subnet;
id | created_at | prefix | shared_network_id | client_class | addr_utilization | pd_utilization
----+---------------------------+---------------+-------------------+--------------+------------------+----------------
1 | 2020-04-20 11:35:49.21212 | 192.0.5.0/24 | 1 | class-01-00 | |
2 | 2020-04-20 11:35:49.21212 | 192.0.6.0/24 | 1 | class-01-01 | |
3 | 2020-04-20 11:35:49.21212 | 192.0.7.0/24 | 1 | class-01-02 | |
4 | 2020-04-20 11:35:49.21212 | 192.0.8.0/24 | 1 | class-01-03 | |
5 | 2020-04-20 11:35:49.21212 | 192.0.9.0/24 | 1 | class-01-04 | |
6 | 2020-04-20 11:35:49.21212 | 192.1.15.0/24 | 2 | class-02-00 | |
7 | 2020-04-20 11:35:49.21212 | 192.1.16.0/24 | 2 | class-02-01 | |
8 | 2020-04-20 11:35:49.21212 | 192.1.17.0/24 | 2 | class-02-02 | |
9 | 2020-04-20 11:35:49.21212 | 192.0.2.0/24 | | class-00-00 | |
(9 rows)
```
and this is how it looks filtered:
![bug-duplicate-subnets](/uploads/93df4695f51dd1872cb44ec193bcf405/bug-duplicate-subnets.png)0.8Marcin SiodelskiMarcin Siodelskihttps://gitlab.isc.org/isc-projects/stork/-/issues/244UI cosmetic tweaks2020-05-07T19:54:46ZTomek MrugalskiUI cosmetic tweaksHere's a list of small issues I found during demo preparation:
1. Help => BIND9 Manual links to a broken page (bind9 is broken on rtd)
1. Login page "ver:" => "version"
1. On the hosts page, there should be an information where the host...Here's a list of small issues I found during demo preparation:
1. Help => BIND9 Manual links to a broken page (bind9 is broken on rtd)
1. Login page "ver:" => "version"
1. On the hosts page, there should be an information where the host info came from: Kea config file or a database. (taken care of in #246)
1. Need to add help for hosts (taken care of in #217)
1. Grafana is a service and should be shown in the Services menu (justification: 1. it is a service, 2. we should save space on the screen, the menu already looks ugly wrapped when doing screenshots on non-full screen)
1. Bind Grafana template should have Stork added to its name (and stork tag), so it would would similar to Kea dhcpv4 dashboard.0.7Thomas MarkwalderThomas Markwalderhttps://gitlab.isc.org/isc-projects/kea/-/issues/1193Client throttling hook (limits)2022-05-30T14:07:41ZTomek MrugalskiClient throttling hook (limits)We should develop a solution that would do client throttling, somewhat similar to RRL in DNS.We should develop a solution that would do client throttling, somewhat similar to RRL in DNS.kea2.1-backlog