BIND as an authoritative server stops responding to queries during `rndc reconfig`
Summary
The setup is a server with a lot of primary
zones - 10,000 in the customer example - which has reached steady state whilst under test with a heavy query load; nearly all queries receiving responses.
A new zone is added to the configuration then rndc reload
is sent to the server, which results in a large loss of responses for the duration of the reconfiguration activity.
BIND version affected
- Affects v9.19: v9.19.23
- Affects v9.18: v9.18.26
- Affects v9.16: v9.16.50
9.18.18 and 9.19.16 originally. Compared with 9.16.43, which did not exhibit the same problem.
Steps to reproduce
- 10,000 small primary zones (SOA and two NS only). Recursion is disabled. -1rrexample.db
- Load generator:
yes 'zr7hakuwn2.example A' | dnsperf -S1 -c 128
- Issue
rndc reconfig
:while true; do rndc -s 127.0.0.1 reconfig; sleep 1; done
What is the current bug behavior?
QPS drops/queries might be lost.
Here's QPS as reported by dnsperf:
1713437900.912125: 146445.751469
1713437901.913167: 146366.486121
1713437902.914202: 146622.245975
1713437903.915236: 146122.908912
1713437904.916352: 135082.248211 <<-- rndc reconfig loop started here
1713437905.917388: 114448.431425
1713437906.918426: 107053.878075
1713437907.918797: 108374.792952
1713437908.919924: 108645.556458
1713437909.920957: 134391.173917
1713437910.922015: 103220.792402
1713437911.922123: 101256.064345
1713437912.923159: 96982.526103
1713437913.924211: 130121.112590
1713437914.925252: 118918.206147
1713437915.925452: 107023.595281
1713437916.926491: 105333.558433
1713437917.927541: 110063.433395
1713437918.928580: 136672.996756
1713437919.928787: 106074.042673 <<-- rndc reconfig loop stopped here
1713437920.929823: 143345.494068
1713437921.930859: 144386.415673
1713437922.931896: 141657.101586
1713437923.932122: 144858.262033
1713437924.933159: 143524.165440
What is the expected correct behavior?
Minimal impact, assuming some system capacity is available for the extra work.
Extra comments
Here are CPU profiles, generated without sleep between rndc reconfig
calls to make the problem more visible: