Re-evaluate BIND performance with pthread rwlocks
BIND 9.16.1+ uses pthread rwlocks by default, with a build-time option
--disable-pthread-rwlock) to use the ISC rwlock implementation
instead. The decision to enable use of pthread rwlocks by default was
based on testing various deployment scenarios in Perflab.
Every BIND release goes through a "stress" test before the BIND QA
team signs off on it. This test starts three
named instances ("root",
authoritative, recursive) on the same host and uses a few Flamethrower
instances (also started on the same host) to query the recursive server
for names in a domain served by the authoritative server at a specified
Historically, this test only tracked memory use over time as its only purpose was detecting memory leaks. Recently, however, we also started looking at other statistics generated during this test.
While testing BIND 9.16.2 on his development machine (6 cores + HT), @mnowak noticed that the number of incoming queries received by the recursive server during the "stress" test was unexpectedly low. Flamethrower statistics were showing average response times in the range of hundreds of miliseconds for all combinations of layer 3 (IPv4, IPv6) and layer 4 (UDP, TCP) protocols.
I tried reproducing this phenomenon on my laptop (2 cores + HT) to no avail, so we initially assumed this was some local problem on @mnowak's machine.
Yet, @mnowak tested older versions: the problem persisted with BIND
9.16.11 but did not persist any more with BIND 9.16.0. So he did a
git bisect between BIND 9.16.0 and BIND 9.16.1 and found out that the
commit which causes BIND performance on his laptop to plummet was
16cedf6e ("Use pthread rwlocks by
default"). So he retested BIND 9.16.2 using
and the problem was no longer there.
I managed to reproduce this behavior on
cores + HT). I did a few 5-minute "stress" tests (the problem seems to
be triggered immediately, not after some time, so 5 minutes seem good
enough for initial experiments) with:
- BIND 9.16.0
- BIND 9.16.1
- BIND 9.16.2
For each release, I tested 2 builds: one using
and one using
The requested query rates for each L3/L4 protocol tuple were:
- UDP/IPv4: 10,000 qps
- UDP/IPv6: 10,000 qps
- TCP/IPv4: 3,000 qps (pipelined over up to 30 TCP conns/s)
- TCP/IPv6: 3,000 qps (pipelined over up to 30 TCP conns/s)
During the "stress" test, each Flamethrower instance logs the min/avg/max response times for the queries it sent within each one-second period. I plotted the average response times that Flamethrower logged in those lines for each test run. All plots have the same scale for easier comparison. Here are the results2:
And here are the same plots cut off at 200ms on the Y axis:
From these plots, it looks like the problem was still relatively benign in BIND 9.16.0 but got aggravated in BIND 9.16.1 and then became even worse in BIND 9.16.2.
During these tests, recursive
named instances generated a lot more
SERVFAIL responses for pthread builds than for ISC builds:
gitlab-ci-03-9.16.0-pthread/ns3/named.stats: 6,797,809 NOERROR gitlab-ci-03-9.16.0-pthread/ns3/named.stats: 824,498 SERVFAIL gitlab-ci-03-9.16.0-isc/ns3/named.stats: 7,679,453 NOERROR gitlab-ci-03-9.16.0-isc/ns3/named.stats: 12,105 SERVFAIL gitlab-ci-03-9.16.1-pthread/ns3/named.stats: 6,375,052 NOERROR gitlab-ci-03-9.16.1-pthread/ns3/named.stats: 1,089,098 SERVFAIL gitlab-ci-03-9.16.1-isc/ns3/named.stats: 7,675,143 NOERROR gitlab-ci-03-9.16.1-isc/ns3/named.stats: 13,650 SERVFAIL gitlab-ci-03-9.16.2-pthread/ns3/named.stats: 6,264,550 NOERROR gitlab-ci-03-9.16.2-pthread/ns3/named.stats: 1,102,360 SERVFAIL gitlab-ci-03-9.16.2-isc/ns3/named.stats: 7,681,557 NOERROR gitlab-ci-03-9.16.2-isc/ns3/named.stats: 9,032 SERVFAIL
I checked Perflab data for the cold cache scenario and it seems a lot of
SERVFAIL response are also generated there (though I admit I have not
looked closely at how the Perflab cold cache test looks!). Here are
some results from the last
dnsgen runs for a few Perflab tests:
- NOERROR: 784,492
- SERVFAIL: 2,606,729
- NXDOMAIN: 53,152
- NOERROR: 1,839,048
- SERVFAIL: 14,607,259
- NXDOMAIN: 135,103
- NOERROR: 1,824,782
- SERVFAIL: 14,950,392
- NXDOMAIN: 136,069
Given that Perflab tests run on multiple physical machines while all processes in the "stress" test are confined to a single host, this seems to look like something more than just a testing glitch and I believe it should be investigated further.
I may have been looking at this for too long, so feel free to point out any mistakes in my reasoning and/or testing methodology.
I have not yet written nice scripts that someone else could easily use to reproduce the graphs and logs presented above, but I can surely work on that if it would help.