Re-evaluate BIND performance with pthread rwlocks

BIND 9.16.1+ uses pthread rwlocks by default, with a build-time option (--disable-pthread-rwlock) to use the ISC rwlock implementation instead. The decision to enable use of pthread rwlocks by default was based on testing various deployment scenarios in Perflab.

Every BIND release goes through a "stress" test before the BIND QA team signs off on it. This test starts three named instances ("root", authoritative, recursive) on the same host and uses a few Flamethrower instances (also started on the same host) to query the recursive server for names in a domain served by the authoritative server at a specified rate.

Historically, this test only tracked memory use over time as its only purpose was detecting memory leaks. Recently, however, we also started looking at other statistics generated during this test.

While testing BIND 9.16.2 on his development machine (6 cores + HT), @mnowak noticed that the number of incoming queries received by the recursive server during the "stress" test was unexpectedly low. Flamethrower statistics were showing average response times in the range of hundreds of miliseconds for all combinations of layer 3 (IPv4, IPv6) and layer 4 (UDP, TCP) protocols.

I tried reproducing this phenomenon on my laptop (2 cores + HT) to no avail, so we initially assumed this was some local problem on @mnowak's machine.

Yet, @mnowak tested older versions: the problem persisted with BIND 9.16.1¹ but did not persist any more with BIND 9.16.0. So he did a git bisect between BIND 9.16.0 and BIND 9.16.1 and found out that the commit which causes BIND performance on his laptop to plummet was 16cedf6e ("Use pthread rwlocks by default"). So he retested BIND 9.16.2 using --disable-pthread-rwlock and the problem was no longer there.

I managed to reproduce this behavior on gitlab-ci-03.lab.isc.org (4 cores + HT). I did a few 5-minute "stress" tests (the problem seems to be triggered immediately, not after some time, so 5 minutes seem good enough for initial experiments) with:

BIND 9.16.0
BIND 9.16.1
BIND 9.16.2

For each release, I tested 2 builds: one using --enable-pthread-rwlock and one using --disable-pthread-rwlock.

The requested query rates for each L3/L4 protocol tuple were:

UDP/IPv4: 10,000 qps
UDP/IPv6: 10,000 qps
TCP/IPv4: 3,000 qps (pipelined over up to 30 TCP conns/s)
TCP/IPv6: 3,000 qps (pipelined over up to 30 TCP conns/s)

During the "stress" test, each Flamethrower instance logs the min/avg/max response times for the queries it sent within each one-second period. I plotted the average response times that Flamethrower logged in those lines for each test run. All plots have the same scale for easier comparison. Here are the results²:

And here are the same plots cut off at 200ms on the Y axis:

From these plots, it looks like the problem was still relatively benign in BIND 9.16.0 but got aggravated in BIND 9.16.1 and then became even worse in BIND 9.16.2.

During these tests, recursive named instances generated a lot more SERVFAIL responses for pthread builds than for ISC builds:

gitlab-ci-03-9.16.0-pthread/ns3/named.stats:    6,797,809 NOERROR
gitlab-ci-03-9.16.0-pthread/ns3/named.stats:      824,498 SERVFAIL
gitlab-ci-03-9.16.0-isc/ns3/named.stats:        7,679,453 NOERROR
gitlab-ci-03-9.16.0-isc/ns3/named.stats:           12,105 SERVFAIL

gitlab-ci-03-9.16.1-pthread/ns3/named.stats:    6,375,052 NOERROR
gitlab-ci-03-9.16.1-pthread/ns3/named.stats:    1,089,098 SERVFAIL
gitlab-ci-03-9.16.1-isc/ns3/named.stats:        7,675,143 NOERROR
gitlab-ci-03-9.16.1-isc/ns3/named.stats:           13,650 SERVFAIL

gitlab-ci-03-9.16.2-pthread/ns3/named.stats:    6,264,550 NOERROR
gitlab-ci-03-9.16.2-pthread/ns3/named.stats:    1,102,360 SERVFAIL
gitlab-ci-03-9.16.2-isc/ns3/named.stats:        7,681,557 NOERROR
gitlab-ci-03-9.16.2-isc/ns3/named.stats:            9,032 SERVFAIL

I checked Perflab data for the cold cache scenario and it seems a lot of SERVFAIL response are also generated there (though I admit I have not looked closely at how the Perflab cold cache test looks!). Here are some results from the last dnsgen runs for a few Perflab tests:

https://perflab.isc.org/#/run/test/5e47037df297b05d32ebcd6a/ (9.16.0)
- NOERROR: 784,492
- SERVFAIL: 2,606,729
- NXDOMAIN: 53,152
https://perflab.isc.org/#/run/test/5e6aeb91f297b05d3250c06b/ (9.16.1)
- NOERROR: 1,839,048
- SERVFAIL: 14,607,259
- NXDOMAIN: 135,103
https://perflab.isc.org/#/run/test/5e8f639bf297b05d326f7cd2/ (9.16.2)
- NOERROR: 1,824,782
- SERVFAIL: 14,950,392
- NXDOMAIN: 136,069

Given that Perflab tests run on multiple physical machines while all processes in the "stress" test are confined to a single host, this seems to look like something more than just a testing glitch and I believe it should be investigated further.

I may have been looking at this for too long, so feel free to point out any mistakes in my reasoning and/or testing methodology.

We did not notice the problem while testing BIND 9.16.1 because we were too focused on investigating a different problem - a suspected memory leak. ↩
I have not yet written nice scripts that someone else could easily use to reproduce the graphs and logs presented above, but I can surely work on that if it would help. ↩