glue-cache scales very poorly on multi-CPU systems
Summary
Glue cache scales very poorly and suffers from lock contention. It eventually improves max QPS by 1/3 on 16-thread system with delegation-heavy workload. Maxing out QPS on 16-thread system takes over 300 seconds of query load.
BIND version used
- Affects v9.18: v9.18.14
- Other versions were not tested, but are assumed to be affected
Steps to reproduce
- configure delegation-heavy zone, e.g. SE from https://zonedata.iis.se/
- issue queries which hit delegations, preferably unique: querydb.xz
What is the current bug behavior?
Initially QPS is very low, and adding CPUs is not improving performance. Gradually BIND builds glue-cache and overall QPS improves.
What is the expected correct behavior?
- Initial QPS should not be that low.
- Adding CPUs should improve performance, also initially.
Workaround
options {
glue-cache no;
};
This provides more predictable performance but incurs ~ 1/3 performance hit (in terms of max QPS).
Benchmarks
- 16-thread machine in AWS, type c5n.4xlarge
- BIND v9.18.4 with glue-cache on / off
- SE zone serial 2021122008
- client kxdpgun:
kxdpgun -t 5 -Q $QPS -i querydb 10.10.126.46 -p 5300
- 5-second tests, QPS in the table below is average
- Individual lines in table are successive tests
- Each step starts with the same query set (so successive tests repeat some of the queries)
- QPS step is +50k QPS
- QPS is incremented only if reponse rate was >= 99 %
You can see that glue-cache yes;
requires significant warm-up time and eventually provides up to 1/3 higher max QPS than configuration with glue-cache no;
. Problem is the ridiculously long warm-up phase.
glue-cache no
config hits max QPS right away without any warm-up.
Raw data - each line is one 5-second benchmark:
glue-cache yes | glue-cache no | |||
---|---|---|---|---|
QPS | Response rate | QPS | Response rate | |
50000 | 77 % | 300000 | 99 % | |
50000 | 90 % | 350000 | 97 % | |
50000 | 79 % | 350000 | 96 % | |
50000 | 99 % | 350000 | 96 % | |
100000 | 69 % | 350000 | 97 % | |
100000 | 80 % | 350000 | max reached | |
100000 | 99 % | |||
150000 | 74 % | |||
150000 | 77 % | |||
150000 | 83 % | |||
150000 | 96 % | |||
150000 | 99 % | |||
200000 | 79 % | |||
200000 | 80 % | |||
200000 | 82 % | |||
200000 | 82 % | |||
200000 | 17 % | |||
200000 | 22 % | |||
200000 | 28 % | |||
200000 | 39 % | |||
200000 | 62 % | |||
200000 | 99 % | |||
250000 | 82 % | |||
250000 | 83 % | |||
250000 | 84 % | |||
250000 | 85 % | |||
250000 | 87 % | |||
250000 | 90 % | |||
250000 | 95 % | |||
250000 | 99 % | |||
300000 | 85 % | |||
300000 | 85 % | |||
300000 | 86 % | |||
300000 | 86 % | |||
300000 | 87 % | |||
300000 | 88 % | |||
300000 | 90 % | |||
300000 | 93 % | |||
300000 | 98 % | |||
300000 | 99 % | |||
350000 | 86 % | |||
350000 | 87 % | |||
350000 | 87 % | |||
350000 | 87 % | |||
350000 | 88 % | |||
350000 | 88 % | |||
350000 | 89 % | |||
350000 | 90 % | |||
350000 | 92 % | |||
350000 | 94 % | |||
350000 | 98 % | |||
350000 | 99 % | |||
400000 | 88 % | |||
400000 | 88 % | |||
400000 | 88 % | |||
400000 | 88 % | |||
400000 | 89 % | |||
400000 | 89 % | |||
400000 | 90 % | |||
400000 | 90 % | |||
400000 | 91 % | |||
400000 | 92 % | |||
400000 | 93 % | |||
400000 | 95 % | |||
400000 | 98 % | |||
400000 | 99 % | |||
450000 | 82 % | |||
450000 | 82 % | |||
450000 | 84 % | |||
450000 | 83 % | |||
450000 | 83 % | |||
450000 | 83 % | |||
450000 | 84 % | |||
450000 | 84 % | |||
450000 | 85 % | |||
450000 | 85 % | |||
450000 | 86 % | |||
450000 | 85 % | |||
450000 | 90 % | |||
450000 | 88 % | |||
450000 | 91 % | |||
450000 | 92 % | |||
450000 | max reached |
Flame chart with sleeper + waker threads generated by offwaketime.py:
(Sorry for missing stack frames, but you get the point.)