small updates to large catalog zones cause CPU consumption spikes
Summary
Single RR modification to CATZ causes CPU spikes.
Here is a latency comparison of 1-minute test run, on 100 k QPS (about 1/4 of capacity) while removing and then adding 1 zone, with 20 seconds in between. Green line shows the same configuration without addition/deletion:
Timeout = 10 seconds, axes are log-log, Y in usec.
Test environment:
- A beefy AWS VM type - I've used c5n.4xlarge, 16 CPU threads.
- One CATZ with 100 k small zones.
Result: One update -> one CPU core is blocked for 10 seconds.
BIND version used
- Affects v9.18 : v9_18_10
Steps to reproduce
- Generate a large catz and configure primary. Here is a script to do that, genconf.py + empty.db
- Configure secondary to pull the catalog and zones:
options {
catalog-zones {
zone "catalog.invalid." min-update-interval 1 default-primaries { 2600::1; };
};
};
zone "catalog.invalid." {
primaries { 2600::1; };
type secondary;
};
- Do one RR modification at a time to the CATZ, adding/deleting single zone at a time. Script to do that: gencatupd.py . Usage:
while true; do python gencatudp.py | nsupdate; done
What is the current bug behavior?
2022-12-14T11:59:02.929Z general: zone catalog.invalid/IN: notify from 2600:1f18:634c:d17e::da5d#42256: serial 2670950588
2022-12-14T11:59:02.929Z xfer-in: zone catalog.invalid/IN: Transfer started.
2022-12-14T11:59:02.929Z xfer-in: transfer of 'catalog.invalid/IN' from 2600:1f18:634c:d17e::da5d#53: connected using 2600:1f18:634c:d17e::da5d#53
2022-12-14T11:59:02.929Z xfer-in: zone catalog.invalid/IN: transferred serial 2670950588
2022-12-14T11:59:02.929Z general: catz: updating catalog zone 'catalog.invalid' with serial -1624016708
2022-12-14T11:59:02.929Z xfer-in: transfer of 'catalog.invalid/IN' from 2600:1f18:634c:d17e::da5d#53: Transfer status: success
2022-12-14T11:59:02.929Z xfer-in: transfer of 'catalog.invalid/IN' from 2600:1f18:634c:d17e::da5d#53: Transfer completed: 1 messages, 5 records, 223 bytes, 0.001 secs (223000 bytes/sec) (serial 2670950588)
<- CPU spins here for 10 sec
2022-12-14T11:59:12.449Z general: catz: adding zone 'z100000.test' from catalog 'catalog.invalid' - success
2022-12-14T11:59:12.449Z xfer-in: zone z100000.test/IN: Transfer started.
Relevant logs and/or screenshots
Here is a CPU profile taken while the CPU is spinning madly:
Possible fixes
I suppose one improvement could be to reuse IXFR change sets somehow, when they are available.
Edited by Petr Špaček