Repatedly hitting max-cache-size leads to all-SERVFAIL answers
Summary
This needs proper investigation.
BIND version affected
- v9.16.49-to-be
- v9.16.45 - even worse, response rate goes down
Other versions were not tested but I assume the same problem in other branches, too.
Steps to reproduce
Run resolver test pipeline with these settings:
- SHOTGUN_SCENARIO = udp
- SHOTGUN_TRAFFIC_MULTIPLIER = 10
- SHOTGUN_DURATION = 600
- CACHE_SIZE_MB = 64
This ridiculously overloads resolver with 64 MB cache and floods it with 100 k QPS.
What is the current bug behavior?
Initially SERVFAIL rate spikes - that's okay, probably recursive clients limit - and then goes down - also expected. But then it goes up again to the point where the resolver only SERVFAILs (by the end of tenth minute).
Second problem is that response rate goes down from time to time. It should not drop answers. But that might be an artifact of the measurement - it uses 2 second timeout.
What is the expected correct behavior?
Well, I would expect very roughly constant SERVFAIL rate.
Relevant configuration files
Auto-generated by the pipeline. Recursive-clients = probably 10 000.
Relevant logs
https://gitlab.isc.org/isc-projects/bind9-shotgun-ci/-/pipelines/166603 (artifacts retained for a while)
Have a look at charts v9.1*/results-shotgun/charts/response-rate-rcodes.svg in individual jobs.