lazy catalog zone configuration makes a server lame
I restarted my test server while another task was frequently repeating several queries; the server got stuck in a funny state where it would SERVFAIL these queries persistently.
tl;dr: I think this problem might affect normal authoritative servers that use catalog zones: there's a chance that lazy zone configuration might cause the server to return referrals rather than proper answers during startup, leading it to being unfairly marked lame by recursive servers for 10 minutes.
I reproduced this with vanilla 9.12.2 and a simplified configuration. The setup is:
- an auth view, containing:
- a catalog zone listing all our local zones
- a local root zone
- a rec view, containing:
- static-stub zone configurations for all the auth zones pointing at the auth view
If I make frequent queries for
cam.ac.uk while the server starts, it usually gets into this SERVFAIL state.
When a cam.ac.uk query happens while the server is starting (before catz configuration is complete), the rec view gets a referral from the root in the auth view. It then decides to mark the auth view as lame for cam.ac.uk, so everything is broken for 10 minutes (the
If there's no root zone, the rec view gets REFUSED; it SERVFAILs a few queries while the server is starting but it doesn't mark the auth view as lame - surprisingly, REFUSED is treated a lot more gently than an unexpected referral.
I can trigger the same problem with only the catalog zone, with no root zone, but in this case it is harder to lose the race. If I repeatedly query for www.cl.cam.ac.uk, and the server loads
cl.cam.ac.uk, then it can get a referral instead of the expected answer.
while sleep 0.05; do dig +tries=1 +timeout=1 @::1 www.cl.cam.ac.uk | grep status & done
I can't trigger the same effect with explicitly configured zones, i.e. if I replace the auth view configuration with
These attachments have the relevant config; I can provide access to a VM if you want to try this config out for yourselves.