When BIND is built with --with-tuning=large, we're setting RCVBUFSIZE far too big for most production servers

See Support ticket #16171 and also KB article --with-tuning=large - about using this build-time option

From BIND 9.16.0, we changed the default so that when building BIND, you get --with-tuning=large automatically.

Per the KB article, the difference between the tunings is this:

RCVBUFSIZE changed from 32K to 16M

Increasing RCVBUFSIZE (the receive buffer size) will reduce dropped packets, but it may also hurt socket performance on some platforms; the Linux kernel allocates the receive buffer space when creating a socket, and an increase from 32k to 16m allocated per socket is potentially significant.

In lib/isc/unix/socket.c on master, it looks like this:

#ifdef TUNE_LARGE
#ifdef sun
#define RCVBUFSIZE (1 * 1024 * 1024)
#define SNDBUFSIZE (1 * 1024 * 1024)
#else /* ifdef sun */
#define RCVBUFSIZE (16 * 1024 * 1024)
#define SNDBUFSIZE (16 * 1024 * 1024)
#endif /* ifdef sun */
#else  /* ifdef TUNE_LARGE */
#define RCVBUFSIZE (32 * 1024)
#define SNDBUFSIZE (32 * 1024)
#endif /* TUNE_LARGE */

(Although we did it slightly differently in the old socket code, the size increase remains the same).

Let's do some sums on that. Assuming an average client query is 70bytes, then without TUNE_LARGE, our socket receive buffer of 32Kb can hold just under 470 client queries before it's full.

With TUNE_LARGE, our socket receive buffer of 16Mb can hold just under 240K queries.

More sums - let's suppose a server can handle a maximum qps of 50K. That means that if queries come in faster than that, the backlog is going to grow, and when the buffer is full (and named is reading and processing them first in, first out), that the length of time each query has been waiting in the buffer before being handled is just under 5s.

That is hopeless - most clients give up and stop waiting for an answer in under 2s. If an overloaded server capable of 50Kqps is to have any chance at all of giving some clients answers that they're still interested in, then it needs a much smaller RCVBUFSIZE. Something that doesn't hold any more than 1s (or even less?) worth of client queries.

Now consider another scenario. A resolver is capable of handling 50Kqps, and has been provisioned so that its normal load is half of that - 25Kqps.

It gets distracted (perhaps it was processing a large inbound IXFR zone update) and gets slower and a backlog builds up.

The clients start to time-out and re-send their original queries - those go into the backlog queue too. Maybe they even stop listening for a reply to their original queries. Now any query response from the resolver is going to be ignored.

Observationally, when this happens, the client query rate ramps up - with dual-stack clients that also retry over IPv6, we can see up to 6x or 7x the query rate than normal. This is way higher than the 50Kqps that is the peak rate the server can handle.

The server recovers and starts handling the backlog. All the queries it is responding too are several seconds stale, so query responses fall on deaf ears (closed sockets). The clients continue to send queries at 6-7x the normal load, so the backlog can never be cleared.

The server has been rendered effectively useless because it can never recover.

This has already been demonstrated in ticket #16171, and is the reason why it's important not to run named with socket buffers that allow a backlog to build up of longer than 0-1s worth of queries, assuming that the server is running at its maximum QPS.

I also suspect that it might be the reason why we've not been successful in replicating the total server hang-up in ticket #14339. Our test tools don't increase the client pressure with re-sends and retries when the server under test doesn't respond promptly.

Conclusion:

a) The default receive buffer size with --with-tuning=large is potentially too big for most production servers anyway, and we'd be better served by being more conservative

b) We need a knob, so that administrators (who know what they're doing, and why) can tune the receive buffer sizes appropriate (potentially per listening socket for some obscure set-ups where the admins are technically sophisticated and capable?)

c) We need to provide some tuning advice to BIND consumers - in the ARM and/or in the KB.