Revisit BIND features to prevent the scenario where a second BIND instance is running accidentally
As discussed with Engineering, but not relating to a specific customer issue, although Support team regularly encounter sites with unintentional multiple running instances of named, so I'm tagging this one as 'Customer'.
With newer BIND using SO_REUSEPORT (reuseport yes;
) there is no longer anything to stop multiple instances of running named from listening on the same sockets - the kernel will distribute incoming queries to the listening threads/processes per however the kernel and NICs implement and support rx-flow-hash.
This could be seen as a 'feature' by some. But for others, it allows accidental launching of multiple instances of named, and then much confusion and pain troubleshooting ensuing problems, if, for example, the intent was to restart named with a different version or different configuration. Having different instances fielding different queries could produce different outcomes!
This new behaviour (mostly) negates this change, introduced in BIND 9.11.0:
- [func] Stop multiple spawns of named by limiting number of processes to 1. This is done by using a lockfile and checking whether we can listen on any configured TCP interfaces. [RT #37908]
Of significance is that with the introduction of reuseport, the TCP listen check will now no longer work, and that lock-file is not enabled by default.
lock-file
This is the pathname of a file on which named attempts to acquire a file lock when starting for the first time; if
unsuccessful, the server terminates, under the assumption that another server is already running. If not specified,
the default is none.
Specifying lock-file none disables the use of a lock file. lock-file is ignored if named was run using the
-X option, which overrides it. Changes to lock-file are ignored if named is being reloaded or reconfigured;
it is only effective when the server is first started.
This is distinct from pid-file, whose purpose is to identify the pid of the running named instance, so that signals can be sent to it:
pid-file
This is the pathname of the file the server writes its process ID in. If not specified, the default is /var/run/
named/named.pid. The PID file is used by programs that send signals to the running name server. Specifying
pid-file none disables the use of a PID file; no file is written and any existing one is removed. Note that
none is a keyword, not a filename, and therefore is not enclosed in double quotes.
What to do? I'm not sure - I think this is a 'gotcha' rather than a bug at this point, but I think it's so subtle that it has the potential to derail the unwary who aren't aware that it could happen. Should we perhaps change the default for the existence of the lock-file?