Better data collection from assertions.
Description
Not so much a "feature request" as an observation...
In a recent bind-users thread:
On 07-Mar-19 17:19, Mark Andrews wrote:
Running named from a nanny program that will restart it is useful. Some OS’s come with such programs already installed. e.g. launchd and Windows Services manager.
Observation
It occurred to me that named
could self-nanny, which would enable better data collection from assertion failures. With less pain for users and developers alike.
A rough outline:
- On assertion failure, create a dump by spawning a sub-process to run (
gcore
,ProcDump
) of thenamed
process. This produces a full process dump (with threads). If it fails (nogcore
, disk full, etc), fallback to the current behavior. - Once the dump has been taken, if
named
uptime >= (reasonable threshold), named re-execs itself (restatingargv
). (else, exit) The threshold is to prevent tight crash-dump-restart-crash loops. Might also check for minimum disk space.
There are some considerations that make this not quite trivial:
- If
named
has dropped privileges, the restarted copy needs to re-acquire them. - If
named
ischroot
ed,named
itself needs to be accessible.
These might be handled by having named exec
a script with (only) the privileges necessary to cold-start itself.
- Any existing nanny (e.g.
systemd
) might need to be educated about the change of PID. - If you can't get a thread-aware dump (e.g. no
gcore
), you could fall back to thefork()
&abort()
scheme - this loses threads, but may produce a useful dump. Orclone()
, which gets the threads, but is linux specific and would trigger demands for other OS-specific mechanisms. I'd not want to add "too much" complexity to dump & restart, however.
The advantages would be:
- not necessary to install, configure, or rely on an external nanny (
launchd
,systemd
, custom script, WSM) - the latency of restart would be minimized
- automagically reduces severity of CVEs caused by assertion failures - to the extent that you can rely on self-restart.
- it would be fairly easy to allow
rndc
to enable (or disable) dumps - removing the requirement for distribution-specific instructions for troubleshooting. - Likewise, one could trigger dumps via
rndc
whennamed
is observed to be (allegedly) misbehaving - rather than expecting administrators to attach tonamed
with a debugger and obtain stack/thread traces. - It becomes possible to have a new type of assertion - "dump & continue", for cases where there is a clear and safe remedy for a logical inconsistency - but the developers would like to understand how it arose. (For example, when a deprecated code path is encountered.) These have been useful in debugging other complex software - including operating systems.
Once named
is in control of its dumps, there could be some less obvious benefits - e.g. named
could log its assertion failures, including the name and path of the core
file. This would make event correlation and packaging for bug reports easier to automate.
I rather suspect that the code required is not much longer than this note :-) Well, maybe autoconf --disable-self-nanny, --with-coredumper=
would stretch it a bit - I'm not an autoconf
person.
The current nanny providers needn't fear for their job security - they can still scan log files, send e-mails, trigger pagers, and deal with pathological failures to restart...