Better data collection from assertions.
Not so much a "feature request" as an observation...
In a recent bind-users thread:
On 07-Mar-19 17:19, Mark Andrews wrote:
Running named from a nanny program that will restart it is useful. Some OS’s come with such programs already installed. e.g. launchd and Windows Services manager.
It occurred to me that
named could self-nanny, which would enable better data collection from assertion failures. With less pain for users and developers alike.
A rough outline:
- On assertion failure, create a dump by spawning a sub-process to run (
ProcDump) of the
namedprocess. This produces a full process dump (with threads). If it fails (no
gcore, disk full, etc), fallback to the current behavior.
- Once the dump has been taken, if
nameduptime >= (reasonable threshold), named re-execs itself (restating
argv). (else, exit) The threshold is to prevent tight crash-dump-restart-crash loops. Might also check for minimum disk space.
There are some considerations that make this not quite trivial:
namedhas dropped privileges, the restarted copy needs to re-acquire them.
nameditself needs to be accessible.
These might be handled by having named
exec a script with (only) the privileges necessary to cold-start itself.
- Any existing nanny (e.g.
systemd) might need to be educated about the change of PID.
- If you can't get a thread-aware dump (e.g. no
gcore), you could fall back to the
abort()scheme - this loses threads, but may produce a useful dump. Or
clone(), which gets the threads, but is linux specific and would trigger demands for other OS-specific mechanisms. I'd not want to add "too much" complexity to dump & restart, however.
The advantages would be:
- not necessary to install, configure, or rely on an external nanny (
systemd, custom script, WSM)
- the latency of restart would be minimized
- automagically reduces severity of CVEs caused by assertion failures - to the extent that you can rely on self-restart.
- it would be fairly easy to allow
rndcto enable (or disable) dumps - removing the requirement for distribution-specific instructions for troubleshooting.
- Likewise, one could trigger dumps via
namedis observed to be (allegedly) misbehaving - rather than expecting administrators to attach to
namedwith a debugger and obtain stack/thread traces.
- It becomes possible to have a new type of assertion - "dump & continue", for cases where there is a clear and safe remedy for a logical inconsistency - but the developers would like to understand how it arose. (For example, when a deprecated code path is encountered.) These have been useful in debugging other complex software - including operating systems.
named is in control of its dumps, there could be some less obvious benefits - e.g.
named could log its assertion failures, including the name and path of the
core file. This would make event correlation and packaging for bug reports easier to automate.
I rather suspect that the code required is not much longer than this note :-) Well, maybe
autoconf --disable-self-nanny, --with-coredumper= would stretch it a bit - I'm not an
The current nanny providers needn't fear for their job security - they can still scan log files, send e-mails, trigger pagers, and deal with pathological failures to restart...