Better data collection from assertions.

Description

Not so much a "feature request" as an observation...

In a recent bind-users thread:

On 07-Mar-19 17:19, Mark Andrews wrote:

Running named from a nanny program that will restart it is useful. Some OS’s come with such programs already installed. e.g. launchd and Windows Services manager.

Observation

It occurred to me that named could self-nanny, which would enable better data collection from assertion failures. With less pain for users and developers alike.

A rough outline:

On assertion failure, create a dump by spawning a sub-process to run (gcore, ProcDump) of the named process. This produces a full process dump (with threads). If it fails (no gcore, disk full, etc), fallback to the current behavior.
Once the dump has been taken, if named uptime >= (reasonable threshold), named re-execs itself (restating argv). (else, exit) The threshold is to prevent tight crash-dump-restart-crash loops. Might also check for minimum disk space.

There are some considerations that make this not quite trivial:

If named has dropped privileges, the restarted copy needs to re-acquire them.
If named is chrooted, named itself needs to be accessible.

These might be handled by having named exec a script with (only) the privileges necessary to cold-start itself.

Any existing nanny (e.g. systemd) might need to be educated about the change of PID.
If you can't get a thread-aware dump (e.g. no gcore), you could fall back to the fork() & abort() scheme - this loses threads, but may produce a useful dump. Or clone(), which gets the threads, but is linux specific and would trigger demands for other OS-specific mechanisms. I'd not want to add "too much" complexity to dump & restart, however.

The advantages would be:

not necessary to install, configure, or rely on an external nanny (launchd, systemd, custom script, WSM)
the latency of restart would be minimized
automagically reduces severity of CVEs caused by assertion failures - to the extent that you can rely on self-restart.
it would be fairly easy to allow rndc to enable (or disable) dumps - removing the requirement for distribution-specific instructions for troubleshooting.
Likewise, one could trigger dumps via rndc when named is observed to be (allegedly) misbehaving - rather than expecting administrators to attach to named with a debugger and obtain stack/thread traces.
It becomes possible to have a new type of assertion - "dump & continue", for cases where there is a clear and safe remedy for a logical inconsistency - but the developers would like to understand how it arose. (For example, when a deprecated code path is encountered.) These have been useful in debugging other complex software - including operating systems.

Once named is in control of its dumps, there could be some less obvious benefits - e.g. named could log its assertion failures, including the name and path of the core file. This would make event correlation and packaging for bug reports easier to automate.

I rather suspect that the code required is not much longer than this note :-) Well, maybe autoconf --disable-self-nanny, --with-coredumper= would stretch it a bit - I'm not an autoconf person.

The current nanny providers needn't fear for their job security - they can still scan log files, send e-mails, trigger pagers, and deal with pathological failures to restart...