[ISC-support #15905] rndc stop issued after server (or single zone) rndc reload and during ixfr-from-differences processing leaves the .jnl file corrupted

From Support ticket #15905

9.11.x (but I don't anticipate that the BIND version makes any difference)

This problem is readily reproducible (and, I suspect, occurs because "rndc stop" doesn't recognise that the zone is effectively 'dynamic' because it has been reloaded with 'ixfr-from-differences yes;').

Here's what is being done, per the server logs. First the reload (the outcome is also the same with 'reload zone'):

09-Jan-2020 14:50:29.157 general: info: received control channel command 'reload' ============>>> RELOAD COMMAND
09-Jan-2020 14:50:29.179 general: info: loading configuration from '/etc/named.conf'
... etcetera
09-Jan-2020 14:50:29.202 general: info: reloading configuration succeeded
09-Jan-2020 14:50:29.202 general: info: reloading zones succeeded
09-Jan-2020 14:50:29.202 general: notice: all zones loaded
09-Jan-2020 14:50:29.202 general: notice: running

Note that the reload has completed, as far as the logging is concerned, but, it would appear that the regeneration of the .jnl files via 'ixfr-from-differences yes;' has not (high CPU use by named - suggests that it is busy doing this).

Then the 'rndc stop' is issued - and it completes almost immediately (no evidence that it is waiting for the processing to complete), in fact, it seems to log that it has aborted a zone reload, even though the previous logging said that the reload had completed:

09-Jan-2020 14:50:33.211 general: info: received control channel command 'stop -p'
09-Jan-2020 14:50:33.212 general: info: shutting down: flushing changes

.. etcetera (just the logs of the various sockets being closed here)

09-Jan-2020 14:50:33.216 general: error: zone test.com/IN: loading from master file dynamic/test.com.zone failed: operation canceled
09-Jan-2020 14:50:33.216 general: error: zone test.com/IN: not loaded due to errors.
09-Jan-2020 14:50:34.265 general: notice: exiting

And then after this, restarting named - the zone can no longer be loaded - the journal file does not tally with the zone itself:

...
09-Jan-2020 14:51:28.141 general: error: zone test.com/IN: journal rollforward failed: journal out of sync with zone
09-Jan-2020 14:51:28.141 general: error: zone test.com/IN: not loaded due to errors.
09-Jan-2020 14:51:28.142 general: notice: all zones loaded
09-Jan-2020 14:51:28.142 general: notice: running

======

Something has gone badly wrong during the 'rndc stop' - which is supposed to be a graceful shutdown of named. I'm assuming that the problem is that the .jnl file itself is corrupt, rather than that something has happened to the zone file on disk - but will ask for more data to confirm this.

stop [-p]
Stop the server, making sure any recent changes made through dynamic update
or IXFR are first saved to the master files of the updated zones. If -p is
specified named’s process id is returned. This allows an external process
to determine when named had completed stopping.

Now, since ixfr-from-differences processing could take an age to complete, I don't think it's reasonable to wait forever on the rndc stop. Possibly one solution would be have a (configurable?) timeout, after which any pending ixfr-from-differences .jnl file generation is terminated and the incomplete .jnl file discarded/removed. After all, the administrator in this scenario has just presented named with a new zone file that it asked named to load - so we know that the full copy of the zone on disk should be good and valid - we just loaded from it.

====

The workaround if this happens, is presumably to manually discard the corrupted .jnl file and restart named.

Edited Mar 16, 2021 by Support RT

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information