rndc stop issued after server (or single zone) rndc reload and during ixfr-from-differences processing leaves the .jnl file corrupted
From Support ticket #15905
9.11.x (but I don't anticipate that the BIND version makes any difference)
This problem is readily reproducible (and, I suspect, occurs because "rndc stop" doesn't recognise that the zone is effectively 'dynamic' because it has been reloaded with 'ixfr-from-differences yes;').
Here's what is being done, per the server logs. First the reload (the outcome is also the same with 'reload zone'):
09-Jan-2020 14:50:29.157 general: info: received control channel command 'reload' ============>>> RELOAD COMMAND 09-Jan-2020 14:50:29.179 general: info: loading configuration from '/etc/named.conf' ... etcetera 09-Jan-2020 14:50:29.202 general: info: reloading configuration succeeded 09-Jan-2020 14:50:29.202 general: info: reloading zones succeeded 09-Jan-2020 14:50:29.202 general: notice: all zones loaded 09-Jan-2020 14:50:29.202 general: notice: running
Note that the reload has completed, as far as the logging is concerned, but, it would appear that the regeneration of the .jnl files via 'ixfr-from-differences yes;' has not (high CPU use by named - suggests that it is busy doing this).
Then the 'rndc stop' is issued - and it completes almost immediately (no evidence that it is waiting for the processing to complete), in fact, it seems to log that it has aborted a zone reload, even though the previous logging said that the reload had completed:
09-Jan-2020 14:50:33.211 general: info: received control channel command 'stop -p' 09-Jan-2020 14:50:33.212 general: info: shutting down: flushing changes .. etcetera (just the logs of the various sockets being closed here) 09-Jan-2020 14:50:33.216 general: error: zone test.com/IN: loading from master file dynamic/test.com.zone failed: operation canceled 09-Jan-2020 14:50:33.216 general: error: zone test.com/IN: not loaded due to errors. 09-Jan-2020 14:50:34.265 general: notice: exiting
And then after this, restarting named - the zone can no longer be loaded - the journal file does not tally with the zone itself:
... 09-Jan-2020 14:51:28.141 general: error: zone test.com/IN: journal rollforward failed: journal out of sync with zone 09-Jan-2020 14:51:28.141 general: error: zone test.com/IN: not loaded due to errors. 09-Jan-2020 14:51:28.142 general: notice: all zones loaded 09-Jan-2020 14:51:28.142 general: notice: running
Something has gone badly wrong during the 'rndc stop' - which is supposed to be a graceful shutdown of named. I'm assuming that the problem is that the .jnl file itself is corrupt, rather than that something has happened to the zone file on disk - but will ask for more data to confirm this.
stop [-p] Stop the server, making sure any recent changes made through dynamic update or IXFR are first saved to the master files of the updated zones. If -p is specified named’s process id is returned. This allows an external process to determine when named had completed stopping.
Now, since ixfr-from-differences processing could take an age to complete, I don't think it's reasonable to wait forever on the rndc stop. Possibly one solution would be have a (configurable?) timeout, after which any pending ixfr-from-differences .jnl file generation is terminated and the incomplete .jnl file discarded/removed. After all, the administrator in this scenario has just presented named with a new zone file that it asked named to load - so we know that the full copy of the zone on disk should be good and valid - we just loaded from it.
The workaround if this happens, is presumably to manually discard the corrupted .jnl file and restart named.