Signed version of an inline-signed zone may be dumped without unsigned serial number information

When the signed version of an inline-signed zone is dumped to disk, the serial number of the unsigned version of the zone is written in the raw-format header so that the contents of the signed zone can be resynchronized after named restart if the unsigned zone file is modified while named is not running (see RT #26676).

In order for the serial number of the unsigned zone to be determined during the dump, zone->raw must be set to a non-NULL value. This should always be the case as long as the signed version of the zone is used for anything by named.

However, a scenario exists in which the signed version of the zone has zone->raw set to NULL while it is being dumped:

Zone dump is requested; zone_dump() is invoked.
Another zone dump is already in progress, so the dump gets deferred until I/O is available (see zonemgr_getio()).
The last external reference to the zone is released. zone_shutdown() gets queued to the zone's task.
I/O becomes available for zone dumping. zone_gotwritehandle() gets queued to the zone's task.
The zone's task runs zone_shutdown(). zone->raw gets set to NULL.
The zone's task runs zone_gotwritehandle(). zone->raw is determined to be NULL, causing the serial number of the unsigned version of the zone to be omitted from the raw-format dump of the signed zone file.

I believe this issue became easier to trigger in BIND 9.12.0. That was the first BIND 9 release containing change 4613 (see RT #38324), specifically this hunk:

@@ -9773,7 +9822,7 @@ dns_zone_flush(dns_zone_t *zone) {
                dumping = ISC_TRUE;
        UNLOCK_ZONE(zone);
        if (!dumping)
-               result = zone_dump(zone, ISC_FALSE);    /* Unknown task. */
+               result = zone_dump(zone, ISC_TRUE);     /* Unknown task. */
        return (result);
 }

zone_dump() can either perform the zone->raw check itself or defer it until zone dump I/O becomes available. Before the above change, deferring the check was only possible if zone_dump() was called from zone_maintenance() (which itself is timer-based). The above change enables the zone->raw check to also be deferred when zone_dump() is called from dns_zone_flush(), i.e. essentially from anywhere, particularly from zone table cleanup callbacks which are run when the zone's reference count is likely to drop to zero, triggering zone_shutdown() and ultimately causing the bug above.

The above change was originally introduced in commit 980611a3fe3ececeb0049b9e7c2e380b577f5e68 without any detailed explanation. I am not entirely sure why, but the change seems to be necessary in order for some tests related to max-journal-size to pass. I ran out of time to determine why that is. Note, however, that zone_dump() warns against setting compact to true for non-task-locked call sites (see also the code comments next to zone_dump() invocations).

At any rate, I believe that the bug could be triggered even without the above change - when the zone's reference count drops to zero while zone_maintenance() is running. I have not confirmed that it is practically possible and I can certainly be missing some implicit protection against such a triggering scenario happening. It does not matter much anyway with BIND 9.11 reaching EoL soon and this not being a critical problem.

The only quick way to fix this issue that I see is to defer detaching from zone->raw in zone_shutdown() if the zone is in the process of being dumped to disk.

The problem is easily reproducible, though I need to find a clean way of turning it into a system test.

This problem was discovered in the process of attempting to fix an unrelated issue.