dnssec-policy unexpectedly breaks a DNSSEC-signed zone by creating new keys if it can't access the current keyset
As noted in Support ticket #18133 it's possible to quite thoroughly 'break' a DNSSEC-signed zone if you're using dnssec-policy and the private keys are inaccessible when named starts up (and I would not be surprised if this also happens when doing a reconfig or a reload zone).
This could be quite serious if the same thing happens because of a temporary glitch with access to keys stored in an HSM. Although the keys were inaccessible, this could have been remedied.
The creation and addition to the zone of new keys, complete with the failure to be able to refresh the DNSSEY RRset RRSIGs from the old KSK (which was the only key with a DS RR in the parent zone) was catastrophic for DNSSEC-validation of this zone.
Here's what happened.
- The zone was signed originally using dnssec-policy which generated the initial keys:
01-Feb-2021 16:53:35.845 zoneload: info: managed-keys-zone: loaded serial 0
01-Feb-2021 16:53:35.845 zoneload: info: zone example.com/IN: loaded serial 1000
01-Feb-2021 16:53:35.845 general: notice: all zones loaded
01-Feb-2021 16:53:35.845 general: notice: running
01-Feb-2021 16:53:35.845 notify: info: zone example.com/IN: sending notifies (serial 1000)
01-Feb-2021 16:53:35.846 dnssec: info: zone example.com/IN: reconfiguring zone keys
01-Feb-2021 16:53:35.846 dnssec: info: keymgr: DNSKEY example.com/ECDSAP256SHA256/57955 (KSK) created for policy example_DNSSEC_DEFAULT
01-Feb-2021 16:53:35.846 dnssec: info: keymgr: DNSKEY example.com/ECDSAP256SHA256/32569 (ZSK) created for policy example_DNSSEC_DEFAULT
01-Feb-2021 16:53:35.847 dnssec: info: Fetching example.com/ECDSAP256SHA256/57955 (KSK) from key repository.
01-Feb-2021 16:53:35.847 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/57955 (KSK) is now published
01-Feb-2021 16:53:35.847 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/57955 (KSK) is now active
01-Feb-2021 16:53:35.847 dnssec: info: Fetching example.com/ECDSAP256SHA256/32569 (ZSK) from key repository.
01-Feb-2021 16:53:35.847 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/32569 (ZSK) is now published
01-Feb-2021 16:53:35.847 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/32569 (ZSK) is now active
01-Feb-2021 16:53:35.851 dnssec: info: zone example.com/IN: next key event: 03-Feb-2021 17:53:35.846
- All was good, with the DS in the parent zone and the signatures being refreshed automatically, until there was a restart of named with the keys unaccessible (some directory changes, the ramifications of which were not fully understood, nor the bad outcome anticipated). This is what was logged:
The zone is loaded and initiates the notifies:
18-Mar-2021 18:26:06.346 notify: info: zone example.com/IN: sending notifies (serial 1616023774)
But then runs into trouble because it can't access the DNSSEC keys (which is interesting, because I don't think this is how named used to behave when loading a zone with RRSIGs from keys it doesn't have):
18-Mar-2021 18:26:06.346 dnssec: info: zone example.com/IN: reconfiguring zone keys
18-Mar-2021 18:26:06.347 general: warning: dns_dnssec_keylistfromrdataset: error reading Kexample.com.+013+32569.private: file not found
^^^^ Oh dear
18-Mar-2021 18:26:06.347 general: warning: dns_dnssec_keylistfromrdataset: error reading Kexample.com.+013+57955.private: file not found
^^^^ Also oh dear...
And this is what dnssec-policy decided to do about it. This made a small oops (RRSIG maintenance broken) into something far far worse. We end up with:
- an updated DNSKEY RRset whose signature from the old KSK can't be refreshed (so fails validation).
- a covering signature for the new RRset from the new KSK (which also fails validation because it's not in the parent zone as a signed DS record).
- named starts replacing zone RR RRSIGs with ones created from the new ZSK as the old ones expire (which is also bad, because the ZSK wasn't pre-published for long enough for the operators to be sure that it has reached all caches).
- and even if the zone operators were able to fix the missing DS in the parent as well as refresh the RRSIG covering the DNSKEY set with the old KSK too - there are still issues with the new ZSK-signed RRSIGs because the key hasn't been 'known' for long enough. -->
18-Mar-2021 18:26:06.347 dnssec: info: keymgr: DNSKEY example.com/ECDSAP256SHA256/39037 (KSK) created for policy example_DNSSEC_DEFAULT
18-Mar-2021 18:26:06.348 dnssec: info: keymgr: DNSKEY example.com/ECDSAP256SHA256/27753 (ZSK) created for policy example_DNSSEC_DEFAULT
18-Mar-2021 18:26:06.354 dnssec: info: Fetching example.com/ECDSAP256SHA256/39037 (KSK) from key repository.
18-Mar-2021 18:26:06.354 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/39037 (KSK) is now published
18-Mar-2021 18:26:06.354 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/39037 (KSK) is now active
18-Mar-2021 18:26:06.354 dnssec: info: Fetching example.com/ECDSAP256SHA256/27753 (ZSK) from key repository.
18-Mar-2021 18:26:06.354 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/27753 (ZSK) is now published
18-Mar-2021 18:26:06.354 dnssec: info: DNSKEY example.com/ECDSAP256SHA256/27753 (ZSK) is now active
18-Mar-2021 18:26:06.355 general: warning: dns_dnssec_findzonekeys2: error reading Kexample.com.+013+32569.private: file not found
And the DNSKEY RRset now has 4 keys instead of two:
- Original KSK ID 57955
- New KSK ID 39037
- Original ZSK ID 32569
- New ZSK ID 27753
This is a bad thing that dnssec-policy has done to the zone, in response to an 'oops' - one that is hard to get back from gracefully, as resolvers start to query and cache the zone content. Better would have been (IMHO) to have not rolled the keys and to have done something like:
- Log a lot of errors
- Maybe just not start named or not load this zone when starting?
- If doing a reconfig or reload, not load the zone (and also log a lot of errors)
- If this scenario is encountered (I'm imagining e.g. an HSM that has gone unavailable unexpectedly, or a file system that contained the private key files unmounting) dynamically, then also log a lot of errors.
I'm not sure if unloading the zone would be a good response or not (discuss?)
I think that removing the old RRSIGs and NSEC(3) chains from the zone would also be bad (I'm thinking about what used to happen before dnssec-policy when adopting a zone that has RRSIGs from the previous zone host, where we don't have the private key). But we should maybe also think about some of these other zone migration cases too, not just about accidental issues.