Skip to content

Prevent generating broken glueless referrals

I think this MR really needs a high-level description, so here I go. (As usually, I tried hard to be very thorough both in both commit log messages and code comments, but still.)

#1967 (closed) turned out to be surprisingly frustrating to fix across all supported versions of BIND as:

  • v9.11 (EoL) uses non-refactored query handling code (bin/named/query.c) and also has an optional additional cache, but no glue cache (which is 9.12+),

  • v9.16+ has refactored query handling code (lib/ns/query.c) and does not support the additional cache feature, but does have an optional glue cache (different feature).

All in all, fixing this in all maintained branches would require preparing four different variants of a non-trivial fix:

  • bin/named/query.c with additional cache enabled,
  • bin/named/query.c with additional cache disabled,
  • lib/ns/query.c with glue cache enabled,
  • lib/ns/query.c with glue cache disabled.

This looks like overkill to me for something that is not a critical issue.

This MR only contains the last two variants of the fix from the list above, i.e. it addresses the problem in v9.16 and v9.17 (EoL) both with and without the glue cache.

But honestly, I would like to go a step further. My suggestion is to:

  • only fix the "glue cache enabled" variant,

  • drop the glue-cache option in v9.17 (EoL) and leave it always on (in a separate MR, of course).

The rationale for introducing the glue-cache option was:

The rationale for this option is that we don't have any feedback on how the glue cache is going to behave in the real world just yet. There should be an option to bypass it.

To the best of my knowledge, we have received zero problem reports which mention this feature in the past 3 years (since glue cache was introduced). I also have not seen the glue-cache no; setting in configuration files people use these days.

Thoughts?

Here is a brief overview of the branch (doc updates excluded):

  • 8eb90b69: glue system test cleanup; nothing fancy and there are just two checks in that test right now,

  • 8344c604: add tests reproducing #1967 (closed) in various configurations,

  • 0cd1dbe5: "glue cache on" fix, part 1,

  • dddbe70d: "glue cache on" fix, part 2,

  • 7684320f: "glue cache off" fix.

The only reason I split the "glue cache on" fix into two parts is to make the fix easier to follow as each of these parts touches a different section of rdataset_addglue().

I would be more than happy to drop the "glue cache off" part of this MR, please do not treat the amount of comments I put in there as a factor :)

Finally, both "glue cache on" and "glue cache off" variants could probably be replaced with a fix that affects a common code path, similarly to e.g. query_glueanswer(), but:

  • the code implementing such a fix would be more complicated than the changes suggested in this MR,

  • I expect such a fix to perform worse than the changes suggested in this MR (by putting the fix in glue cache code, each referral is only "analyzed" once - a "common path" fix would need to do that for every response sent).

Once the path forward with this MR is agreed upon, I would like to run it through Perflab before merging it.

Closes #1967 (closed)

Edited by Michał Kępień

Merge request reports