QP lookup() creates incorrect chain when an iterator is provided

Summary

QP trie with specific content leads to incorrect chain return value after a lookup(). See below.

BIND version affected

Affects v9.19: BIND 9.19.25-dev commit 93570194

Steps to reproduce

High-level:

Create an empty dns_qp_t.
insert name a.
insert name d.b.a.
insert name z.d.b.a.

At this point the QP should contain names - in DNSSEC-order - [a., d.b.a., z.d.b.a.]

lookup() name f.c.b.a. and provide chain and iterator.
Expected return code = DNS_R_PARTIALMATCH - that's okay.
call dns_qpchain_length() on the returned iterator or iterate through it.

⚠ EDIT: The issue manifests only when an iterator is provided when calling lookup().

What is the current bug behavior?

In my tests:

dns_qpchain_length() returns 3
iteration over the chain with idx set to 0..2 returns names:
- idx 0 name a.
- idx 1 name d.b.a.
- idx 2 name a.

Huh?! This certainly does not match description in qp.h:

 * If 'chain' is not NULL, it is updated to contain a QP chain with
 * references to the populated nodes in the tree between the root and
 * the name that was found. If the return code is DNS_R_PARTIALMATCH
 * then the chain terminates at the closest ancestor found; if it is
 * ISC_R_SUCCESS then it terminates at the name that was requested.
 * If the result is ISC_R_NOTFOUND, 'chain' will not be updated.

What is the expected correct behavior?

I would expect chain length == 1 and on index 0 name a..

Hacky code reproducer

This is modified code from #4702 (closed).

Still extremely ugly and very much work in progress. It is sort of "fuzzer" which wraps our libdns in C into a state machine written in Python, and uses library Python hypothesis to verify that all state transitions produce expected results. In pratical terms in compares behavior of our libdns with approximation based on dnspython library.

Usage:

Compile and install BIND version under test.
Install python-cffi library
cd tests/dns
python qp_test_pythonbuild.py # compiles Python-C glue library

Execute the failing test case:

cd tests/dns
LD_PRELOAD=/usr/lib/libjemalloc.so.2 python qp_test.py

Execute fuzzing machinery:

cd tests/dns
LD_PRELOAD=/usr/lib/libjemalloc.so.2 pytest qp_test.py --hypothesis-show-statistics -k TestTrees

It might take ~ 10 minutes until it fails. If it succeeds run it again :-) It's doing pseudo-random walk through various QP-related operations so it might not find bugs right away.

When it fails it might or might not print a nice test case - it's random after all. In case of failure the text output should give you an idea about sequence of steps it tried in sort-of human terms. Beware - the text output typically contains dozens of "retries" from clean state. Read carefully and start with the last one.

Edited May 10, 2024 by Petr Špaček