Fix query context management issues in dighost.c (v9.18)
For the reference, the _cancel_lookup() function iterates through the lookup's queries list and detaches them. In the ideal scenario, that should be the last reference and the query will be destroyed after that, but it is also possible that we are still expecting a callback, which also holds a reference (for example, _cancel_lookup() could have been called from recv_done(), when send_done() was still not executed).
The start_udp() and start_tcp() functions are currently designed in
slightly different ways: start_udp() creates a new query attachment
connectquery
, to be called in the callback function, while
start_tcp() does not, which is a bug, but is hidden by the fact
that when the query is being erroneously destroyed prematurely (before
_cancel_lookup() is called) in the result of that, it also gets
de-listed from the lookup's queries' list, so _cancel_lookup() doesn't
even try to detach it.
For better understanding, here's an illustration of the query's references count changes, and from where it was changed:
UDP
- _new_query() -> refcount = 1 (initial)
- start_udp() -> refcount = 2 (lookup->current_query)
- start_udp() -> refcount = 3 (connectquery)
- udp_ready() -> refcount = 4 (readquery)
- udp_ready() -> refcount = 5 (sendquery)
- udp_ready() -> refcount = 4 (lookup->current_query)
- udp_ready() -> refcount = 3 (connectquery)
- send_done() -> refcount = 2 (sendquery)
- recv_done() -> refcount = 1 (readquery)
- _cancel_lookup() -> refcount = 0 (initial)
- the query gets destroyed and removed from
lookup->q
TCP, fortunate scenario
- _new_query() -> refcount = 1 (initial)
- start_tcp() -> refcount = 2 (lookup->current_query)
- launch_next_query() -> refcount = 3 (readquery)
- launch_next_query() -> refcount = 4 (sendquery)
- tcp_connected() -> refcount = 3 (lookup->current_query)
- tcp_connected() -> refcount = 2 (bug, there was no connectquery)
- send_done() -> refcount = 1 (sendquery)
- recv_done() -> refcount = 0 (readquery)
- the query gets prematurely destroyed and removed from
lookup->q
- _cancel_lookup() -> the query is not in
lookup->q
TCP, unfortunate scenario, revealing the bug
- _new_query() -> refcount = 1 (initial)
- start_tcp() -> refcount = 2 (lookup->current_query)
- launch_next_query() -> refcount = 3 (readquery)
- launch_next_query() -> refcount = 4 (sendquery)
- tcp_connected() -> refcount = 3 (lookup->current_query)
- tcp_connected() -> refcount = 2 (bug, there was no connectquery)
- recv_done() -> refcount = 1 (readquery)
- _cancel_lookup() -> refcount = 0 (the query was in
lookup->q
) - we hit an assertion here when trying to destroy the query, because sendhandle is not detached (which is done by send_done()).
- send_done() -> this never happens
This commit does the following:
- Add a
connectquery
attachment in start_tcp(), like done in start_udp(). - Add missing _cancel_lookup() calls for error scenarios, which were possibly missing because before fixing the bug, calling _cancel_lookup() and then calling query_detach() would cause an assertion.
- Log a debug message and call isc_nm_cancelread(query->readhandle) for every query in the lookup from inside the _cancel_lookup() function, like it is done in _cancel_all().
- Add a
canceled
property for the query which becomestrue
when the lookup (and subsequently, its queries) are canceled. - Use the
canceled
property in the network manager callbacks to know that the query was canceled, and act likeeresult
was equal toISC_R_CANCELED
.
(cherry picked from commit 98820aef)
Closes #3184 (closed)