Skip to content

Fix query context management issues in dighost.c

For the reference, the _cancel_lookup() function iterates through the lookup's queries list and detaches them. In the ideal scenario, that should be the last reference and the query will be destroyed after that, but it is also possible that we are still expecting a callback, which also holds a reference (for example, _cancel_lookup() could have been called from recv_done(), when send_done() was still not executed).

The start_udp() and start_tcp() functions are currently designed in slightly different ways: start_udp() creates a new query attachment connectquery, to be called in the callback function, while start_tcp() does not, which is a bug, but is hidden by the fact that when the query is being erroneously destroyed prematurely (before _cancel_lookup() is called) in the result of that, it also gets de-listed from the lookup's queries' list, so _cancel_lookup() doesn't even try to detach it.

For better understanding, here's an illustration of the query's references count changes, and from where it was changed:

UDP

  1. _new_query() -> refcount = 1 (initial)
  2. start_udp() -> refcount = 2 (lookup->current_query)
  3. start_udp() -> refcount = 3 (connectquery)
  4. udp_ready() -> refcount = 4 (readquery)
  5. udp_ready() -> refcount = 5 (sendquery)
  6. udp_ready() -> refcount = 4 (lookup->current_query)
  7. udp_ready() -> refcount = 3 (connectquery)
  8. send_done() -> refcount = 2 (sendquery)
  9. recv_done() -> refcount = 1 (readquery)
  10. _cancel_lookup() -> refcount = 0 (initial)
  11. the query gets destroyed and removed from lookup->q

TCP, fortunate scenario

  1. _new_query() -> refcount = 1 (initial)
  2. start_tcp() -> refcount = 2 (lookup->current_query)
  3. launch_next_query() -> refcount = 3 (readquery)
  4. launch_next_query() -> refcount = 4 (sendquery)
  5. tcp_connected() -> refcount = 3 (lookup->current_query)
  6. tcp_connected() -> refcount = 2 (bug, there was no connectquery)
  7. send_done() -> refcount = 1 (sendquery)
  8. recv_done() -> refcount = 0 (readquery)
  9. the query gets prematurely destroyed and removed from lookup->q
  10. _cancel_lookup() -> the query is not in lookup->q

TCP, unfortunate scenario, revealing the bug

  1. _new_query() -> refcount = 1 (initial)
  2. start_tcp() -> refcount = 2 (lookup->current_query)
  3. launch_next_query() -> refcount = 3 (readquery)
  4. launch_next_query() -> refcount = 4 (sendquery)
  5. tcp_connected() -> refcount = 3 (lookup->current_query)
  6. tcp_connected() -> refcount = 2 (bug, there was no connectquery)
  7. recv_done() -> refcount = 1 (readquery)
  8. _cancel_lookup() -> refcount = 0 (the query was in lookup->q)
  9. we hit an assertion here when trying to destroy the query, because sendhandle is not detached (which is done by send_done()).
  10. send_done() -> this never happens

This commit does the following:

  1. Add a connectquery attachment in start_tcp(), like done in start_udp().
  2. Add missing _cancel_lookup() calls for error scenarios, which were possibly missing because before fixing the bug, calling _cancel_lookup() and then calling query_detach() would cause an assertion.
  3. Log a debug message and call isc_nm_cancelread(query->readhandle) for every query in the lookup from inside the _cancel_lookup() function, like it is done in _cancel_all().
  4. Add a canceled property for the query which becomes true when the lookup (and subsequently, its queries) are canceled.
  5. Use the canceled property in the network manager callbacks to know that the query was canceled, and act like eresult was equal to ISC_R_CANCELED.

Closes #3184 (closed)

Edited by Ondřej Surý

Merge request reports