premature TCP connection closure leaks fetch contexts (hang on shutdown)
Summary
An error in error handling code causes resource leaks when an authoritative server closes TCP connection without responding. This ultimately leads to hang on shutdown.
BIND version used
Steps to reproduce
- Force resolver to contact an authoritative server which closes TCP connection without responding.
- Shutdown the resolver and observe hang.
- Repeat if it did not reproduce on first try.
Commands & files
-
localhost.test.zone - a zone with delegation to localhost
-
named.conf - config for resolver to load the delegation
-
udpresp.py - UDP responder which always responds with TC=1
-
tcpresp.py - TCP responder which keeps the connection open for couple seconds after accept() and then closes the connection without replying
- Run
named -g -c named.conf -d 99
- Run
python3 udpresp.py &
- Run
python3 tcpresp.py &
- Run
dig @::1 -p 5300 sub.localhost.test.
- Attempt to shutdown the named process
What is the current bug behavior?
Fetch context leaks and resolver hangs on shutdown. Investigation by @michal:
AFAICT, the bug lies in the ISC_R_CONNECTIONRESET case for a TCP dispatch. in short, when tcp_recv() is called with ISC_R_CONNECTIONRESET, execution jumps here:
743 if (resp != NULL) {
744 /* We got a matching response, or timed out */
745 resp->response(eresult, region, resp->arg);
746 dispentry_detach(&resp);
747 } else {
748 /* We're being shut down; cancel all outstanding resps */
749 for (resp = ISC_LIST_HEAD(resps); resp != NULL; resp = next) {
750 next = ISC_LIST_NEXT(resp, rlink);
751 ISC_LIST_UNLINK(resps, resp, rlink);
752 resp->response(ISC_R_SHUTTINGDOWN, region, resp->arg);
753 dispentry_detach(&resp);
754 }
755 }
however, in this case, resp is NULL and resps is an empty list, which means resp->response (the resolver callback, resquery_response()) is not called, preventing the last reference on the query from being detached and therefore preventing the whole fetch context from shutting down.
if I am right, it should be easy to reproduce with a toy auth server which forces a named resolver to retry a UDP query over TCP and then resets the TCP connection after it gets established. this should result in the fetch context hanging around.
What is the expected correct behavior?
Well, it does not leak resources and does not hang on shutdown.
Relevant configuration files
It's not configuration specific, configuration in the reproducer is just for convenience.