Resolve "heap-use-after-free and invalid vptr on PingCheckMgr destruction"
Closes #3190 (closed)
Some explanations about the changes:
- Some IO objects require the IOService (boost::asio::io_service) to still be valid on destruction (e.g. boost::asio::ip::tcp::socket, boost::asio::ip::udp::socket). Most classes requiring this can be detected by the getInternalIOService call which gives them access to the reference of the boost::asio::io_service object.
- The IOService object needs to be kept alive by objects which register some callback functions so that in case of IoServiceThreadPool, the IOService object is not the one run by the main thread but the one created by the instance. This means that when the IoServiceThreadPool instance goes out of scope, the objects using it are referencing a destructed object if they don't acquire the smart pointer to the IOService to keep it still valid.
- For the callback functions (cancel/close/stop) to be called, the IOService restart and poll (always call in a try-catch block) functions must be called before the objects are destructed.
- These objects must implement a cancel/close/stop function which needs to be called before the object is destroyed so that they are still valid when the callback is called. This is usually done by the parent 'Object' which has one of such type as a member, just before it can safely destroy the member (can be the destructor).
- To keep the members alive, they can be added to a lambda function which uses a smart pointer to capture the object, but does not use it. It then must be added to the IOService queue using the post function. This ensured that after the cancel/close/stop callbacks have been called, the lambda is called and only at that stage the object captured is destroyed.
- IOService restart, run, runOne, poll and pollOne can not be called on any context, they need to be called by the main thread either on the normal code path or on specific events like server reconfiguration or server shutdown.
- Some IOService objects are created locally, so the same rationale applies here as well, the cancel/close/stop on respective objects using the IOServie must be called and then restart and poll to execute the handlers, just before the IOService is destroyed.
- In the case of TlsStreamBase(const IOServicePtr& io_service, TlsContextPtr context), the StreamService(io_service, context), base class needed to be created because the IOServicePtr needs to be held valid until the destructor of the TlsStreamBase and TlsStreamImpl are called, so that the destructor and release of the IOService is done last, on the StreamService destructor.
- Most of the changes are related to the fact that the IntervalTimer uses a smart pointer of IntervalTimerImpl which registers a callback function on the IOService. So even if the IntervalTimer instance is destroyed, the underlying IntervalTimerImpl is not released until the cancel callback has been called. This required that before every IntervalTimer goes out of scope, the cancel and IOService restart and poll are called, which in turn call the callbacks and release the reference to the IntervalTimerImpl.
- Some objects like IOFetch and DNSClientImpl perpetuate the execution by registering the same instance on the IOService on the callbacks, so a cancel function needed to be created to set a 'stopped_' flag to true so that when the callback it is executed, the cycle breaks and the function stops registering itself.
Insights:
An Object depending on an IOService must not generate an undefined behavior and must not leak memory.
The Object and IOService should be able to end their scope/visibility in any order, and RAII should properly handle their dependencies and lifetime.
- for an Object to be properly destroyed, it needs to have the IOService still valid
- any cancel/close/stop callback must be called before the Object's destructor, if the callbacks access any Object member (including non static functions)
- by doing so, it means that the IOService must call reset and poll before the Object is destroyed
- the IOService is destroyed when the Object is destroyed, at the earliest, if using the IOServicePtr RAII mechanics
- order of operations is: Object cancel/close/stop followed by IOService poll followed by Object destruction followed by IOService destruction
Conclusions:
-
current implementation: the Object does not hold smart pointer to IOService
- wrong operations:
- destroying IOService before destroying the Object causes undefined behavior even if cancel/close/stop are called on the Object's destructor - even if the callbacks are not called
- destroy the Object (which calls cancel/close/stop) and then call restart and poll on the IOService causes use after free if the callbacks access any Object member (including non static functions)
- right operations:
-
destroy the Object (which calls cancel/close/stop), and never access the Object members (including non static functions) in the callbacks on cancel/close/stop events
to be able to continue calling poll on the IOService and finally destroy the IOService
-
- wrong operations:
-
proposed implementations: the Object does hold smart pointer to IOService
- wrong operations:
- not calling cancel/close/stop on the Object before the Object's destruction causes undefined behavior when IOService poll is called after the Object's destruction
- not calling restart and poll on IOService causes memory leak as the Object is destroyed and then IOServicePtr is forever captured by the callback which is still in the IOService event queue (e.g. IntervalTimer is destroyed when IntervalTimerPtr reaches reference count of 0, but the internal member IntervalTimerImplPtr never reaches reference count of 0 because it is also bound to the close callback, and also holds a reference of the IOService, so neither the IOService or the IntervalTimerImpl gets released)
- right operations:
- call cancel/stop/close on the Object, then call restart and poll on IOService, then destroy the Object, which in turn destroys IOService if it has the last reference
e.g:
// Stop the thread pool. thread_pool_->stop(); // Stop the listener. http_listener_->stop(); thread_io_service_->restart(); try { thread_io_service_->poll(); } catch (...) { } // Get rid of the thread pool. thread_pool_.reset(); // Get rid of the listener. http_listener_.reset(); // Ditch the IOService. thread_io_service_.reset();
- wrong operations:
Alternatives to IOService post:
Using boost::enable_shared_from_this might keep the members alive on the IOService queue until the callbacks are called, so the need for IOService post with the lambda capturing the smart pointer of the member can be avoided.
Possible problems with proposed implementations:
- There are some hook libraries which use objects with this problem, so registering the objects on the IOService post, the hook library must rely on the core functionality to call reset and poll, just before dlclose is used to unload the library symbols. This can be easily done on each server shutdown and on each reconfiguration even.
- A real problem is that a new requirement needs to be satisfied: all hook libraries must destroy all global objects which might have a reference to such objects on the unload hook point. If they don't release the objects by that stage, the unloading of the library might not work, or worse, the reference is destroyed on dlclose and all hanging references used by next call to IOService run/runOne/poll/pollOne will lead to a crash, or at least an undefined behavior. One example of a global instance with references to such objects is PingCheckMgr, but it is properly destroyed on unload hook point.