Use event loops in task manager
In named on unix built with threads enabled, there are a variety of threads:
The main named thread is an isc_app which sleeps once startup completes, and essentially waits for shutdown to occur at some later time. Typically, the app thread waits for SIGTERM, and another thread sends this signal using pthread_kill(). As part of startup, named creates an isc_taskmgr and sets up the system so that events flow to the thread pool.
named creates an isc_taskmgr during startup which is a thread pool. It contains as many threads as processors (by default). The thread pool in isc_taskmgr is used for all processing within named. Work distribution to threads is done via isc_event lists and synchronization occurs using condition variable + mutex. The actual work is dispatched using isc_event structs, which basically results in a callback to an action function with a void* within the worker thread that processes the event. A task is a queue of events. Events are delivered to action callback functions, and a grouping of such events / callback functions is called a isc_task such that only one event of a task may be executing at a time within the thread pool (task manager). So you could say that tasks sequence related callbacks (delivered by events) of a particular activity with mutual exclusion among them.
An isc_timermgr creates a thread that waits for timer events to occur. The timer manager maintains various timers in a heap structure. The thread waits for the next timeout using pthread_cond_timedwait(). When a timeout occurs, it dispatches it as an event to a callback function to be run in the main thread pool (task manager).
An isc_socketmgr creates a single socket listening thread which handles listening for events from all network sockets. Any network IO events are handled first by this thread. This thread uses an event loop (select/epoll_wait/kevent/etc.) to watch the registered sockets for activity, and when a socket is ready for read or writing the following occurs:
- with read as example, the listening thread via process_fd() -> dispatch_recv() sends an event to internal_recv() run under client task indicating the socket is ready for reading, and then turns off the socket for monitoring read indication in the listening thread (i.e., it is no longer listening for such an event on that socket).
- the client task receives the event in internal_recv() within the thread pool and reads on the socket. Then, the read buffer is again sent via an event to the originally provided action callback within the client task (e.g., to ns__client_request()). If more read requests are requested by calling code for the socket, internal_recv() turns on the socket for monitoring the read indication in the listening thread.
With this description, the following can be noted:
- a single incoming message requires switching from A->B->C threads so far (at least 2 context switches).
- Currently a isc_taskmgr's worker thread waits for work by calling pthread_cond_wait(). When the condvar is signaled, the thread unblocks and goes to look for work. The worker thread cannot monitor descriptors for work to be done. By using condition variable + mutex synchronization, the threadpool is unsuitable for monitoring descriptors directly, so if one were to want to bind to the port from multiple threads using SO_REUSEPORT, it would need another pool of threads that run event loops, and still would not avoid context switches for processing.
What is needed then, is for the main thread pool to be converted, so individual threads use event loops so that they can monitor for a variety of events:
- work to be sent to the threads
- ready indications on sockets
- timer timeouts firing
... and synchronously process them within the same thread as much as possible.
(Add notes about sending too)