Skip to content

Fix a race in taskmgr between worker and task pausing/unpausing.

To reproduce the race - create a task, send two events to it, first one must take some time. Then, from the outside, pause(), unpause() and detach() the task.

When the long-running event is processed by the task it is in task_state_running state. When we called pause() the state changed to task_state_paused, on unpause we checked that there are events in the task queue, changed the state to task_state_ready and enqueued the task on the workers readyq. We then detach the task. The dispatch() is done with processing the event, it processes the second event in the queue, and then shuts down the task and frees it (as it's not referenced anymore). Dispatcher then takes the, already freed, task from the queue where it was wrongly put, causing an use-after free and, subsequently, either an assertion failure or a segmentation fault.

The probability of this happening is very slim, yet it might happen under a very high load, more probably on a recursive resolver than on an authoritative.

The fix introduces a new 'task_state_pausing' state - to which tasks are moved if they're being paused while still running. They are moved to task_state_paused state when dispatcher is done with them, and if we unpause a task in paused state it's moved back to task_state_running and not requeued.

Closes #1571 (closed)

Edited by Witold Krecicki

Merge request reports