host: hang in epoll_wait due to interface status changes
Summary
Certain interface state changes seem to be able to trigger a permanent hang in host. This came up in Ubuntu bug 1752411 and I'm merely summarizing here what our great community found on the way of implementing a mitigation for now.
Steps to reproduce
We have had reports of two kinds to trigger that (none is a perfect do this to trigger - I know):
- setting interfaces online/offline while running host, but that seems dependent on the interface type
- some VPN solution creating their virtual interfaces and then running host while those are still initializing
Read more on the Ubuntu bug about the different approaches to reproduce different people have taken.
What is the current bug behavior?
host
hangs for an inifinte amount of time.
While it was not our initial setup per reports of people trying alternatives this even applies to cases that set "-W "
What is the expected correct behavior?
host
to give up with an error after some time, or at least timing out when -W is set.
Relevant configuration files
Relevant logs and/or screenshots
process appears sleeping like
root 14606 0.0 0.0 187532 8384 ? Sl 13:05 0:00 host -t soa local.
At the same time hte kernel wchan for the process is sigsuspend
GDB backtrace showing the -1 epoll_wait call
(gdb) t a a bt full
Thread 4 (Thread 0x7ffff0fe1700 (LWP 9916)):
#0 0x00007ffff6be9bb7 in epoll_wait (epfd=5, events=0x7ffff7f81010, maxevents=64, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
resultvar = 18446744073709551612
sc_cancel_oldtype = 0
sc_ret = <optimized out>
#1 0x00007ffff712a49b in watcher (uap=0x7ffff7f80010) at ../../../../lib/isc/unix/socket.c:4292
manager = 0x7ffff7f80010
done = isc_boolean_false
cc = <optimized out>
fnname = 0x7ffff714389a "epoll_wait()"
strbuf = '\000' <repeats 127 times>
#2 0x00007ffff6ec06db in start_thread (arg=0x7ffff0fe1700) at pthread_create.c:463
pd = 0x7ffff0fe1700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737236571904, 6142566376078845154, 140737236569984, 0, 140737353613328, 140737488347808, -6142586158121474846, -6142581914974542622}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
#3 0x00007ffff6be988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.
Strace just shows it in epoll
[pid 6197] 0.000076 epoll_wait(5, <unfinished ...>
Those snippets are stripped for readability, full logs can be found on the Ubuntu bugs. For example to check the other threads of the GDB backtrace or such if you need that.
Possible fixes
Per epoll_wait documentation there could be a timeout set, which currently is -1 to be infinite
cc = epoll_wait(manager->epoll_fd, manager->events,
manager->nevents, -1);
That should be this code in current master.
Maybe setting this to non -1 and iterating the epoll timeout to catch signals like the timeout or other failures? OTOH it is a lib function and I don't know where else this is reused.
Mitigations
For now since -W didn't work we have wrapped it in an external 'timeout' call which works to mitigate the issue.
Long term it would be nice if this bug could be fixed to help other potential users as well as we are considering to move from host to another resolver tool if it is better suited and works. On the actual case (avahi script) we might even drop most of the code these days, but that is not relevant for this issue "in" host
we are discussing here.