host: hang in epoll_wait due to interface status changes

Summary

Certain interface state changes seem to be able to trigger a permanent hang in host. This came up in Ubuntu bug 1752411 and I'm merely summarizing here what our great community found on the way of implementing a mitigation for now.

Steps to reproduce

We have had reports of two kinds to trigger that (none is a perfect do this to trigger - I know):

setting interfaces online/offline while running host, but that seems dependent on the interface type
some VPN solution creating their virtual interfaces and then running host while those are still initializing

Read more on the Ubuntu bug about the different approaches to reproduce different people have taken.

What is the current bug behavior?

host hangs for an inifinte amount of time. While it was not our initial setup per reports of people trying alternatives this even applies to cases that set "-W "

What is the expected correct behavior?

host to give up with an error after some time, or at least timing out when -W is set.

Relevant configuration files

Relevant logs and/or screenshots

process appears sleeping like

root 14606 0.0 0.0 187532 8384 ? Sl 13:05 0:00 host -t soa local.

At the same time hte kernel wchan for the process is sigsuspend

GDB backtrace showing the -1 epoll_wait call

(gdb) t a a bt full

Thread 4 (Thread 0x7ffff0fe1700 (LWP 9916)):
#0  0x00007ffff6be9bb7 in epoll_wait (epfd=5, events=0x7ffff7f81010, maxevents=64, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        resultvar = 18446744073709551612
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x00007ffff712a49b in watcher (uap=0x7ffff7f80010) at ../../../../lib/isc/unix/socket.c:4292
        manager = 0x7ffff7f80010
        done = isc_boolean_false
        cc = <optimized out>
        fnname = 0x7ffff714389a "epoll_wait()"
        strbuf = '\000' <repeats 127 times>
#2  0x00007ffff6ec06db in start_thread (arg=0x7ffff0fe1700) at pthread_create.c:463
        pd = 0x7ffff0fe1700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737236571904, 6142566376078845154, 140737236569984, 0, 140737353613328, 140737488347808, -6142586158121474846, -6142581914974542622}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#3  0x00007ffff6be988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Strace just shows it in epoll

[pid  6197]      0.000076 epoll_wait(5,  <unfinished ...>

Those snippets are stripped for readability, full logs can be found on the Ubuntu bugs. For example to check the other threads of the GDB backtrace or such if you need that.

Possible fixes

Per epoll_wait documentation there could be a timeout set, which currently is -1 to be infinite

cc = epoll_wait(manager->epoll_fd, manager->events,
                manager->nevents, -1);

That should be this code in current master.

Maybe setting this to non -1 and iterating the epoll timeout to catch signals like the timeout or other failures? OTOH it is a lib function and I don't know where else this is reused.

Mitigations

For now since -W didn't work we have wrapped it in an external 'timeout' call which works to mitigate the issue. Long term it would be nice if this bug could be fixed to help other potential users as well as we are considering to move from host to another resolver tool if it is better suited and works. On the actual case (avahi script) we might even drop most of the code these days, but that is not relevant for this issue "in" host we are discussing here.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information