Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • BIND BIND
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 577
    • Issues 577
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 117
    • Merge requests 117
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ISC Open Source ProjectsISC Open Source Projects
  • BINDBIND
  • Issues
  • #2626
Closed
Open
Issue created Apr 12, 2021 by Benjamin Gentil@bgentil

Deadlock with concurrent `rndc addzone`/`rndc delzone` commands

Summary

A deadlock can happens between concurrent rndc delzone and rndc addzone commands (on different zones) making the named process and every following rndc commands stuck.

BIND version used

bind9 debian buster-backports package 1:9.16.13-1~bpo10+1

BIND 9.16.13-Debian (Stable Release) <id:072e758>
running on Linux x86_64 5.11.8-200.fc33.x86_64 #1 SMP Mon Mar 22 01:35:01 UTC 2021
built by make with '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=/usr/lib/x86_64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--libdir=/usr/lib/x86_64-linux-gnu' '--sysconfdir=/etc/bind' '--with-python=python3' '--localstatedir=/' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-gost=no' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-libidn2' '--with-json-c' '--with-lmdb=/usr' '--with-gnu-ld' '--with-maxminddb' '--with-atf=no' '--enable-ipv6' '--enable-rrl' '--enable-filter-aaaa' '--disable-native-pkcs11' '--enable-dnstap' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/build/bind9-xJGMKz/bind9-9.16.13=. -fstack-protector-strong -Wformat -Werror=format-security -fno-strict-aliasing -fno-delete-null-pointer-checks -DNO_VERSION_DATE -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'
compiled by GCC 8.3.0
compiled with OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
linked to OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
compiled with libuv version: 1.24.1
linked to libuv version: 1.24.1
compiled with libxml2 version: 2.9.4
linked to libxml2 version: 20904
compiled with json-c version: 0.12.1
linked to json-c version: 0.12.1
compiled with zlib version: 1.2.11
linked to zlib version: 1.2.11
linked to maxminddb version: 1.3.2
compiled with protobuf-c version: 1.3.1
linked to protobuf-c version: 1.3.1
threads support is enabled

default paths:
  named configuration:  /etc/bind/named.conf
  rndc configuration:   /etc/bind/rndc.conf
  DNSSEC root key:      /etc/bind/bind.keys
  nsupdate session key: //run/named/session.key
  named PID file:       //run/named/named.pid
  named lock file:      //run/named/named.lock
  geoip-directory:      /usr/share/GeoIP

Has been also reproduced on a 9.11.25 build, but not on the debian buster version (1:9.11.5.P4+dfsg-5.1+deb10u3)

Steps to reproduce

The issue can be triggered by running many rndc addzone & rndc delzone in parallel.

A repository with a POC is available here: https://github.com/bengentil/bind-deadlock

What is the current bug behavior?

When the bug happens the named process is deadlocked and all rndc commands are stuck.

What is the expected correct behavior?

The expected behavior is to have all rndc commands to be executed to completion without any deadlock.

Relevant configuration files

I don't have any specific configuration in place to reproduce this issue, as far as I know this issue is not avoidable by configuration.

Relevant logs and/or screenshots

An extract of a gdb backtrace of the deadlocked process (only the 2 relevant threads):

Thread 16 (Thread 0x7f549b6bf700 (LWP 58)):
#0  0x00007f54a5a9e00c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007f54a5f560a9 in isc_task_beginexclusive (task0=<optimized out>) at task.c:1662
        task = <optimized out>
        manager = 0x7f54a02ba010
#2  0x000055a0f169bf99 in do_addzone (zoneconf=<optimized out>, text=0x7f549b6bebc8, redirect=false, zoneobj=0x7f5495d48a38, name=0x7f549b6be830, view=0x7f5490bc1ea0, cfg=0x7f5495d4e048, server=0x7f54a02c0010) at ./server.c:13398
        tresult = <optimized out>
        zone = 0x0
        dbi = 2428247712
        result = 0
        txn = 0x0
        result = <optimized out>
        tresult = <optimized out>
        zone = <optimized out>
        txn = <optimized out>
        dbi = <optimized out>
        dbp = <optimized out>
#3  named_server_changezone (server=0x7f54a02c0010, command=<optimized out>, text=text@entry=0x7f549b6bebc8) at ./server.c:13800
        result = <optimized out>
        addzone = true
        redirect = false
        cfg = 0x7f5495d4e048
        zoneconf = 0x7f5495d48b00
        zoneobj = 0x7f5495d48a38
        zonename = 0x7f5458008430 "test1.local"
        view = 0x7f5490bc1ea0
        buf = {magic = 1114990113, base = 0x7f5458008430, length = 11, used = 11, current = 11, active = 0, link = {prev = 0xffffffffffffffff, next = 0xffffffffffffffff}, mctx = 0x0, autore = false}
        fname = {name = {magic = 1145983854, ndata = 0x7f549b6be940 "\005test1\005local", length = 13, labels = 3, attributes = 1, offsets = 0x7f549b6be880 "", buffer = 0x7f549b6be900, link = {prev = 0xffffffffffffffff, next = 0xffffffffffffffff}, list = {head = 0x0, tail = 0x0}}, offsets = "\000\006\f\000\000\000\000\000\240\307Z\245T\177\000\000\000\200\255\373\320\224\004\255\200\314\005XT\177\000\000\200\314\005XT\177\000\000\200\314\005XT\177\000\000\200\314\005XT\177\000\000\000E\027^\345\027Y\223\310\353k\233T\177\000\000\350\203,\240T\177\000\000\000\352k\233T\177\000\000\320\351k\233T\177\000\000@\366+\240T\177\000\000\277\306X\245T\177\000\000 \000\000\000\060\000\000\000\320\351k\233T\177\000", buffer = {magic = 1114990113, base = 0x7f549b6be940, length = 255, used = 13, current = 0, active = 0, link = {prev = 0xffffffffffffffff, next = 0xffffffffffffffff}, mctx = 0x0, autore = false}, data = "\005test1\005local\000\177\000\000\000\020\000\000\000\000\000\000\000E\027^\345\027Y\223\a\000\000\000\000\000\000\000\310\353k\233T\177\000\000\000\000\000\000\000\000\000\000\270\352k\233T\177\000\000p_\316\225T\177\000\000\300!\371\245T\177\000\000\000\000\000\000\000\000\000\000\242\347\362\245T\177\000\000\260\351k\233T\177\000\000\260\216\025\243T\177\000\000\060\000\000\000\060\000\000\000\220\352k\233T\177\000\000\320\351k\233T\177\000\000\000E\027^\345\027Y\223buffer-0x7f549\000\000\223\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000w_\316\225T\177\000\000\260E\005XT\177\000\000\002\000\000\000\000\000\000\000"...}
        dnsname = 0x7f549b6be830
#4  0x000055a0f167d7d7 in named_control_docommand (message=<optimized out>, readonly=<optimized out>, text=text@entry=0x7f549b6bebc8) at control.c:209
        data = <optimized out>
        cmdline = 0x7f54580545b0 "addzone test1.local {type master; file \"/var/cache/bind/test1.local.db\"; auto-dnssec maintain; inline-signing yes; key-directory \"/var/cache/bind/keys\";};"
        command = 0x7f5495ce5f70 "addzone"
        result = 0
        log_level = <optimized out>
        src = {magic = 1114990113, base = 0x7f54580545b0, length = 154, used = 154, current = 8, active = 0, link = {prev = 0xffffffffffffffff, next = 0xffffffffffffffff}, mctx = 0x0, autore = false}
        lex = 0x7f54a02c83e8
#5  0x000055a0f16809f3 in control_recvmessage (task=0x7f5495d25010, event=<optimized out>) at controlconf.c:476
        conn = 0x7f5495d43e50
        listener = 0x7f5495d4d120
        key = 0x7f54a02bf640
        request = 0x7f5458005240
        response = 0x0
        algorithm = 163
        secret = {rstart = 0x7f54a315bab0 "\342\061\016`\001\201\065\067\242U\313\033|\321\334\350\303\335\213OX\252u\f\210\033v\271K\221x\342\020\273\025\243T\177", rend = 0x7f54a315bad0 "\020\273\025\243T\177"}
        now = 1617996781
        b = <optimized out>
        r = {base = 0x7f54a1ec4f0d ";", length = 2716618509}
        text = 0x7f544c046db0
        result = <optimized out>
        eresult = <optimized out>
        _ctrl = 0x7f54580540e0
        sent = 1617996781
        exp = 1617996841
        nonce = 459112386
        data = 0x0
#6  0x00007f54a5f53e4b in dispatch (threadid=<optimized out>, manager=0x7f54a02ba010) at task.c:1153
        dispatch_count = 24
        done = false
        finished = false
        requeue = false
        event = 0x7f5495d43ed0
        task = 0x7f5495d25010
        task = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        dispatch_count = <optimized out>
        done = <optimized out>
        requeue = <optimized out>
        finished = <optimized out>
        event = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        was_idle = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        empty = <optimized out>
        i = <optimized out>
        __atomic_store_ptr = <optimized out>
        __atomic_store_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
#7  run (queuep=<optimized out>) at task.c:1345
        tq = <optimized out>
        manager = 0x7f54a02ba010
        threadid = <optimized out>
#8  0x00007f54a5f5a282 in isc__trampoline_run (arg=0x55a0f2d9d970) at trampoline.c:191
        trampoline = 0x55a0f2d9d970
        result = <optimized out>
#9  0x00007f54a5a97fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#10 0x00007f54a562d4cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Thread 15 (Thread 0x7f549bec0700 (LWP 57)):
#0  0x00007f54a5aa129c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007f54a5a9a87a in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#2  0x000055a0f168caef in rmzone (task=<optimized out>, event=<optimized out>) at ./server.c:13890
        dz = 0x7f5495d34ff0
        zone = 0x7f5464008d10
        raw = 0x0
        mayberaw = <optimized out>
        zonename = "test2.local", '\000' <repeats 17 times>, "T\177", '\000' <repeats 26 times>, "\377\377\377\377\377\377\377\377RSND\377\377\377\377\000\000\000\000\000\000\000\000", '\377' <repeats 16 times>, '\000' <repeats 20 times>, "\377\377\377\377", '\000' <repeats 72 times>, "X\002\000\000\000\000\000\000"...
        view = 0x7f5490bc1ea0
        cfg = <optimized out>
        dbp = 0x0
        added = true
        result = <optimized out>
        txn = 0x0
        dbi = 4294967295
#3  0x00007f54a5f53e4b in dispatch (threadid=<optimized out>, manager=0x7f54a02ba010) at task.c:1153
        dispatch_count = 0
        done = false
        finished = false
        requeue = false
        event = 0x7f549559be48
        task = 0x7f5495d25be0
        task = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        dispatch_count = <optimized out>
        done = <optimized out>
        requeue = <optimized out>
        finished = <optimized out>
        event = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        was_idle = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        empty = <optimized out>
        i = <optimized out>
        __atomic_store_ptr = <optimized out>
        __atomic_store_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
        __atomic_load_ptr = <optimized out>
        __atomic_load_tmp = <optimized out>
#4  run (queuep=<optimized out>) at task.c:1345
        tq = <optimized out>
        manager = 0x7f54a02ba010
        threadid = <optimized out>
#5  0x00007f54a5f5a282 in isc__trampoline_run (arg=0x55a0f2d9d7e0) at trampoline.c:191
        trampoline = 0x55a0f2d9d7e0
        result = <optimized out>
#6  0x00007f54a5a97fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#7  0x00007f54a562d4cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

The corresponding relevant logs:

Apr  9 19:33:01 f402d8b027a0 named[42]: received control channel command 'delzone test2.local'
Apr  9 19:33:01 f402d8b027a0 named[42]: zone test2.local scheduled for removal via delzone
Apr  9 19:33:01 f402d8b027a0 named[42]: received control channel command 'addzone test1.local {type master; file "/var/cache/bind/test1.local.db"; auto-dnssec maintain; inline-signing yes; key-directory "/var/cache/bind/keys";};'
Apr  9 19:33:01 f402d8b027a0 named[42]: deleting zone test2.local in view _default via delzone

We can see that rndc addzone test1.local [...] has been executed during an already running rndc delzone test2.local and both threads are now waiting.

The thread 15/LWP 57 (delzone) is waiting for a lock owned by thread 16/LWP 58 (addzone)

(gdb) thread 15
[Switching to thread 15 (Thread 0x7f549bec0700 (LWP 57))]
#0  0x00007f54a5aa129c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) frame 2
#2  0x000055a0f168caef in rmzone (task=<optimized out>, event=<optimized out>) at ./server.c:13890
13890   in ./server.c
(gdb) p *(&view->new_zone_lock->__data->__owner)
$9 = 58

isc_task_beginexclusive waits while this is true: manager->halted + 1 < manager->workers

(gdb) thread 16
[Switching to thread 16 (Thread 0x7f549b6bf700 (LWP 58))]
#0  0x00007f54a5a9e00c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) frame 1
#1  0x00007f54a5f560a9 in isc_task_beginexclusive (task0=<optimized out>) at task.c:1662
1662    task.c: No such file or directory.
(gdb) p manager->halted
$10 = 6
(gdb) p manager->workers
$11 = 8

Observations

The rmzone relevant code is:

  • server.c:13890
      13855 rmzone(isc_task_t *task, isc_event_t *event) {
    [...]
      13888 #ifdef HAVE_LMDB
      13889         /* Make sure we can open the NZD database */
    ->13890         LOCK(&view->new_zone_lock);

The addzone relevant code is:

  • server.c:13350

      13337 do_addzone(named_server_t *server, ns_cfgctx_t *cfg, dns_view_t *view,
      13338        dns_name_t *name, cfg_obj_t *zoneconf, const cfg_obj_t *zoneobj,
      13339        bool redirect, isc_buffer_t **text) {
    [...]
      13350     LOCK(&view->new_zone_lock);
    [...]
      13387     /* Make sure we can open the NZD database */
      13388     result = nzd_writable(view);
      13389     if (result != ISC_R_SUCCESS) {
      13390         TCHECK(putstr(text, "unable to open NZD database for '"));
      13391         TCHECK(putstr(text, view->new_zone_db));
      13392         TCHECK(putstr(text, "'"));
      13393         result = ISC_R_FAILURE;
      13394         goto cleanup;
      13395     }
      13396 #endif /* HAVE_LMDB */
      13397
    ->13398     result = isc_task_beginexclusive(server->task);
  • task.c:1662

      1634 isc_task_beginexclusive(isc_task_t *task0) {
    [...]
      1660     while (manager->halted + 1 < manager->workers) {
      1661         wake_all_queues(manager);
    ->1662         WAIT(&manager->halt_cond, &manager->halt_lock);
      1663     }

My understanding is that the rmzone is executed between the LOCK(&view->new_zone_lock) and the isc_task_beginexclusive in do_addzone resulting in:

  • do_addzone waits for other workers to stop (ie. waits for rmzone to stop)
  • rmzone waits for the view->new_zone_lock locked by do_addzone
  • do_addzone sees always 2 workers and the view->new_zone_lock is never unlocked
Assignee
Assign to
Time tracking