1. 30 Jul, 2020 6 commits
  2. 28 Jul, 2020 1 commit
  3. 24 Jul, 2020 2 commits
  4. 23 Jul, 2020 1 commit
  5. 21 Jul, 2020 2 commits
    • Michal Nowak's avatar
      Drop feature test for dlopen() · 2064e01c
      Michal Nowak authored
      With libtool being mandatory from 9.17 on, so is dlopen() (via libltdl).
      2064e01c
    • Ondřej Surý's avatar
      Fix the rbt hashtable and grow it when setting max-cache-size · e24bc324
      Ondřej Surý authored
      There were several problems with rbt hashtable implementation:
      
      1. Our internal hashing function returns uint64_t value, but it was
         silently truncated to unsigned int in dns_name_hash() and
         dns_name_fullhash() functions.  As the SipHash 2-4 higher bits are
         more random, we need to use the upper half of the return value.
      
      2. The hashtable implementation in rbt.c was using modulo to pick the
         slot number for the hash table.  This has several problems because
         modulo is: a) slow, b) oblivious to patterns in the input data.  This
         could lead to very uneven distribution of the hashed data in the
         hashtable.  Combined with the single-linked lists we use, it could
         really hog-down the lookup and removal of the nodes from the rbt
         tree[a].  The Fibonacci Hashing is much better fit for the hashtable
         function here.  For longer description, read "Fibonacci Hashing: The
         Optimization that the World Forgot"[b] or just look at the Linux
         kernel.  Also this will make Diego very happy :).
      
      3. The hashtable would rehash every time the number of nodes in the rbt
         tree would exceed 3 * (hashtable size).  The overcommit will make the
         uneven distribution in the hashtable even worse, but the main problem
         lies in the rehashing - every time the database grows beyond the
         limit, each subsequent rehashing will be much slower.  The mitigation
         here is letting the rbt know how big the cache can grown and
         pre-allocate the hashtable to be big enough to actually never need to
         rehash.  This will consume more memory at the start, but since the
         size of the hashtable is capped to `1 << 32` (e.g. 4 mio entries), it
         will only consume maximum of 32GB of memory for hashtable in the
         worst case (and max-cache-size would need to be set to more than
         4TB).  Calling the dns_db_adjusthashsize() will also cap the maximum
         size of the hashtable to the pre-computed number of bits, so it won't
         try to consume more gigabytes of memory than available for the
         database.
      
         FIXME: What is the average size of the rbt node that gets hashed?  I
         chose the pagesize (4k) as initial value to precompute the size of
         the hashtable, but the value is based on feeling and not any real
         data.
      
      For future work, there are more places where we use result of the hash
      value modulo some small number and that would benefit from Fibonacci
      Hashing to get better distribution.
      
      Notes:
      a. A doubly linked list should be used here to speedup the removal of
         the entries from the hashtable.
      b. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/
      e24bc324
  6. 17 Jul, 2020 3 commits
    • Michal Nowak's avatar
      Check tests for core files regardless of test status · 1b13123c
      Michal Nowak authored
      Failed test should be checked for core files et al. and have
      backtrace generated.
      1b13123c
    • Michal Nowak's avatar
      Rationalize backtrace logging · 05c13e50
      Michal Nowak authored
      GDB backtrace generated via "thread apply all bt full" is too long for
      standard output, lets save them to .txt file among other log files.
      05c13e50
    • Michal Nowak's avatar
      Ensure various test issues are treated as failures · b232e858
      Michal Nowak authored
      Make sure bin/tests/system/run.sh returns a non-zero exit code if any of
      the following happens:
      
        - the test being run produces a core dump,
        - assertion failures are found in the test's logs,
        - ThreadSanitizer reports are found after the test completes,
        - the servers started by the test fail to shut down cleanly.
      
      This change is necessary to always fail a test in such cases (before the
      migration to Automake, test failures were determined based on the
      presence of "R:<test-name>:FAIL" lines in the test suite output and thus
      it was not necessary for bin/tests/system/run.sh to return a non-zero
      exit code).
      b232e858
  7. 16 Jul, 2020 1 commit
    • Evan Hunt's avatar
      rewrite statschannel to use netmgr · 69c1ee1c
      Evan Hunt authored
      modify isc_httpd to use the network manager instead of the
      isc_socket API.
      
      also cleaned up bin/named/statschannel.c to use CHECK.
      69c1ee1c
  8. 14 Jul, 2020 2 commits
    • Mark Andrews's avatar
      Add regression test for [GL !3735] · 11ecf790
      Mark Andrews authored
      Check that resign interval is actually in days rather than hours
      by checking that RRSIGs are all within the allowed day range.
      11ecf790
    • Tony Finch's avatar
      Fix re-signing when `sig-validity-interval` has two arguments · 030674b2
      Tony Finch authored
      Since October 2019 I have had complaints from `dnssec-cds` reporting
      that the signatures on some of my test zones had expired. These were
      zones signed by BIND 9.15 or 9.17, with a DNSKEY TTL of 24h and
      `sig-validity-interval 10 8`.
      
      This is the same setup we have used for our production zones since
      2015, which is intended to re-sign the zones every 2 days, keeping
      at least 8 days signature validity. The SOA expire interval is 7
      days, so even in the presence of zone transfer problems, no-one
      should ever see expired signatures. (These timers are a bit too
      tight to be completely correct, because I should have increased
      the expiry timers when I increased the DNSKEY TTLs from 1h to 24h.
      But that should only matter when zone transfers are broken, which
      was not the case for the error reports that led to this patch.)
      
      For example, this morning my test zone contained:
      
              dev.dns.cam.ac.uk. 86400 IN RRSIG DNSKEY 13 5 86400 (
                                      20200701221418 20200621213022 ...)
      
      But one of my resolvers had cached:
      
              dev.dns.cam.ac.uk. 21424 IN RRSIG DNSKEY 13 5 86400 (
                                      20200622063022 20200612061136 ...)
      
      This TTL was captured at 20200622105807 so the resolver cached the
      RRset 64976 seconds previously (18h02m56s), at 20200621165511
      only about 12h before expiry.
      
      The other symptom of this error was incorrect `resign` times in
      the output from `rndc zonestatus`.
      
      For example, I have configured a test zone
      
              zone fast.dotat.at {
                      file "../u/z/fast.dotat.at";
                      type primary;
                      auto-dnssec maintain;
                      sig-validity-interval 500 499;
              };
      
      The zone is reset to a minimal zone containing only SOA and NS
      records, and when `named` starts it loads and signs the zone. After
      that, `rndc zonestatus` reports:
      
              next resign node: fast.dotat.at/NS
              next resign time: Fri, 28 May 2021 12:48:47 GMT
      
      The resign time should be within the next 24h, but instead it is
      near the signature expiry time, which the RRSIG(NS) says is
      20210618074847. (Note 499 hours is a bit more than 20 days.)
      May/June 2021 is less than 500 days from now because expiry time
      jitter is applied to the NS records.
      
      Using this test I bisected this bug to 09990672 which contained a
      mistake leading to the resigning interval always being calculated in
      hours, when days are expected.
      
      This bug only occurs for configurations that use the two-argument form
      of `sig-validity-interval`.
      030674b2
  9. 13 Jul, 2020 8 commits
    • Evan Hunt's avatar
      purge pending command events when shutting down · 29dcdeba
      Evan Hunt authored
      When we're shutting the system down via "rndc stop" or "rndc halt",
      or reconfiguring the control channel, there are potential shutdown
      races between the server task and network manager.  These are adressed by:
      
      - purging any pending command tasks when shutting down the control channel
      - adding an extra handle reference before the command handler to
        ensure the handle can't be deleted out from under us before calling
        command_respond()
      29dcdeba
    • Evan Hunt's avatar
      use an isc_task to execute rndc commands · 45ab0603
      Evan Hunt authored
      - using an isc_task to execute all rndc functions makes it relatively
        simple for them to acquire task exclusive mode when needed
      - control_recvmessage() has been separated into two functions,
        control_recvmessage() and control_respond(). the respond function
        can be called immediately from control_recvmessage() when processing
        a nonce, or it can be called after returning from the task event
        that ran the rndc command function.
      45ab0603
    • Evan Hunt's avatar
      convert rndc and control channel to use netmgr · 3551d3ff
      Evan Hunt authored
      - updated libisccc to use netmgr events
      - updated rndc to use isc_nm_tcpconnect() to establish connections
      - updated control channel to use isc_nm_listentcp()
      
      open issues:
      
      - the control channel timeout was previously 60 seconds, but it is now
        overridden by the TCP idle timeout setting, which defaults to 30
        seconds. we should add a function that sets the timeout value for
        a specific listener socket, instead of always using the global value
        set in the netmgr. (for the moment, since 30 seconds is a reasonable
        timeout for the control channel, I'm not prioritizing this.)
      - the netmgr currently has no support for UNIX-domain sockets; until
        this is addressed, it will not be possible to configure rndc to use
        them. we will need to either fix this or document the change in
        behavior.
      3551d3ff
    • Evan Hunt's avatar
      don't use exclusive mode for rndc commands that don't need it · 002c3284
      Evan Hunt authored
      "showzone" and "tsig-list" both used exclusive mode unnecessarily;
      changing this will simplify future refactoring a bit.
      002c3284
    • Evan Hunt's avatar
      style cleanup · 0580d9cd
      Evan Hunt authored
      clean up style in rndc and the control channel in preparation for
      changing them to use the new network manager.
      0580d9cd
    • Evan Hunt's avatar
      make sure new_zone_lock is locked before unlocking it · ed37c63e
      Evan Hunt authored
      it was possible for the count_newzones() function to try to
      unlock view->new_zone_lock on return before locking it, which
      caused a crash on shutdown.
      ed37c63e
    • Mark Andrews's avatar
      Fallback to built in trust-anchors, managed-keys, or trusted-keys · d02a14c7
      Mark Andrews authored
      if the bind.keys file cannot be parsed.
      d02a14c7
    • Mark Andrews's avatar
      Don't verify the zone when setting expire to "now+1s" as it can fail · a0e8a11c
      Mark Andrews authored
      as too much wall clock time may have elapsed.
      
      Also capture signzone output for forensic analysis
      a0e8a11c
  10. 12 Jul, 2020 1 commit
  11. 10 Jul, 2020 1 commit
    • Michał Kępień's avatar
      Fix locking for LMDB 0.9.26 · 53120279
      Michał Kępień authored
      When "rndc reconfig" is run, named first configures a fresh set of views
      and then tears down the old views.  Consider what happens for a single
      view with LMDB enabled; "envA" is the pointer to the LMDB environment
      used by the original/old version of the view, "envB" is the pointer to
      the same LMDB environment used by the new version of that view:
      
       1. mdb_env_open(envA) is called when the view is first created.
       2. "rndc reconfig" is called.
       3. mdb_env_open(envB) is called for the new instance of the view.
       4. mdb_env_close(envA) is called for the old instance of the view.
      
      This seems to have worked so far.  However, an upstream change [1] in
      LMDB which will be part of its 0.9.26 release prevents the above
      sequence of calls from working as intended because the locktable mutexes
      will now get destroyed by the mdb_env_close() call in step 4 above,
      causing any subsequent mdb_txn_begin() calls to fail (because all of the
      above steps are happening within a single named process).
      
      Preventing the above scenario from happening would require either
      redesigning the way we use LMDB in BIND, which is not something we can
      easily backport, or redesigning the way BIND carries out its
      reconfiguration process, which would be an even more severe change.
      
      To work around the problem, set MDB_NOLOCK when calling mdb_env_open()
      to stop LMDB from controlling concurrent access to the database and do
      the necessary locking in named instead.  Reuse the view->new_zone_lock
      mutex for this purpose to prevent the need for modifying struct dns_view
      (which would necessitate library API version bumps).  Drop use of
      MDB_NOTLS as it is made redundant by MDB_NOLOCK: MDB_NOTLS only affects
      where LMDB reader locktable slots are stored while MDB_NOLOCK prevents
      the reader locktable from being used altogether.
      
      [1] https://git.openldap.org/openldap/openldap/-/commit/2fd44e325195ae81664eb5dc36e7d265927c5ebc
      53120279
  12. 06 Jul, 2020 2 commits
  13. 03 Jul, 2020 1 commit
    • Matthijs Mekking's avatar
      Increase "rndc dnssec -status" output size · 9347e7db
      Matthijs Mekking authored
      BUFSIZ (512 bytes on Windows) may not be enough to fit the status of a
      DNSSEC policy and three DNSSEC keys.
      
      Set the size of the relevant buffer to a hardcoded value of 4096 bytes,
      which should be enough for most scenarios.
      9347e7db
  14. 02 Jul, 2020 5 commits
  15. 01 Jul, 2020 4 commits