1. 29 Apr, 2021 1 commit
  2. 28 Apr, 2021 4 commits
    • Matthijs Mekking's avatar
      Serve-stale nit fixes · 104b6762
      Matthijs Mekking authored
      While working on the serve-stale backports, I noticed the following
      oddities:
      
      1. In the serve-stale system test, in one case we keep track of the
         time how long it took for dig to complete. In commit
         aaed7f9d, the code removed the
         exception to check for result == ISC_R_SUCCESS on stale found
         answers, and adjusted the test accordingly. This failed to update
         the time tracking accordingly. Move the t1/t2 time track variables
         back around the two dig commands to ensure the lookups resolved
         faster than the resolver-query-timeout.
      
      2. We can remove the setting of NS_QUERYATTR_STALEOK and
         DNS_RDATASETATTR_STALE_ADDED on the "else if (stale_timeout)"
         code path, because they are added later when we know we have
         actually found a stale answer on a stale timeout lookup.
      
      3. We should clear the NS_QUERYATTR_STALEOK flag from the client
         query attributes instead of DNS_RDATASETATTR_STALE_ADDED (that
         flag is set on the rdataset attributes).
      
      4. In 'bin/named/config.c' we should set the configuration options
         in alpabetical order.
      
      5. In the ARM, in the backports we have added "(stale)" between
         "cached" and "RRset" to make more clear a stale RRset may be
         returned in this scenario.
      104b6762
    • Michał Kępień's avatar
      Warn when log files grow too big in system tests · 241e85ef
      Michał Kępień authored
      Exerting excessive I/O load on the host running system tests should be
      avoided in order to limit the number of false positives reported by the
      system test suite.  In some cases, running named with "-d 99" (which is
      the default for system tests) results in a massive amount of logs being
      generated, most of which are useless.  Implement a log file size check
      to draw developers' attention to overly verbose named instances used in
      system tests.  The warning threshold of 200,000 lines was chosen
      arbitrarily.
      241e85ef
    • Michał Kępień's avatar
      Prevent useless logging in the "tcp" system test · 17e5c2a5
      Michał Kępień authored
      The regression test for CVE-2020-8620 causes a lot of useless messages
      to be logged.  However, globally decreasing the log level for the
      affected named instance would be a step too far as debugging information
      may be useful for troubleshooting other checks in the "tcp" system test.
      Starting a separate named instance for a single check should be avoided
      when possible and thus is also not a good solution.  As a compromise,
      run "rndc trace 1" for the affected named instance before starting the
      regression test for CVE-2020-8620.
      17e5c2a5
    • Michał Kępień's avatar
      Limit logging for verbose system tests · 4a8d4048
      Michał Kępień authored
      The system test framework starts all named instances with the "-d 99"
      command line option (unless it is overridden by a named.args file in a
      given instance's working directory).  This causes a lot of log messages
      to be written to named.run files - currently over 5 million lines for a
      single test suite run.  While debugging information preserved in the log
      files is essential for troubleshooting intermittent test failures, some
      system tests involve sending hundreds or even thousands of queries,
      which causes the relevant log files to explode in size.  When multiple
      tests (or even multiple test suites) are run in parallel, excessive
      logging contributes considerably to the I/O load on the test host,
      increasing the odds of intermittent test failures getting triggered.
      
      Decrease the debug level for the seven most verbose named instances:
      
        - use "-d 3" for ns2 in the "cacheclean" system test (it is the lowest
          logging level at which the test still passes without the need to
          apply any changes to tests.sh),
      
        - use "-d 1" for the other six named instances.
      
      This roughly halves the number of lines logged by each test suite run
      while still leaving enough information in the logs to allow at least
      basic troubleshooting in case of test failures.
      
      This approach was chosen as it results in a greater decrease in the
      number of lines logged than running all named instances with "-d 3",
      without causing any test failures.
      4a8d4048
  3. 26 Apr, 2021 1 commit
    • Michał Kępień's avatar
      Test "tkey-gssapi-credential" conditionally · 6feac68b
      Michał Kępień authored
      If "tkey-gssapi-credential" is set in the configuration and GSSAPI
      support is not available, named will refuse to start.  As the test
      system framework does not support starting named instances
      conditionally, ensure that "tkey-gssapi-credential" is only present in
      named.conf if GSSAPI support is available.
      6feac68b
  4. 24 Apr, 2021 1 commit
    • Mark Andrews's avatar
      Wait for named to start · 8d5870f9
      Mark Andrews authored
      If we don't wait for named to finish starting, 'rndc stop' may
      fail due to the listen limit being reached in named leading
      to a false negative test
      8d5870f9
  5. 23 Apr, 2021 1 commit
    • Michał Kępień's avatar
      Test handling of non-apex RRSIG(SOA) RRsets · 24bf4b94
      Michał Kępień authored
      Add a check to the "dnssec" system test which ensures that RRSIG(SOA)
      RRsets present anywhere else than at the zone apex are automatically
      removed after a zone containing such RRsets is loaded.
      24bf4b94
  6. 22 Apr, 2021 3 commits
    • Diego Fronza's avatar
      Add system test for the deadlock fix · d6224035
      Diego Fronza authored
      The test spawns 4 parallel workers that keep adding, modifying and
      deleting zones, the main thread repeatedly checks wheter rndc
      status responds within a reasonable period.
      
      While environment and timing issues may affect the test, in most
      test cases the deadlock that was taking place before the fix used to
      trigger in less than 7 seconds in a machine with at least 2 cores.
      d6224035
    • Petr Špaček's avatar
      Add tests for the "tkey-gssapi-credential" option · 1746d2e8
      Petr Špaček authored
      Four named instances in the "nsupdate" system test have GSS-TSIG support
      enabled.  All of them currently use "tkey-gssapi-keytab".  Configure two
      of them with "tkey-gssapi-credential" to test that option.
      
      As "tkey-gssapi-keytab" and "tkey-gssapi-credential" both provide the
      same functionality, no test modifications are required.  The difference
      between the two options is that the value of "tkey-gssapi-keytab" is an
      explicit path to the keytab file to acquire credentials from, while the
      value of "tkey-gssapi-credential" is the name of the principal whose
      credentials should be used; those credentials are looked up in the
      keytab file expected by the Kerberos library, i.e. /etc/krb5.keytab by
      default.  The path to the default keytab file can be overridden using by
      setting the KRB5_KTNAME environment variable.  Utilize that variable to
      use existing keytab files with the "tkey-gssapi-credential" option.
      
      The KRB5_KTNAME environment variable should not interfere with the
      "tkey-gssapi-keytab" option.  Nevertheless, rename one of the keytab
      files used with "tkey-gssapi-keytab" to something else than the contents
      of the KRB5_KTNAME environment variable in order to make sure that both
      "tkey-gssapi-keytab" and "tkey-gssapi-credential" are actually tested.
      1746d2e8
    • Mark Andrews's avatar
  7. 20 Apr, 2021 1 commit
    • Ondřej Surý's avatar
      Refactor taskmgr to run on top of netmgr · b540722b
      Ondřej Surý authored
      This commit changes the taskmgr to run the individual tasks on the
      netmgr internal workers.  While an effort has been put into keeping the
      taskmgr interface intact, couple of changes have been made:
      
       * The taskmgr has no concept of universal privileged mode - rather the
         tasks are either privileged or unprivileged (normal).  The privileged
         tasks are run as a first thing when the netmgr is unpaused.  There
         are now four different queues in in the netmgr:
      
         1. priority queue - netievent on the priority queue are run even when
            the taskmgr enter exclusive mode and netmgr is paused.  This is
            needed to properly start listening on the interfaces, free
            resources and resume.
      
         2. privileged task queue - only privileged tasks are queued here and
            this is the first queue that gets processed when network manager
            is unpaused using isc_nm_resume().  All netmgr workers need to
            clean the privileged task queue before they all proceed normal
            operation.  Both task queues are processed when the workers are
            finished.
      
         3. task queue - only (traditional) task are scheduled here and this
            queue along with privileged task queues are process when the
            netmgr workers are finishing.  This is needed to process the task
            shutdown events.
      
         4. normal queue - this is the queue with netmgr events, e.g. reading,
            sending, callbacks and pretty much everything is processed here.
      
       * The isc_taskmgr_create() now requires initialized netmgr (isc_nm_t)
         object.
      
       * The isc_nm_destroy() function now waits for indefinite time, but it
         will print out the active objects when in tracing mode
         (-DNETMGR_TRACE=1 and -DNETMGR_TRACE_VERBOSE=1), the netmgr has been
         made a little bit more asynchronous and it might take longer time to
         shutdown all the active networking connections.
      
       * Previously, the isc_nm_stoplistening() was a synchronous operation.
         This has been changed and the isc_nm_stoplistening() just schedules
         the child sockets to stop listening and exits.  This was needed to
         prevent a deadlock as the the (traditional) tasks are now executed on
         the netmgr threads.
      
       * The socket selection logic in isc__nm_udp_send() was flawed, but
         fortunatelly, it was broken, so we never hit the problem where we
         created uvreq_t on a socket from nmhandle_t, but then a different
         socket could be picked up and then we were trying to run the send
         callback on a socket that had different threadid than currently
         running.
      b540722b
  8. 19 Apr, 2021 2 commits
    • Ondřej Surý's avatar
      Fixup win32 paths for moved bin/tests/system/resolve · f14e6786
      Ondřej Surý authored
      When resolve.c was moved from lib/samples to bin/tests/system, the
      resolve.vcxproj.in would still contain old paths to the directory
      root. This commit adds one more ..\ to match the directory depth.
      
      Additionally, fixup the path in BINDInstall.vcxproj.in to be
      bin/tests/system and not bin/tests/samples.
      f14e6786
    • Ondřej Surý's avatar
      Cleanup the isc_<*>mgr_createinc() constructors · 3388ef36
      Ondřej Surý authored
      Previously, the taskmgr, timermgr and socketmgr had a constructor
      variant, that would create the mgr on top of existing appctx.  This was
      no longer true and isc_<*>mgr was just calling isc_<*>mgr_create()
      directly without any extra code.
      
      This commit just cleans up the extra function.
      3388ef36
  9. 16 Apr, 2021 4 commits
    • Evan Hunt's avatar
      move samples/resolve.c to bin/tests/system · d0ec7d1f
      Evan Hunt authored
      "resolve" is used by the resolver system tests, and I'm not
      certain whether delv exercises the same code, so rather than
      remove it, I moved it to bin/tests/system.
      d0ec7d1f
    • Evan Hunt's avatar
      remove dns_client_update() and related code · fb2a352e
      Evan Hunt authored
      the libdns client API is no longer being maintained for
      external use, we can remove the code that isn't being used
      internally, as well as the related tests.
      fb2a352e
    • Ondřej Surý's avatar
      Refactor dns_journal_rollforward() to work over opened journal · 55b942b4
      Ondřej Surý authored
      Too much logic was cramped inside the dns_journal_rollforward() that
      made it harder to follow.  The dns_journal_rollforward() was refactored
      to work over already opened journal and some of the previous logic was
      moved to new static zone_journal_rollforward() that separates the
      journal "rollforward" logic from the "zone" logic.
      55b942b4
    • Mark Andrews's avatar
      Fixing a recoverable journal should not result in the zone being written · ec7a9af3
      Mark Andrews authored
      when dns_journal_rollforward returned ISC_R_RECOVERABLE the distintion
      between 'up to date' and 'success' was lost, as a consequence
      zone_needdump() was called writing out the zone file when it shouldn't
      have been.   This change restores that distintion.  Adjust system
      test to reflect visible changes.
      ec7a9af3
  10. 15 Apr, 2021 1 commit
    • Matthijs Mekking's avatar
      Fix view-related issues in the "keymgr2kasp" test · 0de5a576
      Matthijs Mekking authored
      Due to the lack of "match-clients" clauses in ns4/named2.conf.in, the
      same view is incorrectly chosen for all queries received by ns4 in the
      "keymgr2kasp" system test.  This causes only one version of the
      "view-rsasha256.kasp" zone to actually be checked.  Add "match-clients"
      clauses to ns4/named2.conf.in to ensure the test really checks what it
      claims to.
      
      Use identical view names ("ext", "int") in ns4/named.conf.in and
      ns4/named2.conf.in so that it is easier to quickly identify the
      differences between these two files.
      
      Update tests.sh to account for the above changes.  Also fix a copy-paste
      error in a comment to prevent confusion.
      0de5a576
  11. 14 Apr, 2021 2 commits
    • Matthijs Mekking's avatar
      Fix inline test with missing $INCLUDE · 96583e7c
      Matthijs Mekking authored
      The test case for a zone with a missing include file was wrong for two
      reasons:
      1. It was loading the wrong file (master5 instead of master6)
      2. It did actually not set the $ret variable to 1 if the test failed
         (it should default to ret=1 and clear the variable if the
         appropriate log is found).
      96583e7c
    • Matthijs Mekking's avatar
      Add inline-signing with $INCLUDE test · 6463ee0f
      Matthijs Mekking authored
      Add a test case for inline-signing for a zone with an $INCLUDE
      statement. There is already a test for a missing include file, this
      one adds a test for a zone with an include file that does exist.
      
      Test if the record in the included file is loaded.
      6463ee0f
  12. 13 Apr, 2021 4 commits
    • Matthijs Mekking's avatar
      Implement draft-vandijk-dnsop-nsec-ttl · 9af8caa7
      Matthijs Mekking authored
      The draft says that the NSEC(3) TTL must have the same TTL value
      as the minimum of the SOA MINIMUM field and the SOA TTL. This was
      always the intended behaviour.
      
      Update the zone structure to also track the SOA TTL. Whenever we
      use the MINIMUM value to determine the NSEC(3) TTL, use the minimum
      of MINIMUM and SOA TTL instead.
      
      There is no specific test for this, however two tests need adjusting
      because otherwise they failed: They were testing for NSEC3 records
      including the TTL. Update these checks to use 600 (the SOA TTL),
      rather than 3600 (the SOA MINIMUM).
      9af8caa7
    • Matthijs Mekking's avatar
      Use stale TTL as RRset TTL in dumpdb · a83c8cb0
      Matthijs Mekking authored
      It is more intuitive to have the countdown 'max-stale-ttl' as the
      RRset TTL, instead of 0 TTL. This information was already available
      in a comment "; stale (will be retained for x more seconds", but
      Support suggested to put it in the TTL field instead.
      a83c8cb0
    • Matthijs Mekking's avatar
      Check staleness in bind_rdataset · debee615
      Matthijs Mekking authored
      Before binding an RRset, check the time and see if this record is
      stale (or perhaps even ancient). Marking a header stale or ancient
      happens only when looking up an RRset in cache, but binding an RRset
      can also happen on other occasions (for example when dumping the
      database).
      
      Check the time and compare it to the header. If according to the
      time the entry is stale, but not ancient, set the STALE attribute.
      If according to the time is ancient, set the ANCIENT attribute.
      
      We could mark the header stale or ancient here, but that requires
      locking, so that's why we only compare the current time against
      the rdh_ttl.
      
      Adjust the test to check the dump-db before querying for data. In the
      dumped file the entry should be marked as stale, despite no cache
      lookup happened since the initial query.
      debee615
    • Matthijs Mekking's avatar
      Fix nonsensical stale TTL values in cache dump · 2a5e0232
      Matthijs Mekking authored
      When introducing change 5149, "rndc dumpdb" started to print a line
      above a stale RRset, indicating how long the data will be retained.
      
      At that time, I thought it should also be possible to load
      a cache from file. But if a TTL has a value of 0 (because it is stale),
      stale entries wouldn't be loaded from file. So, I added the
      'max-stale-ttl' to TTL values, and adjusted the $DATE accordingly.
      
      Since we actually don't have a "load cache from file" feature, this
      is premature and is causing confusion at operators. This commit
      changes the 'max-stale-ttl' adjustments.
      
      A check in the serve-stale system test is added for a non-stale
      RRset (longttl.example) to make sure the TTL in cache is sensible.
      
      Also, the comment above stale RRsets could have nonsensical
      values. A possible reason why this may happen is when the RRset was
      marked a stale but the 'max-stale-ttl' has passed (and is actually an
      RRset awaiting cleanup). This would lead to the "will be retained"
      value to be negative (but since it is stored in an uint32_t, you would
      get a nonsensical value (e.g. 4294362497).
      
      To mitigate against this, we now also check if the header is not
      ancient. In addition we check if the stale_ttl would be negative, and
      if so we set it to 0. Most likely this will not happen because the
      header would already have been marked ancient, but there is a possible
      race condition where the 'rdh_ttl + serve_stale_ttl' has passed,
      but the header has not been checked for staleness.
      2a5e0232
  13. 08 Apr, 2021 3 commits
    • Michał Kępień's avatar
      Use the same port selection method on all systems · c3718b92
      Michał Kępień authored
      When system tests are run on Windows, they are assigned port ranges that
      are 100 ports wide and start from port number 5000.  This is a different
      port assignment method than the one used on Unix systems.  Drop the "-p"
      command line option from bin/tests/system/run.sh invocations used for
      starting system tests on Windows to unify the port assignment method
      used across all operating systems.
      c3718b92
    • Michał Kępień's avatar
      Rework get_ports.sh to make it not use a lock file · 31e5ca4b
      Michał Kępień authored
      The get_ports.sh script is used for determining the range of ports a
      given system test should use.  It first determines the start of the port
      range to return (the base port); it can either be specified explicitly
      by the caller or chosen randomly.  Subsequent ports are picked
      sequentially, starting from the base port.  To ensure no single port is
      used by multiple tests, a state file (get_ports.state) containing the
      last assigned port is maintained by the script.  Concurrent access to
      the state file is protected by a lock file (get_ports.lock); if one
      instance of the script holds the lock file while another instance tries
      to acquire it, the latter retries its attempt to acquire the lock file
      after sleeping for 1 second; this retry process can be repeated up to 10
      times before the script returns an error.
      
      There are some problems with this approach:
      
        - the sleep period in case of failure to acquire the lock file is
          fixed, which leads to a "thundering herd" type of problem, where
          (depending on how processes are scheduled by the operating system)
          multiple system tests try to acquire the lock file at the same time
          and subsequently sleep for 1 second, only for the same situation to
          likely happen the next time around,
      
        - the lock file is being locked and then unlocked for every single
          port assignment made, not just once for the entire range of ports a
          system test should use; in other words, the lock file is currently
          locked and unlocked 13 times per system test; this increases the
          odds of the "thundering herd" problem described above preventing a
          system test from getting one or more ports assigned before the
          maximum retry count is reached (assuming multiple system tests are
          run in parallel); it also enables the range of ports used by a given
          system test to be non-sequential (which is a rather cosmetic issue,
          but one that can make log interpretation harder than necessary when
          test failures are diagnosed),
      
        - both issues described above cause unnecessary delays when multiple
          system tests are started in parallel (due to high lock file
          contention among the system tests being started),
      
        - maintaining a state file requires ensuring proper locking, which
          complicates the script's source code.
      
      Rework the get_ports.sh script so that it assigns non-overlapping port
      ranges to its callers without using a state file or a lock file:
      
        - add a new command line switch, "-t", which takes the name of the
          system test to assign ports for,
      
        - ensure every instance of get_ports.sh knows how many ports all
          system tests which form the test suite are going to need in total
          (based on the number of subdirectories found in bin/tests/system/),
      
        - in order to ensure all instances of get_ports.sh work on the same
          global port range (so that no port range collisions happen), a
          stable (throughout the expected run time of a single system test
          suite) base port selection method is used instead of the random one;
          specifically, the base port, unless specified explicitly using the
          "-p" command line switch, is derived from the number of hours which
          passed since the Unix Epoch time,
      
        - use the name of the system test to assign ports for (passed via the
          new "-t" command line switch) as a unique index into the global
          system test range, to ensure all system tests use disjoint port
          ranges.
      31e5ca4b
    • Michal Nowak's avatar
      Move fromhex.pl script to bin/tests/system/ · cd0a34df
      Michal Nowak authored
      The fromhex.pl script needs to be copied from the source directory to
      the build directory before any test is run, otherwise the out-of-tree
      fails to find it. Given that the script is used only in system test,
      move it to bin/tests/system/.
      cd0a34df
  14. 07 Apr, 2021 2 commits
  15. 02 Apr, 2021 1 commit
    • Matthijs Mekking's avatar
      Remove result exception on staleonly lookup · aaed7f9d
      Matthijs Mekking authored
      When implementing "stale-answer-client-timeout", we decided that
      we should only return positive answers prematurely to clients. A
      negative response is not useful, and in that case it is better to
      wait for the recursion to complete.
      
      To do so, we check the result and if it is not ISC_R_SUCCESS, we
      decide that it is not good enough. However, there are more return
      codes that could lead to a positive answer (e.g. CNAME chains).
      
      This commit removes the exception and now uses the same logic that
      other stale lookups use to determine if we found a useful stale
      answer (stale_found == true).
      
      This means we can simplify two test cases in the serve-stale system
      test: nodata.example is no longer treated differently than data.example.
      aaed7f9d
  16. 01 Apr, 2021 1 commit
    • Mark Andrews's avatar
      Test dynamic libraries should not be installed · 35e8f56b
      Mark Andrews authored
      Tag the libraries with check_ to prevent them being installed
      by "make install".  Additionally make check requires .so to be
      create which requires .lai files to be constructed which, in
      turn, requires -rpath <dir> as part of "linking" the .la file.
      35e8f56b
  17. 25 Mar, 2021 2 commits
    • Diego Fronza's avatar
      Added tests for tries=1 and retry=0 on TCP EOF · 4f82cc41
      Diego Fronza authored
      Added tests to ensure that dig won't retry sending a query over tcp
      (+tcp) when a TCP connection is closed prematurely (EOF is read) if
      either +tries=1 or retry=0 is specified on the command line.
      4f82cc41
    • Diego Fronza's avatar
      Adjusted dig system tests · e6808960
      Diego Fronza authored
      Now that premature EOF on tcp connections take +tries and +retry into
      account, the dig system tests handling TCP EOF with +tries=1 were
      expecting dig to do a second attempt in handling the tcp query, which
      doesn't happen anymore.
      
      To make the test work as expected +tries value was adjusted to 2, to
      make it behave as before after the new update on dig.
      e6808960
  18. 24 Mar, 2021 1 commit
    • Matthijs Mekking's avatar
      Add kasp.sh to run.sh.in script · 93ed2150
      Matthijs Mekking authored
      Add kasp.sh to the list of scripts copied from the source directory to
      the build directory before any test is run. This will fix
      the out-of-tree test failures introduced in commit
      ecb073bd on the 'main' branch.
      93ed2150
  19. 22 Mar, 2021 5 commits
    • Matthijs Mekking's avatar
      Fix some intermittent kasp failures · 82d667e1
      Matthijs Mekking authored
      When calling "rndc dnssec -checkds", it may take some milliseconds
      before the appropriate changes have been written to the state file.
      Add retry_quiet mechanisms to allow the write operation to finish.
      
      Also retry_quiet the check for the next key event. A "rndc dnssec"
      command may trigger a zone_rekey event and this will write out
      a new "next key event" log line, but it may take a bit longer than
      than expected in the tests.
      82d667e1
    • Matthijs Mekking's avatar
      Rekey immediately after rndc checkds/rollover · 82f72ae2
      Matthijs Mekking authored
      Call 'dns_zone_rekey' after a 'rndc dnssec -checkds' or 'rndc dnssec
      -rollover' command is received, because such a command may influence
      the next key event. Updating the keys immediately avoids unnecessary
      rollover delays.
      
      The kasp system test no longer needs to call 'rndc loadkeys' after
      a 'rndc dnssec -checkds' or 'rndc dnssec -rollover' command.
      82f72ae2
    • Matthijs Mekking's avatar
      Delete CDS/CDNSKEY records when zone is unsigned · 6f31f62d
      Matthijs Mekking authored
      CDS/CDNSKEY DELETE records are only useful if they are signed,
      otherwise the parent cannot verify these RRsets anyway. So once the DS
      has been removed (and signaled to BIND), we can remove the DNSKEY and
      RRSIG records, and at this point we can also remove the CDS/CDNSKEY
      records.
      6f31f62d
    • Matthijs Mekking's avatar
      Allow CDS/CDNSKEY DELETE records in unsigned zone · f211c7c2
      Matthijs Mekking authored
      While not useful, having a CDS/CDNSKEY DELETE record in an unsigned
      zone is not an error and "named-checkzone" should not complain.
      f211c7c2
    • Matthijs Mekking's avatar
      Retry quiet check keys · d5531df7
      Matthijs Mekking authored
      Change the 'check_keys' function to try three times. Some intermittent
      kasp test failures are because we are inspecting the key files
      before the actual change has happen. The 'retry_quiet' approach allows
      for a bit more time to let the write operation finish.
      d5531df7