1. 29 Apr, 2021 2 commits
  2. 28 Apr, 2021 4 commits
    • Matthijs Mekking's avatar
      Serve-stale nit fixes · 104b6762
      Matthijs Mekking authored
      While working on the serve-stale backports, I noticed the following
      oddities:
      
      1. In the serve-stale system test, in one case we keep track of the
         time how long it took for dig to complete. In commit
         aaed7f9d, the code removed the
         exception to check for result == ISC_R_SUCCESS on stale found
         answers, and adjusted the test accordingly. This failed to update
         the time tracking accordingly. Move the t1/t2 time track variables
         back around the two dig commands to ensure the lookups resolved
         faster than the resolver-query-timeout.
      
      2. We can remove the setting of NS_QUERYATTR_STALEOK and
         DNS_RDATASETATTR_STALE_ADDED on the "else if (stale_timeout)"
         code path, because they are added later when we know we have
         actually found a stale answer on a stale timeout lookup.
      
      3. We should clear the NS_QUERYATTR_STALEOK flag from the client
         query attributes instead of DNS_RDATASETATTR_STALE_ADDED (that
         flag is set on the rdataset attributes).
      
      4. In 'bin/named/config.c' we should set the configuration options
         in alpabetical order.
      
      5. In the ARM, in the backports we have added "(stale)" between
         "cached" and "RRset" to make more clear a stale RRset may be
         returned in this scenario.
      104b6762
    • Michał Kępień's avatar
      Warn when log files grow too big in system tests · 241e85ef
      Michał Kępień authored
      Exerting excessive I/O load on the host running system tests should be
      avoided in order to limit the number of false positives reported by the
      system test suite.  In some cases, running named with "-d 99" (which is
      the default for system tests) results in a massive amount of logs being
      generated, most of which are useless.  Implement a log file size check
      to draw developers' attention to overly verbose named instances used in
      system tests.  The warning threshold of 200,000 lines was chosen
      arbitrarily.
      241e85ef
    • Michał Kępień's avatar
      Prevent useless logging in the "tcp" system test · 17e5c2a5
      Michał Kępień authored
      The regression test for CVE-2020-8620 causes a lot of useless messages
      to be logged.  However, globally decreasing the log level for the
      affected named instance would be a step too far as debugging information
      may be useful for troubleshooting other checks in the "tcp" system test.
      Starting a separate named instance for a single check should be avoided
      when possible and thus is also not a good solution.  As a compromise,
      run "rndc trace 1" for the affected named instance before starting the
      regression test for CVE-2020-8620.
      17e5c2a5
    • Michał Kępień's avatar
      Limit logging for verbose system tests · 4a8d4048
      Michał Kępień authored
      The system test framework starts all named instances with the "-d 99"
      command line option (unless it is overridden by a named.args file in a
      given instance's working directory).  This causes a lot of log messages
      to be written to named.run files - currently over 5 million lines for a
      single test suite run.  While debugging information preserved in the log
      files is essential for troubleshooting intermittent test failures, some
      system tests involve sending hundreds or even thousands of queries,
      which causes the relevant log files to explode in size.  When multiple
      tests (or even multiple test suites) are run in parallel, excessive
      logging contributes considerably to the I/O load on the test host,
      increasing the odds of intermittent test failures getting triggered.
      
      Decrease the debug level for the seven most verbose named instances:
      
        - use "-d 3" for ns2 in the "cacheclean" system test (it is the lowest
          logging level at which the test still passes without the need to
          apply any changes to tests.sh),
      
        - use "-d 1" for the other six named instances.
      
      This roughly halves the number of lines logged by each test suite run
      while still leaving enough information in the logs to allow at least
      basic troubleshooting in case of test failures.
      
      This approach was chosen as it results in a greater decrease in the
      number of lines logged than running all named instances with "-d 3",
      without causing any test failures.
      4a8d4048
  3. 26 Apr, 2021 2 commits
    • Diego Fronza's avatar
      Fix following up lookup failure if more resolvers are available · 4d6408b8
      Diego Fronza authored
      _query_detach function was incorrectly unliking the query object from
      the lookup->q query list, this made it impossible to follow a query
      lookup failure with the next one in the list (possibly using a separate
      resolver), as the link to the next query in the list was dissolved.
      
      Fix by unliking the node only when the query object is about to be
      destroyed, i.e. there is no more references to the object.
      4d6408b8
    • Michał Kępień's avatar
      Test "tkey-gssapi-credential" conditionally · 6feac68b
      Michał Kępień authored
      If "tkey-gssapi-credential" is set in the configuration and GSSAPI
      support is not available, named will refuse to start.  As the test
      system framework does not support starting named instances
      conditionally, ensure that "tkey-gssapi-credential" is only present in
      named.conf if GSSAPI support is available.
      6feac68b
  4. 24 Apr, 2021 1 commit
    • Mark Andrews's avatar
      Wait for named to start · 8d5870f9
      Mark Andrews authored
      If we don't wait for named to finish starting, 'rndc stop' may
      fail due to the listen limit being reached in named leading
      to a false negative test
      8d5870f9
  5. 23 Apr, 2021 1 commit
    • Michał Kępień's avatar
      Test handling of non-apex RRSIG(SOA) RRsets · 24bf4b94
      Michał Kępień authored
      Add a check to the "dnssec" system test which ensures that RRSIG(SOA)
      RRsets present anywhere else than at the zone apex are automatically
      removed after a zone containing such RRsets is loaded.
      24bf4b94
  6. 22 Apr, 2021 4 commits
    • Diego Fronza's avatar
      Add system test for the deadlock fix · d6224035
      Diego Fronza authored
      The test spawns 4 parallel workers that keep adding, modifying and
      deleting zones, the main thread repeatedly checks wheter rndc
      status responds within a reasonable period.
      
      While environment and timing issues may affect the test, in most
      test cases the deadlock that was taking place before the fix used to
      trigger in less than 7 seconds in a machine with at least 2 cores.
      d6224035
    • Diego Fronza's avatar
      Fix deadlock between rndc addzone/delzone/modzone · 9298dceb
      Diego Fronza authored
      It follows a description of the steps that were leading to the deadlock:
      
      1. `do_addzone` calls `isc_task_beginexclusive`.
      
      2. `isc_task_beginexclusive` waits for (N_WORKERS - 1) halted tasks,
         this blocks waiting for those (no. workers -1) workers to halt.
      ...
      isc_task_beginexclusive(isc_task_t *task0) {
          ...
      	while (manager->halted + 1 < manager->workers) {
      		wake_all_queues(manager);
      		WAIT(&manager->halt_cond, &manager->halt_lock);
      	}
      ```
      
      3. It is possible that in `task.c / dispatch()` a worker is running a
         task event, if that event blocks it will not allow this worker to
         halt.
      
      4. `do_addzone` acquires `LOCK(&view->new_zone_lock);`,
      
      5. `rmzone` event is called from some worker's `dispatch()`, `rmzone`
         blocks waiting for the same lock.
      
      6. `do_addzone` calls `isc_task_beginexclusive`.
      
      7. Deadlock triggered, since:
      	- `rmzone` is wating for the lock.
      	- `isc_task_beginexclusive` is waiting for (no. workers - 1) to
      	   be halted
      	- since `rmzone` event is blocked it won't allow the worker to halt.
      
      To fix this, we updated do_addzone code to call isc_task_beginexclusive
      before the lock is acquired, we postpone locking to the nearest required
      place, same for isc_task_beginexclusive.
      
      The same could happen with rndc modzone, so that was addressed as well.
      9298dceb
    • Petr Špaček's avatar
      Add tests for the "tkey-gssapi-credential" option · 1746d2e8
      Petr Špaček authored
      Four named instances in the "nsupdate" system test have GSS-TSIG support
      enabled.  All of them currently use "tkey-gssapi-keytab".  Configure two
      of them with "tkey-gssapi-credential" to test that option.
      
      As "tkey-gssapi-keytab" and "tkey-gssapi-credential" both provide the
      same functionality, no test modifications are required.  The difference
      between the two options is that the value of "tkey-gssapi-keytab" is an
      explicit path to the keytab file to acquire credentials from, while the
      value of "tkey-gssapi-credential" is the name of the principal whose
      credentials should be used; those credentials are looked up in the
      keytab file expected by the Kerberos library, i.e. /etc/krb5.keytab by
      default.  The path to the default keytab file can be overridden using by
      setting the KRB5_KTNAME environment variable.  Utilize that variable to
      use existing keytab files with the "tkey-gssapi-credential" option.
      
      The KRB5_KTNAME environment variable should not interfere with the
      "tkey-gssapi-keytab" option.  Nevertheless, rename one of the keytab
      files used with "tkey-gssapi-keytab" to something else than the contents
      of the KRB5_KTNAME environment variable in order to make sure that both
      "tkey-gssapi-keytab" and "tkey-gssapi-credential" are actually tested.
      1746d2e8
    • Mark Andrews's avatar
  7. 20 Apr, 2021 1 commit
    • Ondřej Surý's avatar
      Refactor taskmgr to run on top of netmgr · b540722b
      Ondřej Surý authored
      This commit changes the taskmgr to run the individual tasks on the
      netmgr internal workers.  While an effort has been put into keeping the
      taskmgr interface intact, couple of changes have been made:
      
       * The taskmgr has no concept of universal privileged mode - rather the
         tasks are either privileged or unprivileged (normal).  The privileged
         tasks are run as a first thing when the netmgr is unpaused.  There
         are now four different queues in in the netmgr:
      
         1. priority queue - netievent on the priority queue are run even when
            the taskmgr enter exclusive mode and netmgr is paused.  This is
            needed to properly start listening on the interfaces, free
            resources and resume.
      
         2. privileged task queue - only privileged tasks are queued here and
            this is the first queue that gets processed when network manager
            is unpaused using isc_nm_resume().  All netmgr workers need to
            clean the privileged task queue before they all proceed normal
            operation.  Both task queues are processed when the workers are
            finished.
      
         3. task queue - only (traditional) task are scheduled here and this
            queue along with privileged task queues are process when the
            netmgr workers are finishing.  This is needed to process the task
            shutdown events.
      
         4. normal queue - this is the queue with netmgr events, e.g. reading,
            sending, callbacks and pretty much everything is processed here.
      
       * The isc_taskmgr_create() now requires initialized netmgr (isc_nm_t)
         object.
      
       * The isc_nm_destroy() function now waits for indefinite time, but it
         will print out the active objects when in tracing mode
         (-DNETMGR_TRACE=1 and -DNETMGR_TRACE_VERBOSE=1), the netmgr has been
         made a little bit more asynchronous and it might take longer time to
         shutdown all the active networking connections.
      
       * Previously, the isc_nm_stoplistening() was a synchronous operation.
         This has been changed and the isc_nm_stoplistening() just schedules
         the child sockets to stop listening and exits.  This was needed to
         prevent a deadlock as the the (traditional) tasks are now executed on
         the netmgr threads.
      
       * The socket selection logic in isc__nm_udp_send() was flawed, but
         fortunatelly, it was broken, so we never hit the problem where we
         created uvreq_t on a socket from nmhandle_t, but then a different
         socket could be picked up and then we were trying to run the send
         callback on a socket that had different threadid than currently
         running.
      b540722b
  8. 19 Apr, 2021 2 commits
    • Ondřej Surý's avatar
      Fixup win32 paths for moved bin/tests/system/resolve · f14e6786
      Ondřej Surý authored
      When resolve.c was moved from lib/samples to bin/tests/system, the
      resolve.vcxproj.in would still contain old paths to the directory
      root. This commit adds one more ..\ to match the directory depth.
      
      Additionally, fixup the path in BINDInstall.vcxproj.in to be
      bin/tests/system and not bin/tests/samples.
      f14e6786
    • Ondřej Surý's avatar
      Cleanup the isc_<*>mgr_createinc() constructors · 3388ef36
      Ondřej Surý authored
      Previously, the taskmgr, timermgr and socketmgr had a constructor
      variant, that would create the mgr on top of existing appctx.  This was
      no longer true and isc_<*>mgr was just calling isc_<*>mgr_create()
      directly without any extra code.
      
      This commit just cleans up the extra function.
      3388ef36
  9. 16 Apr, 2021 7 commits
  10. 15 Apr, 2021 1 commit
    • Matthijs Mekking's avatar
      Fix view-related issues in the "keymgr2kasp" test · 0de5a576
      Matthijs Mekking authored
      Due to the lack of "match-clients" clauses in ns4/named2.conf.in, the
      same view is incorrectly chosen for all queries received by ns4 in the
      "keymgr2kasp" system test.  This causes only one version of the
      "view-rsasha256.kasp" zone to actually be checked.  Add "match-clients"
      clauses to ns4/named2.conf.in to ensure the test really checks what it
      claims to.
      
      Use identical view names ("ext", "int") in ns4/named.conf.in and
      ns4/named2.conf.in so that it is easier to quickly identify the
      differences between these two files.
      
      Update tests.sh to account for the above changes.  Also fix a copy-paste
      error in a comment to prevent confusion.
      0de5a576
  11. 14 Apr, 2021 2 commits
    • Matthijs Mekking's avatar
      Fix inline test with missing $INCLUDE · 96583e7c
      Matthijs Mekking authored
      The test case for a zone with a missing include file was wrong for two
      reasons:
      1. It was loading the wrong file (master5 instead of master6)
      2. It did actually not set the $ret variable to 1 if the test failed
         (it should default to ret=1 and clear the variable if the
         appropriate log is found).
      96583e7c
    • Matthijs Mekking's avatar
      Add inline-signing with $INCLUDE test · 6463ee0f
      Matthijs Mekking authored
      Add a test case for inline-signing for a zone with an $INCLUDE
      statement. There is already a test for a missing include file, this
      one adds a test for a zone with an include file that does exist.
      
      Test if the record in the included file is loaded.
      6463ee0f
  12. 13 Apr, 2021 4 commits
    • Matthijs Mekking's avatar
      Implement draft-vandijk-dnsop-nsec-ttl · 9af8caa7
      Matthijs Mekking authored
      The draft says that the NSEC(3) TTL must have the same TTL value
      as the minimum of the SOA MINIMUM field and the SOA TTL. This was
      always the intended behaviour.
      
      Update the zone structure to also track the SOA TTL. Whenever we
      use the MINIMUM value to determine the NSEC(3) TTL, use the minimum
      of MINIMUM and SOA TTL instead.
      
      There is no specific test for this, however two tests need adjusting
      because otherwise they failed: They were testing for NSEC3 records
      including the TTL. Update these checks to use 600 (the SOA TTL),
      rather than 3600 (the SOA MINIMUM).
      9af8caa7
    • Matthijs Mekking's avatar
      Use stale TTL as RRset TTL in dumpdb · a83c8cb0
      Matthijs Mekking authored
      It is more intuitive to have the countdown 'max-stale-ttl' as the
      RRset TTL, instead of 0 TTL. This information was already available
      in a comment "; stale (will be retained for x more seconds", but
      Support suggested to put it in the TTL field instead.
      a83c8cb0
    • Matthijs Mekking's avatar
      Check staleness in bind_rdataset · debee615
      Matthijs Mekking authored
      Before binding an RRset, check the time and see if this record is
      stale (or perhaps even ancient). Marking a header stale or ancient
      happens only when looking up an RRset in cache, but binding an RRset
      can also happen on other occasions (for example when dumping the
      database).
      
      Check the time and compare it to the header. If according to the
      time the entry is stale, but not ancient, set the STALE attribute.
      If according to the time is ancient, set the ANCIENT attribute.
      
      We could mark the header stale or ancient here, but that requires
      locking, so that's why we only compare the current time against
      the rdh_ttl.
      
      Adjust the test to check the dump-db before querying for data. In the
      dumped file the entry should be marked as stale, despite no cache
      lookup happened since the initial query.
      debee615
    • Matthijs Mekking's avatar
      Fix nonsensical stale TTL values in cache dump · 2a5e0232
      Matthijs Mekking authored
      When introducing change 5149, "rndc dumpdb" started to print a line
      above a stale RRset, indicating how long the data will be retained.
      
      At that time, I thought it should also be possible to load
      a cache from file. But if a TTL has a value of 0 (because it is stale),
      stale entries wouldn't be loaded from file. So, I added the
      'max-stale-ttl' to TTL values, and adjusted the $DATE accordingly.
      
      Since we actually don't have a "load cache from file" feature, this
      is premature and is causing confusion at operators. This commit
      changes the 'max-stale-ttl' adjustments.
      
      A check in the serve-stale system test is added for a non-stale
      RRset (longttl.example) to make sure the TTL in cache is sensible.
      
      Also, the comment above stale RRsets could have nonsensical
      values. A possible reason why this may happen is when the RRset was
      marked a stale but the 'max-stale-ttl' has passed (and is actually an
      RRset awaiting cleanup). This would lead to the "will be retained"
      value to be negative (but since it is stored in an uint32_t, you would
      get a nonsensical value (e.g. 4294362497).
      
      To mitigate against this, we now also check if the header is not
      ancient. In addition we check if the stale_ttl would be negative, and
      if so we set it to 0. Most likely this will not happen because the
      header would already have been marked ancient, but there is a possible
      race condition where the 'rdh_ttl + serve_stale_ttl' has passed,
      but the header has not been checked for staleness.
      2a5e0232
  13. 12 Apr, 2021 1 commit
  14. 08 Apr, 2021 3 commits
    • Michał Kępień's avatar
      Use the same port selection method on all systems · c3718b92
      Michał Kępień authored
      When system tests are run on Windows, they are assigned port ranges that
      are 100 ports wide and start from port number 5000.  This is a different
      port assignment method than the one used on Unix systems.  Drop the "-p"
      command line option from bin/tests/system/run.sh invocations used for
      starting system tests on Windows to unify the port assignment method
      used across all operating systems.
      c3718b92
    • Michał Kępień's avatar
      Rework get_ports.sh to make it not use a lock file · 31e5ca4b
      Michał Kępień authored
      The get_ports.sh script is used for determining the range of ports a
      given system test should use.  It first determines the start of the port
      range to return (the base port); it can either be specified explicitly
      by the caller or chosen randomly.  Subsequent ports are picked
      sequentially, starting from the base port.  To ensure no single port is
      used by multiple tests, a state file (get_ports.state) containing the
      last assigned port is maintained by the script.  Concurrent access to
      the state file is protected by a lock file (get_ports.lock); if one
      instance of the script holds the lock file while another instance tries
      to acquire it, the latter retries its attempt to acquire the lock file
      after sleeping for 1 second; this retry process can be repeated up to 10
      times before the script returns an error.
      
      There are some problems with this approach:
      
        - the sleep period in case of failure to acquire the lock file is
          fixed, which leads to a "thundering herd" type of problem, where
          (depending on how processes are scheduled by the operating system)
          multiple system tests try to acquire the lock file at the same time
          and subsequently sleep for 1 second, only for the same situation to
          likely happen the next time around,
      
        - the lock file is being locked and then unlocked for every single
          port assignment made, not just once for the entire range of ports a
          system test should use; in other words, the lock file is currently
          locked and unlocked 13 times per system test; this increases the
          odds of the "thundering herd" problem described above preventing a
          system test from getting one or more ports assigned before the
          maximum retry count is reached (assuming multiple system tests are
          run in parallel); it also enables the range of ports used by a given
          system test to be non-sequential (which is a rather cosmetic issue,
          but one that can make log interpretation harder than necessary when
          test failures are diagnosed),
      
        - both issues described above cause unnecessary delays when multiple
          system tests are started in parallel (due to high lock file
          contention among the system tests being started),
      
        - maintaining a state file requires ensuring proper locking, which
          complicates the script's source code.
      
      Rework the get_ports.sh script so that it assigns non-overlapping port
      ranges to its callers without using a state file or a lock file:
      
        - add a new command line switch, "-t", which takes the name of the
          system test to assign ports for,
      
        - ensure every instance of get_ports.sh knows how many ports all
          system tests which form the test suite are going to need in total
          (based on the number of subdirectories found in bin/tests/system/),
      
        - in order to ensure all instances of get_ports.sh work on the same
          global port range (so that no port range collisions happen), a
          stable (throughout the expected run time of a single system test
          suite) base port selection method is used instead of the random one;
          specifically, the base port, unless specified explicitly using the
          "-p" command line switch, is derived from the number of hours which
          passed since the Unix Epoch time,
      
        - use the name of the system test to assign ports for (passed via the
          new "-t" command line switch) as a unique index into the global
          system test range, to ensure all system tests use disjoint port
          ranges.
      31e5ca4b
    • Michal Nowak's avatar
      Move fromhex.pl script to bin/tests/system/ · cd0a34df
      Michal Nowak authored
      The fromhex.pl script needs to be copied from the source directory to
      the build directory before any test is run, otherwise the out-of-tree
      fails to find it. Given that the script is used only in system test,
      move it to bin/tests/system/.
      cd0a34df
  15. 07 Apr, 2021 5 commits
    • Mark Andrews's avatar
      Check that upgrade of managed-keys.bind.jnl succeeded · bb6f0fae
      Mark Andrews authored
      Update the system to include a recoverable managed.keys journal created
      with <size,serial0,serial1,0> transactions and test that it has been
      updated as part of the start up process.
      bb6f0fae
    • Evan Hunt's avatar
      Ensure dig lookup is detached on UDP connect failure · d2ea8f42
      Evan Hunt authored
      dig could hang when UDP connect failed due to a dangling lookup object.
      d2ea8f42
    • Ondřej Surý's avatar
      isc_nm_*connect() always return via callback · 86f4872d
      Ondřej Surý authored
      The isc_nm_*connect() functions were refactored to always return the
      connection status via the connect callback instead of sometimes returning
      the hard failure directly (for example, when the socket could not be
      created, or when the network manager was shutting down).
      
      This commit changes the connect functions in all the network manager
      modules, and also makes the necessary refactoring changes in places
      where the connect functions are called.
      86f4872d
    • Evan Hunt's avatar
      move UDP connect retries from dig into isc_nm_udpconnect() · a70cd026
      Evan Hunt authored
      dig previously ran isc_nm_udpconnect() three times before giving
      up, to work around a freebsd bug that caused connect() to return
      a spurious transient EADDRINUSE. this commit moves the retry code
      into the network manager itself, so that isc_nm_udpconnect() no
      longer needs to return a result code.
      a70cd026
    • Matthijs Mekking's avatar
      Change default stale-answer-client-timeout to off · e443279b
      Matthijs Mekking authored
      Using "stale-answer-client-timeout" turns out to have unforeseen
      negative consequences, and thus it is better to disable the feature
      by default for the time being.
      e443279b