Inconsistent lock ordering?
As part of the process of seeing whether Helgrind will produce any useful output for BIND, I've been looking at the lock order conflict reports it has produced. These are cases where Helgrind has detected that a pair of locks (or a cycle of locks) was acquired in a particular order in one part of the program and in another order elsewhere, something that could lead to a deadlock. (See the relevant part of the Helgrind documentation for information about how Helgrind detects inconsistent lock ordering.)
I ran BIND (set up as a recursive server) through Helgrind for 60 minutes, sending it a moderately heavy query load. The attached file gives the first three lock order violation reports in the output. (The full output can be be found on the internal ISC Jenkins web site. I'd like to get an opinion as to whether these are useful reports (and whether they reveal significant problems) before we spend too much time on this.
The configuration used for the test was:
- BIND instance 1: the configuration defined a "root" zone with a single TLD, served by BIND instance 2.
- Bind instance 2: serves a single (signed) zone of 100,000 records.
- Bind instance 3: set up as a recursive server. The queries to it from queryperf comprises A queries: 95% were for records in the zone served by instance 2, the remainder were for names that did not exist.
FWIW, my take on the lock reports is:
==7409== Thread #8: lock order "0x111CA668 before 0x81EE348" violated
These locks appear to be associated with a zone (the "root" zone?) and view (the default view?) created during the call to load_configuration. In one part of load_configuration a call is made to dns_view_setviewcommit which acquires the locks in one order. Later on, a call is made to dns_zone_setview which acquires the locks in the reverse order.
This is not really a problem in the test as only one thread is executing load_configuration and (presumably) no other threads are able to access the locks and structures they protect until it returns. However, I don't know whether this would hold true in a system where a reconfiguration is executed. (No reconfigurations were executed while the test was running.)
==7409== Thread #8: lock order "0x112935E8 before 0x81EE348" violated
This seems similar to the previous report: both locks are accessed from within load_configuration. dns_view_setviewcommit acquires the locks in one order, but dns_zone_setview acquires them in the reverse order.c, although the detailed call stack is different.
==7409== Thread #4: lock order "0x11B3C068 before 0x10E2CFC8" violated
Looking at lock 0x10E2CFC8, this is stored in res->buckets[bucketnum].lock in the function validated (resolver.c, line 5292) where res is of type dns_resolver_t*. When accessed in dns_resolver_createfetch (validator.c, line 10499) it is accessed via res->lock (again res is of type dns_resolver_t*). This makes me wonder whether the lock has been created, destroyed and another lock created at the same address. I don't believe that Helgrind would detect that.