• Ondřej Surý's avatar
    Fix the rbt hashtable and grow it when setting max-cache-size · e24bc324
    Ondřej Surý authored
    There were several problems with rbt hashtable implementation:
    1. Our internal hashing function returns uint64_t value, but it was
       silently truncated to unsigned int in dns_name_hash() and
       dns_name_fullhash() functions.  As the SipHash 2-4 higher bits are
       more random, we need to use the upper half of the return value.
    2. The hashtable implementation in rbt.c was using modulo to pick the
       slot number for the hash table.  This has several problems because
       modulo is: a) slow, b) oblivious to patterns in the input data.  This
       could lead to very uneven distribution of the hashed data in the
       hashtable.  Combined with the single-linked lists we use, it could
       really hog-down the lookup and removal of the nodes from the rbt
       tree[a].  The Fibonacci Hashing is much better fit for the hashtable
       function here.  For longer description, read "Fibonacci Hashing: The
       Optimization that the World Forgot"[b] or just look at the Linux
       kernel.  Also this will make Diego very happy :).
    3. The hashtable would rehash every time the number of nodes in the rbt
       tree would exceed 3 * (hashtable size).  The overcommit will make the
       uneven distribution in the hashtable even worse, but the main problem
       lies in the rehashing - every time the database grows beyond the
       limit, each subsequent rehashing will be much slower.  The mitigation
       here is letting the rbt know how big the cache can grown and
       pre-allocate the hashtable to be big enough to actually never need to
       rehash.  This will consume more memory at the start, but since the
       size of the hashtable is capped to `1 << 32` (e.g. 4 mio entries), it
       will only consume maximum of 32GB of memory for hashtable in the
       worst case (and max-cache-size would need to be set to more than
       4TB).  Calling the dns_db_adjusthashsize() will also cap the maximum
       size of the hashtable to the pre-computed number of bits, so it won't
       try to consume more gigabytes of memory than available for the
       FIXME: What is the average size of the rbt node that gets hashed?  I
       chose the pagesize (4k) as initial value to precompute the size of
       the hashtable, but the value is based on feeling and not any real
    For future work, there are more places where we use result of the hash
    value modulo some small number and that would benefit from Fibonacci
    Hashing to get better distribution.
    a. A doubly linked list should be used here to speedup the removal of
       the entries from the hashtable.
    b. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/
rbt.c 92.2 KB