Histograms for timing and memory statistics

BIND needs to be able to record statistics covering a wide range of possible values (several decimal orders of magnitude):

latency times, from submilliseond queries on the same LAN to multi-minute zone transfers
memory usage, for zones from a handful of records to tens of millions
message sizes, from 64 bytes to 64 kilobytes

In this issue I'm outlining a possible design for a general-purpose histogram data structure, that could be added to libisc for collecting statistics efficiently in several places in BIND.

existing histograms in BIND

The statistics channel has histograms for request and response sizes, which use buckets that are defined manually with some tediously repetitive code. These could be replaced by the proposed self-tuning histograms, although the bucketing will be somewhat different.

examples of general-purpose histograms

It's possible to record histograms of values covering a wide range, with bucket sizes chosen automatically to provide a particular level of accuracy (e.g. 1% or 10%), and without using more than a few KiB for each histogram. Existing examples are:

circllhist, Circonus log-linear histogram, aka OpenHistogram

Uses decimal floating point with two digits of mantissa and a 1 byte exponent, to record values with 1% accuracy.
DDSketch from DataDog

Uses the floating-point logarithm to a base derived from the required accuracy, rounded to an integer to make a bucket index. Has an alternative "fast" mode more like HdrHistogram.
HdrHistogram, high dynamic range histogram

Uses low-precision floating point numbers as bucket indexes.
hg64, 64-bit histograms

My prototype implementation intended for use in BIND.

The DataDog blog article has a nice overview, and compares a quantile sketch implementation (that is designed for a particular rank error) with a histogram (designed for a particular value error). From my reading on this topic I concluded that histograms are both easier to understand, simpler to implement, and have similar or better CPU and memory usage compared to rank-error-based quantile sketches.

key idea

The histogram counts how many measurements (time or space) have particular uint64_t value or range of values, according to the histogram's configured precision (e.g. 1% or 10%). Each range of values corresponds to a bucket or counter.

My prototype hg64 uses a log-linear bucket spacing, which has two parts:

a logarithm of the value to cover a large dynamic range with a few bits; specifically, the log base 2 of a uint64_t varies from 0 to 63, which fits in 6 bits.
linear, evenly spaced buckets between logarithms, to provide more precision than you can get from just a power of 2 or 10. 4 buckets per log are enough for 10% precision; 32 buckets per log gives 1% precision.

This log-linear bucketing is the same thing as decimal scientific notation, like 1e9 (1 significant digit, 10% precision) or 2.2e8 (2 significant digits, 1% precision). It's also the same as a (low-precision) binary floating point number: the FP exponent is the logarithmic part, and the FP mantissa is the linear part.

measurements and values

When counting time measurements, it makes sense for the uint64_t value to be the time measured in nanoseconds. This allows the histogram to count any time measurements we are likely to need, from submicrosecond up to a few centuries. There is no point using lower-precision time measurements because the histogram bucketing algorithm will reduce the precision as required.

Unlike nanosecond measurements, whose values are towards the logarithmic mid-range of uint64_t, memory measurements tend to cluster around zero. The hg64 bucketing algorithm provides one counter for each distinct small integer; for instance, with 1% precision hg64 has a counter for each value from 0 to 63, above which multiple values share each counter. To make the best use of these small-value counters, it makes sense to divide a memory measurement to get the desired resolution. For example, if the allocator quantum is 16 bytes, divide an allocation size by 16 before using it as a histogram value.

incrementing counters quickly

It is very cheap to turn a uint64_t value into a bucket number, using CLZ to get the logarithm with some bit shuffling to move things into place. The basic principle is roughly the same as used by HdrHistogram and fast-mode DDSketch. Paul Khuong encouraged me to use his algorithm which is smaller and faster than the version I developed for my proof-of-concept.

As in BIND's existing statistics code, we use a relaxed atomic increment to update a counter. When the histogram is in cache and uncontended, the whole operation (calculating the bucket number and incrementing the counter) takes less than 2.5ns in my prototype code.

efficient storage

The hg64 bucket keys are small, e.g. 8 bits for 10% precision, or 11 bits for 1% precision. We could store the buckets as a simple array of counters, which would use 2 KiB for 10% precision, or 16 KiB for 1% precision. However a large fraction of that space will be unused, because the values we are recording do not cover anywhere near 20 orders of magnitude.

My prototype code has a 64 entry top-level array (one for each possible exponent) and allocate each sub-array on demand (with a counter for each possible mantissa). Most of the sub-arrays will remain unused. This layout supports lock-free multithreading.

operations on histograms

given a value, find its rank (or percentile)
find the value at a given rank (or percentile)
get the mean and standard deviation of the data recorded in the histogram
merge two histograms (which may differ in precision)
dump and load a histogram in text (e.g. csv, xml, json) and/or binary (for efficiency)
export a histogram to a user-selected collection of buckets (e.g. for prometheus)

I have implementations of the first four.

The rank and percentile queries work on a snapshot of the working histogram, to avoid multithreading races and to make the calculations more efficient.

exporting data

An important consumer for data recorded in histograms is Prometheus. The docs https://prometheus.io/docs/practices/histograms/ say it supports

a "histogram" type (actually a cumulative frequency digest) where quantiles are calculated on the server
a "summary" type, where quantiles are calculated on the client and the server aggregates them over a sliding window

Prometheus has its own textual format for exposing / ingesting data, https://prometheus.io/docs/instrumenting/exposition_formats/. It looks like it would be fairly easy for hg64 and BIND to support it, though it isn't clear whether the server is able to re-bucket data that is exposed with a different bucketing than configured on the server.

elsewhere on gitlab

Related issues #598 #2101 #3455 (closed)

Edited Oct 26, 2022 by Tony Finch