Histograms for timing and memory statistics
BIND needs to be able to record statistics covering a wide range of possible values (several decimal orders of magnitude):
- latency times, from submilliseond queries on the same LAN to multi-minute zone transfers
- memory usage, for zones from a handful of records to tens of millions
- message sizes, from 64 bytes to 64 kilobytes
In this issue I'm outlining a possible design for a general-purpose histogram data structure,
that could be added to libisc
for collecting statistics efficiently in several places in BIND.
existing histograms in BIND
The statistics channel has histograms for request and response sizes, which use buckets that are defined manually with some tediously repetitive code. These could be replaced by the proposed self-tuning histograms, although the bucketing will be somewhat different.
examples of general-purpose histograms
It's possible to record histograms of values covering a wide range, with bucket sizes chosen automatically to provide a particular level of accuracy (e.g. 1% or 10%), and without using more than a few KiB for each histogram. Existing examples are:
-
circllhist, Circonus log-linear histogram, aka OpenHistogram
Uses decimal floating point with two digits of mantissa and a 1 byte exponent, to record values with 1% accuracy.
-
Uses the floating-point logarithm to a base derived from the required accuracy, rounded to an integer to make a bucket index. Has an alternative "fast" mode more like HdrHistogram.
-
HdrHistogram, high dynamic range histogram
Uses low-precision floating point numbers as bucket indexes.
-
My prototype implementation intended for use in BIND.
The DataDog blog article has a nice overview, and compares a quantile sketch implementation (that is designed for a particular rank error) with a histogram (designed for a particular value error). From my reading on this topic I concluded that histograms are both easier to understand, simpler to implement, and have similar or better CPU and memory usage compared to rank-error-based quantile sketches.
key idea
The histogram counts how many measurements (time or space) have particular uint64_t
value
or range of values, according to the histogram's configured precision (e.g. 1% or 10%).
Each range of values corresponds to a bucket or counter.
My prototype hg64
uses a log-linear bucket spacing, which has two parts:
-
a logarithm of the value to cover a large dynamic range with a few bits; specifically, the log base 2 of a
uint64_t
varies from 0 to 63, which fits in 6 bits. -
linear, evenly spaced buckets between logarithms, to provide more precision than you can get from just a power of 2 or 10. 4 buckets per log are enough for 10% precision; 32 buckets per log gives 1% precision.
This log-linear bucketing is the same thing as decimal scientific notation, like 1e9 (1 significant digit, 10% precision) or 2.2e8 (2 significant digits, 1% precision). It's also the same as a (low-precision) binary floating point number: the FP exponent is the logarithmic part, and the FP mantissa is the linear part.
measurements and values
When counting time measurements, it makes sense for the uint64_t
value to be the time measured in nanoseconds. This allows the histogram to count any time measurements we are likely to need, from submicrosecond up to a few centuries. There is no point using lower-precision time measurements because the histogram bucketing algorithm will reduce the precision as required.
Unlike nanosecond measurements, whose values are towards the logarithmic mid-range of uint64_t
, memory measurements tend to cluster around zero. The hg64
bucketing algorithm provides one counter for each distinct small integer; for instance, with 1% precision hg64
has a counter for each value from 0 to 63, above which multiple values share each counter. To make the best use of these small-value counters, it makes sense to divide a memory measurement to get the desired resolution. For example, if the allocator quantum is 16 bytes, divide an allocation size by 16 before using it as a histogram value.
incrementing counters quickly
It is very cheap to turn a uint64_t
value into a bucket number, using CLZ to get the logarithm
with some bit shuffling to move things into place. The basic principle is
roughly the same as used by HdrHistogram and fast-mode DDSketch.
Paul Khuong encouraged me to use his algorithm
which is smaller and faster than the version I developed for my proof-of-concept.
As in BIND's existing statistics code, we use a relaxed atomic increment to update a counter. When the histogram is in cache and uncontended, the whole operation (calculating the bucket number and incrementing the counter) takes less than 2.5ns in my prototype code.
efficient storage
The hg64
bucket keys are small, e.g. 8 bits for 10% precision, or 11 bits for 1% precision.
We could store the buckets as a simple array of counters, which would use 2 KiB for 10%
precision, or 16 KiB for 1% precision. However a large fraction of that space will be
unused, because the values we are recording do not cover anywhere near 20 orders of
magnitude.
My prototype code has a 64 entry top-level array (one for each possible exponent) and allocate each sub-array on demand (with a counter for each possible mantissa). Most of the sub-arrays will remain unused. This layout supports lock-free multithreading.
operations on histograms
- given a value, find its rank (or percentile)
- find the value at a given rank (or percentile)
- get the mean and standard deviation of the data recorded in the histogram
- merge two histograms (which may differ in precision)
- dump and load a histogram in text (e.g. csv, xml, json) and/or binary (for efficiency)
- export a histogram to a user-selected collection of buckets (e.g. for prometheus)
I have implementations of the first four.
The rank and percentile queries work on a snapshot of the working histogram, to avoid multithreading races and to make the calculations more efficient.
exporting data
An important consumer for data recorded in histograms is Prometheus. The docs https://prometheus.io/docs/practices/histograms/ say it supports
-
a "histogram" type (actually a cumulative frequency digest) where quantiles are calculated on the server
-
a "summary" type, where quantiles are calculated on the client and the server aggregates them over a sliding window
Prometheus has its own textual format for exposing / ingesting data,
https://prometheus.io/docs/instrumenting/exposition_formats/.
It looks like it would be fairly easy for hg64
and BIND to support it,
though it isn't clear whether the server is able to re-bucket data that
is exposed with a different bucketing than configured on the server.
elsewhere on gitlab
Related issues #598 (closed) #2101 #3455 (closed)