On Caches¶
Evaluating Caches¶
UA-Parser tries to provide a somewhat decent cache by default, but cache algorithms react differently to traffic patterns, and setups can have different amounts of space to dedicate to cache overhead.
Thus, ua-parser also provides some tooling to try and evaluate fitness, in the form of two built-in command-line scripts. Both scripts take a mandatory sample file in order to provide evaluation on representative traffic. Thus this sample file should be a representative sample of your real world traffic (no sorting, no deduplicating, …).
python -mua_parser hitrates¶
As its name indicates, the hitrates script allows measuring the
hit rates of ua-parser’s available caches by simulating cache use at
various sizes on the sample file. It also provides the memory overhead
of each cache implementation at those sizes, both in total and per
entry.
Warning
The cache overhead does not include the size of the cached entries themselves, which is generally 500~700 bytes for a complete entry (all three domains matched).
hitrates also includes Bélády’s MIN (aka OPT) algorithm for
reference. MIN is not a practical cache as it requires knowledge of
the future, but it provides the theoretical upper bound at a given
cache size (very theoretical, practical cache algorithms tend to be
way behind until cache sizes close in on the total number of unique
values in the dataset).
hitrates has the advantage of being very cheap as it only
exercises the caches themselves and barely looks at the data.
python -mua_parser bench¶
bench is much more expensive in both CPU and wallclock as it
actually runs the resolvers on the sample file, combined with various
caches of various sizes. For usability, it can report its data (the
average parse time per input entry) in both human-readable text with
one result per line and CSV with resolver configurations as the
columns and cache sizes as the rows.
hitrates is generally sufficient as generally speaking for the
same base resolver performances tend to more or less follow hit rates:
a cache hit is close to free compared to a cache miss. Although this
is truer for the basic resolver (for which misses tend to be very
expensive). bench is mostly useful to validate or tie-break
decisions based on hitrates, and allows creating nice graphs in
your spreadsheet software of choice.
Cache Algorithms¶
[S3-FIFO]¶
[S3-FIFO] is a novel fifo-based cache algorithm. It might seem odd to pick that as default rather than a “tried and true” LRU, but the principles are interesting and on our sample it shows very good performances for an acceptable implementation complexity.
Advantages¶
excellent hit rates
thread-safe on hits
excellent handling of one hit wonders (entries unique to the data set) and rare fews (multiple entries with a lot of separation)
flexible implementation
Drawbacks¶
O(n) eviction
somewhat demanding on memory, especially at small sizes
Space¶
An S3Fifo of size n is composed of:
one dict of size 1.9*n
three collections.deque of sizes 0.1 * n, 0.9 * n, and 0.9 * n
[SIEVE]¶
[SIEVE] is an other novel fifo-based algorithm, a cousin of S3Fifo it works off of somewhat different principle. It has good performances and a more straightforward implementation than S3, but it is strongly wedded to linked lists as it needs to remove entries from the middle of the fifo (whereas S3 uses strict fifo).
Advantages¶
good hit rates
thread-safe on hits
memory efficient
Drawbacks¶
O(n) eviction
Space¶
A SIEVE of size n is composed of:
a dict of size n
a linked list with n nodes of 4 pointers each
LRU¶
The grandpappy of non-trivial cache eviction, it’s mostly included as a safety in case users encounter workloads for which the fifo-based algorithms completely fall over (do report them, I’m sure the authors would be interested).
Advantages¶
basically built in the Python stdlib (via
OrderedDict)O(1) eviction
nobody ever got evicted for using an LRU
Drawbacks¶
must be synchronised on hit: entries are moved
poor hit rates
Space¶
An LRU of size n is composed of:
an ordered dict of size n
Memory analysis of Python objects¶
Measures as of Python 3.11, on a 64b platform. Information is the overhead of the object itself, not the data it stores e.g. if an object stores strings the sizes of the strings are not included in the calculations.
class¶
With __slots__, a Python object is 32 bytes + 8 bytes for each
member. An additional 8 bytes is necessary for weakref support
(slotted objects in UA-Parser don’t have weakref support).
Without __slots__, a Python object is 48 bytes plus an instance
dict.
Note
The instance dict is normally key-sharing, which is not included in the analysis, see PEP 412.
dict¶
Python’s dict is a relatively standard hash map, but it has a bit
of a twist in that it stores the entries in a dense array, which
only needs to be sized up to the dict’s load factor, while the shallow
array used for hash lookups (which needs to be sized to match
capacity) only holds indexes into the dense array. This also allows
the size of the indices to only be as large as needed to index into
the dense array, so for small dicts the sparse array is an array of
bytes (8 bits).
However because the dense array of entries is used as a stack (only the last entry can be replaced) in case a dict “churns” (entries get added and removed without the size changing) if the size of the dict is close to the next break-point it would need to be compacted frequently leading to poor performances.
As a result, although a dictionary being created or added to will be
just the next size up a dict with a lot of churn will be two sizes up
to limit the amout of compaction necessary e.g. 10000 entries would
fit in 2**14 (capacity 16384, for a usable size of 10922) but the
dict may be sized up to 2**15 (capacity 32768, for a usable size
of 21845).
Python dicts also have a concept of key kinds which influences parts
of the layout. As of 3.12 there are 3 kinds called
DICT_KEYS_GENERAL, DICT_KEYS_UNICODE, and DICT_KEYS_SPLIT.
This is relevant here because UA-Parser caches are keyed on strings,
which means they should always use the DICT_KEYS_UNICODE kind.
In the DICT_KEYS_GENERAL layout, each entry of the dense array has
to store three pointer-sized items: a pointer to the key, a pointer to
the value, and a cached version of the key hash. However since strings
memoize their hash internally, the DICT_KEYS_UNICODE layout
retrieves the hash value from the key itself when needed and can save
8 bytes per entry.
Thus the space necessary for a dict is:
the standard 4 pointers object header (
prev,next, and type pointers, and reference count)ma_size, 8 bytes, the number of entriesma_version_tag, 8 bytes, deprecatedma_keys, a pointer to the dict entriesma_values, a pointer to the split values inDICT_KEYS_SPLITlayout (not relevant for UA-Parser)
The dict entries then are:
dk_refcnt, an 8 bytes refcount (used for theDICT_KEYS_SPLITlayout)dk_log2_size, 1 byte, the total capacity of the hash map, as a power of twodk_log2_index_bytes, 1 byte, the size of the sparse indexes array in bytes, as a power of two, it essentially memoizes the log2 size of the sparse indexes array by incrementingdk_log2_sizeby 3 if above 32, 2 if above 16, and 1 if above 8Note
This means the dict bumps up the indexes array a bit early to avoids having to resize again within a
dk_log2_sizee.g. at 171 elements the dict will move to size 9 (total capacity 512, usable capacity 341) and the index size will immediately get bumped to 10 even though it can still fit ~80 additional items with a u8 index.dk_kind, 1 byte, the key kind explained abovedk_version, 4 bytes, used for some internal optimisations of cpythondk_usable, 8 bytes, the number of usable entries in the dense arraydk_nentries, 8 bytes, the number of used entries in the dense array, this can’t be computed fromdk_usableanddk_log2_sizebecause ??? from the mention ofDKIX_DUMMYI assume it’s becausedk_usableis used to know when the dict needs to be compacted or resized, and because python uses open addressing and leaves tombstone (DKIX_DUMMY) in the sparse array they matter for collision performances, and thus load calculationsdk_indices, the sparse array of size1<<dk_log2_size_index_bytesdk_entries, the dense array of sizeUSABLE_FRACTION(1<<dk_log2_size) * 16Note
USABLE_FRACTIONis 2/3
Thus the space formula for dicts – in the context of string-indexed caches – is:
32 + 32 + 32
+ 2**(ceil(log2(n)) + 1) * ceil(log256(n))
+ floor(2/3 * 2**ceil(log2(n)) + 1) * 16
collections.OrderedDict¶
While CPython has a pure-python OrderedDict it’s not actually
used, instead a native implementation with a native doubly linked list
and a bespoke secondary hashmap is used, leading to a much denser
collection than achievable in Python. The broad strokes are similar
though:
a regular
dictlinks keys to valuesa secondary hashmap links keys to nodes of the linked list, allowing reordering entries easily
The secondary hashmap is only composed of a dense array of nodes,
using the internal details of the dict in order to handle lookups in
the sparse array and collision resolution. Unlike dict however
it’s sized to the dict’s capacity rather than USABLE_FRACTION
thereof.
The entire layout is:
a full dict object (see above), inline
pointers to the first and last nodes of the doubly linked list
a pointer to the array of nodes
od_fast_nodes_size, 8 bytes, which is used to see if the underlying dict has been resized*od_resize_sentinelwhich is also used to see if the underlying dict has been redized (a pointer to the dict entries object)od_state, 8 bytes, to check for concurrent mutations during iterationod_inst_dict, 8 bytes, used to provide a fake__dict__and better imitateod_inst_dict, 8 bytes, weakref support
And each node in the linked list is 4 pointers: previous, next, key, and hash.
Note
Hash is (likely) to speed up lookup since going from odict node to
dict entry requires a full lookup, and such a lookup is what
happens during iteration, except it uses a regular
PyDict_GetItem instead of a low-level lookup, why?
So the ordereddict space requirement formula is:
dict(n) + 64 + 8 * 2**(ceil(log2(n)) + 1) + 32 * n
Because it matches dict’s, like dict’s the capacity is double what’s strictly required due to amortising churn.
collections.deque¶
Deque is an unrolled doubly linked list of order 64, that is every node of the linked list stores 64 items, plus two pointers for the previous and next links. Note that the deque always allocates a block upfront (nb: why not allocate on use?).
The deque metadata (excluding the blocks) is 232 bytes:
the 32 bytes standard object of an object header (next pointer, previous pointer, refcount, and type pointer)
the
ob_sizeof a VAR_OBJ, apparently used to store the number of items as the deque does not track its blocks sizepointers to the left and right blocks
offsets into the left and right blocks (as they may only be partially filled)
state, a mutation counter used to track mutations during iterationmaxlen, in case the deque is length-boundednumfreeblocks, the actual size of the freelistfreelist, 16 pointers to already allocated available blocksweakreflist, the weakref support pointer
So the deque space requirement formula is:
232 + max(1, ceil(n / 64)) * 66 * 8
lru_cache()¶
While not strictly relevant to ua-parser, it should be noted that
lru_cache() is not built on
OrderedDict, it has its own native
implementation which uses a single dict and a different bespoke doubly
linked list with larger nodes (9 pointers).
Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue, Rashmi Vinayak. 2023. FIFO queues are all you need for cache eviction. SOSP ‘23. https://dl.acm.org/doi/10.1145/3600006.3613147
Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vigfusson, K. V. Rashmi. 2023. SIEVE is Simpler than LRU: an Efficient Turn-Key Eviction Algorithm for Web Caches. NSDI24. https://junchengyang.com/publication/nsdi24-SIEVE.pdf