====== Guides ====== .. _guide-custom-rulesets: Custom Rulesets =============== ua-parser defaults to the latest stable release of `ua-core`_ via `precompiled regexes.yaml`__. That is a suitable defaut, but there are plenty of reasons to use custom rulesets: - trim down the default ruleset to only the most current or relevant rules for efficiency e.g. you might not care about CalDav or podcast applications - add new rules relevant to your own traffic but which aren't (possibly can't be) in the main project - experiment with the creation of new rules - use a completely bespoke ruleset to track UA-identified API clients - use "experimental" rules which haven't been released yet (although `ua-parser-builtins`_ provides regular prerelease versions which may be suitable for this) ua-parser provides easy ways to load custom rolesets: - :mod:`ua_parser.loaders` converts whatever external storage format the rules are in to internal - :meth:`ua_parser.Parser.from_matchers` can directly create a parser from the loaded data, using the default resolver stack .. code-block:: python from ua_parser import Parser from ua_parser.loaders import load_yaml # requires PyYaml parser = Parser.from_matchers(load_yaml("regexes.yaml")) parser.parse(some_ua) .. _ua-parser-builtins: https://pypi.org/project/ua-parser-builtins __ ua-parser-builtins_ .. _ua-core: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml .. _guide-custom-global-parser: Custom Global Parser ==================== The global utility functions :func:`~ua_parser.parse`, :func:`~ua_parser.parse_user_agent`, :func:`~ua_parser.parse_os`, and :func:`~ua_parser.parse_device` just call to the global :data:`ua_parser.parser` internally. This means it's possible to customise their behaviour by just *setting* the global parser, although obviously that affects all users in the process which is both the advantage and risk .. code-block:: pycon >>> import ua_parser >>> import ua_parser.loaders >>> ua_parser.parse("foo") Result(user_agent=None, os=None, device=None, string='foo') >>> ua_parser.parser = ua_parser.Parser.from_matchers( ... ua_parser.loaders.load_data(( ... [{"regex": "(foo)"}], ... [], ... [], ... )) ... ) >>> ua_parser.parse("foo") # doctest: +NORMALIZE_WHITESPACE Result(user_agent=UserAgent(family='foo', major=None, minor=None, patch=None, patch_minor=None), os=None, device=None, string='foo') Cache And Other Advanced Parser Customisation ============================================= While loading custom rulesets has built-in support, other forms of parser customisations don't and require manually instantiating and composing :class:`~ua_parser.Resolver` objects. The most basic such customisation is simply configuring caching away from the default setup. As an example, all resolvers are cached with an :class:`~ua_parser.caching.S3Fifo` of size 2000, a user's workload might lead them to favor a larger LRU. The process is uncomplicated as the APIs are designed to compose together. The first step is to instantiate a base resolver, instantiated with the relevant :class:`Matchers` data:: import ua_parser.loaders import ua_parser.regex base = ua_parser.regex.Resolver( ua_parser.loaders.load_lazy_builtins()) The next step is to instantiate the cache [#cache]_ suitably configured:: cache = ua_parser.caching.Lru(10000) And compose the base resolver and cache together:: resolver = ua_parser.caching.CachingResolver( base, cache ) Finally, for convenience a :class:`ua_parser.Parser` can be wrapped around the resolver, and that can either be used as-is, or set as the global parser for all the library users to use this new configuration from here on:: ua_parser.parser = ua_parser.Parser(resolver) .. note:: To be honest aside from configuring the presence, algorithm, and size of caches there currently isn't much to compose that's built in. The only remaining member of the cast is :class:`~ua_parser.caching.Local`, which is also caching-related, and serves to use thread-local caches rather than a shared cache. Although this does not remove the cache locks it avoids contention on the locks, which can be useful when using a free-threaded interpreter, especially when using the LRU cache as it has to synchronise on hit. .. _builtin-resolvers: Builtin Resolvers ================= .. list-table:: :header-rows: 1 :stub-columns: 1 * - - speed - portability - memory use - safety * - ``regex`` - great - good - fine - great * - ``re2`` - good - bad - good - good * - ``basic`` - terrible - great - great - great ``regex`` --------- The ``regex`` resolver is a bespoke effort as part of the `uap-rust `_ sibling project, built on `rust-regex `_ and `a bespoke regex-prefiltering implementation `_, it: - Is the fastest available resolver, usually edging out ``re2`` by a significant margin (when that is even available). - Is fully controlled by the project, and thus can be built for all interpreters and platforms supported by pyo3 (currently: cpython, pypy, and graalpy, on linux, macos and linux, intel and arm). It is also built as a cpython abi3 wheel and should thus suffer from no compatibility issues with new releases of cpython at least. - Built entirely out of safe rust code, its safety risks are entirely in ``regex`` and ``pyo3``. - Uses somewhat more memory than the other resolvers (~85MB high water mark on a real-world dataset). If available, it is the default resolver, without a cache. ``re2`` ------- The ``re2`` resolver is built atop the widely used `google-re2 `_ via its built-in Python bindings. It: - Is quite fast, though only about half the speed of ``regex`` on real-world data. - Is only compatible with CPython, and uses pure API wheels, so needs a different release for each cpython version, for each OS, for each architecture. - Is built entirely in C++, but by experienced Google developers. - Is more memory intensive than the pure-python ``basic`` resolver, but quite slim all things considered (~55MB high water mark on a real-world dataset). If available, it is the second-preferred resolver, without a cache. At the end of the day, it is really only useful if the codebase already uses ``re2``. ``basic`` --------- The ``basic`` resolver is a naive linear traversal of all rules, using the standard library's ``re``. It: - Is *extremely* slow: about 10x slower than ``re2`` on cpython, and pypy and graal's regex implementations do *not* like the workload and are 3x-4x slower than *cpython*. - Has perfect compatibility, with the caveat above, by virtue of being built entirely out of standard library code. - Is basically as safe as Python software can be by virtue of being just Python, with the native code being the standard library's. - Is the slimmest resolver at about 40MB. This is caveated by a hard requirement to use caches which makes it workably faster on real-world datasets (if still nowhere near *uncached* ``re2`` or ``regex``) but increases its memory requirement significantly e.g. using "sieve" and a cache size of 20000 on a real-world dataset, it is about 4x slower than ``re2`` for about the same memory requirements. It is the fallback and least preferred resolver, with a medium (currently 2000 entries) cache by default. Writing Custom Resolvers ======================== It is unclear if there would be any fun or profit to it, but an express goal of the new API is to allow writing and composing resolvers, so what is a resolver? :class:`~ua_parser.Resolver` is a structural :py:class:`typing.Protocol` for implementation convenience (nothing to inherit, and not even a class to write). Here it is in full:: class Resolver(Protocol): @abc.abstractmethod def __call__(self, ua: str, domain: Domain, /) -> PartialResult: ... So a :class:`~ua_parser.Resolver` is just a callable which takes a string and a :class:`~ua_parser.Domain`, and returns a :class:`~ua_parser.PartialResult`. For our first resolver, let's say that we have an API and a mobile application, and as we expect the mobile application to be the main caller we want to special-case it, we could do it in many ways but the way we're doing it is a bespoke :class:`~ua_parser.Resolver` which matches the application's user agent and performs trivial parsing:: def foo_resolver(ua: str, domain: Domain, /) -> PartialResult: if not ua.startswith('fooapp/'): # not our application, match failure return PartialResult(domain, None, None, None, ua) # we've defined our UA as $appname/$version/$user-token app, version, user = ua.split('/', 3) major, minor = version.split('.') return PartialResult( domain, UserAgent(app, major, minor), None, Device(user), ua, ) This resolver is not hugely interesting as it resolves a very limited number of user agent strings and fails everything else, although it does demonstrate two important requirements of the protocol: - If a domain is requested, it must be returned, even if ``None`` (signaling a matching failure). - If it's efficient there is nothing wrong with returning data for domains which were not requested, at worst they will be ignored. For a more interesting resolver, we can write a *fallback* resolver: it's a higher-order resolver which tries to call multiple sub-resolvers in sequence until the UA is resolved. This means we could then use something like:: Parser(FallbackResolver([ foo_resolver, regex.Resolver(load_lazy_builtins()), ])) to prioritise cheap resolving of our application while still resolving third party user agents:: class FallbackResolver: def __init__(self, resolvers: List[Resolver]) -> None: self.resolvers = resolvers def __call__(self, ua: str, domain: Domain, /) -> PartialResult: if domain: for resolver in self.resolvers: r = resolver(ua, domain) # if any value is non-none the resolver found a match if r.user_agent_string is not None \ or r.os is not None \ or r.device is not None: return r # if no resolver found a match (or nothing was requested), # resolve to failure return PartialResult(domain, None, None, None, ua) .. [#cache] If it has been written yet, see :doc:`advanced/caches` for way too much information you probably don't care about if you just want to parse user agent stings. The tldr is that bigger increases hit rates which decreases costs but uses more memory, and while really easy to write in Python an :class:`~ua_parser.caching.Lru` is a pretty bad cache all things considered.