Benchmarks
The relative performance of hirola against other methods for indexing or
deduplicating keys depends heavily on the type of keys being indexed. Hash table
based methods such as hirola.HashTable or Python’s builtin set prefer high
entropy inputs whereas sorting based methods like numpy.unique or
numpy_indexed prefer keys with patterns (where a pattern could be something
as benign as all keys being within a given numerical range).
Performance is also affected by the number of keys being indexed, the size of
the keys, and in hirola‘s case, the amount of padding between the table’s
maximum and actual capacities (table.max over
table.length).
Also worth bearing in mind when looking at these benchmarks is that very few of
these indexing methods are really functionally equivalent. hirola
deduplicates, preserving the order of first appearance and returning indices
which can be used to reproduce the original input whereas numpy.unique sorts
the keys without returning indices by default – if your use case requires order
of first appearance or those indices then setting numpy.unique‘s
return_indices=True flag and doing some awkward numpy.argsort would
greatly reduce performance whereas if you wanted your keys sorted then this is a
boon to numpy.unique. Similarly, since converting between a Python set and a
numpy.ndarray is remarkably expensive, using one when your data is in the
other format will most likely slow your code down in spite of what these
benchmarks say is the fastest method.
The following graphs show the per-key indexing time of various indexing methods (low numbers mean better performance) when given key sets of a range of sizes. Each graph represents a different kind of data (random binary, floats, integers, strings, UTF-8 encoded strings) being indexed. Choose the graph that best matches whatever it is that you are trying to index. Each trace on a graph represents a different method for indexing keys, whose implementations are summarised below:
hirola x{multiplier} is
hirola.HashTable().add()where the multiplier is the ratio of the table’smaxsize over the number of unique keys in the dataset. The time to construct the table is included in the timings.set() is as you’d expect – it calls
set()on the list of keys. Since NumPy types generally aren’t hashable, where necessary, the input NumPy array is converted into Python native types (normally a list of bytes) beforehand. The conversion time is excluded from the benchmark times.numpy.unique(), numpy.unique(return_indices=True) and numpy_indexed.unique() are just function calls. Since they all only accept one dimensional inputs, any multi dimensional arrays are converted to single dimensional arrays of bytes. This conversion time is excluded from benchmarks.
pandas.Categorical() is
pandas.Categorical(keys).codes.
random_16
Highest entropy short strings of 16 random bytes.
random_128
Highest entropy long strings of 128 random bytes.
id_like
Low entropy pairs of small 32 bit integers.
permutative
Lowest entropy all possible pairs of integers from (0, 0) to (sqrt(n), sqrt(n)).
floating_32
High entropy 32-bit triplets of floats from -30 to 30.
floating_64
Like floating_32 but with 64-bit floats.
textual
Short (up to 20 character) non-random strings from a common passwords database.
utf8
Like textual but using bytes instead of strings.