Benchmarks¶
The relative performance of hirola
against other methods for indexing or
deduplicating keys depends heavily on the type of keys being indexed. Hash table
based methods such as hirola.HashTable
or Python’s builtin set
prefer high
entropy inputs whereas sorting based methods like numpy.unique
or
numpy_indexed
prefer keys with patterns (where a pattern could be something
as benign as all keys being within a given numerical range).
Performance is also affected by the number of keys being indexed, the size of
the keys, and in hirola
‘s case, the amount of padding between the table’s
maximum and actual capacities (table.max
over
table.length
).
Also worth bearing in mind when looking at these benchmarks is that very few of
these indexing methods are really functionally equivalent. hirola
deduplicates, preserving the order of first appearance and returning indices
which can be used to reproduce the original input whereas numpy.unique
sorts
the keys without returning indices by default – if your use case requires order
of first appearance or those indices then setting numpy.unique
‘s
return_indices=True
flag and doing some awkward numpy.argsort
would
greatly reduce performance whereas if you wanted your keys sorted then this is a
boon to numpy.unique
. Similarly, since converting between a Python set
and a
numpy.ndarray
is remarkably expensive, using one when your data is in the
other format will most likely slow your code down in spite of what these
benchmarks say is the fastest method.
The following graphs show the per-key indexing time of various indexing methods (low numbers mean better performance) when given key sets of a range of sizes. Each graph represents a different kind of data (random binary, floats, integers, strings, UTF-8 encoded strings) being indexed. Choose the graph that best matches whatever it is that you are trying to index. Each trace on a graph represents a different method for indexing keys, whose implementations are summarised below:
hirola x{multiplier} is
hirola.HashTable().add()
where the multiplier is the ratio of the table’smax
size over the number of unique keys in the dataset. The time to construct the table is included in the timings.set() is as you’d expect – it calls
set()
on the list of keys. Since NumPy types generally aren’t hashable, where necessary, the input NumPy array is converted into Python native types (normally a list of bytes) beforehand. The conversion time is excluded from the benchmark times.numpy.unique(), numpy.unique(return_indices=True) and numpy_indexed.unique() are just function calls. Since they all only accept one dimensional inputs, any multi dimensional arrays are converted to single dimensional arrays of bytes. This conversion time is excluded from benchmarks.
pandas.Categorical() is
pandas.Categorical(keys).codes
.
random_16¶
Highest entropy short strings of 16 random bytes.
random_128¶
Highest entropy long strings of 128 random bytes.
id_like¶
Low entropy pairs of small 32 bit integers.
permutative¶
Lowest entropy all possible pairs of integers from (0, 0)
to (sqrt(n), sqrt(n))
.
floating_32¶
High entropy 32-bit triplets of floats from -30 to 30.
floating_64¶
Like floating_32
but with 64-bit floats.
textual¶
Short (up to 20 character) non-random strings from a common passwords database.
utf8¶
Like textual
but using bytes instead of strings.