The figures below show JudyL compared to a commonly used hashing algorithm.
The hash algorithm is a simple static table sized to be a power of
two with a linear (linked list) collision chain. The
table is a fixed size of 1,048,576 buckets (220).
Each bucket is a pointer to a chain. Each chain element is 3 words
(key, value, next pointer). The key, value and next pointer are
all 64-bit quantities; therefore, the hash memory used should asymptotically
approach 24 bytes per inserted data element.
To insert into the hash table, the key is hashed and then
ANDed with the table size - 1 to find a bucket. It is also common
to have a prime-number-sized hash table and use the mod function
to get a table index. However, mod is very
slow (~1 microsecond on the system we were using to acquire the
above data) so we used the power of two table method
instead. Chains are linear searched for a match. If there is no match,
a new chain element is created. Finally, the data value is incremented
(the data value becomes a usage count to verify the number of unique
keys inserted).
The hash table was seriously overloaded by a factor of 40:1.
Overloading is very undesirable but all too common with hashing.
The resulting collision chains were an average of 39.15 in length,
showing that our hash distribution was very good.
The hash table creation time is not shown in the data. Judy
table creation is an integral part of JudyLIns since a Judy array
starts at zero length and grows dynamically.
The hashing code for this benchmark was all in-lined.
This is common with hashing but not a practical option with Judy. Mallopt(3C)
was used to speed up the allocation of each hash element. Measurements
were done on a 2GB HP 9000 J5000 workstation running HP-UX 11i at
440 MHz. The raw data is shown below.
 |
 |  |
 |
 | NOTE: The data was collected using random numbers as keys.
Random keys provide the best-case data for hashing and the worst case
data for Judy. Judy's performance and memory usage will both improve
with any data clustering. |
 |
 |  |
 |
Most programmers understand that hashing with a static table
can deliver excellent performance. However when you use hashing,
you must remember to:
Size your hash
table correctly.
Beware of operators like mod (i.e.
%) on a PA-RISC system.
Use any available optimizations, like mallopt(3C).
When you use Judy, you don't need to remember anything
on this checklist! Judy takes care of all the optimizations for
you.
In summary
Judy is a hybrid digital tree that chooses
data structures for each node appropriate to the population below
the node, thereby allowing tree-level compression and balanced memory
usage. Over large populations, the Judy advantage becomes
clear when Judy is benchmarked against other search and retrieval
methods, such as hashing. |