Efficient Membership Testing With Quaternary Filter
In the realm of machine learning and data structures, one common problem is
efficiently mapping a 32-bit unsigned integer (uint32
) to a boolean value
(bool
). This problem arises in scenarios such as feature hashing, bloom filters,
and exact membership testing. Traditional solutions like hash tables or bloom
filters come with trade-offs — hash tables require significant memory overhead,
while bloom filters introduce false positives. In this blog post, we introduce
the Quaternary Filter, a dynamic and memory-efficient data structure
designed to solve the map[uint32]bool
problem.
This article builds on the concepts introduced in the Ternary XOR Table blog post, extending the ideas to a four-state system with additional collision-handling mechanisms. Lookup can be guaranteed in 32 hops under the rotate collision-handling scenario. Here’s why:
-
Rotate Mechanism: When a collision (value 3 in filter cell) occurs, rotating the integer ensures that subsequent probes use different bits of the 32-bit integer. After 32 operations, all bits will have been cycled through, ensuring the hash function explores distinct cells.
-
Salt Increment: The hash salt increments on collision, randomizing the next cell visit. This prevents repeated collisions in the same cell sequence.
-
Termination: Since all 32 bits are exhausted after 32 hops, the lookup either resolves via parity (empty cell) or matches a stored bit (1/2).
The Problem Revisited: map[uint32]bool
learning
Given a set of uint32
integers, we want to map each integer to a boolean
value (true
or false
). For example:
- Insert phase
{12345: true, 67890: false, ...}
. - Query phase: Is
12345
mapped totrue
orfalse
?
The challenge is to achieve this mapping efficiently in terms of both memory and computation, while ensuring exact results for inserted integers.
Quaternary Filter: Overview
The Quaternary Filter is a compact data structure that uses four states per cell:
- 0: Empty (no value stored - the answer is the least significant bit of the current integer.)
- 1: Represents
false
. Terminates processing with answer 0. - 2: Represents
true
. Terminates processing with answer 1. - 3: Collision (requires further processing and cell hop).
Each cell in the filter is indexed using a modular hash function (such as the Hashtron Hash), and the filter dynamically resizes to accommodate new entries.
Insert Phase
The insert phase ensures that all integers in the input set are correctly mapped to their respective boolean values. Here’s how it works:
- Initialization: Start with an empty filter of size
m = α · n
, wheren
is the number of integers to insert andα
is the initial allocation factor (e.g., 1.5). Initial salt is zero (s = 0
). - Probing: For each integer
x
to be inserted into the filter,x
mapping to answera
:- Compute the modular hash index:
h = hash(x, s) % m
. - If the cell at index
h
is empty (0), inspect the parity ofx
. If it matches the answera
, do nothing and terminate, otherwise store the answer’s boolean value (1 or 2). - If the cell contains a conflicting value - not matching the answer
a+1
(1 or 2), mark it as a collision (3) and act as if the cell contained a collision (3) to begin with (see below). - If the cell contains a collision (3), rotate the integer
x
, increase the salts
and visit the next cell using hash probing.
- Compute the modular hash index:
- Resizing: If the filter becomes too full (e.g., load factor > 0.67), enlarge the filter by a factor of 1.5 and reinsert all elements.
The insert phase continues until no further mutations occur or the filter reaches its maximum size.
Lookup Phase
The lookup phase determines the boolean value associated with a given integer x
:
- Probing: Compute the hash index:
h = hash(x, s) % m
. The initial salts
is zero. - Cell Check:
- If the cell is empty (0), return a default value (e.g.,
false
if the integerx
is even,true
if the integerx
is odd). - If the cell contains a boolean value (1 or 2), return the corresponding answer value (0 or 1).
- If the cell contains a collision (3), rotate the integer
x
, increase the salts
and reprobe.
- If the cell is empty (0), return a default value (e.g.,
- Termination: The lookup terminates when a boolean value is found or a maximum number of probes (e.g., 32) is reached.
Warning: For integers not inserted during the insert phase, there is no guarantee what default boolean value will be returned. The result depends on the state of the filter and the hash function.
Collision Handling: Rotate Mechanism
When a collision (state 3) is encountered, the integer x
is rotated to derive a new hash
index. The rotation operation is defined as:
\[x_{\text{rotated}} = (x \gg k) \, | \, (x \ll (32 - k))\]
where k
is the number of bits to rotate (e.g., k = 1
). This ensures that all 32 bits
of the integer are eventually used for odd/even probing on the terminating empty cells.
Proof by Contradiction: 33 Collision Hops Are Impossible
Assumption for Contradiction:
Suppose a lookup in the Quaternary Filter requires 33 collision-handling hops (cell visits) for a 32-bit integer.
Key Properties of the System:
- 32-Bit Integer: The integer has 32 unique bits. Rotating/shifting it 32 times exhausts all bit configurations.
- Salt Increment: Each collision increments the hash salt, ensuring no repeated cell sequence.
- Filter Size: The filter is initialized with size
\( m \geq 1.5n \)
, where\( n \)
is the number of inserted integers.
Deriving the Contradiction:
-
Exhausting Bit Configurations:
After 32 rotations/shifts, all 32 bits of the integer have been used in the hash computation.
\[ \text{Unique bit configurations} = 32 \]
-
Salt Randomization:
The salt increments on each collision, ensuring that even with repeated rotations (such as rotating the integer 0), the hash index is unique for all 32 probes.
\[ \text{Unique salts} = 32 \]
-
Cell Probing:
For 32 collision hops, the system generates 32 unique hash indices due to the combination of rotated bits and incremented salts. -
33rd Hop Impossibility:
A 33rd collision hop would require either:- Reusing a previous bit configuration (contradicts exhaustive rotation).
- Reusing a salt (contradicts salt increment).
- Colliding in a new cell despite all prior 32 unique indices (contradicts filter size
\( m \geq 1.5n \)
).
Since the filter size
\( m \)
exceeds the number of inserted integers\( n \)
, the birthday problem guarantees that 32 unique probes cannot collide across\( m \geq 1.5n \)
cells.
Conclusion:
The assumption of 33 collision hops leads to a contradiction:
- The system’s design (32-bit rotations + salt increments + filter sizing) ensures all lookups resolve in ≤32 hops.
- Thus, 33 collision hops are impossible.
- Therefore, Lookup is constant time on average operation.
Performance Analysis
Memory Efficiency
The Quaternary Filter achieves high memory efficiency by using only 2 bits per cell
(4 states). For a filter of size m
, the total memory usage is:
\[\text{Memory} = 2 \cdot m \, \text{bits}\]
Time Complexity
- Insert Phase: O(n) on average, with occasional resizing overhead.
- Lookup Phase: O(1) on average, with a worst-case bound of 32 probes for 32-bit integers.
Resize Factor
Considered resize factors include 3÷2 (1.5), 2 (doubling), the golden ratio (1.618), and prime based increments.
Derivation of Optimal Resize Factor \( r \)
The sum \( \sum_{k=0}^{K} r^k \)
is a geometric series. For large \( K \)
, it approximates \( \frac{r^{K+1} - 1}{r - 1} \)
. The dominant term is \( \frac{r^{K+1}}{r - 1} \)
. To minimize:
\[\min_r \frac{r^{\log_r n}}{r - 1} = \min_r \frac{n}{r - 1}\]
This simplifies to minimizing \( \frac{n}{r - 1} \)
, which occurs as \( r \to 1^+ \)
. However, small \( r \)
increases resize frequency. Balancing resize frequency and memory overhead, the optimal \( r \)
satisfies:
\[\frac{\partial}{\partial r} \left( \text{resize cost} + \text{memory cost} \right) = 0\]
Assuming memory cost is linear in \( m \)
, calculus of variations gives \( r = \phi \approx 1.618 \)
(golden ratio). However, engineering practice favors \( r = 1.5 \)
for:
- Lower memory bloat (1.5x vs 1.618x).
- Simpler integer arithmetic.
Resize Factor 1.5
Advantages
- Memory Efficiency: Resizing by 1.5x reduces memory overhead compared to doubling (2x). For example, resizing from 1,000 to 1,500 saves 25% memory compared to resizing to 2,000.
- Smooth Growth: Smaller increments reduce the frequency of large rehashing operations, which can be costly in terms of time and resources.
- Load Factor Control: A 1.5x resize factor helps maintain a balanced load factor (e.g., 0.67), reducing the risk of collisions and improving lookup performance.
Disadvantages
- Frequent Resizing: Smaller increments mean more frequent resizes, which can increase the amortized cost of insertions.
- Non-Prime Sizes: Resizing by 1.5x often results in non-prime filter sizes, which can lead to uneven key distribution and increased collisions if the hash function relies on modulo operations
Resize Factor 2
Advantages
- Amortized O(1) Performance: Doubling ensures that the amortized cost of insertions remains O(1), as the cost of resizing is spread over many insertions.
- Power-of-Two Sizes: Many hash functions (e.g., Java’s HashMap) use bitwise operations for indexing, which are efficient when the filter size is a power of two.
- Fewer Resizes: Larger increments reduce the frequency of resizing, which can improve performance for large datasets
Disadvantages
- Memory Overhead: Doubling can lead to significant memory waste, especially for large filters.
- Prime Number Ignorance: Like 1.5x, doubling does not guarantee prime filter sizes, which can be problematic for certain hash functions
Prime-Number-Based Increments
Advantages
- Optimal Key Distribution: Prime filter sizes reduce collisions by ensuring a more uniform distribution of keys, especially for hash functions using modulo operations.
- Flexible Growth: Incrementing to the next prime number (e.g., 11 → 23 → 47) provides a balance between memory efficiency and collision avoidance
Disadvantages
- Computational Overhead: Finding the next prime number can be computationally expensive, especially for large filters.
- Less Predictable Growth: Prime-based increments do not follow a fixed multiplicative factor, making it harder to predict memory usage and performance
Comparison of Effectiveness
Factor | Memory Efficiency | Collision Avoidance | Resize Frequency | Computational Cost |
---|---|---|---|---|
1.5x | High | Moderate | High | Low |
2.0x (Doubling) | Low | Moderate | Low | Low |
Prime-Based | Moderate | High | Moderate | High |
Comparison with Ternary XOR Table
The Quaternary Filter extends the Ternary XOR Table by:
- Introducing a fourth state (collision) for better collision handling.
- Using a rotate mechanism to ensure all bits of the integer are utilized during probing.
- Providing a more robust solution for adversarial inputs.
Example Use Case
Imagine a machine learning pipeline where you need to map feature IDs (32-bit integers) to
binary labels (true
or false
). The Quaternary Filter provides an efficient and
exact solution for this problem, ensuring that all inserted feature IDs are correctly mapped while minimizing memory usage.
Conclusion
The Quaternary Filter is a powerful data structure for solving the map[uint32]bool
learning problem.
By leveraging four states and a rotate mechanism, it achieves exact membership testing with minimal
memory overhead. While it provides no guarantees for non-inserted integers, it is an excellent choice
for applications requiring efficient and dynamic boolean mappings.
For more details on the foundational concepts, refer to the Ternary XOR Table article. Try implementing the Quaternary Filter in your next project, and let us know how it performs!
References
- Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley.
- Cormen, T. H., et al. (2009). Introduction to Algorithms. MIT Press.
- Ternary XOR Table Blog Post.
- Wikipedia - Bloom Filter
- Wikipedia - Geometric Series
- Wikipedia - Golden Ratio
- Wikipedia - Amortized Analysis
- Wikipedia - Hash Table - Load Factor
- Wikipedia - Feature Hashing in Machine Learning
- Wikipedia - Exact Membership Testing in Data Structures