Efficient Membership Testing With Quaternary Filter

In the realm of machine learning and data structures, one common problem is efficiently mapping a 32-bit unsigned integer (uint32) to a boolean value (bool). This problem arises in scenarios such as feature hashing, bloom filters, and exact membership testing. Traditional solutions like hash tables or bloom filters come with trade-offs — hash tables require significant memory overhead, while bloom filters introduce false positives. In this blog post, we introduce the Quaternary Filter, a dynamic and memory-efficient data structure designed to solve the map[uint32]bool problem.

This article builds on the concepts introduced in the Ternary XOR Table blog post, extending the ideas to a four-state system with additional collision-handling mechanisms. Lookup can be guaranteed in 32 hops under the rotate collision-handling scenario. Here’s why:

Rotate Mechanism: When a collision (value 3 in filter cell) occurs, rotating the integer ensures that subsequent probes use different bits of the 32-bit integer. After 32 operations, all bits will have been cycled through, ensuring the hash function explores distinct cells.
Salt Increment: The hash salt increments on collision, randomizing the next cell visit. This prevents repeated collisions in the same cell sequence.
Termination: Since all 32 bits are exhausted after 32 hops, the lookup either resolves via parity (empty cell) or matches a stored bit (1/2).

The Problem Revisited: `map[uint32]bool` learning

Given a set of uint32 integers, we want to map each integer to a boolean value (true or false). For example:

Insert phase {12345: true, 67890: false, ...}.
Query phase: Is 12345 mapped to true or false?

The challenge is to achieve this mapping efficiently in terms of both memory and computation, while ensuring exact results for inserted integers.

Quaternary Filter: Overview

The Quaternary Filter is a compact data structure that uses four states per cell:

0: Empty (no value stored - the answer is the least significant bit of the current integer.)
1: Represents false. Terminates processing with answer 0.
2: Represents true. Terminates processing with answer 1.
3: Collision (requires further processing and cell hop).

Each cell in the filter is indexed using a modular hash function (such as the Hashtron Hash), and the filter dynamically resizes to accommodate new entries.

Insert Phase

The insert phase ensures that all integers in the input set are correctly mapped to their respective boolean values. Here’s how it works:

Initialization: Start with an empty filter of size m = α · n, where n is the number of integers to insert and α is the initial allocation factor (e.g., 1.5). Initial salt is zero (s = 0).
Probing: For each integer x to be inserted into the filter, x mapping to answer a:
- Compute the modular hash index: h = hash(x, s) % m.
- If the cell at index h is empty (0), inspect the parity of x. If it matches the answer a, do nothing and terminate, otherwise store the answer’s boolean value (1 or 2).
- If the cell contains a conflicting value - not matching the answer a+1 (1 or 2), mark it as a collision (3) and act as if the cell contained a collision (3) to begin with (see below).
- If the cell contains a collision (3), rotate the integer x, increase the salt s and visit the next cell using hash probing.
Resizing: If the filter becomes too full (e.g., load factor > 0.67), enlarge the filter by a factor of 1.5 and reinsert all elements.

The insert phase continues until no further mutations occur or the filter reaches its maximum size.

Lookup Phase

The lookup phase determines the boolean value associated with a given integer x:

Probing: Compute the hash index: h = hash(x, s) % m. The initial salt s is zero.
Cell Check:
- If the cell is empty (0), return a default value (e.g., false if the integer x is even, true if the integer x is odd).
- If the cell contains a boolean value (1 or 2), return the corresponding answer value (0 or 1).
- If the cell contains a collision (3), rotate the integer x, increase the salt s and reprobe.
Termination: The lookup terminates when a boolean value is found or a maximum number of probes (e.g., 32) is reached.

Warning: For integers not inserted during the insert phase, there is no guarantee what default boolean value will be returned. The result depends on the state of the filter and the hash function.

Collision Handling: Rotate Mechanism

When a collision (state 3) is encountered, the integer x is rotated to derive a new hash index. The rotation operation is defined as: \[x_{\text{rotated}} = (x \gg k) \, | \, (x \ll (32 - k))\] where k is the number of bits to rotate (e.g., k = 1). This ensures that all 32 bits of the integer are eventually used for odd/even probing on the terminating empty cells.

Proof by Contradiction: 33 Collision Hops Are Impossible

Assumption for Contradiction:
Suppose a lookup in the Quaternary Filter requires 33 collision-handling hops (cell visits) for a 32-bit integer.

Key Properties of the System:

32-Bit Integer: The integer has 32 unique bits. Rotating/shifting it 32 times exhausts all bit configurations.
Salt Increment: Each collision increments the hash salt, ensuring no repeated cell sequence.
Filter Size: The filter is initialized with size \( m \geq 1.5n \), where \( n \) is the number of inserted integers.

Deriving the Contradiction:

Exhausting Bit Configurations:
After 32 rotations/shifts, all 32 bits of the integer have been used in the hash computation.
\[ \text{Unique bit configurations} = 32 \]
Salt Randomization:
The salt increments on each collision, ensuring that even with repeated rotations (such as rotating the integer 0), the hash index is unique for all 32 probes.
\[ \text{Unique salts} = 32 \]
Cell Probing:
For 32 collision hops, the system generates 32 unique hash indices due to the combination of rotated bits and incremented salts.
33rd Hop Impossibility:
A 33rd collision hop would require either:
- Reusing a previous bit configuration (contradicts exhaustive rotation).
- Reusing a salt (contradicts salt increment).
- Colliding in a new cell despite all prior 32 unique indices (contradicts filter size \( m \geq 1.5n \)).
Since the filter size \( m \) exceeds the number of inserted integers \( n \), the birthday problem guarantees that 32 unique probes cannot collide across \( m \geq 1.5n \) cells.

Conclusion:

The assumption of 33 collision hops leads to a contradiction:

The system’s design (32-bit rotations + salt increments + filter sizing) ensures all lookups resolve in ≤32 hops.
Thus, 33 collision hops are impossible.
Therefore, Lookup is constant time on average operation.

Performance Analysis

Memory Efficiency

The Quaternary Filter achieves high memory efficiency by using only 2 bits per cell (4 states). For a filter of size m, the total memory usage is: \[\text{Memory} = 2 \cdot m \, \text{bits}\]

Time Complexity

Insert Phase: O(n) on average, with occasional resizing overhead.
Lookup Phase: O(1) on average, with a worst-case bound of 32 probes for 32-bit integers.

Resize Factor

Considered resize factors include 3÷2 (1.5), 2 (doubling), the golden ratio (1.618), and prime based increments.

Derivation of Optimal Resize Factor `\( r \)`

The sum \( \sum_{k=0}^{K} r^k \) is a geometric series. For large \( K \), it approximates \( \frac{r^{K+1} - 1}{r - 1} \). The dominant term is \( \frac{r^{K+1}}{r - 1} \). To minimize: \[\min_r \frac{r^{\log_r n}}{r - 1} = \min_r \frac{n}{r - 1}\]

This simplifies to minimizing \( \frac{n}{r - 1} \), which occurs as \( r \to 1^+ \). However, small \( r \) increases resize frequency. Balancing resize frequency and memory overhead, the optimal \( r \) satisfies: \[\frac{\partial}{\partial r} \left( \text{resize cost} + \text{memory cost} \right) = 0\]

Assuming memory cost is linear in \( m \), calculus of variations gives \( r = \phi \approx 1.618 \) (golden ratio). However, engineering practice favors \( r = 1.5 \) for:

Lower memory bloat (1.5x vs 1.618x).
Simpler integer arithmetic.

Resize Factor 1.5

Advantages

Memory Efficiency: Resizing by 1.5x reduces memory overhead compared to doubling (2x). For example, resizing from 1,000 to 1,500 saves 25% memory compared to resizing to 2,000.
Smooth Growth: Smaller increments reduce the frequency of large rehashing operations, which can be costly in terms of time and resources.
Load Factor Control: A 1.5x resize factor helps maintain a balanced load factor (e.g., 0.67), reducing the risk of collisions and improving lookup performance.

Disadvantages

Frequent Resizing: Smaller increments mean more frequent resizes, which can increase the amortized cost of insertions.
Non-Prime Sizes: Resizing by 1.5x often results in non-prime filter sizes, which can lead to uneven key distribution and increased collisions if the hash function relies on modulo operations

Resize Factor 2

Advantages

Amortized O(1) Performance: Doubling ensures that the amortized cost of insertions remains O(1), as the cost of resizing is spread over many insertions.
Power-of-Two Sizes: Many hash functions (e.g., Java’s HashMap) use bitwise operations for indexing, which are efficient when the filter size is a power of two.
Fewer Resizes: Larger increments reduce the frequency of resizing, which can improve performance for large datasets

Disadvantages

Memory Overhead: Doubling can lead to significant memory waste, especially for large filters.
Prime Number Ignorance: Like 1.5x, doubling does not guarantee prime filter sizes, which can be problematic for certain hash functions

Prime-Number-Based Increments

Advantages

Optimal Key Distribution: Prime filter sizes reduce collisions by ensuring a more uniform distribution of keys, especially for hash functions using modulo operations.
Flexible Growth: Incrementing to the next prime number (e.g., 11 → 23 → 47) provides a balance between memory efficiency and collision avoidance

Disadvantages

Computational Overhead: Finding the next prime number can be computationally expensive, especially for large filters.
Less Predictable Growth: Prime-based increments do not follow a fixed multiplicative factor, making it harder to predict memory usage and performance

Comparison of Effectiveness

Factor	Memory Efficiency	Collision Avoidance	Resize Frequency	Computational Cost
1.5x	High	Moderate	High	Low
2.0x (Doubling)	Low	Moderate	Low	Low
Prime-Based	Moderate	High	Moderate	High

Comparison with Ternary XOR Table

The Quaternary Filter extends the Ternary XOR Table by:

Introducing a fourth state (collision) for better collision handling.
Using a rotate mechanism to ensure all bits of the integer are utilized during probing.
Providing a more robust solution for adversarial inputs.

Example Use Case

Imagine a machine learning pipeline where you need to map feature IDs (32-bit integers) to binary labels (true or false). The Quaternary Filter provides an efficient and exact solution for this problem, ensuring that all inserted feature IDs are correctly mapped while minimizing memory usage.

Conclusion

The Quaternary Filter is a powerful data structure for solving the map[uint32]bool learning problem. By leveraging four states and a rotate mechanism, it achieves exact membership testing with minimal memory overhead. While it provides no guarantees for non-inserted integers, it is an excellent choice for applications requiring efficient and dynamic boolean mappings.

For more details on the foundational concepts, refer to the Ternary XOR Table article. Try implementing the Quaternary Filter in your next project, and let us know how it performs!

References

Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley.
Cormen, T. H., et al. (2009). Introduction to Algorithms. MIT Press.
Ternary XOR Table Blog Post.
Wikipedia - Bloom Filter
Wikipedia - Geometric Series
Wikipedia - Golden Ratio
Wikipedia - Amortized Analysis
Wikipedia - Hash Table - Load Factor
Wikipedia - Feature Hashing in Machine Learning
Wikipedia - Exact Membership Testing in Data Structures

Efficient Membership Testing With Quaternary Filter

Neurlang

2025/02/20

Efficient Membership Testing With Quaternary Filter

The Problem Revisited: `map[uint32]bool` learning

Quaternary Filter: Overview

Insert Phase

Lookup Phase

Collision Handling: Rotate Mechanism

Proof by Contradiction: 33 Collision Hops Are Impossible

Key Properties of the System:

Deriving the Contradiction:

Conclusion:

Performance Analysis

Memory Efficiency

Time Complexity

Resize Factor

Derivation of Optimal Resize Factor `\( r \)`

Resize Factor 1.5

Advantages

Disadvantages

Resize Factor 2

Advantages

Disadvantages

Prime-Number-Based Increments

Advantages

Disadvantages

Comparison of Effectiveness

Comparison with Ternary XOR Table

Example Use Case

Conclusion

References

Efficient Membership Testing With Quaternary Filter

Neurlang

2025/02/20

Efficient Membership Testing With Quaternary Filter

The Problem Revisited: map[uint32]bool learning

Quaternary Filter: Overview

Insert Phase

Lookup Phase

Collision Handling: Rotate Mechanism

Proof by Contradiction: 33 Collision Hops Are Impossible

Key Properties of the System:

Deriving the Contradiction:

Conclusion:

Performance Analysis

Memory Efficiency

Time Complexity

Resize Factor

Derivation of Optimal Resize Factor \( r \)

Resize Factor 1.5

Advantages

Disadvantages

Resize Factor 2

Advantages

Disadvantages

Prime-Number-Based Increments

Advantages

Disadvantages

Comparison of Effectiveness

Comparison with Ternary XOR Table

Example Use Case

Conclusion

References

The Problem Revisited: `map[uint32]bool` learning

Derivation of Optimal Resize Factor `\( r \)`