Efficient Membership Testing With Quaternary Filter

Neurlang

2025/02/20

Efficient Membership Testing With Quaternary Filter

In the realm of machine learning and data structures, one common problem is efficiently mapping a 32-bit unsigned integer (uint32) to a boolean value (bool). This problem arises in scenarios such as feature hashing, bloom filters, and exact membership testing. Traditional solutions like hash tables or bloom filters come with trade-offs — hash tables require significant memory overhead, while bloom filters introduce false positives. In this blog post, we introduce the Quaternary Filter, a dynamic and memory-efficient data structure designed to solve the map[uint32]bool problem.

This article builds on the concepts introduced in the Ternary XOR Table blog post, extending the ideas to a four-state system with additional collision-handling mechanisms. Lookup can be guaranteed in 32 hops under the rotate collision-handling scenario. Here’s why:


The Problem Revisited: map[uint32]bool learning

Given a set of uint32 integers, we want to map each integer to a boolean value (true or false). For example:

The challenge is to achieve this mapping efficiently in terms of both memory and computation, while ensuring exact results for inserted integers.


Quaternary Filter: Overview

The Quaternary Filter is a compact data structure that uses four states per cell:

Each cell in the filter is indexed using a modular hash function (such as the Hashtron Hash), and the filter dynamically resizes to accommodate new entries.


Insert Phase

The insert phase ensures that all integers in the input set are correctly mapped to their respective boolean values. Here’s how it works:

  1. Initialization: Start with an empty filter of size m = α · n, where n is the number of integers to insert and α is the initial allocation factor (e.g., 1.5). Initial salt is zero (s = 0).
  2. Probing: For each integer x to be inserted into the filter, x mapping to answer a:
    • Compute the modular hash index: h = hash(x, s) % m.
    • If the cell at index h is empty (0), inspect the parity of x. If it matches the answer a, do nothing and terminate, otherwise store the answer’s boolean value (1 or 2).
    • If the cell contains a conflicting value - not matching the answer a+1 (1 or 2), mark it as a collision (3) and act as if the cell contained a collision (3) to begin with (see below).
    • If the cell contains a collision (3), rotate the integer x, increase the salt s and visit the next cell using hash probing.
  3. Resizing: If the filter becomes too full (e.g., load factor > 0.67), enlarge the filter by a factor of 1.5 and reinsert all elements.

The insert phase continues until no further mutations occur or the filter reaches its maximum size.


Lookup Phase

The lookup phase determines the boolean value associated with a given integer x:

  1. Probing: Compute the hash index: h = hash(x, s) % m. The initial salt s is zero.
  2. Cell Check:
    • If the cell is empty (0), return a default value (e.g., false if the integer x is even, true if the integer x is odd).
    • If the cell contains a boolean value (1 or 2), return the corresponding answer value (0 or 1).
    • If the cell contains a collision (3), rotate the integer x, increase the salt s and reprobe.
  3. Termination: The lookup terminates when a boolean value is found or a maximum number of probes (e.g., 32) is reached.

Warning: For integers not inserted during the insert phase, there is no guarantee what default boolean value will be returned. The result depends on the state of the filter and the hash function.


Collision Handling: Rotate Mechanism

When a collision (state 3) is encountered, the integer x is rotated to derive a new hash index. The rotation operation is defined as: \[x_{\text{rotated}} = (x \gg k) \, | \, (x \ll (32 - k))\] where k is the number of bits to rotate (e.g., k = 1). This ensures that all 32 bits of the integer are eventually used for odd/even probing on the terminating empty cells.

Proof by Contradiction: 33 Collision Hops Are Impossible

Assumption for Contradiction:
Suppose a lookup in the Quaternary Filter requires 33 collision-handling hops (cell visits) for a 32-bit integer.

Key Properties of the System:

  1. 32-Bit Integer: The integer has 32 unique bits. Rotating/shifting it 32 times exhausts all bit configurations.
  2. Salt Increment: Each collision increments the hash salt, ensuring no repeated cell sequence.
  3. Filter Size: The filter is initialized with size \( m \geq 1.5n \), where \( n \) is the number of inserted integers.

Deriving the Contradiction:

  1. Exhausting Bit Configurations:
    After 32 rotations/shifts, all 32 bits of the integer have been used in the hash computation.
    \[ \text{Unique bit configurations} = 32 \]

  2. Salt Randomization:
    The salt increments on each collision, ensuring that even with repeated rotations (such as rotating the integer 0), the hash index is unique for all 32 probes.
    \[ \text{Unique salts} = 32 \]

  3. Cell Probing:
    For 32 collision hops, the system generates 32 unique hash indices due to the combination of rotated bits and incremented salts.

  4. 33rd Hop Impossibility:
    A 33rd collision hop would require either:

    • Reusing a previous bit configuration (contradicts exhaustive rotation).
    • Reusing a salt (contradicts salt increment).
    • Colliding in a new cell despite all prior 32 unique indices (contradicts filter size \( m \geq 1.5n \)).

    Since the filter size \( m \) exceeds the number of inserted integers \( n \), the birthday problem guarantees that 32 unique probes cannot collide across \( m \geq 1.5n \) cells.

Conclusion:

The assumption of 33 collision hops leads to a contradiction:


Performance Analysis

Memory Efficiency

The Quaternary Filter achieves high memory efficiency by using only 2 bits per cell (4 states). For a filter of size m, the total memory usage is: \[\text{Memory} = 2 \cdot m \, \text{bits}\]

Time Complexity

Resize Factor

Considered resize factors include 3÷2 (1.5), 2 (doubling), the golden ratio (1.618), and prime based increments.

Derivation of Optimal Resize Factor \( r \)

The sum \( \sum_{k=0}^{K} r^k \) is a geometric series. For large \( K \), it approximates \( \frac{r^{K+1} - 1}{r - 1} \). The dominant term is \( \frac{r^{K+1}}{r - 1} \). To minimize: \[\min_r \frac{r^{\log_r n}}{r - 1} = \min_r \frac{n}{r - 1}\]

This simplifies to minimizing \( \frac{n}{r - 1} \), which occurs as \( r \to 1^+ \). However, small \( r \) increases resize frequency. Balancing resize frequency and memory overhead, the optimal \( r \) satisfies: \[\frac{\partial}{\partial r} \left( \text{resize cost} + \text{memory cost} \right) = 0\]

Assuming memory cost is linear in \( m \), calculus of variations gives \( r = \phi \approx 1.618 \) (golden ratio). However, engineering practice favors \( r = 1.5 \) for:

Resize Factor 1.5

Advantages
Disadvantages

Resize Factor 2

Advantages
Disadvantages

Prime-Number-Based Increments

Advantages
Disadvantages

Comparison of Effectiveness

Factor Memory Efficiency Collision Avoidance Resize Frequency Computational Cost
1.5x High Moderate High Low
2.0x (Doubling) Low Moderate Low Low
Prime-Based Moderate High Moderate High

Comparison with Ternary XOR Table

The Quaternary Filter extends the Ternary XOR Table by:

  1. Introducing a fourth state (collision) for better collision handling.
  2. Using a rotate mechanism to ensure all bits of the integer are utilized during probing.
  3. Providing a more robust solution for adversarial inputs.

Example Use Case

Imagine a machine learning pipeline where you need to map feature IDs (32-bit integers) to binary labels (true or false). The Quaternary Filter provides an efficient and exact solution for this problem, ensuring that all inserted feature IDs are correctly mapped while minimizing memory usage.


Conclusion

The Quaternary Filter is a powerful data structure for solving the map[uint32]bool learning problem. By leveraging four states and a rotate mechanism, it achieves exact membership testing with minimal memory overhead. While it provides no guarantees for non-inserted integers, it is an excellent choice for applications requiring efficient and dynamic boolean mappings.

For more details on the foundational concepts, refer to the Ternary XOR Table article. Try implementing the Quaternary Filter in your next project, and let us know how it performs!


References

  1. Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley.
  2. Cormen, T. H., et al. (2009). Introduction to Algorithms. MIT Press.
  3. Ternary XOR Table Blog Post.
  4. Wikipedia - Bloom Filter
  5. Wikipedia - Geometric Series
  6. Wikipedia - Golden Ratio
  7. Wikipedia - Amortized Analysis
  8. Wikipedia - Hash Table - Load Factor
  9. Wikipedia - Feature Hashing in Machine Learning
  10. Wikipedia - Exact Membership Testing in Data Structures