Skip to main content

Vac 101: Membership with Bloom Filters and Cuckoo Filters

by
13 min read

We examine two data structures: Bloom filters and Cuckoo filters.

Membership with Bloom Filters and Cuckoo Filters

The ability to efficiently query the membership of an element in a given data set is crucial. In certain applications, it is more important to output a result quickly than to have a 'perfect' result. In particular, false positives may be an acceptable tradeoff for speed. In this blog, we examine Bloom and Cuckoo data filters. Both of these filters are data structures that can be used for membership proofs.

Everyone is familiar with the process of creating a new account for various websites, whether it is an e-mail account or a social media account. Consider when you enter your desired username. Many sites provide real-time feedback, as you type, on the availability of a given string. In this scenario, it is necessary that the result is seemingly instant, regardless of the number of existing accounts. However, it is not important that the usernames that are flagged as unavailable are, in fact, in use. That is, it is sufficient to have a probabilistic check for membership.

Bloom filters and Cuckoo filters are data structures that can be used to accumulate data with a fixed amount of space. The associated filter FF for a digest of data DD can be queried to determine whether an element is (possibly) a member of DD:

  • 0: The queried element is definitely not a member of digest DD.
  • 1: The entry is possibly a member of the digest DD.

The algorithms associated with Bloom filters and Cuckoo filters, which we will discuss shortly, are deterministic. The possibility of false positives arises from the query algorithm.

Bloom filters

A Bloom filter is a data structure that can be used to accumulate an arbitrary amount of data with a fixed amount of space. Bloom filters have been a popular data structure for proof of non-membership due to their small storage size. Specifically, a Bloom filter consists of a binary string v{0,1}n{\bf{v}} \in \{0,1\}^n and kk hash functions {hi:{0,1}{0,,n1}}i=0k1\{h_i: \{0,1\}^* \rightarrow \{0,\dots,n-1\}\}_{i=0}^{k-1}. We note that each hash function hih_i is used to determine an index of our binary string v{\bf{v}} to flip the associated bit to 1. The binary string v{\bf{v}} is initialized with every entry as 0. The hash functions do not need to be cryptographic hash functions.

  • Append: Suppose that we wish to add the element xx to the Bloom filter.
    • Define the vector b{0,,n1}k{\bf{b}} \in \{0,\dots,n-1\}^k so that b[i]:=hi(x){\bf{b}}[i] := h_i(x) for each i{0,,k1}i \in \{0,\dots,k-1\}.
    • Update the binary string v[b[i]]1{\bf{v}}[{\bf{b}}[i]] \leftarrow 1 for each i{0,,k1}i \in \{0,\dots,k-1\}.
  • Query: Suppose that we wish to query the Bloom filter for element yy.
    • Return 1 provided v[hi(y)]=1{\bf{v}}[h_i(y)] = 1 for every i{0,,k1}i \in \{0,\dots,k-1\}. Otherwise, return 0.

The algorithm Query will output 1 for every element yy that has been added to the Bloom filter. This is a consequence of the Append algorithm. However, due to potential collisions over a set of hash functions, it is possible for false positives to occur. Moreover, the possibility of collisions makes it impossible to remove elements from the Bloom filter.

Complexity

The storage of a Bloom filter requires constant space. Specifically, the Bloom filter uses nn bits regardless of the size of the digest. So, regardless of the number of elements that we append, the Bloom filter will use nn bits. Further, if we assume that each of the kk hash functions runs in constant time, then we can append/query an entry in O(k)O(k).

Example

Suppose that k=3k = 3 and n=10n = 10. Our Bloom filter is initialized as v=(0000000000).\bf{v} = \begin{pmatrix}0&0&0&0&0&0&0&0&0&0\end{pmatrix}. Now, we will append the words addadd, sumsum, and equalequal. Suppose that

h0(add)=1h1(add)=4h2(add)=7h0(sum)=9h1(sum)=2h2(sum)=1h0(equal)=5h1(equal)=8h2(equal)=0.\begin{matrix} h_0(add) = 1 & h_1(add) = 4 & h_2(add) = 7\\ h_0(sum) = 9 & h_1(sum) = 2 & h_2(sum) = 1\\ h_0(equal) = 5 & h_1(equal) = 8 & h_2(equal) = 0. \end{matrix}

After appending these words, the Bloom filter is v=(1110110111).\bf{v} = \begin{pmatrix}1&1&1&0&1&1&0&1&1&1\end{pmatrix}.

Now, suppose that we query the words subtractsubtract and multiplemultiple so that

h0(subtract)=3h1(subtract)=5h2(subtract)=1h0(multiple)=7h1(multiple)=1h2(multiple)=4\begin{matrix} h_0(subtract) = 3 & h_1(subtract) = 5 & h_2(subtract) = 1\\ h_0(multiple) = 7 & h_1(multiple) = 1 & h_2(multiple) = 4\\ \end{matrix}.

The query for subtractsubtract returns 0 since v[3]=0{\bf{v}}[3]=0. On the other hand, the query for multiplemultiple returns 1 since v[1]=1,v[4]=1{\bf{v}}[1]=1, {\bf{v}}[4] = 1, and v[7]=1{\bf{v}}[7]=1. Even though multiplemultiple was not used to generate the Bloom filter v{\bf{v}}, our query returns the false positive.

Probability of false positives

For our analysis, we will assume that the probabilities that arise in our analysis are independent. However, this assumption can be removed to gain the same approximation.

We note that for a single hash function, the probability that a specific bit is flipped to 1 is 1/n1/n. So, the probability that the specific bit is not flipped by the hash function is 11/n1-1/n. Applying our assumption that the kk hash functions are 'independent,' the probability that the specific bit is not flipped by any of the hash functions is (11/n)k(1-1/n)^k.

Recall the calculus fact lim(11/n)n=e1\lim_{\infty} (1-1/n)^n = e^{-1}. That is, as we increase the number of bits that our Bloom filter uses, the approximate probability that a given bit is not flipped by any of the kk hash functions is ek/ne^{-k/n}.

Suppose that \ell entries have been added to the Bloom filter. The probability that a specific bit is still 0 after the \ell entries have been added is approximately ek/ne^{-\ell k/n}. The probability that a queried element is erroneously claimed as a member of the digest is approximately (1ek/n)k(1-e^{-\ell k/n})^k.

The following table provides concrete values for these approximations.

nnkk\ell(1ek/n)k(1-e^{-\ell k/n})^k
32330.01474
32370.11143
323120.30802
323170.50595
323280.79804

Notice that the probability of false positives increases as the number of elements (\ell) that have been added to the digest increases.

Sliding-Window Bloom filter

Our toy example and table illustrated an issue concerning Bloom filters. The number of entries that can be added to a Bloom filter is restricted by our choice of kk and nn. Not only does the probability that false positives will occur increase, but it is possible that our vector v{\bf{v}} can be a string of all 1s. Szepieniec and Værge proposed a modification to Bloom filters to handle this.

Instead of having a fixed number of bits for our Bloom filter, we dynamically allot memory based on the number of entries that have been added to the filter. Given a predetermined threshold (bb) for the number of entries, we shift our 'window' of flipping bits by ss bits. Note that this means that it is necessary to keep track of when a given entry is added to the digest. This means that querying the Sliding-Window Bloom filter will yield different results when different timestamps are used.

This can be done with kk hash functions as we used earlier. Alternatively, Szepieniec and Værge proposed using the same hash function but to produce kk entries in the current window. Specifically, we obtain the bits we wish to flip to 1s by computing h(Xi)h(X || i) for each i{0,,k1}i \in \{0,\dots, k-1\} and XX as we will define next. For Sliding-Window Bloom filters, XX is more than just the element we wish to append to the filter. Instead, XX consists of the element xx and a timestamp tt. The timestamp tt is used to locate the correct window for bits, as we see below:

  • Append: Suppose that we wish to add the element xx with timestamp tt to the Sliding-Window Bloom filter.
    • Define the vector b{0,,n1}k{\bf{b}} \in \{0,\dots,n-1\}^k so that b[i]:=h(xti){\bf{b}}[i] := h(x||t||i) for each i{0,,k1}i \in \{0,\dots,k-1\}.
    • Update the binary string v[b[i]+t/bs]1{\bf{v}}[{\bf{b}}[i]+\lfloor t/b \rfloor s] \leftarrow 1 for each i{0,,k1}i \in \{0,\dots,k-1\}.
  • Query: Suppose that we wish to query the Bloom filter for element yy with timestamp tt.
    • Return 1 provided v[h(yti)+t/bs]=1{\bf{v}}[h(y||t||i) + \lfloor t/b \rfloor s] = 1 for every i{0,,k1}i \in \{0,\dots,k-1\}. Otherwise, return 0.

By incorporating a shifting window, we maintain efficient querying and appending at the cost of constant space. However, by losing constant space, we gain 'infinite' scalability.

Cuckoo filters

A Cuckoo filter is a data structure for probabilistic membership proofs based on Cuckoo hash tables. The specific design goal for Cuckoo filters is to address the inability to remove elements from a Bloom Filter. This is done by replacing a list of bits with a list of 'fingerprints.' A fingerprint can be thought of as the hash value for an entry in the digest. A Cuckoo filter is a fixed-length list of 'fingerprints.' If the maximum number of entries that a Cuckoo filter can hold is nn and a fingerprint occupies ff bits, then the Cuckoo filter occupies nfnf bits.

Now, we describe the algorithms associated with the Cuckoo filter CC with hash function hash(X)hash(X) and fingerprint function fingerprint(X)fingerprint(X).

  • Append: Suppose that we wish to add the element xx to the Cuckoo filter.

    • If either position ix:=hash(x)i_x := hash(x) or jx:=ihash(fingerprint(x))j_x := i \otimes hash(fingerprint(x)) of CC is empty, then fingerprint(x)fingerprint(x) is inserted into an empty position.
    • If both ixi_x and jxj_x are occupied with a fingerprint that is distinct from fingerprint(x)fingerprint(x), then we select either ixi_x or jxj_x to insert fingerprint(x)fingerprint(x). The fingerprint that had previously occupied this position cannot be discarried. Instead, we insert this fingerprint into its alternate location. This reshuffling process either ends with fingerprints all having their own bucket or one that cannot be inserted. In the case that we have a fingerprint that cannot be inserted, then the Cuckoo filter is overfilled.
  • Query: Suppose that we wish to query the Cuckoo filter for element yy.

    • Return 1 provided fingerprint(y)fingerprint(y) is either in position iyi_y or jyj_y.
  • Delete: Suppose that we wish to delete the element yy from the Cuckoo filter.

    • If yy has been added to the Cuckoo filter, then fingerprint(y)fingerprint(y) is either in position iyi_y or jyj_y. We remove fingerprint(y)fingerprint(y) from the appropriate position.

We note that false positives in Cuckoo filters only occur when an element shares a fingerprint and hash with a value that has already been added to the Cuckoo filter.

Example

In this example, we will append the words addadd, sumsum, and equalequal to a Cuckoo filter with 8 slots.

For each word xx, we compute two indices: ix:=hash(x) and jx:=hash(x)hash(fingerprint(x)).i_x := hash(x) \text{ and } j_x := hash(x) \otimes hash(fingerprint(x)). Suppose that we have the following values for our words:

wordixi_xjxj_x
addadd(0,1,0)(0,1,0)(1,0,0)(1,0,0)
sumsum(1,0,1)(1,0,1)(1,1,0)(1,1,0)
equalequal(0,1,0)(0,1,0)(1,0,1)(1,0,1)

For clarity of the example, we append the words directly to the buckets instead of fingerprints of our data.

01234567
append addaddaddadd
append sumsumaddaddsumsum

Notice that both of the buckets (2 and 5) that equalequal can map to are occupied. So, we select one of these buckets (say 2) to insert equalequal into. Then, we have to insert addadd to its possible bucket (1). This leaves us with the Cuckoo filter:

01234567
addaddequalequalsumsum

Complexity

Notice that deletions and queries to Cuckoo filters are done in constant time. Specifically, only two locations need to be checked for any data xx. Appends may require shuffling previously added elements to their alternate locations. As such, the append does not run in constant time.

Bloom filters vs Cuckoo filters

The design of Bloom filters is focused on space efficiency and quick query time. Even though they occupy constant space, Cuckoo filters require significantly more space for nn items than Bloom filters. The worst-case append in a Cuckoo filter is slower than the append in a Bloom filter. However, an append that does not require any shuffling in a Cuckoo filter can be quicker than appends in Bloom filters. Cuckoo filters make up for these disadvantages with quicker query time and the ability to delete entries. Further, the probability of false positives in Cuckoo filters is lower than the probability of false positives in Bloom filters.

Combining Filters with RLN

In a series of posts (1,2,3), various versons of rate limiting nullifiers (RLN) that are used by Waku has been discussed. RLN uses a sparse Merkle tree for the membership set. The computational power required to construct the Merkle tree prevent light clients from participating in verifying membership proofs. In Verifying RLN Proofs in Light Clients with Subtrees, it was proposed to move the membership set on-chain so that it would not be necessary for a light client to construct the entire Merkle tree locally. Unfortunately, the naive approach is not practical as the gas limit for a single call is too restrictive for an appropriately sized tree. Instead, it was proposed to make utilize of subtrees. In this section, we provide a discussion of an alternate solution for light clients by using filters for the membership set. The two parts of RLN that we will focus on are user registration and deletion.

Both Bloom and Cuckoo filters support user registration as this is can be done as an append. The fixed size of these filters would restrict the total number of users that can register. This can be migitated by using Sliding-Window Bloom filter as this supports system growth. The Sliding-Window can be adapted to Cuckoo filters as well. In the case of a Sliding-Window filter, an user would maintain the epoch of when they registered. The registration of new users to Bloom filters can be done in constant time which is a significant improvement over appending to subtrees. Unfortunately, the complexity of registration to Cuckoo filters cannot be as easily computed.

A user could be slashed from the RLN by sending too many messages in a given epoch. Unfortunately, Bloom filters do not support the deletion of members. Luckily, Cuckoo filters allow for deletions that can performed in constant time.

Cuckoo filter that use Sliding-Window could be used so that light clients are able to verify proofs of membership in the RLN. These proofs are not a substitute to the usual proofs that a heavy client can verify due to the allowance of false positives. However, with the allowance of false positives, a light client can participate in verification RLN proofs in an efficient manner.

References