August 07, 2004

Bloom Filters

I woke up this morning, got out of bed and promptly stepped in a Bloom filter. On my way to work I drove by a dead Bloom filter on the side of the road, bloated and surrounded by clouds of buzzing flies. Seems like they're suddenly everywhere, so maybe it's time to come up with a Lisp implementation.

Never heard of a Bloom filter? This post is mostly about code, but here's the super brief intro: They're like hash tables used for keeping track of set membership, only (i) they're probabilistic, (ii) you can't easily remove keys and (iii) they use a fraction of the memory of a standard hash table. Beyond that, you'll want to look at Maciej Ceglowski's “Using Bloom Filters” article on perl.com, which is a good tutorial and also lists several references.

Code time.

Our Bloom filter class will consist of three things: the filter's bitmap, the maximum size of the bitmap, and a list of the hash functions the filter is using.

(defclass bloom-filter () ((bitmap :accessor bitmap :initform 0 :initarg :bitmap) (hash-functions :accessor hash-functions :initform nil :initarg :hash-functions) (size :accessor size :initform nil :initarg :size)))

Adding a key to the filter is simple: Construct a bitmap from the key, `OR` it with the filter's current bitmap and then assign the result back to the filter's current bitmap.

(defmethod add ((self bloom-filter) key) "Adds a key to a Bloom filter." (setf (bitmap self) (logior (bitmap self) (make-bitmap self key))))

Testing for key membership is also simple: `AND` the key's bitmap with the filter's current bitmap, and if the result is equal to the key's bitmap then the key is (probably) in the filter.

(defmethod contains ((self bloom-filter) key) "Returns T if the specified key is in the Bloom filter." (let ((bitmap (make-bitmap self key))) (= (logand (bitmap self) bitmap) bitmap)))

So how do we construct a bitmap from a key? We map over the hash functions, use the value returned
by each hash function for the key as an index into the filter's bitmap, and turn on the bit at those indices. The bitmap itself will be represented as a bignum, and we'll use `ldb` to access individual bits.

The SHA1 algorithm will be used as our source of good hash bits. The particular SHA1 implementation I'm using, by Nathan Froyd, returns its 160 bits of hash in the form of a vector containing 20 byte values:

CL(10): (sha:sha1sum-sequence "foo") #(11 238 199 181 234 63 15 219 201 93 13 212 127 60 91 194 117 218 138 51)

We'll divide these 20 bytes into 5 blocks of 4 bytes each, consider each block as its own 32 bit number, `XOR` the numbers together, and use the resulting value as an index into the bitmap (modulo the bitmap size). We'll turn on one bit in the bitmap for each hash function.

(defmethod make-bitmap ((self bloom-filter) key) (flet ((subproduct (index hash-seq) (* (elt hash-seq index) (elt hash-seq (+ index 1)) (elt hash-seq (+ index 2)) (elt hash-seq (+ index 3))))) (let ((vector 0)) (dolist (hash-function (hash-functions self)) (let* ((hash-bytes (funcall hash-function key)) (hashes (list (subproduct 0 hash-bytes) (subproduct 4 hash-bytes) (subproduct 8 hash-bytes) (subproduct 12 hash-bytes) (subproduct 16 hash-bytes))) (combined-hash (apply #'logxor hashes))) (let ((index (mod combined-hash (size self)))) (setf (ldb (byte 1 index) vector) 1)))) vector)))

[Update 2004/08/10]: Bill Clagett pointed out that the above method's `subproduct` function wasn't correctly building 32 bit numbers out of blocks of 5 bytes. Here's the corrected version. Thanks, Bill! Note that I'm too lazy to go back and generate new sample output, so what you see might be different from what I have here.

(defmethod make-bitmap ((self bloom-filter) key) (flet ((subproduct (index hash-seq) (reduce #'(lambda (a b) (logior (ash a 8) b)) hash-seq :start index :end (+ index 4)))) (let ((vector 0)) (dolist (hash-function (hash-functions self)) (let* ((hash-bytes (funcall hash-function key)) (hashes (list (subproduct 0 hash-bytes) (subproduct 4 hash-bytes) (subproduct 8 hash-bytes) (subproduct 12 hash-bytes) (subproduct 16 hash-bytes))) (combined-hash (apply #'logxor hashes))) (let ((index (mod combined-hash (size self)))) (setf (ldb (byte 1 index) vector) 1)))) vector)))

To turn our single source of hash bits, SHA1, into the multiple hash functions that Bloom filters require, we will use multiple “salts” that we prefix to our keys before hashing them.

(defun make-hashing-functions (salts) (mapcar #'(lambda (salt) #'(lambda (seq) (sha:sha1sum-sequence (concatenate 'string salt seq)))) salts))

We can use arbitrary strings as salts:

CL(11): (make-hashing-functions '("aa" "bb" "cc")) (#<Closure (:INTERNAL MAKE-HASHING-FUNCTIONS 0) @ #x572b49a> #<Closure (:INTERNAL MAKE-HASHING-FUNCTIONS 0) @ #x572b4b2> #<Closure (:INTERNAL MAKE-HASHING-FUNCTIONS 0) @ #x572b4ca>)

Now we can finally construct some filters and play around with them.

CL(12): (defparameter *filter1* (make-instance 'bloom-filter :size 10 :hash-functions (make-hashing-functions '("a" "b" "c")))) *FILTER1* CL(13): (add *filter1* "P. J. Harvey") 148 CL(14): (contains *filter1* "P. J. Harvey") T CL(15): (contains *filter1* "The Sounds") NIL

Since we specified that the filter should only use 10 bits, we can easily demonstrate the probabilistic nature of the Bloom filter. Let's add a few more keys:

CL(16): (add *filter1* "Royal Trux") 725 CL(17): (add *filter1* "Moldy Peaches") 981 CL(18): (add *filter1* "N*E*R*D*") 981

Now let's check some keys that we haven't entered:

CL(19): (contains *filter1* "Beth Orton") NIL CL(20): (contains *filter1* "Rick James") NIL CL(21): (contains *filter1* "Rick Springfield") T

Oops. Well, that's what you get for using just 10 bits of memory.

Further work: Write a LOAF implementation in Lisp. Check which has the best performance: Using a bignum to hold the bitmap (can use straightforward `logior` and `logand` to add and compare bitmaps) or using a bit vector (have to iterate over bits manually).

Comments

This is probably the hippest example toy program ever.

Posted by: rps on August 7, 2004 07:01 PMNice! This program would also make a good post to the small-cl-src sourcecode-only mailing list, http://www.cliki.net/small-cl-src

Posted by: Luke Gorrie on August 9, 2004 05:22 AMYou can remove items by keeping a ref-count of collisions. Use a vector of nibbles, rather than bits. Increment/decrement the nibble when adding/removing a key.

test2

Post a comment