I woke up this morning, got out of bed and promptly stepped in a Bloom filter. On my way to work I drove by a dead Bloom filter on the side of the road, bloated and surrounded by clouds of buzzing flies. Seems like they're suddenly everywhere, so maybe it's time to come up with a Lisp implementation.
Never heard of a Bloom filter? This post is mostly about code, but here's the super brief intro: They're like hash tables used for keeping track of set membership, only (i) they're probabilistic, (ii) you can't easily remove keys and (iii) they use a fraction of the memory of a standard hash table. Beyond that, you'll want to look at Maciej Ceglowski's “Using Bloom Filters” article on perl.com, which is a good tutorial and also lists several references.
Our Bloom filter class will consist of three things: the filter's bitmap, the maximum size of the bitmap, and a list of the hash functions the filter is using.
(defclass bloom-filter () ((bitmap :accessor bitmap :initform 0 :initarg :bitmap) (hash-functions :accessor hash-functions :initform nil :initarg :hash-functions) (size :accessor size :initform nil :initarg :size)))
Adding a key to the filter is simple: Construct a bitmap from the key, OR it with the filter's current bitmap and then assign the result back to the filter's current bitmap.
(defmethod add ((self bloom-filter) key) "Adds a key to a Bloom filter." (setf (bitmap self) (logior (bitmap self) (make-bitmap self key))))
Testing for key membership is also simple: AND the key's bitmap with the filter's current bitmap, and if the result is equal to the key's bitmap then the key is (probably) in the filter.
(defmethod contains ((self bloom-filter) key) "Returns T if the specified key is in the Bloom filter." (let ((bitmap (make-bitmap self key))) (= (logand (bitmap self) bitmap) bitmap)))
So how do we construct a bitmap from a key? We map over the hash functions, use the value returned by each hash function for the key as an index into the filter's bitmap, and turn on the bit at those indices. The bitmap itself will be represented as a bignum, and we'll use ldb to access individual bits.
The SHA1 algorithm will be used as our source of good hash bits. The particular SHA1 implementation I'm using, by Nathan Froyd, returns its 160 bits of hash in the form of a vector containing 20 byte values:
CL(10): (sha:sha1sum-sequence "foo") #(11 238 199 181 234 63 15 219 201 93 13 212 127 60 91 194 117 218 138 51)
We'll divide these 20 bytes into 5 blocks of 4 bytes each, consider each block as its own 32 bit number, XOR the numbers together, and use the resulting value as an index into the bitmap (modulo the bitmap size). We'll turn on one bit in the bitmap for each hash function.
(defmethod make-bitmap ((self bloom-filter) key) (flet ((subproduct (index hash-seq) (* (elt hash-seq index) (elt hash-seq (+ index 1)) (elt hash-seq (+ index 2)) (elt hash-seq (+ index 3))))) (let ((vector 0)) (dolist (hash-function (hash-functions self)) (let* ((hash-bytes (funcall hash-function key)) (hashes (list (subproduct 0 hash-bytes) (subproduct 4 hash-bytes) (subproduct 8 hash-bytes) (subproduct 12 hash-bytes) (subproduct 16 hash-bytes))) (combined-hash (apply #'logxor hashes))) (let ((index (mod combined-hash (size self)))) (setf (ldb (byte 1 index) vector) 1)))) vector)))
[Update 2004/08/10]: Bill Clagett pointed out that the above method's subproduct function wasn't correctly building 32 bit numbers out of blocks of 5 bytes. Here's the corrected version. Thanks, Bill! Note that I'm too lazy to go back and generate new sample output, so what you see might be different from what I have here.
(defmethod make-bitmap ((self bloom-filter) key) (flet ((subproduct (index hash-seq) (reduce #'(lambda (a b) (logior (ash a 8) b)) hash-seq :start index :end (+ index 4)))) (let ((vector 0)) (dolist (hash-function (hash-functions self)) (let* ((hash-bytes (funcall hash-function key)) (hashes (list (subproduct 0 hash-bytes) (subproduct 4 hash-bytes) (subproduct 8 hash-bytes) (subproduct 12 hash-bytes) (subproduct 16 hash-bytes))) (combined-hash (apply #'logxor hashes))) (let ((index (mod combined-hash (size self)))) (setf (ldb (byte 1 index) vector) 1)))) vector)))
To turn our single source of hash bits, SHA1, into the multiple hash functions that Bloom filters require, we will use multiple “salts” that we prefix to our keys before hashing them.
(defun make-hashing-functions (salts) (mapcar #'(lambda (salt) #'(lambda (seq) (sha:sha1sum-sequence (concatenate 'string salt seq)))) salts))
We can use arbitrary strings as salts:
CL(11): (make-hashing-functions '("aa" "bb" "cc")) (#<Closure (:INTERNAL MAKE-HASHING-FUNCTIONS 0) @ #x572b49a> #<Closure (:INTERNAL MAKE-HASHING-FUNCTIONS 0) @ #x572b4b2> #<Closure (:INTERNAL MAKE-HASHING-FUNCTIONS 0) @ #x572b4ca>)
Now we can finally construct some filters and play around with them.
CL(12): (defparameter *filter1* (make-instance 'bloom-filter :size 10 :hash-functions (make-hashing-functions '("a" "b" "c")))) *FILTER1* CL(13): (add *filter1* "P. J. Harvey") 148 CL(14): (contains *filter1* "P. J. Harvey") T CL(15): (contains *filter1* "The Sounds") NIL
Since we specified that the filter should only use 10 bits, we can easily demonstrate the probabilistic nature of the Bloom filter. Let's add a few more keys:
CL(16): (add *filter1* "Royal Trux") 725 CL(17): (add *filter1* "Moldy Peaches") 981 CL(18): (add *filter1* "N*E*R*D*") 981
Now let's check some keys that we haven't entered:
CL(19): (contains *filter1* "Beth Orton") NIL CL(20): (contains *filter1* "Rick James") NIL CL(21): (contains *filter1* "Rick Springfield") T
Oops. Well, that's what you get for using just 10 bits of memory.
Further work: Write a LOAF implementation in Lisp. Check which has the best performance: Using a bignum to hold the bitmap (can use straightforward logior and logand to add and compare bitmaps) or using a bit vector (have to iterate over bits manually).Posted by jjwiseman at August 07, 2004 05:25 PM