Bloom Filters with ORC Files

Published Jul 31, 2022  ∙  Updated Aug 1, 2022

A Bloom filter is a space-efficient, probabilistic data structure that is used to test whether an element is a member of a set.

Given an element, a Bloom filter index will return whether it is:

  1. Definitely not in a set, or
  2. Possibly in a set

Let’s go through a sample scenario.

Suppose we are storing Optimized Row Columnar (ORC) format files in S3. We can easily query these ORC files using standard SQL in Athena.

We’ll learn about how Bloom filters work, but let’s first see how we might arrive at the decision to use Bloom filters.

In our case, it all starts with speeding up Athena queries.

How can we speed up Athena queries of ORC files?

1. Partition pruning

One way we might be able to speed up Athena queries is to use partition pruning, where Athena prunes the dataset (e.g. a table with partition columns) to only the partitions that apply to a query.

Partition pruning reduces the number of files Athena needs to open and close for a query.

2. Predicate pushdown

Another way to speed up Athena queries is to use predicate pushdown, or predicate filtering, which enables Athena to reduce the number of rows fetched in an ORC file for the query.

The columnar nature of the ORC format allows us to avoid reading unnecessary columns, but predicate pushdown allows us to avoid reading unnecessary rows.

Intro to ORC file indexes

Data in ORC files are divided into stripes, each of which contain many rows.

ORC provides three levels of indexes within each file to determine whether to read or skip chunks of data:

  1. File level: column statistics across the entire file
  2. Stripe level: column statistics for each stripe within a file
  3. Row level: column statistics for each row group (set of 10,000 rows within a stripe)

Column statistics may include column-level aggregates (when applicable) such as count, sum, min, and max. It may also include whether the column contains null values.

Read more about the ORC file structure here.

SQL query example

Suppose we have predicate pushdown enabled: set hive.optimize.ppd = true.

SELECT SUM(cost) FROM products
WHERE purchase_date BETWEEN '2022-07-29' and '2022-07-30';

A trivial query engine implementation would run a scan of the entire dataset, deserialize cost and purchase_date, and apply the predicate on the purchase_date and sum the filtered rows.

A predicate is a boolean expression that evaluates to TRUE, FALSE, or UNKNOWN. It refers to the WHERE and HAVING clauses in a SQL query.

If we run the query above with predicate pushdown, the predicate (i.e. the WHERE clause) will get executed by the scan operator (i.e. pushing the predicate to the scan) using ORC indexes.

For example, suppose the scan operator encounters a stripe with an index of:

  • purchase_date.min=2022-01-01, and
  • purchase_date.max=2022-01-02

It can deduce that the predicate will always evaluate to false for this query and skip those rows.

While the columnar nature of the ORC format reduces the number of columns read, predicate pushdown reduces the number of rows read, resulting in a massive reduction of file and disk I/O per query. The performance gain due to lower I/O is inversely proportional to the selectivity (i.e. the percentage of matching rows).

3. Bloom filters

One more way to speed up queries is with Bloom filters, which were added to the ORC format in Hive 1.2.0.

Alongside the standard indexes that are created with every ORC file (e.g. sum, min, max), predicate pushdown can use Bloom filter indexes to further reduce the number of rows read.

Bloom filters guarantee no false negatives, so we can use it to test whether an element is certainly not present in a set. From there, the ORC file reader can decide whether to skip an entire file, stripe, or row group.

Predicate pushdown can use Bloom filter indexes to further reduce the number of rows read.

How does a Bloom filter work?

A Bloom filter is a data structure that can tell us, rapidly and space-efficiently, whether an element is present in a dataset.

However, in order to be rapid and space-efficient, Bloom filters are designed to be a probabilistic data structure.

Bloom filters are based on simple bit arrays. Suppose we have a bit array of size m=8.

Value 0 0 0 0 0 0 0 0
Index 0 1 2 3 4 5 6 7

When we add an element to the Bloom filter, we first calculate k hashes of this element.

Let’s say we want to use k=2 hash functions.

The result of these 2 hash functions are the indexes of the bit array whose value will be set to 1.

Let’s store the string "hello" into our dataset. Suppose we’re using FNV and Murmur for our 2 hash functions.

FNVHash(hello) = 6
MurmurHash(hello) = 0

The Bloom filter would be updated accordingly.

Value 1 0 0 0 0 0 1 0
Index 0 1 2 3 4 5 6 7

To test for membership, we simply hash the value with the same hash functions and check if those bits are set in the bit array.

If they are not set, we know the element definitely is not in our dataset.

If they are set, we know the element might be in the dataset. This is why Bloom filters are probabilistic. We can run into collisions if all k hash functions return the same results for different inputs.

To reduce the probability of collisions, or false positive probability (FPP), we can increase the number of bits in our Bloom filter.

The size of a Bloom filter depends on the number elements in the dataset and the specified FPP. The lower the FPP, the more accurate it will be at the cost of more disk space.

For ORC files, FPP is by default set to 0.05. This indicates that 5% of the time, some chunk of indexed data (e.g. file, stripe, or row-group) will be unnecessarily scanned.

Computing the optimal configuration

Adjusting the number of bits and hash functions in our Bloom filters seems like a tough optimization problem, but we luckily have a few formulas on hand.

Note that to use Bloom filters with ORC, all we need to set are the properties orc.bloom.filter.columns and orc.bloom.filter.fpp. The optimal number of bits and optimal number of hash functions will be internally computed based on the formulas presented below.

1. Number of bits

Given the following:

  • n: size of dataset to be represented by the Bloom filter (e.g. 10,000)
  • p: acceptable false positive probability between (0,1) (e.g. 0.055%)

We can calculate m, the optimal number of bits in the bloom filter.

Intuitively, m will need to increase as p decreases.

Similarly, m will need to increase as n increases.

2. Number of hash functions

Given the following:

  • n: size of dataset to be represented by the Bloom filter (e.g. 10,000)
  • m: the number of bits in the bloom filter

We can calculate k, the optimal number of hash functions in the bloom filter.

Why isn’t the optimal k value some huge number? Because if we have too many hash functions, we’ll set nearly all the bits in our Bloom filter to 1, resulting in a ~100% false positive rate.

For a given m and n, the value of k that minimizes the probability is:

See the calculations in action using this Bloom Filter Calculator.

Bloom filter vs. Set

Hold on. We already know of a data structure that can answer whether or not an element exists in a dataset.

A set.

However, Bloom filters have a few advantages over sets.

They are space-efficient, space constant, and time constant.

Space efficiency. The size of a Bloom filter is independent of the size of the dataset. A Bloom filter with 10^4 elements will take up the same amount of space as one with 10^30 elements, which will take up the same amount of space as one with 0 elements.

The amount of space a Bloom filter takes up is up to the user, based on the acceptable false positive probability.

Space constant. When we save data to a set, we have to actually store the data somewhere. If we store "abcdefghijklmnopqrstuvwxyz" into a set, we use 26 bytes of space. However, with a Bloom filter, we will only ever need m bits per element (which could be a single integer or a 99 byte string).

That being said, we will, of course, need a place to store the data itself. In our scenario above, the Bloom filters are stored in the ORC files that reside in S3.

Time constant. All Bloom filter operations are constant time, which is not the same as the amortized constant time in the case of sets. If a set has collisions, it can run in O(n) time.