Sunday, March 8, 2015

HBase Scanning Internals - SEEK vs SKIP

By Lars Hofhansl, March 8th, 2015

Recently we ran into a problem where a mapper that scanned a region of about 10GB with a timerange that did not include any Cell (KeyValue) took upwards of 20 minutes to finish; it processed only about 8MB/s.

It turns out this was a known problem that has eluded a fix for while: Deciding at the scanner level whether to SEEK ahead past potentially many Cells or whether to power through and repeatedly SKIP the next Cell until the next useful Cell is reached.

Background

Scanning forward through a file, HBase has no knowledge about how many columns are to follow for the current row or how many versions there are for the current column (remember that every version of every column in HBase has its own Cell in the HFiles).

In order to deal with many columns or versions, HBase can issue SEEKs to the next row (seeking past all versions for all remaining columns for the row) or the next column (seeking past all remaining versions). HBase errs on the side of SEEK'ing frequently since SKIP'ing potentially 1000' or 100000's of times can be disastrous for performance (imagine a row with 100 columns and 10000 versions each - unlikely, but possible).

The problem is: SEEK'ing is about 10x as expensive as a single SKIP - depending on how many seek pointers into HFiles have to be reset.

Yet, in many cases we have rows with only a few or even just one column and just one version each. Now the SEEKs will cause a significant slowdown.

After much trying finally there is a proposed solution:
HBASE-13109 Make better SEEK vs SKIP decisions during scanning
(0.98.12 and later)

How does it work?

HFiles are organized like B-Trees, and it is possible to determine the start key of the next block in each file.

A heuristic is now:
Will the SEEK we are about to execute get us into the next block of the HFile that is at top of the heap used for the merge sorting between the HFiles?

If so, we will definitely benefit from seeking (the repeated SKIPs would eventually exhaust the current block and load the next one anyway).

If not, we'll likely benefit from repeated SKIP'ing. This is a heuristic only, the SEEK might allow us to seek past manys Cell in HFiles not at the top of the heap, but that is unlikely.

In all tests I did this performs equal to or (much) better than the current behavior.

The upshot is that now the HBase plumbing (filters, coprocessors, delete marker handling, skipping columns, etc) can continue to issue SEEKs where that is logically possible, and then at the scanner level it can decide whether to act upon the SEEK or to keep SKIP'ing inside the current block, with almost optimal performance.

TL;DR:
HBASE-13109 Allows many queries against HBase to execute 2-3x faster. Especially those that select specific columns, or those that filter on timeranges, or where many deleted columns for column families are encountered.
Queries that request all columns and that do not filter the data in any way will not benefit from this change.

You do need to do anything to get that benefit (other than upgrading your HBase to at least the upcoming 0.98.12).