Monday, July 29, 2013

HBase and data locality

By Lars Hofhansl

A quick note about data locality.

In distributed systems data locality is an important property. Data locality, here, refers to the ability to move the computation close to where the data is.

This is one of the key concepts of MapReduce in Hadoop. The distributed file system provides the framework with locality information, which is then used to split the computation into chunks with maximum locality.

When data is written in HDFS, one copy is written locally, another is written to another node in a different rack (if possible) and a third copy is written to another node in the same rack. For all practical purposes the two extra copies are written to random nodes in the cluster.

In typical HBase setups a RegionServer is co-located with an HDFS DataNode on the same physical machine. Thus every write is written locally and then to the two nodes as mentioned above. As long the regions are not moved between RegionServers there is good data locality: A RegionServer can serve most reads just from the local disk (and cache), provided short circuit reads are enabled (see HDFS-2246 and HDFS-347).

When regions are moved some data locality is lost and the RegionServers in question need to request the data over the network from remote DataNodes, until the data is rewritten locally (for example by running some compactions).

There are four basic events that can potentially destroy data locality:
  1. The HBase balancer decides to move a region to balance data sizes across RegionServers.
  2. A RegionServer dies. All its regions need to be relocated to another server.
  3. A table is disable and re-enabled.
  4. A cluster is stopped and restarted.
Case #1 is a trade off. Either a RegionServer hosts an over-proportional large portion of the data, or some of the data loses its locality.

In case #2 there is not much choice. That RegionServer and likely its machine died, so the regions need to be migrated to somewhere else where not all data is local (remember the "random" writes to the two other data nodes).

In case #4 HBase reads the previous locality information from the .META. table (whose regions - currently only one - are assigned to an RegionServer first), and then attempts to the assign the regions of all tables to the same RegionServers (fixed with HBASE-4402).

In case #3 the regions are reassigned in a round-robin fashion...

Wait what? If the a table is simply disabled and then re-enabled its regions are just randomly assigned?

Yep. In a large enough cluster you'll end up with close to 0% data locality, which means all data for all requests is pulled remotely from another machine. In some setups that means an outage, a time of unacceptably slow read performance.

This was brought up on the mailing lists again this week and we're fixing the old issue now (HBASE-6143, HBASE-9080).

Tuesday, July 2, 2013

Protecting HBase against data center outages

By Lars Hofhansl

As HBase relies on HDFS for durability many of HDFS's idiosyncrasies need to be considered when deploying HBase into a production environment.

HBASE-5954 is still not finished, and in any event it is not clear whether the performance penalty incurred by syncing each log entry (i.e. each individual change) to disk is an acceptable performance trade off.

There are however a few practical considerations that can be applied right now.
It is important to keep in mind how HBase works:
  1. new data is collected in memory
  2. the memory contents are written into new immutable files (flush)
  3. smaller files are combined into larger immutable files and then deleted (compaction)
#3 is interesting when it comes power outages. As described here, data written to HDFS is typically only guaranteed to have reached N datanodes (3 by default), but not guaranteed to be persistent on disk. If the wrong N machines fail at the same time - as can happen during a data center power outage - recently written data can be lost.

In case #3 above, HBase rewrites an arbitrary amount of old data into new files and then deletes the old files. This in turn means that during a DC outage an arbitrary amount of data can be lost and that more specifically the data loss is not limited to the newest data.

Since version 1.1.1 HDFS support syncing a closed block to disk (HDFS-1539). When used together with HBase this guarantees that new files are persisted to disk before the old files are deleted, and thus compactions would no longer lead to data loss following DC outages.

Of course there is a price to pay. The client now needs to wait until the data is synced on the N datanodes, and more problematically this is likely to happen all at once when the datablock (64mb or larger) is closed. I found that with this setting new file creation takes between 50 and 100ms!

This performance penalty can be alleviated to some extend by syncing the data to disk early and asynchronously. Syncing early here has no detrimental effect as the all data written by HBase is immutable (a delay is only beneficial if a block is changed multiple times).

Linux/Posix can do that automatically with the proper fadvise and sync_file_range calls.
HDFS supports these since version 1.1.0 (HDFS-2465). Specifically, dfs.datanode.sync.behind.writes is useful here, since it will smooth out the sync cost as much as possible.

So... TL;DR: In our cluster we enable
  • dfs.datanode.sync.behind.writes
    and
  • dfs.datanode.synconclose
for HDFS, in order to get some amount of safety against DC power outages with acceptable performance.

Monday, May 6, 2013

HBase durability guarantees

By Lars Hofhansl

Like most other databases HBase logs changes to a write ahead log (WAL) before applying them (i.e. making them visible).

I wrote in more detail about HDFS' flush and sync semantics here: here. Also check out HDFS-744 and HBASE-5954.

Recently I got back to this area in the code and committed HBASE-7801 to HBase 0.94.7+, which streamlines the durability API for the HBase client.
The existing API was a mess, with some settings configured only via the table on the server (such as deferred log flush) and some only via the client per each Mutation (i.e. Put or Delete) such as Mutation.setWriteToWAL(...). In addition some table settings could be overridden by a client and some could not.

HBASE-7801 includes the necessary changes on both server and client to support a new API: Mutation.setDurablity(Durability).
A client can now control durability per Mutation - via passing a Durability Enum - as follows:
  1. USE_DEFAULT: do whatever the default is for the table. This is the client default.
  2. SKIP_WAL: do not write to the WAL
  3. ASYNC_WAL: write to the WAL asynchronously as soon as possible
  4. SYNC_WAL: write WAL synchronously
  5. FSYNC_WAL: write to the WAL synchronously and guarantee that the edit is on disk (not currently supported)
Prior to this change only the equivalent of USE_DEFAULT and SKIP_WAL was supported.

The changes are backward compatible. A server prior to 0.94.7 will gracefully ignore any client settings that it does not support.

The HTable.setWriteToWAL(boolean) API is deprecated in 0.94.7+ and was removed in 0.95+.

I filed a followup jira to do the same refactoring the table descriptors HBASE-8375.

Monday, April 22, 2013

HBaseCon 2013

I am lucky enough to be on the HBaseCon Program Committee again this year.

We received a great set of talks and we organized them into four tracks: Operations, Internals, Ecosystem, and Case Studies

Today the Session List was announced:

http://blog.cloudera.com/blog/2013/04/hbasecon-2013-speakers-tracks-and-sessions-announced/

Obviously I am biased, but I think this will be a great event.

Friday, February 15, 2013

Managing HBase releases

Last year I (perhaps foolishly?) signed up to be the release manager for the 0.94 branch of HBase. I released 0.94.0 in Mai 2012. Since then I have learned a lot of the open source release process (which, in the end, is not that different from release proprietary software).

There are no defined responsibilities per se for such a role other than actually doing the release.

When I started HBase had relatively infrequent releases and there used to be many discussions and delays to a release to get some more "essential" features in.
The partial cure for this is two fold:
  1. Frequent releases on a somewhat strict schedule. If a feature or fix does not get in, it'll be in the next release a few weeks later.
    This reduces the pressure to push a feature into the next point release.
    The only discussions now are typically around serious bugs that have been discovered during the round of release candidates.
    This is the "release train" model. The train stops every few weeks, changes that are ready get on board, the other changes wait for the next train.
  2. A passing, comprehensive test suite, so that we can do the frequent releases with confidence. Problems are identified early (if the tests fail regularly nobody will check out new test failures, or these failures just drown in the noise of failing tests).
We're now on HBase 0.94.5 (released today actually!), and the pattern that emerged after some initial adjustment is one point release (0.94.0, 0.94.1, ...) about every four to six weeks (depending on how many rounds of release candidates were needed), with a relatively constant rate of change of two fixes and improvements per day (hence a point release ends up having 60-80 changes).

As you can see HBase is pretty actively maintained!

So to me being the release manager includes all of the following:
  • Help decide what features or fixes should be included in the release.
  • Help channel the discussion about whether a feature in (unstable) trunk is important enough to be backported to 0.94.
  • Try to review all the changes that go into 0.94. Due to the rate of change I cannot have a detailed look at every fix (I have other responsibilities in my day time job too), but I try to at least skim the changes to see if anything risky or incorrect sticks out.
  • Make sure the test suite passes reliably. This is a pet-peeve of mine and has been especially challenging, but we're now at pass rate of about 70% (up from 20-30% a few months back, but still needs to be improved).
    (Note that many of the failures are due to timing issues in the virtual build machines, and not due to a bug in the HBase code base. A single failing test out of over 1800 tests will make the test suite fail. So 70% is not as bad as it sounds.)
  • Keep timely releases. This my pet-peeve number two.
    Releases should be frequent, on a semi strict schedule, and backward compatible.
    That allows users to get features and fixes sooner and does not require cumbersome serial upgrades (where you need to upgrade from version 0.94.0 to 0.94.1 first in order to then upgrade from 0.94.1 to 0.94.2, and so on). Intermediary releases can be skipped (it is possible to upgrade from 0.94.0 to 0.94.5 directly).
    At the same time - as mentioned above - it allows developers to finish a feature or fix correctly rather than rushing it to "get it in", just because the next release will be 6 months from now.
  • (Sometimes) coordinate with vendors (such as Cloudera and Hortonworks) to time a release or a fix with their releases. This is on a best effort basis, the Apache release is independent of any vendor; but let's be honest, a significant fraction of our users run a release from these vendors.
  • Doing the actual release:
    • Tagging the release in SVN
    • Creating the release artifacts (currently we use the ones generated by the jenkins build for this).
    • Go through a round of one or more RCs and get other committers to test and vote for this RC. Here we need to improve with more automated integration test.
    • Uploading the release to the official Apache mirrors.
    • Pushing the release to the Maven repository (which involves a lot of black voodoo).
0.94 is the current stable release branch of HBase. As long as the next version (0.96) does not have a stable release we will keep backporting new features to 0.94 and keep the frequent releases going.

So far this has been fun (with the occasional frustration about the flaky test suite in the past).

The HBase community is very friendly and invites outside patches and improvements. So download HBase 0.94.5, and start contributing :)

Wednesday, January 30, 2013

SQL on HBase

My Salesforce colleague and cubicle neighbor, James Taylor, just released Phoenix a SQL layer on top of HBase to the Open Source world.

Phoenix is implemented as a JDBC driver. It makes use of various HBase features such as coprocessors and filters to push predicates into the server as much as possible. Queries are parallelized across RegionServers.

Phoenix has a formal data model that includes making use of the row key structure for optimization.

Currently Phoenix is limited to single table operations.

Here's James' blog entry announcing Phoenix.

Friday, January 18, 2013

HBase region server memory sizing

Sizing a machine for HBase is somewhat of a black art.

Unlike a pure storage machine that would just be optimized for disk size and throughput, an HBase RegionServer is also a compute node.

Every byte of disk space needs to be matched with a fraction of a byte in the RegionServer's Java heap.

You can estimate the ratio of raw disk space to required Java heap as follows:

RegionSize / MemstoreSize *
ReplicationFactor * HeapFractionForMemstores

Or in terms of HBase/HDFS configuration parameters:

regions.hbase.hregion.max.filesize /
hbase.hregion.memstore.flush.size *
dfs.replication *
hbase.regionserver.global.memstore.lowerLimit

Say you have the following parameters (these are the defaults in 0.94):
  • 10GB regions
  • 128MB memstores
  • HDFS replication factor of 3 
  • 40% of the heap use for the memstores

Then: 10GB/128MB*3*0.4 = 96.

Now think about this. With the default setting this means that if you wanted to serve 10T worth of disks space per region server you would need a 107GB Java heap!
Or if you give a region server a 10G heap you can only utilize about 1T of disk space per region server machine.

Most people are surprised by this. I know I was.

Let's double check:
In order to serve 10T worth of raw disk space - 3.3T of effective space after 3-way replication -  with 10GB regions, you'd need ~338 regions. @128MB that's about 43GB. But only 40% is by default used for the memstores so what you actually need is 43GB/0.4 ~ 107GB. Yep it's right.

Maybe we can get away with a bit less by assuming that not all memstores are 100% full at all times. That is offset by the fact that not all region will be exactly the same size or 100% filled.

Now. What can you do?
There are several options:
  1. Increase the region size. 20GB is about the maximum. Although some people claim they have 200GB regions. (hbase.hregion.max.filesize)
  2. Decrease the memstore size. Depending on your write load you can go smaller, 64MB or even less. (hbase.hregion.memstore.flush.size).
    You can allow a memstore to grow beyond this size temporarily.
    (hbase.hregion.memstore.block.multiplier)
  3. Increase the HDFS replication factor. That does not really help per se, but if you have more disk space than you can utilize, increasing the replication factor would at least put your disks to good use.
  4. Fiddle with the heap fractions used for the memstores. If you load is write-heave maybe up that 50% of the heap (hbase.regionserver.global.memstore.upperLimit, hbase.regionserver.global.memstore.lowerLimit)
These parameters (except the replication factor, which is an HDFS setting) are described in hbase-defaults.xml that ships with HBase.

Personally I would place the maximum disk space per machine that can be served exclusively with HBase around 6T, unless you have a very read-heavy workload.
In that case the Java heap should be 32GB (20G regions, 128M memstores, the rest defaults). With MSLAB in 0.94 that works.

Of course your needs may vary. You may have mostly readonly load, in which case you can shrink the memstores. Or the disk space might be shared with other applications.
Maybe you need smaller regions or larger memstores. In that case he maximum disk space you can serve per machine would be less.

Future JVMs might support bigger heap effectively (JDK7's G1 comes to mind).

In any case. The formula above provides a reasonable starting point.


Update Monday, June 10, 2013:

Kevin O'dell pointed out on the mailing list that the overall WAL size should match the overall memstore size. So here's an update adding this to the sizing considerations.

HBase uses three config option to size the HLog per Regionserver:

  1. hbase.regionserver.hlog.blocksize, defaults to the HDFS block size
  2. hbase.regionserver.logroll.multiplier, defaults to 0.95
  3. hbase.regionserver.maxlogs, default: 32

Hence with a 64MB default HDFS blocksize the HLogs would capped less than 2GB.

hbase.regionserver.global.memstore.lowerLimit should be <= 
hbase.regionserver.hlog.blocksize * 
hbase.regionserver.logroll.multiplier *
hbase.regionserver.maxlogs

I would only change hbase.regionserver.maxlogs. Individual HLogs should remain smaller than a single HDFS block.