HBase Benchmarking

Currently I am working with new setup of Apache HBase cluster to query data using Phoenix  on top of HDP Distribution. After setting up cluster, the values for heap, cache and timeouts were all defaults. Now I needed to know how good is the cluster in current shape and how can it be improved.

Now for the improvement part, understanding of  HBase internals is needed. How does a write work in HBase. What is the read path. What is the data access and data writing patterns. By analyzing these aspects, you vary parameters. But after varying, one needs to see the effect of variance right? And thus you need something to measure performance and benchmark the cluster.

I found two tools for this purpose:

PerformanceEvaluation

This comes built-in with the HBase. It has various parameters to run different kinds of workloads. So first we write data. I am using --nomapred option because I have not installed YARN. To get list of supported options and parameters just run the command without any options.

So here we first load the data. I am using randonWrite here. With 1 thread.

$ time hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 1

The result may show something like:

To write in parallel:

$ time hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 3

And to read:

Again, there are so many options to this utility, try them as needed.

YCSB

Yahoo! Cloud Serving Benchmark is tool for benchmarking various kind of databases like Cassandra, MongoGB, Voldemort, HBase etc. The steps for setup are explained here. These will the steps to follow while using this tool:

  • Create table named usertable
  • Load the data
  • Run workload tests

Now after creating table, we first need to load the data into this table. There are various kinds of workloads for different purpose. We will use workload A to load the data in table we created.

Loading the data:

first parameter load which tells to write data to table. Second parameter is type of database. -P tells type of workload. Next three parameters are self explanatory. -s prints  progress as loading happens.

The output looks something like:

So here I was getting around 2500 operations/sec

Reading the data

I will use workload C which is read only. You can read more on types of workloads on link given above. Here first parameter is run as we are running the tests, as opposed to loading the data. All other parameters are familiar looking.

The output for above test should look like

So for reads I was getting around 7800 operations/second.

Find a workload suitable for your usecase. Or create one. Run tests on them. And let me know how is your cluster doing 🙂 I am also planning to write some of my thoughts and findings on tuning the cluster. Until then, happy hadooping…

Leave a Reply