You are on page 1of 21

HBase @ Meetup

Gary Helmling – Lead SW Engineer


The Problem
circa Jan 2009

● Groups doing great


things, but how do
you find it all?
● Wait til the next event
● Click around (a lot)
● Wanted to show
what's happening in
groups
● Discussions, photos,
new members, RSVPs,
etc.
● But requires 10 different
queries!
The Solution
● Show activity from all
your groups in one
place
● real-time updates
● better discovery of
what's going on
● find new ways to
participate and get to
know your groups
Challenges
● Normalized schema
● Each type of activity requires querying a separate table
– already wasn't scaling at the group level
● Query efficiency
● Activity occurs at group level
● Members can be in hundreds of groups
● For member home page we need activity from all groups ordered by
most recent
– N subqueries by group ID merged back by descending timestamp
Options
● De-normalize MySQL ● Something new
● Stuff different activity types into ● the Cloud
a common table (with different – Google App Engine
fields for different types of
activity) – Amazon SimpleDB
● Hadoop/HBase
● Duplicate entity data (or we're
still doing N queries) ● CouchDB
● Start to lose a lot of the ● MongoDB
benefits of RDBMS ● Voldemort
● Query efficiency still a problem ● Cassandra
● Single system scaling limit
Why HBase?
● We own infrastructure, no usage limits
● Data model
● Semi-structured data in HBase (easily handles multiple types in same
table)
● Time-series ordered
● Scaling is built in (just add more servers)
● But extra indexing is DIY
● Very active developer community
● Established, mature project (in relative terms!)
● Matches our own toolset (java/linux based)
What is HBase?
● Clone of Google's BigTable
● Distributed (automatic partitioning)
● Column-oriented
● Semi-structured (columns can be added just by inserting)
● Built-in versioning
● Not an RDBMS
● No joins
● No SQL
● Data usually not normalized
● Transactions & built-in secondary indexes available (as contrib) but immature
● Need to think differently about how you structure data
● Denormalize your data where necessary
● Structure data & row keys around common access
What is HBase?
Data Storage

● Table
● Regions, defined by row [start key, end key)
– Store, 1 per family
● 1+ Store Files (Hfile format on HDFS)
● (table, rowkey, family, column, timestamp) = value
● Everything is byte[]
● Rows are ordered sequentially by key
● Special tables: -ROOT-, .META.
● Tell clients where to find user data
HBase Architecture
Courtesy of Lars George

from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
What is HBase?
Data Access

● Random access (Gets)


● by rowkey only
● Sequential reads (Scans)
● starting row key
● where you stop is as important as where you start
– ending row key (optional)
– server-side filter (optional)
● Writes (Puts)
● No insert vs. update distinction
How It Works
Storing activity data in HBase
● FeedItem: stores activity data for all types
● keyed by group and descending timestamp
– ch<chapterID>-ts<Long.MAX_VALUE–timestamp>-<type>-<entityID>
● each row only contains data for that type
Row Key info: content:
ch1261585-ts9223... item_type = chapter_greeting greeting = “Hi, Gary”
target_greeting = 8104438
ch1261585-ts9223... item_type = new_discussion title = “Improvements”
target_forum = 847743 body = “When a
target_thread = 7369603 discussion is created...”

● MemberFeedIndex: index of FeedItem rows from all of a member's groups


● one row per member (keyed by member ID)
● columns store refs to FeedItem row keys for that member's groups
● TTL of 2 months expires old index values
Row Key item:
4679998 ch176399-ts9223370788400750807-mem-10044424 = new_member
ch1261585-ts9223370787431124807-ptag-8525047 = photo_tag
...
How It Works
MemberFeedIndex

● Steps in displaying member home page feed


● lookup member record in MemberFeedIndex by ID
● grab the X most recent columns & values
– use a time range for paging (older pages start with an earlier start time)
● get each row from FeedItem using (column key as row key)
– N gets, where N is number of items to display
● populate some basic info about members and aggregate the results
– still query MySQL for core entity info (member, group, event)
How it Works
Secondary index tables
● Still need to find rows by column values
● tried “tableindexed” contrib (0.19 release), high CPU usage & contention
on scans
● decided to update to 0.20 release for other performance improvements
● built secondary indexing into app layer
● Separate table per indexed column
● FeedItem info:actor_member indexed by FeedItem-by_actor_member
● Index table rows keyed by column value and descending timestamp
– <column value>-<Long.MAX_VALUE–timestamp>-<orig row key>
● Zero pad numeric values (or big-endian representation) for correct byte
ordering
How it Works
Secondary index tables

ex. FeedItem-by_actor_member
Row Key info: __idx__:
0002851766-9223370783553935005-rowkey actor_member = 2851766 row = ch1143475-
item_type = new_rsvp ts9223370783553935005-rsvp-54704795
pub_date =
0004679998-9223370783650851832-rowkey actor_member = 4679998 row = ch1261585-
item_type = new_discussion ts9223370783650851832-disc-7369603
pub_date =

indexes FeedItem
Row Key info: content:
ch1143475-ts9223370783553935005-rsvp-54704795 actor_member = 2851766 comment = “See you there”
item_type = new_rsvp
pub_date =
ch1261585-ts9223370783650851832-disc-7369603 actor_member = 4679998 title = “Next month”
item_type = new_discussion body = “...”
pub_date =
Interacting with HBase
Meetup.Beeno

Java Beans mapped to HBase tables


package com.meetup.feeds.db;

...

@HEntity(name="FeedItem")
public class FeedItem implements Externalizable {

...

@HRowKey
public String getId() { return this.id; }
public void setId(String id) { this.id = id; }

@HProperty(family="info", name="actor_member",
indexes = { @HIndex(date_col="info:pub_date", date_invert=true,
extra_cols={"info:item_type"}) } )
public Integer getMemberId() { return this.memberId; }
public void setMemberId(Integer id) { this.memberId = id; }
Interacting with HBase
Services

Base service class provides round-tripping based on annotations

public class EntityService<T> {

public T get(String rowKey) throws HBaseException {…}

public void save(T entity) throws HBaseException {…}

public void saveAll(List<T> entities) throws HBaseException {…}

public void delete(String rowKey) throws HBaseException {…}

public Query<T> query() throws MappingException {…}

easily extended for specific needs

Almost all HBase interaction through service instances.


Interacting with HBase
Queries

Simple Query API uses mappings and secondary index tables

Find all items related to a discussion


FeedItemService service = new FeedItemService(DiscussionItem.class);
Query query =
service.query()
.using( Criteria.eq("threadId", threadId) );
List items = query.execute();

Find all greetings from a given member


FeedItemService service = new FeedItemService(GreetingItem.class);
Query query =
service.query()
.using( Criteria.eq("memberId", memberId) )
.where( Criteria.eq(“type”,
FeedItem.ItemType.CHAPTER_GREETING) );
List items = query.execute();
Interacting with HBase
Member Feed Retrieval

Get latest activity from all a member's groups using MemberFeedIndex


// retrieve the member's index record
HTable mfiTable = HUtil.getTable("MemberFeedIndex");
Get get = new Get( Bytes.toBytes(String.valueOf(memberId)) );
get.addFamily( Bytes.toBytes("item") );
Result r = mfiTable.get(get);

FeedItemService service = new FeedItemService();


Set<IndexKey> sortedKeys = sortKeys(r);
List<FeedItem> items = new ArrayList<FeedItem>();

// for each index col get the entity record


for (IndexKey key : sortedKeys) {
FeedItem item = service.get(key.getKey());
if (item != null)
items.add(item);
}

// populate member and chapter info



HBase @ Meetup
Issues along the way

● Performance testing
● Product targeting 3 of our highest traffic pages, simulating load is hard
● Started with load scripts
● Moved to testing with live traffic
– Use AJAX calls to simulate requests
– Selective enable for X% of traffic
● Launched data collection/write traffic first
– Allowed tweaking configuration before impacting user experience
HBase @ Meetup
Issues along the way

● High CPU / Concurrency issues


● Updated to 0.20 release for performance gains across the board
● Replaced “tableindexed” usage with application level secondary indexing
● “Hot regions” - profile page hits small table every page
load
● Force split table to distribute across multiple servers
● “Newest” region still handling high load
– changed index keying to <value % 100>-<value>-<timestamp> for even
distribution
● I/O Heavy load / MemberFeedIndex table growing
● Lowered MemberFeedIndex time-to-live to 2 months
● Enabled LZO compression
HBase @ Meetup
Current Status

● Live traffic growing


● Cluster handling ~2.5k – 3k request/sec
● 50+% still write traffic
● ~17% of page views hit HBase (for reads)
● Expanding to 30% of page views in coming months
● Meetup.Beeno now open-source on Github:
● http://github.com/ghelmling/meetup.beeno
● Next up
● Continue tweaking
● Site analytics

You might also like