HBase at Meetup

HBase @ Meetup
Gary Helmling – Lead SW Engineer

The Problem
circa Jan 2009
● Groups doing great

things, but how do
you find it all?
● Wait til the next event
● Click around (a lot)
● Wanted to show
what's happening in
groups
● Discussions, photos,
new members, RSVPs,
etc.
● But requires 10 different
queries!
The Solution
● Show activity from all
your groups in one
place
● real-time updates
● better discovery of
what's going on
● find new ways to
participate and get to
know your groups
Challenges
● Normalized schema
● Each type of activity requires querying a separate table
– already wasn't scaling at the group level
● Query efficiency
● Activity occurs at group level
● Members can be in hundreds of groups
● For member home page we need activity from all groups ordered by
most recent
– N subqueries by group ID merged back by descending timestamp
Options
● De-normalize MySQL ● Something new
● Stuff different activity types into ● the Cloud
a common table (with different – Google App Engine
fields for different types of
activity) – Amazon SimpleDB
● Hadoop/HBase
● Duplicate entity data (or we're
still doing N queries) ● CouchDB
● Start to lose a lot of the ● MongoDB
benefits of RDBMS ● Voldemort
● Query efficiency still a problem ● Cassandra
● Single system scaling limit
Why HBase?
● We own infrastructure, no usage limits
● Data model
● Semi-structured data in HBase (easily handles multiple types in same
table)
● Time-series ordered
● Scaling is built in (just add more servers)
● But extra indexing is DIY
● Very active developer community
● Established, mature project (in relative terms!)
● Matches our own toolset (java/linux based)
What is HBase?
● Clone of Google's BigTable
● Distributed (automatic partitioning)
● Column-oriented
● Semi-structured (columns can be added just by inserting)
● Built-in versioning
● Not an RDBMS
● No joins
● No SQL
● Data usually not normalized
● Transactions & built-in secondary indexes available (as contrib) but immature
● Need to think differently about how you structure data
● Denormalize your data where necessary
● Structure data & row keys around common access
What is HBase?
Data Storage
● Table
● Regions, defined by row [start key, end key)
– Store, 1 per family
● 1+ Store Files (Hfile format on HDFS)
● (table, rowkey, family, column, timestamp) = value
● Everything is byte[]
● Rows are ordered sequentially by key
● Special tables: -ROOT-, .META.
● Tell clients where to find user data
HBase Architecture
Courtesy of Lars George
from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
What is HBase?
Data Access
● Random access (Gets)

● by rowkey only
● Sequential reads (Scans)
● starting row key
● where you stop is as important as where you start
– ending row key (optional)
– server-side filter (optional)
● Writes (Puts)
● No insert vs. update distinction
How It Works
Storing activity data in HBase
● FeedItem: stores activity data for all types
● keyed by group and descending timestamp
– ch<chapterID>-ts<Long.MAX_VALUE–timestamp>-<type>-<entityID>
● each row only contains data for that type
Row Key info: content:
ch1261585-ts9223... item_type = chapter_greeting greeting = “Hi, Gary”
target_greeting = 8104438
ch1261585-ts9223... item_type = new_discussion title = “Improvements”
target_forum = 847743 body = “When a
target_thread = 7369603 discussion is created...”
● MemberFeedIndex: index of FeedItem rows from all of a member's groups

● one row per member (keyed by member ID)
● columns store refs to FeedItem row keys for that member's groups
● TTL of 2 months expires old index values
Row Key item:
4679998 ch176399-ts9223370788400750807-mem-10044424 = new_member
ch1261585-ts9223370787431124807-ptag-8525047 = photo_tag
...
How It Works
MemberFeedIndex
● Steps in displaying member home page feed

● lookup member record in MemberFeedIndex by ID
● grab the X most recent columns & values
– use a time range for paging (older pages start with an earlier start time)
● get each row from FeedItem using (column key as row key)
– N gets, where N is number of items to display
● populate some basic info about members and aggregate the results
– still query MySQL for core entity info (member, group, event)
How it Works
Secondary index tables
● Still need to find rows by column values
● tried “tableindexed” contrib (0.19 release), high CPU usage & contention
on scans
● decided to update to 0.20 release for other performance improvements
● built secondary indexing into app layer
● Separate table per indexed column
● FeedItem info:actor_member indexed by FeedItem-by_actor_member
● Index table rows keyed by column value and descending timestamp
– <column value>-<Long.MAX_VALUE–timestamp>-<orig row key>
● Zero pad numeric values (or big-endian representation) for correct byte
ordering
How it Works
Secondary index tables
ex. FeedItem-by_actor_member
Row Key info: __idx__:
0002851766-9223370783553935005-rowkey actor_member = 2851766 row = ch1143475-
item_type = new_rsvp ts9223370783553935005-rsvp-54704795
pub_date =
0004679998-9223370783650851832-rowkey actor_member = 4679998 row = ch1261585-
item_type = new_discussion ts9223370783650851832-disc-7369603
pub_date =
indexes FeedItem
Row Key info: content:
ch1143475-ts9223370783553935005-rsvp-54704795 actor_member = 2851766 comment = “See you there”
item_type = new_rsvp
pub_date =
ch1261585-ts9223370783650851832-disc-7369603 actor_member = 4679998 title = “Next month”
item_type = new_discussion body = “...”
pub_date =
Interacting with HBase
Meetup.Beeno
Java Beans mapped to HBase tables

package com.meetup.feeds.db;
...
@HEntity(name="FeedItem")
public class FeedItem implements Externalizable {
...
@HRowKey
public String getId() { return this.id; }
public void setId(String id) { this.id = id; }
@HProperty(family="info", name="actor_member",
indexes = { @HIndex(date_col="info:pub_date", date_invert=true,
extra_cols={"info:item_type"}) } )
public Integer getMemberId() { return this.memberId; }
public void setMemberId(Integer id) { this.memberId = id; }
Services
Base service class provides round-tripping based on annotations
public class EntityService<T> {
public T get(String rowKey) throws HBaseException {…}
public void save(T entity) throws HBaseException {…}
public void saveAll(List<T> entities) throws HBaseException {…}
public void delete(String rowKey) throws HBaseException {…}
public Query<T> query() throws MappingException {…}
easily extended for specific needs
Almost all HBase interaction through service instances.

Queries
Simple Query API uses mappings and secondary index tables
Find all items related to a discussion

FeedItemService service = new FeedItemService(DiscussionItem.class);
Query query =
service.query()
.using( Criteria.eq("threadId", threadId) );
List items = query.execute();
Find all greetings from a given member

FeedItemService service = new FeedItemService(GreetingItem.class);
Query query =
service.query()
.using( Criteria.eq("memberId", memberId) )
.where( Criteria.eq(“type”,
FeedItem.ItemType.CHAPTER_GREETING) );
List items = query.execute();
Member Feed Retrieval
Get latest activity from all a member's groups using MemberFeedIndex

// retrieve the member's index record
HTable mfiTable = HUtil.getTable("MemberFeedIndex");
Get get = new Get( Bytes.toBytes(String.valueOf(memberId)) );
get.addFamily( Bytes.toBytes("item") );
Result r = mfiTable.get(get);
FeedItemService service = new FeedItemService();

Set<IndexKey> sortedKeys = sortKeys(r);
List<FeedItem> items = new ArrayList<FeedItem>();
// for each index col get the entity record

for (IndexKey key : sortedKeys) {
FeedItem item = service.get(key.getKey());
if (item != null)
items.add(item);
}
// populate member and chapter info

…
HBase @ Meetup
Issues along the way
● Performance testing
● Product targeting 3 of our highest traffic pages, simulating load is hard
● Started with load scripts
● Moved to testing with live traffic
– Use AJAX calls to simulate requests
– Selective enable for X% of traffic
● Launched data collection/write traffic first
– Allowed tweaking configuration before impacting user experience
HBase @ Meetup
Issues along the way
● High CPU / Concurrency issues

● Updated to 0.20 release for performance gains across the board
● Replaced “tableindexed” usage with application level secondary indexing
● “Hot regions” - profile page hits small table every page
load
● Force split table to distribute across multiple servers
● “Newest” region still handling high load
– changed index keying to <value % 100>-<value>-<timestamp> for even
distribution
● I/O Heavy load / MemberFeedIndex table growing
● Lowered MemberFeedIndex time-to-live to 2 months
● Enabled LZO compression
HBase @ Meetup
Current Status
● Live traffic growing

● Cluster handling ~2.5k – 3k request/sec
● 50+% still write traffic
● ~17% of page views hit HBase (for reads)
● Expanding to 30% of page views in coming months
● Meetup.Beeno now open-source on Github:
● http://github.com/ghelmling/meetup.beeno
● Next up
● Continue tweaking
● Site analytics

HBase at Meetup

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HBase at Meetup

Uploaded by

Copyright:

Available Formats

HBase @ Meetup

Gary Helmling – Lead SW Engineer

● Groups doing great

● Random access (Gets)

● MemberFeedIndex: index of FeedItem rows from all of a member's groups

● Steps in displaying member home page feed

Java Beans mapped to HBase tables

Base service class provides round-tripping based on annotations

public class EntityService<T> {

public T get(String rowKey) throws HBaseException {…}

public void save(T entity) throws HBaseException {…}

public void saveAll(List<T> entities) throws HBaseException {…}

public void delete(String rowKey) throws HBaseException {…}

public Query<T> query() throws MappingException {…}

easily extended for specific needs

Almost all HBase interaction through service instances.

Simple Query API uses mappings and secondary index tables

Find all items related to a discussion

Find all greetings from a given member

Get latest activity from all a member's groups using MemberFeedIndex

FeedItemService service = new FeedItemService();

// for each index col get the entity record

// populate member and chapter info

● High CPU / Concurrency issues

● Live traffic growing

You might also like