Real-Time Searching of Big Data with Solr and Hadoop

Rod Cope, CTO & Founder OpenLogic, Inc.

Agenda
Introduction The Problem The Solution Lessons Learned Public Clouds & Big Data Final Thoughts Q&A

OpenLogic, Inc.

2

Introduction
Rod Cope
CTO & Founder of OpenLogic 25 years of software development experience IBM Global Services, Anthem, General Electric

OpenLogic
Open Source Provisioning, Support, and Governance Solutions Certified library w/SLA support on 500+ Open Source packages
http://olex.openlogic.com

Over 200 Enterprise customers

OpenLogic, Inc.

3

code. 4 . indexes Individual tables contain many terabytes Relational databases aren’t scale-free Growing every day Need real-time random access to all data Long-running and complex analysis jobs OpenLogic.The Problem “Big Data” All the world’s Open Source Software Metadata. Inc.

CentOS. well-supported.The Solution Hadoop. MySQL. JRuby. Rails. Resque. fast. Unicorn. Memcached. Inc. and Solr Hadoop – distributed file system HBase – “NoSQL” data store – column-oriented Solr – search server based on Lucene All are scalable. Redis. 5 . Nginx. … OpenLogic. HBase. flexible. Ruby. HAProxy. used in production environments And a supporting cast of thousands… Stargate.

Solution Architecture Web Browser Scanner Client Ruby on Rails Resque Workers Stargate Nginx & Unicorn MySQL Live replication Live replication Solr Redis Maven Client Maven Repo Live replication Internet Application LAN Data LAN Live replication (3x) HBase Caching and load balancing not shown OpenLogic. 6 . Inc.

Getting Data out of HBase HBase  NoSQL Think hash table. powerful query language. 7 . faceted search. Inc. and other data formats OpenLogic. Ruby. JSON. accessible via simple REST-like web API w/XML. highly scalable search server with built-in sharding and replication – based on Lucene Dynamic schema. querying How do find my data if primary key won’t cut it? Solr to the rescue Very fast. not relational database Scanning vs.

8 . all indexed in Solr for real-time search in multiple ways Over 20 Solr fields indexed per source file OpenLogic. sharded. cross-replicated. fronted with HAProxy Load balanced writes across masters.Solr Sharding Query any server – it executes the same query against all other servers in the group Returns aggregated result to original caller Async replication (slaves poll their masters) Can use repeaters if replicating across data centers OpenLogic Solr farm. reads across masters and slaves Be careful not to over-commit Billions of lines of code in HBase. Inc.

Inc.Solr Implementation – Sharding + Replication HAProxy HAProxy Machine 1 Machine 2 Solr Core B Machine 3 Solr Core C Machine 26 Solr Core Z Masters Solr Core A … Slaves Solr Core Z’ Solr Core A’ Solr Core B’ Solr Core Y’ OpenLogic. 9 .

10 .Write Flow HAProxy HAProxy Machine 1 Machine 2 Solr Core B Machine 3 Solr Core C Machine 26 Solr Core Z Masters Solr Core A … Slaves Solr Core Z’ Solr Core A’ Solr Core B’ Solr Core Y’ OpenLogic. Inc.

11 .Read Flow HAProxy HAProxy Machine 1 Machine 2 Solr Core B Machine 3 Solr Core C Machine 26 Solr Core Z Masters Solr Core A … Slaves Solr Core Z’ Solr Core A’ Solr Core B’ Solr Core Y’ OpenLogic. Inc.

Delete Flow HAProxy HAProxy Machine 1 Machine 2 Solr Core B Machine 3 Solr Core C Machine 26 Solr Core Z Masters Solr Core A … Slaves Solr Core Z’ Solr Core A’ Solr Core B’ Solr Core Y’ OpenLogic. Inc. 12 .

Inc.Failover HAProxy HAProxy Machine 1 Machine 2 Solr Core B Machine 3 Solr Core C Machine 26 Solr Core Z Masters Solr Core A … Slaves Solr Core Z’ Solr Core A’ Solr Core B’ Solr Core Y’ OpenLogic.Write Flow . 13 .

Read Flow . 14 . Inc.Failover HAProxy HAProxy Machine 1 Machine 2 Solr Core B Machine 3 Solr Core C Machine 26 Solr Core Z Masters Solr Core A … Slaves Solr Core Z’ Solr Core A’ Solr Core B’ Solr Core Y’ OpenLogic.

so you might need to adjust various timeouts Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing OpenLogic. 15 . turn it back down for search performance Minimize number of queries Start with something like 5 Example: curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5 This can take a few minutes. Inc. it can help to use a higher factor for load performance Minimize index manipulation gymnastics Start with something like 25 When you’re done with the massive initial load/import.Loading Big Data Experiment with different Solr merge factors During huge loads.

something is wrong OpenLogic. optimize.) Test your write-focused load balancing Look for large skews in Solr index size Note: you may have to commit. Inc.Loading Big Data (cont. 16 . and commit before you can really tell Make sure your replication slaves are keeping up Using identical hardware helps If index directories don’t look the same. write again.

Loading Big Data (cont.) Don’t commit to Solr too frequently It’s easy to auto-commit or commit after every record Doing this 100’s of times per second will take Solr down. so use more of them instead OpenLogic. 17 . Inc. especially if you have serious warm up queries configured Avoid putting large values in HBase (> 5MB) Works. but may cause instability and/or performance issues Rows and columns are cheap.

18 .) Don’t use a single machine to load the cluster You might not live long enough to see it finish At OpenLogic.Loading Big Data (cont. we spread raw source data across many machines and hard drives via NFS Be very careful with NFS configuration – can hang machines Load data into HBase via Hadoop map/reduce jobs Turn off WAL for much better performance put.setWriteToWAL(false) Index in Solr as you go Good way to test your load balancing write schemes and replication set up This will find your weak spots! OpenLogic. Inc.

Scripting Languages Can Help Writing data loading jobs can be tedious Scripting is faster and easier than writing Java Great for system administration tasks. testing Standard HBase shell is based on JRuby Very easy Map/Reduce jobs with J/Ruby and Wukong Used heavily at OpenLogic Productivity of Ruby Power of Java Virtual Machine Ruby on Rails. GUI clients OpenLogic. Inc. 19 . Hadoop integration.

while ( iter. Inc. while ( iter.Java (27 lines) public class Filter { public static void main( String[] args ) { List list = new ArrayList(). List shorts = filter.println( iter.add( item ). 4 ). 20 . Iterator iter = list.add( "Missy" ). } } public List filterLongerThan( List list. int length ) { List result = new ArrayList(). Filter filter = new Filter().iterator(). Iterator iter = shorts.size() ). list.println( shorts.next(). list. list. list. } } return result.add( "Rod" ).filterLongerThan( list. } OpenLogic.out. System.out.add( "Eric" ).next() ).hasNext() ) { String item = (String) iter.iterator().add( "Neeta" ).hasNext() ) { System.length() <= length ) { result. if ( item.

size <= 4 } puts shorts.each { name -> println name } -> 2 -> Rod Eric OpenLogic.size() <= 4 } println shorts.each { |name| puts name } -> 2 -> Rod Eric Groovy list = ["Rod". "Neeta". "Eric".size shorts. "Neeta". "Missy"] shorts = list.size shorts. "Missy"] shorts = list. "Eric". Inc. .find_all { |name| name.findAll { name -> name.Scripting languages (4 lines) JRuby list = ["Rod".

Cutting Edge Hadoop SPOF around Namenode. Inc. and indexing solutions in flux Solr Several competing solutions around cloud-like scalability and fault-tolerance. append functionality HBase Backup. none quite ready for production OpenLogic. including ZooKeeper and Hadoop integration No clear winner. 22 . replication.

sockets. and other limits Hadoop and HBase configuration http://wiki.Configuration is Key Many moving parts It’s easy to let typos slip through Consider automated configuration via Chef.apache. 23 . Inc. or similar Pay attention to the details Operating system – max open files. Puppet.org/hadoop/Hbase/Troubleshooting Solr merge factor and norms Don’t starve HBase or Solr for memory Swapping will cripple your system OpenLogic.

24 . 32GB RAM. 4+ disks Don’t bother with RAID on Hadoop data disks Be wary of non-enterprise drives Expect ugly hardware issues at some point OpenLogic.Commodity Hardware “Commodity hardware” != 3 year old desktop Dual quad-core. Inc.

Inc. 25 .OpenLogic’s Hadoop Implementation Environment 100+ CPU cores 100+ Terabytes of disk Machines don’t have identity Add capacity by plugging in new machines Why not Amazon EC2? Great for computational bursts Expensive for long-term storage of Big Data Not yet consistent enough for mission-critical usage of HBase OpenLogic.

Inc. Key “source” data backups Hadoop datanode gets remaining drives Redundant enterprise switches Dual. Hadoop.and quad-gigabit NIC’s OpenLogic. HBase. job code.OpenLogic’s Hadoop and Solr Deployment Dual quad-core and dual hex-core Dell boxes 32-64GB RAM ECC (highly recommended by Google) 6 x 2TB enterprise hard drives RAID 1 on two of the drives OS. NFS mounts (be careful!). Solr. 26 . etc.

34.760 hrs/yr * 3 yrs) / 3 = $86k/year to operate Totals for 20 virtual machines 1st year cost: $120k + $80k + $86k = $286k 2nd & 3rd year costs: $120k + $86k = $206k Average: ($286k + $206k + $206k) / 3 = $232k/year OpenLogic.760 hrs/yr = $175k/year 3 year reserved instances 20 * 4k = $80k up front to reserve (20 * $0. 27 .2GB RAM 20 instances * $1.34/hr * 8.Public Clouds and Big Data Amazon EC2 EBS Storage 100TB * $0.00/hr * 8.10/GB/month = $120k/year Double Extra Large instances 13 EC2 compute units. Inc.

28 . 5 TB disk = $160k Over 33 EC2 compute units each Total: $53k/year (amortized over 3 years) OpenLogic.Private Clouds and Big Data Buy your own 20 * Dell servers w/12 CPU cores. Inc. 32GB RAM.

Public Clouds are Expensive for Big Data Amazon EC2 20 instances * 13 EC2 compute units = 260 EC2 compute units Cost: $232k/year Buy your own 20 machines * 33 EC2 compute units = 660 EC2 compute units Cost: $53k/year Does not include hosting & maintenance costs Don’t think system administration goes away You still “own” all the instances – monitoring. debugging. support OpenLogic. Inc. 29 .

Solr servers Your Code and Data Stray Map/Reduce jobs. strange corner cases in your data leading to program failures OpenLogic. dropped packets Software Servers Hadoop datanodes. HBase regionservers. zombie processes. Stargate servers. hard drives Operating System Kernel panics. 30 .Expect Things to Fail – A Lot Hardware Power supplies. Inc.

31 .Not Possible Without Open Source OpenLogic. Inc.

HSQLDB. Tomcat. JUnit CentOS Dozens more Too expensive to build or buy everything OpenLogic. ZooKeeper. Lucene. JRuby. 32 . HAProxy Stargate. Geronimo Apache Commons. Solr Apache.Not Possible Without Open Source Hadoop. Jetty. HBase. Inc.

Solr scales (to a point) Don’t worry about outgrowing a few machines Do worry about outgrowing a rack of Solr instances Look for ways to partition your data other than “automatic” sharding OpenLogic. Inc. 33 .Final Thoughts You can host your own big data Tools are available today that didn’t exist a few years ago Fast to prototype – production readiness takes time Expect to invest in training and support HBase and Solr are fast 100+ random queries/sec per instance Give them memory and stand back HBase scales.

Q&A Any questions for Rod? rod.com OpenLogic.cope@openlogic.com/presentations/rod * Unless otherwise credited. all images in this presentation are either open source project logos or were licensed from BigStockPhoto. Inc.com Slides: http://www.openlogic. .

Sign up to vote on this title
UsefulNot useful