Big Data

Presented By :

Sanjay Sharma

© 2012 Impetus Technologies

2

Outline
• About • Big Data: Recap • Big Data Technologies Landscape • Hadoop Overview • Other Big Data Tools Overview • Impact on IT and us
© 2012 Impetus Technologies

3

About
• Big Data Solution Architect • Work for Impetus Technologies • Based out of San Jose, Atlanta &
India(1300+ Engineers)

• Thought Leaders in Big Data
consulting

• Started Big Data Labs 4 years ago
© 2012 Impetus Technologies

4

Big Data
Velocity

Volume

Variety

Structured
© 2012 Impetus Technologies

Big Data

Unstructured

5 Big Data Opportunity
Time to Market Visualization Optimization

Personalization

Business Opportunity

Advanced Predictive Analytics

© 2012 Impetus Technologies

6 Big Data Users- 2009

Source: http://www.cloudera.com/company/press-center/hadoop-world-nyc-2009/, Google

© 2012 Impetus Technologies

7 Big Data Users- 2010

Source: http://www.cloudera.com/company/press-center/hadoop-world-nyc/agenda/

© 2012 Impetus Technologies

8 Big Data Job Trends
Top Job Trends
(Indeed.com July 2012) HTML5

MongoDB
iOS Android Mobile app Puppet

Hadoop
jQuery PaaS Social Media

Source: indeed.com

© 2012 Impetus Technologies

9

Big Data Every Where
Search : Atlanta, GA Date : 10/16/2012 Search : USA Date : 10/16/2012

“Big Data” = 82

“Big Data” = 5169

“Hadoop” = 78
“NoSQL” = 117

“Hadoop” = 5174
“NoSQL” = 3820

MPP DBs= 192
Source: indeed.com © 2012 Impetus Technologies

MPP DBs= 4581

10

Big Data : Future

Source:: McKinsey-http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation

© 2012 Impetus Technologies

11 Big Data Landscape
• Petabyte Scale • Commodity • DW Vendors • Appliances

Hadoop

MPP

NewSQL
• Scalability Limits • Online vs. Batch

NoSQL
• Open Source • Writes/Reads

© 2012 Impetus Technologies

12

Big Data Vendor Galore

© 2012 Impetus Technologies

Hadoop: Glory to the Elephant

© 2012 Impetus Technologies

14

Hadoop

Distributed File System

Distributed Processing System

• • • •

Petabyte Scale Thousands of Commodity Servers High Availability Highly Fault Tolerant

• • • •

Simple easy to code Algorithm Code once Run on PBs

High Fault Tolerance
Data Locality

© 2012 Impetus Technologies

15
1 2 3 4 …. 256 million

Map Concept: RDBMS
BIG COMBINED TABLE id Name Scott Bob Lisa Sanjay Bob
Select count(*), ‘Scott’ from table where TABLE 4- on m/c 4 Id name=“Scott”; 192 million and 1 ->4,’Scott’
192 million and 2 192 million and 3 TABLE 2- on m/c 2 Id 64 million and 1 64 million and 2 Bob 64 million and 3 64 million and 4 Name TABLE 3- on m/c 3 Id Scott 128 million and 1 Bob 128 million and 2 Lisa 128 million and 3 Sanjay 128 million and 4 …… 192 million and 4 Name …… Name Scott Bob Lisa Lisa

Other Columns. .. .. .. .. ..

Select count(*), ‘Bob’ from table where name=“Bob”; Name TABLE 1- on m/c 1 Id ->12,’Bob’
1 2 3 4 …… 64 million Scott Bob Lisa Sanjay

Scott million 256 Bob Bob Select count(*), ‘Bob’ from table where name=“Bob”; Lisa ”Bob”,3 Sanjay ”Scott”,1 “Sanjay”,1 Bob

Select ‘Bob’,count(*), from table where name=“Bob”; <- same queries for ‘Scott’ & ‘Sanjay” ”Bob”,3 ”Scott”,1 “Sanjay”,2

……
128 million

Bob 192 million Select count(*), ‘Bob’ from table where name=“Bob”; Select count(*), ‘Bob’ from table where name=“Bob”; ”Bob”,3 ”Scott”,1 “Sanjay”,0 ”Bob”,3 ”Scott”,1 “Sanjay”,1

© 2012 Impetus Technologies

16

Reduce Concept: RDBMS
“Scott”,list([1,1,1,1]) List[1,1,1,1].iterate-> Sum(EACH) “Bob”,list([3,3,3,3]) List[3,3,3,3].iterate-> Sum(EACH)

“Sanjay” ,list([2,0,1,1]) List[2,0,1,1].iterate-> Sum(EACH)

© 2012 Impetus Technologies

17

Hadoop DFS

Source : http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html
© 2012 Impetus Technologies

18 Hadoop Map Reduce
map (k1,v1)  list(k2,v2) reduce (k2,list(v2))  list(v2)

Source : http://architects.dzone.com/articles/how-hadoop-mapreduce-works
© 2012 Impetus Technologies

19 Hadoop Map Reduce
(brown, (1,1)) (fox, (1,1)) (how, (1)) (now,(1)) (the, (1,1,1))

© 2012 Impetus Technologies

20 Hadoop Ecosystem

© 2012 Impetus Technologies

NoSQL:
“No to SQL” OR “Not Only SQL”

© 2012 Impetus Technologies

22

NoSQL Overview
NoSQL Models
Volatile Storage Persistence Storage
Key / Value Databases Columnar Databases Document Databases Graph Databases Other Databases Voldemort, Redis, Scalaris Hbase, Cassandra, Hypertable MongoDB, CouchDB InfoGrid, Neo4j Kyotocabinet, Berkley DB

Memcached, Ehcache

© 2012 Impetus Technologies

23

NoSQL Characteristics
AutoSharding Schemaless In-memory flush to disk Failover
TYPICAL BENEFITS  Scalability  Availability

Intelligent client Dynamic clustering

 Near-Real time Performance
 Modeling flexibility  Deployment flexibility

© 2012 Impetus Technologies

MPP:
Massively Parallel Processing DW

© 2012 Impetus Technologies

MPP/Columnar 25 Stores • Oracle Exadata
• IBM Netezza • Teradata • EMC Greenplum • HP Vertica • ParAccel • Microsoft SQL Server PDW

© 2012 Impetus Technologies

26

Source : http://www.oracle.com/us/products/database/big-data-for-enterprise-519135.pdf

© 2012 Impetus Technologies

27 Big Data: Microsoft

Source : http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx © 2012 Impetus Technologies

NewSQL:
New Generation DB

© 2012 Impetus Technologies

29 New SQL/ Cloud DB
• VoltDB • NimbusDB • Clustrix • Xeround
© 2012 Impetus Technologies

• SimpleDB • DynamoDB • NuoDB • Totutek

ETL, BI & Reporting

© 2012 Impetus Technologies

31 ETL, BI & Reporting
• Hadoop/ MPP/ NoSQL support in• Informatica Datastage • Talend, Pentaho • Microstrategy, SAS • Tableau, Qlikview, Intellicus
© 2012 Impetus Technologies

Big Data & Cloud

© 2012 Impetus Technologies

33

Big Data & Cloud
• Marriage made in heaven • Big data demands met by Cloud
scalability

• IAAS, PAAS and DAAS offerings • AWS EMR, SimpleDB, RDS • Azure SQL Server, Hadoop • Google
© 2012 Impetus Technologies

Real Time Analytics

© 2012 Impetus Technologies

35 Real Time Analytics • Storm • HStreaming, StreamBase • Microsoft StreamInsight • IBM Streams • Oracle SQLstream • Complex Event Processing engines- Esper
etc.
© 2012 Impetus Technologies

Big Data Impact on us

© 2012 Impetus Technologies

37

Big Data Careers
• Java Developers • ETL Developers • Database Administrators • Database SQL Developers • Solution/ Technical
Architects

Enhance OR Extend NOT Replace

• Data Scientists
© 2012 Impetus Technologies

38
Hadoop/Hive Developers - Java, Hive

Some Big Data Careers
Hadoop Architects - Java, DW, ETL Hadoop Administrators - Linux, Java NoSQL Developers - Java/ Python/ Ruby MPP DW Admin - Linux, SQL

MPP DW Developers - SQL, Data Modeling

Data Scientist - Machine Learning, Algorithms

Big Data Architect - Solution/ Technical Architecture

© 2012 Impetus Technologies

39

Typical Big Data Architecture
© 2012 Impetus Technologies

40

Credits and Acknowledgements

 Company Logos – Creative Commons/ Company
Copyrighted/ Trademarked

 Hadoop Elephant Images– Apache Trademarked  Cloudera.com, hadoop.apache.org, Oracle big data web site,
Indeed.com, McKinsey report, dzone.org. microsoft.com

 The Awesome !! Team of Big Data architects and practitioners
at Impetus
© 2012 Impetus Technologies

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.