You are on page 1of 50

An

Introduc@on to Big Data,


Apache Hadoop, and Cloudera
Ian Wrigley, Curriculum Manager, Cloudera

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 1
The Mo@va@on for Hadoop

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 2
Tradi@onal Large-Scale Computa@on

Tradi*onally, computa*on has been processor-bound


Rela@vely small amounts of data
Signicant amount of complex processing performed on that data
For decades, the primary push was to increase the compu*ng power of a
single machine
Faster processor, more RAM

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 3
The Data Explosion

10,000

1.8 trillion gigabytes of data was


GIGABYTES OF DATA CREATED (IN BILLIONS)

created in 2011

More than 90% is unstructured data


Approx. 500 quadrillion les
5,000 Quan@ty doubles every 2 years

2005 2010 2015

STRUCTURED DATA UNSTRUCTURED DATA


Source: IDC 2011

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 4
Current Solu@ons

10,000

Current Database Solutions are


GIGABYTES OF DATA CREATED (IN BILLIONS)

designed for structured data.

Op@mized to answer known ques*ons quickly


Schemas dictate form/context
5,000
Dicult to adapt to new data types and new
ques@ons
Expensive at Petabyte scale

0 10%
2005 2010 2015

STRUCTURED DATA UNSTRUCTURED DATA

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 5
Why Use Hadoop?

Move beyond rigid legacy frameworks

Hadoop handles any data Hadoop grows with your Hadoop is 100% Apache Hadoop helps you derive
type, in any quan*ty business licensed and open source the complete value of all
your data
Structured, unstructured No vendor lock-in
Proven at petabyte scale Drives revenue by extrac@ng
Schema, no schema Community development value from data that was
Capacity and performance previously out of reach
High volume, low volume grow simultaneously Rich ecosystem of related
projects Controls costs by storing data
All kinds of analy@c Leverages commodity more aordably than any
applica@ons hardware to mi@gate costs other pla`orm

1 2 3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 6
The Origins of Hadoop

Launches SQL support


for Hadoop

Open Source
Open source web Publishes MapReduce MapReduce and HDFS Runs 4,000-node Hadoop wins Terabyte Releases CDH and
crawler project created and GFS Paper project created by Hadoop cluster sort benchmark Cloudera Enterprise
by Doug Cuang Doug Cuang

2002 2007 2012

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 7
Core Hadoop: HDFS

Self-healing, high bandwidth

3 HDFS

4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4

HDFS breaks incoming les into blocks and stores them redundantly across the cluster.

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 8
Core Hadoop: MapReduce

framework

3 MR

4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4

Processes large jobs in parallel across many nodes and combines the results.

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 9
Hadoop and Databases

You need

Best Used For: Best Used For:


Interac@ve OLAP Analy@cs (<1sec) Structured or Not (Flexibility)
Mul@step ACID Transac@ons Scalability of Storage/Compute
100% SQL Compliance Complex Data Processing

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 10
Typical Datacenter Architecture

Enterprise web site

Business
intelligence apps

Interactive Data export OLAP load Oracle,


database SAP...

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 11
Adding Hadoop To The Mix

Enterprise web site


Business
intelligence apps
Dynamic
OLAP queries

New Oracle,
Interactive Hadoop SAP...
data
database

Recommendations, etc...

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 12
Why Cloudera?

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 13
Cloudera is

in Customers and Users in Integrated Partners

in banking, across hardware, pla`orms,


telecommunica@ons, mobile services, defense & intelligence, database and business intelligence (BI)
media and retail depend on Cloudera


than for hardware, pla`orms, sokware and services
all other Hadoop systems combined

in Training and Certification in Nodes Under Management

developers, administrators and


managers trained on 6 con@nents since 2009
in Open Source Contributions

for developers, administrators and managers


in Data Science

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 14
Experienced and Proven Across Hundreds of Deployments

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 15
The Only Vendor With a Complete Solu@on

Clouderas Distribu*on Including Apache Hadoop (CDH) COMPU- INTEGRA-


STORAGE ACCESS
Big Data storage, processing and analy@cs pla`orm based on TATION TION
Apache Hadoop 100% open source

Cloudera Enterprise 4.0

Cloudera Manager DIAGNOS-


DEPLO- CONFIGUR- MONITOR- TICS AND
End-to-end management applica@on for the YMENT ATION ING REPORT-
deployment and opera@on of CDH ING

Produc*on Support ISSUE KNOW-


ESCALATION OPTIMIZA-
Our team of experts on call to help you meet RESOLU- LEDGE
PROCESSES TION
your Service Level Agreements (SLAs) TION BASE

Cloudera University
Partner Ecosystem Equipping the Big Data workforce 12,000+ trained
250+ partners across hardware, software, platforms and services

Professional Services
Use case discovery, pilots, process & team development

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 16
Solving Problems with Hadoop

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 17
Eight Common Hadoop-able Problems

1. Modeling true risk 5. Analyzing network data to


predict failure
2. Customer churn analysis
6. Threat analysis
3. Recommenda*on engine
7. Search quality
4. PoS transac*on analysis
8. Data sandbox

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 18
1. Modeling True Risk

Challenge:
How much risk exposure does an organiza*on really have with each
customer?
Mul@ple sources of data and across mul@ple lines of business
Solu*on with Hadoop:
Source and aggregate disparate data sources to build data picture
e.g. credit card records, call recordings, chat sessions,
emails, banking ac@vity
Structure and analyze
Sen@ment analysis, graph crea@on, pa=ern recogni@on
Typical Industry:
Financial Services (banks, insurance companies)

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 19
2. Customer Churn Analysis

Challenge:
Why is an organiza*on really losing customers?
Data on these factors comes from dierent sources
Solu*on with Hadoop:
Rapidly build behavioral model from disparate data sources
Structure and analyze with Hadoop
Traversing
Graph crea@on
Pa=ern recogni@on
Typical Industry:
Telecommunica@ons, Financial Services

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 20
3. Recommenda@on Engine/Ad Targe@ng

Challenge:
Using user data to predict which products to recommend
Solu*on with Hadoop:
Batch processing framework
Allow execu@on in in parallel over large datasets
Collabora*ve ltering
Collec@ng taste informa@on from many users
U@lizing informa@on to predict what similar users like
Typical Industry
Ecommerce, Manufacturing, Retail
Adver@sing

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 21
4. Point of Sale Transac@on Analysis

Challenge:
Analyzing Point of Sale (PoS) data to target promo*ons and manage
opera*ons
Sources are complex and data volumes grow across chains of stores and
other sources
Solu*on with Hadoop:
Batch processing framework
Allow execu@on in in parallel over large datasets
Paiern recogni*on
Op@mizing over mul@ple data sources
U@lizing informa@on to predict demand
Typical Industry:
Retail
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 22
5. Analyzing Network Data to Predict Failure

Challenge:
Analyzing real-*me data series from a network of sensors
Calcula@ng average frequency over @me is extremely tedious because
of the need to analyze terabytes
Solu*on with Hadoop:
Take the computa*on to the data
Expand from simple scans to more complex data mining
Beier understand how the network reacts to uctua*ons
Discrete anomalies may, in fact, be interconnected
Iden*fy leading indicators of component failure
Typical Industry:
U@li@es, Telecommunica@ons, Data Centers

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 23
6. Threat Analysis/Trade Surveillance

Challenge:
Detec*ng threats in the form of fraudulent ac*vity or aiacks
Large data volumes involved
Like looking for a needle in a haystack
Solu*on with Hadoop:
Parallel processing over huge datasets
Paiern recogni*on to iden*fy anomalies,
i.e., threats
Typical Industry:
Security, Financial Services,
General: spam gh@ng, click fraud

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 24
7. Search Quality

Challenge:
Providing real *me meaningful search results
Solu*on with Hadoop:
Analyzing search aiempts in conjunc*on with structured data
Paiern recogni*on
Browsing pa=ern of users performing searches in dierent categories
Typical Industry:
Web, Ecommerce

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 25
8. Data Sandbox

Challenge:
Data Deluge
Dont know what to do with the data or what analysis to run
Solu*on with Hadoop:
Dump all this data into an HDFS cluster
Use Hadoop to start trying out dierent analysis on the data
See paierns to derive value from data
Typical Industry:
Common across all industries

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 26
Orbitz: Major Online Travel Booking Service

Challenge:
Orbitz performs millions of searches and transac@ons daily, which leads
to hundreds of gigabytes of log data every day
Not all of that data has value (i.e., it is logged for historic reasons)
Much is quite valuable
Want to capture even more data
Solu*on with Hadoop:
Hadoop provides Orbitz with
ecient, economical, scalable,
and reliable storage and processing
of these large amounts of data
Hadoop places no constraints
on how data is processed

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 27
Before Hadoop

Orbitzs data warehouse contains a full archive of all transac*ons


Every booking, refund, cancella@on etc.
Non-transac*onal data was thrown away because it was uneconomical to
store

Non-transactional Data Transactional Data


(e.g., Searches) (e.g., Bookings)

Data Warehouse

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 28
Aker Hadoop

Hadoop was deployed late 2009/early 2010 to begin collec*ng this non-
transac*onal data
Orbitz has been using CDH for that en@re period with great success.
Much of this non-transac*onal data is contained in Web analy*cs logs

Non-transactional Data Transactional Data


(e.g., Searches) (e.g., Bookings)

Hadoop Data Warehouse

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 29
What Now?

Access to this non-transac*onal data enables a number of applica*ons


Op@mizing hotel search
E.g., op@mize hotel ranking and show consumers hotels more
closely matching their preferences
User specic product Recommenda@ons
Web page performance tracking
Analyses to op@mize search result cache performance
User segments analysis, which can drive personaliza@on
Lots of press coverage in June 2012: company discovered that
people using Macs are willing to spend 30% more on hotels that PC
users
Mac users are now presented with pricier hotels rst in the list

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 30
Major Na@onal Bank

Background
100M customers
Rela@onal data: 2.5B records/month
Card transac@ons, home loans, auto loans, etc.
Data volume growing by hundreds of TB/year
Needs to incorporate non-rela@onal data as well
Web clicks, check images, voice data
Uses Hadoop to
Iden@fy credit risk, fraud
Proac@vely manage capital

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 31
Financial Regulatory Body

Stringent data reliability requirements


Must store seven years of data
850TB of data collected from every Wall Street trade each year
Data volumes growing at 40% each year
Replacing EMC Greenplum + SAN with CDH
Goal is to store data from years two to seven in Hadoop
Will have 5PB of data in Hadoop by the end of 2013
Cost savings predicted to be 10s of millions of dollars
Applica*on performance tes*ng is showing speed gains of 20x in some
cases

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 32
Leading North American Retailer

Storing 400TB of data in CDH cluster


Capture and analysis of data on individual customers and SKUs across
4,000 loca@ons
Using Hadoop for:
Loyalty program analy@cs and personal pricing
Fraud detec@on
Supply chain op@miza@on
Marke@ng and promo@ons
Loca@ng and pricing overstocked items for clearance

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 33
Digital Media Company

Needs to quickly and reliably process high volume clickstream and


pageview data
Experienced database boilenecks and reliability issues
Now using CDH
A cluster of just 20 nodes
Inges*ng 75 million clickstream, page view, and user prole events per
day
15GB of data
Processes 430 million records from six million users in 11 minutes
Alterna@ve solu@on would have required 10x more investment in
database sokware, high-end servers, developer @me

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 34
Leader in Real-Time Adver@sing Technology

Hundreds of customers need unique views of the data


Were using Netezza; unable to run more than 2-3 big jobs per day
Too expensive to scale
Now using CDH
Processing hundreds of jobs concurrently
200-300GB/hour per job
Inges@ng 10TB of data per day
Moving data between CDH, Netezza, and Ver@ca

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 35
Ne`lix

Before Hadoop
Nightly processing of logs
Imported into a database
Analysis/BI
As data volume grew, it took more than 24 hours to process and load a
days worth of logs
Today, an hourly Hadoop job processes logs for quicker availability to the
data for analysis/BI
Currently inges*ng approximately 1TB of data per day

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 36
04-36 Copyright @ 2011 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent.
Hadoop as Cheap Storage

Yahoo
Before Hadoop: $1 million for 10TB storage
With Hadoop: $1 million for1 PB of storage
Other Large Company
Before Hadoop: $5 million to store data in Oracle
With Hadoop: $240K to store the data in HDFS
Facebook
Hadoop as unied storage

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 37
Hadoop Jobs

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 38
The Roles People Play

System Administrators
Developers
Analysts
Data Stewards

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 39
System Administrators

Required skills:
Strong Linux administra@on skills
Networking knowledge
Understanding of hardware
Job responsibili*es
Install, congure and upgrade Hadoop sokware
Manage hardware components
Monitor the cluster
Integrate with other systems (e.g., Flume and Sqoop)

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 40
Developers

Required skills:
Strong Java or scrip@ng capabili@es
Understanding of MapReduce and algorithms
Job responsibili*es:
Write, package and deploy MapReduce programs
Op@mize MapReduce jobs and Hive/Pig programs

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 41
Data Analyst/Business Analyst

Required skills:
SQL
Understanding data analy@cs/data mining
Job responsibili*es:
Extract intelligence from the data
Write Hive and/or Pig programs

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 42
Data Steward

Required skills:
Data modeling and ETL
Scrip@ng skills
Job responsibili*es:
Cataloging the data (analogous to a librarian for books)
Manage data lifecycle, reten@on
Data quality control with SLAs

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 43
Combining Roles

System Administrator + Steward analogous to DBA


Required skills:
Data modeling and ETL
Scrip@ng skills
Strong Linux administra@on skills
Job responsibili*es:
Manage data lifecycle, reten@on
Data quality control with SLAs
Install, congure and upgrade Hadoop sokware
Manage hardware components
Monitor the cluster
Integrate with other systems (e.g., Flume and Sqoop)

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 44
Finding The Right People

Hiring Hadoop experts


Strong Hadoop skills are scarce and expensive
Hadoop User Groups
Key words
Developers: MapReduce, Cloudera Cer@ed Developer for Apache
Hadoop (CCDH)
System Admins: distributed systems (e.g., Teradata, RedHat
Cluster), Linux, Cloudera Cer@ed Administrator for Apache Hadoop
(CCAH)
Consider cross-training, especially system administrators and data
librarians

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 45
Clouderas Academic Partnership Program

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 46
Clouderas Academic Partnerships: Overview

Clouderas Academic Partnerships (CAP)


An essen@al component of Clouderas strategy to provide
comprehensive Apache Hadoop training to current and future data
professionals
Designed to be a mutually benecial rela*onship
Universi@es are enabled to deliver new and relevant areas of study to
their students
Cloudera is able to help ll the demand for qualied data professionals
to help the market con@nue is explosive growth
With CDH and Cloudera Manager available for free, and our curriculum
and Virtual Machine, we provide universi*es the founda*on to start
experimen*ng with Hadoop and developing exper*se among their
students

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 47
Clouderas Academic Partnerships: Goals

Introduce students to Apache Hadoop


Provide students and instructors with quality course materials and virtual
machine images to complete hands-on labs
Grant 50% discount on cer*ca*on costs to students associated with the
program who are interested in aiemp*ng Cloudera's industry leading
Hadoop cer*ca*on exams
Highly recommended they take the class and a=empt the cer@ca@on
exam
Allow academic ins*tu*ons op*ons to augment their degree program
requirements

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 48
Clouderas Academic Partnerships: Financial Overview

Cloudera does not currently charge Academic Partners for usage of the
training materials
This is a program designed solely to facilitate students learning of an
emerging technology
Our reward is helping the industry grow, and ideally the exposure to
Cloudera is a posi@ve one which will be remembered when the students
we service today are making decisions for their business tomorrow
Instructors who are delivering the Cloudera courses are eligible for a 50%
discount to commercial training courses delivered by Cloudera
We want to make sure the folks leading the classes have the skillset to
help their students be successful
Normally we provide universi*es with courses focused on the roles of
Hadoop Developer or Administrator

Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 49
Ian Wrigley
ian@cloudera.com 50
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent.

You might also like