You are on page 1of 93

Introduction to Big Data

Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
2
What Is Big Data?
• There is not a consensus as to how to define big data
“Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and process it
with in a tolerable elapsed time for its user population.” - Teradata
Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze.” - The McKinsey Global Institute, 2011

3
SIZE OF DATA
What Is Big Data?
• The “BIG” in big data isn’t just about volume

* IOPS(Input/Output Operations Per Second)


5
12V’s
Big Data Analysis Example
• How does location tracking work?
• Recognize the dead zone

7
Big Data Analysis Example
• Big data can generate significant financial value across sectors

8
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
9
How Is Big Data Different?
1) Automatically generated by a machine
(e.g. Sensor embedded in an engine)

2) Typically an entirely new source of data


(e.g. Use of the internet)

3) Not designed to be friendly


(e.g. Text streams)

4) May not have much values


• Need to focus on the important part

10
How Is Big Data More of the Same?
• Most new data sources were considered big and difficult
• Just the next wave of new, bigger data

< The past > < The present > < The future >

11
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
12
Risks of Big Data
• Will be so overwhelmed
• Need the right people and solve the right problems

• Costs escalate too fast


• Isn’t necessary to capture 100%

• Many sources of big data is privacy


• self-regulation
• Legal regulation

13
Why You Need to Tame Big Data
• Analyzing big data is already standard
(e.g. ecommerce)

• Be left behind in a few years


• So far, only missed the chance on the bleeding edge

• Capturing data, using analysis to make decisions


• Just an extension of what you are already doing today

14
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
15
The Structure of Big Data
• Structured
• Most traditional data
sources

• Semi-structured
• Many sources of big
data

• Unstructured
• Video data, audio data

16
Various types of data formats
Exploring Big Data ▪ The time for
• The time for
developing an analysis
developing an analysis
(Initially working with big
data) Analyzing
data
(5%)

Analyzing
data
(20~30%)

Gathering & preparing Gathering & preparing


data data
(70~80%) (95%)

18
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
19
Filtering Big Data Effectively
• Sipping from the hose Focus on the important pieces of the data

It makes big data easier to handle

• The extract, transform, and load (ETL) processes


• taking a raw feed of data, reading it, and producing a usable set of
output
•Extract

•Transform

20
•Load
The Example of RFID Tags
• Have short-term value
• (e.g.) The responses at 10 second intervals between tags and readers

• Have long-term value


• With the entry and exit of the pallet
21
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
22
Mixing Big Data with Traditional Data
• The biggest value in big data can be driven by combing big data with
other corporate data

Big data
Create a
synergy
effect
Other
data

23
Mixing Big Data with Traditional Data
• Browsing history
• Knowing how valuable a customer is
• What they have bought in the past

• Smart-grid data
• For a utility company
• Knowing the historical billing patterns
• Dwelling type

• Text (Online chat and e-mails)


• Knowing the detailed product specification being discussed
• The sales data related those products

24
The Need for Standards
• Become more structured over time
• Fine-tune to be friendlier for analysis
• Standardize enough to make life much easier

25
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
26
Today’s Big Data Is Not Tomorrow’s Big Data
• Banking industries were very hard to handle even a decade ago
• Retail
• Telecommunications

• “BIG” will change


• Big data will continue to evolve
• Another new data source will come

27
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
28
Web Data Overview (1/6)

360-Degree View
• Organizations have talked about a 360-degree view of their
customers
• What is a 360-degree view?

Names & Addresses

29
Web Data Overview (2/6)

What Are You Missing?


• About 2% of browsing sessions complete a purchase
• Information is missing on more than 98% of web sessions
• If only transactions are tracked

98% of Information

30
Web Data Overview (3/6)

Importance of Missing Information


• For every purchase transaction
• There might be dozens or hundreds of specific actions
• That information needs to be collected and analyzed

Action flow

31
Web Data Overview (4/6)

New Ways of Communicating


• You have visibility into the entire buying process
• Instead of seeing just the results

motivation1
Intention1

Motiva Preference1
tion2 Etc.

Preference
2 Inten
tion2

32
Web Data Overview (5/6)

Data That Should Be Collected


• Collects detailed event history from any customer touch point
• Web sites
• Kiosks
• Mobile apps
• Social media
• Etc… Behaviors That Can Be
Captured
Purchases Requesting help
Product views Forwarding a link
Shopping basket additions Posting a comment
Watching a video Registering for a webinar
Accessing a download Executing a search
Reading / writing a review And many more!
33
Web Data Overview (6/6)

Privacy
• Privacy may become an even bigger issue as time passes
• Faceless customer analysis
• An arbitrary ID number can be matched
• It is useful to find the pattern, not the behavior of any specific customer

Behavioral
Pattern

34
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
35
What Web Data Reveals (1/7)

Shopping Behaviors
• How customers come to a site to begin shopping
• What search engine do they use?
• What specific search terms are entered?
• Do they use a bookmark they created previously?
Associated with higher sales rates

Search keywords

36
What Web Data Reveals (2/7)

Shopping Behaviors (cont.)


• Start to examine all the products they explore
• Who looked at a product landing page?
• Who drilled down further?
• Who looked at detailed product specifications?
• Who looked at shipping information?

37
What Web Data Reveals (3/7)
Shopping Behaviors (cont.)

• Start to examine all the products they explore


• Who took advantage of any other information?
• Which products were added/later removed to a wish list or basket?

38
What Web Data Reveals (4/7)

Research Behaviors
• Understanding how customers utilize the research content can lead
to tremendous insights into
• How to interact with each individual customer
• How different aspects of the site do or do not add value

39
What Web Data Reveals (5/7)

Research Behaviors - An Example


• An organization may see an unusual number of customers dropping a
specific product

Detailed specification

40
What Web Data Reveals (6/7)

Feedback Behaviors
• Some of the best information is
• Detailed feedback on products and services
• By using text mining, we can understand
• Tone
• Intent
• Topic

41
What Web Data Reveals (7/7)

Feedback Behaviors - Examples


• Some customers post reviews on a regular basis
• It is smart to give special incentives to keep the good words coming

Customers in Each specific


general customer

• By parsing the questions and comments via online help


• It is possible to get a feel for what each specific customer is asking about

42
Outline
• What is Big Data and Why Does It Matter?
• What Is Big Data?
• How Is Big Data Different and More of the Same?
• Risks of Big Data
• The Structure of Big Data
• Most Big Data Doesn’t Matter
• Mixing Big Data with Traditional Data
• Today’s Big Data Is Not Tomorrow’s Big Data

• Web Data: The Original Big Data


• Web Data Overview
• What Web Data Reveals
• Web Data in Action
43
Web Data in Action (1/8)

The Next Best Offer


• A common marketing analysis is to predict what the next best offer is
for each customer
• To maximize the chances of success
• Having web behavior data can be very useful

44
Web Data in Action (2/8)

The Next Best Offer - An Example


• At a bank, information about Mr. Smith
▪ He has four accounts: checking, savings, credit card, and a car loan
▪ He makes five deposits and 25 withdrawals per month
▪ He never visits a branch in person
▪ He has a total of $50,000 in assets deposited
▪ He owes a total of $15,000 between his credit card and car loan

What is the best offer to place in an e-mail to Mr. Smith?


• A lower credit card interest rate
• An offer of a CD for his sizable cash holdings

But, how about offering a mortgage?


45
Web Data in Action (3/8)

The Next Best Offer - An Example (cont.)


• We have nothing that says it is remotely relevant
• If Mr. Smith’s web behavior is examined and we got additional
information
▪ He browsed mortgage rates five times in past month
▪ He viewed information about homeowners’ insurance
▪ He viewed information about flood insurance
▪ He explored home load options (i.e., fixed versus variable, 15- versus
30-year) twice in the past month

It’s pretty easy to decide what to discuss next


with Mr. Smith

46
Web Data in Action (4/8)

Attrition Modeling
• In the telecommunications industry,
• Companies have invested massive amounts of time and effort for “churn”
models
• It is critical to understand patterns of customer usage and
profitability

47
Web Data in Action (5/8)

Attrition modeling: an example


• Mrs. Smith
• A customer of telecom Provider 101

How do I cancel my Provider 101 contract?

Provider 101’s
cancellation
policies page

Knowing these actions are very important for a churn model!!

By capturing Mrs. Smith’s actions on the web,


Provider 101 is able to move more quickly to avert losing Mrs. Smith
48
Web Data in Action (6/8)

Response Modeling
• It is similar to attrition modeling
• The goal is predicting a negative behavior rather than a positive behavior
(purchase or response)
• In response model, all customers are scored and ranked
• In theory, every customer has a unique score
• In practice, a small number of variables define most models
• Many customers end up with identical or nearly identical scores
• Web data can help increase differentiation among customers

49
Web Data in Action (7/8)

Response Modeling - An Example


• 4 customers scored by a response model
• Has the exact same score due to having the same value: 0.62
▪ Last purchase was within 90 days
▪ Six purchases in the past year
▪ Spent $200 to $300 in total
▪ Homeowner with estimated household income of $100,000 to $150,000
▪ Member of the loyalty program
▪ Has purchased the featured product category in the past year

• Using web data, the scores are changed drastically

▪ Customer 1 has never browsed your site : 0.62 0.54


▪ Customer 2 viewed the product category featured in the offer within the past month:
0.62 0.67
▪ Customer 3 viewed the specific product featured in the offer within the past month:
0.62 0.78
▪ Customer 4 browsed the specific product featured 3 times last week, added it to a
basket once, abandoned the basket, then viewed the product again later: 0.62 0.86
50
Web Data in Action (8/8)

Customer Segmentation
• Web data enables to segment customers based upon typical
browsing patterns

Dreamer

51
Thank you
The Evolution of Analytic Processes
Outline
• Introduction
• The Analytic Sandbox
• Analytic Data Set (ADS)
• Enterprise Analytic Data Set (EADS)
• Scoring Routines

54
Introduction
• Upgrading technologies won’t provide a lot of value, if the same old
analytical processes remain in place
1. Change the process of configuring and maintaining workspace
The Analytic SandBox

2. Consistently leverage a database platform through a sandbox


Enterprise Analytic Data Set
(EADS)
3. Necessary to keep scores up to date on a daily

Embedded Scoring

55
Outline
• Introduction
• The Analytic Sandbox
• Analytic Data Set (ADS)
• Enterprise Analytic Data Set (EADS)
• Scoring Routines

56
The Analytical Sandbox (1/5)
Definition
• A set of resources that enable analytic professionals to experiment
and reshape data in whatever fashion they need to
• Data exploration
• Development of analytical processes
• Proof of concepts
• prototyping

57
The Analytical Sandbox (2/5)
An Internal Sandbox
• A portion of an enterprise data warehouse or data mart is carved out
to serve as the analytic sandbox
• Strength
• Leverage existing hardware resources and infrastructure already in place
• Ability to directly join production data with sandbox data
• Cost-effective since no new hardware is needed
• Weaknesses
• An additional load on the existing enterprise data warehouse or data mart
• Can be constrained by production policies and procedures

Sandbox
Analytic Views & Core Database
Enterprise Analytic Data Tables
Sets

Enterprise Data Warehouse or Data Mart


Additional Data 58
The Analytical Sandbox (3/5)
An External Sandbox
• A physically separate analytic sandbox is created for testing and
development of analytic processes
• Strength
• A stand-alone environment, no impact on other processes
• Reduce workload management
• Weaknesses
• The additional cost of the stand-alone system
• Some data movement
Sandbox

Extract

Enterprise Data Warehouse or Data Mart


59
The Analytical Sandbox (4/5)
A Hybrid Sandbox
• The combination of an internal sandbox and an external sandbox
• Strength
• Flexibility in the approach taken for an analysis
• Can be run in a ‘pseudo-production’ mode temporarily
• Weaknesses
• Maintain both an internal and external sandbox environment
• Two-way data feeds may be required, which adds complexity
External
Sandbox
Internal
Extract Sandbox

Enterprise Data Warehouse or Data Mart


60
The Analytical Sandbox (5/5)
Benefits
• From the view of an analytic professional
• Independence
• Flexibility
• Efficiency
• Freedom
• Speed

• From the view of IT


• Centralization
• Streamlining
• Simplicity
• Control
• Costs

61
Outline
• Introduction
• The Analytic Sandbox
• Analytic Data Set (ADS)
• Enterprise Analytic Data Set (EADS)
• Scoring Routines

62
Analytic Data Set (1/2)
Definition
• The data that is pulled together in order to create an analysis or
model
• In the format required for the specific analysis at hand
• Generated by transforming, aggregating, and combining data
• Help to bridge the gap between efficient storage and ease of use

63
Analytic Data Set (2/2)
Two Primary kinds of Analytic Data Sets
• A development ADS
• Used to build an analytic process
• Have many variables or metrics within it
• Very wide but not very deep
• Production analysis data set
• Needed for scoring and deployment
• Contain only the specific metrics that were actually in the final solution
• Not very
Table1
wide but very deep Production ADS
Table2
Table3
Table4 Development Analytic Data Set
Table5
Table6
Narrow & Deep Wide & Shallow

Base
Tables
64
Derive, Aggregate, Combine, and Transform….
Outline
• Introduction
• The Analytic Sandbox
• Analytic Data Set (ADS)
• Enterprise Analytic Data Set (EADS)
• Scoring Routines

65
Enterprise Analytic Data Set (1/5)
Traditional Analytic Data Sets
• All analytic data sets are created outside of the database
• Each analytic professional creates their own data sets independently
• The risk of inconsistencies
• The repetitious work

A dedicated ADS is generated


outside the database for every project

66
Enterprise Analytic Data Set (2/5)
Enterprise Analytic Data Set
• A shared and reusable set of centralized, standardized analytic data
sets for use in analytics
• A standardized view of data to support multiple analysis efforts
• Streamline the data preparation process
• Provide grate consistency, accuracy, and visibility to analytics processes
• Build once, use many

Centralized ADS tables and views


are utilized across many projects

67
Enterprise Analytic Data Set (3/5)
Structure
EADS Logical View:
Customer ADS
Table
Total Total Home- Mail E-mail
Customer Gender
Sales Purchases owners Responder Opt in

EADS Potential Physical View:


Customer Customer
Sales Demographics
Total Total Home-ow
Customer Customer Gender
Sales Purchases ner

Customer
Sales
It could very well be stored
Mail E-mail differently!
Customer For updating an
Responder Opt in
EADS

68
Enterprise Analytic Data Set (4/5)
Summary Table or View?
• Summary tables that are updated via a scheduled process
• Benefits
• Compute once, use many
• Most advanced analytics efforts involve a heavy use of historical data
• Very low latency in getting data
• Downsides
• Not be fully up-to-date with the latest data
• Use disk space on the system, potentially a whole lot of it

69
Enterprise Analytic Data Set (5/5)
Summary Table or View?
• A series of views that are run on demand
• Benefits
• be completely fresh and updated
• Good performance in real-time analysis
• Changes are immediately available
• Consistency and transparency of the computations
• Downsides
• The system load won’t necessarily be reduced that much
• Have to wait longer to get their data back

70
Outline
• Introduction
• The Analytic Sandbox
• Analytic Data Set (ADS)
• Enterprise Analytic Data Set (EADS)
• Scoring Routines

71
Scoring Routines (1/2)
Embedded Scoring
• Score
• Something generated from a predictive model, or any other type of output from
analytic process

• Embedded Scoring
• Deploying each individual scoring routine
• A process to manage and track the various scoring routines

• Benefits
• Scores run in batches will be available on demand
• Real-time scoring
• Abstract complexity from users
• Have all the models contained in a centralized repository so they are all in one place

72
Embedded Scoring (2/2)
Model and Score Management
• Model and score management procedures will need to be in place to
scale the use of models by an organization
Analytic Data Set Inputs

Model Definitions

Model Validation & Reporting

Model Scoring Outputs

73
The Evolution of Analytic Tools and
Methods
Outline
• Introduction
• The Evolution of Analytic Methods
• The Evolution of Analytic Tools

75
Introduction
• Analytic professionals have used a range of tools over the years
• Execute analytic algorithms
• Assess the results

But Now

76
Outline
• Introduction
• The Evolution of Analytic Methods
• The Evolution of Analytic Tools

77
The Evolution of Analytic Methods(1/7)

The Evolution of Analytic Methods


• Until the advent of computers, it wasn’t feasible to run
• Many iterations of a model
• Highly advanced methods
• Large dataset
DATA
DATA

Sophisticated
Naïve algorithm
algorithm

NOW

output output

78
The Evolution of Analytic Methods(2/7)
Ensemble Methods

• Ensemble methods are built using multiple techniques


• go beyond individual performer
linear
regression

logistic
regression final
aggregator
decision result
tree
neutral
network

79
The Evolution of Analytic Methods(3/7)

The Wisdom of Crowds


• One reason for ensemble models are gaining traction is
• The Wisdom of Crowds

80
The Evolution of Analytic Methods(4/7)
Commodity Model
• Commodity model has been produced rapidly
• A commodity modeling process stops when something good enough
is found
10
0 GREAT

90 SOMETIMES
ACCEPTABL
E
81
The Evolution of Analytic Methods(5/7)
Uses for Commodity Models

• Traditionally, building models was a time-intensive and expensive


• Modeling for low-value problems doesn’t make sense
• Commodity model provides an option for low-value problems

VS

82
The Evolution of Analytic Methods(6/7)
Text Analysis
• Analysis of text and other unstructured data sources is growing
rapidly
• Unstructured data is applied to some structure after being processed
• Structured results are what is analyzed
Structured
data

parser

83
The Evolution of Analytic Methods(7/7)

Ambiguity
• Applying context to the text is no easy task
• read a book vs book a ticket
• Emphasis can change the meaning
Varying the emphasis Changes the meaning
I didn’t say Bill’s book stinks But my buddy Bob did!
I didn’t say Bill’s book stinks How dare you accuse me of such a thing

I didn’t say Bill’s book stinks But I admit that I did write it in an e-mail
I didn’t say Bill’s book stinks It’s that other guy’s book that stinks
I didn’t say Bill’s book stinks I said his blog stinks
I didn’t say Bill’s book stinks I simply said it wasn’t my favorite

84
Outline
• Introduction
• The Evolution of Analytic Methods
• The Evolution of Analytic Tools

85
The Evolution of Analytic Tools(1/7)

Previous Tools
• Analytics work was done against a mainframe in 1980s
• Not user-friendly
• Directly program code to do analytics

86
The Evolution of Analytic Tools(2/7)
Graphical User Interface

• Graphical user interfaces can accelerate the generation of code while


ensuring it is bug-free and optimized
• Point-and-click environment
• Generate the code automatically
• Users still should understand the code to validate the intention

87
The Evolution of Analytic Tools(3/7)
The Explosion of Point Solutions

• Analytic point solutions are software package that address a set of


specific problems
• Price optimization applications
• Fraud applications
• Demand forecasting applications
• One downside of point solutions is the high price
• Can be $10 million
• Implementing point solutions in a serial way is preferred

88
The Evolution of Analytic Tools(4/7)
Open Source

• Open-source software have been around for some time


• In many cases, open-source products are outside the mainstream
• Many individuals are contributing to improving the functionality
• Bugs can be patched soon

89
The Evolution of Analytic Tools(5/7)

The R Project for Statistical Computing


• R Project is open source for statistical computing
• Features of R Project

More object-oriented

Integrate new features faster

Free for charge

Programming is intensive

90
The Evolution of Analytic Tools(6/7)

Data Visualization
• An effective visualization can make a pattern jump right off the page
at you
• Today’s visualization tools allows
• Multiple tabs
• Link the graphs and charts with underlying data
• New idea for data visualization
• 3-D

91
The Evolution of Analytic Tools(7/7)

Importance of Data Visualization


• Appropriate visualization will increase an audience’s
comprehension
• Understanding how to visualize data will help analytic
professionals become better

92
THANK YOU

You might also like