Professional Documents
Culture Documents
6 DW
6 DW
2018
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions
2
About me
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions
4
We want to know….
Big Data : Big data is an evolving term that describes any voluminous
amount of structured, semi-structured and unstructured data that has the
potential to be mined for information.
• Customer Insight
Finance Marketing • Customer Churn
• Product
HR Assortment
Sales • Revenue
Assurance
CRM
• Target Marketing
• Cross Sell /
Up Sell
Web Depts,
Lines of
Business • Fraud Detection
Legacy
Customers,
Suppliers
External
What is Data Warehouse?
Sources People
Copy • Product
HR Assortment
Sales • Revenue
Copy Assurance
CRM Copy
Trns • Target Marketing
• Cross Sell /
Up Sell
Web Copy Copy Copy
Depts,
Lines of
Business • Fraud Detection
Legacy Trns
Customers,
Suppliers
External
What is Data Warehouse?
Sources People
Copy • Product
HR E T L Assortment
x r o Sales
Scorecards & • Revenue
t a a • Comprehensive Dashboards Assurance
CRM Copy
r n d Copy • Cross Enterprise
Copy • Target Marketing
a s • Atomic, detailed data
• History
c f • Trends
• Cross Sell /
Up Sell
Web t o Copy Copy
• Recalculation Copy
Depts,
Lines of
r • Consistent
Data Mining Business • Fraud Detection
m • Source independent
Copy • Value add
Branch / • Accessible
services
Store Ops Copy • Understandable
Copy Product
• Trusted Development
Legacy Copy
Customers,
Suppliers
External
Data Warehouse Architecture
Reports
Files
Data
Warehouse
Source1
Metadata
Data Mining
Source2
Data
Data Marts
Source-n
Machine
Learning
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions
What is Data Mining?
• Usually, the goal is either to discover / generate some preliminary insights in an area
where there really was little knowledge beforehand, or to be able to predict future
observations accurately. Moreover, data mining procedures could be either
'unsupervised' (we don't know the answer--discovery) or 'supervised' (we know the
answer--prediction)
• Knowledge discovery from hidden patterns
• Supports associations, constructing analytical models, performing classification and
prediction, and presenting the mining results using visualization tools
• Draws ideas from machine learning/AI, pattern recognition, statistics and database
systems Machine
• Traditional Techniques may be unsuitable due to Learning
Statistics Pattern
• Enormity of data
• High dimensionality of data Data Recognition
• Heterogeneous, distributed nature of data Mining
Database
Systems
Data Mining Techniques
Descriptive Analytics
Historical Data
DW
Predictive Analytics
Rules, Algoritms
Actions
Rules, Algoritms
APPLICATION DESCRIPTION
Market Identifies the common characteristics of customers who
segmentation buys the same products from the company
Customer churn Predicts which customers are likely to leave your company
and go to a competitor
• “Big Data is the frontier of a firm's ability to store, process, and access (SPA) all the data it
needs to operate effectively, make decisions, reduce risks, and serve customers.” Forrester
• “Big Data in general is defined as high volume, velocity and variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight and
decision making.” Gartner
• “Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it.” O’Reilly
• “Big data is the data characterized by 4 key attributes: volume, variety, velocity and
value.” IBM
Big Data in a different way.
Yottabyte
Zettabyte
Megabyte
Gigabyte
Byte
Petabyte
One Byte
Exabyte
Big Data in a different way.
• How many emails, facebook likes, tweets and photos are sent by us every minute?
Every minute we send 204 million emails, generate 1,8 million Facebook likes, send 278
thousand Tweets, and up-load 200,000 photos to Facebook.
• Big data has been used to predict crimes before they happen – a “predictive policing” trial in
California was able to identify areas where crime will occur three times more accurately than
existing methods of forecasting.
• By better integrating big data analytics into healthcare, the industry could save $300bn a
year – that’s the equivalent of reducing the healthcare costs of every man, woman and child
by $1,000 a year.
• The big data industry is expected to grow from US$10.2 billion in 2013 to about US$54.3
billion by 2017.
Three Characteristics of Big Data V3s
• Hadoop is a platform.
• Distributes and replicates data.
• Manages parallel tasks created by users.
• Runs as several processes on a cluster.
• Handles unstructured to semi-structured to structured data.
• Handles enormous data volumes.
• Flexible data analysis and machine learning tools.
• Cost-effective scalability.The core of Apache Hadoop consists of a
storage part (Hadoop Distributed File System (HDFS)) and a
processing part (MapReduce). Hadoop splits files into large blocks
and distributes them amongst the nodes in the cluster.
History of Hadoop
HDFS Storage
MapReduce API
Redundant (3 copies)
For large files – large blocks Other Libraries
Batch (Job) processing
64 or 128 MB / block
Distributed and Localized to
Can scale to 1000s of nodes Pig
clusters (Map)
Hive
Auto-Parallelizable for huge
amounts of data HBase
Fault-tolerant (auto retries) Others
Adds high availability and
more
Hadoop Cluster HDFS (Physical) Storage
NameNode
Put File
File
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated
NameNode
Put File
File
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated
NameNode
1
Put File 2
3
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated
NameNode
1,4,6
Put File 2
3
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated
NameNode
1,4,6
Put File 2,5,3
3
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated
NameNode
1,4,6
Put File 2,5,3
3,2,6
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
,4,6
Read File 2,5,3
3,2,6
DataNode 2 DataNode 3
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
DataNode 2 DataNode 3
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
DataNode 2 DataNode 3
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
Read time
=
Transfer DataNode 2 DataNode 3
Rate x
Number of
Machines*
DataNode 4 DataNode 5 DataNode 6
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
Read time
100 MB/s
=
x
Transfer
3
DataNode 2 DataNode 3
Rate x
=
Number of
300MB/s
Machines*
DataNode 4 DataNode 5 DataNode 6
22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
What is MapReduce?
MapReduce is a framework for processing parallelizable problems across huge datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous
hardware).
Hadoop Distributions
49
Hadoop Distributions
•Oldest distro
Cloudera •Very polished Free / Premium model
www.cloudera.com •Comes with good tools to install and manage a (depending on cluster size)
Hadoop cluster
•Newer distro
HortonWorks •Tracks Apache Hadoop closely
Completely open source
www.hortonworks.com •Comes with tools to manage and administer a
cluster
•MapR has their own file system (alternative to HDFS)
•Boasts higher performance
MapR •Nice set of tools to manage and administer a cluster
Free / Premium model
www.mapr.com •Does not suffer from Single Point of Failure
•Offer some cool features like mirroring, snapshots,
etc.
50
Hadoop Distributions
Ways to MapReduce
HBase Java*
Hive HiveQL (HQL)
Pig Latin
Spark
Python
Sqoop
Scala
Oozie JavaScript
Mahout R
Others… More…
Hive
• Map-Reduce is scalable
• SQL has a huge user base
• SQL is easy to code
• Solution: Combine SQL and Map-Reduce
• Hive on top of Hadoop (open source)
• Efficient implementations of SQL filters, joins and
group-by’s on top of map reduce
• SQL queries are converted to Map-Reduce code in
background
Impala
• Impala brings scalable parallel database technology to
Hadoop, enabling users to issue low-latency SQL
queries to data stored in HDFS and Apache HBase
without requiring data movement or transformation
HBase
• Speed
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Ease of Use
Write applications quickly in Java, Scala, Python, R.
• Combine SQL, streaming, and complex analytics
Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the
same application.
Sqoop
Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop
jobs.
• Oozie Coordinator jobs are recurrent Oozie Workflow jobs
triggered by time (frequency) and data availability.
• Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-
reduce, Streaming map-reduce, Spark, Hive, Sqoop) as well as
system specific jobs (such as Java programs and shell scripts).
Apache Hadoop Ecosystem