6 DW

USE OF INFORMATION TECHNOLOGIES AT FINANCE
DWH: Transforming big data into a strategic asset

Caner Sayın
2018
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions
2
About me
AGENDA
• Big Data
• Questions
4
We want to know….
Which are our

lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
What product prom- Which customers

-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
5 have on revenue
and margins?
Some Definitions
Data Warehouse : It collects and stores integrated sets of historical data

from multiple systems to make analysis and reports
Datamining : is a set of techniques uses mathematical algorithms to

segment the data and evaluate the probability of future events
Big Data : Big data is an evolving term that describes any voluminous
amount of structured, semi-structured and unstructured data that has the
potential to be mined for information.
Machine Learning : Machine learning is the subfield of computer science

that «gives computers the ability to learn without being explicitly
programmed».
What is Data Warehouse?
Sources People
Business Processes
Reports HQ, Business Benefits
ERP CEO,
Board
• Customer Insight
Finance Marketing • Customer Churn
• Product
HR Assortment
Sales • Revenue
Assurance
CRM
• Target Marketing
• Cross Sell /
Up Sell
Web Depts,
Lines of
Business • Fraud Detection
Branch / • Value add

services
Store Ops
Product
Development
Legacy
Customers,
Suppliers
External
Sources People
Without Data Warehouse Reports

Business Processes
HQ, Business Benefits
ERP CEO,
Board
Copy Trns
Copy • Product
HR Assortment
Sales • Revenue
Copy Assurance
CRM Copy
Trns • Target Marketing
• Cross Sell /
Up Sell
Web Copy Copy Copy
Depts,
Lines of
Business • Fraud Detection
Branch / Trns • Value add

services
Store Ops Copy
Copy Product
Development
Legacy Trns
Customers,
Suppliers
External
Sources People
With Data Warehouse Reports

Business Processes
HQ, Business Benefits
ERP CEO,
Analytics Warehouse Reports
Board
Copy Copy
Data Model
Copy • Product
HR E T L Assortment
x r o Sales
Scorecards & • Revenue
t a a • Comprehensive Dashboards Assurance
CRM Copy
r n d Copy • Cross Enterprise
Copy • Target Marketing
a s • Atomic, detailed data
• History
c f • Trends
• Cross Sell /
Up Sell
Web t o Copy Copy
• Recalculation Copy
Depts,
Lines of
r • Consistent
Data Mining Business • Fraud Detection
m • Source independent
Copy • Value add
Branch / • Accessible
services
Store Ops Copy • Understandable
Copy Product
• Trusted Development
Legacy Copy
Customers,
Suppliers
External
Data Warehouse Architecture
Reports
Files
Data
Warehouse
Source1
Metadata
Data Mining
Source2
Data
Data Marts
Source-n
Machine
Learning
AGENDA
• Big Data
• Questions
What is Data Mining?
• Usually, the goal is either to discover / generate some preliminary insights in an area
where there really was little knowledge beforehand, or to be able to predict future
observations accurately. Moreover, data mining procedures could be either
'unsupervised' (we don't know the answer--discovery) or 'supervised' (we know the
answer--prediction)
• Knowledge discovery from hidden patterns
• Supports associations, constructing analytical models, performing classification and
prediction, and presenting the mining results using visualization tools
• Draws ideas from machine learning/AI, pattern recognition, statistics and database
systems Machine
• Traditional Techniques may be unsuitable due to Learning
Statistics Pattern
• Enormity of data
• High dimensionality of data Data Recognition
• Heterogeneous, distributed nature of data Mining
Database
Systems
Data Mining Techniques
Descriptive Analytics
Historical Data
DW
Uses : Only historical and current data

Represents : Classical Data Warehouse, Business Intelligence and
reporting
Responses : What happened in the past
Predictive Analytics
Historical Data Future prediction

DW + D.Mining
Rules, Algoritms
Inputs : Historical data, current data and external data

Uses : In addition to the historical data; rules, algorithms and external
data help predictions about the future
Responses : Why did it happen and what will be happen
Prescriptive Analytics
Future
prediction
Historical Data DW+
D.Mining+
Big Data+ Optimisation
MachineLearning
Actions
Rules, Algoritms
Inputs : Historical data, current data and external data

Uses : Beyond the prediction. In addition to prediction, recommends
actions to achieve optimum benefit by using optimisation techniques.
Responses : What is the best that could happen
Examples of data mining applications
APPLICATION DESCRIPTION
Market Identifies the common characteristics of customers who
segmentation buys the same products from the company
Customer churn Predicts which customers are likely to leave your company
and go to a competitor
Fraud detection Identifies which transactions are most likely to be

fraudulent
Direct marketing Identifies which prospects should be included in a mailing

list to obtain the highest response rate
Market based Understands what products or services are commonly

analysis purchased together
Trend analysis Reveals the difference between a typical customer this

month versus last month
Science Simulates nuclear explosions; visualizes quantum physics

AGENDA
• Big Data
• Questions
How big is BIG?
• “Big Data is the frontier of a firm's ability to store, process, and access (SPA) all the data it
needs to operate effectively, make decisions, reduce risks, and serve customers.” Forrester
• “Big Data in general is defined as high volume, velocity and variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight and
decision making.” Gartner
• “Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it.” O’Reilly
• “Big data is the data characterized by 4 key attributes: volume, variety, velocity and
value.” IBM
Big Data in a different way.
Byte : one grain of rice

Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Yottabyte
Zettabyte
Megabyte
Gigabyte
Byte
Petabyte
One Byte
Exabyte

Kilobyte : cup of rice Hobbyist
Desktop
Internet


Hobbyist
Gigabyte : 3 Semi trucks Desktop
Petabyte : Blankets Manhattan Internet
Yottabyte :
Yottabyte A: A
EARTH
EARTH SIZESIZE
RICERICE
BALL!
BALL!
The Future?
Why Big Data
• What is the percentage of all the data in the world was created in the past 2 years?
Over 90% of all the data in the world was created in the past 2 years
• How many emails, facebook likes, tweets and photos are sent by us every minute?
Every minute we send 204 million emails, generate 1,8 million Facebook likes, send 278
thousand Tweets, and up-load 200,000 photos to Facebook.
• How many search done in Google per second?

Google alone processes on average over 40 thousand search queries per second, making it
over 3.5 billion in a single day.
• How many video uploaded to Youtube every minute?

Around 100 hours of video are uploaded to YouTube every minute and it would take you
around 15 years to watch every video uploaded by users in one day.
Big Data Facts
• Big data has been used to predict crimes before they happen – a “predictive policing” trial in
California was able to identify areas where crime will occur three times more accurately than
existing methods of forecasting.
• By better integrating big data analytics into healthcare, the industry could save $300bn a
year – that’s the equivalent of reducing the healthcare costs of every man, woman and child
by $1,000 a year.
• The big data industry is expected to grow from US$10.2 billion in 2013 to about US$54.3
billion by 2017.
Three Characteristics of Big Data V3s
Volume Velocity Variety Verification
• Data • Data • Data

quantity Speed Types
Value
Characteristics of Big Data
Volume Velocity Variety
• A typical PC might have had 10 • Clickstreams and ad • Big Data isn't just numbers,
gigabytes of storage in 2000. impressions capture user dates, and strings. Big Data
behavior at millions of is also geospatial data, 3D
• Today, Facebook ingests 500 events per second data, audio and video, and
terabytes of new data every unstructured text, including
day. • Machine to machine log files and social media.
processes exchange data
• Boeing 737 will generate 240 between billions of • Traditional database
terabytes of flight data during devices systems were designed to
a single flight across the US. address smaller volumes of
• Infrastructure and sensors structured data, fewer
• The smart phones, the data generate massive log data updates or a predictable,
they create and consume; in real-time consistent data structure.
sensors embedded into
everyday objects will soon • On-line gaming • Big Data analysis includes
result in billions of new, systems support millions different types of data
constantly-updated data feeds of concurrent users, each
containing environmental, producing multiple inputs
location, and other per second.
information, including video.
The Structure of Data
Structured Semi-structured Unstructured
Relational Databases - SQL NoSQL

• JSON
• Oracle • MongoDB
• XML
• MySQL • Cassandra
• SybaseIQ • Neo4j
• MsSQL • Redis
2 Types of Big Data
Types of tools used in Big-Data
• Where processing is hosted?

Distributed Servers / Cloud (e.g. Amazon EC2, Hadoop Clusters)
• Where data is stored?

Distributed Storage (e.g. Amazon S3, HDFS)
• What is the programming model?

Distributed Processing (e.g. MapReduce, Apache Sparks)
• How data is stored & indexed?

High-performance schema-free databases (e.g. MongoDB, Neo4j)
What is Hadoop?
• Hadoop is a platform.
• Distributes and replicates data.
• Manages parallel tasks created by users.
• Runs as several processes on a cluster.
• Handles unstructured to semi-structured to structured data.
• Handles enormous data volumes.
• Flexible data analysis and machine learning tools.
• Cost-effective scalability.The core of Apache Hadoop consists of a
storage part (Hadoop Distributed File System (HDFS)) and a
processing part (MapReduce). Hadoop splits files into large blocks
and distributes them amongst the nodes in the cluster.
History of Hadoop
• Early in Google’s history, developers there codified a style of

programming that they called MapReduce that is surprisingly
effective at processing very large amounts of data and yet is able
to express a wide range of algorithms.
• In 2004, these developers published an article that described
their methods and results.
• Fortunately, two enterprising engineers — Doug Cutting and
Mike Cafarella — had been working on their own web crawling
technology named Nutch.
• After reading the Google research paper, they set out to create
the foundations of what would later be known as Hadoop, in
2005.
What are the core parts of a Hadoop
distribution?
HDFS Storage
MapReduce API
Redundant (3 copies)
For large files – large blocks Other Libraries
Batch (Job) processing
64 or 128 MB / block
Distributed and Localized to
Can scale to 1000s of nodes Pig
clusters (Map)
Hive
Auto-Parallelizable for huge
amounts of data HBase
Fault-tolerant (auto retries) Others
Adds high availability and
more
Hadoop Cluster HDFS (Physical) Storage
• Hadoop Distributed File System.

• Stores files in blocks across many nodes in a cluster.
• Replicates the blocks across nodes for durability.
• Master/Slave architecture.
• NameNode
• Runs on a single node as a master process
• Holds file metadata (which blocks are where)
• Directs client access to files in HDFS
• SecondaryNameNode
• Not a hot failover
• Maintains a copy of the NameNode metadata
• DataNode
• Generally runs on all nodes in the cluster
• Block creation/replication/deletion/reads
• Takes orders from the NameNode
HDFS Illustrated
NameNode
Put File
File
DataNode 1 DataNode 2 DataNode 3
21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated
NameNode
Put File
File
21
HDFS Illustrated
NameNode
1
Put File 2
3
21
HDFS Illustrated
NameNode
1,4,6
Put File 2
3
21
HDFS Illustrated
NameNode
1,4,6
Put File 2,5,3
3
21
HDFS Illustrated
NameNode
1,4,6
Put File 2,5,3
3,2,6
21
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6
22
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6
22
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6
22
Power of Hadoop
NameNode
,4,6
Read File 2,5,3
3,2,6
DataNode 2 DataNode 3
22
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
22
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
22
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
Read time
=
Transfer DataNode 2 DataNode 3
Rate x
Number of
Machines*
22
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6
Read time
100 MB/s
=
x
Transfer
3
Rate x
=
Number of
300MB/s
Machines*
22
What is MapReduce?
MapReduce is a framework for processing parallelizable problems across huge datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous
hardware).
Hadoop Distributions
What is Hadoop Distributions?

• Hadoop is Apache software so it is freely available for download and use. So why do we need
distributions at all?
• This is very akin to Linux a few years back and Linux distributions like RedHat, Suse and
Ubuntu. The software is free to download and use but distributions offer an easier to use
bundle.
What do Hadoop distros offer?
• Distributions provide easy to install mediums like RPMs
• Distros package multiple components that work well together
• Tested
• Performance patches
• Predictable upgrade path
• And most importantly . . SUPPORT
49
Distro Remarks Free / Premium

Apache •The Hadoop Source Completely free and open
•No packaging except TAR balls
hadoop.apache.org •No extra tools source
•Oldest distro
Cloudera •Very polished Free / Premium model
www.cloudera.com •Comes with good tools to install and manage a (depending on cluster size)
Hadoop cluster
•Newer distro
HortonWorks •Tracks Apache Hadoop closely
Completely open source
www.hortonworks.com •Comes with tools to manage and administer a
cluster
•MapR has their own file system (alternative to HDFS)
•Boasts higher performance
MapR •Nice set of tools to manage and administer a cluster
Free / Premium model
www.mapr.com •Does not suffer from Single Point of Failure
•Offer some cool features like mirroring, snapshots,
etc.
50
Ways to MapReduce
Libraries & Frameworks Languages
HBase Java*
Hive HiveQL (HQL)
Pig Latin
Spark
Python
Sqoop
Scala
Oozie JavaScript
Mahout R
Others… More…
Hive
• Map-Reduce is scalable
• SQL has a huge user base
• SQL is easy to code
• Solution: Combine SQL and Map-Reduce
• Hive on top of Hadoop (open source)
• Efficient implementations of SQL filters, joins and
group-by’s on top of map reduce
• SQL queries are converted to Map-Reduce code in
background
Impala
• Impala brings scalable parallel database technology to
Hadoop, enabling users to issue low-latency SQL
queries to data stored in HDFS and Apache HBase
without requiring data movement or transformation
HBase
• Apache HBase is the Hadoop database, a distributed,

scalable, big data store.
• This project's goal is the hosting of very large tables --
billions of rows X millions of columns
• HBase is a data model that is similar to Google’s big
table designed to provide quick random access to huge
amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
Spark
• Speed
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Ease of Use
Write applications quickly in Java, Scala, Python, R.
• Combine SQL, streaming, and complex analytics
Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the
same application.
Sqoop
• Apache Sqoop is a tool designed for efficiently transferring bulk

data between Apache Hadoop and structured data stores such
as relational databases.
Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop
jobs.
• Oozie Coordinator jobs are recurrent Oozie Workflow jobs
triggered by time (frequency) and data availability.
• Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-
reduce, Streaming map-reduce, Spark, Hive, Sqoop) as well as
system specific jobs (such as Java programs and shell scripts).
Apache Hadoop Ecosystem
Data Access: Pig, Hive

Data Storage: HBase, Cassandra
Interecation, Visualization, Execution,
Development : HCatalog, Lucene, Hama, Crunch
Data Serialization: Avro, Thrift
Data Intelligence: Drill, Mohout
Data Integration: Sqoop, Flume, Chuwka
Management: Ambari(Portal)
Monitoring: Zookeeper
Orchestration: Oozie
Applications for Big Data Analytics
Thank You !

6 DW

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 DW

Uploaded by

Copyright:

Available Formats

USE OF INFORMATION TECHNOLOGIES AT FINANCE

DWH: Transforming big data into a strategic asset

Which are our

What product prom- Which customers

Data Warehouse : It collects and stores integrated sets of historical data

Datamining : is a set of techniques uses mathematical algorithms to

Machine Learning : Machine learning is the subfield of computer science

Branch / • Value add

Without Data Warehouse Reports

Branch / Trns • Value add

With Data Warehouse Reports

Uses : Only historical and current data

Historical Data Future prediction

Inputs : Historical data, current data and external data

Inputs : Historical data, current data and external data

Fraud detection Identifies which transactions are most likely to be

Direct marketing Identifies which prospects should be included in a mailing

Market based Understands what products or services are commonly

Trend analysis Reveals the difference between a typical customer this

Science Simulates nuclear explosions; visualizes quantum physics

Byte : one grain of rice

Byte : one grain of rice

Byte : one grain of rice

Byte : one grain of rice

• How many search done in Google per second?

• How many video uploaded to Youtube every minute?

Volume Velocity Variety Verification

• Data • Data • Data

Structured Semi-structured Unstructured

Relational Databases - SQL NoSQL

• Where processing is hosted?

• Where data is stored?

• What is the programming model?

• How data is stored & indexed?

• Early in Google’s history, developers there codified a style of

• Hadoop Distributed File System.

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

DataNode 4 DataNode 5 DataNode 6

DataNode 4 DataNode 5 DataNode 6

DataNode 4 DataNode 5 DataNode 6

What is Hadoop Distributions?

Distro Remarks Free / Premium

Libraries & Frameworks Languages

• Apache HBase is the Hadoop database, a distributed,

• Apache Sqoop is a tool designed for efficiently transferring bulk

Data Access: Pig, Hive