You are on page 1of 59

USE OF INFORMATION TECHNOLOGIES AT FINANCE

DWH: Transforming big data into a strategic asset


Caner Sayın

2018
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions

2
About me
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions

4
We want to know….

Which are our


lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
5 have on revenue
and margins?
Some Definitions

Data Warehouse : It collects and stores integrated sets of historical data


from multiple systems to make analysis and reports

Datamining : is a set of techniques uses mathematical algorithms to


segment the data and evaluate the probability of future events

Big Data : Big data is an evolving term that describes any voluminous
amount of structured, semi-structured and unstructured data that has the
potential to be mined for information.

Machine Learning : Machine learning is the subfield of computer science


that «gives computers the ability to learn without being explicitly
programmed».
What is Data Warehouse?
Sources People
Business Processes
Reports HQ, Business Benefits
ERP CEO,
Board

• Customer Insight
Finance Marketing • Customer Churn

• Product
HR Assortment

Sales • Revenue
Assurance
CRM
• Target Marketing

• Cross Sell /
Up Sell
Web Depts,
Lines of
Business • Fraud Detection

Branch / • Value add


services
Store Ops
Product
Development

Legacy
Customers,
Suppliers

External
What is Data Warehouse?
Sources People

Without Data Warehouse Reports


Business Processes
HQ, Business Benefits
ERP CEO,
Board
Copy Trns
• Customer Insight
Finance Marketing • Customer Churn

Copy • Product
HR Assortment

Sales • Revenue
Copy Assurance
CRM Copy
Trns • Target Marketing

• Cross Sell /
Up Sell
Web Copy Copy Copy
Depts,
Lines of
Business • Fraud Detection

Branch / Trns • Value add


services
Store Ops Copy
Copy Product
Development

Legacy Trns
Customers,
Suppliers

External
What is Data Warehouse?
Sources People

With Data Warehouse Reports


Business Processes
HQ, Business Benefits
ERP CEO,
Analytics Warehouse Reports
Board
Copy Copy
• Customer Insight
Finance Marketing • Customer Churn
Data Model

Copy • Product
HR E T L Assortment
x r o Sales
Scorecards & • Revenue
t a a • Comprehensive Dashboards Assurance
CRM Copy
r n d Copy • Cross Enterprise
Copy • Target Marketing
a s • Atomic, detailed data
• History
c f • Trends
• Cross Sell /
Up Sell
Web t o Copy Copy
• Recalculation Copy
Depts,
Lines of
r • Consistent
Data Mining Business • Fraud Detection
m • Source independent
Copy • Value add
Branch / • Accessible
services
Store Ops Copy • Understandable
Copy Product
• Trusted Development

Legacy Copy
Customers,
Suppliers

External
Data Warehouse Architecture
Reports
Files
Data
Warehouse
Source1
Metadata
Data Mining
Source2

Data
Data Marts
Source-n

Machine
Learning
AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions
What is Data Mining?
• Usually, the goal is either to discover / generate some preliminary insights in an area
where there really was little knowledge beforehand, or to be able to predict future
observations accurately. Moreover, data mining procedures could be either
'unsupervised' (we don't know the answer--discovery) or 'supervised' (we know the
answer--prediction)
• Knowledge discovery from hidden patterns
• Supports associations, constructing analytical models, performing classification and
prediction, and presenting the mining results using visualization tools
• Draws ideas from machine learning/AI, pattern recognition, statistics and database
systems Machine
• Traditional Techniques may be unsuitable due to Learning
Statistics Pattern
• Enormity of data
• High dimensionality of data Data Recognition
• Heterogeneous, distributed nature of data Mining

Database
Systems
Data Mining Techniques

Descriptive Analytics

Historical Data
DW

Uses : Only historical and current data


Represents : Classical Data Warehouse, Business Intelligence and
reporting
Responses : What happened in the past
Data Mining Techniques

Predictive Analytics

Historical Data Future prediction


DW + D.Mining

Rules, Algoritms

Inputs : Historical data, current data and external data


Uses : In addition to the historical data; rules, algorithms and external
data help predictions about the future
Responses : Why did it happen and what will be happen
Data Mining Techniques
Prescriptive Analytics
Future
prediction
Historical Data DW+
D.Mining+
Big Data+ Optimisation
MachineLearning

Actions

Rules, Algoritms

Inputs : Historical data, current data and external data


Uses : Beyond the prediction. In addition to prediction, recommends
actions to achieve optimum benefit by using optimisation techniques.
Responses : What is the best that could happen
Examples of data mining applications

APPLICATION DESCRIPTION
Market Identifies the common characteristics of customers who
segmentation buys the same products from the company

Customer churn Predicts which customers are likely to leave your company
and go to a competitor

Fraud detection Identifies which transactions are most likely to be


fraudulent

Direct marketing Identifies which prospects should be included in a mailing


list to obtain the highest response rate

Market based Understands what products or services are commonly


analysis purchased together

Trend analysis Reveals the difference between a typical customer this


month versus last month

Science Simulates nuclear explosions; visualizes quantum physics


AGENDA
• About Instructors
• What is Data Warehouse?
• Data Mining & Analytics
• Big Data
• Questions
How big is BIG?

• “Big Data is the frontier of a firm's ability to store, process, and access (SPA) all the data it
needs to operate effectively, make decisions, reduce risks, and serve customers.” Forrester
• “Big Data in general is defined as high volume, velocity and variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight and
decision making.” Gartner
• “Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it.” O’Reilly
• “Big data is the data characterized by 4 key attributes: volume, variety, velocity and
value.” IBM
Big Data in a different way.

Byte : one grain of rice


Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!

Yottabyte
Zettabyte
Megabyte
Gigabyte
Byte
Petabyte
One Byte
Exabyte
Big Data in a different way.

Byte : one grain of rice


Kilobyte : cup of rice Hobbyist
Megabyte : 8 bags of rice
Desktop
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Internet
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Big Data in a different way.

Byte : one grain of rice


Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Big Data in a different way.

Byte : one grain of rice


Hobbyist
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks Desktop
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan Internet
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte :
Yottabyte A: A
EARTH
EARTH SIZESIZE
RICERICE
BALL!
BALL!
The Future?
Why Big Data
• What is the percentage of all the data in the world was created in the past 2 years?
Over 90% of all the data in the world was created in the past 2 years

• How many emails, facebook likes, tweets and photos are sent by us every minute?
Every minute we send 204 million emails, generate 1,8 million Facebook likes, send 278
thousand Tweets, and up-load 200,000 photos to Facebook.

• How many search done in Google per second?


Google alone processes on average over 40 thousand search queries per second, making it
over 3.5 billion in a single day.

• How many video uploaded to Youtube every minute?


Around 100 hours of video are uploaded to YouTube every minute and it would take you
around 15 years to watch every video uploaded by users in one day.
Big Data Facts

• Big data has been used to predict crimes before they happen – a “predictive policing” trial in
California was able to identify areas where crime will occur three times more accurately than
existing methods of forecasting.
• By better integrating big data analytics into healthcare, the industry could save $300bn a
year – that’s the equivalent of reducing the healthcare costs of every man, woman and child
by $1,000 a year.
• The big data industry is expected to grow from US$10.2 billion in 2013 to about US$54.3
billion by 2017.
Three Characteristics of Big Data V3s

Volume Velocity Variety Verification

• Data • Data • Data


quantity Speed Types
Value
Characteristics of Big Data
Volume Velocity Variety
• A typical PC might have had 10 • Clickstreams and ad • Big Data isn't just numbers,
gigabytes of storage in 2000. impressions capture user dates, and strings. Big Data
behavior at millions of is also geospatial data, 3D
• Today, Facebook ingests 500 events per second data, audio and video, and
terabytes of new data every unstructured text, including
day. • Machine to machine log files and social media.
processes exchange data
• Boeing 737 will generate 240 between billions of • Traditional database
terabytes of flight data during devices systems were designed to
a single flight across the US. address smaller volumes of
• Infrastructure and sensors structured data, fewer
• The smart phones, the data generate massive log data updates or a predictable,
they create and consume; in real-time consistent data structure.
sensors embedded into
everyday objects will soon • On-line gaming • Big Data analysis includes
result in billions of new, systems support millions different types of data
constantly-updated data feeds of concurrent users, each
containing environmental, producing multiple inputs
location, and other per second.
information, including video.
The Structure of Data

Structured Semi-structured Unstructured

Relational Databases - SQL NoSQL


• JSON
• Oracle • MongoDB
• XML
• MySQL • Cassandra
• SybaseIQ • Neo4j
• MsSQL • Redis
2 Types of Big Data
Types of tools used in Big-Data

• Where processing is hosted?


Distributed Servers / Cloud (e.g. Amazon EC2, Hadoop Clusters)

• Where data is stored?


Distributed Storage (e.g. Amazon S3, HDFS)

• What is the programming model?


Distributed Processing (e.g. MapReduce, Apache Sparks)

• How data is stored & indexed?


High-performance schema-free databases (e.g. MongoDB, Neo4j)
What is Hadoop?

• Hadoop is a platform.
• Distributes and replicates data.
• Manages parallel tasks created by users.
• Runs as several processes on a cluster.
• Handles unstructured to semi-structured to structured data.
• Handles enormous data volumes.
• Flexible data analysis and machine learning tools.
• Cost-effective scalability.The core of Apache Hadoop consists of a
storage part (Hadoop Distributed File System (HDFS)) and a
processing part (MapReduce). Hadoop splits files into large blocks
and distributes them amongst the nodes in the cluster.
History of Hadoop

• Early in Google’s history, developers there codified a style of


programming that they called MapReduce that is surprisingly
effective at processing very large amounts of data and yet is able
to express a wide range of algorithms.
• In 2004, these developers published an article that described
their methods and results.
• Fortunately, two enterprising engineers — Doug Cutting and
Mike Cafarella — had been working on their own web crawling
technology named Nutch.
• After reading the Google research paper, they set out to create
the foundations of what would later be known as Hadoop, in
2005.
What are the core parts of a Hadoop
distribution?

HDFS Storage

MapReduce API
Redundant (3 copies)
For large files – large blocks Other Libraries
Batch (Job) processing
64 or 128 MB / block
Distributed and Localized to
Can scale to 1000s of nodes Pig
clusters (Map)
Hive
Auto-Parallelizable for huge
amounts of data HBase
Fault-tolerant (auto retries) Others
Adds high availability and
more
Hadoop Cluster HDFS (Physical) Storage

• Hadoop Distributed File System.


• Stores files in blocks across many nodes in a cluster.
• Replicates the blocks across nodes for durability.
• Master/Slave architecture.
• NameNode
• Runs on a single node as a master process
• Holds file metadata (which blocks are where)
• Directs client access to files in HDFS
• SecondaryNameNode
• Not a hot failover
• Maintains a copy of the NameNode metadata
• DataNode
• Generally runs on all nodes in the cluster
• Block creation/replication/deletion/reads
• Takes orders from the NameNode
HDFS Illustrated

NameNode

Put File

File

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated

NameNode

Put File

File

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated

NameNode

1
Put File 2
3

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated

NameNode

1,4,6
Put File 2
3

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated

NameNode

1,4,6
Put File 2,5,3
3

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Illustrated

NameNode

1,4,6
Put File 2,5,3
3,2,6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

21
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode

1,4,6
Read File 2,5,3
3,2,6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
1,4,6
Read File 2,5,3
3,2,6

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
,4,6
Read File 2,5,3
3,2,6

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6

Read time
=
Transfer DataNode 2 DataNode 3

Rate x
Number of
Machines*
DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Power of Hadoop
NameNode
5,4,6
Read File 2,5,3
3,2,6

Read time
100 MB/s
=
x
Transfer
3
DataNode 2 DataNode 3

Rate x
=
Number of
300MB/s
Machines*
DataNode 4 DataNode 5 DataNode 6

22
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
What is MapReduce?
MapReduce is a framework for processing parallelizable problems across huge datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous
hardware).
Hadoop Distributions

What is Hadoop Distributions?


• Hadoop is Apache software so it is freely available for download and use. So why do we need
distributions at all?
• This is very akin to Linux a few years back and Linux distributions like RedHat, Suse and
Ubuntu. The software is free to download and use but distributions offer an easier to use
bundle.
What do Hadoop distros offer?
• Distributions provide easy to install mediums like RPMs
• Distros package multiple components that work well together
• Tested
• Performance patches
• Predictable upgrade path
• And most importantly . . SUPPORT

49
Hadoop Distributions

Distro Remarks Free / Premium


Apache •The Hadoop Source Completely free and open
•No packaging except TAR balls
hadoop.apache.org •No extra tools source

•Oldest distro
Cloudera •Very polished Free / Premium model
www.cloudera.com •Comes with good tools to install and manage a (depending on cluster size)
Hadoop cluster
•Newer distro
HortonWorks •Tracks Apache Hadoop closely
Completely open source
www.hortonworks.com •Comes with tools to manage and administer a
cluster
•MapR has their own file system (alternative to HDFS)
•Boasts higher performance
MapR •Nice set of tools to manage and administer a cluster
Free / Premium model
www.mapr.com •Does not suffer from Single Point of Failure
•Offer some cool features like mirroring, snapshots,
etc.

50
Hadoop Distributions
Ways to MapReduce

Libraries & Frameworks Languages

HBase Java*
Hive HiveQL (HQL)
Pig Latin
Spark
Python
Sqoop
Scala
Oozie JavaScript
Mahout R
Others… More…
Hive
• Map-Reduce is scalable
• SQL has a huge user base
• SQL is easy to code
• Solution: Combine SQL and Map-Reduce
• Hive on top of Hadoop (open source)
• Efficient implementations of SQL filters, joins and
group-by’s on top of map reduce
• SQL queries are converted to Map-Reduce code in
background

Impala
• Impala brings scalable parallel database technology to
Hadoop, enabling users to issue low-latency SQL
queries to data stored in HDFS and Apache HBase
without requiring data movement or transformation
HBase

• Apache HBase is the Hadoop database, a distributed,


scalable, big data store.
• This project's goal is the hosting of very large tables --
billions of rows X millions of columns
• HBase is a data model that is similar to Google’s big
table designed to provide quick random access to huge
amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
Spark

• Speed
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Ease of Use
Write applications quickly in Java, Scala, Python, R.
• Combine SQL, streaming, and complex analytics
Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the
same application.
Sqoop

• Apache Sqoop is a tool designed for efficiently transferring bulk


data between Apache Hadoop and structured data stores such
as relational databases.

Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop
jobs.
• Oozie Coordinator jobs are recurrent Oozie Workflow jobs
triggered by time (frequency) and data availability.
• Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-
reduce, Streaming map-reduce, Spark, Hive, Sqoop) as well as
system specific jobs (such as Java programs and shell scripts).
Apache Hadoop Ecosystem

Data Access: Pig, Hive


Data Storage: HBase, Cassandra
Interecation, Visualization, Execution,
Development : HCatalog, Lucene, Hama, Crunch
Data Serialization: Avro, Thrift
Data Intelligence: Drill, Mohout
Data Integration: Sqoop, Flume, Chuwka
Management: Ambari(Portal)
Monitoring: Zookeeper
Orchestration: Oozie
Applications for Big Data Analytics
Thank You !

You might also like