You are on page 1of 28

BIG DATA ARCHITECTURE

Lecture - 1
Lecture Goals

1. Issues and solutions in current business strategy


2. What is Big Data?
3. Distributed approaches
4. Arrival of Hadoop
5. Case studies??

Slide 2
Current status of Business

1. Overwhelmed with data


Byte > Kilo Byte > Mega Byte > Giga Byte > Tera Byte > Pega Byte > Exa Byte > Zetta
Byte > Yotta Byte
2. Starved for knowledge
3. Result for Companies
a) Lost productivity
b) Lost opportunities
c) Lost revenues

Slide 3
Pathway of Solution

1. Issues faced by the company


a) Storage Space
b) Efficient analysis of data
c) Scalable solutions
d) Cost
2. Organization  Software Architect  Software Developer  Data Analyst
3. Proprietary Softwares
a) Lack in scalability
b) Low performance
4. Apache Hadoop
Slide 4
Advantages of understanding the data

1. Input – Process – Output  Store – Process & Understand - Analyse


2. Benefits
a) Business competition
b) Importance of customers
c) Situational awareness
d) Productivity
e) Increase profit
f) Science and Innovation

Slide 5
Assets of Organization

1) Employees
2) Machineries
3) Liabilities
4) Information
a) Volume
b) Accessibility
c) Trust worthiness
d) Capability to make sense of it
e) Reasonable time
f) Empower in intelligent decision making
Slide 6
What is Big Data ?

Big Data is a large data set which comes from many sources and data formats, and data that
can be processed and analyzed to find insights and patterns used to make informed
decisions.

According to the American IT research and advisory firm Gartner Inc.,

“Big Data is high-volume, high-velocity and/or high-variety information assets


that demand cost-effective, innovative forms of information processing
that enable enhanced insight and decision making.

Slide 7
Big Data consists of...

 Structured data
 Unstructured data
 Graph data
 Images
 Videos
 Voices
 Text and so on

Big data is a term that describes the large volume of data, deals with both structured and
unstructured data.

Slide 8
Challenges with Big Data

1. Storage of Big Data


2. Processing of Big Data
3. Manual Distributed Computing (with respect to System Set Up)

Slide 9
Analytics / Analysis

1. Analytics is analyzing business data to get meaningful insights


2. These insights enable businesses to act and make strategic business decisions
3. Business can
a) Compete with the competitors
b) To look for trends
c) Statistics
d) New business possibilities

Slide 10
OLTP (Online Transaction Processing ) Versus OLAP (Online Analytical
Processing ) Systems

1. OLTP Systems (Online Transaction Processing)


a. They do / execute the business for organizations
b. Data ??
2. OLAP Systems (Online Analytical Processing)
a. They analyse the business for organizations
b. Data ??
3. OLTP Systems act as the source of data in case of enterprise systems. (Otherwise the
sources are sensors and IOT devices etc.,)

Slide 11
RDBMS cannot store big data because….

1. They were developed keeping OLTP Systems in mind which need


a. Fast Throughput
b. Low latency operations
c. CRUD operations
2. RDBMS were made for Scale Up and not Scale Out kind of arrangement

Slide 12
Computer Cluster

1. A computer cluster is a set of connected computers that work together


2. In many respects, they can be viewed as a single system
3. Perform for a common goal.

Slide 13
Distributed Computing
Distributed Computing

1. A distributed computing system is a system whose components (machines / slaves) are


located on different networked computers (to form a cluster as mentioned above).
2. Such systems are set up so that one can harness the combined processing power of all
the networked computers.
3. They are also known as scale out systems.

Slide 15
Apache Hadoop

1. Apache Hadoop is an open source framework which provides an automated distributed


computing environment that supports storage of big data sets. It does that storage
using a cluster of commodity machines (reasonable hardware). It then analyses this
stored big data using a very simple programming model.

○ The storage mechanism is known as HDFS (Hadoop Distributed File System). It is


based on Google GFS(Google File System) white paper.
○ The analytical mechanism is known as MapReduce and is based on Google Map
Reduce white paper.

Slide 16
Hadoop and previous Distributed approaches

1. Data is distributed in advance


2. Data is replicated throughout the cluster of computers for reliability and availability
3. Data processing tries to occur where the data is stored
4. Eliminates the bandwidth bottlenecks
5. Simple programming approaches
6. Abstracts complexity in distributed environment implementations
7. Powerful mechanism for data analytics
a) Huge storage
b) High processing power
c) Reliability, fail over and scalability
8. Provides easy-to-use platform
Slide 17
Data Science in Business World

1. Goal: Extract meaning from data


2. Basic operations
a) Math
b) Statistical analysis
c) Pattern recognition
d) Machine learning
e) High-performance computing
f) Data warehousing and more

Slide 18
Hadoop Development Environment

1. Powerful computational platform


2. Highly scalable environment
3. Parallizable execution
4. Development of Big Data analysis applications
5. Development of enterprise applications

Slide 19
Typical Business Use cases

 Consumer Behavioural Analytics


 Sentiment Analysis
 CRM Onboarding
 Prediction
 Etc.

Slide 20
Typical Business Use cases (contd.,)

Use Cases Input Data Source Purpose

CBA Unstructured E-Commerce Impression on products and


clickstream web site ads
data
Analysis & Prediction

SA Social data Social network Opinion of social


communities about their Product catalog
selection
product
Special offers
Maintain the
reputation of their
CRM-OB CBA + SA Online + Accurate customer brand
Offline data segmentation to provide
offers to specific customers

Slide 21
Typical Business Use cases : Big Data in Education

Slide 22
Typical Business Use cases : Big Data in Education (contd.,)

Slide 23
Typical Business Use cases : Big Data in Education (contd.,)

Slide 24
Typical Business Use cases : Big Data in E-Commerce

Slide 25
Typical Business Use cases : Big Data in E-Commerce (contd.,)

Slide 26
Typical Business Use cases

1. Consumer Behavioural Analytics 9. Context optimization and engagement


2. Sentiment Analysis 10. Network analytics and mediation
3. CRM Onboarding 11. Google Map
4. Prediction 12. Netflix
5. Fraud detection for banks and credit card 13. Cloudera and Hordon
companies 14. AWS
6. Social media marketing analysis 15. Nutch
7. Shopping pattern analysis for retail 16. IBM Education System
product placement 17. E-Commerce
8. Traffic pattern recognition for urban
development Slide 27
Typical Business Use cases : Big Data in E-Commerce (contd.,)

Slide 28

You might also like