You are on page 1of 90

BIG DATA AND ANALYTICS

Subject Code : 18CS72 CIE Marks : 40

Lecture Hours : 50 SEE Marks : 60

Credits : 04

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
10 Hours

Introduction to Big Data Analytics

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Definitions of Data

• Collins English Dictionary: Data is Information, usually in the form


of facts or statistics that one can analyze or use for further calculations

Computing: Data is information that can be stored and used by a


computer program

Electricals Engineering, Circuits, computing and control: Data is infomation


presented in numbers, letters or other form

Science: Data is information from series of observations,


measurements or facts

Social Sciences: Data is information from series of behavioural


observations, measurements or facts
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Definitions of Web Data

• Web data in the form of documents and other resources

• Data present in web servers: text, images, video, audio

• Web sites, web services, web portals, online business apps, emails, chats
tweets and social n/ws provide and consume web data

Examples of web data:


Wikipedia, Google Maps, McGraw-Hill, Oxford Bookstore, YouTube, etc.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Classification of Data

1. Structured: with schema and model


In tables
15 to 20% data in structured or semi-structured

2. Semi-structured

3.Unstructured: No data models

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Using Structured Data

Structured enables following

1. Insert, delete, update and append

2.Indexing to faster retrieval

3.Scalability

4.Transaction processing (follows ACID Rules)

5.Encryption and decryption for security

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Using Semi-structured Data

1. In the form of XML and JSON

<college>JNNCE, Shivamogga</college>

[{ "college":"JNNCE, Shivamogga" } ]

2. Contains tags or other markers

3. Data do not associate with data models (Relational and table models)

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Using Unstructured Data

1. Doesnot possess data features such as table or DB

2. Found in types such as .txt, .csv

3. Key-value pairs, emails

4. Do not reveal relationship, hierarchy or OO features

Example:
Mobile Data: Text msgs, chat msgs, tweets, blogs and comments

Website Contet data: YouTube, Browsing data, E-payments, Web store,


user generated maps
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Using Unstructured Data

Social media Data: Exchaning data in various forms

Texts and documents

Personal docs and Emails

Text internal to organization: Text within docs, logs, survey results

Satelite Imgaes, Atmospheric data, Survelliance, Traffic videos,


Instragram Images, Flicker, etc

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Using Multi-structured Data

Consisting multiple formats: structued/ semi-structured/ unstructured

• Can have many formats

• Found in non-transactional systems

Ex: streaming in customer interactions, data multiple sensors, data at web

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Definitions
Gartner(2012): Big data is High volume, high velocity and/or high variety
information asset requires new forms of processing for enhanced decision
making, Insight discovery and process optimization.

Doug Laney: described 3 V’s volume, variety and velocity as the key data
management challenges for enterprises.
Analytics also describe the 4 V’s volume velocity variety and veracity

Wikipedia: A collection of data sets so large are complex that traditional


data processing applications are inadequate.

Oxford English Dictionary: Data of a very large size, typically to the extent
that its manipulation and management present significant logistical challenges.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Definitions

• The McKinsey GlobalInstitute [2011]: Big data refers to data sets whose
size is beyond the ability the ability of typical database software tool to
capture, store, manage and analyse.

By 2025, it’s estimated that the global


datasphere will grow to 175 zettabytes.

2,000,000,000,000,000,000 (2 Exa) bytes of


data are generated each day across all industries
Source: https://www.sigmacomputing.com

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data

Source: https://in.pinterest.com/

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Characteristics
• Volume: Size defines the amount or quantity of data which is generated
from an application(s).
The size determines the processing considerations needed for
handling the data

Velocity: Speed of generation of data. How fast data generates?

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Characteristics

• Variety: Variety of data generated from multiple sources.


Leads to complexity

Veracity: how accurate or truthful a data set may be.

4Vs data need tools for mining, discovering patterns, BI, AI, ML, text
analytics,
descriptive and predictive analytics and data visualization tools.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Types
1. Social NWs and web data: Facebook, Twitter, e-mails, blogs and YouTube.

2. Transactions data and Business Process(BPs) data: Credit card transactions,


Flight bookings, public agencies data (medical records, insurance)
3.Customer master data: Facial recognition for name, DOB, Anniversary, Gender,
Location and income category
4.M/C generated data: IOT, M to M, sensors data, trackers, web and computer logs.
5.Human-Generated:Biometrics, human-mc interaction, notebooks/dairies, photos,
audio and video.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Classification
Based on characteristics and analytics

Traditional data
1. Data sources: Records, RDBMS, distributed DB’s, row oriented In-memory
data tables, col-oriented in-memory data tables, data warehouse,
server, mc generated data, human sourced data, business
processed data and BI data.

2.Data formats : structured and semi-structured

3. Processing data rates: Batch, near-time, real-time, streaming

4.Analysis Types: Batch, scheduled, near real-time datasets analytics


Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Classification
Based on characteristics and analytics

Big data
1. Data sources: Distributed file system, Operational Data Sources, data marts,
data warehouse, NoSQL DB, sensors data, audit trail of
financial transactions, external data such as web, social media,
weather data, health records.

2.Data formats : Unstructured, semi-structured and multi structured

3.Data stores structure: Web, enterprise or cloud servers, data warehouse,


row-oriented data for OLTP, column-oriented for OLAP,
records, graph database, hashed entries for key-value pairs
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Classification
4. Processing data rates: High volume, velocity, variety and veracity, batch,
near real-time and streaming data processing.

5.Processing Methods: Batch processing( using MapReduce, Hive or Pig)


real-time processing (using SparkStreaming, SparkSQL,
Apache Drill)
6. Analysis Methods: Statistical, predictive, regression, Mahout, ML algorithms,
clustering algorithms, classifiers, text analysis,
social nw analysis, location-based analysis,
diagnostic analysis, cognitive analysis.

7.Data Usage: Human, business process, Knowledge discovery, enterprise apps,


Data stores.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Handling Techniques
Following techniques deployed for Big Data storage, applications,
data management and analytics

1. Huge data volume storage, data distribution, high speed nws and high
performance computing

2. Applications scheduling using open source, reliable, scalable, distributed


file system, distributed database, parallel & distributed computing systems
such as (Hadoop or Spark).

3. Open source tools which are scalable, elastic and provide virtualized
environment, clusters of data nodes, task and thread management.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Handling Techniques

4. Data management using NoSQL, doc DB, col oriented DB, graph DB and
other form of DB

5.Data mining and analytics, data retrieval, data reporting, data visualization
and ML Big data tools.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing

• Processing complex applications with large datasets need hundreds


of computing nodes.

• Processing in short time with min cost is problematic

• When workload and complexity exceed system capacity, scale it up and


scale it out.

• Big data processing and analytics requires scaling up and scaling out,
both vertical and horizontal computing resources.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
1. Analytics Scalability to Big Data

Vertical Scalability: scale up system resources

Design algorithm, that uses resources efficiently

Horizontal Scalability: Increase no of systems

Scale out using more resources and distribute tasks in parallel

Alternatively deploy MPP’s, cloud, grid, clusters and distributed computing sw.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
2.Massively Parallel Processing platform
-Distributing separate tasks onto separate threads on same CPU
-Distributing separate tasks onto separate CPU on same computer
-Distributing separate tasks onto separate computers

So, SW must be implemented software to support parallel processing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
-Distributed Computing Model
-Uses cloud, grid or clusters which process and analyse large data sets
- Nodes connected by high speed networks

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
-Cloud Computing
- Cloud computing is type of internet based computing, provides shared
processing resources and data to computers and other devices on
demand (Wikipedia)

- Best approach for data processing


- Single point of failure
- High data security

Amazon Web Service(AWS), Elastic Compute Cloud(EC2), Microsoft Azure or


Apple CloudStack

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
-Cloud Computing

Features are:

1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad Network Access

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
-Cloud services can be classified into three types

IaaS: HDDs, Nw connections, DB storage, data centre and virtual server space

PaaS: Manage service, storage, nw, deploying, testing, collaborating, hosting


and maintaining applications (IBM BigInsight, MS Azure HD Insights, Oracle
Big Data Cloud Services)

SaaS: Apps are hosted by SP and made available to customer over Internet.
GoogleSQL, IBM BigSQL, HPE Vertica, MS Polybase and
Oracle Big Data SQL

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
-Cloud services can be classified into three types

IaaS: HDDs, Nw connections, DB storage, data centre and virtual server space

PaaS: Manage service, storage, nw, deploying, testing, collaborating, hosting


and maintaining applications (IBM BigInsight, MS Azure HD Insights, Oracle
Big Data Cloud Services)

SaaS: Apps are hosted by SP and made available to customer over Internet.
GoogleSQL, IBM BigSQL, HPE Vertica, MS Polybase and
Oracle Big Data SQL

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
- Grid and cluster computing
Grid Computing: Distributed computing, several computers from several
locations are connected with each other for common task.

Large scale resource sharing, flexible and secure

Data grids stores large data over grid nodes

Features: Scalable, forms distributed nw for resource integration


Drawbacks: Single point failure, Storage varies with no. of users, instances
and amount of data transferred at given time

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
- Grid and cluster computing
Cluster Computing: Group of computers connected by nw
Group works together to do a task
Do load balancing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
- Grid and cluster computing
Cluster Computing: Group of computers connected by nw
Group works together to do a task
Do load balancing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts

- Volunteer computing

-Distributed computing environment, which uses computing resources of


volunteers

-Volunteers are orgs or members who own personal computers

Ex: Science related projects executed by universities or academia


Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Scalability and Parallel Processing: Concepts
- Issues with Volunteer computing

1. Heterogeneity
2. Drop outs from network over time
3. Their sporadic availability
4. Incorrect results are unaccountable

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Designing Data Architecture

[Techopedia] Big data architecture is the logical and/or physical layout/structure


of how Big data will be stored, accessed and managed within a Big data or IT
environment. Architecture logically defines how big data solution will work, the
core components (HW, DB, SW, Storage) used, flow information, security and
more.

Designing Big data architecture is complex process.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Designing Data Architecture

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Managing Data for Analysis

-Enabling, Controlling, Protecting and enhancing value of data and info asset
-Reports, analysis and visualizations need well defined data
Data management functions include:
1.Data assets creation, maintenance and protection
2.Data governance: ensures availability, usability, integrity, security and
high quality data
3.Data architecture creation, modelling and analysis
4.DB maintenance, administration and management system.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Managing Data for Analysis

Data management functions include:


5.Managing data security, data access control, deletion, privacy and security
6.Managing data quality
7.Data collection using ETL process
8.Managing documents, records and contents
9.Creation of reference and master data, data control and supervision
10.Data and application integration

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Managing Data for Analysis

Data management functions include:


11.Integrated data management, enterprise-ready data creation, fast access and
analysis, automation and simplification of operations on data
12.Data warehouse management
13.Maintenance of business intelligence
14.Data mining and analytics algorithms

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

1.Structured Data Sources


- Can be a file, database or streaming
Ex: SQL server, MySQL, MS Access DB, Oracle DBMS,IBM DB2, Informix,
Amazon SimpleDB or file-collection directory at server.

MS applications consider two types of sources for processing


1. Machine sources
2. File sources

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing
Oracle applications consider two types of sources for processing
1. Databases
2. Logic-machine : source can be on network, Data source point to:
-DB in specific location or in data library of OS
-Specific machine in enterprise that process logic
-Data source master table may be at enterprise server or server-map
IBM applications consider data sources
1. Specific DB instance
2. File on remote system
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

2.Unstructured Data Sources


- Distributed over high speed networks
- Need high velocity processing
- Sources from distributed file system
- File types text file, csv, key-value pairs
- May have internal structure: e-mail, Facebook pages, twitter messages etc
- Do not model, reveal relationships, OO features
Data sources – Sensors, Signals and GPS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

Relevancy, recency, range, robustness and reliability


-Data Integrity
-Maintenance of consistency and accuracy in data
Data Noise, Outliers, Missing and Duplicate Values
-Noise: Data giving additional meaningless information
-Outliers: Not belong to the dataset
-Missing Values: Data not appearing in data set
-Duplicate Values: Same data repeats

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

-Data Pre-processing
Pre-processing needs are:
-Dropping out of range, inconsistent and outlier values
-Filtering unreliable, irrelevant and redundant information
-Data cleaning, editing, reduction and/or wrangling
-Data validation, transformation or transcoding
-ELT processing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

-Data cleaning
-Process of removing or correcting incomplete, incorrect, inaccurate or
irrelevant parts.
Tools: OpenRefine and DataCleaner
-Data Enrichment : operations or processes with refine, enhance or improve
the raw data
-Data Editing: reviewing and adjusting acquired datasets
Methods: Interactive, Selective, Automatic, Aggregating and distribution

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

-Data Reduction
-Transformation of information into ordered, correct and simplified form
-Uses editing, scaling, coding, sorting, collating, smoothing, interpolating and
preparing tabular summaries
-Data Wrangling
-Transforming and mapping data into format

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

-Data Format used during Pre-processing


i) Comma separated values (CSV)
ii) Java Script Object Notation (JSON)
iii) Tag Length Value (TLV)
iv) Key-value pairs
v) Hash-key-value pairs

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Data Sources, Quality, Pre-processing and Storing

Data Sources, Quality, Pre-processing and Storing

-Data Store Export to cloud

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing

-Data Store Export to cloud

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Sources, Quality, Pre-processing and Storing
Export of Data to AWS and Rackspace clouds

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Traditional Systems
Data store with Structured or semi-structured Data

-Traditional system uses structured or semi-structured data

SQL

- An RDBMS uses SQL


- SQL based on relational calculus and algebra
- SQL can embed within other using modules, libraries and pre-compilers

1. Create schema
2. Create catalog
3. Data Definition Language
4. Data Manipulation Language
5. Data Control Language

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Traditional Systems

Large Data Storage using RDBMS

- Supports privacy, security, integration, compaction and fusion.


- Uses M/c generated data, human sourced data, data from BP and BI.
- Set of keys and relational keys access the fields of tables and retrieve data using queri

DDBMS

1. Collection of logically related DB


2. Cooperation between databases in transparent manner
3. Should be Location independent

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Traditional Systems
In-Memory Column Formats Data

- Data in a column are kept together in-memory in columnar format.


- Single memory access loads many values at the column
- OLAP enables real time analytics

In-Memory Row Format Databases

- Allows much faster data processing during OLTP

Enterprise Data-Store Server and Data Warehouse

- Enterprise data server use data from several distributed sources


- All data merge using an integration tool

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Traditional Systems
In-Memory Column Formats Data

- Data in a column are kept together in-memory in columnar format.


- Single memory access loads many values at the column
- OLAP enables real time analytics

In-Memory Row Format Databases

- Allows much faster data processing during OLTP

Enterprise Data-Store Server and Data Warehouse

- Enterprise data server use data from several distributed sources


- All data merge using an integration tool

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Traditional Systems
Some Business processes defined in Oracle application-integration architecture
1. Integrating and enhancing the existing system and processes
2. Business Intelligence
3. Data security and integrity
4. New business services/products (web services)
5. Collaboration / Knowledge management
6. Enterprise architecture / SOA
7. E-commerce
8. External customer services
9. Supply chain automation / visualization
10.Data center optimization
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Storage
Big Data NoSQL or Not Only SQL

- NoSQL DBs are semi-structured data


- Big data store uses NoSQL
- Stores do not integrate with applications using SQL

Features:
1.A class of non-relational data storage systems, flexible data models and multiple schem
i) Uninterrupted key/value or big hash table [Dynamo (Amazon S3)]
ii) Unordered keys using JSON (PNUTS)
iii) Ordered keys and semi-structured data storage systems
[BigTable, Cassandra(Facebook/ Apache) and HBase]

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Storage
Big Data NoSQL or Not Only SQL

iv) Class consisting of JSON (MongoDB)


v) Consisting name/value in the text (CouchDB)
vi) May not use fixed table schema
vii) Do not use the JOINS
viii) Data written at one node replicate at multiple nodes, so storage is fault tolerant
ix) May relax ACID rules during transactions
x) Data store can be partitioned and follows CAP theorem

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Storage
Coexistence of Big Data, NoSQL and Traditional Data Stores

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Storage
Various Data sources, usage examples and tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Storage
Various Data sources, usage examples and tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Big Data platform should provide tools and services for:

1. Storage, processing and analytics

2. Developing, deploying, operating and managing Big Data environment

3. Reducing complexity of multiple data sources and integration of applications into one

4. Custom development, querying and integration with other systems

5. Traditional as well as Big Data techniques

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Services require the following:
1. New innovative non-traditional methods of storage, processing and analytics
2. Distributed Data store
3. Creating scalable , elastic virtualized platform
4. Huge volume of data stores
5. Massive parallelism
6. High speed networks
7. High performance processing, optimization and tuning
8. Data management model based on NoSQL
9. In-memory data column as well as row formats for OLAP and OLTP

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Services require the following:
10. Data retrieval, mining, reporting, visualization and analytics
11. Graph DBs to enable analytics with social NW msgs, pages and data analytics
12. ML or other approaches
13. Big Data sources: Data storages, data warehouse, Oracle Big Data, MongoDB, Cassan
14. Data sources: Sensors, Audit trail of Financial transactions data, external data (Web,
Media, weather data, health records).

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Hadoop:

- Hadoop Distributed File System is open source storage system


- Scaling File System
- Self-managing File System
- Self-healing File System
- Reliable parallel computing platform
- Manages distributed DBs of Big Data

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Hadoop based Big Data environment

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Mesos

- Mesos v0.9 is a resources management platform


- Enables sharing of cluster of nodes by multiple networks
- Compatible with open analytics stack
Data processing: Hive, Hadoop, Hbase, Storm
Data management: HDFS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Big Data Stack

- Consists set of SW components and data store units

- Applications, ML algorithms, analytics and visualization tools

- Cloud services: Amazon EC2, Azure or private cloud

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Platform
Tools for Big Data environment

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Analytics
Data Analytics Definition

Wikipedia: “Analysis of data is a process of inspecting, cleaning, transforming and model


Data with the goal of discovering useful information, suggesting conclusio
supporting decision making”

- Statistical and mathematical data analysis that clusters, segments, ranks and
predicts future possibilities

- Uses historical data and forecasts new values and results

- Suggests techniques which will provide most efficient and beneficial result for enterpris

- Helps in business intelligence and decision making

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Analytics
Phases in Analytics

1. Descriptive Analytics: Enables deriving additional value from visualizations and report

2. Predictive Analytics: Enables extraction of new facts and knowledge, then predicts/fo

3. Prescriptive Analytics: Enable derivation of additional value and undertake better dec
for new option to maximize the profits

4. Cognitive Analytics: Enables derivation of additional value and undertake better decisi

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Analytics
Analytics Architecture reference model

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Analytics
Berkeley Dada Analytics Stack (BDAS)

- Open source data analytics stack for complex computations on Big data

- Supports efficient, large-scale in-memory data processing

- Achieves accuracy, time and cost effective processing

3 Layers

1. Applications: AMP-Genomics and carat run at BDAS.


(Berkelys Algorithm, Machines and Peoples Laboratory)

2. Data Processing: Combines batch, streaming and interactive computations

3. Resource Management: Provides sharing infrastructure across various frameworks

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Data Storage and Analysis : Big Data Analytics
Four layers architecture for Big Data Stack

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies
Marketing, Sales, Health care, Medicines, Advertising etc.

Big Data in Marketing and Sales

- Data are important of marketing, sales and advertising

- Customer Value depends on 3 factors : Quality, Service and Price

- BDA identify and derive intelligence using predictive models

- Enables marketing companies to decide what products to sell

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Marketing: Process of creation, communication and delivery of value to customers.

Customer Value: what customer desires from product

Customer Value Analytics: analyze what a customer really needs

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Five application areas of BigData use cases

1. CVA using inputs of evaluated purchase patterns, preferences, quality,


price and post sale service requirements

2. Operational analytics for optimizing company operations

3. Detection of fraud and compliances

4. New products and innovations in service

5. Enterprise data warehouse optimization

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

BigData provides marketing insight into

i) Effective content at each stage of a sales cycle

ii) Investment in improving customer relationship management

iii) Addition to strategies for increasing customer lifetime value

iv) Lowering of customer acquisition cost

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data Analytics in detection of marketing frauds

i) Fusing of existing data at an enterprise data warehouse with data


from sources such as social media, websites, blogs, e-mails and thus
enriching existing data

ii) Using multiple sources of data and connecting with many


applications

iii) Providing greater insights using querying of the multiple source data

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data Analytics in detection of marketing frauds

iv) Analyzing data which enable structured reports and visualization

v) Providing high volume data mining, new innovative applications


and thus leading to new business intelligence and knowledge
discovery

vi) Making it less difficult and faster detection of threats, and


predict frauds by using various data and information publicly
available

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data Risks

Five data risks are described by Bernard Marr

Data Security

Data Privacy breach

Costs affecting profits

Bad Analytics

Bad Data

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data Credit Risk Management

Main risks are

i) Loan defaults

ii) Timely return of interests and principal amount

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data Credit Risk Management

1) Identifying high credit rating business groups and individuals


2) Identifying risk involved before lending money
3) Identify industrial sectors with greater risks
4) Identify types of employees and business with greater risks
5) Anticipating liquidity issues over the years

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data Credit Risk Management

The data insights from analytics leads to faster reactions, benefits are

i) Minimize the non-payments and frauds


ii) Identify new credit opportunities, new customers and revenue streams
iii) Marketing to low risk business and households

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data and Algorithmic Trading

“Wikipedia” Algorithmic trading is a method of executing a large order using


automated pre-programmed trading instructions accounting for
variables such as time, price and volume.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data and Health Care

Data sources:

1. Clinical records
2. Pharmacy records
3. Electronic medical records
4. Diagnosis logs and notes
5. Additional data
( Deviation from person usual activities, medical leaves from job,
social interactions

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies
Big Data and Health Care

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data and Health Care

Health care analytics using Big Data can facilitate following:

1. Provisioning of value-based and customer centric healthcare


2. Utilizing the IOT for health care
3. Preventing fraud –Excessive or duplicate claims
waste- unnecessary tests
abuse-unnecessary use of medicines, testing facility
4. Improving outcomes
5. Monitoring patients in real time.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data in Medicine

Building Health profiles of individual patients and predicting


models

1. Aggregating information from sources DNA, proteins,


metabolites to cells, tissues, organs and ecosystems that enhance
diagnosing disease

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data in Medicine

Big data creates patterns and models by data mining and help
better understanding

2. Deploying wearable devices data, devices data records during


active as well as inactive periods provide better understanding of
patient health and better risk profiling for certain diseases

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Big Data
Big Data Analytics Applications and Case Studies

Big Data in Advertising

- Big data helps digital advertiser to discover new relationships,


lesser competitive regions

- Success from advertisement depend on collection, analyzing


and mining

- New insights enable the hyper-localized advertising


(personalize and targeting online, social media and mobile for
advertisement)

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga

You might also like