Bda - M1

BIG DATA ANALYTICS
18CS72
Module 1
Introduction to Big Data Analytics: Big Data, Scalability and Parallel Processing, Designing
Data Architecture, Data Sources, Quality, Pre-Processing and Storing, Data Storage and
Analysis, Big Data Analytics Applications and Case Studies.
Module 2
Introduction to Hadoop: Introduction, Hadoop and its Ecosystem, Hadoop Distributed File
System, MapReduce Framework and Programming Model, Hadoop Yarn, Hadoop Ecosystem
Tools.
Hadoop Distributed File System Basics: HDFS Design Features, Components, HDFS User
Commands.
Essential Hadoop Tools: Using Apache Pig, Hive, Sqoop, Flume, Oozie, HBase.
Module 3
NoSQL Big Data Management, MongoDB and Cassandra: Introduction, NoSQL Data
Store, NoSQL Data Architecture Patterns, NoSQL to Manage Big Data, Shared-Nothing
Architecture for Big Data Tasks, MongoDB, Databases, Cassandra Databases.
Module 4
MapReduce, Hive and Pig: Introduction, MapReduce Map Tasks, Reduce Tasks and
MapReduce Execution, Composing MapReduce for Calculations and Algorithms, Hive,
HiveQL, Pig.
Module 5
Machine Learning Algorithms for Big Data Analytics: Introduction, Estimating the
relationships, Outliers, Variances, Probability Distributions, and Correlations, Regression
analysis, Finding Similar Items, Similarity of Sets and Collaborative Filtering, Frequent
Itemsets and Association Rule Mining.
Text, Web Content, Link, and Social Network Analytics: Introduction, Text mining, Web
Mining, Web Content and Web Usage Analytics, Page Rank, Structure of Web and analyzing a
Web Graph, Social Network as Graphs and Social Network Analytics:
Course Outcomes (CO)
CO1 - Understand fundamentals of Big Data analytics.[KL2]
CO2 - Investigate Hadoop framework and Hadoop Distributed File system.[KL4]
CO3 - Illustrate the concepts of NoSQL using MongoDB and Cassandra for Big
Data.[KL3]
CO4 - Demonstrate the MapReduce programming model to process the big data along
with Hadoop tools.[KL3]
CO5 - Use Machine Learning algorithms for real world big data.[KL3]
CO6 - Analyze web contents and Social Networks to provide analytics with relevant
visualization tools.[KL4]
MODULE 1 - INTRODUCTION TO BIG DATA
ANALYTICS
Introduction
Need of Big Data

KEY TERMS
• Application
• Application Programming Interface (API)
• Data Model
• Data Repository
• Data Store
• Distributed Data Store
• Database
• Table Flat File Database
• Name-Value Pair
• Key-Value Pair
• Hash Key-Value
• Spreadsheet
• Stream Analytics
• Database Maintenance (DBM)
• Database Administration (DBA)
• Database Management System (DBMS)
• Relation Database
• Relation Database Management System (RDMS)
• Transaction
• Structured Query Language (SQL)
• Database Connection
• Database Connectivity (DBC)
• Database Connectivity Driver
• DB2
• Data Warehouse
• Data Mart
• Process
• Process Matrix
• Business Process
• Business Intelligence
• Batch Processing
• Batch Transaction Processing
• Streaming Transaction Processing
• In-memory
• Interactive Transaction Processing
• Real-Time Processing
• Real-Time Transaction Processing
• Extract, Transform and Load (ETL)
• Machine
• Server
• Service
• Service-Oriented Architecture (SOA)
BIG DATA
DEFINATION OF DATA
Data is information, usually in the form of facts or statistics that one can analyze or use
for further calculations.
Data is information from series of observations, measurements or facts.

WEB DATA
WEB DATA is the data present on web servers in the forms of text, images, videos,
audios and multimedia files for web users.
A user interacts with data. A client can access data of response from a server.
The data can also publish or post from server.

CLASSIFICATION OF DATA
Using Structured Data
• Data insert, delete, update and append
• Indexing
• Scalability
• Transactions Processing
• Encryption and Decryption
Using Semi-Structured Data
Semi-structured data contain tags or other markers, which separate semantic elements and
enforce hierarchies of records and fields within the data.
Using Multi-Structured Data
Refers to data consisting of multiple formats of data.
Using Unstructured Data
Unstructured data does not possess data features such as a table or a database.
BIG DATA DEFINITIONS
Big Data refers to data sets whose size is beyond the ability of typical database
software tool to capture, store, manage and analyze.
Data of a very large size, typically to the extent that its manipulation and management
present significant logistical challenges.
BIG DATA CHARACTERISTICS
• Volume – The size of data.
• Velocity – The speed og generation of data.
• Variety – The different forms of data generated from multiple sources in a system.
• Veracity – The quality of data captured.
• Value
BIG DATA TYPES
• Social Networks and Web Data
• Transactions data and Business Processes data
• Machine-Generated data
• Human-Generated data
BIG DATA CLASSIFICATION
BIG DATA HANDLING TECHNIQUES
• Huge data volumes storage, data distribution, high-speed networks.
• Open source tools which are scalable, elastic.
• Application scheduling using open source.
• Data Management using NoSQL, document database.
• Data mining and analytics, data retrieval, data reporting, data visualization.
SCALABILITY AND PARALLEL PROCESSING
• Big Data needs processing of large data volumes. Processing of this much
distributed data within a short time and at minimum cost.
• Convergence of Data Environments and Analytics.
• Scalability is the capability of a system to handle the workload as per the
magnitude of the work.

ANALYTICS SCALABILITY TO BIG DATA
Vertical Scalability means scaling up the given system’s resource and increasing the
system’s analytics, reporting and visualization capabilities.
Horizontal scalability means increasing the number of systems working in coherence
and scaling out the workload.

MASSIVELY PARALLEL PROCESSING PLATFORMS
• Distributing separate tasks onto separate threads on the CPU
• Distributing separate tasks onto separate CPUs on the same computer
• Distributing separate tasks onto separate computers.

DISTRIBUTED COMPUTING MODEL
CLOUD COMPUTING
Cloud computing is a type of internet-based computing that provides shared processing
resources and data to the computers and other devices on demand.
Features
• On-demand service
• Resource poling.
• Scalability.
• Accountability.
• Broad network access
Fundamental Types
• Infrastructure as a Service(IaaS)
• Platform as a Service (PaaS)
• Software as a service (SaaS)
GRID AND CLUSTER COMPUTING
Grid Computing refers to distributed computing, in which a group of computers from several
locations are connected with each other to achieve a common task.
Features of grid Computing
Drawbacks of grid computing
Cluster computing is a group of computers connected by a network. The group works together
to accomplish the same task.
VOLUNTEER COMPUTING
Volunteer computing is a distributed computing paradigm which uses computing
resources of the volunteers.
Issues
• Volunteered computers heterogeneity.
• Drops outs from the network over time.
• Their sporadic availability.
• Incorrect results at volunteers are unaccountable as they are essential from anonymous
volunteers.
DATA ARCHITECTURE DESIGN
• Big data architecture is the logical and physical layout/structure of how big data will
be stored, accessed and managed within a big data or IT environment
• Architecture logically defines how big data solution will work the core components
used of information, security and more.
LAYERS OF DATA ARCHITECTURE
Layer 1 : Identification and data sources.
Layer 2 : Acquisition, ingestion, extraction, pre-processing, transformation of data
Layer 3 : Data storage at files, servers, cluster or cloud
Layer 4 : Data-processing
Layer 5 : Data consumption in the number of programs and tools.

LAYER 1
L1 considers the following aspects in a design
• Amount of data need at ingestion layer 2(L2)
Ingestion means a process of absorbing information. It is the process of obtaining and

importing data for immediate use or transfer.
• Push from L1 or pull by L2 as per the mechanism for the usage.
• Source data-types: Database, files, web or service
• Source formats. i.e semi-structured, unstructured or structured.

LAYER 2
• Ingestion and ETL processes either in real time, which means stroe and use the data
as generated or in batches. Batch processing is using discrete datasets at scheduled or
periodic intervals of time.

LAYER 3
• Data storage type, format, compression, incoming data, frequency, querying patterns
and consumption requirements for L4 or L5.
• Data storage using Hadoop distributed file system or NoSQL data stores – Hbase,
Cassandra, MongoDB.
LAYER 4
• Data processing software such as MapRedue, Hive, Pig, Spark, Spark Mahout, Spark
Streaming.
• Processing in scheduled batches or real time or hybrid.
• Processing as per synchronous or asynchronous processing requirements at L5.

LAYER 5
• Data integration.
• Datasets usages for reporting and visualization.
• Analytics, Business Processing, Knowledge discovery.
• Export of datasets to cloud, web or other systems.

MANAGING DATA FOR ANALYSIS
Data management functions include:
• Data assets creation, maintenance and protection.
• Data governance which include establishing the process for ensuring the availability,
usability, integrity security and high-quality of data.
• Data architecture creation, modelling and analysis.
• DB maintenance, administration and management system.
• Managing data security, data access control, deletion, privacy and security.
• Managing the data quality.
• Data collection using the ETL process
MANAGING DATA FOR ANALYSIS
• Managing documents, record and contents.
• Creation of reference and master data and data control and supervision.
• Data and application integration.
• Integrated data management, enterprise-ready data creation, fast access and analysis,
automation and simplification of operations on the data..
• Data warehouse management.
• Maintenance of business intelligence.
• Data mining and analytics algorithms.

DATA SOURCES ,QUALITY, PRE-PROCESSING AND STORING
Data Source
External – sensors, trackers, web logs, computer system logs and feeds.
Internal – Data repositories, database, relational database, flat file, spreadsheet, web
server.
Structured Data Sources
Types of sources for processing
• Machine Sources.
• File Sources.
Types of Data sources
• Database.
• Logic Machine.
Data source can point to:
• A database in a specific location or in a data library of OS.
• A specific machine in the enterprise that processes logic
• A data source master which stores data source definitions. The table may be at a
centralized source or at server-map for the source.
Unstructured Data Sources
Data sources – Sensors, Signals and GPS
Sensors are electronic devices that sense the physical environment. Sensors play an
active roe in the automotive industry.
Data Quality
High quality means data, which enables all the required operations, analysis, decisions,
planning and knowledge discovery correctly.
• Relevancy
• Recency
• Range
• Robustness
• Reliability.
Data Integrity refers to the maintenance of consistency and accuracy in data over its
usable life.
Noise in data refers to data giving additional meaningless information besides true
information.
Outliers in data refers to data, which appears to not belong to the datasets.
Missing Values implies data not appearing in the data set.
Duplicate Values implies the same data appearing two or more time in a datasets.
DATA PRE-PROCESSING
Pre-processing needs are:
• Dropping out of range, inconsistent and outlier values.
• Filtering unreliable, irrelevant and redundant information
• Data cleaning, editing, reduction and wrangling.
• Data validation, transformation or transcoding.
• ELT processing.
Data Cleaning refers to the process of removing or correcting incomplete, incorrect, inaccurate
or irrelevant parts of the data after detecting them.
Data Cleaning Tools

Data Enrichment refers to operations or processes which refine, enhance or improve the
raw data.
Data editing refers to the process of reviewing and adjusting the acquired datasets.
Editing methods are
• Interactive
• Selective
• Automatic.
• Aggregating.
• Distribution.
Data Enrichment refers to operations or processes which refine, enhance or improve the
raw data.
Data editing refers to the process of reviewing and adjusting the acquired datasets.
Editing methods are
• Interactive
• Selective
• Automatic.
• Aggregating.
• Distribution.
Data Reduction enables the transformation of acquired information into an ordered,
correct and simplified form.
Data Wrangling refers to the process of transforming and mapping the data.
Data format used during pre-processing
• Data Storage.
• Analytics Application.
• Service.
• Cloud
CSV Format
Data Format Conversions
DATA STORE EXPORT TO CLOUD
CLOUD SERVICES
DATA STORAGE AND ANALYSIS
Data Storage & Management: Traditional Systems

Data Store with Structured or Semi-Structured Data
SQL
• Create Schema
• Creating catalog
• Data Definition Language (DDL)
• Data Manipulation Language (DML)
• Data Control Language (DCL)
Large Data Storage using RDBMS
Distributed Database Management System
In-Memory Column Formats Data
In-Memory Row Formats Data
Enterprise Data-Store Server and Data Warehouse
• Integrating & enhancing the existing systems and processes
• Business Intelligence
• Data security & integrity
• New business services/products (web services)
• Collaboration/knowledge management
• Enterprise architecture/SOA
• E-commerce
• External customer services
• Supply chain automation/ visualization
• Data centre optimization.
BIG DATA STORAGE
Big Data NoSQL or Not Only SQL

Features of NoSQL are
• It is a class of non-relational data storage systems and the flexible data models
and multiple schema
• Consistency
• Availability
• Partition
Coexistence of Big Data, NoSQL and Traditional Data Stores
BIG DATA PLATFORM
Big data platforms provision tools and services for:
• Storage, processing and analytics.
• Developing, deploying, operating and managing big data environment
• Reducing the complexity of multiple data sources and integration of applications into
one cohesive solution.
• Custom development, querying and integration with other systems.
• The traditional as well as big data techniques

Data management, storage and analytics of big data captured at the companies and
services require the following:
• New innovative no-traditional methods of storage, processing and analytics.
• Distributed Data Stores.
• Creating scalable as well as elastic virtualized platform (cloud computing).
• Huge volume of Data Stores.
• Massive parallelism.
• High peed networks
• High performance processing, optimization and tuning.
• Data management model based on not only SQL or NoSQL

• HADOOP
• MESOS
• BIG DATA STACK
BIG DATA ANALYTICS
Big Data Analytics
Data Analytics Definition
Analysis of data is a process of inspecting, cleaning, transforming and modeling data

with the goal of discovering useful information, suggesting conclusions and supporting
decision making.
Phases in Analytics
• Descriptive analytics
• Predictive analytics
• Prescriptive analytics
• Cognitive analytics
Berkeley Data Analytics Stack (BDAS)
• Cost reduction
• Time reduction
• New product planning and development
• Smart decision making using predictive analytics
• Knowledge discovery
BIG DATA ANALYTICS APPLICATIONS AND CASE STUDIES
Big Data in Marketing and Sales
Big Data Analytics in Detection of Marketing Frauds
Big data usages has the following features for enabling detection and prevention of frauds
• Fusing of existing data at an enterprise data warehouse with data from sources such as social
media, websites, blogs, e-mail and thus enriching existing data.
• Using multiple sources of data and connecting with many applications.
• Providing greater insights using quiring of the multiple source data.
• Analysing data which enable structured reports and visualization.
• Providing high volume data mining, new innovative applications and thus leading to new
business intelligence and knowledge discovery.
• Marking it less difficult and faster detection of threats and predict likely frauds by using
various data and information publicly available.
Big Data Risks
Big Data Credit Risk Management
• Identifying high credit rating business groups and individuals
• Identifying risk involved before lending money
• Identifying industrial sectors with greater risks
• Identifying types of employees and business with greater risks
• Anticipating Liquidity issues and over the years
Big Data and Algorithmic Trading

Big Data and healthcare
Healthcare analytics using big data can facilitate the following
• Provisioning of value-based and customer-centric healthcare
• Utilizing the ‘Internet of things’ for health care
• Preventing fraud, waste, abuse in the healthcare industry and reduce healthcare costs
• Improving outcomes
• Monitoring patients in real time
Big Data in Medicine
Big Data in Advertising

Bda - M1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda - M1

Uploaded by

Copyright:

Available Formats

BIG DATA ANALYTICS

CO1 - Understand fundamentals of Big Data analytics.[KL2]

CO2 - Investigate Hadoop framework and Hadoop Distributed File system.[KL4]

Need of Big Data

Data is information from series of observations, measurements or facts.

The data can also publish or post from server.

• Data insert, delete, update and append

• Encryption and Decryption

Using Semi-Structured Data

Refers to data consisting of multiple formats of data.

Using Unstructured Data

• Volume – The size of data.

• Velocity – The speed og generation of data.

• Veracity – The quality of data captured.

• Social Networks and Web Data

• Transactions data and Business Processes data

• Huge data volumes storage, data distribution, high-speed networks.

• Open source tools which are scalable, elastic.

• Application scheduling using open source.

• Data Management using NoSQL, document database.

distributed data within a short time and at minimum cost.

• Convergence of Data Environments and Analytics.

• Scalability is the capability of a system to handle the workload as per the

magnitude of the work.

system’s analytics, reporting and visualization capabilities.

Horizontal scalability means increasing the number of systems working in coherence

and scaling out the workload.

• Distributing separate tasks onto separate threads on the CPU

• Distributing separate tasks onto separate CPUs on the same computer

• Distributing separate tasks onto separate computers.

• Volunteered computers heterogeneity.

• Drops outs from the network over time.

• Their sporadic availability.

Layer 1 : Identification and data sources.

Layer 2 : Acquisition, ingestion, extraction, pre-processing, transformation of data

Layer 3 : Data storage at files, servers, cluster or cloud

Layer 5 : Data consumption in the number of programs and tools.

L1 considers the following aspects in a design

• Amount of data need at ingestion layer 2(L2)

Ingestion means a process of absorbing information. It is the process of obtaining and

• Push from L1 or pull by L2 as per the mechanism for the usage.

• Source data-types: Database, files, web or service

• Source formats. i.e semi-structured, unstructured or structured.

L2 considers the following aspects in a design

as generated or in batches. Batch processing is using discrete datasets at scheduled or

periodic intervals of time.

L3 considers the following aspects in a design

and consumption requirements for L4 or L5.

L4 considers the following aspects in a design

• Processing in scheduled batches or real time or hybrid.

• Processing as per synchronous or asynchronous processing requirements at L5.

L5 considers the following aspects in a design

• Datasets usages for reporting and visualization.

• Analytics, Business Processing, Knowledge discovery.

• Export of datasets to cloud, web or other systems.

• Data and application integration.

• Data warehouse management.

• Maintenance of business intelligence.

• Data mining and analytics algorithms.

Data source can point to:

• A database in a specific location or in a data library of OS.

• A specific machine in the enterprise that processes logic

Missing Values implies data not appearing in the data set.