Module 1

BIG DATA AND ANALYTICS
Subject Code : 18CS72 CIE Marks : 40
Lecture Hours : 50 SEE Marks : 60
Credits : 04
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
10 Hours
Introduction to Big Data Analytics
Arun Kumar P
Introduction to Big Data
Definitions of Data
• Collins English Dictionary: Data is Information, usually in the form

of facts or statistics that one can analyze or use for further calculations
Computing: Data is information that can be stored and used by a

computer program
Electricals Engineering, Circuits, computing and control: Data is infomation

presented in numbers, letters or other form
Science: Data is information from series of observations,

measurements or facts
Social Sciences: Data is information from series of behavioural

observations, measurements or facts
Arun Kumar P
Definitions of Web Data
• Web data in the form of documents and other resources
• Data present in web servers: text, images, video, audio
• Web sites, web services, web portals, online business apps, emails, chats
tweets and social n/ws provide and consume web data
Examples of web data:

Wikipedia, Google Maps, McGraw-Hill, Oxford Bookstore, YouTube, etc.
Arun Kumar P
Classification of Data
1. Structured: with schema and model

In tables
15 to 20% data in structured or semi-structured
2. Semi-structured
3.Unstructured: No data models
Arun Kumar P
Using Structured Data
Structured enables following
1. Insert, delete, update and append
2.Indexing to faster retrieval
3.Scalability
4.Transaction processing (follows ACID Rules)
5.Encryption and decryption for security
Arun Kumar P
Using Semi-structured Data
1. In the form of XML and JSON
<college>JNNCE, Shivamogga</college>
[{ "college":"JNNCE, Shivamogga" } ]
2. Contains tags or other markers
3. Data do not associate with data models (Relational and table models)
Arun Kumar P
Using Unstructured Data
1. Doesnot possess data features such as table or DB
2. Found in types such as .txt, .csv
3. Key-value pairs, emails
4. Do not reveal relationship, hierarchy or OO features
Example:
Mobile Data: Text msgs, chat msgs, tweets, blogs and comments
Website Contet data: YouTube, Browsing data, E-payments, Web store,

user generated maps
Arun Kumar P
Using Unstructured Data
Social media Data: Exchaning data in various forms
Texts and documents
Personal docs and Emails
Text internal to organization: Text within docs, logs, survey results
Satelite Imgaes, Atmospheric data, Survelliance, Traffic videos,

Instragram Images, Flicker, etc
Arun Kumar P
Using Multi-structured Data
Consisting multiple formats: structued/ semi-structured/ unstructured
• Can have many formats
• Found in non-transactional systems
Ex: streaming in customer interactions, data multiple sensors, data at web
Arun Kumar P
Big Data Definitions
Gartner(2012): Big data is High volume, high velocity and/or high variety
information asset requires new forms of processing for enhanced decision
making, Insight discovery and process optimization.
Doug Laney: described 3 V’s volume, variety and velocity as the key data
management challenges for enterprises.
Analytics also describe the 4 V’s volume velocity variety and veracity
Wikipedia: A collection of data sets so large are complex that traditional

data processing applications are inadequate.
Oxford English Dictionary: Data of a very large size, typically to the extent
that its manipulation and management present significant logistical challenges.
Arun Kumar P
Big Data Definitions
• The McKinsey GlobalInstitute [2011]: Big data refers to data sets whose
size is beyond the ability the ability of typical database software tool to
capture, store, manage and analyse.
By 2025, it’s estimated that the global

datasphere will grow to 175 zettabytes.
2,000,000,000,000,000,000 (2 Exa) bytes of

data are generated each day across all industries
Source: https://www.sigmacomputing.com
Arun Kumar P
Source: https://in.pinterest.com/
Arun Kumar P
Big Data Characteristics
• Volume: Size defines the amount or quantity of data which is generated
from an application(s).
The size determines the processing considerations needed for
handling the data
Velocity: Speed of generation of data. How fast data generates?
Arun Kumar P
Big Data Characteristics
• Variety: Variety of data generated from multiple sources.

Leads to complexity
Veracity: how accurate or truthful a data set may be.
4Vs data need tools for mining, discovering patterns, BI, AI, ML, text
analytics,
descriptive and predictive analytics and data visualization tools.
Arun Kumar P
Big Data Types
1. Social NWs and web data: Facebook, Twitter, e-mails, blogs and YouTube.
2. Transactions data and Business Process(BPs) data: Credit card transactions,

Flight bookings, public agencies data (medical records, insurance)
3.Customer master data: Facial recognition for name, DOB, Anniversary, Gender,
Location and income category
4.M/C generated data: IOT, M to M, sensors data, trackers, web and computer logs.
5.Human-Generated:Biometrics, human-mc interaction, notebooks/dairies, photos,
audio and video.
Arun Kumar P
Data Classification
Based on characteristics and analytics
Traditional data
1. Data sources: Records, RDBMS, distributed DB’s, row oriented In-memory
data tables, col-oriented in-memory data tables, data warehouse,
server, mc generated data, human sourced data, business
processed data and BI data.
2.Data formats : structured and semi-structured
3. Processing data rates: Batch, near-time, real-time, streaming
4.Analysis Types: Batch, scheduled, near real-time datasets analytics

Arun Kumar P
Big Data Classification
Based on characteristics and analytics
Big data
1. Data sources: Distributed file system, Operational Data Sources, data marts,
data warehouse, NoSQL DB, sensors data, audit trail of
financial transactions, external data such as web, social media,
weather data, health records.
2.Data formats : Unstructured, semi-structured and multi structured
3.Data stores structure: Web, enterprise or cloud servers, data warehouse,

row-oriented data for OLTP, column-oriented for OLAP,
records, graph database, hashed entries for key-value pairs
Arun Kumar P
Big Data Classification
4. Processing data rates: High volume, velocity, variety and veracity, batch,
near real-time and streaming data processing.
5.Processing Methods: Batch processing( using MapReduce, Hive or Pig)

real-time processing (using SparkStreaming, SparkSQL,
Apache Drill)
6. Analysis Methods: Statistical, predictive, regression, Mahout, ML algorithms,
clustering algorithms, classifiers, text analysis,
social nw analysis, location-based analysis,
diagnostic analysis, cognitive analysis.
7.Data Usage: Human, business process, Knowledge discovery, enterprise apps,

Data stores.
Arun Kumar P
Big Data Handling Techniques
Following techniques deployed for Big Data storage, applications,
data management and analytics
1. Huge data volume storage, data distribution, high speed nws and high
performance computing
2. Applications scheduling using open source, reliable, scalable, distributed

file system, distributed database, parallel & distributed computing systems
such as (Hadoop or Spark).
3. Open source tools which are scalable, elastic and provide virtualized
environment, clusters of data nodes, task and thread management.
Arun Kumar P
Big Data Handling Techniques
4. Data management using NoSQL, doc DB, col oriented DB, graph DB and
other form of DB
5.Data mining and analytics, data retrieval, data reporting, data visualization
and ML Big data tools.
Arun Kumar P
Scalability and Parallel Processing
• Processing complex applications with large datasets need hundreds

of computing nodes.
• Processing in short time with min cost is problematic
• When workload and complexity exceed system capacity, scale it up and

scale it out.
• Big data processing and analytics requires scaling up and scaling out,
both vertical and horizontal computing resources.
Arun Kumar P
Scalability and Parallel Processing: Concepts
1. Analytics Scalability to Big Data
Vertical Scalability: scale up system resources
Design algorithm, that uses resources efficiently
Horizontal Scalability: Increase no of systems
Scale out using more resources and distribute tasks in parallel
Alternatively deploy MPP’s, cloud, grid, clusters and distributed computing sw.
Arun Kumar P
2.Massively Parallel Processing platform
-Distributing separate tasks onto separate threads on same CPU
-Distributing separate tasks onto separate CPU on same computer
-Distributing separate tasks onto separate computers
So, SW must be implemented software to support parallel processing
Arun Kumar P
-Distributed Computing Model
-Uses cloud, grid or clusters which process and analyse large data sets
- Nodes connected by high speed networks
Arun Kumar P
-Cloud Computing
- Cloud computing is type of internet based computing, provides shared
processing resources and data to computers and other devices on
demand (Wikipedia)
- Best approach for data processing

- Single point of failure
- High data security
Amazon Web Service(AWS), Elastic Compute Cloud(EC2), Microsoft Azure or

Apple CloudStack
Arun Kumar P
-Cloud Computing
Features are:
1. On-demand service
2. Resource pooling
3. Scalability
4. Accountability
5. Broad Network Access
Arun Kumar P
-Cloud services can be classified into three types
IaaS: HDDs, Nw connections, DB storage, data centre and virtual server space
PaaS: Manage service, storage, nw, deploying, testing, collaborating, hosting

and maintaining applications (IBM BigInsight, MS Azure HD Insights, Oracle
Big Data Cloud Services)
SaaS: Apps are hosted by SP and made available to customer over Internet.
GoogleSQL, IBM BigSQL, HPE Vertica, MS Polybase and
Oracle Big Data SQL
Arun Kumar P
-Cloud services can be classified into three types
IaaS: HDDs, Nw connections, DB storage, data centre and virtual server space
PaaS: Manage service, storage, nw, deploying, testing, collaborating, hosting

and maintaining applications (IBM BigInsight, MS Azure HD Insights, Oracle
Big Data Cloud Services)
SaaS: Apps are hosted by SP and made available to customer over Internet.
GoogleSQL, IBM BigSQL, HPE Vertica, MS Polybase and
Oracle Big Data SQL
Arun Kumar P
- Grid and cluster computing
Grid Computing: Distributed computing, several computers from several
locations are connected with each other for common task.
Large scale resource sharing, flexible and secure
Data grids stores large data over grid nodes
Features: Scalable, forms distributed nw for resource integration

Drawbacks: Single point failure, Storage varies with no. of users, instances
and amount of data transferred at given time
Arun Kumar P
Cluster Computing: Group of computers connected by nw
Group works together to do a task
Do load balancing
Arun Kumar P
Cluster Computing: Group of computers connected by nw
Group works together to do a task
Do load balancing
Arun Kumar P
- Volunteer computing
-Distributed computing environment, which uses computing resources of

volunteers
-Volunteers are orgs or members who own personal computers
Ex: Science related projects executed by universities or academia

Arun Kumar P
- Issues with Volunteer computing
1. Heterogeneity
2. Drop outs from network over time
3. Their sporadic availability
4. Incorrect results are unaccountable
Arun Kumar P
Designing Data Architecture
[Techopedia] Big data architecture is the logical and/or physical layout/structure

of how Big data will be stored, accessed and managed within a Big data or IT
environment. Architecture logically defines how big data solution will work, the
core components (HW, DB, SW, Storage) used, flow information, security and
more.
Designing Big data architecture is complex process.
Arun Kumar P
Designing Data Architecture
Arun Kumar P
Managing Data for Analysis
-Enabling, Controlling, Protecting and enhancing value of data and info asset
-Reports, analysis and visualizations need well defined data
Data management functions include:
1.Data assets creation, maintenance and protection
2.Data governance: ensures availability, usability, integrity, security and
high quality data
3.Data architecture creation, modelling and analysis
4.DB maintenance, administration and management system.
Arun Kumar P

5.Managing data security, data access control, deletion, privacy and security
6.Managing data quality
7.Data collection using ETL process
8.Managing documents, records and contents
9.Creation of reference and master data, data control and supervision
10.Data and application integration
Arun Kumar P

11.Integrated data management, enterprise-ready data creation, fast access and
analysis, automation and simplification of operations on data
12.Data warehouse management
13.Maintenance of business intelligence
14.Data mining and analytics algorithms
Arun Kumar P
Data Sources, Quality, Pre-processing and Storing
1.Structured Data Sources

- Can be a file, database or streaming
Ex: SQL server, MySQL, MS Access DB, Oracle DBMS,IBM DB2, Informix,
Amazon SimpleDB or file-collection directory at server.
MS applications consider two types of sources for processing

1. Machine sources
2. File sources
Arun Kumar P
Oracle applications consider two types of sources for processing
1. Databases
2. Logic-machine : source can be on network, Data source point to:
-DB in specific location or in data library of OS
-Specific machine in enterprise that process logic
-Data source master table may be at enterprise server or server-map
IBM applications consider data sources
1. Specific DB instance
2. File on remote system
Arun Kumar P
2.Unstructured Data Sources

- Distributed over high speed networks
- Need high velocity processing
- Sources from distributed file system
- File types text file, csv, key-value pairs
- May have internal structure: e-mail, Facebook pages, twitter messages etc
- Do not model, reveal relationships, OO features
Data sources – Sensors, Signals and GPS
Arun Kumar P
Relevancy, recency, range, robustness and reliability

-Data Integrity
-Maintenance of consistency and accuracy in data
Data Noise, Outliers, Missing and Duplicate Values
-Noise: Data giving additional meaningless information
-Outliers: Not belong to the dataset
-Missing Values: Data not appearing in data set
-Duplicate Values: Same data repeats
Arun Kumar P
-Data Pre-processing
Pre-processing needs are:
-Dropping out of range, inconsistent and outlier values
-Filtering unreliable, irrelevant and redundant information
-Data cleaning, editing, reduction and/or wrangling
-Data validation, transformation or transcoding
-ELT processing
Arun Kumar P
-Data cleaning
-Process of removing or correcting incomplete, incorrect, inaccurate or
irrelevant parts.
Tools: OpenRefine and DataCleaner
-Data Enrichment : operations or processes with refine, enhance or improve
the raw data
-Data Editing: reviewing and adjusting acquired datasets
Methods: Interactive, Selective, Automatic, Aggregating and distribution
Arun Kumar P
-Data Reduction
-Transformation of information into ordered, correct and simplified form
-Uses editing, scaling, coding, sorting, collating, smoothing, interpolating and
preparing tabular summaries
-Data Wrangling
-Transforming and mapping data into format
Arun Kumar P
-Data Format used during Pre-processing

i) Comma separated values (CSV)
ii) Java Script Object Notation (JSON)
iii) Tag Length Value (TLV)
iv) Key-value pairs
v) Hash-key-value pairs
Arun Kumar P
-Data Store Export to cloud
Arun Kumar P
-Data Store Export to cloud
Arun Kumar P
Export of Data to AWS and Rackspace clouds
Arun Kumar P
Data Storage and Analysis : Traditional Systems
Data store with Structured or semi-structured Data
-Traditional system uses structured or semi-structured data
SQL
- An RDBMS uses SQL

- SQL based on relational calculus and algebra
- SQL can embed within other using modules, libraries and pre-compilers
1. Create schema
2. Create catalog
3. Data Definition Language
4. Data Manipulation Language
5. Data Control Language
Arun Kumar P
Large Data Storage using RDBMS
- Supports privacy, security, integration, compaction and fusion.

- Uses M/c generated data, human sourced data, data from BP and BI.
- Set of keys and relational keys access the fields of tables and retrieve data using queri
DDBMS
1. Collection of logically related DB

2. Cooperation between databases in transparent manner
3. Should be Location independent
Arun Kumar P
In-Memory Column Formats Data
- Data in a column are kept together in-memory in columnar format.

- Single memory access loads many values at the column
- OLAP enables real time analytics
In-Memory Row Format Databases
- Allows much faster data processing during OLTP
Enterprise Data-Store Server and Data Warehouse
- Enterprise data server use data from several distributed sources

- All data merge using an integration tool
Arun Kumar P
In-Memory Column Formats Data
- Data in a column are kept together in-memory in columnar format.

- Single memory access loads many values at the column
- OLAP enables real time analytics
In-Memory Row Format Databases
- Allows much faster data processing during OLTP
Enterprise Data-Store Server and Data Warehouse
- Enterprise data server use data from several distributed sources

- All data merge using an integration tool
Arun Kumar P
Some Business processes defined in Oracle application-integration architecture
1. Integrating and enhancing the existing system and processes
2. Business Intelligence
3. Data security and integrity
4. New business services/products (web services)
5. Collaboration / Knowledge management
6. Enterprise architecture / SOA
7. E-commerce
8. External customer services
9. Supply chain automation / visualization
10.Data center optimization
Arun Kumar P
Data Storage and Analysis : Big Data Storage
Big Data NoSQL or Not Only SQL
- NoSQL DBs are semi-structured data

- Big data store uses NoSQL
- Stores do not integrate with applications using SQL
Features:
1.A class of non-relational data storage systems, flexible data models and multiple schem
i) Uninterrupted key/value or big hash table [Dynamo (Amazon S3)]
ii) Unordered keys using JSON (PNUTS)
iii) Ordered keys and semi-structured data storage systems
[BigTable, Cassandra(Facebook/ Apache) and HBase]
Arun Kumar P
Big Data NoSQL or Not Only SQL
iv) Class consisting of JSON (MongoDB)

v) Consisting name/value in the text (CouchDB)
vi) May not use fixed table schema
vii) Do not use the JOINS
viii) Data written at one node replicate at multiple nodes, so storage is fault tolerant
ix) May relax ACID rules during transactions
x) Data store can be partitioned and follows CAP theorem
Arun Kumar P
Coexistence of Big Data, NoSQL and Traditional Data Stores
Arun Kumar P
Various Data sources, usage examples and tools
Arun Kumar P
Various Data sources, usage examples and tools
Arun Kumar P
Data Storage and Analysis : Big Data Platform
Big Data platform should provide tools and services for:
1. Storage, processing and analytics
2. Developing, deploying, operating and managing Big Data environment
3. Reducing complexity of multiple data sources and integration of applications into one
4. Custom development, querying and integration with other systems
5. Traditional as well as Big Data techniques
Arun Kumar P
Services require the following:
1. New innovative non-traditional methods of storage, processing and analytics
2. Distributed Data store
3. Creating scalable , elastic virtualized platform
4. Huge volume of data stores
5. Massive parallelism
6. High speed networks
7. High performance processing, optimization and tuning
8. Data management model based on NoSQL
9. In-memory data column as well as row formats for OLAP and OLTP
Arun Kumar P
Services require the following:
10. Data retrieval, mining, reporting, visualization and analytics
11. Graph DBs to enable analytics with social NW msgs, pages and data analytics
12. ML or other approaches
13. Big Data sources: Data storages, data warehouse, Oracle Big Data, MongoDB, Cassan
14. Data sources: Sensors, Audit trail of Financial transactions data, external data (Web,
Media, weather data, health records).
Arun Kumar P
Hadoop:
- Hadoop Distributed File System is open source storage system

- Scaling File System
- Self-managing File System
- Self-healing File System
- Reliable parallel computing platform
- Manages distributed DBs of Big Data
Arun Kumar P
Hadoop based Big Data environment
Arun Kumar P
Mesos
- Mesos v0.9 is a resources management platform

- Enables sharing of cluster of nodes by multiple networks
- Compatible with open analytics stack
Data processing: Hive, Hadoop, Hbase, Storm
Data management: HDFS
Arun Kumar P
Big Data Stack
- Consists set of SW components and data store units
- Applications, ML algorithms, analytics and visualization tools
- Cloud services: Amazon EC2, Azure or private cloud
Arun Kumar P
Tools for Big Data environment
Arun Kumar P
Data Storage and Analysis : Big Data Analytics
Data Analytics Definition
Wikipedia: “Analysis of data is a process of inspecting, cleaning, transforming and model

Data with the goal of discovering useful information, suggesting conclusio
supporting decision making”
- Statistical and mathematical data analysis that clusters, segments, ranks and
predicts future possibilities
- Uses historical data and forecasts new values and results
- Suggests techniques which will provide most efficient and beneficial result for enterpris
- Helps in business intelligence and decision making
Arun Kumar P
Phases in Analytics
1. Descriptive Analytics: Enables deriving additional value from visualizations and report
2. Predictive Analytics: Enables extraction of new facts and knowledge, then predicts/fo
3. Prescriptive Analytics: Enable derivation of additional value and undertake better dec
for new option to maximize the profits
4. Cognitive Analytics: Enables derivation of additional value and undertake better decisi
Arun Kumar P
Analytics Architecture reference model
Arun Kumar P
Berkeley Dada Analytics Stack (BDAS)
- Open source data analytics stack for complex computations on Big data
- Supports efficient, large-scale in-memory data processing
- Achieves accuracy, time and cost effective processing
3 Layers
1. Applications: AMP-Genomics and carat run at BDAS.

(Berkelys Algorithm, Machines and Peoples Laboratory)
2. Data Processing: Combines batch, streaming and interactive computations
3. Resource Management: Provides sharing infrastructure across various frameworks
Arun Kumar P
Four layers architecture for Big Data Stack
Arun Kumar P
Big Data Analytics Applications and Case Studies
Marketing, Sales, Health care, Medicines, Advertising etc.
Big Data in Marketing and Sales
- Data are important of marketing, sales and advertising
- Customer Value depends on 3 factors : Quality, Service and Price
- BDA identify and derive intelligence using predictive models
- Enables marketing companies to decide what products to sell
Arun Kumar P
Marketing: Process of creation, communication and delivery of value to customers.
Customer Value: what customer desires from product
Customer Value Analytics: analyze what a customer really needs
Arun Kumar P
Five application areas of BigData use cases
1. CVA using inputs of evaluated purchase patterns, preferences, quality,

price and post sale service requirements
2. Operational analytics for optimizing company operations
3. Detection of fraud and compliances
4. New products and innovations in service
5. Enterprise data warehouse optimization
Arun Kumar P
BigData provides marketing insight into
i) Effective content at each stage of a sales cycle
ii) Investment in improving customer relationship management
iii) Addition to strategies for increasing customer lifetime value
iv) Lowering of customer acquisition cost
Arun Kumar P
Big Data Analytics in detection of marketing frauds
i) Fusing of existing data at an enterprise data warehouse with data

from sources such as social media, websites, blogs, e-mails and thus
enriching existing data
ii) Using multiple sources of data and connecting with many

applications
iii) Providing greater insights using querying of the multiple source data
Arun Kumar P
Big Data Analytics in detection of marketing frauds
iv) Analyzing data which enable structured reports and visualization
v) Providing high volume data mining, new innovative applications

and thus leading to new business intelligence and knowledge
discovery
vi) Making it less difficult and faster detection of threats, and

predict frauds by using various data and information publicly
available
Arun Kumar P
Big Data Risks
Five data risks are described by Bernard Marr
Data Security
Data Privacy breach
Costs affecting profits
Bad Analytics
Bad Data
Arun Kumar P
Big Data Credit Risk Management
Main risks are
i) Loan defaults
ii) Timely return of interests and principal amount
Arun Kumar P
1) Identifying high credit rating business groups and individuals

2) Identifying risk involved before lending money
3) Identify industrial sectors with greater risks
4) Identify types of employees and business with greater risks
5) Anticipating liquidity issues over the years
Arun Kumar P
The data insights from analytics leads to faster reactions, benefits are
i) Minimize the non-payments and frauds

ii) Identify new credit opportunities, new customers and revenue streams
iii) Marketing to low risk business and households
Arun Kumar P
Big Data and Algorithmic Trading
“Wikipedia” Algorithmic trading is a method of executing a large order using

automated pre-programmed trading instructions accounting for
variables such as time, price and volume.
Arun Kumar P
Big Data and Health Care
Data sources:
1. Clinical records
2. Pharmacy records
3. Electronic medical records
4. Diagnosis logs and notes
5. Additional data
( Deviation from person usual activities, medical leaves from job,
social interactions
Arun Kumar P
Arun Kumar P
Health care analytics using Big Data can facilitate following:
1. Provisioning of value-based and customer centric healthcare

2. Utilizing the IOT for health care
3. Preventing fraud –Excessive or duplicate claims
waste- unnecessary tests
abuse-unnecessary use of medicines, testing facility
4. Improving outcomes
5. Monitoring patients in real time.
Arun Kumar P
Big Data in Medicine
Building Health profiles of individual patients and predicting

models
1. Aggregating information from sources DNA, proteins,

metabolites to cells, tissues, organs and ecosystems that enhance
diagnosing disease
Arun Kumar P
Big Data in Medicine
Big data creates patterns and models by data mining and help
better understanding
2. Deploying wearable devices data, devices data records during

active as well as inactive periods provide better understanding of
patient health and better risk profiling for certain diseases
Arun Kumar P
Big Data in Advertising
- Big data helps digital advertiser to discover new relationships,

lesser competitive regions
- Success from advertisement depend on collection, analyzing

and mining
- New insights enable the hyper-localized advertising

(personalize and targeting online, social media and mobile for
advertisement)
Arun Kumar P

Module 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1

Uploaded by

Copyright:

Available Formats

BIG DATA AND ANALYTICS

Subject Code : 18CS72 CIE Marks : 40

Lecture Hours : 50 SEE Marks : 60

Introduction to Big Data Analytics

• Collins English Dictionary: Data is Information, usually in the form

Computing: Data is information that can be stored and used by a

Electricals Engineering, Circuits, computing and control: Data is infomation

Science: Data is information from series of observations,

Social Sciences: Data is information from series of behavioural

• Web data in the form of documents and other resources

• Data present in web servers: text, images, video, audio

Examples of web data:

1. Structured: with schema and model

3.Unstructured: No data models

Structured enables following

1. Insert, delete, update and append

2.Indexing to faster retrieval

4.Transaction processing (follows ACID Rules)

5.Encryption and decryption for security

1. In the form of XML and JSON

2. Contains tags or other markers

1. Doesnot possess data features such as table or DB

2. Found in types such as .txt, .csv

3. Key-value pairs, emails

4. Do not reveal relationship, hierarchy or OO features

Website Contet data: YouTube, Browsing data, E-payments, Web store,

Social media Data: Exchaning data in various forms

Texts and documents

Personal docs and Emails

Text internal to organization: Text within docs, logs, survey results

Satelite Imgaes, Atmospheric data, Survelliance, Traffic videos,

Consisting multiple formats: structued/ semi-structured/ unstructured

• Can have many formats

• Found in non-transactional systems

Ex: streaming in customer interactions, data multiple sensors, data at web

Wikipedia: A collection of data sets so large are complex that traditional

By 2025, it’s estimated that the global

2,000,000,000,000,000,000 (2 Exa) bytes of

Velocity: Speed of generation of data. How fast data generates?

• Variety: Variety of data generated from multiple sources.

Veracity: how accurate or truthful a data set may be.

2. Transactions data and Business Process(BPs) data: Credit card transactions,

2.Data formats : structured and semi-structured

3. Processing data rates: Batch, near-time, real-time, streaming

4.Analysis Types: Batch, scheduled, near real-time datasets analytics

2.Data formats : Unstructured, semi-structured and multi structured

3.Data stores structure: Web, enterprise or cloud servers, data warehouse,

5.Processing Methods: Batch processing( using MapReduce, Hive or Pig)

7.Data Usage: Human, business process, Knowledge discovery, enterprise apps,

2. Applications scheduling using open source, reliable, scalable, distributed

• Processing complex applications with large datasets need hundreds

• Processing in short time with min cost is problematic

• When workload and complexity exceed system capacity, scale it up and

Vertical Scalability: scale up system resources

Design algorithm, that uses resources efficiently

Horizontal Scalability: Increase no of systems

Scale out using more resources and distribute tasks in parallel

So, SW must be implemented software to support parallel processing

- Best approach for data processing

Amazon Web Service(AWS), Elastic Compute Cloud(EC2), Microsoft Azure or

PaaS: Manage service, storage, nw, deploying, testing, collaborating, hosting

PaaS: Manage service, storage, nw, deploying, testing, collaborating, hosting

Large scale resource sharing, flexible and secure