You are on page 1of 108

COLL EGE OF EN GI N EE R IN G ROO RK EE

Established in 1998

DATA ANALYTICS
UNIT-2
Kshitij Jain
Assistant Professor
Department of Computer science & Engineering
COER
UNIT-2 SYLLABUS
INTRODUCTION TO BIG DATA: Big Data and its
Importance, Four V’s of Big Data, Drivers for Big
Data, Introduction to Big Data Analytics, Big
Data Analytics applications. BIG DATA
TECHNOLOGIES: Hadoop’s Parallel World, Data
discovery, Open source technology for Big Data
Analytics, cloud and Big Data, Predictive
Analytics, Mobile Business Intelligence and Big
Data, Crowd Sourcing Analytics, Inter- and Trans-
Firewall Analytics, Information Management.
List of Program Outcomes
PO1 Engineering Knowledge: Apply knowledge of mathematics and science, with
fundamentals of Computer Science & Engineering to be able to solve complex
engineering problems related to CSE.

PO2 Problem Analysis: Identify, Formulate, review research literature and analyze complex
engineering problems related to CSE and reaching substantiated conclusions using first
principles of mathematics, natural sciences and engineering sciences

PO3 Design/Development of solutions: Design solutions for complex engineering problems


related to CSE and design system components or processes that meet the specified
needs with appropriate consideration for the public health and safety and the cultural
societal and environmental considerations

PO4 Conduct Investigations of Complex problems: Use research–based knowledge and


research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions.
PO5 Modern Tool Usage: Create, Select and apply appropriate techniques, resources and
modern engineering and IT tools including prediction and modeling to computer science
related complex engineering activities with an understanding of the limitations

PO6 The Engineer and Society: Apply Reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the CSE professional engineering practice

PO7 Environment and Sustainability: Understand the impact of the CSE professional
engineering solutions in societal and environmental contexts and demonstrate the
knowledge of, and need for sustainable development

PO8 Ethics: Apply Ethical Principles and commit to professional ethics and responsibilities and
norms of the engineering practice
PO9 Individual and Team Work: Function effectively as an individual and as a member or
leader in diverse teams and in multidisciplinary Settings

PO10 Communication: Communicate effectively on complex engineering activities with the


engineering community and with society at large such as able to comprehend and with
write effective reports and design documentation, make effective presentations and give
and receive clear instructions.

PO11 Project Management and Finance: Demonstrate knowledge and understanding of the
engineering management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multi-disciplinary environments

PO12 Life-Long Learning: Recognize the need for and have the preparation and ability to
engage in independent and life-long learning the broadest context of technological
change
Big Data Types

• Relational databases are examples of structured data.


• Examples of unstructured data include audio, video.
• Semi-structured data includes the XML data, JSON files, and others.
5 V’s of BIG DATA
volume
• The produced ‘amount’ of data refers to the volume of the data.
The huge volume of data is known as big data. Generally, data is
produced in different formats, such as structured and unstructured,
by various sources.
• The examples of various data formats are word, excel documents,
PDF, and reports, along with videos and images.
• In addition, data is continuously produced in large chunks due to
digital and social media platforms. This data is difficult for all
organizations to store and process with traditional data analysis
methods.
• Therefore, these organizations should focus on implementing up-to-
date tools and techniques to collect, store and analyze such large
and massive data faster.
Velocity

The speed of data generation, collection, and


analysis are known as the velocity. Multiple sources
are there through which data flows continuously,
such as computer systems, networks, machines,
mobile phones, social media, etc. In the present
scenario speed of data accumulation is feasible in
real-time(within a fraction of seconds). The fast
processing or analysis of data affects decision-
making. The main purpose is to get the valuable
information present in real-time and transmit it
back to the organization to make effective decisions
regarding their operations.
Valu
e
• The only collection of the large volume of data is not useful.
This data should be used to add value to any organization
with insights. In terms of big data, value defines the data
which is useful or useless for an enterprise. Hence, we need
data analysis techniques to accomplish this task.
• Although numerous organizations have established data
estimations and storage bases, they cannot understand the
difference between data estimation and value addition.
• By using modern data analytics, we can draw important
insights from the gathered data. These insights or
information is the thing that adds value to the decision of
any organization.
Variety
• Along with the volume and velocity of data, variety is also one of
the 4 Vs of big data. We collect the data types from various sources
and process them diversly. These data sources can be an external or
internal units of any organization.
• Mainly big data is divided into 3 categories-
• Structured Data- This type of data has a clearly defined format,
length, and volume.
• Semi-structured Data- The format in this data is partially clear.
• Unstructured Data-This is unorganized data that involves images,
videos, and other content from social media platforms.
• This large amount of data is an arduous task to collect and analyze.
In addition, almost 80% of data, including photos, videos, mobile,
and social media data, is unstructured by nature.
Veracity

• How truthful Big Data is remains a difficult point. Data


quickly becomes outdated and the information shared
via the internet and social media does not necessarily
have to be correct. Many managers and directors in the
business community do not dare to make decisions
based on Big Data.
• Data scientists and IT professionals are building
methods for accessing the right data. It is very
important that they find a good way to do this.
Because if Big Data is organized and used in the right
way, it can be of great value in our lives. From
predicting business trends to preventing disease and
crime.
Importance of Big Data
Big Data analytics solution helps organizations to unlock the strategic
values and take full advantage of their assets. It helps organizations:
• To understand Where, When and Why their customers buy.
• Protect the company’s client base with improved loyalty
programs.
• Seizing cross-selling and upselling opportunities.
• Provide targeted promotional information.
• Optimize Workforce planning and operations.
• Improve inefficiencies in the company’s supply chain.
• Predict market trends.
• Predict future needs.
• Make companies more innovative and competitive.
• It helps companies to discover new sources of revenue.
• Big Data importance doesn’t revolve around the amount of data a
company has. Its importance lies in the fact that how the company utilizes
the gathered data.
• Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.
• The companies in the present market need to collect it and analyze it
because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.

2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various
sources. Tools like Hadoop help them to analyze data immediately thus
helping in making quick decisions based on the learnings.
3. Understand the market conditions
Big Data analysis helps businesses to get a better understanding of market situations.
For example, analysis of customer purchasing behavior helps companies to identify the products
sold most and thus produces those products accordingly. This helps companies to get ahead of
their competitors.

4. Social Media Listening


Companies can perform sentiment analysis using Big Data tools. These enable them to get
feedback about their company, that is, who is saying what about the company.
Companies can use Big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention


Customers are a vital asset on which any business depends on. No single business can achieve its
success without building a robust customer base. But even with a solid customer base, the
companies can’t ignore the competition in the market.
If we don’t know what our customers want then it will degrade companies’ success. It will result
in the loss of clientele which creates an adverse effect on business growth.
Big data analytics helps businesses to identify customer related trends and patterns. Customer
behavior analysis leads to a profitable business.
6. Solve Advertisers Problem and Offer Marketing Insights
Big data analytics shapes all business operations. It enables companies to fulfill
customer expectations. Big data analytics helps in changing the company’s product
line. It ensures powerful marketing campaigns.

7. The driver of Innovations and Product Development


Big data makes companies capable to innovate and redevelop their products.
Real-Time Benefits of Big Data
Big Data analytics has expanded its roots in all the fields. This results in the use of Big
Data in a wide range of industries including Finance and Banking, Healthcare,
Education, Government, Retail, Manufacturing, and many more.
There are many companies like Amazon, Netflix, Spotify, LinkedIn, Swiggy,etc which
use big data analytics. Banking sectors make the maximum use of Big Data Analytics.
Education sector is also using data analytics to enhance students’ performance as well
as making teaching easier for instructors.
Big Data analytics help retailers from traditional to e-commerce to understand
customer behavior and recommend products as per customer interest. This helps
them in developing new and improved products which help the firm enormously.
Credit Card Fraud Detection
As millions of people are using a credit card nowadays, so it has become very
necessary to protect people from frauds. It has become a challenge for Credit
card companies to identify whether the requested transaction is fraudulent or
not.
A credit card transaction hardly takes 2-4 seconds to completion. So the
companies need an innovative solution to identify the transactions which may
appear as fraud in this small time and thus protect their customers from
becoming its victim.
An abnormal number of clicks from the same IP address or a pattern in the
access times may be one criteria .
Again if you have made any transaction from Mumbai today and the very next
minute there is a transaction from your card in Singapore. Then there are
chances that this transaction may be fraud and not done by you.
So companies need to process the data in real time (Data in Motion analytics
DIM) and analyze it against individual history in a very short span of time and
identify whether the transaction is actually fraud or not.
Sentiment Analysis
Sentiment analysis provides substance behind social data.
A basic task in sentiment analysis is classifying the
polarity of a given text at the document, sentence, or
feature/aspect level — whether the expressed opinion in
a document, a sentence or an entity feature/aspect is
positive, negative, or neutral.
Advanced, “beyond polarity” sentiment classification
looks, for instance, at emotional states such as “angry,”
“sad,” and “happy.”
In sentiment analysis, language is processed to identify
and understand consumer feelings and attitudes towards
brands or topics in online conversations ie, what they are
thinking about a particular product or service, whether
they are happy or not with it, etc.
Data Processing (Retail)

• Let us now see an application for Leading Retail Client in


India. The client was getting invoice data daily which was of
about 100 GB size and was in XML format.
• To generate a report from the data, conventional method
was taking about 10 hours time and client had to wait for
this time to get the report from the data.
• This conventional method was developed in C and was
taking a huge time which was not a feasible solution and
the client was not happy with it.
• The invoice data was in XML format which needs to be
transformed into a structured format before generating the
report. This involved validation, verification of data and
implementation of complex business rules.
Orbitz.com

• Orbitz is a leading travel company using latest technologies


to transform the way clients around the world plan the
travel. They operate the customer travel planning sites
Orbitz, Ebookers and CheapTickets.
• It generates 1.5mn flight searches and 1mn hotel searches
daily and the log data being generated by this activity is
approximately 500GB in size. The raw logs are only stored
for a few days because of costly data warehousing.
• To handle such huge data and to store it using conventional
data warehouse storage and analysis infrastructure was
becoming more expensive and time consuming with time.
Customer Churn Analysis

Churn analysis is the calculation of the rate of attrition in the


customer base of any company. It involves identifying those
consumers who are most likely to discontinue using your
service or product.
The best way to manage these issues will be to predict the
subscribers who are likely to churn, well in advance so that
business can take required measures to mitigate it and win-
back the customers or re-activate the sleeping base.
The industries are interested in finding root cause of customer
churn; they want to know why customer is leaving them and
which the biggest factor is. To find out the root cause of
Customer churn companies need to analyze following data.
This data might be in the range of TBs to PBs
Big Data Technology

Technology is radically changing the way data is produced, processed,


analyzed, and consumed. On one hand, technology helps evolve new and
more effective data sources. On the other, as more and more data gets
captured, technology steps in to help process this data quickly, efficiently, and
visualize it to drive informed decisions. Now, more than any other time in the
short history of analytics, technology plays an increasingly pivotal role in the
entire process of how we gather and use data.
Hadoop’s Parallel World

There are many Big Data technologies that have been making an impact on the
new technology stacks for handling Big Data, but Apache Hadoop is one
technology that has been the darling of Big Data talk. Hadoop is an open-
source platform for storage and processing of diverse data types that enables
data-driven enterprises to rapidly derive the complete value from all their
data. The original creators of Hadoop are Doug Cutting and Mike Cafarell.
Doug and Mike were building a project called “Nutch” with the goal of creating
a large Web index. They saw the MapReduce and GFS papers from Google,
which were obviously super relevant to the problem Nutch was trying to solve.
They integrated the concepts from MapReduce and GFS into Nutch; then later
these two components were pulled out to form the genesis of the Hadoop
project.
Moving beyond rigid legacy frameworks, Hadoop gives organizations the
flexibility to ask questions across their structured and unstructured data
that were previously impossible to ask or solve:
The scale and variety of data have permanently overwhelmed the ability
to cost-effectively extract value using traditional platforms.
The scalability and elasticity of free, open-source Hadoop running on
standard hardware allow organizations to hold onto more data than ever
before, at a transformationally lower TCO than proprietary solutions and
thereby take advantage of all their data to increase operational efficiency
and gain a competitive edge. At one-tenth the cost of traditional solutions,
Hadoop excels at supporting complex analyses—including detailed,
special-purpose computation—across large collections of data.
Hadoop handles a variety of workloads, including search, log processing,
recommendation systems, data warehousing, and video/image analysis.
Today’s explosion of data types and volumes means that Big Data equals big
opportunities and Apache Hadoop empowers organizations to work on the
most modern scale-out architectures using a clean-sheet design data
framework, without vendor lock-in.
Apache Hadoop is an open-source project administered by the Apache
Software Foundation. The software was originally developed by the world’s
largest Internet companies to capture and analyze the data that they
generate. Unlike traditional, structured platforms, Hadoop is able to store
any kind of data in its native format and to perform a wide variety of
analyses and transformations on that data. Hadoop stores terabytes, and
even petabytes, of data inexpensively. It is robust and reliable and handles
hardware and system failures automatically, without losing data or
interrupting data analyses. Hadoop runs on clusters of commodity servers
and each of those servers has local CPUs and disk storage that can be
leveraged by the system.
The two critical components of Hadoop are:

1.The Hadoop Distributed File System (HDFS). HDFS is the storage system for a
Hadoop cluster. When data lands in the cluster, HDFS breaks it into pieces and
distributes those pieces among the different servers participating in the
cluster. Each server stores just a small fragment of the complete data set, and
each piece of data is replicated on more than one server.
2.MapReduce. Because Hadoop stores the entire dataset in small pieces across
a collection of servers, analytical jobs can be distributed, in parallel, to each of
the servers storing part of the data. Each server evaluates the question against
its local fragment simultaneously and reports its results back for collation into
a comprehensive answer. MapReduce is the agent that distributes the work
and collects the results.
Both HDFS and MapReduce are designed to continue to work in the
face of system failures. HDFS continually monitors the data stored on
the cluster. If a server becomes unavailable, a disk drive fails, or data is
damaged, whether due to hardware or software problems, HDFS
automatically restores the data from one of the known good replicas
stored elsewhere on the cluster. Likewise, when an analysis job is
running, MapReduce monitors progress of each of the servers
participating in the job. If one of them is slow in returning an answer
or fails before completing its work, MapReduce automatically starts
another instance of that task on another server that has a copy of the
data. Because of the way that HDFS and MapReduce work, Hadoop
provides scalable, reliable, and fault-tolerant services for data storage
and analysis at very low cost.
BIG DATA AS OPPORTUNITY
Big Data Collected by Smart Meter
Problem with Smart Meter Big Data
How Smart Meter Big Data Is Analyzed
Types Of Big Data Analytics

• Descriptive, which answers the


question, “What happened?”
• Diagnostic, which answers the question, “Why
did this happen?”
• Predictive, which answers the
question, “What might happen in the future?”
• Prescriptive, which answers the
question, “What should we do next?”
Story of Big Data & Traditional System
Failure of Traditional System
Failure of Traditional System
Need of an Effective Solution
Effective Solution
Need of a Framework
Apache Hadoop: Framework to
Process Big Data
MapReduce Word Count Program
Hadoop: Master/Slave Architecture
Sources of Big Data Deluge
Data Science vs. Business Intelligence
Data Analytics Use-Cases
Social Media Sentiment Analysis
Data Analytics Use-Cases
Data Analytics Lifecycle
Data Analytics Lifecycle
Hadoop 2.x Daemons
Hadoop Cluster Architecture-Master
Slave Topology
NameNode and DataNode
Secondary NameNode &
Checkpointing
Distributed File System
Hadoop Architecture-Block Replication
HDFS Blocks
Hadoop Architecture: Rack Awareness
Rack Awareness
Hadoop Cluster
HDFS Architecture
HDFS Write Mechanism:-Pipeline
Setup
HDFS Write Mechanism:-Writing a
Block
HDFS Read Mechanism
Hadoop Architecture: HDFS & YARN
Hadoop Cluster Modes
Hadoop Ecosystem
THANK YOU!

You might also like