You are on page 1of 83

Overview of Big Data

Module 1
Evaluation Criteria - Theory

Criteria Marks
Mid Marks(Best of Three) 30M
Assignment 5M
Quiz 5M
Total 40M
Evaluation Criteria - Lab
Criteria Marks
Continuous Evaluation 40M

Lab Exam 1 15M


Lab Exam 2 15M
Coursera 20M
Case Study 10M
Total 100M
Coursera Course Link

Introduction to Big data with Spark and Hadoop offered by IBM


https://www.coursera.org/learn/introduction-to-big-data-with-spark-
hadoop
Syllabus – Module 1

Getting an overview of Big Data


Big Data definition, History of Data Management, Structuring
Big Data, Elements of Big Data, Big Data Analytics.
Exploring use of Big Data in Business Context:
Use of Big Data in Social Networking, Use of Big Data in
preventing Fraudulent Activities in Insurance Sector & in
Retail Industry.
Why Big Data ?

• Soaring Demand for Analytics


Professionals
• Salary Aspects
• Big Data Analytics: A Top Priority
in a lot of Organizations
• Big Data Analytics is Used
Everywhere!

6
Soaring
Demand for
Analytics
Professionals
Salary
Aspects

8
Big Data –
Job Titles
Big Data –
Required
skills
Big Data/Analytics
11 Jobs (Toronto)
• Banks • Web/Mobile/Startup
• RBC, TD, CIBC, Scotiabank, – Google, Mozilla
AMEX, CapitalOne, ING
Direct • Digital Media/Agencies
• Telcommunications • Globe and Mail, Kobo
• Rogers, Telus, Bell, etc. • Consulting
• Technology – Accenture, IBM, Deloitte, SAS
• BlackBerry, Huawei, CGI • Retail/e-commerce
• Manufacture/Services – Amazon, HR, Hudson Bay,
• GM, Canada Post, Sears, Shoppers, Canadian Tire,
Workopolis Sobeys
• Insurance • Pharmaceutical/Healthcare
• SunLife, Manulife – Hospitals, Clinical Research
Companies etc.
Job Market

• Where are big data


jobs?
• North America
• Sillicon Valley, Seattle,
NYC, Toronto
• India/China

Big data jobs around the nation: http://www.tableausoftware.com/public/gallery/big-data-jobs


Big Data Salary: http://goo.gl/4et998
Oreilly Media Data Science Salary Survey: http://www.oreilly.com/data/free/files/stratasurvey.pdf
KDNuggets 2014 Analytics/Data Science Salary Poll: http://goo.gl/VhO9IW
Why is Big Data important now?
'Big Data' has many valuable applications:

• Product recommendation
• Prediction
• Market Analysis
• Fraud detection

And many, many more ... Data must be processed to glean insights from it and derive the
value from it.
Big Data Made Possible
Hardware
‒ Big cluster of commodity machines at lower cost
• Faster processor
• Cheaper memory
• Bigger hard drive space
• Faster network bandwidth
Software
‒ Algorithms to allow parallel computing (map-reduce)
What is Big Data?
Think of the following:
• Every second, there are around 8,22 tweets on Twitter.
• Every minute, nearly 510 comments are posted, 293,000 status are updated and
136,000 photos are uploaded on Facebook.
• Every hour, Walmart handles more than 1 million customer transactions.
• Everyday, Customers make around 11.5 million payments by using PayPal.
- Digital world -> increase in data rapidly ->increase in the use of internet, sensors
at a very high rate.
- The sheer volume, variety, velocity and veracity of such data is signified by the
term ‘Big Data
What is Big Data?
• Big data is structured, unstructured and semi-structured in nature.
• Difficult for computing systems due to high speed and volume.
• Traditional data management, warehousing and analysis fizzle to
analyze the high speed of data.
• Hadoop by Apache is widely used for storing an managing Big data.
• According to IBM, everyday we create 2.5 quintillion bytes of data – so
much that 90% of the world today has been created in the last two
years alone.
• Data – sensor data, climate data, GPS data, bank data to name a
few.This data is Big data.
Big Data - Definition

• “Big data” is high-volume, velocity, and variety


information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.”
• In simple words, Big data is a collection of data that is
huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none
of traditional data management tools can store it or
process it efficiently.
• One of the data production source - Smart electronic
devices
Data Expansion – • Amount of data – 175 ZettaBytes by 2025.
Day by Day • Total volume of data – double every two years.
Tabular Representation of various Memory
Sizes
NAME EQUAL TO SIZE(IN BYTES)
Bit 1 bit 1/8
Nibble 4 bits 1/2 (rare)
Byte 8 bits 1
Kilobyte-KB 1024 bytes 1024
Megabyte-MB 1, 024kilobytes 1, 048, 576
Gigabyte-GB 1, 024 megabytes 1, 073, 741, 824
Terrabyte-TB 1, 024 gigabytes 1, 099, 511, 627, 776
Petabyte-PB 1, 024 terrabytes 1, 125, 899, 906, 842, 624
Exabyte-EB 1, 024 petabytes 1, 152, 921, 504, 606, 846, 976
Zettabyte-ZB 1, 024 exabytes 1, 180, 591, 620, 717, 411, 303, 424
Yottabyte-YB 1, 024 zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
19
In simple
words,
various
memory
sizes
Sources of Big Data

• Social media
• Sensor placed in various cities
• Customer satisfaction feedback
• IoT Appliance
• E-Commerce
• Global Positioning System(GPS)
Sources of Big Data
Social Media
• Whatsapp, Facebook, Instagram, Twitter, YouTube etc
• Each activity – upload photo/video, making comment, sending a
message, like etc create data.
Sensors
• Sensors in city – gather temperature, humidity etc
• Camera beside roads gather information
• Security cameras in airports/banks – create a lot of data
Customer Satisfaction feedback
• Amazon, flipkart, firstcry, licious, swiggy, blinkit, zepto etc –
gather customer feedback – quality of product/deliver time. It
creates a lot of data.
Sources of Big Data

IoT Appliance
• Electronic devices connected to the internet create data for their smart functionality. Example :
Samsung smartthings.
E-Commerce
• Payments through Credit card, Debit card, pay later, or all electronic ways are recorded as data.
Global Positioning System(GPS)
• Vehicle movement – directions/ traffic congestion. Creates a lot of data on vehicle position and
movement.
1. Volume
Volume defines how much data we have – what we used to measure in Gigabytes is now measured in
Zettabytes (ZB) or even Yottabytes (YB). The Internet of Things (IoT) creates exponential growth in data.
Projections show the volume of data changing significantly in the coming years.

2. Velocity
Velocity represents the speed at which data is processed and becomes accessible. Today, if delivery is not
real-time, it’s usually not fast enough.

3. Variety
Variety describes one of the biggest challenges of big data. The insights may come without structure. The
total asset may include many data types, from XML to video to SMS. Organizing the data in a meaningful
way is no simple task when the data itself changes rapidly.

4. Variability
Variability is different from variety. A coffee shop may offer six different blends of coffee, but if you get
the same blend every day and it tastes different every day, that is variability. The same is true of data. If
the meaning constantly changes, it can significantly impact your data homogenization.
5. Veracity
Veracity ensures the data is accurate, which requires processes to keep the insufficient data from
accumulating in your systems. The simplest example is when contacts enter your marketing
automation system with false names and inaccurate contact information. How many times have you
seen Mickey Mouse in your database? It’s the classic “garbage in, garbage out” challenge.

6. Visualization
Visualization is critical in today’s world. Using charts and graphs to visualize large amounts of complex
data is much more effective in conveying meaning than spreadsheets and reports chock-full of
numbers and formulas.

7. Value
Value is the end game. After addressing volume, velocity, variety, variability, veracity, and visualization
— which takes a lot of time, effort, and resources —, you want to be sure your organization is getting
value from the data.
Main Features of Big data
Big Data

Is classified in terms of
Is a new data
4 V’s
challenge that Is usually unstructured
Volume
requires leveraging and qualitative in
Variety
existing systems nature
Velocity
differently
Veracity
Real world examples – Big data
• Social media analytics – Consumer product companies and retail
organizations are observing data on social media websites to analyze
customer behaviour, preferences etc
• Insurance companies use BDA to see which home insurance
applications can be immediately processed and which ones need a
validating in person visit from an agent.
• Hospitals are analysing medical data and patient records to predict
those patients that are likely for readmission within few months of
discharge.
• Relying on Social networks and analytics, Companies are gathering
volumes of data from the web to help musicians and music
companies better understand their audiences.
Types and Sources of data

Type Description Source


Social Data Information collected from various Facebook, Twitter and Linkedin
social networking sites and online
portals
Machine Data Information generated from RFID RFID chip readings, Global
chips, bar code scanners and sensors positioning System(GPS)
Results
Transactional Information generated from online Retail websites like ebay and
Data shopping sites, retailers and Business Amazon
to Business(B2B) transactions
Caselet
History of Data Management – Evolution of Big Data

• Big data is the new term of data evolution directed by velocity, variety
and volume of data.
• Velocity implies the speed with which the data flows in an
organization.
• Variety refers to the varied forms of data, such as structured,
semi-structured or unstructured.
• Volume defines the amount or quantity of data an organization has to
deal with.
Challenges faced while handling the data over the
past few decades

In the 90’s,technology
Today, the technology is
In the early 60’s, technology witnessed issues with
facing issues related to huge
witnessed problems with variety
volume, leading to new
velocity. This need, inspired (emails,documents,videos),
storage and processing
the evolution of databases. leading to the emergence
solutions,
of non-SQL stores.
• In simple terms, arranging the available data
so that it becomes easy to study, analyse,
and derive conclusion from it.
• Information processing systems – Can
analyse on basis of what you searched, what
Structuring you looked at, for how long you remained at
a particular page or website.
Big Data • When a user regularly visits or purchases
from Amazon, each time he/she logs in, the
system can present a recommended a list of
products that may interest the user on the
basis of his/her purchases or searches. This
is the power of Big Data Analytics.
Types of Data

• Data that comes from multiple sources such as


databases, ERP systems, weblogs, chat history and
GPA maps varies in its format.

• Data is obtained primarily from the following types of


sources:
(a) Internal sources, such as organizational data
(b) External sources, such as social data
Types of Data
Data Source Definition Examples Application

Internal Provides structured data that • Customer Relationship This data is used to support daily
originates within the enterprise and Management business operations of an
helps run business • Enterprise Resource Planning organization
• Customers, details
• Products and sales data

External Provides unstructured data that • Business partners This data is analyzed to understand
originates from external • Internet the entities mostly to external
environment of an organization • Market research organizations organizations, such as customers,
competitors, market and
environmemt.
Types of Data
• Big data comprises
- Structured data
- Unstructured data
- Semi-structured data
Structured data
• Is organized data in a predefined format
• Is stored in tabular form
• Is the data that resides in fixed fields within a record or file
• Is formatted data that has entities and their attributes
mapped
• Is used to query and report against predetermined
datatypes
• SQL is used for managing and querying data - represent
only 5 to 10% of all the data
• When data grows beyond the size of RDBMS, it Can be
stored & analyzed in data warehouses but only up to
certain limit
Example –Sample of Structured data

Customer Name Product ID City State


ID
123 Jack 4689 Graz Styria
321 Sandy 5688 Wolfsberg Carinthia
459 Robert 459 Enns Upper
Austria
Unstructured Data

• lack of structure
• About 85% of total data is un-structured.
Ex:
• e-mail messages,
• word processing documents,
• videos, photos, audio files, presentations,
• web pages
• other kinds of business documents.
Semi Structured
Data Sl Name E-Mail
No
Also known as having a
schema-less or self 1 Sam smj@xyz.com
describing structure refers
to a form of structured data 2 First Name : David davidb@xyz.com
that contains tags in order Second Name :
to separate elements and Brown
generate hierarchies of
records and fields in the
given table.
Elements of Big Data

• According to Gartner, data is growing at the rate of 59% every year.


This growth can be depicted in terms of the following four Vs:
(i) Volume
(ii) Velocity
(iii) Variety
(iv) Veracity
Video Box Position

Department of CSE, GIT Course Code: EID449 Course Title: BIG DATA ANALYTICS
8 December 2022 43
Volume
• Volume is the amount of data generated by organizations
or individuals.
• At present, Volume of data – exabytes
• In coming years, Volume of data – zettabytes
• Organizations are doing their best to handle this ever-
increasing volume of data.
Example :
- Every minute, over 571+ new websites are being created.
- Boeing 737 will generate 240 terabytes of flight data during
a single flight across US.
Velocity
• Velocity describes the rate at which data is generated, captured and shared.
• Information processing systems face problem with the data, as the data which
keeps adding up but cannot be processed quickly.
Example : eBay analyses around 5 million transactions per day in real time to detect
and prevent frauds arising from the use of PayPal.
Sources of high velocity data:
- IT devices, including routers,firewalls, switches etc generate valuable data
- Social media, including Facebook posts, tweets create huge amount of data, to be
analyzed at fast speed as the value degrades quickly with the time.
Variety
• refers to structured, unstructured, and
semi structured data that is gathered from
multiple sources and comes in different
formats, such as images, text, videos etc.
• While in the past, data could only be
collected from spreadsheets and
databases, today data comes in an array of
forms such as emails, PDFs, photos, videos,
audios, SM posts, and so much more.
Veracity

• Refers to Uncertainty of data i.e., that is data which is


available can sometimes get messy and quality and
accuracy are difficult to control.

Example: Data in bulk could create confusion whereas less


amount of data could convey half or Incomplete Information.
In short, Simple 4V’s
Big Data Analytics
• Big Data analytics is a process used to extract meaningful
insights, such as hidden patterns, unknown correlations,
market trends, and customer preferences.
• Big Data analytics provides various advantages—it can be
used for better decision making, preventing fraudulent
activities, among other things.
• There are three main types of business/data analytics:
(a) Descriptive Analytics
(b) Diagnostics Analytics
(c) Predictive Analytics
(d) Prescriptive Analytics
Big Data Analytics - Descriptive analytics –
“What happened in the business”?
• Descriptive analytics analyses a database to provide
information on the trends of past or current business
events that can help managers, planners, leaders to
develop a roadmap for the future actions.
• In short, Identifying the root cause of the problem and
the underlying reason for failures.
Example: During the pandemic, a leading
pharmaceuticals company conducted data analysis on
its offices and research labs. Descriptive analytics
helped them identify unutilized spaces and departments
that were consolidated, saving the company millions of
dollars.
Big Data Analytics - Diagnostics
analytics

• Diagnostics analytics helps companies understand


why a problem occurred. Big data technologies and
tools allow users to mine and recover data that helps
dissect an issue and prevent it from happening in the
future.

Example: A clothing company’s sales have decreased


even though customers continue to add items to their
shopping carts. Diagnostics analytics helped to
understand that the payment page was not working
properly for a few weeks.
Big Data Analytics - Predictive analytics
– “What could happen”?

Understanding and predicting the future by using


statistical models and different forecast techniques.
Here, we use statistics, data mining techniques and
machine learning to analyze the future.

Example: In the manufacturing sector, companies


can use algorithms based on historical data to
predict if or when a piece of equipment will
malfunction or break down.
Big Data Analytics - Prescriptive
analytics – “What should we do”?

• Based on complex data from descriptive and


predictive analyses, prescriptive analytics is
used.
• By using the optimization technique, this
analytics determines the finest substitute to
minimize or maximize some equitable
marketing and many other areas.
Example: If we have to find the best way of
shipping goods from a factory to a destination
to minimize costs, we will use the prescriptive
analytics.
Questions

• List the four elements of Big Data.


• As an HR manager of a company providing Big Data
solutions to clients, what characteristics would you look for
recruiting a potential candidate for a position of a data
analyst?
• You are planning the marketing strategy for a new product
in your company. Identify and list some limitations of
structured data related to the work.
Exploring the Use of Big Data in
Business Context
Use of Big Data in Social Networking
Use of Big Data in preventing fraudulent activities
Use of Big Data in preventing fraudulent activities in Insurance Sector
Use of Big Data in Retail Industry
Exploring the Use of Big Data in Business Context
• An organization generally has to spend huge amounts to collect data
and information.
• For example, customer surveys collecting information goes on
escalating as an organization keeps on collecting more information.
The continuously increasing cost decreases the value of the collected
information.
• In other words, collecting and maintain a pool of data and
information is just a waste of resources unless any logical conclusions
and business insights can be derived from it.
• This is where Big data analytics come into the picture.
Use of Big Data in Social Networking
Use of Big
Data in Social
Networking
Use of Big Data in Social Networking
• Social network data refers to the data generated from people
socializing on social media.
• Some popular social networking sites are Twitter, Facebook,LinkedIn
etc
• On the social networking site, different people constantly add and
update comments, status, likes, preference etc. All these activities
generate large amounts of data.
• This data can be segregated on the basis of different age groups,
locations and genders for the purpose of analysis.
Use of Big Data in
Social Networking
• Social Networking Analysis(SNA) – Analysis
performed on the data from social media.
Example : Mobile Network Operator(MNO)
• The data captures by MNO in the form of
phone calls, text messages and other
record details of all its customers per day
is very huge in volume.
• The company should study the data of
people whom the customer called and
also of the people who called back. Such a
network is called Social Network.
Use of Big Data in
Social Networking
• The data analysis process can go
deeper and deeper within the network
to get a complete picture of a social
network.
• As the analysis goes deeper, the
volume of data to be analyzed also
becomes massive.
• The same structure of SNA is followed
when it comes to social networking
sites.
Use of Big Data in Social
Networking

• Following are the areas in which decision-making


processes are influenced by social network data:
(a) Business Intelligence
(b) Marketing
(c) Product design and development
Use of Big Data in Social Networking
– Business Intelligence(BI)

• Data analysis process to convert a raw dataset to


meaningful information.
• Allows a company to collect, store, access and
analyse the data for adding value to decision
making.
• The data generated from different social media is
analyzed using Social Customer Relationship
Management(CRM) which is used to describe the
data.
Use of Big Data in Social Networking
– Business Intelligence(BI)

Example:
• Mobile service provider that has a low-value customer.
• If the low-value customer is not satisfied with the services and
if he wants to leave the company generally has no problems to
let the customer go as he is providing low-revenue.
• With the help of SNA, the organization can identify some
connections of the customers network make a large number of
calls and text messaged and have a large network of friends.
• With such an analysis, the organization might take an
altogether decision making and might start valuing the
customer more – influence of a customer is very important to
organization.
Use of Big Data in Social Networking – Marketing

• Today the customer preferences has changes due to their busy


schedules – No time to read newspaper, TV commercials or go
through marketing emails.
• Customers can now make their preferences clear and select the
marketing messages they wish to receive.
• In today’s world, marketers aim to deliver what consumers want by
using interactive communication across digital channels such as e-
mail, mobile, social and the Web which inturn generates the social
data.
Use of Big Data in Social Networking – Marketing
Product Design and Development
• By listening to customers needs, ny understanding where the gap in the offering is, and
so on, organizations can make the right decisions in the direction of their product
development and offerings.
Example : YouTube – Rate a brand on a scale of 1-10/ know a brand etc
• Once the brand rating crosses 300 or more, the applications sends out a report about the
information what the customer is feeling about the product and the detailed analysis of
the brand’s reputation.
• In this way, social network can help organizations to improve the product development
by making sure about the customer needs.
• Sentiment analysis analyses human emotions, attitudes and views across popular social
networks.
Product Design and Development
Use of Big Data in Fraudulent Activities

• Most common types of Financial frauds:


(a) Credit card fraud
(b) Exchange or return policy fraud – Amazon/Flipkart
(c) Personal information fraud –
Obtaining the login details of a customer, purchase a product online, and then
change the delivery address to different location. The actual customer keeps calling
to retailer to refund the amount as he has not made the transaction
Preventing Fraud using Big Data Analytics
Analyzing Big Data allows organizations to:
• Keep track of and process huge volumes of data.
• Differentiate between real and fraudulent entries.
• Identify new methods of fraud of fraud and add them to the list of
fraud-prevention checks.
• Verify whether a product has actually been delivered to valid
recipient
• Determine the location of the customer and the time when the
product was actually delivered.
Use of Big Data in Detecting Fraudulent Activities
in Insurance Sector

• Insurance company wants to improve the ability to take decisions while


processing claims.
• Decides to implement a Big Data analytical platform, which will use the data
from social media to provide the real-time view of the case in hand.
• The information obtained will enable the insurance agent to diagnose the
patterns of customer’s claim, behavior and other issues.
Example: In some cases, social media could also provide great triggers to identify
fraud – A customer might indicate that his car was destroyed in a flood, but the
documentation from the social media feed any show that the car was actually in
another city on the day flood occurred.
Fraud Detection
• Fraudulent claims were identified by insurance companies by using
statistical models.
• Social Networking Analysis(SNA) is an innovative way to identify and
detect frauds.
• SNA tool uses a mix of analytical methods which includes statistical
methods, pattern analysis and link analysis to identify any kinds of
relationships or patterns within large amounts of data collected from
different sources.
• When link analysis is used in fraud detection, one looks for clusters of
data and how these clusters are linked to other data clusters.
Fraud detection using SNA method
Social Customer Relationship
Management(CRM)

• Social CRM enables effective fraud detection in the insurance sector.


• Social CRM is a process, it is not a platform or technology.
• Makes critical for insurance companies to link social media sites to
their CRM systems.
• If social media is integrated within an organization, it provides high
transparency in various issues related to customer.
Social CRM Process
• Collects data from organization’s existing CRM and different social
media platforms
• Reference data obtained from the social media platform and the data
stored in CR, are loaded into claim management system, which
compares and analyses the data and provides results.
• The response received from claim management system is then
investigated.
Use of Big Data in Retail Industry
• Big data has huge potential for the retail industry by considering the immense
number of transactions and their correlation.
• A single retail location has a small customer database and it is easy to answer the
simple questions like :
(a) How many basic tees did we sell today?
(b) What time of the year do we sell most leggings?
(c) What else has customer bought ,and what kind of coupons can we sent to the
customer?
• However, with millions of transactions spread across at multiple locations, it is
impossible to find answers to such questions.
Use of Big Data in Retail Industry
Use of Big Data in Detecting Fraudulent
Activities in Retail Sector
Retail fraud:
It is an illegal transaction that a fraudster performs using stolen credit
card details or loopholes in the order placement and payment systems
and company policies. As technology grew, so did the fraudsters'
sophistication of executing frauds online.
Types of Retail fraud:
(a) Transaction fraud
(b) Return fraud
(c) Chargeback guarantee fraud
Types of Retail fraud
• Transaction fraud
It is also called card-not-present (CNP) fraud where the fraudster uses a stolen credit card
for online purchases. The company loses money when the original owner of the card
demands a chargeback.
• Return fraud
Example - e-commerce industry
• Chargeback guarantee fraud
Many online retail fraud prevention solutions guarantee that they will block all transactions
and friendly frauds and even pay the admin fee out of their pocket. The problem arises
when the company blocks even legitimate customers. This is called a false positive that not
only damages your reputation but also results in loss of revenue.
Use of Big Data in Detecting Fraudulent Activities
in Retail Sector -Fraud Detection in Real time
• Big Data helps to detect frauds in real time.
Example :
(a) In an online transaction, BigData would compare the incoming IP address with
the geotag received from customer’s smartphone apps. A valid match between
the two confirms the authenticity of transaction.
(b) Also, examines the entire historical data to track suspicious patterns of the
customer order –
Big Data analysis is performed in real time by retailers to know the actual time of
the product delivered.
Costly products of have sensors attached to transmit their location
information,thereby, preventing frauds.
Questions
• Discuss some areas in which decision-making processes are
influenced by social network data.
• List some common types of financial frauds prevalent in the current
business scenario.
• In what ways does analyzing Big Data help organizations prevent
fraud?
• List some methods used for verification of credit cards.
• List the steps that SNA follows to detect fraud.
• What is Social Customer Relationship Management(CRM)

You might also like