Unit 1

Big Data Unit 1:
Introduction:
Overviews of Big Data: The definition of big data is data that contains greater variety,
arriving in increasing volumes and with more velocity. Big data refers to the large volume of
data and also the data is increasing with a rapid speed with respect to time. It includes
structured and unstructured and semi-structured data which is so large and complex and it
cannot be managed by any traditional data management tool. Specialized big data
management tools are required to store and process the data.
5V’s of Big Data

• Volume – The amount of data generated
• Velocity - The speed at which data is generated, collected and analysed
• Variety - The different types of structured, semi-structured and unstructured data
• Value - The ability to turn data into useful insights
• Veracity - Trustworthiness in terms of quality and accuracy
Data Analytics: Data Analytics refers to the process of analysing the raw data and finding
out conclusions about that information. It helps in taking raw data and uncovering patterns
by examining it to extract valuable insights from it. The aim behind data analytics is to
enhance productivity and business gain. It helps companies to better understand their
customers, planning strategies accordingly and develop products. Descriptive, Diagnostic,
Predictive, and Prescriptive are the four basic types of data analytics.
Source of Big Data:

• Social Media: Today’s world a good percent of the total world population is engaged
with social media like Facebook, WhatsApp, Twitter, YouTube, Instagram, etc. Each
activity on such media like uploading a photo, or video, sending a message, making
comment, putting like, etc create data.
• A sensor placed in various places: Sensor placed in various places of the city that
gathers data on temperature, humidity, etc. A camera placed beside the road gather
information about traffic condition and creates data. Security cameras placed in
sensitive areas like airports, railway stations, and shopping malls create a lot of data.
• Customer Satisfaction Feedback: Customer feedback on the product or service of the
various company on their website creates data. For Example, retail commercial sites
like Amazon, Walmart, Flipkart, and Myntra gather customer feedback on the quality
of their product and delivery time. Telecom companies, and other service provider
organizations seek customer experience with their service. These create a lot of
data.
• IoT Appliance: Electronic devices that are connected to the internet create data for
their smart functionality, examples are a smart TV, smart washing machine, smart
coffee machine, smart AC, etc. It is machine-generated data that are created by
sensors kept in various devices. For Example, a Smart printing machine – is
connected to the internet. A number of such printing machines connected to a
network can transfer data within each other. So, if anyone loads a file copy in one
printing machine, the system stores that file content, and another printing machine
kept in another building or another floor can print out that file hard copy. Such data
transfer between various printing machines generates data.
• E-commerce: In e-commerce transactions, business transactions, banking, and the
stock market, lots of records stored are considered one of the sources of big data.
Payments through credit cards, debit cards, or other electronic ways, all are kept
recorded as data.
• Global Positioning System (GPS): GPS in the vehicle helps in monitoring the
movement of the vehicle to shorten the path to a destination to cut fuel, and time
consumption. This system creates huge data on vehicle position and movement.
• Transactional Data: Transactional data, as the name implies, is information obtained
through online and offline transactions at various points of sale. The data contains
important information about transactions, such as the date and time of the
transaction, the location where it took place, the items bought, their prices, the
methods of payment, the discounts or coupons that were applied, and other
pertinent quantitative data. These are some of the sources of transactional data:
orders for payment, Invoices, E-receipts and recordkeeping etc.
• Machine Data: Automatically generated machine data is produced in reaction to an
event or according to a set timetable. This indicates that all of the data was compiled
from a variety of sources, including satellites, desktop computers, mobile phones,
industrial machines, smart sensors, SIEM logs, medical and wearable devices, road
cameras, IoT devices, and more. Businesses can monitor consumer behaviour thanks
to these sources. Data derived from automated sources expands exponentially in
tandem with the market’s shifting external environment. These sensors are used to
capture this kind of information: In a broader sense, machine data includes data that
is generated by servers, user applications, websites, cloud programmes, and other
sources.
State of the Practice in Analytics: In this new digital world, data is being generated in
an enormous amount which opens new paradigms. As we have high computing power as
well as a large amount of data we can make use of this data to help us make data-driven
decision making. The main benefits of data-driven decisions are that they are made up by
observing past trends which have resulted in beneficial results.
In short, we can say that data analytics is the process of manipulating data to extract
useful trends and hidden patterns which can help us derive valuable insights to make
business predictions.
Use of Data Analytics: There are some key domains and strategic planning techniques in
which the Data Analytics has played a very important role:
• Improved Decision-Making – If we will have supporting data in favour of a decision
that then we will be able to implement them with even more success probability. For
example, if a certain decision or plan has to lead to better outcomes then there will
be no doubt in implementing them again.
• Better Customer Service – Churn modelling is the best example of this in which we
try to predict or identify what leads to customer churn and change those things
accordingly so, that the attrition of the customers is as low as possible which is a
most important factor in any organization.
• Efficient Operations – Data Analytics can help us understand what is the demand of
the situation and what should be done to get better results then we will be able to
streamline our processes which in turn will lead to efficient operations.
• Effective Marketing – Market segmentation techniques have been implemented to
target this important factor only in which we are supposed to find the marketing
techniques which will help us increase our sales and leads to effective marketing
strategies.
The Data Scientist: Data scientists are a new breed of analytical data expert who have
the technical skills to solve complex problems – and the curiosity to explore what problems
need to be solved. They're part mathematician, part computer scientist and part trend-
spotter.
A data scientist uses data to understand and explain the phenomena around them,
and help organizations make better decisions.
Data scientists are analytical data experts with technical skills to solve complex
problems. They work with several elements related to mathematics, statistics, and
computer science and collect, analyse, and interpret large amounts of data. They are
responsible for providing insights beyond statistical analyses.
Role of data scientist:

• Find patterns and trends in datasets to uncover insights
• Create algorithms and data models to forecast outcomes
• Use machine learning techniques to improve the quality of data or product offerings
• Communicate recommendations to other teams and senior staff
• Deploy data tools such as Python, R, SAS, or SQL in data analysis
• Stay on top of innovations in the data science field
• Identify business challenges and opportunities to improve products/services.
• Collecting a large chunk of data from various sources
• Using programming tools to structure the data, convert it into usable information,
and make strategic or tactical recommendations.
• Apply expertise in data cleaning and handling, quantitative analysis, and data mining.
• See beyond the numbers and understand how users interact with consumers and
products.
• Collaborate with Product and Engineering teams to solve problems and identify
trends and opportunities.
• Building a blueprint or model of a project from the insights
• Creating data visualizations for stakeholders to understand data better
• Maintaining and analysing the data and gathering insights
• Utilizing machine learning frameworks for numerical computation
• Extending the company’s data with third-party sources of information when needed
• Prediction and goal setting of the product team, designing and evaluating
experiments.
• Monitor key product metrics, and understand the causes of changes in metrics.
• Create and analyse dashboards and reports.
• Creation of key data sets to enhance operational and exploratory analysis.
• Enhancing data collection procedures for building analytic systems
• Creating automated anomaly detection systems and tracking their performance
• Creating data dashboards, graphs, and visualizations
Big Data Analytics in Industry Verticals:

• Banking and Securities: The Securities Exchange Commission (SEC) is using Big Data
to monitor financial market activity. They are currently using network analytics and
natural language processors to catch illegal trading activity in the financial markets.
This industry also heavily relies on Big Data for risk analytics, including; anti-money
laundering, demand enterprise risk management, "Know Your Customer," and fraud
mitigation.
• Healthcare Providers: The healthcare sector has access to huge amounts of data but
has been plagued by failures in utilizing the data to curb the cost of rising healthcare
and by inefficient systems that stifle faster and better healthcare benefits across the
board.
• Education: Online educational course conducting organization utilize big data to
search candidate, interested in that course. If someone searches for YouTube
tutorial video on a subject, then online or offline course provider organization on
that subject send ad online to that person about their course.
• Manufacturing and Natural Resources: Manufacturing company install IOT sensor
into machines to collect operational data. Analysing such data, it can be predicted
how long machine will work without any problem when it requires repairing so that
company can take action before the situation when machine facing a lot of issues or
gets totally down. Thus, the cost to replace the whole machine can be saved.
• Government: In governments, the most significant challenges are the integration
and interoperability of Big Data across different government departments and
affiliated organizations. In public services, Big Data has an extensive range of
applications, including energy exploration, financial market analysis, fraud detection,
health-related research, and environmental protection.
• Insurance: Big data has been used in the industry to provide customer insights for
transparent and simpler products, by analysing and predicting customer behaviour
through data derived from social media, GPS-enabled devices, and CCTV footage.
The Big Data also allows for better customer retention from insurance companies.
• Retail and Wholesale trade: By tracking customer spending habit, shopping
behaviour, Big retails store provide a recommendation to the customer. E-commerce
site like Amazon, Walmart, Flipkart does product recommendation. They track what
product a customer is searching, based on that data they recommend that type of
product to that customer.
• Transportation: Data about the condition of the traffic of different road, collected
through camera kept beside the road, at entry and exit point of the city, GPS device
placed in the vehicle (Ola, Uber cab, etc.). All such data are analysed and jam-free or
less jam way, less time taking ways are recommended. Such a way smart traffic
system can be built in the city by Big data analysis. One more profit is fuel
consumption can be reduced.
• Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this
read data to the server, where data analysed and it can be estimated what is the time in a
day when the power load is less throughout the city. By this system manufacturing unit or
housekeeper are suggested the time when they should drive their heavy machine in the
night time when power load less to enjoy less electricity bill.
• Media and Entertainment Sector: Media and entertainment service providing company like
Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like what
type of video, music users are watching, listening most, how long users are spending on site,
etc are collected and analysed to set the next business strategy.
Data Analytics Lifecycle Challenges of Conventional Systems:
• Storage: With vast amounts of data generated daily, the greatest challenge is
storage (especially when the data is in different formats) within legacy systems.
Unstructured data cannot be stored in traditional databases.
• Processing: Processing big data refers to the reading, transforming, extraction, and
formatting of useful information from raw information. The input and output of
information in unified formats continue to present difficulties.
• Security: Security is a big concern for organizations. Non-encrypted information is at
risk of theft or damage by cyber-criminals. Therefore, data security professionals
must balance access to data against maintaining strict security protocols.
• Finding and Fixing Data Quality Issues: Many of you are probably dealing with
challenges related to poor data quality, but solutions are available.
• Scaling Big Data Systems: Database sharding, memory caching, moving to the cloud
and separating read-only and write-active databases are all effective scaling
methods. While each one of those approaches is fantastic on its own, combining
them will lead you to the next level.
• Evaluating and Selecting Big Data Technologies: Companies are spending millions on
new big data technologies, and the market for such tools is expanding rapidly. In
recent years, however, the IT industry has caught on to big data and analytics
potential.
• Big Data Environments: In an extensive data set, data is constantly being ingested
from various sources, making it more dynamic than a data warehouse. The people in
charge of the big data environment will fast forget where and what each data
collection came from.
• Real-Time Insights: The term "real-time analytics" describes the practice of
performing analyses on data as a system is collecting it. Decisions may be made
more efficiently and with more accurate information thanks to real-time analytics
tools, which use logic and mathematics to deliver insights on this data quickly.
• Data Validation: Before using data in a business process, its integrity, accuracy, and
structure must be validated. The output of a data validation procedure can be used
for further analysis, BI, or even to train a machine learning model.
• Healthcare Challenges: Electronic health records (EHRs), genomic sequencing,
medical research, wearables, and medical imaging are just a few examples of the
many sources of health-related big data.
Statistical Concepts:
• Sampling Distributions: A sampling distribution refers to a probability

distribution of a statistic that comes from choosing random samples of a given
population. Also known as a finite-sample distribution, it represents the distribution
of frequencies on how spread apart various outcomes will be for a specific
population.
The sampling distribution depends on multiple factors – the statistic, sample
size, sampling process, and the overall population. It is used to help calculate
statistics such as means, ranges, variances, and standard deviations for given sample.
Types of Sampling Distribution
1. Sampling distribution of mean: Calculate the mean of every sample group
chosen from the population and plot out all the data points. The graph will show a
normal distribution, and the centre will be the mean of the entire population.
2. Sampling distribution of proportion: It gives you information about
proportions in a population. We select samples from the population and get the
sample proportion. The mean of all the sample proportions that you calculate from
each sample group would become the proportion of the entire population.
3. T-distribution: T-distribution is used when the sample size is very small or
not much is known about the population. It is used to estimate the mean of the
population, confidence intervals, statistical differences, and linear regression.
Factors influence sampling distribution

o The number observed in a population: The symbol for this variable is "N." It is
the measure of observed activity in a given group of data.
o The number observed in the sample: The symbol for this variable is "n." It is
the measure of observed activity in a random sample of data that is part of
the larger grouping.
o The method of choosing the sample: How you chose the samples can account
for variability in some cases.
• Re-Sampling: Resampling is a series of techniques used in statistics to gather more

information about a sample. This can include retaking a sample or estimating its
accuracy. With these additional techniques, resampling often improves the overall
accuracy and estimates any uncertainty within a population.
Methods: There are four main methods:

o Simple random sampling: Simple random sampling is when every person or
data piece within a population or a group has an equal chance of selection.
You might generate random numbers or have another random selection
process.
o Systematic sampling: Systematic sampling is often still random, but people
might receive numbers or values at the start. The person holding the
experiment then might select intervals to divide the group, like every third
person.
o Stratified sampling: Stratified sampling is when you divide the main
population into several subgroups based on certain qualities. This can mean
collecting samples from groups of different ages, cultures or other
demographics.
o Cluster sampling: Cluster sampling is similar to stratified sampling, as you can
divide populations into separate subgroups. Rather than coordinated groups
with similar qualities, you select these groups randomly, often causing
differences in results.
Types of Resampling: Two common method of Resampling are:
o K-fold cross-validation: In this method population data is divided into k equal
sets in which one set is considered as the test set for the experiment while all
other set will be used to train the model.
o Bootstrapping: In bootstrapping, samples are drawn with replacement (i.e.
one observation can be repeated in more than one group) and the remaining
data which are not used in samples are used to test the model.
• Statistical Inference: Statistical inference is a method of making decisions about

the parameters of a population, based on random sampling. It helps to assess the
relationship between the dependent and independent variables. The purpose of
statistical inference to estimate the uncertainty or sample to sample variation.
The main types of statistical inference are:

o Estimation: Statistics from a sample are used to estimate population
parameters. The most likely value is called a point estimate. There is always
uncertainty when estimating. The uncertainty is often expressed as
confidence intervals defined by a likely lowest and highest value for the
parameter.
o Hypothesis testing: Hypothesis testing is a method to check if a claim about a
population is true. More precisely, it checks how likely it is that a hypothesis
is true is based on the sample data.
A statistical hypothesis is an assumption about a population which
may or may not be true. Hypothesis testing is a set of formal procedures used
by statisticians to either accept or reject statistical hypotheses.
Procedure:
1. State the hypotheses - This step involves stating both null and
alternative hypotheses. The hypotheses should be stated in such a
way that they are mutually exclusive. If one is true then other must be
false.
2. Formulate an analysis plan - The analysis plan is to describe how to
use the sample data to evaluate the null hypothesis. The evaluation
process focuses around a single test statistic.
3. Analyze sample data - Find the value of the test statistic (using
properties like mean score, proportion, t statistic, z-score, etc.) stated
in the analysis plan.
4. Interpret results - Apply the decisions stated in the analysis plan. If the
value of the test statistic is very unlikely based on the null hypothesis,
then reject the null hypothesis.
• Prediction Error: In statistics, prediction error refers to the difference between

the predicted values made by some model and the actual values.
o In regression analysis, it’s a measure of how well the model predicts the
response variable.
o In classification (machine learning), it’s a measure of how well samples are
classified to the correct category.
• Regression Modelling: It is a method that is used to determine the relationship

between one or more independent variables and a dependent variable. Regression is
mainly of two types:
o Linear regression: It is used to fit the regression model that explains the
relationship between a numeric predictor variable and one or more predictor
variables.
o Logistic regression: It is used to fit a regression model that explains the
relationship between the binary response variable and one or more predictor
variables.
• Multivariate Analysis: Multivariate analysis is based in observation and analysis

of more than one statistical outcome variable at a time. The statistical study of data
where multiple measurements are made on each experimental unit and where the
relationships among multivariate measurements and their structure are important.
Multivariate analysis takes a whole host of variables into consideration. This
makes it a complicated as well as essential tool. The greatest virtue of such a model
is that it considers as many factors into consideration as possible. This results in
tremendous reduction of bias and gives a result closest to reality.
Multivariate analysis encompasses all statistical techniques that are used to
analyse more than two variables at once. The aim is to find patterns and correlations
between several variables simultaneously—allowing for a much deeper, more
complex understanding of a given scenario
• Bayesian Modelling: Bayesian modelling is a statistical model where probability

is influenced by the belief of the likelihood of a certain outcome. A Bayesian
approach means that probabilities can be assigned to events that are neither
repeatable nor random.
In traditional statistics, it wouldn’t make much sense to assign probabilities to this
situation, but with a Bayesian approach, you can use prior probability to inform the
outcome and then continually update that probability when new evidence is
received.
o P (A/B): How often A occurs given that B happens, also known as posterior
probability.
o P (B/A): How often B occurs given that A happens, also known as likelihood
probability.
o P (A): How likely it is for A to occur on its own.
o P (B): How likely it is for B to occur on its own.

Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

Big Data Unit 1:

5V’s of Big Data

Source of Big Data:

Role of data scientist:

Big Data Analytics in Industry Verticals:

Data Analytics Lifecycle Challenges of Conventional Systems:

• Sampling Distributions: A sampling distribution refers to a probability

Factors influence sampling distribution

• Re-Sampling: Resampling is a series of techniques used in statistics to gather more

Methods: There are four main methods:

• Statistical Inference: Statistical inference is a method of making decisions about

The main types of statistical inference are:

• Prediction Error: In statistics, prediction error refers to the difference between

• Regression Modelling: It is a method that is used to determine the relationship

• Multivariate Analysis: Multivariate analysis is based in observation and analysis

• Bayesian Modelling: Bayesian modelling is a statistical model where probability

You might also like