You are on page 1of 24

PROFESSIONAL SKILLS - FUNDAMENTALS OF DATA ANALYTICS

Unit-I
What Can We Do With Data- Big Data and Data Science- Big Data Architectures-Small Data -
What is Data? -A Short Taxonomy of Data Analytics -Descriptive Statistics: Scale Types- Descriptive
Univariate Analysis : Univariate Frequencies -Univariate Data Visualization - Univariate Statistics -
Common Univariate Probability Distributions -Descriptive Bivariate Analysis

What is Data?
Data is defined as individual facts, such as numbers, words, measurements, observations
or just descriptions of things.
For example, data might include individual prices, weights, addresses, ages, names,
temperatures, dates, or distances.
Data is a set of characters used to collect, store and transmit information for a specific
purpose. Data can be in any form, i.e., text, image, audio, etc. Data comes from the Latin
word 'Datum', which means 'something given'. When the data is processed, it is termed
as 'Information'.

There are two main types of data:


1. Quantitative data - is provided in numerical form, like the weight, volume, or cost of
an item.
2. Qualitative data - is descriptive, but non-numerical, like the name, sex, or eye
colour of a person.

What is Big Data


Big Data refers to a collection of a very large and complicated set of data for which it
becomes difficult to process using traditional or manual database management tools. The
size of data grows exponentially and is generally in terabytes or more.
For example- over 500 million Tweets are generated on Twitter daily; Netflix has over
220 million paid memberships globally; there are over 2 billion daily users of Facebook.
These statistics are quite large in numbers and increasing exponentially every year and
thus can be classified as Big Data.
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in
the past 3 years.
Examples of Big Data:
The New York Stock Exchange generates about one terabyte of new trade data per day.
The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook, every day.
This data is mainly generated in terms of photo and video uploads, message exchanges,
putting comments etc. A single Jet engine can generate 10+terabytes of data in 30
minutes of flight time. With many thousand flights per day, generation of data reaches up
to many Petabytes.

Sources of Big Data

These data come from many sources like


o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of
logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which
are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction
Advantages of using Big Data
1. Improved business processes
2. Fraud detection
3. Improved customer service
4. Better decision-making
5. Increased productivity
6. Reduce costs
7. Improved customer service
8. Fraud detection
9. Increased revenue
10. Increased agility
11. Greater innovation
12. Faster speed to market

Disadvantages of Big Data


1. Privacy and security concerns
2. Need for technical expertise
3. Need for talent
4. Data quality
5. Need for cultural change
6. Compliance
7. Cybersecurity risks
8. Rapid change
9. Hardware needs
10. Costs
11. Difficulty integrating legacy systems

Types of Big Data

There are three types of Big Data: Structured, Semi-structured and Unstructured
data.
1. Structured Data: Any data in a fixed format is known as structured data. It can
only be accessed, stored, or processed in a particular format. This type of data is
stored in the form of tables with rows and columns. Any Excel file or SQL file is an
example of structured data.
2. Unstructured Data: Unstructured data do not have a fixed format. These are
stored in an unknown format. Such type of data is known as unstructured data. An
example of unstructured data is a web page with text, images, videos, etc.
3. Semi-structured Data: Semi-structured data is the combination of structured as
well as unstructured forms of data. It does not contain any table to show relations;
it contains tags or other markers to show hierarchy. JSON files, XML files, and
CSV files (Comma-separated files) are semi-structured data examples. The e-mails
we send or receive are also an example of semi-structured data.

Characteristics of 3V's of Big Data

Big data can be described by the following characteristics:


 Volume
 Variety
 Velocity
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume
of data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as
well as unstructured. Log file, CCTV footage is unstructured data. Data which can
be saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
Use Cases of Big Data

1. Social Media and Entertainment: You must have witnessed streaming service apps
such as Netflix recommending shows and movies based on your previous searches
and what you have watched. It is done using the concept of Big Data. Netflix and
other streaming service apps create a custom user profile, where they store the
data of users, including their search history, their history, which genre they watch
the most, at what time of day they prefer to watch the most, their streaming time
per day, etc. analyze it and accordingly gives recommendations. It helps in a better
streaming experience for the users.
2. Shopping: Websites like Amazon, Flipkart, etc., also use Big Data to recommend
products based on your previous purchases, search history, and interests. It is done
to maximize their profits and provide a better shopping experience to their
customers.
3. Education: Big Data helps in analyzing and monitoring the behavior and activities of
students, like the time they need to answer a question, the number of questions
skipped, and the difficulty level of the questions that are skipped, and thus helps
students to analyze their overall preparation, weak topics, strong topics, etc.
4. Healthcare: Healthcare sectors use Big Data to track and analyze the health and
fitness of the patients, the number of visits, the number of skipped appointments a
patient, etc. Mass outbreaks of diseases can be predicted by analyzing the data and
using algorithms.
5. Transportation: Traffic control by collecting and analyzing the data from several
sensors and cameras installed on roads and highways. Accident-prone areas can be
detected with the help of Big Data analysis; thus, required measures can be taken
to avoid accidents.
Evolution of Big Data
o The earliest record to track and analyze data was not decades back but thousands
of years back when accounting was first introduced in Mesopotamia.
o In the 20th century, IBM developed the first large-scale data project, punch
carding systems, which tracked the information of millions of Americans.
o With the emergence of the World Wide Web and supercomputers in the 1990s, the
creation of data on a large scale started to grow at an exponential rate. It was in
the early 1990s when the term 'Big Data' was first used.
o The two main challenges regarding 'Big Data' were storing and processing such a
huge volume of data.
o In 2005, Yahoo created the open-source framework Hadoop, which stores and
processes large data sets.
o The storage solution in Hadoop was named HDFS (Hadoop Distributed File System),
and the processing solution was named MapReduce.
o Later, Hadoop was handed over to an open-source and non-profitable
corporation: Apache Software Foundation.
o In 2008, Cloudera became the first company to provide commercial Hadoop
distribution.
o In 2013, the Creators of Apache Spark founded a company, Databricks, which
offers a platform for Big Data and Machine Learning solutions.
o Over the past few years, top Cloud providers such as Microsoft, Google, and
Amazon also started to provide Big Data solutions. These Cloud providers made it
much easier for users and companies to work on Big Data.
Importance of Big Data
1. A better understanding of market conditions.
2. Time and cost saving.
3. Solving advertisers' problems.
4. Offering better market insights.
5. Boosting customer acquisition and retention.

Applications of Big Data


Big Data finds applications in various sectors, such as-
1. Banking and Security
2. Social Media and Entertainment
3. E-commerce websites
4. HealthCare
5. Education
6. Transportations

Types of Big Data Analytics


1. Descriptive Analytics: This type of analytics summarizes or extracts insights
based on the incoming We came up with a description based on the data. For
example, insights drawn for your YouTube channel are based on the data such as
likes, shares, and views on your videos.
2. Predictive Analytics: This type of analytics predicts what might happen. Questions
such as 'how' and 'why' reveal particular patterns that help predict future trends.
Machine Learning concepts are also used for such types of analysis. For example,
prediction of weather, prediction of malfunctioning in the parts of an airplane, etc.
3. Prescriptive Analytics: These types of analytics are based on rules and
recommendations and thus prescribe an analytical path. The analysis is generally
based on the question, 'what actions should be taken?' Google's self-driving car is
an example of prescriptive analysis.
4. Diagnostic Analytics: These analytics look into past trends and diagnose questions
such as how and why something happened. It is also called behavioral analytics. This
analysis aims to answer the question, 'why did this happen?' For example, if the
sales report of a company shows a rise in sales, then the company can analyze the
internal and external causes responsible for the increase.

DATA SCIENCE
What is a Data Science

Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data. These insights can be used to
guide decision making and strategic planning.

data science technologies


Data science practitioners work with complex technologies such as:
1. Artificial intelligence: Machine learning models and related software are used for
predictive and prescriptive analysis.
2. Cloud computing: Cloud technologies have given data scientists the flexibility and
processing power required for advanced data analytics.
3. Internet of things: IoT refers to various devices that can automatically connect to the
internet. These devices collect data for data science initiatives. They generate massive
data which can be used for data mining and data extraction.
4. Quantum computing: Quantum computers can perform complex calculations at high speed.
Skilled data scientists use them for building complex quantitative algorithms.

The data science lifecycle


A data science lifecycle is defined as the iterative set of data science steps required to
deliver a project or analysis. There are no one-size-fits that define data science projects.
Hence you need to determine the one that best fits your business requirements. Each step
in the lifecycle should be performed carefully.

The data science lifecycle involves various roles, tools, and processes, which enables
analysts to glean actionable insights. Typically, a data science project undergoes the
following stages:

1. Business Understanding: The complete cycle revolves around the enterprise goal.
What will you resolve if you do not longer have a specific problem? It is extraordinarily
essential to apprehend the commercial enterprise goal sincerely due to the fact that
will be your ultimate aim of the analysis.
2. Data Understanding: After enterprise understanding, the subsequent step is data
understanding. This includes a series of all the reachable data. Here you need to
intently work with the commercial enterprise group as they are certainly conscious of
what information is present, what facts should be used for this commercial enterprise
problem, and different information.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps
like choosing the applicable data, integrating the data by means of merging the data
sets, cleaning it, treating the lacking values through either eliminating them or imputing
them, treating inaccurate data through eliminating them, additionally test for outliers
the use of box plots and cope with them. Constructing new data, derive new elements
from present ones. Format the data into the preferred structure, eliminate undesirable
columns and features. Data preparation is the most time-consuming but arguably the
most essential step in the complete existence cycle. Your model will be as accurate as
your data.
4. Exploratory Data Analysis: This step includes getting some concept about the
answer and elements affecting it, earlier than constructing the real model. Distribution
of data inside distinctive variables of a character is explored graphically the usage of
bar-graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps. Many data visualization strategies
are considerably used to discover each and every characteristic individually and by
means of combining them with different features.
5. Data Modeling: Data modeling is the coronary heart of data analysis. A model takes
the organized data as input and gives the preferred output. This step consists of
selecting the suitable kind of model, whether the problem is a classification problem, or
a regression problem or a clustering problem. After deciding on the model family,
amongst the number of algorithms amongst that family, we need to cautiously pick out
the algorithms to put into effect and enforce them.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be
deployed. The model is examined on an unseen data, evaluated on a cautiously thought
out set of assessment metrics. We additionally need to make positive that the model
conforms to reality. If we do not acquire a quality end result in the evaluation, we have
to re-iterate the complete modelling procedure until the preferred stage of metrics is
achieved.
7. Model Deployment: The model after a rigorous assessment is at the end deployed in
the preferred structure and channel. This is the last step in the data science life cycle.
Each step in the data science life cycle defined above must be laboured upon carefully.
If any step is performed improperly, and hence,
What is Big Data Architecture?
It is a comprehensive system of processing a vast amount of data. The Big data
architecture framework lays out the blueprint of providing solutions and infrastructures
to handle big data depending on an organization’s needs. It clearly defines the
architecture components of big data analytics, layers to be used, and the flow of
information. The reference point is ingesting, processing, storing, managing, accessing, and
analyzing the data. A typical big data architecture framework looks like below, having the
following big data architecture layers:

Big Data Sources Layer


The architecture of big data heavily depends on the type of data and its sources. The
data sources are both open and third-party data providers. Several data sources range
from relational database management systems, data warehouses, cloud-based data
warehouses, SaaS applications, real-time data from the company servers and sensors such
as IoT devices, third-party data providers, and also static files such as Windows logs. The
data managed can be both batch processing and real-time processing (more on this below).
Storage Layer
The storage layer is the second layer in the architecture of big data receiving the data
for the big data. It provides infrastructure for storing past data, converts it into the
required format, and stores it in that format. For instance, the structured data is only
stored in RDBMS and Hadoop Distributed File System (HDFS) to store the batch
processing data. Typically, the information is stored in the data lake according to the
system’s requirements.
Batch processing and real-time processing Layer
The architecture of big data needs to incorporate both types of data: batch (static) data
and real-time or streaming data.
 Batch processing: It is needed to manage such data (in gigabytes) efficiently. In
batch processing, the data is filtered, aggregated, processed, and prepared for
analysis. The batches are long-running jobs. The way the batch process works is to
read the data into the storage layer, process it, and write the outputs into new files.
The solution for batch time processing is Hadoop.
 Real-Time Processing: It is required for capturing, storing, and processing the data
on a real-time basis. Real-time processing first ingests the data and then uses that as
a “publish-subscribe kind of a tool.”

Stream processing
Stream processing varies from real-time message ingestion. Stream processing handles all
the window streaming data or even streams and writes the streamed data to the output
area. The tools here are Apache Spark, Apache Flink, Storm.

Analytical datastore
The analytical data store is like a one-stop place for the prepared data. This data is either
presented in an interactive database that offers the metadata or in a NoSQL data
warehouse. The data set then prepared can be searched for by querying and used for
analysis with tools such as Hive, Spark SQL, Hbase.

Analytics Layer
The analysis layer interacts with the storage layer to extract valuable insights and
business intelligence. The architecture needs a mix of multiple data tools for handling the
unstructured data and making the analysis. The tools and techniques for big data
architecture framework are covered in a later section. On this subject, you may also like
to read: Big Data Analytics – Key Aspects One Must Know

Consumption or Business Intelligence (BI) Layer


This layer is the output of the big data analytics architecture. It receives the final
analysis from the analytics layer and presents and replicates it to the relevant output
system. The results acquired are used for making decisions and for visualization. It is also
referred to as the business intelligence (BI) layer.

AnalytixLabs, India’s top-ranked AI & Data Science Institute, is led by a team of IIM,
IIT, ISB, and McKinsey alumni. The institute provides a wide range of data analytics
courses, including detailed project work that helps an individual fit for the professional
roles in AI, Data Science, and Big Data Engineering. With its decade of experience in
providing meticulous, practical, and tailored learning, AnalytixLabs has proficiency in
making aspirants “industry-ready” professionals.
SMALL DATA

Small Data: It can be defined as small datasets that are capable of impacting decisions
in the present. Anything that is currently ongoing and whose data can be accumulated in
an Excel file. Small Data is also helpful in making decisions, but does not aim to impact
the business to a great extent, rather for a short span of Small data can be described
as small datasets that are capable of having an influence on current decisions. Almost
everything currently in progress and the data of which can be acquired in an Excel file.
Small data is also useful in decision-making but is not intended to have a large impact on
business, rather for a short period of time.

In nutshell, data that is simple enough to be used for human understanding in such a
volume and structure that makes it accessible, concise, and workable is known as small
data.
Big Data: It can be represented as large chunks of structured and unstructured data.
The amount of data stored is immense. It is therefore important for analysts to
thoroughly dig the whole thing into making it relevant and useful to make proper
business decisions.
In short, datasets that are really huge and complex that conventional data processing
techniques can not manage them are known as big data.

Feature Small Data Big Data


Variety Data is typically structured Data is often unstructured and
and uniform heterogeneous
Veracity Data is generally high quality Data quality and reliability can vary
and reliable widely
Processing Data can often be processed Data requires distributed processing
on a single machine or in- frameworks such as MapReduce or
memory Spark
Technology Traditional Modern
Analytics Traditional statistical Advanced analytics techniques such as
techniques can be used to machine learning are often require
analyze data
Collection Generally, it is obtained in The Big Data collection is done by using
an organized manner than is pipelines having queues like AWS Kinesis
inserted into the database or Google Pub / Sub to balance high-
speed data
Volume Data in the range of tens or Size of Data is more than Terabytes
hundreds of Gigabytes
Analysis Areas Data marts(Analysts) Clusters(Data Scientists), Data
marts(Analysts)
Quality Contains less noise as data is Usually, the quality of data is not
less collected in a controlled guaranteed
manner
Processing It requires batch-oriented It has both batch and stream
processing pipelines processing pipelines
Database SQL NoSQL
Velocity A regulated and constant Data arrives at extremely high speeds,
flow of data, data large volumes of data aggregation in a
aggregation is slow short time
Structure Structured data in tabular Numerous variety of data set including
format with fixed tabular data, text, audio, images, video,
schema(Relational) logs, JSON etc.(Non Relational)
Scalability They are usually vertically They are mostly based on horizontally
scaled scaling architectures, which gives more
versatility at a lower cost
Query only Sequel Python, R, Java, Sequel
Language
Hardware A single server is sufficient Requires more than one server
Value Business Intelligence, Complex data mining techniques for
analysis and reporting pattern finding, recommendation,
prediction etc.
Optimization Data can be optimized Requires machine learning techniques
manually(human powered) for data optimization
Storage Storage within enterprises, Usually requires distributed storage
local servers etc. systems on cloud or in external file
systems
People Data Analysts, Database Data Scientists, Data Analysts,
Administrators and Data Database Administrators and Data
Engineers Engineers
Security Security practices for Small Securing Big Data systems are much
Data include user privileges, more complicated. Best security
data encryption, hashing, etc. practices include data encryption,
cluster network isolation, strong access
control protocols etc.
Nomenclature Database, Data Warehouse, Data Lake
Data Mart
Infrastructure Predictable resource More agile infrastructure with
allocation, mostly vertically horizontally scalable hardware
scalable hardware.
Applications Small-scale applications, such Large-scale applications, such as
as personal or small business enterprise-level data management,
data management internet of things (IoT), and social
media analysis

A SHORT TAXONOMY OF DATA ANALYTICS


There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Data Analytics and its Types

Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive
analytics uses data to determine the probable outcome of an event or a likelihood of a
situation occurring. Predictive analytics holds a variety of statistical techniques from
modeling, machine learning , data mining, and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining
Basic Corner Stones of Predictive Analytics
 Predictive modeling
 Decision Analysis and optimization
 Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to
approach future events. It looks at past performance and understands the performance
by mining historical data to understand the cause of success or failure in the past.
Almost all management reporting such as sales, marketing, operations, and finance uses
this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to
classify customers or prospects into groups. Unlike a predictive model that focuses on
predicting the behavior of a single customer, Descriptive analytics identifies many
different relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard

Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to
take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.

For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors
such as economic data, population demography, etc.

Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question
or for the solution of any problem. We try to find any dependency and pattern in the
historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise data
collection may turn out individual for every problem and it will be very time-consuming.
Common techniques used for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations
Descriptive statistics help you to understand the data, but before we understand what
data is, we should know different data types in descriptive statistical analysis. The below
screen helps you to get an overview of it.

Types of Data
A data set is a grouping of information that is related to each other. A data set can be
either qualitative or quantitative. A qualitative data set consists of words that can be
observed, not measured. A quantitative data set consists of numbers that can be directly
measured. Months in a year would be an example of qualitative, while the weight of persons
would be an example of quantitative data.
Now, let’s suppose you go to KFC to eat some burgers along with your friends, you placed
the order at coupon counter and after receiving from the food counter everyone eats what
you ordered on their behalf. If someone asked about the taste to others then the ratings
on the taste will vary from one to another but if asked how many burgers we ordered then
everyone will come to a definite count and it will be the same for all. Here, Taste’s ratings
represent the Categorical Data and the number of burgers is Numerical Data.
Types of Categorical Data:
1. Nominal Data: When there is no natural order between categories then data
is nominal type.
Example: Color of an Eye, Gender (Male & Female), Blood Type, Political Party,
and Zipcode, Type of living accommodation(House, Apartment, Trailer,
Other), Religious preference( Hindu, Buddhist, Muslim, Jewish, Christian,
Other), etc.
2. Ordinal Data: When there is natural order between categories then data is
ordinal type. But here, the difference between the values in order does not
matter.
Example: Exam Grades, Socio-economic status (poor, middle class, rich),
Education-level (kindergarten, primary, secondary, higher secondary,
graduation), satisfaction rating(extremely dislike, dislike, neutral, like,
extremely like), Time of Day(dawn, morning, noon, afternoon, evening, night),
Level of Agreement(yes, maybe, no), The Likert Scale(strongly disagree,
disagree, neutral, agree, strongly agree), etc.

Types of Numerical Data:


1. Discrete Data: The data is said to be discrete if the measurements are
integers. It represents count or an item that can be counted.
Example: Number of people in a family, the number of kids in class, the
number of cricket players in a team, the number of cricket playing nations in
the world.
Discrete data is a special kind of data because each value is separate and
different. With any data, if we can answer the below questions then it is
discrete.
1. Can you count it?
2. Can it be divided into smaller and smaller parts?
2. Continous Data: The data is said to be continuous if the measurements can
take any value usually within some range. It is a scale of measurement that
can consist of numbers other than whole numbers, like decimals and fractions.
Example: height, weight, length, temperature
Continuous data usually require a tool, like a ruler, measuring tape, scale, or
thermometer, to produce the values in a continuous data set.

Scales of Measurement:
Data can be classified as being on one of four scales: nominal, ordinal, interval or
ratio. Each level of measurement has some important properties that are useful to know.
1. Nominal Scale: Nominal datatype defined above can be placed into this
category. They don’t have a numeric value and so it neither be added,
subtracted, divided nor be multiplied. They also have no order; if they appear
to have an order then you probably have ordinal variables instead.
2. Ordinal Scale: Ordinal datatype defined above can be placed into this
category. The ordinal scale contains things that you can place in order. For
example, hottest to coldest, lightest to heaviest, richest to poorest. So, if
you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that
is on an ordinal scale.
3. Interval Scale: An interval scale has ordered numbers with meaningful
divisions. Temperature is on the interval scale: a difference of 10 degrees
between 90 and 100 means the same as 10 degrees between 150 and 160.
Compare that to Olympic running race (which is ordinal), where the time
difference between the winner and runner up might be 0.01 second and
between second-last and last 0.5 seconds. If you have meaningful divisions,
you have something on the interval scale.
4. Ratio Scale: The ratio scale has all the property of interval scale with one
major difference: zero is meaningful. When the scale is equal to 0.0 then
there is none of that scale. For example, a height of zero is meaningful (it
means you don’t exist). The temperature in Kelvin(0.0 K), 0.0 Kelvin really
does mean “no heat”. Compare that to a temperature of zero, which while it
exists, it doesn’t mean anything in particular (although admittedly, in the
Celsius scale it’s the freezing point for water).

Univariate, Bivariate and Multivariate data and its analysis


1. Univariate data –
This type of data consists of only one variable. The analysis of univariate data is thus
the simplest form of analysis since the information deals with only one quantity that
changes. It does not deal with causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist within it. The example of a
univariate data can be height.

Suppose that the heights of seven students of a class is recorded(figure 1),there is


only one variable that is height and it is not dealing with any cause or relationship. The
description of patterns found in this type of data can be made by drawing conclusions
using central tendency measures (mean, median and mode), dispersion or spread of data
(range, minimum, maximum, quartiles, variance and standard deviation) and by using
frequency distribution tables, histograms, pie charts, frequency polygon and bar
charts.
2. Bivariate data
This type of data involves two different variables. The analysis of this type of data
deals with causes and relationships and the analysis is done to find out the relationship
among the two variables. Example of bivariate data can be temperature and ice cream
sales in summer season.

Suppose the temperature and ice cream sales are the two variables of a bivariate
data(figure 2). Here, the relationship is visible from the table that temperature and
sales are directly proportional to each other and thus related because as the
temperature increases, the sales also increase. Thus bivariate data analysis involves
comparisons, relationships, causes and explanations. These variables are often plotted
on X and Y axis on the graph for better understanding of data and one of these
variables is independent while the other is dependent.

3. Multivariate data
When the data involves three or more variables, it is categorized under multivariate.
Example of this type of data is suppose an advertiser wants to compare the popularity
of four advertisements on a website, then their click rates could be measured for both
men and women and relationships between variables can then be examined. It is similar
to bivariate but contains more than one dependent variable. The ways to perform
analysis on this data depends on the goals to be achieved. Some of the techniques are
regression analysis, path analysis, factor analysis and multivariate analysis of variance
(MANOVA).
There are a lots of different tools, techniques and methods that can be used to
conduct your analysis. You could use software libraries, visualization tools and statistic
testing methods. However, this blog we will be compare Univariate, Bivariate and
Multivariate analysis.
Univariate Bivariate Multivariate
It only summarize It only summarize two It only summarize more than 2
single variable at a variables variables.
time.
It does not deal It does deal with causes It does not deal with causes and
with causes and and relationships and relationships and analysis is done.
relationships. analysis is done.
It does not contain It does contain only one It is similar to bivariate but it
any dependent dependent variable. contains more than 2 variables.
variable.
The main purpose is The main purpose is to The main purpose is to study the
to describe. explain. relationship among them.
The example of a The example of bivariate Example, Suppose an advertiser wants
univariate can be can be temperature and ice to compare the popularity of four
height. sales in summer vacation. advertisements on a website.
Then their click rates could be
measured for both men and women
and relationships between variable can
be examined

Common Univariate Probability Distributions


What is Probability Distribution?
A probability distribution is a mathematical function that defines the likelihood of
different outcomes or values of a variable. This function is commonly represented by a
graph or probability table, and it provides the probabilities of various possible results
of an experiment or random phenomenon based on the sample space and the
probabilities of events. Probability distributions are fundamental in probability theory
and statistics for analyzing data and making predictions.
Example of Probability Distribution
Suppose you are a teacher at a university. After checking assignments for a week, you
graded all the students. You gave these graded papers to a data entry guy in the
university and told him to create a spreadsheet containing the grades of all the
students. But the guy only stores the grades and not the corresponding students.
He made another blunder; he missed a few entries in a hurry, and we have no idea
whose grades are missing. One way to find this out is by visualizing the grades and
seeing if you can find a trend in the data.

The graph you plotted is called the frequency distribution of the data. You see that
there is a smooth curve-like structure that defines our data, but do you notice an
anomaly? We have an abnormally low frequency at a particular score range. So the
best guess would be to have missing values that remove the dent in the distribution.

Types of Distributions
Here is a list of distributions types
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal or Gaussian Distribution
5. Exponential Distribution
6. Poisson Distribution

Bernoulli Distribution
Let’s start with the easiest distribution, which is Bernoulli Distribution. It is actually
easier to understand than it sounds!
All you cricket junkies out there! At the beginning of any cricket match, how do you
decide who will bat or ball? A toss! It all depends on whether you win or lose the toss,
right? Let’s say if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two bernoulli trials or possible outcomes, namely 1
(success) and 0 (failure), and a single trial. So the random variable X with a Bernoulli
distribution can take the value 1 with the probability of success, say p, and the value 0
with the probability of failure, say q or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes
failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only
two possible outcomes.
The probability mass function is given by: px(1-p)1-x where x € (0, 1)
It can also be written as:

The probabilities of success and failure need not be equally likely, like the result of a
fight between Undertaker and me. He is pretty much certain to win. So, in this case
probability of my success is 0.15, while my failure is 0.85
Bernoulli Distribution Example
Here, the probability of success(p) is not the same as the probability of failure. So,
the chart below shows the Bernoulli Distribution of our fight.

Here, the probability of success = 0.15, and the probability of failure = 0.85. The
expected value is exactly what it sounds like. If I punch you, I may expect you to
punch me back. Basically expected value of any distribution is the mean of the
distribution. The expected value of a random variable X from a Bernoulli distribution
is found as follows:
E(X) = 1*p + 0*(1-p) = p
The variance of a random variable from a bernoulli distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
There are many examples of Bernoulli distribution, such as whether it will rain
tomorrow or not, where rain denotes success and no rain denotes failure and Winning
(success) or losing (failure) the game.
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these
outcomes are equally likely, which is the basis of a uniform distribution. Unlike
Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution
are equally likely.
A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

You can see that the shape of the Uniform distribution curve is rectangular, the
reason why Uniform distribution is called rectangular distribution.
For a Uniform Distribution, a and b are the parameters.

You might also like