You are on page 1of 25

Z-Score Normalization: Definition & Examples

Data normalization is a crucial element of data analysis. It’s what allows analysts to compile
and compare numbers of different sizes, from various data sources. And yet, normalization is
little understood and little used.

The reason normalization goes under-appreciated is probably linked to confusion surrounding


what it is. There are easy normalization techniques, such as removing decimal places, and
there are advanced normalization techniques, such as z-score normalization.

 Decimal place normalization


 Data type normalization
 Formatting normalization (date abbreviations, date order, & deliminators)
 Z-Score normalization
 Linear normalization (or “Max-Min,” & how to normalize to 100)
 Clipping normalization
 Standard Deviation Normalization

Z-score normalization refers to the process of normalizing every value in a dataset such that
the mean of all of the values is 0 and the standard deviation is 1.

We use the following formula to perform a z-score normalization on every value in a dataset:

New value = (x – μ) / σ

where:

 x: Original value
 μ: Mean of data
 σ: Standard deviation of data
Example: Performing Z-Score Normalization
Suppose we have the following dataset:
Using a calculator, we can find that the mean of the dataset is 21.2 and the standard deviation
is 29.8.

To perform a z-score normalization on the first value in the dataset, we can use the following
formula:

New value = (x – μ) / σ
New value = (3 – 21.2) / 29.8
New value = -0.61
We can use this formula to perform a z-score normalization on every value in the dataset:
Advanced ranking techniques
1. ELO ranking
Elo Rankings is a method used for ranking players or teams in competitive games or sports. It
was originally developed by Arpad Elo for ranking chess players but has since been adapted
for various other games and sports, including esports, soccer, basketball, and tennis.
The Elo rating system assigns a numerical rating to each player or team, representing their
skill level or performance. The basic principle behind Elo Rankings is that players' ratings
change after each match based on the outcome and the relative ratings of the opponents.
Here's how the Elo Rankings system typically works:
1. Initial Ratings: Each player or team starts with an initial rating, usually a default value
determined by the organizer or based on their previous performance.
2. Expected Score: Before a match, the Elo system calculates the expected score for each
player based on their ratings. The higher-rated player is expected to win, and the lower-
rated player is expected to lose. The expected score is calculated using a logistic
function.
3. Actual Score: After the match, the actual outcome is compared to the expected outcome.
If the player performs better than expected (e.g., wins against a higher-rated opponent),
their rating increases. Conversely, if the player performs worse than expected (e.g., loses
against a lower-rated opponent), their rating decreases.
4. Rating Adjustment: The amount by which a player's rating changes after a match
depends on the difference between the actual score and the expected score, as well as a
parameter known as the K-factor, which determines the sensitivity of the rating changes
to individual match results. A higher K-factor means that ratings change more
dramatically with each match, while a lower K-factor means that ratings change more
gradually.
5. Iterative Process: The process of updating ratings after each match is iterative, meaning
that ratings continue to adjust over time as players compete against each other and their
performance fluctuates.
Overall, Elo Rankings provide a dynamic and objective method for assessing the relative skill
levels of players or teams in competitive games and sports, with ratings evolving based on
actual match results.
The difference in the ratings between two players serves as a predictor of the outcome of a
match. If players A and B have ratings Rᴬ and Rᴮ, then the expected scores are given by:

The formulae for calculating expected scores given Elo ratings


A player’s expected score = their probability of winning + half their probability of drawing.
If two players have equal ratings (Rᴬ = Rᴮ), then the expected scores of A and B evaluate to
1/2 each. That makes sense — if both players are equally good, then both are expected to
score an equal number of wins.
Sometimes when a player’s actual tournament scores differs from their expected scores, the
scores need to be adjusted upwards or downwards. Elo’s original suggestion, which is still
widely used, was a simple linear adjustment proportional to the amount by which a player
over-performed or under-performed. The maximum possible adjustment per game, called the
K-factor, was set at K = 16 for masters and K = 32 for weaker players. If Player A was
expected to score Eᴬ points but actually scored Sᴬ points, the player’s rating is updated using
the formula:

Formula for updating a player’s ELO rating


Let’s consider an example:
 Anand has a rating of 2600
 Boris has a rating of 2300
Their expected scores are therefore:
 Anand: 1/1+10^(2300–2600)/400 = 0.849
 Boris: 1/1+10^(2600–2300)/400 = 0.151
If the organizers determined that K =16 and Anand wins, then the new ratings would be:
 Anand = 2600 + 16 (1 – 0.849) = 2602
 Boris = 2300 + 16 (0 – 0.151) = 2298
If the organizers determined that K =16 and Boris wins, then the new ratings would be:
 Anand = 2600 + 16 (0 – 0.849) = 2586
 Boris = 2300 + 16 (1 – 0.151) = 2314
A couple of cool points
 An Elo rating is only valid within the rating pool where it was established. For
example, consider a person with an ELO of 2150 in the All India Chess Federation
and another person with an ELO of 2080 in the US Chess Federation — given only
these 2 ratings and no other information, one cannot determine who is better.
 In a pure Elo system, each game ends in an equal transaction of rating points. If the
winner gains 15 rating points, the loser will drop by 15 rating points. However,
because players tend to enter the system as novices with a low rating and retire from
the system as experienced players with a high rating, the system faces
a rating deflation (A way to combat this is to use a K-Factor that decreases with
experience).
 In the long-run, the Elo rating system is self-correcting. If a player’s rating is too
high, they will perform worse than what the rating system predicts, and if a player’s
rating is too low, they will perform better than what the rating system predicts. Thus,
their rating eventually settles to the correct value in the long run.
Where else is ELO used?
Chess Boxing
I’m not even kidding — chessboxing is a real thing. As the name implies, you play chess,
then you box, then you play chess, then you box, and this keeps going until you either
checkmate the other person or knockout the person in the boxing ring. The current minimum
requirements to fight in a Chess Boxing Global event include an Elo rating of 1600 and a
record of at least 50 amateur bouts fought in boxing or another similar martial art.
Board Games
National Scrabble organizations compute normally distributed Elo ratings. except in the
United Kingdom, where a different system is used. The popular First Internet Backgammon
Server (FIBS) calculates ratings based on a modified Elo system, and the UK Backgammon
Federation uses this FIBS formula for its UK national ratings. The European Go Federation
also adopted an Elo-based rating system initially pioneered by the Czech Go Federation.
Card Games
Video Games
PUBG is one of the few games that uses the original Elo system. Winning increases the rating
and losing decreases it. The change in the ratings isn’t abrupt, so losing one game isn’t a
determining factor. PUBG has separate ranking systems for each game mode.
FaceMash
FaceMash was Facebook’s predecessor, and was developed by Mark Zuckerberg in his
second year at Harvard. If you’ve seen The Social Network, you might remember this scene
when Zuckerberg asks Eduardo Saverin for the equation used to rank chess players, except
this time it was for rating the ‘attractiveness’ of female Harvard students.
The movie claims that the Elo equations (although written incorrectly in the above scene)
were used in the algorithm in the original FaceMash website. Two students were shown side-
by-side and users could vote on who was more attractive. In this scene, Rᴬ can be interpreted
as student A’s Elo rating, Rᴮ as student B’s Elo rating, Eᴬ as the probability that student A is
more attractive than student B and Eᴮ as the probability that student B is more attractive than
student A.
In the movie, it is shown that the site became extremely popular in a short amount of time
and even crashed the Harvard servers. Zuckerberg was even charged by the administration
with breach of security, violating copyrights, and violating individual privacy, but these
charges were ultimately dropped.
Tinder
Tinder, a social networking and online dating app, allows users to swipe left or right based on
whether they like or dislike a profile. If two users right swipe on each other, they are
‘matched’ and can then exchange messages.
However, in a blog post dated 15 March 2019, they acknowledged that they did use an Elo-
score as part of their algorithm that considered how others engaged with one’s profile, but
that it’s no longer being used.
Merging Rankings
Merging Rankings is a technique used in data science to combine or merge multiple ranking
systems or sources of data into a unified ranking. This approach is often employed when
dealing with diverse datasets or when different ranking algorithms have been applied to the
same data.
Here's an explanation of how Merging Rankings works in data science:
1. Multiple Ranking Systems: In many cases, there may be multiple ranking systems or
algorithms applied to the same set of data. Each ranking system may prioritize
different criteria, use different algorithms, or have different biases.
2. Integration: Merging Rankings involves integrating or combining the rankings
generated by these different systems into a single, cohesive ranking. This integration
process aims to create a unified ranking that reflects the strengths and weaknesses of
each individual ranking system.
3. Weighting: One common approach in Merging Rankings is to assign weights to each
individual ranking system based on factors such as their reliability, accuracy, or
relevance to the specific task at hand. These weights are then used to combine the
rankings, giving more influence to the more reliable or relevant ranking systems.
4. Aggregation: Once the rankings from different systems are weighted, they are
aggregated or combined using a suitable method. Common aggregation methods
include averaging, weighted averaging, and rank-order fusion techniques.
5. Normalization: In some cases, it may be necessary to normalize the rankings before
merging them to ensure that they are on a consistent scale. This can involve rescaling
the rankings or adjusting them to have a common baseline.
6. Evaluation: After merging the rankings, it's essential to evaluate the performance of
the merged ranking compared to the individual rankings. This evaluation helps ensure
that the merged ranking accurately represents the underlying data and achieves the
desired objectives.
Overall, Merging Rankings is a valuable technique in data science for synthesizing diverse
sources of ranking information and creating a unified ranking that captures the collective
insights of multiple ranking systems. It enables data scientists to leverage the strengths of
different algorithms and datasets while mitigating the limitations of individual rankings.
Diagraph Rankings
Diagraph Rankings, also known as Directed Graph Rankings, are a method used in data
science and network analysis to rank entities within a directed graph structure. In a directed
graph, also called a digraph, entities are represented as nodes, and the relationships between
entities are represented as directed edges or arrows.
Here's an explanation of how Diagraph Rankings work:
1. Directed Graph Representation: Diagraph Rankings operate on a directed graph,
where each node represents an entity, and each directed edge represents a relationship
or connection between entities. These relationships can be asymmetric, meaning that
the direction of the edge matters.
2. Node Importance: In Diagraph Rankings, the importance or centrality of nodes within
the graph is determined based on their relationships with other nodes. Nodes that are
highly connected to other important nodes tend to have higher rankings.
3. Algorithms: Several algorithms can be used to calculate Diagraph Rankings. One
common algorithm is the PageRank algorithm, which was originally developed by
Google to rank web pages based on their importance in the World Wide Web.
PageRank assigns scores to nodes in the graph based on the number and importance of
incoming edges.
4. Centrality Measures: In addition to PageRank, other centrality measures can be used
to calculate Diagraph Rankings, such as betweenness centrality, closeness centrality,
and eigenvector centrality. Each centrality measure quantifies a different aspect of node
importance within the graph.
5. Applications: Diagraph Rankings have various applications in data science, network
analysis, and social network analysis. They can be used to identify influential nodes
within a network, recommend relevant content or connections to users, detect
communities or clusters within the network, and predict the flow of information or
influence within the network.
Overall, Diagraph Rankings provide a powerful framework for analyzing and ranking entities
within directed graph structures, allowing data scientists to uncover valuable insights and
patterns in complex networks.
PageRank is an algorithm used to rank web pages in search engine results. It was developed
by Larry Page and Sergey Brin, the co-founders of Google, and is a fundamental component
of the Google search engine. PageRank assigns a numerical value to each web page,
representing its importance or relevance in the context of a particular search query.
Here's how the PageRank algorithm works:
1. Graph Representation: PageRank operates on a graph representation of the World
Wide Web, where web pages are represented as nodes, and hyperlinks between pages
are represented as directed edges or arrows. In this graph, a link from page A to page
B is considered a vote of confidence from page A to page B.
2. Random Surfer Model: PageRank is based on the concept of a random surfer
navigating the web by clicking on links at random. The probability that the surfer will
arrive at a particular page is determined by the number and quality of inbound links to
that page.
3. Iterative Calculation: The PageRank algorithm iteratively calculates the PageRank
score for each web page based on the votes of confidence it receives from other pages.
Initially, each page is assigned an equal PageRank score. In subsequent iterations, the
PageRank score of each page is updated based on the PageRank scores of the pages
linking to it.
4. Damping Factor: To prevent the PageRank scores from becoming too concentrated
on a small subset of highly connected pages, a damping factor is introduced. This
damping factor represents the probability that the random surfer will continue clicking
on links rather than jumping to a random page. Typically, the damping factor is set to
around 0.85.
5. Convergence: The PageRank algorithm continues iterating until the PageRank scores
converge to stable values. At this point, the PageRank scores represent the relative
importance or relevance of each web page in the context of the entire web graph.
6. Applications: PageRank is used by search engines to rank web pages in search
results, with pages receiving higher PageRank scores being displayed more
prominently in search results. It is also used in various other applications, such as
ranking academic papers, social network analysis, and recommendation systems.
Simplified algorithm
Assume a small universe of four web pages: A, B, C, and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are
ignored. PageRank is initialized to the same value for all pages. In the original form
of PageRank, the sum of PageRank over all pages was the total number of pages on
the web at that time, so each page in this example would have an initial value of 1.
However, later versions of PageRank, and the remainder of this section, assume a
probability distribution between 0 and 1. Hence the initial value for each page in this
example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon
the next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would
transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.
PR(A) = PR(B) + PR(C) + PR(D),
Suppose instead that page B had a link to pages C and A, page C had a link to page A,
and page D had links to all three pages. Thus, upon the first iteration, page B would
transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to
page C. Page C would transfer all of its existing value, 0.25, to the only page it links
to, A. Since D had three outbound links, it would transfer one-third of its existing
value, or approximately 0.083, to A. At the completion of this iteration, page A will
have a PageRank of approximately 0.458.
PR(A)={PR(B)/2}+{PR(C)/1}+{PR(D)/3}
Let us see how to solve Page Rank Algorithm. Compute page rank at every node at the end of
the second iteration. use teleportation factor = 0.8
Formula is:
PR(A) = (1-β) + β * [PR(B) / Cout(B) + PR(C) / Cout(C)+ ...... + PR(N) / Cout(N)]

Let us create a table of the 0th Iteration, 1st Iteration, and 2nd Iteration.
Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3
So, what have we done here is for node A we will see how many incoming signals are there
so here we have PR(B) and PR(C). And for each of the incoming signals, we will see the
outgoing signals from that particular incoming signal i.e. for PR(B) we have 4 outgoing
signals and for PR(C) we have 2 outgoing signals.
The same procedure will be applicable for the remaining nodes and iterations.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
This was for iteration 1, now let us calculate iteration 2.
Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.3568/4) + (0.3568/2)
= 0.4141
So, the final PAGE RANK for the above-given question is,

NODES ITERATION 0 ITERATION 1 ITERATION 2

A 1/6 = 0.16 0.3 0.392

B 1/6 = 0.16 0.32 0.3568

C 1/6 = 0.16 0.32 0.3568

D 1/6 = 0.16 0.264 0.2714


NODES ITERATION 0 ITERATION 1 ITERATION 2

E 1/6 = 0.16 0.264 0.2714

F 1/6 = 0.16 0.392 0.4141

Statistical Analysis
Karl Pearson’s definition on Statistics, “Statistics is the
grammar of science”.
Statistical analysis is the process of collecting and analysing large volumes of data in order
to identify trends and develop valuable insights.
In the professional world, statistical analysts take raw data and find correlations between
variables to reveal patterns and trends to relevant stakeholders. Working in a wide range of
different fields, statistical analysts are responsible for new scientific discoveries, improving
the health of our communities, and guiding business decisions.
There are six types of Statistical analysis,
In Descriptive Statistics, from the given observation the data is summarized. The
summarization takes place by considering the sample from the population using the mean or
standard deviation.
There are four different categories in Descriptive Statistics. They are,
 Measure of frequency
 Measure of dispersion
 Measure of central tendency
 Measure of position.
Based on the number of times a particular data has occurred defines the measure of
frequency. The Measure of dispersion can be defined based on the Range, Variance, Standard
Deviation, etc., The mean, median, mode, skewness of the respective data comes under the
measure of central tendency. Finally, based on the percentile and quartile the position is
measured.
For example, if a business gave you a book of its expenses and you summarized the
percentage of money it spent on different categories of items, then you would be performing
a form of descriptive statistics.
When performing descriptive statistics, you will often use data visualization to present
information in the form of graphs, tables, and charts to clearly convey it to others in an
understandable format. Typically, leaders in a company or organization will then use this data
to guide their decision making going forward.
Inferential Statistics,
Inferential statistics takes the results of descriptive statistics one step further by drawing
conclusions from the data and then making recommendations. The inferences are drawn
based upon sampling variation and observational error.
For example, instead of only summarizing the business's expenses, you might go on to
recommend in which areas to reduce spending and suggest an alternative budget.
Inferential statistical analysis is often used by businesses to inform company decisions and in
scientific research to find new relationships between variables.
 Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past
trends and predict future events on the basis of them. It uses machine
learning algorithms, data mining, data modelling, and artificial intelligence to conduct the
statistical analysis of data.
 Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best course of
action based on the results. It is a type of statistical analysis that helps you make an informed
decision.
 Exploratory Data Analysis
Exploratory analysis is similar to inferential analysis, but the difference is that it involves
exploring the unknown data associations. It analyzes the potential relationships within the
data.
 Causal Analysis
The causal statistical analysis focuses on determining the cause and effect relationship
between different variables within the raw data. In simple words, it determines why
something happens and its effect on other variables. This methodology can be used by
businesses to determine the reason for failure.
Importance of Statistical Analysis
Statistical analysis eliminates unnecessary information and catalogues important data in an
uncomplicated manner, making the monumental work of organizing inputs appear so serene.
Once the data has been collected, statistical analysis may be utilized for a variety of purposes.
Some of them are listed below:
 The statistical analysis aids in summarizing enormous amounts of data into clearly
digestible chunks.
 Statistical analysis may help with solid and efficient planning in any subject of study.
 Statistical analysis aid in establishing broad generalizations and forecasting how much
of something will occur under particular conditions.
 Statistical methods, which are effective tools for interpreting numerical data, are
applied in practically every field of study. Statistical approaches have been created
and are increasingly applied in physical and biological sciences, such as genetics.
 Statistical approaches are used in the job of a businessman, a manufacturer, and a
researcher. Statistics departments can be found in banks, insurance businesses, and
government agencies.
 A modern administrator, whether in the public or commercial sector, relies on
statistical data to make correct decisions.
 Politicians can utilize statistics to support and validate their claims while also
explaining the issues they address.
Benefits of Statistical Analysis
 It can help you determine the monthly, quarterly, yearly figures of sales profits, and
costs making it easier to make your decisions.
 It can help you make informed and correct decisions.
 It can help you identify the problem or cause of the failure and make corrections. For
example, it can identify the reason for an increase in total costs and help you cut the
wasteful expenses.
 It can help you conduct market analysis and make an effective marketing and sales
strategy.
 It helps improve the efficiency of different processes.
Statistical Analysis Process
Given below are the 5 steps to conduct a statistical analysis that you should follow:
 Step 1: Identify and describe the nature of the data that you are supposed to analyse.
 Step 2: The next step is to establish a relation between the data analysed and the
sample population to which the data belongs.
 Step 3: The third step is to create a model that clearly presents and summarizes the
relationship between the population and the data.
 Step 4: Prove if the model is valid or not.
 Step 5: Use predictive analysis to predict future trends and events likely to happen.

Sampling distribution
Sampling distribution in statistics refers to studying many random samples collected from a
given population based on a specific attribute. The results obtained provide a clear picture of
variations in the probability of the outcomes derived. As a result, the analysts remain aware
of the results beforehand, and hence, they can make preparations to take action accordingly.
As the data is based on one population at a time, the information gathered is easy to manage
and is more reliable as far as obtaining accurate results is concerned. Therefore, the sampling
distribution is an effective tool in helping researchers, academicians, financial analysts,
market strategists, and others make well-informed and wise decisions.
How Does Sampling Distribution Work?
A population is a group of people having the same attribute used for random sample
collection in terms of statistics.
With sampling distribution, the samples are studied to determine the probability of various
outcomes occurring with respect to certain events.
It is also known as finite-sample distribution. In the process, users collect samples
randomly but from one chosen population. For example, deriving data to understand the
adverts that can help attract teenagers would require selecting a population of those aged
between 13 and 19 only.
Using finite-sample distribution, users can calculate the mean, range, standard deviation,
mean absolute value of the deviation, variance, and unbiased estimate of the variance of the
sample. No matter for what purpose users wish to use the collected data, it helps strategists,
statisticians, academicians, and financial analysts make necessary preparations and take
relevant actions with respect to the expected outcome.
As soon as users decide to utilize the data for further calculation, the next step is to develop
a frequency distribution with respect to individual sample statistics as calculated through the
mean, variance, and other methods. Next, they plot the frequency distribution for each of
them on a graph to represent the variation in the outcome. This representation is indicated on
the distribution graph.

Influencing Factors
Moreover, the accuracy of the distribution depends on various factors, and the major ones
that influence the results include:
 Number of observations in the population. It is denoted by “N.”
 Number of observations in the sample. It is denoted by “n.”
 Methods adopted for choosing samples randomly. It leads to variation in the outcome.
Types
The finite-sample distribution can be expressed in various forms. Here is a list of some
of its types:

#1 – Sampling Distribution of Mean


It is the probabilistic spread of all the means of samples of fixed size that users choose
randomly from a particular population. When they plot individual means on the graph, it
indicates normal distribution. However, the center of the graph is the mean of the finite-
sample distribution, which is also the mean of that population.
#2 – Sampling Distribution of Proportion
This type of finite-sample distribution identifies the proportions of the population. The users
select samples and calculate the sample proportion. They, then, plot the resulting figures on
the graph. The mean of the sample proportions gathered from each sample group signifies the
mean proportion of the population as a whole. For example, a Vlogger collects data from a
sample group to find out the proportion of it interested in watching its upcoming videos.
#3 – T-Distribution
People use this type of distribution when they are not well aware of the chosen population or
when the sample size is very small. This symmetrical form of distribution fulfills the
condition of standard normal variate. As the sample size increases, even T distribution tends
to become very close to normal distribution. Users use it to find out the mean of the
population, statistical differences, etc.
Example #1
Sarah wants to analyze the number of teens riding a bicycle between two regions of 13-18.
Instead of considering each individual in the population of 13-18 years of age in the two
regions, she selected 200 samples randomly from each area.
Here,
 The average count of the bicycle usage here is the sample mean.
 Each chosen sample has its own generated mean, and the distribution for the average
mean is the sample distribution.
 The deviation obtained is termed the standard error.
She plots the data gathered from the sample on a graph to get a clear view of the finite-
sample distribution.
Central Limit Theorem
The discussion on sampling distribution is incomplete without the mention of the central limit
theorem, which states that the shape of the distribution will depend on the size of the sample.
According to this theorem, the increase in the sample size will reduce the chances of standard
error, thereby keeping the distribution normal. When users plot the data on a graph, the shape
will be close to the bell-curve shape. In short, the more sample groups one studies, the better
and more normal is the result/representation.
The CLT is a statistical theory that states that - if you take a sufficiently large sample size
from a population with a finite level of variance, the mean of all samples from that population
will be roughly equal to the population mean.
Consider there are 15 sections in class X, and each section has 50 students. Our task is to
calculate the average marks of students in class X.
The standard approach will be to calculate the average simply:
 Calculate the total marks of all the students in Class X
 Add all the marks
 Divide the total marks by the total number of students
But what if the data is extremely large? Is this a good approach? No way, calculation marks
of all the students will be a tedious and time-consuming process. So, what are the
alternatives? Let's take a look at another approach.
 To begin, select groups of students from the class at random. This will be referred to
as a sample. Create several samples, each with 30 students.
 Calculate each sample's individual mean.
 Calculate the average of these sample means.
 The value will give us the approximate average marks of the students in Class X.
 The histogram of the sample means marks of the students will resemble a bell curve
or normal distribution.
Significance of Central Limit Theorem
The CLT has several applications. Look at the places where you can use it.
 Political/election polling is a great example of how you can use CLT. These polls are
used to estimate the number of people who support a specific candidate. You may
have seen these results with confidence intervals on news channels. The CLT aids in
this calculation.
 You use the CLT in various census fields to calculate various population details, such
as family income, electricity consumption, individual salaries, and so on.
Central Limit Theorem Formula
Let us assume we have a random variable X. Let σ is its standard deviation and μ is the
mean of the random variable. Now as per Central Limit Theorem, the sample mean X will
approximate to the normal distribution which is given as {X} ⁓ N(μ, σ/√n). here, n is the
sample size.

Statistical distributions
Statistical distributions help us understand a problem better by assigning a range of possible
values to the variables, making them very useful in data science and machine learning. Here
are 7 types of distributions with intuitive examples that often occur in real-life data.
Common Types of Data

When you roll a die or pick a card from a deck, you have a limited number of outcomes
possible. This type of data is called Discrete Data, which can only take a specified number of
values. For example, in rolling a die, the specified values are 1, 2, 3, 4, 5, and 6.
Similarly, we can see examples of infinite outcomes from discrete events in our daily
environment. Recording time or measuring a person’s height has infinitely many values
within a given interval. This type of data is called Continuous Data, which can have any
value within a given range. That range can be finite or infinite.
For example, suppose you measure a watermelon’s weight. It can be any value from 10.2 kg,
10.24 kg, or 10.243 kg. Making it measurable but not countable, hence, continuous. On the
other hand, suppose you count the number of boys in a class; since the value is countable, it is
discrete.
Types of Statistical Distributions
Depending on the type of data we use, we have grouped distributions into two categories,
discrete distributions for discrete data (finite outcomes) and continuous distributions for
continuous data (infinite outcomes).

Discrete Distributions
DISCRETE UNIFORM DISTRIBUTION: ALL OUTCOMES ARE EQUALLY LIKELY
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are
equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all
six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equalling a
probability of 1/6, hence an example of a discrete uniform distribution.
As a result, the uniform distribution graph contains bars of equal height representing each
outcome. In our example, the height is a probability of 1/6 (0.166667).
Fair Dice Uniform Distribution Graph
Uniform distribution is represented by the function U(a, b), where a and b represent the
starting and ending values, respectively. Similar to a discrete uniform distribution, there is a
continuous uniform distribution for continuous variables.
The drawbacks of this distribution are that it often provides us with no relevant information.
Using our example of a rolling die, we get the expected value of 3.5, which gives us no
accurate intuition since there is no such thing as half a number on a dice. Since all values are
equally likely, it gives us no real predictive power.
BERNOULLI DISTRIBUTION: SINGLE-TRIAL WITH TWO POSSIBLE OUTCOMES
The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a
starting point to derive more complex distributions. Any event with a single trial and only
two outcomes follows a Bernoulli distribution. Flipping a coin or choosing between True and
False in a quiz are examples of a Bernoulli distribution.
They have a single trial and only two outcomes. Let’s assume you flip a coin once; this is a
single trail. The only two outcomes are either heads or tails. This is an example of a Bernoulli
distribution.
Usually, when following a Bernoulli distribution, we have the probability of one of the
outcomes (p). From (p), we can deduce the probability of the other outcome by subtracting it
from the total probability (1), represented as (1-p).
It is represented by bern(p), where p is the probability of success. The expected value of a
Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly Bernoulli variance is, Var(x) =
p(1-p).
Loaded Coin Bernoulli Distribution Graph
The graph of a Bernoulli distribution is simple to read. It consists of only two bars, one rising
to the associated probability p and the other growing to 1-p.
BINOMIAL DISTRIBUTION: A SEQUENCE OF BERNOULLI EVENTS
The Binomial Distribution can be thought of as the sum of outcomes of an event following a
Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events,
and the probability of success and failure is the same in all successive trials. An example of a
binomial event would be flipping a coin multiple times to count the number of heads and
tails.
Binomial vs Bernoulli distribution.
The difference between these distributions can be explained through an example. Consider
you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question
would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions
would be categorized as a Binomial trial. The main characteristics of Binomial Distribution
are:
 Given multiple trials, each of them is independent of the other. That is, the outcome of
one trial doesn’t affect another one.
 Each trial can lead to just two possible results (e.g., winning or losing), with
probabilities p and (1 – p).
A binomial distribution is represented by B (n, p), where n is the number of trials and p is the
probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial
trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the
number of times a success occurs, represented as E(x) = np. Similarly, variance is represented
as Var(x) = np(1-p).
Let’s consider the probability of success (p) and the number of trials (n). We can then
calculate the likelihood of success (x) for these n trials using the formula below:

For example, suppose that a candy company produces both milk chocolate and dark chocolate
candy bars. The total products contain half milk chocolate bars and half dark chocolate bars.
Say you choose ten candy bars at random and choosing milk chocolate is defined as a
success. The probability distribution of the number of successes during these ten trials with p
= 0.5 is shown here in the binomial distribution graph:

Binomial Distribution Graph


POISSON DISTRIBUTION: THE PROBABILITY THAT AN EVENT MAY OR MAY
NOT OCCUR
Poisson distribution deals with the frequency with which an event occurs within a specific
interval. Instead of the probability of an event, Poisson distribution requires knowing how
often it happens in a particular period or distance. For example, a cricket chirps two times in
7 seconds on average. We can use the Poisson distribution to determine the likelihood of it
chirping five times in 15 seconds.
A Poisson process is represented with the notation Po(λ), where λ represents the expected
number of events that can take place in a period. The expected value and variance of a
Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can
be modelled using the following formula.
The main characteristics which describe the Poisson Processes are:
 The events are independent of each other.
 An event can occur any number of times (within the defined period).
 Two events can’t take place simultaneously.
Poisson Distribution Graph
The graph of Poisson distribution plots the number of instances an event occurs in the
standard interval of time and the probability of each one.
Continuous Distributions
NORMAL DISTRIBUTION: SYMMETRIC DISTRIBUTION OF VALUES AROUND
THE MEAN
Normal distribution is the most used distribution in data science. In a normal distribution
graph, data is symmetrically distributed with no skew. When plotted, the data follows a bell
shape, with most values clustering around a central region and tapering off as they go further
away from the center.
The normal distribution frequently appears in nature and life in various forms. For example,
the scores of a quiz follow a normal distribution. Many of the students scored between 60 and
80 as illustrated in the graph below. Of course, students with scores that fall outside this
range are deviating from the center.

Normal Distribution Bell Curve Graph


Here, you can witness the “bell-shaped” curve around the central region, indicating that most
data points exist there. The normal distribution is represented as N(µ, σ2) here, µ represents
the mean, and σ2 represents the variance, one of which is mostly provided. The expected
value of a normal distribution is equal to its mean. Some of the characteristics which can help
us to recognize a normal distribution are:
 The curve is symmetric at the center. Therefore mean, mode, and median are equal to
the same value, distributing all the values symmetrically around the mean.
 The area under the distribution curve equals 1 (all the probabilities must sum up to 1).

STUDENT T-TEST DISTRIBUTION: SMALL SAMPLE SIZE APPROXIMATION OF


A NORMAL DISTRIBUTION
The student’s t-distribution, also known as the t distribution, is a type of statistical
distribution similar to the normal distribution with its bell shape but has heavier tails. The t
distribution is used instead of the normal distribution when you have small sample sizes.

Student t-Test Distribution Curve


For example, suppose we deal with the total apples sold by a shopkeeper in a month. In that
case, we will use the normal distribution. Whereas, if we are dealing with the total amount of
apples sold in a day, i.e., a smaller sample, we can use the t distribution.
Overall, the student t distribution is frequently used when conducting statistical analysis and
plays a significant role in performing Hypothesis Testing with limited data.
EXPONENTIAL DISTRIBUTION: MODEL ELAPSED TIME BETWEEN TWO
EVENTS
Exponential distribution is one of the widely used continuous distributions. It is used to
model the time taken between different events. For example, in physics, it is often used to
measure radioactive decay; in engineering, to measure the time associated with receiving a
defective part on an assembly line; and in finance, to measure the likelihood of the next
default for a portfolio of financial assets. Another common application of Exponential
distributions in survival analysis (e.g., expected life of a device/machine).
The exponential distribution is commonly represented as Exp(λ), where λ is the distribution
parameter, often called the rate parameter. We can find the value of λ by the formula = 1/μ,
where μ is the mean. Here standard deviation is the same as the mean. Var (x) gives the
variance = 1/λ2

Exponential Distribution Curve


An exponential graph is a curved line representing how the probability changes
exponentially. Exponential distributions are commonly used in calculations of product
reliability or the length of time a product lasts.

You might also like