Professional Documents
Culture Documents
Data normalization is a crucial element of data analysis. It’s what allows analysts to compile
and compare numbers of different sizes, from various data sources. And yet, normalization is
little understood and little used.
Z-score normalization refers to the process of normalizing every value in a dataset such that
the mean of all of the values is 0 and the standard deviation is 1.
We use the following formula to perform a z-score normalization on every value in a dataset:
New value = (x – μ) / σ
where:
x: Original value
μ: Mean of data
σ: Standard deviation of data
Example: Performing Z-Score Normalization
Suppose we have the following dataset:
Using a calculator, we can find that the mean of the dataset is 21.2 and the standard deviation
is 29.8.
To perform a z-score normalization on the first value in the dataset, we can use the following
formula:
New value = (x – μ) / σ
New value = (3 – 21.2) / 29.8
New value = -0.61
We can use this formula to perform a z-score normalization on every value in the dataset:
Advanced ranking techniques
1. ELO ranking
Elo Rankings is a method used for ranking players or teams in competitive games or sports. It
was originally developed by Arpad Elo for ranking chess players but has since been adapted
for various other games and sports, including esports, soccer, basketball, and tennis.
The Elo rating system assigns a numerical rating to each player or team, representing their
skill level or performance. The basic principle behind Elo Rankings is that players' ratings
change after each match based on the outcome and the relative ratings of the opponents.
Here's how the Elo Rankings system typically works:
1. Initial Ratings: Each player or team starts with an initial rating, usually a default value
determined by the organizer or based on their previous performance.
2. Expected Score: Before a match, the Elo system calculates the expected score for each
player based on their ratings. The higher-rated player is expected to win, and the lower-
rated player is expected to lose. The expected score is calculated using a logistic
function.
3. Actual Score: After the match, the actual outcome is compared to the expected outcome.
If the player performs better than expected (e.g., wins against a higher-rated opponent),
their rating increases. Conversely, if the player performs worse than expected (e.g., loses
against a lower-rated opponent), their rating decreases.
4. Rating Adjustment: The amount by which a player's rating changes after a match
depends on the difference between the actual score and the expected score, as well as a
parameter known as the K-factor, which determines the sensitivity of the rating changes
to individual match results. A higher K-factor means that ratings change more
dramatically with each match, while a lower K-factor means that ratings change more
gradually.
5. Iterative Process: The process of updating ratings after each match is iterative, meaning
that ratings continue to adjust over time as players compete against each other and their
performance fluctuates.
Overall, Elo Rankings provide a dynamic and objective method for assessing the relative skill
levels of players or teams in competitive games and sports, with ratings evolving based on
actual match results.
The difference in the ratings between two players serves as a predictor of the outcome of a
match. If players A and B have ratings Rᴬ and Rᴮ, then the expected scores are given by:
Let us create a table of the 0th Iteration, 1st Iteration, and 2nd Iteration.
Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3
So, what have we done here is for node A we will see how many incoming signals are there
so here we have PR(B) and PR(C). And for each of the incoming signals, we will see the
outgoing signals from that particular incoming signal i.e. for PR(B) we have 4 outgoing
signals and for PR(C) we have 2 outgoing signals.
The same procedure will be applicable for the remaining nodes and iterations.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
This was for iteration 1, now let us calculate iteration 2.
Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.3568/4) + (0.3568/2)
= 0.4141
So, the final PAGE RANK for the above-given question is,
Statistical Analysis
Karl Pearson’s definition on Statistics, “Statistics is the
grammar of science”.
Statistical analysis is the process of collecting and analysing large volumes of data in order
to identify trends and develop valuable insights.
In the professional world, statistical analysts take raw data and find correlations between
variables to reveal patterns and trends to relevant stakeholders. Working in a wide range of
different fields, statistical analysts are responsible for new scientific discoveries, improving
the health of our communities, and guiding business decisions.
There are six types of Statistical analysis,
In Descriptive Statistics, from the given observation the data is summarized. The
summarization takes place by considering the sample from the population using the mean or
standard deviation.
There are four different categories in Descriptive Statistics. They are,
Measure of frequency
Measure of dispersion
Measure of central tendency
Measure of position.
Based on the number of times a particular data has occurred defines the measure of
frequency. The Measure of dispersion can be defined based on the Range, Variance, Standard
Deviation, etc., The mean, median, mode, skewness of the respective data comes under the
measure of central tendency. Finally, based on the percentile and quartile the position is
measured.
For example, if a business gave you a book of its expenses and you summarized the
percentage of money it spent on different categories of items, then you would be performing
a form of descriptive statistics.
When performing descriptive statistics, you will often use data visualization to present
information in the form of graphs, tables, and charts to clearly convey it to others in an
understandable format. Typically, leaders in a company or organization will then use this data
to guide their decision making going forward.
Inferential Statistics,
Inferential statistics takes the results of descriptive statistics one step further by drawing
conclusions from the data and then making recommendations. The inferences are drawn
based upon sampling variation and observational error.
For example, instead of only summarizing the business's expenses, you might go on to
recommend in which areas to reduce spending and suggest an alternative budget.
Inferential statistical analysis is often used by businesses to inform company decisions and in
scientific research to find new relationships between variables.
Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past
trends and predict future events on the basis of them. It uses machine
learning algorithms, data mining, data modelling, and artificial intelligence to conduct the
statistical analysis of data.
Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best course of
action based on the results. It is a type of statistical analysis that helps you make an informed
decision.
Exploratory Data Analysis
Exploratory analysis is similar to inferential analysis, but the difference is that it involves
exploring the unknown data associations. It analyzes the potential relationships within the
data.
Causal Analysis
The causal statistical analysis focuses on determining the cause and effect relationship
between different variables within the raw data. In simple words, it determines why
something happens and its effect on other variables. This methodology can be used by
businesses to determine the reason for failure.
Importance of Statistical Analysis
Statistical analysis eliminates unnecessary information and catalogues important data in an
uncomplicated manner, making the monumental work of organizing inputs appear so serene.
Once the data has been collected, statistical analysis may be utilized for a variety of purposes.
Some of them are listed below:
The statistical analysis aids in summarizing enormous amounts of data into clearly
digestible chunks.
Statistical analysis may help with solid and efficient planning in any subject of study.
Statistical analysis aid in establishing broad generalizations and forecasting how much
of something will occur under particular conditions.
Statistical methods, which are effective tools for interpreting numerical data, are
applied in practically every field of study. Statistical approaches have been created
and are increasingly applied in physical and biological sciences, such as genetics.
Statistical approaches are used in the job of a businessman, a manufacturer, and a
researcher. Statistics departments can be found in banks, insurance businesses, and
government agencies.
A modern administrator, whether in the public or commercial sector, relies on
statistical data to make correct decisions.
Politicians can utilize statistics to support and validate their claims while also
explaining the issues they address.
Benefits of Statistical Analysis
It can help you determine the monthly, quarterly, yearly figures of sales profits, and
costs making it easier to make your decisions.
It can help you make informed and correct decisions.
It can help you identify the problem or cause of the failure and make corrections. For
example, it can identify the reason for an increase in total costs and help you cut the
wasteful expenses.
It can help you conduct market analysis and make an effective marketing and sales
strategy.
It helps improve the efficiency of different processes.
Statistical Analysis Process
Given below are the 5 steps to conduct a statistical analysis that you should follow:
Step 1: Identify and describe the nature of the data that you are supposed to analyse.
Step 2: The next step is to establish a relation between the data analysed and the
sample population to which the data belongs.
Step 3: The third step is to create a model that clearly presents and summarizes the
relationship between the population and the data.
Step 4: Prove if the model is valid or not.
Step 5: Use predictive analysis to predict future trends and events likely to happen.
Sampling distribution
Sampling distribution in statistics refers to studying many random samples collected from a
given population based on a specific attribute. The results obtained provide a clear picture of
variations in the probability of the outcomes derived. As a result, the analysts remain aware
of the results beforehand, and hence, they can make preparations to take action accordingly.
As the data is based on one population at a time, the information gathered is easy to manage
and is more reliable as far as obtaining accurate results is concerned. Therefore, the sampling
distribution is an effective tool in helping researchers, academicians, financial analysts,
market strategists, and others make well-informed and wise decisions.
How Does Sampling Distribution Work?
A population is a group of people having the same attribute used for random sample
collection in terms of statistics.
With sampling distribution, the samples are studied to determine the probability of various
outcomes occurring with respect to certain events.
It is also known as finite-sample distribution. In the process, users collect samples
randomly but from one chosen population. For example, deriving data to understand the
adverts that can help attract teenagers would require selecting a population of those aged
between 13 and 19 only.
Using finite-sample distribution, users can calculate the mean, range, standard deviation,
mean absolute value of the deviation, variance, and unbiased estimate of the variance of the
sample. No matter for what purpose users wish to use the collected data, it helps strategists,
statisticians, academicians, and financial analysts make necessary preparations and take
relevant actions with respect to the expected outcome.
As soon as users decide to utilize the data for further calculation, the next step is to develop
a frequency distribution with respect to individual sample statistics as calculated through the
mean, variance, and other methods. Next, they plot the frequency distribution for each of
them on a graph to represent the variation in the outcome. This representation is indicated on
the distribution graph.
Influencing Factors
Moreover, the accuracy of the distribution depends on various factors, and the major ones
that influence the results include:
Number of observations in the population. It is denoted by “N.”
Number of observations in the sample. It is denoted by “n.”
Methods adopted for choosing samples randomly. It leads to variation in the outcome.
Types
The finite-sample distribution can be expressed in various forms. Here is a list of some
of its types:
Statistical distributions
Statistical distributions help us understand a problem better by assigning a range of possible
values to the variables, making them very useful in data science and machine learning. Here
are 7 types of distributions with intuitive examples that often occur in real-life data.
Common Types of Data
When you roll a die or pick a card from a deck, you have a limited number of outcomes
possible. This type of data is called Discrete Data, which can only take a specified number of
values. For example, in rolling a die, the specified values are 1, 2, 3, 4, 5, and 6.
Similarly, we can see examples of infinite outcomes from discrete events in our daily
environment. Recording time or measuring a person’s height has infinitely many values
within a given interval. This type of data is called Continuous Data, which can have any
value within a given range. That range can be finite or infinite.
For example, suppose you measure a watermelon’s weight. It can be any value from 10.2 kg,
10.24 kg, or 10.243 kg. Making it measurable but not countable, hence, continuous. On the
other hand, suppose you count the number of boys in a class; since the value is countable, it is
discrete.
Types of Statistical Distributions
Depending on the type of data we use, we have grouped distributions into two categories,
discrete distributions for discrete data (finite outcomes) and continuous distributions for
continuous data (infinite outcomes).
Discrete Distributions
DISCRETE UNIFORM DISTRIBUTION: ALL OUTCOMES ARE EQUALLY LIKELY
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are
equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all
six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equalling a
probability of 1/6, hence an example of a discrete uniform distribution.
As a result, the uniform distribution graph contains bars of equal height representing each
outcome. In our example, the height is a probability of 1/6 (0.166667).
Fair Dice Uniform Distribution Graph
Uniform distribution is represented by the function U(a, b), where a and b represent the
starting and ending values, respectively. Similar to a discrete uniform distribution, there is a
continuous uniform distribution for continuous variables.
The drawbacks of this distribution are that it often provides us with no relevant information.
Using our example of a rolling die, we get the expected value of 3.5, which gives us no
accurate intuition since there is no such thing as half a number on a dice. Since all values are
equally likely, it gives us no real predictive power.
BERNOULLI DISTRIBUTION: SINGLE-TRIAL WITH TWO POSSIBLE OUTCOMES
The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a
starting point to derive more complex distributions. Any event with a single trial and only
two outcomes follows a Bernoulli distribution. Flipping a coin or choosing between True and
False in a quiz are examples of a Bernoulli distribution.
They have a single trial and only two outcomes. Let’s assume you flip a coin once; this is a
single trail. The only two outcomes are either heads or tails. This is an example of a Bernoulli
distribution.
Usually, when following a Bernoulli distribution, we have the probability of one of the
outcomes (p). From (p), we can deduce the probability of the other outcome by subtracting it
from the total probability (1), represented as (1-p).
It is represented by bern(p), where p is the probability of success. The expected value of a
Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly Bernoulli variance is, Var(x) =
p(1-p).
Loaded Coin Bernoulli Distribution Graph
The graph of a Bernoulli distribution is simple to read. It consists of only two bars, one rising
to the associated probability p and the other growing to 1-p.
BINOMIAL DISTRIBUTION: A SEQUENCE OF BERNOULLI EVENTS
The Binomial Distribution can be thought of as the sum of outcomes of an event following a
Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events,
and the probability of success and failure is the same in all successive trials. An example of a
binomial event would be flipping a coin multiple times to count the number of heads and
tails.
Binomial vs Bernoulli distribution.
The difference between these distributions can be explained through an example. Consider
you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question
would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions
would be categorized as a Binomial trial. The main characteristics of Binomial Distribution
are:
Given multiple trials, each of them is independent of the other. That is, the outcome of
one trial doesn’t affect another one.
Each trial can lead to just two possible results (e.g., winning or losing), with
probabilities p and (1 – p).
A binomial distribution is represented by B (n, p), where n is the number of trials and p is the
probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial
trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the
number of times a success occurs, represented as E(x) = np. Similarly, variance is represented
as Var(x) = np(1-p).
Let’s consider the probability of success (p) and the number of trials (n). We can then
calculate the likelihood of success (x) for these n trials using the formula below:
For example, suppose that a candy company produces both milk chocolate and dark chocolate
candy bars. The total products contain half milk chocolate bars and half dark chocolate bars.
Say you choose ten candy bars at random and choosing milk chocolate is defined as a
success. The probability distribution of the number of successes during these ten trials with p
= 0.5 is shown here in the binomial distribution graph: