You are on page 1of 63

UNIT II - DESCRIPTIVE ANALYTICS USING STATISTICS

(9Hrs)

Types of Data – Mean, Median and Mode – Standard


Deviation and Variance – Probability – Probability
Density Function – Types of Data Distribution –
Percentiles and Moments – Correlation and Covariance
– Conditional Probability – Bayes Theorem –
Introduction to Univariate, Bivariate and Multivariate
Analysis.
1.1.1 Nominal:
These are the set of values that don’t possess a natural ordering. Let’s understand this with some
examples. The color of a smartphone can be considered as a nominal data type as we can’t compare one
color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one where we
can’t differentiate between male, female, or others. Mobile phone categories whether it is midrange,
budget segment, or premium smartphone is also nominal data type.
1.1.2 Ordinal:
These types of values have a natural ordering while maintaining their class of values. If we consider the
size of a clothing brand then we can easily sort them according to their name tag in the order of small <
medium < large. The grading system while marking candidates in a test can also be considered as an
ordinal data type where A+ is definitely better than B grade.
1.2 Quantitative Data Type:
This data type tries to quantify things and it does by considering numerical values that make it
countable in nature. The price of a smartphone, discount offered, number of ratings on a product, the
frequency of processor of a smartphone, or ram of that particular phone, all these things fall under the
category of Quantitative data types.
1.2.1 Discrete:
The numerical values which fall under are integers or whole numbers are placed under this category.
The number of speakers in the phone, cameras, cores in the processor, the number of sims supported all
these are some of the examples of the discrete data type.
1.2.2 Continuous:
The fractional numbers are considered as continuous values. These can take the form of the operating
frequency of the processors, the android version of the phone, wifi frequency, temperature of the cores,
and so on.
2. Mean, Mode, Median

2.1. Mean:
• The mean or average of a number of observations is the sum of the values of all observations divided
by the total number of observations.
• It is denoted by \(\overline x \) and read as \(x\) bar.

Find the mean for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

The mean is the usual average, so I'll add up and then divide:

(8 + 9 + 10 + 10 + 10 + 11 + 11 + 11 + 12 + 13) ÷ 10 = 105 ÷ 10 = 10.5


Mean = 10.5
2.2. Median:
• The median is the value of the given number of observations, which divides it into exactly two parts. So,
when the data is arranged in ascending or descending order, the median of ungrouped data is calculated
based upon whether a number of terms are odd or even.

Find the Median for the following list of values:


8, 9, 10, 10, 10, 11, 11, 11, 12, 13

The median is the middle value. In a list of ten values, that will be the (10 + 1) ÷ 2 = 5.5-th value; the
formula is reminding me, with that "point-five", that I'll need to average the fifth and sixth numbers to find
the median. The fifth and sixth numbers are the last 10 and the first 11, so:

(10 + 11) ÷ 2 = 21 ÷ 2 = 10.5


Median = 10.5
2.3. Mode:

• The mode is the number repeated most often.

Find the mean for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

• This list has two values that are repeated three times; namely, 10 and 11, each
repeated three times.The largest value is 13 and the smallest is 8, so the range is 13 –
8 = 5.
3. Standard Deviation and Variance

3.1. Standard Deviation


The spread of statistical data is measured by the standard deviation. Distribution measures the deviation of data
from its mean or average position. The degree of dispersion is computed by the method of estimating the
deviation of data points. It is denoted by the symbol, ‘σ’.
Properties of Standard Deviation:
• It describes the square root of the mean of the squares of all values in a data set and is also called the root-
mean-square deviation.
• The smallest value of the standard deviation is 0 since it cannot be negative.
• When the data values of a group are similar, then the standard deviation will be very low or close to zero. But
when the data values vary with each other, then the standard variation is high or far from zero.
Standard Deviation Formula
The population standard deviation formula is given as:

σ = Population standard deviation

Similarly, the sample standard deviation formula is:

s = Sample standard deviation

3.2 Variance
According to layman’s words, the variance is a measure of how far a set of data are
dispersed out from their mean or average value. It is denoted as ‘σ 2’.
Properties of Variance
• It is always non-negative since each term in the variance sum is squared and therefore the
result is either positive or zero.
• Variance always has squared units. For example, the variance of a set of weights
estimated in kilograms will be given in kg squared. Since the population variance is
squared, we cannot compare it directly with the mean or the data themselves.

Variance Formula:
The population variance formula is given by:

σ2 = Population variance
N = Number of observations in population
Xi = ith observation in the population
μ = Population mean
The sample variance formula is given as:

s2 = Sample variance
n = Number of observations in sample
xi = ith observation in the sample

xx¯ = Sample mean

Variance and Standard deviation Relationship


Variance is equal to the average squared deviations from the mean, while standard deviation
is the number’s square root. Also, the standard deviation is a square root of variance. Both
measures exhibit variability in distribution, but their units vary: Standard deviation is
expressed in the same units as the original values, whereas the variance is expressed in
squared units.
Example
Question: If a die is rolled, then find the variance and standard deviation of the possibilities.
Solution: When a die is rolled, the possible outcome will be 6. So the sample space, n = 6 and the data set
= { 1;2;3;4;5;6}.
To find the variance, first, we need to calculate the mean of the data set.
Mean, x̅ = (1+2+3+4+5+6)/6 = 3.5
We can put the value of data and mean in the formula to get;
σ2 = Σ (xi – x̅ )2/n
σ2 = ⅙ (6.25+2.25+0.25+0.25+2.25+6.25)
σ2 = 2.917
Now, the standard deviation,σ = √2.917 = 1.708
4. Probability Density Function
• A probability density function (PDF), or density of a continuous random variable, is a function whose
value at any given sample (or point) in the sample space(the set of possible values taken by the random
variable) can be interpreted as providing a relative likelihood that the value of the random variable would
be close to that sample.
• In other words, while the absolute likelihood for a continuous random variable to take on any particular
value is 0 (since there is an infinite set of possible values to begin with), the value of the PDF at two
different samples can be used to infer, in any particular draw of the random variable, how much more
likely it is that the random variable would be close to one sample compared to the other sample.
4.1. Probability Density Function Formula
In the case of a continuous random variable, the probability taken by X on some given value x is always
0. In this case, if we find P(X = x), it does not work. Instead of this, we must calculate the probability of
X lying in an interval (a, b). Now, we have to figure it for P(a< X< b), and we can calculate this using the
formula of PDF. The Probability density function formula is given as,
[latex]\large \mathbf{P(a<X<b)=\int_{a}^{b}f(x)\ dx}[/latex]
Or
[latex]\large \mathbf{P(a\le X\le b)=\int_{a}^{b}f(x)\ dx}[/latex]
This is because, when X is continuous, we can ignore the endpoints of intervals while finding
probabilities of continuous random variables. That means, for any constants a and b,
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b).
4.2 Probability Density Function Graph
The probability density function is defined as an integral of the density of the variable density
over a given range. It is denoted by f (x). This function is positive or non-negative at any point
of the graph, and the integral, more specifically the definite integral of PDF over the entire space
is always equal to one. The graph of PDFs typically resembles a bell curve, with the probability
of the outcomes below the curve. The below figure depicts the graph of a probability density
function for a continuous random variable x with function f(x).
4.3. Probability Density Function Example

Question:
Let X be a continuous random variable with the PDF given by:
[latex]f(x) = \left\{x; 0<x<12−x; 1<x<20; x>2x; 0<x<12−x; 1<x<20; x>2\right.[/latex]
Find P(0.5 < x < 1.5).
Solution:
Given PDF is:
[latex]f(x)= \left\{x; 0<x<12−x; 1<x<20; x>2x; 0<x<12−x; 1<x<20; x>2\right.[/latex]
[latex]P(0.5 < X < 1.5) =\int_{0.5}^{1.5}f(x)dx[/latex]
Let us split the integral by taking the intervals as given below:
[latex]=\int_{0.5}^{1}f(x)dx+\int_{1}^{1.5}f(x)dx[/latex]
Substituting the corresponding values of f(x) based on the intervals, we get;
[latex]=\int_{0.5}^{1}xdx+\int_{1}^{1.5}(2-x)dx[/latex]
Integrating the functions, we get;
[latex]=\left ( \frac{x^{2}}{2} \right )_{0.5}^{1}+\left ( 2x-\frac{x^{2}}{2} \right )_{1}^{1.5}[/latex]
= [(1)2/2 – (0.5)2/2] + {[2(1.5) – (1.5)2/2] – [2(1) – (1)2/2]}
= [(½) – (⅛)] + {[3 – (9/8)] – [2 – (½)]}
= (⅜) + [(15/8) – (3/2)]
= (3 + 15 – 12)/8
= 6/8
= 3/4
4.4. Applications of Probability Density Function
PDF, i.e. the probability density function has many applications in different fields of study such as statistics,
science and engineering. Some of the important applications of the probability density function are listed below:
• In Statistics, it is used to calculate the probabilities associated with the random variables.
• The probability density function is used in modelling the annual data of atmospheric NOx temporal
concentration
• It is used to model the diesel engine combustion.
5. Types of data Distribution:

1.Bernoulli’s Distribution

This is one of the simplest distributions that can be used as an initial point to derive more
complex distributions. Bernoulli’s distribution has possibly two outcomes (success or
failure) and a single trial.

For example, tossing a coin, the success probability of an outcome to be heads is p, then the
probability of having tail as outcome is (1-p). Bernoulli’s distribution is the special case of
binomial distribution with a single trial.

The density function can be given as

f(x) = px (1-p)(1-x) where x € (0,1)


It can also be written as;
The graph of Bernoulli's distribution is shown below where the probability of success is less
than probability of failure.

The distribution has following characteristics;


• The number of trials, to be performed, need to be predefined for a single experiment.
• Each trial has only two possible outcomes-success or failure.
• The probability of success of each event/experiment must be the same.
• Each event must be independent of each other.
2.Binomial Distribution

The binomial distribution is applied in binary outcomes events where the probability of success is equal to the
probability of failure in all the successive trials. Its example includes tossing a biased/unbiased coin for a
repeated number of times.
As input, the distribution considers two parameters, and is thus called as bi-parametric distribution. The
two parameters are;
•The number of times an event occurs, n, and
•Assigned probability, p, to one of the two classes
For n number of trials, and success probability, p, the probability of successful event (x) within n trials can be
determined by the following formula
The graph of binomial distribution is shown below when the probability of success is
equal to probability of failure.

The binomial distribution holds the following properties;


• For multiple trials provided, each trial is independent to each other, i.e, the result of one trial cannot
influence other trials.
• Each of the trials can have two possible outcomes, either success or failure, with probabilities p, and (1-
p).
• A total number of n identical trials can be conducted, and the probability of success and failure is the
same for all trials.
3.Normal (Gaussian) Distribution

Being a continuous distribution, the normal distribution is most commonly used in


data science. A very common process of our day to day life belongs to this
distribution- income distribution, average employees report, average weight of a
population, etc.

The formula for normal distribution;

Where μ = Mean value,


σ = Standard probability distribution of probability,
x = random variable

According to the formula, the distribution is said to be normal if mean (μ) = 0


and standard deviation (σ) = 1
The graph of normal distribution is shown below which is symmetric about the centre
(mean).

Normal distribution has the following properties;

• Mean, mode and median coincide with each other.


• The distribution has a bell-shaped distribution curve.
• The distribution curve is symmetrical to the centre.
• The area under the curve is equal to 1.
4.Poisson Distribution

Being a part of discrete probability distribution, poisson distribution outlines the probability for a given
number of events that take place in a fixed time period or space, or particularized intervals such as distance,
area, volume.

For example, conducting risk analysis by the insurance/banking industry, anticipating the number of car
accidents in a particular time interval and in a specific area.

Poisson distribution considers following assumptions;


• The success probability for a short span is equal to success probability for a long period of time.
• The success probability in a duration equals to zero as the duration becomes smaller.
• A successful event can’t impact the result of another successful event

A poisson distribution can be modeled using the formula below,

Where 𝝺 represents the possible number of events take place in a fixed period of
time, and X is the number of events in that time period.
The graph of poisson distribution is shown below;

Poisson distribution has the following characteristics;

• The events are independent of each other, i.e, if an event occurs, it doesn’t affect the probability of
another event occurring.
• An event could occur any number of times in a defined period of time.
• Any two events can’t be occurring at the same time.
• The average rate of events to take place is constant.
5.Exponential Distribution

Like the poisson distribution, exponential distribution has the time element; it gives the
probability of a time duration before an event takes place.

Exponential distribution is used for survival analysis, for example, life of an air conditioner,
expected life of a machine,and length of time between metro arrivals.

A variable X is said to possess an exponential distribution when

Where λ stands for rate and always has value greater than zero.
The graph of exponential distribution is shown below;

The exponential distribution has following characteristics;

• As shown in the graph, the higher the rate, the faster the curve drops, and lower the rate,
flatter the curve.
• In survival analysis, λ is termed as a failure rate of a machine at any time t with the
assumption that the machine will survive upto t time.
6. Uniform distribution

Uniform distribution can either be discrete or continuous where each event is equally likely to
occur. It has a constant probability constructing a rectangular distribution.

In this type of distribution, an unlimited number of outcomes will be possible and all the events
have the same probability, similar to Bernoulli’s distribution. For example, while rolling a dice,
the outcomes are 1 to 6 that have equal probabilities of ⅙ and represent a uniform
distribution.

A variable X is said to have uniform distribution if the probability density function is


The graph of a uniform distribution looks as below

The uniform distribution has the following properties;


•The probability density function combines to unity.
•Every input function has an equal weightage.
6. Correlation and Covariance
6.1. Covariance:
Covariance is a quantitative measure of the degree to which the deviation of one variable (X) from
its mean is related to the deviation of another variable (Y) from its mean. To simplify, covariance
measures the joint variability of two random variables. For example, if greater values of one
variable tend to correspond with greater values of another variable, this suggests positive
covariance. We’ll explore the different types of covariance shortly. First, let’s look at how
covariance is calculated in mathematical terms.

Covariance formula
The covariance formula calculates data points from their average value in a dataset. For example,
the covariance between two random variables X and Y can be computed using the following
formula:
Where:
• xi represents the values of the X-variable
• yi represents the values of the Y-variable
• x represents the mean (average) of the X-variable
• y represents the mean (average) of the Y-variable
• n represents the number of data points

Different types of covariance

Covariance can have both positive and negative values. Depending on the diverse values, there are
two main types: Positive and negative covariance.
1.Positive covariance
Positive covariance means both the variables (X, Y) move in the same direction (i.e. show similar
behavior). So, if greater values of one variable (X) seem to correspond with greater values of
another variable (Y), then the variables are considered to have positive covariance. This tells you
something about the linear relationship between the two variables. So, for example, if an increase
in a person’s height corresponds with an increase in a person’s weight, there is positive covariance
between the two.
2. Negative covariance
Negative covariance means both the variables (X, Y) move in the opposite direction. As opposed
to positive covariance, if the greater values of one variable (X) correspond to lesser values of
another variable (Y) and vice-versa, then the variables are considered to have negative covariance.

Case 1 where (x,y) > 0: If (x,y) is greater than zero (i.e. If X is, on average, greater than its mean
when Y is greater than its mean and, similarly, if X is, on average, less than its mean when Y is less
than its mean), then the covariance for both variables is positive, moving in the same direction.

Case 2 where (x,y) < 0 : If (x,y) is less than zero (i.e. when X is, on average, less than its mean
when Y is greater than its mean and vice versa), then the covariance for both variables is negative,
moving in the opposite direction.

Case 3 where (x,y) = 0: If (x,y) is zero, then there is no relationship between the two variables.
Covariance matrix
For multi-dimensional data, there applies a generalization of covariance in terms of a covariance
matrix. The covariance matrix is also known as the variance-covariance matrix, as the diagonal values
of the covariance matrix show variances and the other values are the covariances. The covariance
matrix is a square matrix which can be written as follows:

In case of data points having zero mean, then the covariance matrix can be calculated
by employing a semi-definite matrix, i.e. XXT as follows:
In simple terms, the covariance matrix for two-dimensional data can be represented as follows:

Here:
• C represents covariance matrix
• (x,x) and (y,y) represent variances of variable X and Y
• (x,y) and (y,x) represent covariance of X and Y
The covariances of both variables X and Y are commutative in nature. In math,
commutative simply means that the values can be moved around in the formula and the
answer will still be the same, so (x,y) = (y,x). Thus, the covariance matrix is symmetric.
6.2 Correlation
So far, we’ve established that covariance indicates the extent to which two random variables increase
or decrease in tandem with each other. Correlation tells us both the strength and the direction of this
relationship. Correlation is best used for multiple variables that express a linear relationship with one
another. When we assume a correlation between two variables, we are essentially deducing that a
change in one variable impacts a change in another variable. Correlation helps us to determine
whether or not, and how strongly, changes in various variables relate to each other.

Types of correlation
Correlation is classified into the following types based on diverse values: Positive
correlation, negative correlation, and no correlation. Let’s explore those now.\

1. Positive correlation
2. Negative correlation
3. Zero or no correlation
1. Positive correlation
Two variables are considered to have a positive correlation if they are directly proportional. That
is, if the value of one variable increases, then the value of the other variable will also increase. A
perfect positive correlation holds a value of “1”. On a graph, positive correlation appears as
follows:
2. Negative correlation
A perfect negative correlation holds a value of “-1” which means that, as the value of one variable
increases, the value of the second variable decreases (and vice versa). In graph form, this is how
negative correlation might look:
3. Zero or no correlation
The value “0” denotes that there is no correlation. It indicates that there is no relationship between
the two variables, so an increase or decrease in one variable is unrelated to an increase or decrease in
the other variable. A graph showing zero correlation will follow a random distribution of data points,
as opposed to a clear line:

This table perfectly displays the varying degree of correlation between two values.
Correlation coefficient:
Correlation is calculated using a method known as “Pearson’s Product-Moment Correlation” or
simply “Correlation Coefficient.” Correlation is usually denoted by italic letter r. The following
formula is normally used to find r for two variables X and Y.

Where:
• r represents the correlation coefficient
• xi represents the value of variable X in data sample
• x represents the mean (average) of values of X variable
• yi represents the value of variable Y in data sample
• y represents the mean (average) of Y variable
Correlation matrix:
We use correlation coefficients to determine the relationship between two variables, for example, to find
the number of hours a student must spend working to complete a project within the desired timeline. But
what if we want to evaluate the correlation among multiple pairs of variables? Then we use a correlation
matrix.
A correlation matrix is essentially a table depicting the correlation coefficients for various variables. The
rows and columns contain the value of the variables, and each cell shows the correlation coefficient.
Example:
In the following correlation matrix, we can see the correlation coefficient for each possible combination of
variables. In this example, we’re looking at:
•Hours spent exercising (per week)
•Cardio fitness level
•Height
•Age
Correlation matrix used for:
• To conveniently encapsulate datasets: For large datasets containing thousands of rows, the
correlation matrix is an effective way of summarizing the correlation among various variables
of that dataset. The relationship between two variables can easily be interpreted by looking at
raw data in the matrix.
• To perform regression testing: Multiple linear regression is difficult to interpret when two
independent variables in the dataset are highly correlated. Two variables which are highly
correlated can easily be located using a correlation matrix, as its convenient structure helps with
quick and easy detection.
• Input for various analyses: Analysis methods such as structural equation models can use the
correlation matrix as an input for their calculation process.
7. Conditional Probability
Conditional probability is the probability of an event occurring given that another event has already occurred.
The concept is one of the quintessential concepts in probability theory. The concept of conditional probability
is primarily related to the Bayes’ theorem, which is one of the most influential theories in statistics.

Formula for Conditional Probability


Note that conditional probability does not state that there is always a causal relationship between the two
events, as well as it does not indicate that both events occur simultaneously.

Where:
• P(A|B) – the conditional probability; the probability of event A occurring given that event B has
already occurred
• P(A ∩ B) – the joint probability of events A and B; the probability that both events A and B occur
• P(B) – the probability of event B
The formula above is applied to the calculation of the conditional probability of events that are
neither independent nor mutually exclusive.

Another way of calculating conditional probability is by using the Bayes’ theorem. The theorem
can be used to determine the conditional probability of event A, given that event B has occurred,
by knowing the conditional probability of event B, given the event A has occurred, as well as the
individual probabilities of events A and B. Mathematically, the Bayes’ theorem can be denoted in
the following way:

Finally, conditional probabilities can be found using a tree diagram. In the tree diagram, the
probabilities in each branch are conditional.
Conditional Probability for Independent Events
Two events are independent if the probability of the outcome of one event does not influence
the probability of the outcome of another event. Due to this reason, the conditional probability
of two independent events A and B is:

Conditional Probability for Mutually Exclusive Events


In probability theory, mutually exclusive events are events that cannot occur
simultaneously. In other words, if one event has already occurred, another can event
cannot occur. Thus, the conditional probability of mutually exclusive events is always
zero.
8. Bayes Theorem

Bayes' Theorem -- commonly also referred to as Bayes' Rule or Bayes' Law -- is a


mathematical property that allows us to express a conditional probability in terms of
of the inverse of the conditional:

Example: Clouds at Sunrise and Rain

We want to predict the probability it will rain in a given day based only on if there are
clouds at sunrise. In this example, our hypothesis is that it will rain and our data is that there
are clouds at sunrise. Therefore, we're answering the question: P( rain | clouds at sunrise ).
Unfortunately, we only know:
• It rains 25% of all days, or P(rain) = 25%,
• It is cloudy at sunrise only 15% of days, or P(clouds at sunrise) = 15%,
• From our data, we discovered there were clouds at sunrise on 50% of the days it rained, or P(clouds at
sunrise | rain) = 50%

This is a classic application of Bayes` Theorem since we have a dataset about the information
described by our conditional probability! Applying this problem to Bayes' Rule:

Bayes' Theorem, as applied to Example 1 with Emojis


Solving the formula with the data we know:

P(rain | clouds at sunrise) = ???


...using Bayes' Theorem...

= P(clouds at sunrise | rain) × ( P(rain) / P(clouds at sunrise) )


...we know that there is a 50% of chance of clouds at sunrise on days where it rained...

= 50% × (P(rain) / P( clouds at sunrise) )


...we know the probability of rain on any day, with no conditionals, is 25%...

= 50% × ( 25% / P( clouds at sunrise) )


...we know the probability of clouds at sunrise on any day, with no conditionals, is 15%...

= 50% × ( 25% / 15% )


...solving the equation:

= 83.33%
...and finally, stating the complete answer:
P(rain | clouds at sunrise) = 83.33%

This is a really useful result -- if we had no information at all, we only know there's a 25% chance of
rain. However, by spotting clouds at sunrise, we know there's now over an 80% probability it will rain!
Univariate, Bivariate, Multivariate analysis

1. Univariate : Univariate analysis is the most basic form of statistical data analysis technique. When
the data contains only one variable and doesn’t deal with a causes or effect relationships then a
Univariate analysis technique is used.
Here is one example of Univariate analysis-
In a survey of a class room, the researcher may be looking to count the number of boys and girls. In this
instance, the data would simply reflect the number, i.e. a single variable and its quantity as per the below
table. The key objective of Univariate analysis is to simply describe the data to find patterns within the
data. This is be done by looking into the mean, median, mode, dispersion, variance, range, standard
deviation etc.
Univariate analysis is conducted through several ways which are mostly descriptive in nature –
•Frequency Distribution Tables
•Histograms
•Frequency Polygons
•Pie Charts
•Bar Charts
2. Bivariate analysis

Bivariate analysis is slightly more analytical than Univariate analysis. When the data set
contains two variables and researchers aim to undertake comparisons between the two data set
then Bivariate analysis is the right type of analysis technique.
Here is one simple example of bivariate analysis –
In a survey of a classroom, the researcher may be looking to analysis the ratio of
students who scored above 85% corresponding to their genders. In this case, there are two
variables – gender = X (independent variable) and result = Y (dependent variable). A Bivariate
analysis is will measure the correlations between the two variables.
Bivariate analysis is conducted using –
•Correlation coefficients
•Regression analysis
3. Multivariate analysis:
Multivariate analysis is a more complex form of statistical analysis technique and used when there
are more than two variables in the data set.
Here is an example of multivariate analysis –
A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on
the eating habits of the subjects (e.g., how many ounces of red meat, fish, dairy products, and
chocolate consumed per week). She wants to investigate the relationship between the three
measures of health and eating habits?

In this instance, a multivariate analysis would be required to understand the relationship of each
variable with each other.

Commonly used multivariate analysis technique include –


• Factor Analysis
• Cluster Analysis
• Variance Analysis
• Discriminant Analysis
• Multidimensional Scaling
• Principal Component Analysis
• Redundancy Analysis
Percentiles
Percentiles divide ordered data into hundredths. In a sorted dataset, a given percentile is the point at
which that percent of the data is less than the point we are at.

The 50th percentile is pretty much the median.

For instance, imagine the growth chart of baby girls from birth until 2 years old. By following the
lines, we can see that 98% of the one year old baby girls weigh less than 11.5Kg.
Girls’ growth chart. Courtesy: World Health Organisation Child Growth Standards
Girls’ growth chart. Courtesy: World Health Organization Child Growth Standards
Another popular example is a country’s income distribution. The 99th percentile is the income at which
99% of the rest of the country is making less than that amount, and 1% is making more. In the case of the
UK on the graph below, this is £75,000.
Quartiles
Quartiles are special percentiles, which divide the data into quarters. The first quartile, Q1, is the same
as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median is called
both the second quartile, Q2, and the 50th percentile.

Interquartile Range (IQR)


The IQR is a number that indicates how spread the middle half (i.e. the middle 50%) of the dataset is
and can help determine outliers. It is the difference between the Q3 and Q1.
IQR = Q3 - Q1

Generally speaking, outliers are those data points that fall outside from the Q1 – 1.5 x IQR and Q3 + 1.5 x
IQR range.
Box Plots
Box plots (also called box and whisker plots) illustrate:
how concentrated the data is, and
how far the extreme values are from most of the data.

Elements of a boxplot. Courtesy:


A box plot is comprised of a scaled horizontal or vertical axis and a rectangular box.

The minimum and maximum values are the endpoints of the axis (-15 and 5 in this case). The Q1 marks
one end of the box and the Q3 the other end of the blue box.

The ‘whiskers’ (shown in purple) extend from the ends of the box to the smallest and largest data values.
There are also box plots that have dots marking outlier values (shown in red). In those cases, the whiskers
are not extending to the minimum and maximum values.
Boxplots on a Normal Distribution
There is a subtle nuance with boxplots on normal distributions: Even though they are called quartile 1
(Q1) and quartile 3 (Q1), they don’t really represent 25% of the data! They represent 34.135%, and
the area in between is not 50%, but 68.27%.

Comparison of a boxplot of a nearly normal distribution (top) and a PDF for a normal distribution
(bottom). Courtesy: Wikipedia
Moments
Moments describe various aspects of the nature and shape of our distribution.

#1 — The first moment is the mean of the data, which describes the location of the distribution.

#2 — The second moment is the variance, which describes the spread of the distribution. High
values are more spread out than smaller values.

#3 — The third moment is the skewness and it is basically a measure of how lopsided a distribution
is. A positive skew means we have a left lean and a long right tail. This means that the mean is to the
right of the bulk of our data. And vice versa:
#4 — The fourth moment is the kurtosis, which describes how thick the tail is and how sharp
the peak is. It indicates how likely it is to find extreme values in our data. Higher values make outliers
more likely. This sounds a lot like spread (variance) but is subtly different.

Kurtosis illustration of three curves


We can see that the higher peak values have a higher kurtosis value, i.e. the topmost curve has a higher kurtosis than the
bottommost curve.

You might also like