Professional Documents
Culture Documents
(9Hrs)
2.1. Mean:
• The mean or average of a number of observations is the sum of the values of all observations divided
by the total number of observations.
• It is denoted by \(\overline x \) and read as \(x\) bar.
The mean is the usual average, so I'll add up and then divide:
The median is the middle value. In a list of ten values, that will be the (10 + 1) ÷ 2 = 5.5-th value; the
formula is reminding me, with that "point-five", that I'll need to average the fifth and sixth numbers to find
the median. The fifth and sixth numbers are the last 10 and the first 11, so:
• This list has two values that are repeated three times; namely, 10 and 11, each
repeated three times.The largest value is 13 and the smallest is 8, so the range is 13 –
8 = 5.
3. Standard Deviation and Variance
3.2 Variance
According to layman’s words, the variance is a measure of how far a set of data are
dispersed out from their mean or average value. It is denoted as ‘σ 2’.
Properties of Variance
• It is always non-negative since each term in the variance sum is squared and therefore the
result is either positive or zero.
• Variance always has squared units. For example, the variance of a set of weights
estimated in kilograms will be given in kg squared. Since the population variance is
squared, we cannot compare it directly with the mean or the data themselves.
Variance Formula:
The population variance formula is given by:
σ2 = Population variance
N = Number of observations in population
Xi = ith observation in the population
μ = Population mean
The sample variance formula is given as:
s2 = Sample variance
n = Number of observations in sample
xi = ith observation in the sample
Question:
Let X be a continuous random variable with the PDF given by:
[latex]f(x) = \left\{x; 0<x<12−x; 1<x<20; x>2x; 0<x<12−x; 1<x<20; x>2\right.[/latex]
Find P(0.5 < x < 1.5).
Solution:
Given PDF is:
[latex]f(x)= \left\{x; 0<x<12−x; 1<x<20; x>2x; 0<x<12−x; 1<x<20; x>2\right.[/latex]
[latex]P(0.5 < X < 1.5) =\int_{0.5}^{1.5}f(x)dx[/latex]
Let us split the integral by taking the intervals as given below:
[latex]=\int_{0.5}^{1}f(x)dx+\int_{1}^{1.5}f(x)dx[/latex]
Substituting the corresponding values of f(x) based on the intervals, we get;
[latex]=\int_{0.5}^{1}xdx+\int_{1}^{1.5}(2-x)dx[/latex]
Integrating the functions, we get;
[latex]=\left ( \frac{x^{2}}{2} \right )_{0.5}^{1}+\left ( 2x-\frac{x^{2}}{2} \right )_{1}^{1.5}[/latex]
= [(1)2/2 – (0.5)2/2] + {[2(1.5) – (1.5)2/2] – [2(1) – (1)2/2]}
= [(½) – (⅛)] + {[3 – (9/8)] – [2 – (½)]}
= (⅜) + [(15/8) – (3/2)]
= (3 + 15 – 12)/8
= 6/8
= 3/4
4.4. Applications of Probability Density Function
PDF, i.e. the probability density function has many applications in different fields of study such as statistics,
science and engineering. Some of the important applications of the probability density function are listed below:
• In Statistics, it is used to calculate the probabilities associated with the random variables.
• The probability density function is used in modelling the annual data of atmospheric NOx temporal
concentration
• It is used to model the diesel engine combustion.
5. Types of data Distribution:
1.Bernoulli’s Distribution
This is one of the simplest distributions that can be used as an initial point to derive more
complex distributions. Bernoulli’s distribution has possibly two outcomes (success or
failure) and a single trial.
For example, tossing a coin, the success probability of an outcome to be heads is p, then the
probability of having tail as outcome is (1-p). Bernoulli’s distribution is the special case of
binomial distribution with a single trial.
The binomial distribution is applied in binary outcomes events where the probability of success is equal to the
probability of failure in all the successive trials. Its example includes tossing a biased/unbiased coin for a
repeated number of times.
As input, the distribution considers two parameters, and is thus called as bi-parametric distribution. The
two parameters are;
•The number of times an event occurs, n, and
•Assigned probability, p, to one of the two classes
For n number of trials, and success probability, p, the probability of successful event (x) within n trials can be
determined by the following formula
The graph of binomial distribution is shown below when the probability of success is
equal to probability of failure.
Being a part of discrete probability distribution, poisson distribution outlines the probability for a given
number of events that take place in a fixed time period or space, or particularized intervals such as distance,
area, volume.
For example, conducting risk analysis by the insurance/banking industry, anticipating the number of car
accidents in a particular time interval and in a specific area.
Where 𝝺 represents the possible number of events take place in a fixed period of
time, and X is the number of events in that time period.
The graph of poisson distribution is shown below;
• The events are independent of each other, i.e, if an event occurs, it doesn’t affect the probability of
another event occurring.
• An event could occur any number of times in a defined period of time.
• Any two events can’t be occurring at the same time.
• The average rate of events to take place is constant.
5.Exponential Distribution
Like the poisson distribution, exponential distribution has the time element; it gives the
probability of a time duration before an event takes place.
Exponential distribution is used for survival analysis, for example, life of an air conditioner,
expected life of a machine,and length of time between metro arrivals.
Where λ stands for rate and always has value greater than zero.
The graph of exponential distribution is shown below;
• As shown in the graph, the higher the rate, the faster the curve drops, and lower the rate,
flatter the curve.
• In survival analysis, λ is termed as a failure rate of a machine at any time t with the
assumption that the machine will survive upto t time.
6. Uniform distribution
Uniform distribution can either be discrete or continuous where each event is equally likely to
occur. It has a constant probability constructing a rectangular distribution.
In this type of distribution, an unlimited number of outcomes will be possible and all the events
have the same probability, similar to Bernoulli’s distribution. For example, while rolling a dice,
the outcomes are 1 to 6 that have equal probabilities of ⅙ and represent a uniform
distribution.
Covariance formula
The covariance formula calculates data points from their average value in a dataset. For example,
the covariance between two random variables X and Y can be computed using the following
formula:
Where:
• xi represents the values of the X-variable
• yi represents the values of the Y-variable
• x represents the mean (average) of the X-variable
• y represents the mean (average) of the Y-variable
• n represents the number of data points
Covariance can have both positive and negative values. Depending on the diverse values, there are
two main types: Positive and negative covariance.
1.Positive covariance
Positive covariance means both the variables (X, Y) move in the same direction (i.e. show similar
behavior). So, if greater values of one variable (X) seem to correspond with greater values of
another variable (Y), then the variables are considered to have positive covariance. This tells you
something about the linear relationship between the two variables. So, for example, if an increase
in a person’s height corresponds with an increase in a person’s weight, there is positive covariance
between the two.
2. Negative covariance
Negative covariance means both the variables (X, Y) move in the opposite direction. As opposed
to positive covariance, if the greater values of one variable (X) correspond to lesser values of
another variable (Y) and vice-versa, then the variables are considered to have negative covariance.
Case 1 where (x,y) > 0: If (x,y) is greater than zero (i.e. If X is, on average, greater than its mean
when Y is greater than its mean and, similarly, if X is, on average, less than its mean when Y is less
than its mean), then the covariance for both variables is positive, moving in the same direction.
Case 2 where (x,y) < 0 : If (x,y) is less than zero (i.e. when X is, on average, less than its mean
when Y is greater than its mean and vice versa), then the covariance for both variables is negative,
moving in the opposite direction.
Case 3 where (x,y) = 0: If (x,y) is zero, then there is no relationship between the two variables.
Covariance matrix
For multi-dimensional data, there applies a generalization of covariance in terms of a covariance
matrix. The covariance matrix is also known as the variance-covariance matrix, as the diagonal values
of the covariance matrix show variances and the other values are the covariances. The covariance
matrix is a square matrix which can be written as follows:
In case of data points having zero mean, then the covariance matrix can be calculated
by employing a semi-definite matrix, i.e. XXT as follows:
In simple terms, the covariance matrix for two-dimensional data can be represented as follows:
Here:
• C represents covariance matrix
• (x,x) and (y,y) represent variances of variable X and Y
• (x,y) and (y,x) represent covariance of X and Y
The covariances of both variables X and Y are commutative in nature. In math,
commutative simply means that the values can be moved around in the formula and the
answer will still be the same, so (x,y) = (y,x). Thus, the covariance matrix is symmetric.
6.2 Correlation
So far, we’ve established that covariance indicates the extent to which two random variables increase
or decrease in tandem with each other. Correlation tells us both the strength and the direction of this
relationship. Correlation is best used for multiple variables that express a linear relationship with one
another. When we assume a correlation between two variables, we are essentially deducing that a
change in one variable impacts a change in another variable. Correlation helps us to determine
whether or not, and how strongly, changes in various variables relate to each other.
Types of correlation
Correlation is classified into the following types based on diverse values: Positive
correlation, negative correlation, and no correlation. Let’s explore those now.\
1. Positive correlation
2. Negative correlation
3. Zero or no correlation
1. Positive correlation
Two variables are considered to have a positive correlation if they are directly proportional. That
is, if the value of one variable increases, then the value of the other variable will also increase. A
perfect positive correlation holds a value of “1”. On a graph, positive correlation appears as
follows:
2. Negative correlation
A perfect negative correlation holds a value of “-1” which means that, as the value of one variable
increases, the value of the second variable decreases (and vice versa). In graph form, this is how
negative correlation might look:
3. Zero or no correlation
The value “0” denotes that there is no correlation. It indicates that there is no relationship between
the two variables, so an increase or decrease in one variable is unrelated to an increase or decrease in
the other variable. A graph showing zero correlation will follow a random distribution of data points,
as opposed to a clear line:
This table perfectly displays the varying degree of correlation between two values.
Correlation coefficient:
Correlation is calculated using a method known as “Pearson’s Product-Moment Correlation” or
simply “Correlation Coefficient.” Correlation is usually denoted by italic letter r. The following
formula is normally used to find r for two variables X and Y.
Where:
• r represents the correlation coefficient
• xi represents the value of variable X in data sample
• x represents the mean (average) of values of X variable
• yi represents the value of variable Y in data sample
• y represents the mean (average) of Y variable
Correlation matrix:
We use correlation coefficients to determine the relationship between two variables, for example, to find
the number of hours a student must spend working to complete a project within the desired timeline. But
what if we want to evaluate the correlation among multiple pairs of variables? Then we use a correlation
matrix.
A correlation matrix is essentially a table depicting the correlation coefficients for various variables. The
rows and columns contain the value of the variables, and each cell shows the correlation coefficient.
Example:
In the following correlation matrix, we can see the correlation coefficient for each possible combination of
variables. In this example, we’re looking at:
•Hours spent exercising (per week)
•Cardio fitness level
•Height
•Age
Correlation matrix used for:
• To conveniently encapsulate datasets: For large datasets containing thousands of rows, the
correlation matrix is an effective way of summarizing the correlation among various variables
of that dataset. The relationship between two variables can easily be interpreted by looking at
raw data in the matrix.
• To perform regression testing: Multiple linear regression is difficult to interpret when two
independent variables in the dataset are highly correlated. Two variables which are highly
correlated can easily be located using a correlation matrix, as its convenient structure helps with
quick and easy detection.
• Input for various analyses: Analysis methods such as structural equation models can use the
correlation matrix as an input for their calculation process.
7. Conditional Probability
Conditional probability is the probability of an event occurring given that another event has already occurred.
The concept is one of the quintessential concepts in probability theory. The concept of conditional probability
is primarily related to the Bayes’ theorem, which is one of the most influential theories in statistics.
Where:
• P(A|B) – the conditional probability; the probability of event A occurring given that event B has
already occurred
• P(A ∩ B) – the joint probability of events A and B; the probability that both events A and B occur
• P(B) – the probability of event B
The formula above is applied to the calculation of the conditional probability of events that are
neither independent nor mutually exclusive.
Another way of calculating conditional probability is by using the Bayes’ theorem. The theorem
can be used to determine the conditional probability of event A, given that event B has occurred,
by knowing the conditional probability of event B, given the event A has occurred, as well as the
individual probabilities of events A and B. Mathematically, the Bayes’ theorem can be denoted in
the following way:
Finally, conditional probabilities can be found using a tree diagram. In the tree diagram, the
probabilities in each branch are conditional.
Conditional Probability for Independent Events
Two events are independent if the probability of the outcome of one event does not influence
the probability of the outcome of another event. Due to this reason, the conditional probability
of two independent events A and B is:
We want to predict the probability it will rain in a given day based only on if there are
clouds at sunrise. In this example, our hypothesis is that it will rain and our data is that there
are clouds at sunrise. Therefore, we're answering the question: P( rain | clouds at sunrise ).
Unfortunately, we only know:
• It rains 25% of all days, or P(rain) = 25%,
• It is cloudy at sunrise only 15% of days, or P(clouds at sunrise) = 15%,
• From our data, we discovered there were clouds at sunrise on 50% of the days it rained, or P(clouds at
sunrise | rain) = 50%
This is a classic application of Bayes` Theorem since we have a dataset about the information
described by our conditional probability! Applying this problem to Bayes' Rule:
= 83.33%
...and finally, stating the complete answer:
P(rain | clouds at sunrise) = 83.33%
This is a really useful result -- if we had no information at all, we only know there's a 25% chance of
rain. However, by spotting clouds at sunrise, we know there's now over an 80% probability it will rain!
Univariate, Bivariate, Multivariate analysis
1. Univariate : Univariate analysis is the most basic form of statistical data analysis technique. When
the data contains only one variable and doesn’t deal with a causes or effect relationships then a
Univariate analysis technique is used.
Here is one example of Univariate analysis-
In a survey of a class room, the researcher may be looking to count the number of boys and girls. In this
instance, the data would simply reflect the number, i.e. a single variable and its quantity as per the below
table. The key objective of Univariate analysis is to simply describe the data to find patterns within the
data. This is be done by looking into the mean, median, mode, dispersion, variance, range, standard
deviation etc.
Univariate analysis is conducted through several ways which are mostly descriptive in nature –
•Frequency Distribution Tables
•Histograms
•Frequency Polygons
•Pie Charts
•Bar Charts
2. Bivariate analysis
Bivariate analysis is slightly more analytical than Univariate analysis. When the data set
contains two variables and researchers aim to undertake comparisons between the two data set
then Bivariate analysis is the right type of analysis technique.
Here is one simple example of bivariate analysis –
In a survey of a classroom, the researcher may be looking to analysis the ratio of
students who scored above 85% corresponding to their genders. In this case, there are two
variables – gender = X (independent variable) and result = Y (dependent variable). A Bivariate
analysis is will measure the correlations between the two variables.
Bivariate analysis is conducted using –
•Correlation coefficients
•Regression analysis
3. Multivariate analysis:
Multivariate analysis is a more complex form of statistical analysis technique and used when there
are more than two variables in the data set.
Here is an example of multivariate analysis –
A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on
the eating habits of the subjects (e.g., how many ounces of red meat, fish, dairy products, and
chocolate consumed per week). She wants to investigate the relationship between the three
measures of health and eating habits?
In this instance, a multivariate analysis would be required to understand the relationship of each
variable with each other.
For instance, imagine the growth chart of baby girls from birth until 2 years old. By following the
lines, we can see that 98% of the one year old baby girls weigh less than 11.5Kg.
Girls’ growth chart. Courtesy: World Health Organisation Child Growth Standards
Girls’ growth chart. Courtesy: World Health Organization Child Growth Standards
Another popular example is a country’s income distribution. The 99th percentile is the income at which
99% of the rest of the country is making less than that amount, and 1% is making more. In the case of the
UK on the graph below, this is £75,000.
Quartiles
Quartiles are special percentiles, which divide the data into quarters. The first quartile, Q1, is the same
as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median is called
both the second quartile, Q2, and the 50th percentile.
Generally speaking, outliers are those data points that fall outside from the Q1 – 1.5 x IQR and Q3 + 1.5 x
IQR range.
Box Plots
Box plots (also called box and whisker plots) illustrate:
how concentrated the data is, and
how far the extreme values are from most of the data.
The minimum and maximum values are the endpoints of the axis (-15 and 5 in this case). The Q1 marks
one end of the box and the Q3 the other end of the blue box.
The ‘whiskers’ (shown in purple) extend from the ends of the box to the smallest and largest data values.
There are also box plots that have dots marking outlier values (shown in red). In those cases, the whiskers
are not extending to the minimum and maximum values.
Boxplots on a Normal Distribution
There is a subtle nuance with boxplots on normal distributions: Even though they are called quartile 1
(Q1) and quartile 3 (Q1), they don’t really represent 25% of the data! They represent 34.135%, and
the area in between is not 50%, but 68.27%.
Comparison of a boxplot of a nearly normal distribution (top) and a PDF for a normal distribution
(bottom). Courtesy: Wikipedia
Moments
Moments describe various aspects of the nature and shape of our distribution.
#1 — The first moment is the mean of the data, which describes the location of the distribution.
#2 — The second moment is the variance, which describes the spread of the distribution. High
values are more spread out than smaller values.
#3 — The third moment is the skewness and it is basically a measure of how lopsided a distribution
is. A positive skew means we have a left lean and a long right tail. This means that the mean is to the
right of the bulk of our data. And vice versa:
#4 — The fourth moment is the kurtosis, which describes how thick the tail is and how sharp
the peak is. It indicates how likely it is to find extreme values in our data. Higher values make outliers
more likely. This sounds a lot like spread (variance) but is subtly different.