You are on page 1of 19

Miafe B.

Almendralejo Statistical Analysis with Software Applications


BSA – 2 2nd Sem; 2021 – 22

1. Descriptive - What is descriptive statistics?

Descriptive statistics are a form of statistical analysis that are utilized to provide a
summary of a dataset. Descriptive statistics are used just to describe some basic features of the
data in a study. They provide simple summaries about the sample and enable us to present
data in a meaningful way. It allows a simpler interpretation of the data and allows you to
characterize your data based on its properties.

Descriptive statistics concerning measures of position provides information concerning


the normality of the distribution of the sample. This is needed to identify what type of statistical
analysis can be used later, for instance, parametric or non-parametric tests. Descriptive
statistics can only provide summary information of datasets. They can be summaries of
samples, variables, or results and There are types of descriptive statistics, which are the
measures of frequency, measures of central tendency, measures of variability or dispersion and
measures of position.

The data summarized by the researchers, in a useful way, with the help of numerical and
graphical tools such as charts, tables, and graphs, to represent data in an accurate way.
Moreover, the text is represented in support of the diagrams to explain what they represent.

2. Enumerate, discuss, include important principles, formulas and provide examples of


the following:

Measures of Central Tendency

Measures of Central Tendency (also called measures of location or central location) is a


method to describe what’s typical for a group (set) of data. It means central tendency doesn’t
show us what is typical about each one piece of data, but it gives us an overview of the whole
picture of the entire data set. It tells us what is normal or average for a given set of data.

Measures of central tendency locates the distribution by various points and you can use
this when you want to show how an average or most commonly indicated response. Measure of
central tendency allow researchers to determine the typical numerical point in a set of data. It
also tell researchers where the center value lies in the distribution of data. Measures of central
tendency use a single value to describe the center of a data set. In the following sections, we
will look at the mean, mode and median, and learn how to calculate them and under what
conditions they are most appropriate to be used.

 Mean

Mean is an essential concept in mathematics and statistics. The mean or average is the
most popular and well known measure of central tendency and it is calculated by finding the
sum of the study data and dividing it by the total number of data. One of its important properties
is that it minimizes error in the prediction of any one value in your data set. That is, it is the
value that produces the lowest amount of error from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part
of the calculation. In addition, the mean is the only measure of central tendency where the sum
of the deviations of each value from the mean is always zero.

The mean formula for a set of given observations can be expressed as:
Mean = (Sum of Observations) ÷ (Total Numbers of Observations)

Similarly, we have a mean formula for grouped data, which is expressed as: x̄ = Σfx / Σf

where:
x̄ = the mean value of the set of given data
f = frequency of each class
x = mid-interval value of each class

Example:

Let’s say you have a sample of 5 girls and 6 boys.


The girls’ heights in inches are: 62, 70, 60, 63, 65.

To calculate the mean height for the group of girls you need to add the data together: 62 + 70 +
60 + 63 + 65 = 320. Now, you take the sum (320) and divide it by the total number of girls (5):
320 / 5 = 64. So, our mean is 64.

 Median

The median is the middle value in a set of data. It is one of the measures used to find
the central value of the given data. In a set of numerical data, the median is a point in which the
equal number of data values lie above and below the median. It is calculated by first listing the
data in numerical order then locating the value in the middle of the list. When working with an
odd set of data, the median is the middle number. The importance of the median value is that it
provides the idea about the distribution of the data. If the mean and the median of the data set
are the same, then the dataset is evenly distributed from the smallest to the highest values.

The median formula is: {(n + 1) ÷ 2}th

where:
“n” – is the number of items in the set and
“th” – just means the (n)th number.

Example:

The median in a set of 9 data is the number in the fifth place: 21, 22, 24, 24, 26, 27, 28, 29, 31.
The middle number in the below set is 26 as there are 4 numbers above it and 4 numbers
below. When working with an even set of data, you find the average of the two middle numbers.
For example, in a data set of 10, you would find the average of the numbers in the fifth and sixth
places. If you have an even set of data, easily, you just need to find the average of the two
middle numbers.

For example, in the dataset of 10 numbers: 21, 22, 24, 24, 26, 27, 28, 29, 31, 32. The average
of the numbers is: (26 + 27) / 2 = 26.5
 Mode

The mode is the number that appears most frequently in the set of data. The mode of a
set of data is the number in the set that occurs most often. A set of data may have one mode,
more than one mode, or no mode at all.

The median provides a helpful measure of the center of a dataset. By comparing the
median to the mean, you can get an idea of the distribution of a dataset. When the mean and
the median are the same, the dataset is more or less evenly distributed from the lowest to
highest values.

The mode is important because it lets us know which value(s) in a dataset is the most
common, it's useful for finding the most frequently. occurring value in categorical data when the
mean and median can't be calculated, and it gives us an idea of where the “center” of a dataset
is located, although the median and mean are more commonly used.

Mode formula for grouped data is given as:

where:
L – the lower limit of the modal class
h – the size of the class interval
fm – the frequency of the modal class
f1 – the frequency of the class preceding the modal class
f2 – the frequency of the class succeeding the modal class

We might find the above formula written in different forms in some references, as given below,

Example:

Consider this dataset showing the retirement age of 11 people, in whole years: 54, 54, 54, 55,
56, 57, 57, 58, 58, 60, 60. This table shows a simple frequency distribution of the retirement age
data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
The mean and median can only be used with numerical data. The mode can be used
with both numerical and nominal data, or data in the form of names or labels. Eye color, gender,
and hair color are all examples of nominal data. The mean is the preferred measure of central
tendency since it considers all of the numbers in a data set, however, the mean is extremely
sensitive to outliers, or extreme values that are much higher or lower than the rest of the values
in a data set. The median is preferred in cases where there are outliers, since the median only
considers the middle values.

Please use the following summary table to know what the best measure of central
tendency is with respect to the different types of variable.

Type of Variable Best measure of Central tendency


Nominal Mode
Ordinal Median
Interval Ratio ( not skewed ) Mean
Interval ration ( skewed ) Median

Measures of Variability

Measure of variability describes how far apart data points lie from each other and from
the center of a distribution. Along with measures of central tendency, measures of variability
give you descriptive statistics that summarize your data. Variability is also referred to as spread,
scatter or dispersion. Measures of variability complement the averages and allow us to interpret
them much better. In statistics, it describes the spread of the data values in a given dataset. In
other words, it shows how the data is dispersed around the mean (the central value). The study
of dispersion has a key role in statistical data.

The simplest measure of dispersion or variability is the range. This tells us how spread
out our data is. In order to calculate the range, you subtract the smallest number from the
largest number. Just like the mean, the range is very sensitive to outliers. The importance of the
variability of a data set is that, it helps researchers understand how much the data spreads out
around the data set's midpoint, and it also helps researchers compare different sets of data.

Example:

Imagine you have to compare the performance of 2 group of students on the final math exam.
You find that the average math test results are identical for both groups.

Group of students A: 56, 58, 60, 62, 64


Group of students B: 40, 50, 60, 70, 80

Both of these groups have mean scores of 60. However, in group A the individual scores are
concentrated around the center – 60. All students in A have a very similar performance. There is
consistency.

On the other hand, in group B the mean is also 60 but the individual scores are not even close
to the center. One score is quite small – 40 and one score is very large – 80. We can conclude
that there is greater dispersion in group B.
 Range

Range is the most straightforward measure of variability to calculate and it is simple and
easy to understand its purpose as well. Range is simply the difference between the largest and
smallest value in a data set. It shows how much variation from the average exists. You might
guess that low range tells us that the data points are very close to the mean. And a high range
shows the opposite.

The importance of range is to quickly and easily inform us on how wide the scores are.
The range also gives us a good indicator of variability when we have a distribution without
extreme values. When paired with measures of central tendency, the range can tell us about the
span of the distribution.

Here is the formula for calculating the range: Range = maximum value – minimum value

Example:

73, 79, 84, 87, 88, 91, 94. Group of students: We easily can calculate
Range = 94 – 73 A: 56, 58, 60, 62, 64 the range:
Range = 21 B: 40, 50, 60, 70, 80 Group A: 64 – 56 = 8
Group B: 80 – 40 = 40

You see that the data values in Group A are much closer to the mean than the ones in Group B.

 Interquartile range

The interquartile range (IQR) measures the spread of the middle half of your data. It is
the range for the middle 50% of your sample. Use the IQR to assess the variability where most
of your values lie. Larger values indicate that the central portion of your data spread out further.
Conversely, smaller values show that the middle values cluster more tightly. The interquartile
range is one of several measures of variability. The difference between the median value
calculated in the first half and second half of a dataset.

The interquartile range is the best measure of variability for skewed distributions or data
sets with outliers. Because it's based on values that come from the middle half of the
distribution, it's unlikely to be influenced by outliers. The formula for finding the interquartile
range takes the third quartile value and subtracts the first quartile value.

IQR = Q3 – Q1
Equivalently, the interquartile range is the region between the 75th and 25th percentile
(75 – 25 = 50% of the data). Using the IQR formula, we need to find the values for Q3 and Q1.
To do that, simply order your data from low to high and split the value into four equal portions.

With an Even Sample Size:

For the sample (n=10) the median diastolic blood pressure is 71 (50% of the values are above
71, and 50% are below). The quartiles can be determined in the same way we determined the
median, except we consider each half of the data set separately.

Figure 9 - Interquartile Range with Even Sample Size:

There are 5 values below the median (lower half), the middle value is 64 which is the first
quartile. There are 5 values above the median (upper half), the middle value is 77 which is the
third quartile. The interquartile range is 77 – 64 = 13; the interquartile range is the range of the
middle 50% of the data.

With an Odd Sample Size:

When the sample size is odd, the median and quartiles are determined in the same
way. Suppose in the previous example, the lowest value (62) were excluded, and the sample
size was n=9. The median and quartiles are indicated below.

Figure 10 - Interquartile Range with Odd Sample Size:

When the sample size is 9, the median is the middle number 72. The quartiles are determined in
the same way looking at the lower and upper halves, respectively.

There are 4 values in the lower half, the first quartile is the mean of the 2 middle values in the
lower half [(64+64)/2=64]. The same approach is used in the upper half to determine the third
quartile [(77+81)/2=79].

 Variance

The term variance refers to a statistical measurement of the spread between numbers in
a data set. More specifically, variance measures how far each number in the set is from the
mean and thus from every other number in the set.
It is important in order to calculate other statistics, such as the standard deviation. The
higher the variance, the more spread out your data are. The variances of the samples to assess
whether the populations they come from differ from each other.

The formula of variance:

Example:

Let’s calculate the variance of the following data set: 2, 7, 3, 12, 9.

The first step is to calculate the mean. The sum is 33 and there are 5 data points. Therefore, the
mean is 33 ÷ 5 = 6.6. Then you take each value in data set, subtract the mean and square the
difference. For instance, for the first value: (2 - 6.6) 2 = 21.16 The squared differences for all
values are added: 21.16 + 0.16 + 12.96 + 29.16 + 5.76 = 69.20. The sum is then divided by the
number of data points: 69.20 ÷5 = 13.84. The variance is 13.84.

 Standard Deviation

Standard deviation also provides information on how much variation from the mean
exists. However, the standard deviation goes further than Range and shows how each value in
a dataset varies from the mean. As in the Range, a low standard deviation tells us that the data
points are very close to the mean.

And a high standard deviation shows the opposite. It is important as it identifies the
spread of scores by stating intervals. You can use this when you want to show how "spread out"
the data are. It is helpful to know when your data are so spread out that it affects the mean

The standard deviation formula for a sample of a population is:

Example:

If we use the math results in Example 6:


Group of students A: 56, 58, 60, 62, 64. The mean is 60.
Let’s find the standard deviation of the math exam scores by hand. We use simple values for
the purposes of easy calculations.

Now, let’s replace the values in the formula:

The result above shows that, on average, every math exam score in The Group of students. A is
approximately 2.45 points away from the mean of 60. Of course, you can calculate the above
values by calculator instead by hand.

Note: The above formula is for a sample of a population. The standard deviation of an entire
population is represented by the Greek lowercase letter sigma and looks like that:

Measures of Position

A measure of position determines the position of a single value in relation to other values
in a sample or a population data set. Unlike the mean and the standard deviation, descriptive
measures based on quantiles are not sensitive to the influence of a few extreme observations.
For this reason, descriptive measures based on quantiles are often preferred over those based
on the mean and standard deviation (Weiss 2010).

Measures of position give us a way to see where a certain data point or value falls in a
sample or distribution. A measure can tell us whether a value is about the average, or whether
it’s unusually high or low.

Measures of position are used for quantitative data that falls on some numerical scale.
Sometimes, measures can be applied to ordinal variables, those variables that have an order,
like first, second…fiftieth. Statisticians often talk about the position of a value, relative to other
values in a set of data. The most common measures of position are percentiles, quartiles, and
standard scores (aka, z-scores).
 Percentile Ranks

Percentile rank is a common statistical measurement that you can use for everything
from comparing standardized test scores to analyzing weight distribution in a sample.
Statisticians often use percentile rank to get an idea of how a particular assessment score or
result compares with others in a set. Additionally, understanding percentile rank can give you an
insight into how well you're performing on any given assessment.

The percentile rank (PR) of a given score is the percentage of scores in its frequency
distribution that are less than that score. Percentile ranks are commonly used to clarify the
interpretation of scores on standardized tests.

The percentile rank formula is: R = P / 100 (n+ 1)

R – represents the rank order of the score.


P – represents the percentile rank.
N – represents the number of scores in the distribution.

Example: Consider a data set of following numbers: 122, 112, 114, 17, 118, 116, 111, 115, 112.
You are required to calculate 25th Percentile Rank.

R = P / 100 ( n+1 )
= 25 / 100 ( 9+1 )
 Decile

A decile is a quantitative method of splitting up a set of ranked data into 10 equally large
subsections. A decile rank arranges the data in order from lowest to highest and is done on a
scale of one to 10 where each successive number corresponds to an increase of 10 percentage
points. Deciles are similar to quartiles. But where quartiles split the data in four equal parts,
deciles split the data into ten parts: The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th and
100th percentiles.

Decile facilitates categorization of data sets and observations into samples for
convenience of analysis and measurement. The data set is divided into ten equal parts and will
contain nine data points. The purpose of deciles is to determine the largest and smallest values
based on a metric.

Decile Formula for ungrouped data:

Decile Formula for grouped data: D (x)

Example:

Suppose a data set consists of the following numbers: 24, 32, 27, 32, 23, 62, 45, 80, 59, 63, 36,
54, 57, 36, 72, 55, 51, 32, 56, 33, 42, 55, 30. The value of the first two deciles has to be
calculated.
 Quartile Ranks

Quartile, as it sounds phonetically, is a statistical term that divides the data into four
quarters. It basically divides the data points into a data set in 4 quarters on the number line. The
lowest quarter, two middle quarters, and a highest quarter. Quartiles tell us about the spread of
a data set by breaking the data set into quarters, just like the median breaks it in half.

Quartiles help to give us a fuller picture of our data set as a whole. The first and third
quartiles give us information about the internal structure of our data. The middle half of the data
falls between the first and third quartiles, and is centered about the median.

Quartile formulas:

 First Quartile (Q1) = [(n+1)/4]th Term also known as the lower quartile.
 The second quartile or the 50th percentile or the Median is given as: Second Quartile
(Q2) = [(n+1)/2]th Term
 The third Quartile of the 75th Percentile (Q3) is given as: Third Quartile (Q3) =
[(3(n+1)/4]th Term also known as the upper quartile.
 The interquartile range is calculated as: Upper Quartile – Lower Quartile.

Example: 5, 7, 4, 4, 6, 2, 8. Put them in order: 2, 4, 4, 5, 6, 7, 8

Cut the list into quarters and the result is:

Quartile 1 (Q1) = 4
Quartile 2 (Q2), which is also the Median, = 5
Quartile 3 (Q3) = 7

Sometimes a "cut" is between two numbers, the Quartile is the average of the two numbers.

Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8. The numbers are already in order

Cut the list into quarters: In this case Quartile 2 is half way between 5 and 6: Q2 = (5+6)/2 = 5.5

And the result is:


Quartile 1 (Q1) = 3
Quartile 2 (Q2) = 5.5
Quartile 3 (Q3) = 7
 Standard Scores (z-scores)

Z-scores are a way to compare results from a test to a “normal” population. A z-score
gives you an idea of how far from the mean a data point is. But more technically it’s a measure
of how many standard deviations below or above the population mean a raw score is. A z-score
can be placed on a normal distribution curve. Z-scores range from -3 standard deviations (which
would fall to the far left of the normal distribution curve) up to +3 standard deviations (which
would fall to the far right of the normal distribution curve). In order to use a z-score, you need to
know the mean μ and also the population standard deviation σ.

Z-scores are a way to compare results to a “normal” population. Results from tests or
surveys have thousands of possible results and units; those results can often seem
meaningless.

The basic z score formula for a sample is: z = (x – μ) / σ

Example:

Let’s say you have a test score of 190. The test has a mean (μ) of 150 and a standard deviation
(σ) of 25. Assuming a normal distribution, your z score would be: (190 – 150) / 25 = 1.6.

The z score tells you how many standard deviations from the mean your score is. In this
example, your score is 1.6 standard deviations above the mean.

Empirical rule & chebyshev's theorem, coef of variation

 Emperical Rule

The Empirical Rule is an approximation that applies only to data sets with a bell-shaped
relative frequency histogram. It estimates the proportion of the measurements that lie within
one, two, and three standard deviations of the mean The empirical rule in statistics, also known
as the 68 95 99 rule, states that for normal distributions, 68% of observed data points will lie
inside one standard deviation of the mean, 95% will fall within two standard deviations, and
99.7% will occur within three standard deviations. The empirical rule tells you what percentage
of your data falls within a certain number of standard deviations from the mean:

 68% of the data falls within one standard deviation of the mean.
 95% of the data falls within two standard deviations of the mean.
 99.7% of the data falls within three standard deviations of the mean.

To calculate the data ranges associated with the empirical rule percentages of 68%,
95%, and 99.7%, start by calculating the sample mean (x̅) and standard deviation (s). Then
input those values into the formulas below to derive the ranges.

Data Range Percentage of Data in the range


x̅ − s, x̅ + s 68%
x̅ − 2s, x̅ + 2s 95%
x̅ − 3s, x̅ + 3s 99.7%
Example:

Scores on IQ tests have a bell-shaped distribution with mean μ=100 and standard deviation
σ=10. Discuss what the Empirical Rule implies concerning individuals with IQ scores of 110,
120, and 130 .

Solution:

A sketch of the IQ distribution is given in Figure 3.2.2.3. The Empirical Rule states that
approximately 68% of the IQ scores in the population lie between 90 and 110,
approximately 95% of the IQ scores in the population lie between 80 and 120, and
approximately 99.7% of the IQ scores in the population lie between 70 and 130.

Figure 3.2.2.3: Distribution of IQ Scores.

Since 68% of the IQ scores lie within the interval from 90 to 110, it must be the case that
32% lie outside that interval. By symmetry approximately half of that 32%, or 16% of all IQ
scores, will lie above 110. If 16% lie above 110, then 84% lie below. We conclude that the IQ
score 110 is the 84th percentile.

The same analysis applies to the score 120. Since approximately 95% of all IQ scores
lie within the interval from 80 to 120, only 5% lie outside it, and half of them, or 2.5% of all
scores, are above 120. The IQ score 120 is thus higher than 97.5% of all IQ scores, and is quite
a high score.

By a similar argument, only 15/100 of 1% of all adults, or about one or two in every
thousand, would have an IQ score above 130. This fact makes the score 130 extremely high.

 Chebyshev’s Theorem

Chebyshey’s Theorem is a fact that applies to all possible data sets. It describes the
minimum proportion of the measurements that lie must within one, two, or more standard
deviations of the mean.

The Empirical Rule does not apply to all data sets, only to those that are bell-shaped,
and even then is stated in terms of approximations. A result that applies to every data set is
known as Chebyshev’s Theorem.
Example:

A sample of size n=50 has mean x¯=28 and standard deviation s=3. Without knowing anything
else about the sample, what can be said about the number of observations that lie in the interval
(22,34)? What can be said about the number of observations that lie outside that interval?

Solution:

The interval (22,34) is the one that is formed by adding and subtracting two standard deviations
from the mean. By Chebyshev’s Theorem, at least 3/4 of the data are within this interval. Since
3/4 of 50 is 37.5, this means that at least 37.5 observations are in the interval. But one cannot
take a fractional observation, so we conclude that at least 38 observations must lie inside the
interval (22,34).

If at least 3/4 of the observations are in the interval, then at most 1/4 of them are outside it.
Since 1/4 of 50 is 12.5 at most 2.5 observations are outside the interval. Since again a fraction
of an observation is impossible, x (22,34).

 Coefficient of variation

The coefficient of variation (relative standard deviation) is a statistical measure of the


dispersion of data points around the mean. The metric is commonly used to compare the data
dispersion between distinct series of data.

Unlike the standard deviation that must always be considered in the context of the mean
of the data, the coefficient of variation provides a relatively simple and quick tool to compare
different data series.

Formula for Coefficient of Variation:

Where:
 σ – the standard deviation
 μ – the mean
Example:

Fred wants to find a new investment for his portfolio. He is looking for a safe investment that
provides stable returns. He considers the following options for investment:

 Stocks: Fred was offered stock of ABC Corp. It is a mature company with strong
operational and financial performance. The volatility of the stock is 10%, and the
expected return is 14%.
 ETFs: Another option is an Exchange-Traded Fund (ETF) which tracks the performance
of the S&P 500 index. The ETF offers an expected return of 13% with a volatility of 7%.
 Bonds: Bonds with excellent credit ratings offer an expected return of 3% with 2%
volatility.

In order to select the most suitable investment opportunity, Fred decided to calculate the
coefficient of variation of each option. Using the formula above, he obtained the following
results:

Based on the calculations


above, Fred wants to invest in
the ETF because it offers the
lowest coefficient (of variation)
with the most optimal risk-to-
reward ratio.

3. Enumerate, discuss, include important principles, formulas and provide examples of


the following:

Normal Distribution

Normal distribution, also known as the Gaussian distribution, is a probability distribution


that is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean. In graph form, normal distribution will appear as a bell
curve.

The normal distribution is the most common type of distribution assumed in technical
stock market analysis and in other types of statistical analyses. The standard normal distribution
has two parameters: the mean and the standard deviation. For a normal distribution, 68% of the
observations are within +/- one standard deviation of the mean, 95% are within +/- two standard
deviations, and 99.7% are within +- three standard deviations. The normal distribution is
produced by the normal density function:

x is the variable
μ is the mean
σ is the standard deviation
Example:

If the value of random variable is 2, mean is 5 and the standard deviation is 4, then find the
probability density function of the gaussian distribution.

Solution: Given, Variable, x = 2, Mean = 5 and Standard deviation = 4. By the formula of the
probability density of normal distribution, we can write;

f(2,2,4) = 1/(4√2π) e0
f(2,2,4) = 0.0997

Hyphothesis Testing

Hypothesis testing in statistics is a way for you to test the results of a survey or
experiment to see if you have meaningful results. You’re basically testing whether your results
are valid by figuring out the odds that your results have happened by chance. If your results
may have happened by chance, the experiment won’t be repeatable and so has little use.
Hypothesis testing can be one of the most confusing aspects for students, mostly because
before you can even perform a test, you have to know what your null hypothesis is.

Hypothesis testing allows the researcher to determine whether the data from the sample
is statistically significant. Hypothesis testing is one of the most important processes for
measuring the validity and reliability of outcomes in any systematic investigation.
Formula of Hypothesis Testing:

 Null Hypothesis

The null hypothesis is always the accepted fact. Simple examples of null hypotheses
that are generally accepted as being true are:

1. DNA is shaped like a double helix.


2. There are 8 planets in the solar system (excluding Pluto).
3. Taking Vioxx can increase your risk of heart problems (a drug now taken off the market).

 P - value

A concept which provides a convenient basis for drawing conclusions in hypothesis-


testing applications. The p-value is a measure of how likely the sample results are, assuming
the null hypothesis is true; the smaller the p-value, the less likely the sample results. If the p-
value is less than α, the null hypothesis can be rejected; otherwise, the null hypothesis cannot
be rejected. The p-value is often called the observed level of significance for the test.

Example:

A principal at a certain school claims that the students in his school are above average
intelligence. A random sample of thirty students IQ scores have a mean score of 112.5. Is there
sufficient evidence to support the principal’s claim? The mean population IQ is 100 with a
standard deviation of 15.
Step 1: State the Null hypothesis. The accepted fact is that the population mean is 100,
so: H0: μ = 100.
Step 2: State the Alternate Hypothesis. The claim is that the students have above average
IQ scores, so:H1: μ > 100.The fact that we are looking for scores “greater than” a certain
point means that this is a one-tailed test.
Step 3: Draw a picture to help you visualize the problem.

Step 4: State the alpha level. If you aren’t given an alpha level, use 5% (0.05).
Step 5: Find the rejection region area (given by your alpha level above) from the z-table.
An area of .05 is equal to a z-score of 1.645.
Step 6: Find the test statistic using this formula:
For this set of data: z= (112.5 – 100) / (15/√30) = 4.56
Step 7: If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5, you
cannot reject the null hypothesis. In this case, it is more (4.56 > 1.645), so you can reject the
null.

Correlation & Regression Analysis

Correlation and Regression are the two multivariate distribution-based analyses. A


multivariate distribution is called multiple variables distribution. Correlation is described as the
analysis that allows us to know the relationship between two variables 'x' and 'y' or the absence
of it. Correlation is when it is observed that a change in a unit in one variable is retaliated by an
equivalent change in another variable, i.e., direct or indirect, at the time of study of two
variables. Or else the variables are said to be uncorrelated when the motion in one variable
does not amount to any movement in a specific direction in another variable. It is a statistical
technique that represents the strength of the linkage between variable pairs.

Correlation can be either negative or positive. If the two variables move in the same
direction, i.e. an increase in one variable results in the corresponding increase in another
variable, and vice versa, then the variables are considered to be positively correlated. For
example, Investment and profit. On the contrary, if the two variables move in different directions
so that an increase in one variable leads to a decline in another variable and vice versa, this
situation is known as a negative correlation. For example, Product price and demand.

On the other hand, the Regression analysis predicts the value of the dependent variable
based on the known value of the independent variable, assuming that there is an average
mathematical relation between two or more variables. A statistical technique based on the
average mathematical relationship between two or more variables is known as regression, to
estimate the change in the metric dependent variable due to the change in one or more
independent variables. It plays an important role in many human activities since it is a powerful
and flexible tool that is used to forecast past, present, or future events based on past or present
events.
For example, the future profit of a business can be estimated on the basis of past
records. There are two variables x and y in a simple linear regression, wherein y depends on x
or say that is influenced by x. Here y is called as a variable dependent, or criterion, and x is a
variable independent or predictor. The line of regression y on x is expressed as: Y = a + bx

Where;
a = constant,
b = regression coefficient

For example, in patients attending an accident and emergency unit (A&E), we could use
correlation and regression to determine whether there is a relationship between age and urea
level, and whether the level of urea can be predicted for a given age.

Analysis of Variance

Analysis of Variance (ANOVA) is a statistical formula used to compare variances across


the means (or average) of different groups. A range of scenarios use it to determine if there is
any difference between the means of different groups. ANOVA helps to compare these group
means to find out if they are statistically different or if they are similar. ANOVA also indirectly
reveals if an independent variable is influencing the dependent variable.

The outcome of ANOVA is the ‘F statistic’. This ratio shows the difference between the
within group variance and the between group variance, which ultimately produces a figure which
allows a conclusion that the null hypothesis is supported or rejected. If there is a significant
difference between the groups, the null hypothesis is not supported, and the F-ratio will be
larger. The ANOVA formula is given by:

F – the ANOVA coefficient


MST – the mean sum of all the squares due to the
treatment
MSE – the mean sum of squares due to error

 One-Way ANOVA

The one-way analysis of variance is also known as single-factor ANOVA or simple


ANOVA. As the name suggests, the one-way ANOVA is suitable for experiments with only one
independent variable (factor) with two or more levels. For instance a dependent variable may be
what month of the year there are more flowers in the garden. There will be twelve levels.

Example:

Situation1: You have a group of individuals randomly split into smaller groups and completing
different tasks. For example, you might be studying the effects of tea on weight loss and form
three groups: green tea, black tea, and no tea.

Situation 2: Similar to situation 1, but in this case the individuals are split into groups based on
an attribute they possess. For example, you might be studying leg strength of people according
to weight. You could split participants into weight categories (obese, overweight and normal)
and measure their leg strength on a weight machine.
 Full Factorial ANOVA (also called two-way ANOVA)

Full Factorial ANOVA is used when there are two or more independent variables. Each
of these factors can have multiple levels. Full-factorial ANOVA can only be used in the case of a
full factorial experiment, where there is use of every possible permutation of factors and their
levels. This might be the month of the year when there are more flowers in the garden, and then
the number of sunshine hours. This two-way ANOVA not only measures the independent vs the
independent variable, but if the two factors affect each other.

Example:

You might want to find out if there is an interaction between income and gender for anxiety level
at job interviews. The anxiety level is the outcome, or the variable that can be measured.
Gender and Income are the two categorical variables. These categorical variables are also the
independent variables, which are called factors in a Two Way ANOVA.

The factors can be split into levels. In the above example, income level could be split into three
levels: low, middle and high income. Gender could be split into three levels: male, female, and
transgender. Treatment groups are all possible combinations of the factors. In this example
there would be 3 x 3 = 9 treatment groups.

References:

1. https://www.intellspot.com/descriptive-statistics-examples/amp/
2. https://www.scribbr.com/statistics/descriptive-statistics/
3. https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-
median.php
4. https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-
Statistics/Measures-of-Position/index.html
5. https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-
chebysh.html
6. https://www.mathsisfun.com/data/quartiles.html
7. https://www.statisticshowto.com/measures-of-position/
8. https://www.researchconnections.org/research-tools/descriptive-statistics
9. https://cosmologist.info/teaching/STAT/CHAP4.pdf
10. http://samples.jbpub.com/9781449686697/45670_ch06_101_132.pdf
11. https://www.vedantu.com/maths/differences-between-correlation-and-regression
12. https://byjus.com/maths/normal-distribution/
13. https://stats.libretexts.org/Las_Positas_College_Math_Statistics_and_Probability
14. https://corporatefinanceinstitute.com/resources/knowledge/other/coefficient-of-variation.
15. https://www.ncbi.nlm.nih.gov/pmc/articles/

You might also like