Professional Documents
Culture Documents
➢ Functions of Statistics
The functions of statistics may be enumerated as follows:
Query:hello@everstudy.in www.everstudy.co.in 2
➢ Main Limitations of Statistics
1. Qualitative Aspect Ignored
2. It does not deal with individual items
3. It does not depict entire story of phenomenon
4. It is liable to be miscued
5. Results are true only on average
6. To Many methods to study problems
Averages provide us the gist and give a bird’s eye view of the huge mass of
unwieldy numerical data.
Averages are the typical values around which other items of the distribution
congregate.
This value lies between the two extreme observations of the distribution and
give us an idea about the concentration of the values in the central part of the
distribution.
And so they are called the measures of central tendency.
Averages are also called measures of location since they enable us to locate
the position or place of the distribution in question.
Query:hello@everstudy.in www.everstudy.co.in 3
An average represents the statistical data and it is used for purposes of
comparison, it must possess the following properties.
1. It must be rigidly defined and not left to the mere estimation of the observer. If
the definition is rigid, the computed value of the average obtained by different
persons shall be similar.
2. The average must be based upon all values given in the distribution.
3. It should be easily understood. The average should possess simple and obvious
properties. It should be too abstract for the common people.
Query:hello@everstudy.in www.everstudy.co.in 4
Different methods of measuring “Central Tendency” provide us with different
kinds of averages.
The following are the main types of averages that are commonly used:
I. Mean
(i) Arithmetic mean
2. Median
3. Mode
1. Arithmetic Mean:
The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar.
Calculating the mean is very simple.
Add up all of the values and divide by the number of observations in the
dataset.
The calculation of the mean incorporates all values in the data. If you change
any value, the mean changes.
Query:hello@everstudy.in www.everstudy.co.in 5
Mathematical Properties of the Arithmetic Mean:
1. The sum of the deviation of a given set of individual observations from the
arithmetic mean is always zero.
2. The sum of squares of deviations of a set of observations is the minimum
when deviations are taken from the arithmetic average.
3. If each value of a variable X is increased or decreased or multiplied by a
constant k, the arithmetic mean also increases or decreases or multiplies by
the same constant.
4. If we are given the arithmetic mean and number of items of two or more
groups, we can compute the combined average of these groups by apply the
following formula :
Combined mean
Query:hello@everstudy.in www.everstudy.co.in 6
Simplicity Effect of extreme values.
Certainty Mean value may not figure in the
Based on all values. series.
Algebraic treatment possible. Unsuitability.
Basis of comparison Misleading conclusions.
Accuracy test possible. Cannot be used in case of
No scope for estimated value qualitative phenomenon.
Gets distorted by extreme value
of the series.
Types of
Direct Method Shortcut Methods Step deviation Methods
Series
Individual
Series
Discrete
series
Continuous
Series
2. Weighted Average
A weighted average is a type of average where each observation in the data set
is multiplied by a predetermined weight before calculation.
Query:hello@everstudy.in www.everstudy.co.in 7
In calculating a simple average (arithmetic mean) all observations are treated
equally and assigned equal weight.
A weighted average assigns weights that determine the relative importance of
each data point.
Weighted Mean
3. Geometric Mean:
A geometric mean is a mean or average which shows the central tendency of a set
of numbers by using the product of their values.
For a set of n observations, a geometric mean is the nth root of their product.
The geometric mean G.M., for a set of numbers x1, x2, … , xn is given as
Query:hello@everstudy.in www.everstudy.co.in 8
G.M. = (x1. x2 … xn)1⁄n
The geometric mean of two numbers, say x, and y is the square root of their product
x×y. For three numbers, it will be the cube root of their products i.e., (x y z) 1⁄3.
In order to make our calculation easy and less time consuming we use the concept
of logarithms in the calculation of geometric means.
Query:hello@everstudy.in www.everstudy.co.in 9
For a grouped frequency distribution, the geometric mean G.M. is
log G.M. = 1⁄N (f1 log x1 + f2 log x2 + … + fn log xn) = 1⁄N [∑ i= 1n fi log xi ].
2. If all the observations assumed by a variable are constants, say K >0, then the
G.M. of the observation is also K
3. The geometric mean of the ratio of two variables is the ratio of the geometric
means of the two variables
4. The geometric mean of the product of two variables is the product of their
geometric means
Suppose G1, and G2 are the geometric means of two series of sizes n1, and
n2respectively. The geometric mean G, of the combined groups, is:
Query:hello@everstudy.in www.everstudy.co.in 10
or, G = antilog [(log G1 + n2 log G2) ⁄ (n1 + n2)]
The geometric Mean has certain specific uses, some of them are:
Query:hello@everstudy.in www.everstudy.co.in 11
It gives more weight to small negative, the geometric mean
items becomes imaginary
4. Harmonic Mean
The most important criteria for it are that none of the observations should be
zero.
The most common examples of ratios are that of speed and time, cost and unit
of material, work and time etc.
Query:hello@everstudy.in www.everstudy.co.in 12
➢ Properties of Harmonic Mean
1. If all the observation taken by a variable are constants, say k, then the
harmonic mean of the observations is also k
2. The harmonic mean has the least value when compared to the geometric mean
and the arithmetic mean
Query:hello@everstudy.in www.everstudy.co.in 13
If x, a, y is an arithmetic progression then 'a' is called arithmetic mean.
If x, a, y is a geometric progression then 'a' is called geometric mean.
If x, a, y form a harmonic progression then 'a' is called harmonic mean.
GM = geometric mean,
AM × HM = GM2
(II) Median
➢ Median is the middle value of the series when arranged in order of the magnitude.
➢ When a series is divided into more than two parts, the dividing values are called
Partition values.
The very first thing to be done with raw data is to arrange them in ascending or
descending order.
In Layman’s terms:
As we have 5 numbers the middle number will be the 3 rd number which can also
be calculated as
Query:hello@everstudy.in www.everstudy.co.in 14
{(n+1)/2 }th number= (5+1)/= 6/2 = 3rd number which is 8
So the Median is 8
For even numbers: As then there is no value exactly in the middle of the series.
In such a situation the median is arbitrarily taken to be halfway between the two
middle items.
Query:hello@everstudy.in www.everstudy.co.in 15
Related Positional Measures:
The median divides the series into two equal parts.
Similarly there are certain other measures which divide the series into certain equal
parts
➢ Quartiles:
Quartiles are the measures which divide the data into four equal parts; each portion
contains equal number of observation.
1. The lower half of a data set is the set of all values that are to the left of the
median value when the data has been put into increasing order.
2. The upper half of a data set is the set of all values that are to the right of the
median value when the data has been put into increasing order.
1. The first quartile, denoted by Q1, is the median of the lower half of the data
set. This means that about 25% of the numbers in the data set lie below Q1 and
about 75% lie above Q1.
2. The second quartile also called median and denoted by Q2, has 50% of the
items below it and 50% of the items above it.
3. The third quartile, denoted by Q3, is the median of the upper half of the data
set. This means that about 75% of the numbers in the data set lie below Q3 and
about 25% lie above Q3.
Query:hello@everstudy.in www.everstudy.co.in 16
Formulae of calculating median and partition values:
Individual Discrete
Measure Continuous Series
Series Series
Size of
Size of item Size of item Size of item Formula
item
Median
First
Quartile
Third
Quartile
➢ Deciles
Deciles: Deciles distribute the series into ten equal parts and generally expressed as
D.
➢ There are nine deciles expressed as D1,D2…D9 which are called as first
decile, second decile and so on
Query:hello@everstudy.in www.everstudy.co.in 17
➢ Percentiles
Percentiles: Percentiles divide the series into hundred equal parts and generally
expressed as P.
(i) Simple measure of central tendency. (i) Not based on all the items in the
(ii) It is not affected by extreme series, as it indicates the value of middle
observations. items.
(Iii) Possible even when data is (ii) Not suitable for algebraic treatment.
incomplete. (iii) Arranging the data in ascending
(iv) Median can be determined by order takes much time.
graphic presentation of data. (iv) Affected by fluctuations of items.
(v) It has a definite value. (v) It cannot be computed exactly where
(vi) Simple to calculate and understand the number of items in a series is even.
(vii) It is a positional value not a
calculated value.
III. Mode
Mode is that value of the variable which occurs or repeats itself maximum
number of item.
The mode is most “ fashionable” size in the sense that it is the most common and
typical and is defined by Zizek as “the value occurring most frequently in series
of items and around which the other items are distributed most densely.”
In the words of Croxton and Cowden, the mode of a distribution is the value at
the point where the items tend to be most heavily concentrated.
According to A.M. Tuttle, Mode is the value which has the greater frequency
density in its immediate neighborhood.
Query:hello@everstudy.in www.everstudy.co.in 18
In the case of individual observations, the mode is that value which is repeated
the maximum number of times in the series. The value of mode can be denoted by
the alphabet ‘z’ also.
Query:hello@everstudy.in www.everstudy.co.in 19
(vi) It is less effected by extreme values. to identify the modal value.
DISPERSION
• According to Dr. Bowley, “dispersion is the measure of the variation between
items.”
➢ Objectives of Dispersion
a) To determine the reliability of an average.
b) To compare the variability of two or more series.
c) It serves the basis of other statistical measures such as correlation etc.
d) It serves the basis of statistical quality control.
Query:hello@everstudy.in www.everstudy.co.in 20
➢ Classification of Measures of Dispersion
MEASURES OF
DISPERSION
ABSOLUTE
RELATIVE MEASURES
MEASURES
➢ Range:
It is the simplest method of studying dispersion. Range is the difference
between the smallest value and the largest value of a series.
(X max + X min)
Query:hello@everstudy.in www.everstudy.co.in 21
Merits of Range Demerits of Range
➢ Quartile Deviation
✓ The concept of ‘Quartile Deviation does take into account only the values of
the ‘Upper quartile (Q3) and the ‘Lower quartile’ (Q1).
✓ Inter quartile range is the difference between Upper Quartile (Q3) and Lower
Quartile Q1.
Inter-quartile Deviation
= Q3 – Q1
Semi-quartile Deviation
= Q3 – Q1
Query:hello@everstudy.in www.everstudy.co.in 22
Co-efficient of quartile deviation
= (Q3 – Q1)
(Q3 + Q1).
➢ Mean Deviation:
Average deviation is defined as a value which is obtained by taking the average of the
deviations of various items from a measure of central tendency Mean or Median or
Mode, ignoring negative signs.
Query:hello@everstudy.in www.everstudy.co.in 23
Merits of Mean Deviation Demerits of Mean Deviation
➢ Standard Deviation:
“Standard deviation or S.D. is the square root of the mean of the squared deviations of
the individual scores from the mean of the distribution.”
Query:hello@everstudy.in www.everstudy.co.in 24
Standard deviation is calculated as the square root of average of squared deviations
taken from actual mean.
But,
Query:hello@everstudy.in www.everstudy.co.in 25
Standard deviation for Sample Standard deviation for Population
(Some authors use ‘x’ as the deviation of individual scores from the mean)
Query:hello@everstudy.in www.everstudy.co.in 26
Co-efficient of Standard deviation = S.D.
Mean
Query:hello@everstudy.in www.everstudy.co.in 27
➢ Uses of S.D:
(i) When the most accurate, reliable and stable measure of variability is wanted.
(ii) When more weight is to be given to extreme deviations from the mean.
(iii) When coefficient of correlation and other statistics are subsequently computed.
(v) When scores are to be properly interpreted with reference to the normal curve.
(vii) When we want to test the significance of the difference between two statistics.
➢ Coefficient of Dispersion
Whenever we want to compare the variability of the two series which differ
widely in their averages.
The coefficient of variation (C.V.) is 100 times the coefficient of dispersion based
on standard deviation.
Query:hello@everstudy.in www.everstudy.co.in 28
C.V. = 100 x (S.D. / Mean)
CV gives the percentage which σ is of the test mean. It is thus a ratio which
is independent of the units of measurement.
CV is restricted in its use owing to certain ambiguities in its interpretation. It is
defensible when used with ratio scales—scales in which the units are equal and
there is a true zero or reference point.
Two cases arise in the use of V with ratio scales:
(2) When M’s are unequal, the units of the scale being the same.
➢ Types of Distributions
Bernoulli Distribution
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and
0 (failure), and a single trial.
So the random variable X which has a Bernoulli distribution can take value 1 with
the probability of success, say p, and the value 0 with the probability of failure,
say q or 1-p.
Query:hello@everstudy.in www.everstudy.co.in 29
Probability of getting a head = Probability of getting a tail = 0.5 since there are
only two possible outcomes.
px(1-p)1-x
E(X) = p
Query:hello@everstudy.in www.everstudy.co.in 30
The variance of a random variable from a Bernoulli distribution is:
Var(X) = p-(1-p)
Binomial Distribution
A distribution where only two outcomes are possible, such as success or failure,
gain or loss, win or lose and where the probability of success and failure is same
for all the trials is called a Binomial Distribution.
2. There are only two possible outcomes in a trial- either a success or a failure.
4. The probability of success and failure is same for all trials. (Trials are
identical.)
Query:hello@everstudy.in www.everstudy.co.in 31
The mathematical representation of binomial distribution (Probability mass
function) is given by:
Mean = µ = n*p
Query:hello@everstudy.in www.everstudy.co.in 32
• Asking 200 people if they watch ABC news.
• Drawing 5 cards from a deck for a poker hand (done without replacement, so
not independent)
Normal distribution :
Normal distribution represents the behavior of most of the situations in the
universe.
The large sum of (small) random variables often turns out to be normally
distributed, contributing to its widespread application.
Any distribution is known as Normal distribution if it has the following
characteristics:
1. The mean, median and mode of the distribution coincide.
2. The curve of the distribution is bell-shaped and symmetrical about the line
x=μ.
3. The total area under the curve is 1.
4. The mean divides the curve into 2 equal parts
5. Its quartile deviation, Q.D= /3 σ
Limits Area %
Query:hello@everstudy.in www.everstudy.co.in 33
68.2
µ± σ
95
µ± 1.96σ
95.4
µ± 2σ
99.7
µ± 3σ
Query:hello@everstudy.in www.everstudy.co.in 34
The parameters of normal distribution are µ (mean) and σ (standard
deviation)
Mean= E(X) = µ
Variance = Var(X) = σ2
Poisson distribution
The Poisson distribution is a discrete distribution with a single parameter ‘m’.
If the probability of success “p” is small and the number of trails “n” is large, the
binomial distribution is approximated to Poisson distribution.
Query:hello@everstudy.in www.everstudy.co.in 35
All the Poisson distribution is skewed to right. This is the reason why the Poisson
probability distributions have been called the probability of distribution of rare
events.
1. Any successful event should not influence the outcome of another successful
event.
2. The probability of success over a short interval must equal the probability of
success over a longer interval.
Query:hello@everstudy.in www.everstudy.co.in 36
Some examples are
Mean = E(X) = µ = λ
Query:hello@everstudy.in www.everstudy.co.in 37
Variance = Var(X) = µ = λ
Exponential Distribution
From the expected life of a machine to the expected life of a human, exponential
distribution successfully delivers the result.
Consider the call center example one more time. What about the interval of time
between the calls ?
Query:hello@everstudy.in www.everstudy.co.in 38
Here λ > 0 is the parameter of the distribution, often called the rate
parameter.
For survival analysis, λ is called the failure rate of a device at any time t, given that
it has survived up to t.
Query:hello@everstudy.in www.everstudy.co.in 39
Data collection:
Collection of data is the first and most important stage in any statistical survey.
The method for collection of data depends upon various factors such as
objective, scope, nature of investigation and availability of resources.
Sources of Data
1. Statistical sources refer to data that are collected for some official purposes and
include censuses and officially conducted surveys.
2. Non-statistical sources refer to the data that are collected for other administrative
purposes or for the private sector.
Query:hello@everstudy.in www.everstudy.co.in 40
data and analyzing it using statistical methods. This is done to make estimations
about population characteristics.
Types of Data
There are two types of data – primary data and secondary data.
1. Primary data is the data collected for the first time keeping in view the
objective of the survey. Interview, questionnaire and telephone/mail are all
examples of primary data.
2. Secondary data is any information, used for the current investigation but is
obtained from data, which has been collected and used by some other
agency or person in a separate investigation, or survey.
Both primary and secondary data may be collected either by census or by
sampling methods. Based on how accurate data is required for statistical
surveys, appropriate methods can be adopted.
➢ Primary data
Primary data is the one, which is collected by the investigator for the purpose
of a specific inquiry or study.
Query:hello@everstudy.in www.everstudy.co.in 41
Such data is original in character and is generated by a survey conducted by
individuals or a research institution or any organization.
They are likely to be more reliable. However, cost of collection of such data is
much higher.
Primary data is collected by either a census method or a sampling method
Query:hello@everstudy.in www.everstudy.co.in 42
They are sent to the respondents with a covering letter soliciting
cooperation from the respondents (respondents are the people who respond
to questions in the questionnaire).
The respondents are asked to give correct information and to mail the
questionnaire back.
5. Information through a schedule filled by investigators:
Information can be collected through schedules filled by investigators
through personal contact. In order to get reliable information, the
investigator should be well trained, tactful, unbiased and hard working.
A schedule is suitable for an extensive area of investigation through
investigator’s personal contact.
The problem of non-response is minimized. There is a difference between a
schedule and a questionnaire.
A schedule is a form that the investigator fills personally, while surveying
the units or individuals from the sample (respondent).
A questionnaire is a form sent (usually mailed) by an investigator to
respondents. The respondent has to fill it and then send it back to the
investigator.
➢ Secondary data:
Any information, that is used for the current investigation but is obtained from
some data, which has been collected and used by some other agency or
person in a separate investigation, or survey, is known as secondary data.
They are available in a published or unpublished form.
Query:hello@everstudy.in www.everstudy.co.in 43
1. Published sources: The various sources of published data are:
• Reports and official publications of international and national
organizations as well as central and state governments
• Publications of several local bodies such as municipal corporations and
district boards
• Financial and economic journals
• Annual reports of various companies
• Publications brought out by research agencies and research scholars
• Some of the journals (both academic and non-academic) are published at
regular intervals like yearly, monthly, weekly whereas, other
publications are more ad hoc.
• Internet is a powerful source of secondary data, which can be accessed at
any time for any further analysis of the study.
2. Unpublished sources: Unpublished data such as records maintained by
various government and private offices, studies made by research
institutions and scholars can also be used where necessary.
1) Reliability of data
2) Suitability of the data
3) Adequacy of data
Query:hello@everstudy.in www.everstudy.co.in 44
the researcher himself. earlier.
➢ Questionnaire design
Questionnaire design is the process of designing the format and questions in the
survey instrument that will be used to collect data about a particular phenomenon.
Query:hello@everstudy.in www.everstudy.co.in 46
A. Closed ended questions
i. Dichotomous
ii. Multiple choice (4 to 5 options; neutral point)
iii. Likert scale (Agree or disagree)
iv. Semantic differential (scale connecting bipolar words)
v. Importance scale (importance of some attribute)
vi. Rating scale (Excellent to poor)
Query:hello@everstudy.in www.everstudy.co.in 47
➢ Essentials of a good questionnaire?
Success of this method of collection of data depends mainly on proper drafting of the
questionnaire. You have to keep the following points in mind while preparing a
questionnaire:
The task of composing questionnaire may be considered more an art than a science. It
needs a great deal of experience, expertise, and creativity.
Query:hello@everstudy.in www.everstudy.co.in 48
Determine the Data to be collected
➢ Statistical Survey
A Statistical Survey is a scientific process of collection and analysis of numerical
data.
Surveys differ from each other with regard to their purpose, field of study, scope, and
the source of information. The standard tools for any statistical study are:
• relevance
• timeliness
Query:hello@everstudy.in www.everstudy.co.in 49
➢ Stages of Statistical surveys
Statistical surveys involve two stages namely –
1. Planning and
2. Execution.
Planning: A properly planned investigation can lead to the best results with least
cost and time. There are five steps involved in planning the survey.
Query:hello@everstudy.in www.everstudy.co.in 50
Sampling – Concept, Process and Techniques
➢ Sampling:
The process of selecting a number of individuals for a study in such a way that the
individuals represent the larger group from which they were selected
Query:hello@everstudy.in www.everstudy.co.in 51
Defining the population of concern
Specifying a sampling method for selecting items or events from the frame
➢ Types of sample:
Cluster sample
Query:hello@everstudy.in www.everstudy.co.in 52
1. Simple random sample:
✓ It is applicable when population is small, homogeneous & readily
available
✓ All subsets of the frame are given an equal probability.
✓ Each element of the frame thus has an equal probability of selection.
✓ It provides for greatest number of possible samples. This is done by
assigning a number to each unit in the sampling frame.
✓ A table of random number or lottery system is used to determine which
units are to be selected.
Pros: Cons:
It involves a random start and then proceeds with the selection of every kth
element from then onwards. In this case, k=(population size/sample size).
Query:hello@everstudy.in www.everstudy.co.in 53
It is important that the starting point is not automatically the first in the list, but
is instead randomly chosen from within the first to the kth element in the list.
In a systematic sample, after you decide the sample size, arrange the elements
of the population in some order and select terms at regular intervals from the
list.
A simple example would be to select every 10th name from the telephone
directory (an 'every 10th' sample, also referred to as 'sampling with a skip of
10').
ADVANTAGES: DISADVANTAGES:
Query:hello@everstudy.in www.everstudy.co.in 54
3. Stratified random sample
ADVANTAGES: DISADVANTAGES:
Query:hello@everstudy.in www.everstudy.co.in 55
4. Cluster Sampling:
The process of randomly selecting intact groups, not individuals, within the
defined population sharing similar characteristics
➢ Selection process
1. One-stage sampling: All of the elements within selected clusters are included
in the sample.
2. Two-stage sampling: A subset of elements within selected clusters is
randomly selected for inclusion in the sample.
Query:hello@everstudy.in www.everstudy.co.in 56
One-stage sampling. Two-stage sampling
5. Multi-Stage Sampling
It is the combination of one or more methods described above.
Population is divided into multiple clusters and then these clusters are further
divided and grouped into various sub groups (strata) based on similarity.
One or more clusters can be randomly selected from each stratum. This
process continues until the cluster can’t be divided anymore.
For example country can be divided into states, cities, urban and rural and all
the areas with similar characteristics can be merged together to form a strata.
Query:hello@everstudy.in www.everstudy.co.in 57
Non- probability sampling:
1. Convenience Sampling
The process of including whoever happens to be available at the time that is,
readily available and convenient .
Advantages: Disadvantages
Query:hello@everstudy.in www.everstudy.co.in 58
2. Purposive sample:
The researcher chooses the sample based on who they think would be
appropriate for the study.
This is used primarily when there is a limited number of people that have
expertise in the area being researched
It is the process whereby the researcher selects a sample based on experience
or knowledge of the group to be sampled
It is also called “judgment” sampling
Advantages: Disadvantages
3. Quota Sampling
Quota sampling is the non-probability equivalent of stratified sampling that we
discussed earlier.
It starts with characterizing the population based on certain desired features and
assigns a quota to each subset of the population.
Then judgment used to select subjects or units from each segment based on a
specified proportion.
Query:hello@everstudy.in www.everstudy.co.in 59
For example, an interviewer may be told to sample 200 females and 300 males
between the age of 45 and 60.
Advantages Disadvantages
4. Snowball Sampling
Just as the snowball rolls and gathers mass, the sample constructed in this way
will grow in size as you move through the process of conducting a survey.
In this technique, you rely on your initial respondents to refer you to the next
respondents whom you may connect with for the purpose of your survey.
Snowball sampling can be useful when you need the sample to reflect certain
features that are difficult to find.
To conduct a survey of people who go jogging in a certain park every morning,
for example, snowball sampling would be a quick, accurate way to create the
sample.
Disadvantages:
Advantages: The clear downside of this
The costs associated with this approach is that you may restrict
method are significantly lower, yourself to only a small, largely
and you will end up with a homogenous section of the population.
Query:hello@everstudy.in www.everstudy.co.in 60
sample that is very relevant to
your study.
➢ Hypothesis Testing
The Hypothesis is an assumption which is tested to check whether the inference
drawn from the sample of data stand true for the entire population or not.
1. Set up a Hypothesis:
Query:hello@everstudy.in www.everstudy.co.in 61
The statistical hypothesis is an assumption about the value of some unknown
parameter, and the hypothesis provides some numerical value or range of
values for the parameter.
Here two hypotheses about the population are constructed Null
Hypothesis and Alternative Hypothesis.
The Null Hypothesis denoted by H0 asserts that there is no true difference
between the sample of data and the population parameter and that the
difference is accidental which is caused due to the fluctuations in sampling.
Thus,
a null hypothesis states that
Query:hello@everstudy.in www.everstudy.co.in 62
HYPOTHESIS
TESTING
Alternative
Null hypothesis, H0
hypothesis,HA
State the hypothesized value of the All possible alternatives other than the
parameter before sampling null hypothesis.
The assumption we wish to test (or the E.g µ ≠ 20
assumption we are trying to reject) . µ > 20
E.g population mean µ = 20 µ < 20
There is no difference between coke There is a difference between coke and
and diet coke diet coke
Once the hypothesis about the population is constructed the researcher has to
decide the level of significance, i.e. a confidence level with which the null
hypothesis is accepted or rejected.
The significance level is denoted by ‘α’ and is usually defined before the
samples are drawn such that results obtained do not influence the choice.
In practice, we either take 5% or 1% level of significance.
After the hypothesis is constructed, and the significance level is decided upon,
the next step is to determine a suitable test statistic and its distribution.
Query:hello@everstudy.in www.everstudy.co.in 63
Most of the statistic tests assume the following form:
Before the samples are drawn it must be decided that which values to the test
statistic will lead to the acceptance of H0 and which will lead to its rejection.
The values that lead to rejection of H0 are called the critical region.
5. Performing Computations:
Once the critical region is identified, we compute several values for the random
sample of size ‘n.’
Then we will apply the formula of the test statistic as shown in step (3) to
check whether the sample results falls in the acceptance region or the rejection
region.
6. Decision-making:
Query:hello@everstudy.in www.everstudy.co.in 64
Once all the steps are performed, the statistical conclusions can be drawn, and
the management can take decisions.
The decision involves either accepting the null hypothesis or rejecting it.
The decision that the null hypothesis is accepted or rejected depends on
whether the computed value falls in the acceptance region or the rejection
region.
Thus, to test the hypothesis, it is necessary to follow these steps systematically so that
the results obtained are accurate and do not suffer from either of the statistical error
Viz. Type-I error and Type-II error.
For example
H0: there is no difference between the two drugs on average.
Type I error will occur if we conclude that the two drugs produce different
effects when actually there isn’t a difference.
The probability of making a Type I error when the null hypothesis is true as an
equality is called the level of significance.
Applications of hypothesis testing that only control the Type I error are often
called significance tests.
Prob(Type I error) = significance level = α 2
Type II error
Type II error refers to the situation when we accept the null hypothesis when
it is false.
H0: there is no difference between the two drugs on average. Type II error will
occur if we conclude that the two drugs produce the same effect when actually
there is a difference.
Prob(Type II error) = ß
Query:hello@everstudy.in www.everstudy.co.in 65
Statisticians avoid the risk of making a Type II error by using “do not reject
H0” and not “accept H0”.
One Tail Test Upper tailed test will reject the null hypothesis if the sample
mean is significantly higher than the hypothesized mean.
Appropriate when H0 : µ = µ0 and HA: µ > µ0
Query:hello@everstudy.in www.everstudy.co.in 66
T test:
The T-statistic was introduced by W.S. Gossett under the pen name “Student”
Developed T test around 1905, for dealing with small samples in brewing
quality control which was Published in 1908
T test is used to compare two samples to determine if they came from the
same population
Application of T Test
1. Test of Hypothesis about population(One sample t-Test )
Query:hello@everstudy.in www.everstudy.co.in 67
Larger the degrees of freedom, the more it approximates the normal
distribution.
The curve doesn’t touches X axis
How many
Samples
Independent Dependent
One Sample T
Z- Test Samples T Samples T
Test
Test Test
Query:hello@everstudy.in www.everstudy.co.in 68
One sample t-test:
H0 : µ = µ0
Test if a sample mean for a variable differs significantly from the given
population with a known mean
Query:hello@everstudy.in www.everstudy.co.in 69
One sample T Test Independent Sample T Test Paired Sample T Test
➢ Z test
Given by Prof. Fisher
Z test is also called as Standard Normal deviate Test, Standard Normal Test,
approximate Test and Large Sample Test
➢ Conditions of Z Test
1. Data points should be independent from each other
Query:hello@everstudy.in www.everstudy.co.in 70
3. The variances of the samples should be the same
➢ Application of Z Test
1. Test of significance for single mean
1. If the Table value > Calculated value, we accept the Null Hypothesis
2. If the Table value < Calculated value, we Reject the Null Hypothesis
➢ Table Values
Query:hello@everstudy.in www.everstudy.co.in 71
2 Tailed Test ±1.645 ±1.96 ±2.58 ±2.81
Query:hello@everstudy.in www.everstudy.co.in 72
Testing difference of Two proportions
T-test is more adaptable than Z-test since Z-test will often require certain
conditions to be reliable.
Additionally, T-test has many methods that will suit any need. T-tests are more
commonly used than Z-tests.
Z-tests are preferred than T-tests when standard deviations are known.
➢ ANOVA Test:
Analysis of Variance (ANOVA) is a parametric statistical technique used to
compare datasets.
Query:hello@everstudy.in www.everstudy.co.in 73
This technique was invented by R.A. Fisher, in 1920 and is thus often referred to
as Fisher’s ANOVA, as well.
F Test is mainly arise when the models have been shifted to the data using to least
square
➢ Types of t-tests
Query:hello@everstudy.in www.everstudy.co.in 74
One way analysis: When we are comparing more than three groups based on one
factor variable, then it said to be one way analysis of variance (ANOVA).
For example, if we want to compare whether or not the mean output of three workers
is the same based on the working hours of the three workers.
Two way analysis: When factor variables are more than two, then it is said to be two
way analysis of variance (ANOVA).
For example, based on working condition and working hours, we can compare
whether or not the mean output of three workers is the same.
➢ Steps ANOVA
State Alpha
Query:hello@everstudy.in www.everstudy.co.in 75
N- Total Observations (Total sample size)
K- Number of groups
SSb - Sum of Square between the groups
SSW - Sum of Square within the group
MSSW -Mean sum of Square within the group
MSSb- Mean sum of Square within the group
➢ Additional tests called Post Hoc tests can be done to determine where differences
lie.
➢ It may be between first and second or second and third or may be between all of
them.
Query:hello@everstudy.in www.everstudy.co.in 76
➢ Chi-Square Test
✓ The chi-square test is an important test amongst the several tests of significance
developed by statistician Karl Pearson in1900.
✓ The distributions are positively skewed. The research hypothesis for the chi-
square is always a one-tailed test.
Query:hello@everstudy.in www.everstudy.co.in 77
4. The frequency data must have a precise numerical value and must be
organized into categories or groups.
6. No group should contain very few items, say less than 10.
7. The overall number of items must also be reasonably large. It should normally
be at least 50.
df = n-1
In a contingency table
df = (r – 1)(c – 1)
Query:hello@everstudy.in www.everstudy.co.in 78
➢ Types of Chi-Square Test:
CHI-
SQUARE
Non-
Parametric
Parametric
Test of Test of
Test of Test of
Comparing Goodness of
Independence Homogeneity
Variance fit
Query:hello@everstudy.in www.everstudy.co.in 79
Test Of Comparing Variance
Goodness Of Fit
Test Of Independence
Test Of Homogeneity
Query:hello@everstudy.in www.everstudy.co.in 80
Variance value. This test can be either a two-sided test or a one-sided
test.
Goodness of fit In Chi-Square goodness of fit test, the term goodness of fit is
used to compare the observed sample distribution with the
expected probability distribution.
➢ Decision rule:
➢ Correlation:
The degree of relationship between the variables under consideration is measure
through the correlation analysis. „
The measure of correlation called the correlation coefficient.
Query:hello@everstudy.in www.everstudy.co.in 81
The degree of relationship is expressed by coefficient which range from
correlation ( -1 ≤ r ≥ +1) „
The direction of change is indicated by a sign. „
The correlation analysis enables us to have an idea about the degree & direction
of the relationship between the two variables under study.
Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. „
Correlation analysis deals with the association between two or more variables.
➢ Types of Correlation:
On the basis of Degree of Correlation
1. Positive Correlation: The correlation is said to be positive correlation if the
values of two variables changing with same direction. As X is increasing, Y is
increasing „ As X is decreasing, Y is decreasing
Ex. Expenses & sales, Height & weight. „
2. Negative Correlation: The correlation is said to be negative correlation when the
values of variables change with opposite direction. As X is increasing, Y is
decreasing „ As X is decreasing, Y is increasing. Ex. Price & qty. demanded.
3. No correlation: There might be the case when there is no change in a variable
with any change in another variable. In this case, it is defined as no correlation
between the two.
Query:hello@everstudy.in www.everstudy.co.in 82
Correlation
Query:hello@everstudy.in www.everstudy.co.in 83
Such as, in the above example, if we study the relationship between the yield and
fertilizers used during the periods when certain average temperature existed, then
it is a problem of partial correlation
For example, from the values of two variables given below, it is clear that the ratio of
change between the variables is the same:
X: 10 20 30 40 50
Y: 20 40 60 80 100
➢ Scatter Diagram:
Query:hello@everstudy.in www.everstudy.co.in 84
If the line goes upward and this upward movement is from left to right it will
show positive correlation.
Similarly, if the lines move downward and its direction is from left to right,
it will show negative correlation.
The degree of slope will indicate the degree of correlation.
Query:hello@everstudy.in www.everstudy.co.in 85
Karl Pearson’s Coefficient of Correlation denoted by- r -1 ≤ r ≥ +1
Degree of Correlation is expressed by a value of Coefficient „
Direction of change is indicated by sign ( - ve) or ( + ve)
The value of the coefficient of correlation (r) always lies between ±1.
Such as: r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
The coefficient of correlation is independent of the origin and scale.
By origin, it means subtracting any non-zero constant from the given value of X
and Y the value of “r” remains unchanged.
By scale it means, there is no effect on the value of “r” if the value of X and Y is
divided or multiplied by any constant.
The coefficient of correlation is a geometric mean of two regression
coefficient
Query:hello@everstudy.in www.everstudy.co.in 86
The coefficient of correlation is “ zero” when the variables X and Y are
independent. But, however, the converse is not true.
Query:hello@everstudy.in www.everstudy.co.in 87
where rho denotes the correlation in a population
1. The data must approximate to the bell-shaped curve, i.e. a normal frequency
curve.
2. The Probable error computed from the statistical measure must have been taken
from the sample.
3. The sample items must be selected in an unbiased manner and must be
independent of each other.
Thus, the probable error is calculated to check the reliability of the value of
coefficient calculated from the random sampling.
Query:hello@everstudy.in www.everstudy.co.in 88
R = Rank correlation coefficient „
D = Difference of rank between paired item in two series. „
N = Total number of observation
➢ Types of problems:
An individual must follow the following steps to calculate the correlation coefficient:
Query:hello@everstudy.in www.everstudy.co.in 89
2. Where ranks are not assigned:
In case the ranks are not given, then the individual may assign the rank by taking
either the highest value or the lowest value as 1. Whatever criteria is being
decided the same method should be applied to all the variables.
3. Equal Ranks or Tie in Ranks or when ranks are repeated:
In case the same ranks are assigned to two or more entities, then the ranks are
assigned on an average basis.
Such as if two individuals are ranked equal at third position, then the ranks shall
be calculated as: (3+4)/2 = 3.5
The formula to calculate the rank correlation coefficient when there is a tie in the
ranks is:
➢ Regression
Query:hello@everstudy.in www.everstudy.co.in 90
It estimates the values of dependent variables from the values of the
independent variable. This means, the value of the unknown variable can be
estimated from the known value of another variable.
➢ Regression Line:
The degree to which the variables are correlated to each other depends on the
Regression Line.
The regression line is a single line that best fits the data, i.e. all the points
plotted are connected via a line in the manner that the distance from the line to
the points is the smallest.
➢ Regression Coefficient
Query:hello@everstudy.in www.everstudy.co.in 91
The constant ‘b’ in the regression equation (Ye = a + bX) is called as the
Regression Coefficient.
It determines the slope of the line, i.e. the change in the value of Y
corresponding to the unit change in X and therefore, it is also called as a
“Slope Coefficient.”
The correlation coefficient is the geometric mean of two regression
coefficients.
r2=byx*bxy
r = √ byx * bxy
The sign of both the regression coefficients will be same, i.e. they will be
either positive or negative.
It is an absolute measure
The average value of the two regression coefficients will be greater than the
value of the correlation.
Query:hello@everstudy.in www.everstudy.co.in 92