Professional Documents
Culture Documents
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Class : BBAcam
Unit 1
Statistics: Definition, Importance & Limitation.
Definition of statistics:-“By statistics we mean aggregates of facts affected to a
marked extent by a multiplicity of causesnumerically expressed, enumerated or
estimated according to reasonable standards of accuracy collected in a systematic
manner for a pre-determined purpose and placed in relation to each other”
Importance of Statistics
These days statistical methods are applicable everywhere. There is no field of work
in which statistical methods are not applied. According to A L. Bowley, ‘A
knowledge of statistics is like a knowledge of foreign languages or of Algebra, it
may prove of use at any time under any circumstances”. The importance of the
statistical science is increasing in almost all spheres of knowledge, e g., astronomy,
biology, meteorology, demography, economics and mathematics. Economic
planning without statistics is bound to be baseless.
Statistics serve in administration, and facilitate the work of formulation of new
policies. Financial institutions and investors utilise statistical data to summaries the
past experience. Statistics are also helpful to an auditor, when he uses sampling
techniques or test checking to audit the accounts of his client.
Limitation of statistics
The scope of the science of statistic is restricted by certain limitations :
1. The use of statistics is limited numerical studies: Statistical methods cannot
be applied to study the nature of all type of phenomena. Statistics deal with only
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
In a discrete series, the data are presented in such a way that exact measurements
of units are indicated. In a discrete frequency distribution, we count the number of
times each value of the variable in data given to you. This is facilitated through the
technique of tally bars.
In the first column, we write all values of the variable. In the second column, a
vertical bar called tally bar against the variable, we write a particular value has
occurred four times, for the fifth occurrence, we put a cross tally mark ( / ) on the
four tally bars to make a block of 5. The technique of putting cross tally bars at
every fifth repetition facilitates the counting of the number of occurrences of the
value. After putting tally bars for all the values in the data; we count the number of
times each value is repeated and write it against the corresponding value of the
variable in the third column entitled frequency. This type of representation of the
data is called discrete frequency distribution.
We are given marks of 42 students:
55 51 57 40 26 43 46 41 46 48 33 40 26 40 40 41
43 53 45 53 33 50 40 33 40 26 53 59 33 39 55 48
15 26 43 59 51 39 15 45 26 15
We can construct a discrete frequency distribution from the above given marks.
Marks of 42 Students
------------------------------------------
Marks Tally Bars Frequency
------------------------------------------
15 ||| 3
26 5
33 |||| 4
39 || 2
40 5
41 || 2
43 ||| 3
45 || 2
46 || 2
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
48 || 2
50 | 1
51 || 2
53 ||| 3
55 ||| 3
57 | 1
59 || 2
Total 42
The presentation of the data in the form of a discrete frequency distribution is
better than arranging but it does not condense the data as needed and is quite
difficult to grasp and comprehend. This distribution is quite simple in case the
values of the variable are repeated otherwise there will be hardly any condensation.
Continuous Frequency Distribution:-
If the identity of the units about a particular information collected, is neither
relevant nor is the order in which the observations occur, then the first step of
condensation is to classify the data into different classes by dividing the entire
group of values of the variable into a suitable number of groups and then recording
the number of observations in each group. Thus, we divide the total range of values
of the variable (marks of 42 students) i.e. 59-15 = 44 into groups of 10 each, then
we shall get
(42/10) 5 groups and the distribution of marks is displayed by the following
frequency distribution:
Marks of 42 Students
---------------------------------------------------------------------
Marks (×) Tally Bars Number of Students (f)
---------------------------------------------------------------------
15-25 ||| 3
25-35 |||| 9
35-45 || 12
45-55 || 12
55-65 6
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
----------------------------------------------------------------------
Total 42
Graphs of Frequency Distributions
The guiding principles for the graphic representation of the frequency distributions
are same as for the diagrammatic and graphic representation of other types of data.
The information contained in a frequency distribution can be shown in graphs
which reveals the important characteristics and relationships that are not easily
discernible on a simple examination of the frequency tables. The most commonly
used graphs for charting a frequency distribution are :
1. Histogram
2. Frequency polygon
3. Smoothed frequency curves
4. Ogives or cumulative frequency curves.
1. Histogram
The term ‘histogram’ must not be confused with the term ‘historigram’
which relates to time charts. Histogram is the best way of presenting
graphically a simple frequency distribution. The statistical meaning of
histogram is that it is a graph that represents the class frequencies in a
frequency distribution by vertical adjacent rectangles.
While constructing histogram the variable is always taken on the X-axis and
the corresponding classinterval. The distance for each rectangle on the X-
axis shall remain the same in case the class-intervals are uniform throughout;
if they are different the width of the rectangles shall also change
proportionately. The Yaxis represents the frequencies of each class which
constitute the height of its rectangle. We get a series of rectangles each
having a class interval distance as its width and the frequency distance as its
height. The
area of the histogram represents the total frequency.
The histogram should be clearly distinguished from a bar diagram. A bar
diagram is one-dimensional where the length of the bar is important and not
the width, a histogram is two-dimensional where both the length and width
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
----------------------------------------------------------------------
2. Frequency Polygon
This is a graph of frequency distribution which has more than four sides. It is
particularly effective in comparing two or more frequency distributions. There are
two ways of constructing a frequency polygon.
(i) We may draw a histogram of the given data and then join by straight line the
mid-points of the upper horizontal side of each rectangle with the adjacent ones.
The figure so formed shall be frequency polygon. Both the ends of the polygon
should be extended to the base line in order to make the area under frequency
polygons equal to the area under Histogram.
(ii) Another method of constructing frequency polygon is to take the mid-points of
the various classintervals and then plot the frequency corresponding to each point
and join all these points by straight lines. The figure obtained by both the methods
would be identical.
Frequency polygon has an advantage over the histogram. The frequency polygons
of several distributions can be drawn on the same axis, which makes comparisons
possible whereas histogram cannot be used in the same way. To compare
histograms we need to draw them on separate graphs.
3. Cumulative Frequency Curves or Ogives
We have discussed the charting of simple distributions where each frequency refers
to the measurement of the class-interval against which it is placed. Sometimes it
becomes necessary to know the number of items whose values are greater or less
than a certain amount. We may, for example, be interested in knowing the number
of students whose weight is less than 65 Ibs. or more than say 15.5 Ibs. To get this
information, it is necessary to change the form of frequency distribution from a
simple to a cumulative distribution. In a cumulative frequency distribution, the
frequency of each class is made to include the frequencies of all the lower or all the
upper classes depending upon the manner in which cumulation is done. The graph
of such a distribution is called a cumulative frequency curve or an Ogive.
There are two method of constructing ogives, namely:
(i) less than method, and
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
--------------------------------------------------------------
100.5 5
110.5 39
120.5 178
130.5 478
140.5 845
150.5 1164
160.5 1369
170.5 1445
180.5 1488
190.5 1504
200.5 1507
210.5 1511
220.5 1514
230.5 1515
--------------------------------------------------------------
Plot these frequencies and weights on a graph paper. The curve formed is called an
Ogive Now we calculate the cumulative frequencies of the given data by more than
method.
--------------------------------------------------------------
More than (Weights) Cumulative Frequencies
--------------------------------------------------------------
90.5 1515
100.5 1510
110.5 1476
120.5 1337
130.5 1037
140.5 670
150.5 351
160.5 146
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
170.5 70
180.5 27
190.5 11
200.5 8
210.5 4
220.5 1
--------------------------------------------------------------
By plotting these frequencies on a graph paper, we will get a declining curve which
will be our cumulative frequency curve or Ogive by more than method.
Although the graphs are a powerful and effective method of presenting statistical
data, they are not under all circumstances and for all purposes complete substitutes
for tabular and other forms of presentation.The specialist in this field is one who
recognizes not only the advantages but also the limitations of these techniques. He
knows when to use and when not to use these methods and from his experience and
expertise is able to select the most appropriate method for every purpose.
Example :Draw an ogive by less than method and determine the number of
companies earning profits between Rs. 45 crores and Rs. 75 crores :
------------------------------------------------------------------------
Profits No. of Profits No. of
(Rs. crores) Companies (Rs. crores) Companies
------------------------------------------------------------------------
10—20 8 60—70 10
20—30 12 70—80 7
30—40 20 80—90 3
40—50 24 90—100 1
50—6.0 15
------------------------------------------------------------------------
Solution :
OGIVE BY LESS THAN METHOD
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
-----------------------------------------------
Profits No.of
(Rs. crores) Companies
----------------------------------------------
Less than 20 8
Less than 30 20
Less than 40 40
Less than 50 64
Less than 60 79
Less than 70 89
Less than 80 96
Less than 90 99
Less than 100 100
-----------------------------------------------
It is clear from the graph that the number of companies getting profits less than
Rs.75 crores is 92 and the number of companies getting profits less than Rs. 45
crores is 51. Hence the number of companies getting profits between Rs. 45 crores
and Rs. 75 crores is 92 – 51 = 41.
Example :The following distribution is with regard to weight in grams of mangoes
of a given variety. If mangoes of weight less than 443 grams be considered
unsuitable for foreign market, what is the percentage of total mangoes suitable for
it? Assume the given frequency distribution to be typical of the variety:
------------------------------------------------------------------------------------------------
Weight in gms. No. of mangoes Weight in gms. No. of mangoes
---------------------------------------------------------------------------------
410 – 119 10 450 – 159 45
420 – 429 20 460 – 469 18
430 – 139 42 470 – 179 7
440 – 449 54
---------------------------------------------------------------------------------
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Draw an ogive of ‘more than’ type of the above data and deduce how many
mangoes will be more than 443 grams.
Solution : Mangoes weighting more than 443 gms. are suitable for foreign market.
Number of mangoes weighting more than 443 gms. lies in the last four classes.
Number of mangoes weighing between 444 and 449 grams would be
Total number of mangoes weighing more than 443 gms. = 32.4 + 45 + 18 + 7 =
102.4
Percentage of mangoes =
Therefore, the percentage of the total mangoes suitable for foreign market is 52.25.
OGIVE BY MORE THAN METHOD
------------------------------------------------------------------
Weight more than (gms.) No. of Mangoes
------------------------------------------------------------------
410 196
420 186
430 166
440 124
450 70
460 25
470 7
------------------------------------------------------------------
From the graph it can be seen that there are 103 mangoes whose weight will be
more than 443 gms. and are suitable for foreign market.
DIAGRAM:-
Statistical data can be presented by means of frequency tables, graphs and
diagrams. In this lesson, so far we have discussed the graphical presentation. Now
we shall take up the study of diagrams. There are many variety of diagrams but
here we are concerned with the following types only :
(i) Bar diagrams
Bar Diagram:-
A bar diagram may be simple or component or multiple. A simple bar diagram is
used to represent only one variable. Length of the bars is proportional to the
magnitude to be represented. But when we are interested in showing various parts
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Averages are also called measures of location since they enable us to locate the
position or place of the distribution in question. Averages are statistical constants
which enables us to comprehend in a single value the significance of the whole
group. According to Croxlon and Cowden, an average value is a single value
within the range of the data that is used to represent all the values in that series.
Since an average issomewhere within the range of data, it is sometimes called a
measure of central value. An average is the most typical representative item of the
group to which it belongs and which is capable of revealing all important
characteristics of that group or distribution.
What are the Objects of Central Tendency
The most important object of calculating an average or measuring central tendency
is to determine a single figure which may be used to represent a whole series
involving magnitudes of the same variable. Second object is that an average
represents the empire data, it facilitates comparison within one group or between
groups of data. Thus, the performance of the members of a group can be compared
with the average performance of different groups.
Third object is that an average helps in computing various other statistical
measures such as dispersion,skewness, kurtosis etc.
Essential of a Good Average
An average represents the statistical data and it is used for purposes of comparison,
it must possess the following properties.
1. It must be rigidly defined and not left to the mere estimation of the observer. If
the definition is rigid, the computed value of the average obtained by different
persons shall be similar.
2. The average must be based upon all values given in the distribution. If the item
is not based on all value it might not be representative of the entire group of data.
3. It should be easily understood. The average should possess simple and obvious
properties. It should be too abstract for the common people.
4. It should be capable of being calculated with reasonable care and rapidity.
5. It should be stable and unaffected by sampling fluctuations.
6. It should be capable of further algebraic manipulation.
Different methods of measuring “Central Tendency” provide us with different
kinds of averages. The following are the main types of averages that are commonly
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
used:
1. Mean
(i) Arithmetic mean
(ii) Weighted mean
(iii) Geometric mean
(iv) Harmonic mean
2. Median
3. Mode
Arithmetic Mean: The arithmetic mean of a series is the quotient obtained by
dividing the sum of the values by the number of items. In algebraic language, if
X1, X2, X3 ....... Xn are the n values of a variate X.
Then the Arithmetic Mean is defined by the following formula:
=
=
Example : The following are the monthly salaries (Rs.) of ten employees in an
office. Calculate the meansalary of the employees: 250, 275, 265, 280, 400, 490,
670, 890, 1100, 1250.
Solution : =
= = Rs. 587
Short-cut Method: Direct method is suitable where the number of items is
moderate and the figures are small sizes and integers. But if the number of items is
large and/or the values of the variate are big, then the process of adding together all
the values may be a lengthy process. To overcome this difficulty ofcomputations, a
short-cut method may be used. Short cut method of computation is based on an
importantcharacteristic of the arithmetic mean, that is, the algebraic sum of the
deviations of a series of individualobservation from their mean is always equal to
zero. Thus deviations of the various values of the variate from an assumed mean
computed and the sum is divided by the number of items. The quotient obtained
is added to the assumed mean lo find the arithmetic mean.
Symbolically, = . where A is assumed mean and dx are deviations = (X – A).
We can solve the previous example by short-cut method.
Computation of Arithmetic Mean
----------------------------------------------------------------------------------
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
1. 250 –150
2. 275 –125
3. 265 –135
4. 280 –120
5. 400 0
6. 490 +90
7. 670 +270
8. 890 +490
9. 1100 + 700
10. 1250 + 850
----------------------------------------------------------------
N = 10 ∑dx = 1870
--------------------------------------------------------------
By substituting the values in the formula, we get
=
Computation of Arithmetic Mean in Discrete series. In discrete series, arithmetic
mean may be computed by both direct and short cut methods. The formula
according to direct method is:
=
where the variable values X1 X2, .......... Xn, have frequencies f1, f2, ................fn
and N = ∑f.
Example : The following table gives the distribution of 100 accidents during seven
days of the week in a given month. During a particular month there were 5 Fridays
and Saturdays and only four each of other days. Calculate the average number of
accidents per day.
Days : Sun. Mon. Tue. Wed. Thur. Fri. Sat. Total
Number of
accidents : 20 22 10 9 11 8 20 = 100
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
30 + 128
-----------------------------------------------------------------------
= = = 14 accidents per day
Calculation of arithmetic mean for Continuous Series: The arithmetic mean can
be computed both by direct and short-cut method. In addition, a coding method or
step deviation method is also applied for simplification of calculations. In any case,
it is necessary to find out the mid-values of the various classes in the frequency
distribution before arithmetic mean of the frequency distribution can be computed.
Once the mid-points of various classes are found out, then the process of the
calculation of arithmetic mean is same as in the case of discrete series. In case of
direct method, the formula to be used:
= , when m = mid points of various classes and N = total frequency In the short-cut
method, the following formula is applied:
= where dx = (m – A) and N = ∑f
The short-cut method can further be simplified in practice and is named coding
method. The deviations from the assumed mean are divided by a common factor to
reduce their size. The sum of the products of the deviations and frequencies is
multiplied by this common factor and then it is divided by the total frequency and
added to the assumed mean. Symbolically
= where and i = common factor
Geometric Mean :
In general, if we have n numbers (none of them being zero), then the GM. is
defined as
G.M. =
In case of a discrete series, if x1, x2,............. xn occur f1, f2, ............... fn times
respectively and N is
the
total frequency (i.e. N = f1 + f2...................fn ), then
G.M. =
For convenience, use of logarithms is made extensively to calculate the nth root. In
terms of logarithms
G.M. =
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
It the number of items is even, then there is no value exactly in the middle of the
series. In such a situation the median is arbitrarily taken to be halfway between the
two middle items. Symbolically.
Median =
Location of Median in Discrete series: In a discrete series, medium is computed
in the following manner:
(i) Arrange the given variable data in ascending or descending order,
(ii) Find cumulative frequencies.
(iii) Apply Med. = size of th item
(iv) Locate median according to the size i.e., variable corresponding to the size or
for next cumulative frequency.
Example: Following are the number of rooms in the houses of a particular locality.
Find median of the data:
No. of rooms: 3 4 5 6 7 8
No of houses: 38 654 311 42 12 2
Solution: Computation of Median
------------------------------------------------------------------------
No. of Rooms No. of Houses cumulative Frequency
X f Cf
-----------------------------------------------------------------------
3 38 38
4 654 692
5 311 1003
6 42 1045
7 12 1057
8 2 1059
------------------------------------------------------------------
Median = size of th item = size of th item = 530 th item.
Median lies in the cumulative frequency of 692 and the value corresponding to this
is 4
Therefore, Median = 4 rooms.
In a continuous series, median is computed in the following manner:
(i) Arrange the given variable data in ascending or descending order.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
(ii) If inclusive series is given, it must he converted into exclusive series to find
real class interval
(iii) Find cumulative frequencies.
(iv) Apply Median = size of th item to ascertain median class.
(v) Apply formula of interpolation to ascertain the value of median.
Median = l1 + or Median = l2 –
where, l1 refers to lower limit of median class,
l2 refers to higher limit of median class,
cfo refers cumulative frequency of previous to median class,
f refers to frequency of median class,
Example: The following table gives you the distribution of marks secured by some
students in an examination:
Marks No. of Students
0—20 42
21—30 38
31—40 120
41—50 84
51— 60 48
61—70 36
71—80 31
Find the median marks.
Solution: Calculation of Median Marks
---------------------------------------------------
Marks No. of Students cf
(x) (f)
--------------------------------------------------
0 – 20 42 42
21 – 30 38 80
31 – 40 120 200
41 – 50 84 284
51 – 60 48 332
61 – 70 36 368
71 – 80 31 399
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
---------------------------------------------------
Median = size of th item = size of th item = 199.5 th item.
which lies in (31 – 40) group, therefore the median class is 30.5 – 40.5.
Applying the formula of interpolation.
Median = l1 +
= 30.5 +
Mode
Mode is that value of the variable which occurs or repeats itself maximum number
of item. The mode is most “ fashionable” size in the sense that it is the most
common and typical and is defined by Zizek as “the value occurring most
frequently in series of items and around which the other items are distributed most
densely.” In the words of Croxton and Cowden, the mode of a distribution is the
value at the point where the items tend to be most heavily concentrated. According
to A.M. Tuttle, Mode is the value which has the greater frequency density in its
immediate neighbourhood. In the case of individual observations, the mode is that
value which is repeated the maximum number of times in the series. The value of
mode can be denoted by the alphabet z also.
----------------------------------------
Calculation of Mode in Discrete series. In discrete series, it is quite often
determined by inspection.We can understand with the help of an example:
X 1 2 3 4 5 6 7
f 4 5 13 6 12 8 6
By inspection, the modal size is 3 as it has the maximum frequency. But this test of
greatest frequency is not fool proof as it is not the frequency of a single class, but
also the frequencies of the neighbour classes that decide the mode. In such cases,
we shall be using the method of Grouping and Analysis table.
Size of shoe 1 2 3 4 5 6 7
Frequency 4 5 13 6 12 8 6
Solution : By inspection, the mode is 3, but the size of mode may be 5. This is so
because the neighboring frequencies of size 5 are greater than the neighbouring
frequencies of size 3. This effect of neighbouring frequencies is seen with the help
of grouping and analysis table technique.
Measures of dispersion
For the study of dispersion, we need some measures which show whether the
dispersion is small or large. There are two types of measure of dispersion, which
are:
1. The Range
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Range
are, and by computing the difference between the maximum and minimum
values, we can get an estimate of the spread of the data.
For example, suppose an experiment involves finding out the weight of lab rats and
the values in grams are 320, 367, 423, 471 and 480. In this case, the range is
Range is quite a useful indication of how spread out the data is, but it has some
serious limitations. This is because sometimes data can have outliers that are
widely off the other data points. In these cases, the range might not give a true
indication of the spread of data.
For example, in our previous case, consider a small baby rat added to the data set
that weighs only 50 grams. Now the range is computed as 480-50 = 430 grams,
which looks like a false indication of the dispersion of data.
Mean deviation
We're going to discuss methods to compute the Mean Deviation for three types of
series:
• Individual Data Series
• Discrete Data Series
• Continuous Data Series
Individual Data Series
When data is given on individual basis. Following is an example of individual
series:
Items 5 10 20 30 40 50 60 70
Items 5 10 20 30 40 50 60 70
Frequency 2 5 1 3 12 0 5 7
Frequency 2 5 1 3 12
The mean difference (more correctly, 'difference in means') is a standard statistic
that measures the absolute difference between the mean value in two groups in a
clinical trial. It estimates the amount by which the experimental intervention
changes the outcome on average compared with the control.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Formula
Mean Difference=∑x1n−∑x2nMean Difference=∑x1n−∑x2n
Where −
• x1x1 = Mean of group one
• x2x2 = Mean of group two
• nn = Sample size
Example
Problem Statement:
There are 2 dance groups whose data is listed below. Find the mean difference
between these dance groups.
Group 1 3 9 5 7
Group 2 5 3 4 4
Solution:
∑x1=3+9+5+7=24∑x2=5+3+4+4=16M1=∑x1n=244=6M2=∑x2n=164=4MeanDi
fference=6−4=2
Standard Deviation
You might like to read this simpler page on Standard Deviation first.
The formula actually says all of that, and I will show you how.
In the formula above μ (the greek letter "mu") is the mean of all our values ...
9+2+5+4+12+7+8+11+9+3+7+4+12+5+4+10+9+6+9+420
= 14020 = 7
So:
μ=7
Step 2. Then for each number: subtract the Mean and square the result
So it says "for each value, subtract the mean and square the result", like this
Example (continued):
(9 - 7)2 = (2)2 = 4
(2 - 7)2 = (-5)2 = 25
(5 - 7)2 = (-2)2 = 4
(4 - 7)2 = (-3)2 = 9
(7 - 7)2 = (0)2 = 0
(8 - 7)2 = (1)2 = 1
To work out the mean, add up all the values then divide by how many.
But how do we say "add them all up" in mathematics? We use "Sigma": Σ
Sigma Notation
We want to add up all the values from 1 to N, where N=20 in our case because
there are 20 values:
Example (continued):
We already calculated (x1-7)2=4 etc. in the previous step, so just sum them up:
= 4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9 = 178
But that isn't the mean yet, we need to divide by how many, which is done
by multiplying by 1/N (the same as dividing by N):
Example (continued):
σ = √(8.9) = 2.983...
DONE!
Example: Sam has 20 rose bushes, but only counted the flowers on 6 of them!
and the "sample" is the 6 bushes that Sam counted the flowers of.
9, 2, 5, 4, 12, 7
But when we use the sample as an estimate of the whole population, the Standard
Deviation formula changes to this:
The symbols also change to reflect that we are working on a sample instead of the
whole population:
• The mean is now x (for sample mean) instead of μ (the population mean),
• And the answer is s (for Sample Standard Deviation) instead of σ.
But that does not affect the calculations. Only N-1 instead of N changes the
calculations.
So:
x = 6.5
Step 2. Then for each number: subtract the Mean and square the result
Example 2 (continued):
To work out the mean, add up all the values then divide by how many.
But hang on ... we are calculating the Sample Standard Deviation, so instead of
dividing by how many (N), we will divide by N-1
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Example 2 (continued):
s = √(13.1) = 3.619...
DONE!
Coefficient of variation
Formula
The formula for the coefficient of variation is:
Coefficient of Variation = (Standard Deviation / Mean) * 100.
In symbols: CV = (SD/ ) * 100.
Multiplying the coefficient by 100 is an optional step to get a percentage, as
opposed to a decimal.
SD 10.2 12.7
SD 10.2 12.7
CV 17.03 28.35
Looking at the standard deviations of 10.2 and 12.7, you might think that the tests
have similar results. However, when you adjust for the difference in the means, the
results have more significance:
Regular test: CV = 17.03
Randomized answers: CV = 28.35
The coefficient of variation can also be used to compare variability between
different measures. For example, you can compare IQ scores to scores on the
Woodcock-Johnson III Tests of Cognitive Abilities.
Why Sample?
• Pool of possible cases is too large (e.g., 260 million Americans) -- would
cost too much and take too long
• Don't want to use up the cases: e.g., when testing light bulbs to see how long
they last, you take a bulb and leave it on until it burns out. You can't test all
the bulbs this way, because their whole objective is to sell the bulbs, not
burn them out.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
• It's not necessary to survey all cases: for most purposes, taking a sample
yields a estimates that are accurate enough.
• The trade-off is that sampling does introduce some error. You didn't
interview everybody, so certain opinions or combinations of opinions won't
be represented in your data. When the population is very diverse, your
sample can't include all the possible combinations of attributes that are
found in the population, such blacks and whites, men and women, cardiac
patients non-patients, black women, white men, white women with heart
trouble who like Oprah and don't like Ally McBeal, etc.
• Population is the universe of cases. It is the group that you ultimately want
to say something about. For example, if you want to report 'what Americans
think about Clinton', then the population is all Americans.
• Elements are the individual cases in the population (usually, persons)
• Sampling ratio is size of sample divided by size of population. Contrary to
popular belief, a large sampling ratio is not crucial.
• Sampling frame is a specific list of names from which sample elements will
be chosen. The Literary Digest poll in 1936 used a sample of 10 million,
drawn from government lists of automobile and telephone owners. Predicted
Alf Landon would beat Franklin Roosevelt by a wide margin. But instead
Roosevelt won by a landslide. The reason was that the sampling frame did
not match the population. Only the rich owned automobiles and telephones,
and they were the ones who favored Landon.
• Replacement. Sampling with replacement means that after you draw a name
out of the hat and record it, you put the name back and it can be chosen
again. Sampling without replacement means that once you draw the name
out, it is not available to be chosen again.
• Bias. Systematic errors produced by your sampling procedure. For example,
if you sample people and ask them whether they watch Ally McBeal, but the
percentage always comes out too high (maybe because you are interviewing
your friends and your whole group really likes Ally McBeal)
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Non-Probability Sampling
Haphazard/Convenience
• Whoever happens to walk by your office; who's on the street when the
camera crews come out
• If you have a choice, don't use this method. Often produces really wrong
answers, because certain attributes tend to cluster with certain geographic
and temporal variables. For example, at 8am in NYC, most of the people on
the street are workers heading for their jobs. At 10am, there are many more
people who don't work, and the proportion of women is much higher. At
midnight, there are young people and muggers.
Quota
Purposive/Judgement
Snowball
Probability Sampling
mathematically). These are also called random sampling. They require more work,
but are much more accurate. They also allow the researcher to calculate the amount
of error she can expect, and this is really important.
Simple Random
• Develop a sampling frame, then randomly select elements (place all names
on cards, then randomly draw cards from hat; in Excel, there is a function
for attaching a random number to each cell, then sort and take N largest)
• Typically use sampling without replacement, but with replacement can be
done (and is easier mathematically)
• Any one sample is likely to yield statistics (such as the average income or
the percentage of respondents that watch Ally McBeal) that are different
from the population parameters
• The average statistic from many random samples should equal the
population parameter. In other words, if you took 150 different samples of
Americans, each of 300 people, and calculated the percentage that like Ally
McBeal in each of the samples, then averaged all those percentages together,
that should equal the "real" percentage of all Americans that like Ally
McBeal
• It is the Central Limit Theory that guarantees that as the number of random
samples increases, the average of those samples converges on the population
parameter
• Because of these mathematical guarantees, we can estimate how far off a
sample might be from the population, giving rise to confidence intervals
• Random samples are unbiased and, on average, representative of the
population.
percentage is high enough, the company will consider instituting a mandatory drug
testing program. Given this objective, a simple random sampling design is perfect:
the results will generalize to the whole company.
Stratified Sampling
Cluster Sampling
• Used when (a) sampling frame not available or too expensive, and (b) cost
of reaching an individual element is too high
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Example. Once a quarter, a large retail chain sends auditors to randomly chosen
stores to check that proper procedures are being carried out. They look at the
physical layout, the interactions between staff and customers, backroom
procedures, and so on. A simple random sample could have an auditor visiting a
California store one day, a New York the next, then another California store, and
so on. Using cluster sampling, the auditor might first select a random sample of
states, then visit a random sampling of stores with each state, thus reducing travel
time.
Sample Size
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
• The bigger the better, up to 2500. Beyond 2500, it doesn't really matter
(accuracy increases very slowly after this point)
• The smaller the population, the bigger the sampling ratio that is needed.
• For populations under 1000, you need sampling ratio of 30% (300 elements)
to be really accurate.
• For populations of about 10,000 need sampling ratio of about 10%
• This lesson will show the difference between sampling and nonsampling
errors. Using a sample in order to get information about a population is often
better than conducting a census for many reasons.
• Sampling is less costly and it can be done more quickly than a census which
requires data for the entire population.
• Suppose, we need to find the sampling error for the mean. Suppose also
there is no nonsampling error which we define below.
• Sampling error = x̄ – μ
• For example, in the lesson about sampling distribution, the 5 scores below
are for the entire population and μ = 86.4
• 80 85 85 90 92
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
• The mean score estimated from the sample is 2.6 higher than the mean score
from the population.
• 0.34 does not really represent the sampling error since we already calculated
it as 2.6.
• The difference between 2.6 and 0.34 or 1.26 – 0.26 or 2.26 is the
nonsampling error because the value of 2.26 occured as a result of human
mistake.
Reliability of samples
theory that states that given a sufficiently large sample size from a population with
a finite level of variance, the mean of all samples from the same population will be
Unit-2
If A and B are any two events of a sample space such that P(A) ≠0 and P(B)≠0,
then
P(A∩B) = P(A) * P(B|A) = P(B) *P(A|B).
INDEPENDENT EVENTS:
Two events A and B are said to be independent if there is no change in the
happening of an event with the happening of the other event.
i.e. Two events A and B are said to be independent if
P(A|B) = P(A) where P(B)≠0.
Example:
While laying the pack of cards, let A be the event of drawing a diamond and B
be the event of drawing an ace.
Note:
(1) If 3 events A,B and C are independent the
P(A∩B∩C) = P(A)*P(B)*P(C).
Example:
The event of getting 2 heads, A and the event of getting 2 tails, B when two
coins are tossed are mutually exclusive.
Because A = {HH}; B = {TT}.
If A and B are two mutually exhaustive then the probability of their union is 1.
i.e. P(AUB)=1.
Example:
The event of getting a head and the event of getting a tail when a coin is tossed
are mutually exhaustive.
Example:
If the probability of solving a problem by two students George and James are 1/2
and 1/3 respectively then what is the probability of the problem to be solved.
Solution:
Let A and B be the probabilities of solving the problem by George and James
respectively.
Then P(A)=1/2 and P(B)=1/3.
P(AUB) = 1/2 +.1/3 – 1/2 * 1/3 = 1/2 +1/3-1/6 = (3+2-1)/6 = 4/6 = 2/3
Note:
If A and B are any two mutually exclusive events then P(A∩B)=0.
Then P(AUB) = P(A)+P(B).
Conditional Probability
The conditional probability of an event B is the probability that the event will
occur given the knowledge that an event A has already occurred. This probability is
written P(B|A), notation for the probability of B given A. In the case where
events A and B are independent (where event A has no effect on the probability of
event B), the conditional probability of event B given event A is simply the
probability of event B, that is P(B).
If events A and B are not independent, then the probability of the intersection
of A and B (the probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).
Examples
In a card game, suppose a player needs to draw two cards of the same suit in order
to win. Of the 52 cards, there are 13 cards in each suit. Suppose first the player
draws a heart. Now the player wishes to draw a second heart. Since one heart has
already been chosen, there are now 12 hearts remaining in a deck of 51 cards. So
the conditional probability P(Draw second heart|First card a heart) = 12/51.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
To calculate the probability of the intersection of more than two events, the
conditional probabilities of all of the preceding events must be considered. In
the case of three events, A, B, and C, the probability of the intersection P(A
and B and C) = P(A)P(B|A)P(C|A and B).
Consider the college applicant who has determined that he has 0.80 probability of
acceptance and that only 60% of the accepted students will receive dormitory
housing. Of the accepted students who receive dormitory housing, 80% will have
at least one roommate. The probability of being accepted and receiving dormitory
housing and having no roommates is calculated by:
P(Accepted and Dormitory Housing and No Roommates) =
P(Accepted)P(Dormitory Housing|Accepted)P(No Roomates|Dormitory Housing
and Accepted) = (0.80)*(0.60)*(0.20) = 0.096. The student has about a 10% chance
of receiving a single room at the college.
Example
Suppose a voter poll is taken in three states. In state A, 50% of voters support the
liberal candidate, in state B, 60% of the voters support the liberal candidate, and in
state C, 35% of the voters support the liberal candidate. Of the total population of
the three states, 40% live in state A, 25% live in state B, and 35% live in state C.
Given that a voter supports the liberal candidate, what is the probability that she
lives in state B?
By Bayes's formula,
Independent Events
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
LO 6.7: Determine whether two events are independent or dependent and justify
your conclusion.
Independent Events:
• Two events A and B are said to be independent if the fact that one event has
occurred does not affect the probability that the other event will occur.
• If whether or not one event occurs does affect the probability that the other
event will occur, then the two events are said to be dependent.
Here are a few examples:
EXAMPLE:
A woman’s pocket contains two quarters and two nickels.
She randomly extracts one of the coins and, after looking at it, replaces it before
picking a second coin.
Let Q1 be the event that the first coin is a quarter and Q2 be the event that the
second coin is a quarter.
She randomly extracts one of the coins, and without placing it back into her
pocket, she picks a second coin.
As before, let Q1 be the event that the first coin is a quarter, and Q2 be the event
that the second coin is a quarter.
In these last two examples, we could actually have done some calculation in order
to check whether or not the two events are independent or not.
Sometimes we can just use common sense to guide us as to whether two events are
independent. Here is an example.
EXAMPLE:
Two people are selected simultaneously and at random from all people in the
United States.
Let B1 be the event that one of the people has blue eyes and B2 be the event that
the other person has blue eyes.
In this case, since they were chosen at random, whether one of them has blue eyes
has no effect on the likelihood that the other one has blue eyes, and therefore B1
and B2 are independent.
On the other hand …
EXAMPLE:
A family has 4 children, two of whom are selected at random.
Let B1 be the event that one child has blue eyes, and B2 be the event that the other
chosen child has blue eyes.
In this case, B1 and B2 are not independent, since we know that eye color is
hereditary.
Thus, whether or not one child is blue-eyed will increase or decrease the chances
that the other child has blue eyes, respectively.
Comments:
• It is quite common for students to initially get confused about the distinction
between the idea of disjoint events and the idea of independent events. The
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
purpose of this comment (and the activity that follows it) is to help students
develop more understanding about these very different ideas.
The idea of disjoint events is about whether or not it is possible for the events to
occur at the same time (see the examples on the page for Basic Probability Rules).
The idea of independent events is about whether or not the events affect each
other in the sense that the occurrence of one event affects the probability of the
occurrence of the other (see the examples above).
The following activity deals with the distinction between these concepts.
The purpose of this activity is to help you strengthen your understanding about the
concepts of disjoint events and independent events, and the distinction between
them.
Learn by Doing: Independent Events
If events are disjoint then they must be not independent, i.e. they must be
dependent events.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Why is that?
• Recall: If A and B are disjoint then they cannot happen together.
• In other words, A and B being disjoint events implies that if event A occurs
then B does not occur and vice versa.
• Well… if that’s the case, knowing that event A has occurred dramatically
changes the likelihood that event B occurs – that likelihood is zero.
• This implies that A and B are not independent.
Now that we understand the idea of independent events, we can finally get to rules
for finding P(A and B) in the special case in which the events A and B are
independent.
Later we will present a more general version for use when the events are not
necessarily independent.
Using a Venn diagram, we can visualize “A and B,” which is represented by the
overlap between events A and B:
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
EXAMPLE:
Recall the blood type example:
Two people are selected simultaneously and at random from all people in the
United States.
Comments:
• We now have an Addition Rule that says
P(A or B) = P(A) + P(B) for disjoint events,
and a Multiplication Rule that says
Since probabilities are never negative, the probability of one event or another is
always at least as large as either of the individual probabilities.
Since probabilities are never more than 1, the probability of one event and another
generally involves multiplying numbers that are less than 1, therefore can never be
more than either of the individual probabilities.
Here is an example:
EXAMPLE:
Consider the event A that a randomly chosen person has blood type A.
Modify it to a more general event — that a randomly chosen person has blood type
A or B — and the probability increases.
Modify it to a more specific (or restrictive) event — that not just one randomly
chosen person has blood type A, but that out of two simultaneously randomly
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
chosen people, person 1 will have type A and person 2 will have type B — and the
probability decreases.
• The word “and” is associated in our minds with “adding more stuff.” Therefore,
some students incorrectlythink that P(A and B) should be larger than either one
of the individual probabilities, while it is actually smaller, since it is a more
specific (restrictive) event.
• Also, the word “or” is associated in our minds with “having to choose between”
or “losing something,” and therefore some students incorrectly think that P(A
or B) should be smaller than either one of the individual probabilities, while it is
actually larger, since it is a more general event.
Practically, you can use this comment to check yourself when solving problems.
For example, if you solve a problem that involves “or,” and the resulting
probability is smaller than either one of the individual probabilities, then you know
you have made a mistake somewhere.
Comment:
• Probability rule six can be used as a test to see if two events are independent or
not.
• If you can easily find P(A), P(B), and P(A and B) using logic or are provided
these values, then we can test for independent events using the multiplication
rule for independent events:
IF P(A)*P(B) = P(A and B) THEN A and B are independent events,
otherwise, they are dependent events.
As you’ve seen, the last three rules that we’ve introduced (the Complement Rule,
the Addition Rules, and the Multiplication Rule for Independent Events) are
frequently used in solving problems.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Before we move on to our next rule, here are two comments that will help you use
these rules in broader types of problems and more effectively.
Comment:
• As we mentioned before, the Addition Rule for Disjoint events (rule four) can
be extended to more than two disjoint events.
• Likewise, the Multiplication Rule for independent events (rule six) can be
extended to more than two independent events.
• So if A, B and C are three independent events, for example, then P(A and B and
C) = P(A) * P(B) * P(C).
• These extensions are quite straightforward, as long as you remember that “or”
requires us to add, while “and” requires us to multiply.
EXAMPLE:
Three people are chosen simultaneously and at random.
We’ll use the usual notation of B1, B2 and B3 for the events that persons 1, 2 and
3 have blood type B, respectively.
We need to find P(B1 and B2 and B3). Let’s solve this one together:
EXAMPLE:
A fair coin is tossed 10 times. Which of the following two outcomes is more
likely?
(a) HHHHHHHHHH
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
(b) HTTHHTHTTH
In fact, they are equally likely. The 10 tosses are independent, so we’ll use the
Multiplication Rule for Independent Events:
• while it is true that it is more likely to get an outcome that has 5 heads and 5
tails than an outcome that has only heads
since there is only one possible outcome which gives all heads
and many possible outcomes which give 5 heads and 5 tails
• if we are comparing 2 specific outcomesas we do here, they are equally likely.
IMPORTANT Comments:
• Only use the multiplication rule for independent events, rule six, which says
P(A and B) = P(A)P(B) if you are certain the two events are independent.
o Probability rule six is ONLY true for independent events.
• When finding P(A or B) using the general addition rule: P(A) + P(B) – P(A
and B),
o do NOT use the multiplication rule for independent events to calculate P(A
and B), use only logic and counting.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Bayes’ theorem
In most cases, you can’t just plug numbers into an equation; You have to figure out
what your “tests” and “events” are first. For two events, A and B, Bayes’ theorem
allows you to figure out p(A|B) (the probability that event A happened, given that
test B was positive) from p(B|A) (the probability that test B happened, given that
event A happened). It can be a little tricky to wrap your head around as technically
you’re working backwards; you may have to switch your tests and events around,
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
which can get confusing. An example should clarify what I mean by “switch the
tests and events around.”
Bayes’ Theorem Example #1
You might be interested in finding out a patient’s probability of having liver
disease if they are an alcoholic. “Being an alcoholic” is the test (kind of like a
litmus test) for liver disease.
• A could mean the event “Patient has liver disease.” Past data tells you that 10%
of patients entering your clinic have liver disease. P(A) = 0.10.
• B could mean the litmus test that “Patient is an alcoholic.” Five percent of the
clinic’s patients are alcoholics. P(B) = 0.05.
• You might also know that among those patients diagnosed with liver disease,
7% are alcoholics. This is your B|A: the probability that a patient is alcoholic,
given that they have liver disease, is 7%.
Bayes’ theorem tells you:
P(A|B) = (0.07 * 0.1)/0.05 = 0.14
In other words, if the patient is an alcoholic, their chances of having liver disease is
0.14 (14%). This is a large increase from the 10% suggested by past data. But it’s
still unlikely that any particular patient has liver disease.
More Bayes’ Theorem Examples
Bayes’ Theorem Problems Example #2
Another way to look at the theorem is to say that one event follows another. Above
I said “tests” and “events”, but it’s also legitimate to think of it as the “first event”
that leads to the “second event.” There’s no one right way to do this: use the
terminology that makes most sense to you.
In a particular pain clinic, 10% of patients are prescribed narcotic pain killers.
Overall, five percent of the clinic’s patients are addicted to narcotics (including
pain killers and illegal substances). Out of all the people prescribed pain pills, 8%
are addicts. If a patient is an addict, what is the probability that they will be
prescribed pain pills?
Step 1: Figure out what your event “A” is from the question. That information
is in the italicized part of this particular question. The event that happens first (A)
is being prescribed pain pills. That’s given as 10%.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Step 2: Figure out what your event “B” is from the question. That information
is also in the italicized part of this particular question. Event B is being an addict.
That’s given as 5%.
Step 3: Figure out what the probability of event B (Step 2) given event A (Step
1). In other words, find what (B|A) is. We want to know “Given that people are
prescribed pain pills, what’s the probability they are an addict?” That is given in
the question as 8%, or .8.
Step 4: Insert your answers from Steps 1, 2 and 3 into the formula and solve.
P(A|B) = P(B|A) * P(A) / P(B) = (0.08 * 0.1)/0.05 = 0.16
The probability of an addict being prescribed pain pills is 0.16 (16%).
Expected Values
EV, average, mean value, mean, or first moment. More practically, the expected
possible values.
two possible outcomes: heads or tails and taking a test could have two possible
outcomes: pass or fail.
The first variable in the binomial formula, n, stands for the number of times the
experiment runs. The second variable, p, represents the probability of one specific
outcome. For example, let’s suppose you wanted to know the probability of getting
a 1 on a die roll. if you were to roll a die 20 times, the probability of rolling a one
on any throw is 1/6. Roll twenty times and you have a binomial distribution of
(n=20, p=1/6). SUCCESS would be “roll a one” and FAILURE would be “roll
anything else.” If the outcome in question was the probability of the die landing on
an even number, the binomial distribution would then become (n=20, p=1/2).
That’s because your probability of throwing an even number is one half.
Criteria
Binomial distributions must also meet the following three criteria:
1. The number of observations or trials is fixed. In other words, you can only
figure out the probability of something happening if you do it a certain number
of times. This is common sense—if you toss a coin once, your probability of
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
getting a tails is 50%. If you toss a coin a 20 times, your probability of getting a
tails is very, very close to 100%.
2. Each observation or trial is independent. In other words, none of your trials
have an effect on the probability of the next trial.
3. The probability of success (tails, heads, fail or pass) is exactly the same from
one trial to another.
80% of people who purchase pet insurance are women. If 9 pet insurance
owners are randomly selected, find the probability that exactly 6 are women.
Step 1: Identify ‘n’ from the problem. Using our sample question, n (the number
of randomly selected items) is 9.
Step 2: Identify ‘X’ from the problem. X (the number you are asked to find the
probability for) is 6.
Step 3: Work the first part of the formula. The first part of the formula is
n! / (n – X)! X!
Substitute your variables:
9! / ((9 – 6)! × 6!)
Which equals 84. Set this number aside for a moment.
Step 4: Find p and q. p is the probability of success and q is the probability of
failure. We are given p = 80%, or .8. So the probability of failure is 1 – .8 = .2
(20%).
Step 5: Work the second part of the formula.
pX
= .86
= .262144
Set this number aside for a moment.
Step 6: Work the third part of the formula.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
q(n – X)
= .2(9-6)
= .23
= .008
Step 7: Multiply your answer from step 3, 5, and 6 together.
84 × .262144 × .008 = 0.176.
Example 3
60% of people who purchase sports cars are men. If 10 sports car owners are
randomly selected, find the probability that exactly 7 are men.
Step 1:: Identify ‘n’ and ‘X’ from the problem. Using our sample question, n (the
number of randomly selected items—in this case, sports car owners are randomly
selected) is 10, and X (the number you are asked to “find the probability” for) is
7.
Step 2: Figure out the first part of the formula, which is:
n! / (n – X)! X!
Substituting the variables:
10! / ((10 – 7)! × 7!)
Which equals 120. Set this number aside for a moment.
Step 3: Find “p” the probability of success and “q” the probability of failure. We
are given p = 60%, or .6. therefore, the probability of failure is 1 – .6 = .4 (40%).
Step 4: Work the next part of the formula.
pX
= .67
= .0.0279936
Set this number aside while you work the third part of the formula.
Step 5: Work the third part of the formula.
q(.4 – 7)
= .4(10-7)
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
= .43
= .0.064
Step 6: Multiply the three answers from steps 2, 4 and 5 together.
120 × 0.0279936 × 0.064 = 0.215.
Poisson Distribution
If you are using GeoGebra, then you will immediately see that the software tells
you P(64 < X <69) =0.5393. If you are using the calculator, then you need to find
the normalcdf (not normalpdf) function. Enter the number on the left where the
shading begins, the number on the right where it ends, the mean of the distribution,
and its standard deviation, all separated by commas, normalcdf (64, 69, 65, 3), and
you will get 0.539347. Round this to the nearest ten-thousandth (four places after
the decimal point), or equivalently to the nearest hundredth of a percent, and you
come up with the correct answer: 0.5393, or 53.93%.
In the last lecture, we mentioned that in the old days, everyone has to learn how to
look up a Z-table, the table the shows the relationship between area and Z-score for
the standard normal. Then how does GeoGebra and normalcdf do it? Well, it’s no
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
magic. The software simply converts any normal distribution to a standard normal,
using the familiar relationship of Z-score:
It’s not necessary that you always convert all normal distributions to Z, but it’s
useful to recognize how it is handled by the software, since we will be doing the
same later in inferential statistics.
2) What is the probability that a woman is taller than 5 feet, 10 inches, or 70
inches? Put another way, what fraction of women are taller than 70 inches?
This would be written as P(X > 70).
Start the same way as in Problem 1, but you have to mark and label only one
number besides the mean, the 70. Then shade to the right of the 70, because that’s
where the taller heights are:
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
In the problems above, we found the probability that the random variable falls
within a certain range. Now we’re going to reverse the process. We’ll start with the
probability of a certain range, and then we’ll have to find the values of the random
variable that determine that range. I’ll call these values cut-offs. Sometimes they
are also called “inverse probability” problems.
In these three problems, we’ll use the same situation as above: Women’s heights
are normally distributed with a mean of 65 inches and a standard deviation of 3
inches.
1) How short does a woman have to be to be in the shortest 10% of women?
If we call this cut-off c, this could be written as finding c such that P(X < c)
= 0.10.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
We’ll do the same kind of diagram as before, but this time we’ll label the known
probability, 10%, and we do this above the shaded area, definitely not on the x-
axis, because it’s an area, not a height. The hardest part of the diagram is deciding
which side of the mean to put the c on and which side of the c to shade.
You really have to think about it. In this case, since by definition 50% of women
are shorter than the mean, the cut-off for 10% has to be less than the mean.
The picture here shows that how GeoGebra can be used to find the cut-off values:
instead of entering the cut-off values, you can enter 0.10 as the probability, and
GeoGebra will solve for the cut-off value (61.1553).
Using the calculator, you will need to resort to the invNorm function, followed by
the percent of data under the normal curve to the left of (always to the left of, no
matter which side of c the shading is on) the cut-off, then the mean and standard
deviation, separated by commas.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
So in our example, we will do invNorm (0.10, 65, 3), or, to the nearest inch, like
the mean and standard deviation, 61 inches. So about 10% of women are shorter
than 61 inches. You can check this using normalcdf, and you might as well use
more of the cut-off than we rounded to, for greater assurance that your check
shows you got the right answer. You get normalcdf (0, 61.1553, 65, 3), which
come to 0.0999997, or 10%.
2) How tall does a woman have to be to be in the tallest fourth of women?
(What is the cut-off for the tallest 25% of women?) If we call this height c,
we want to find the value of c such that P(X > c) = 0.25. Here’s the diagram:
In GeoGebra it’s quite simple: you will just have to switch the left to the right tail.
In the calculator, when we use invNorm we must put in 0.75, because the
calculator finds cut-offs for areas to the left only: invNorm (0.75, 65, 3). Here 0.75
comes from the fact that the total area must be equal to 1. When we subtract the
area to the right, we are getting the area to the left of the cut-off.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Using the algebra you have learned, you will find x = 3*0.67 + 65 = 67.0, which is
how the software arrived at the answer. You won’t have to do it this way every
time, but it’s helpful to keep in mind, since this relation is used later on in finding
the margin of error for confidence intervals.
3) What if we’re interested in finding cut-offs for a middle group of
women’s heights, say the middle 40%? Obviously, we’re looking for two
numbers here, one on either side of the mean, with the same distance to the
mean. Call them and . Then we are looking for these values so
that
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
You probably noticed that the normal calculator in GeoGebra can’t really find two
cut-offs at once in fact, the figure above was drawn using a different tool. But
and are not two independent values, since they are equally far from 65, the
mean. To use the normal calculator, we must find out how much area is under the
curve to the left of . Well, if 100% of area is under the entire curve, then what’s
left over after taking away the middle 40% is 1-0.40=0.60, and since that 60% is
split evenly between the two tails (the parts at the sides), that gives 30% for each
tail. So is the number such that .
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
How much area is there under the curve to the left of ? Either subtract the 30% to
the right from 100%, or add up the 30% in the left tail and the 40% in the middle,
and you’ll get 70% either way. So is the number such that , and
you will find that inches. So to the first decimal, the middle 40% of
heights go from 63.4 to 66.6 inches. If you use invNorm on a calculator, the
process will be similar.
Summary
Here are a few tips that may help you solve problems related to the normal
distribution:
1) First identify the distribution: is it continuous? Is it Normal?
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
2) Draw a graph of the normal PDF with the mean and standard deviation
3) Examine the question to see whether you are looking for a probability, or
cut-off values.
4) Shade the approximate areas under the normal PDF.
5) Use the software/calculator to solve the unknown, and compare the output
with your graph.
Unit-3
Gather sample data and calculate a test statistic where the sample statistic is
compared to the parameter value. The test statistic is calculated under the
Level of Significance
To bring it to life, I’ll add the significance level and P value to the graph in my
previous post in order to perform a graphical version of the 1 sample t-test. It’s
easier to understand when you can see what statistical significance truly means!
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Here’s where we left off in my last post. We want to determine whether our sample
mean (330.6) indicates that this year's average energy cost is significantly different
from last year’s average energy cost of $260.
I left you with a question: where do we draw the line for statistical significance on
the graph? Now we'll add in the significance level and the P value, which are the
decision-making tools we'll need.
• Null hypothesis: The population mean equals the hypothesized mean (260).
• Alternative hypothesis: The population mean differs from the hypothesized
mean (260).
The significance level, also denoted as alpha or α, is the probability of rejecting the
null hypothesis when it is true. For example, a significance level of 0.05 indicates a
5% risk of concluding that a difference exists when there is no actual difference.
The significance level determines how far out from the null hypothesis value we'll
draw that line on the graph. To graph a significance level of 0.05, we need to shade
the 5% of the distribution that is furthest away from the null hypothesis.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
In the graph above, the two shaded areas are equidistant from the null hypothesis
value and each area has a probability of 0.025, for a total of 0.05. In statistics, we
call these shaded areas the critical region for a two-tailed test. If the population
mean is 260, we’d expect to obtain a sample mean that falls in the critical region
5% of the time. The critical region defines how far away our sample statistic must
be from the null hypothesis value before we can say it is unusual enough to reject
the null hypothesis.
Our sample mean (330.6) falls within the critical region, which indicates it is
statistically significant at the 0.05 level.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
We can also see if it is statistically significant using the other common significance
level of 0.01.
The two shaded areas each have a probability of 0.005, which adds up to a total
probability of 0.01. This time our sample mean does not fall within the critical
region and we fail to reject the null hypothesis. This comparison shows why you
need to choose your significance level before you begin your study. It protects you
from choosing a significance level because it conveniently gives you significant
results!
Thanks to the graph, we were able to determine that our results are statistically
significant at the 0.05 level without using a P value. However, when you use the
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
P-values are the probability of obtaining an effect at least as extreme as the one in
your sample data, assuming the truth of the null hypothesis.
This definition of P values, while technically correct, is a bit convoluted. It’s easier
to understand with a graph!
To graph the P value for our example data set, we need to determine the distance
between the sample mean and the null hypothesis value (330.6 - 260 = 70.6). Next,
we can graph the probability of obtaining a sample mean that is at least as extreme
in both tails of the distribution (260 +/- 70.6).
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
In the graph above, the two shaded areas each have a probability of 0.01556, for a
total probability 0.03112. This probability represents the likelihood of obtaining a
sample mean that is at least as extreme as our sample mean in both tails of the
distribution if the population mean is 260. That’s our P value!
When a P value is less than or equal to the significance level, you reject the null
hypothesis. If we take the P value for our example and compare it to the common
significance levels, it matches the previous graphical results. The P value of
0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01
level.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
If we stick to a significance level of 0.05, we can conclude that the average energy
cost for the population is greater than 260.
A common mistake is to interpret the P-value as the probability that the null
hypothesis is true. To understand why this interpretation is incorrect, please read
my blog post How to Correctly Interpret P Values.
• The assumption that the null hypothesis is true—the graphs are centered on
the null hypothesis value.
• The significance level—how far out do we draw the line for the critical
region?
• Our sample statistic—does it fall in the critical region?
Keep in mind that there is no magic significance level that distinguishes between
the studies that have a true effect and those that don’t with 100% accuracy. The
common alpha values of 0.05 and 0.01 are simply based on tradition. For a
significance level of 0.05, expect to obtain sample means in the critical region 5%
of the time when the null hypothesis is true. In these cases, you won’t know that
the null hypothesis is true but you’ll reject it because the sample mean falls in the
critical region. That’s why the significance level is also referred to as an error rate!
This type of error doesn’t imply that the experimenter did anything wrong or
require any other unusual explanation. The graphs show that when the null
hypothesis is true, it is possible to obtain these unusual sample means for no reason
other than random sampling error. It’s just luck of the draw.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Significance levels and P values are important tools that help you quantify and
control this type of error in a hypothesis test. Using these tools to decide when to
reject the null hypothesis increases your chance of making the correct decision.
mean. This is a possibility, but only one of many possibilities. To cover all
alternative outcomes, we resort to a verbal statement of ‘not all equal’ and then
follow up with mean comparisons to find out where differences among means
exist. In our example, this means that fertilizer 1 may result in plants that are
really tall, but fertilizers 2, 3 and the plants with no fertilizers don't differ from one
another. A simpler way of thinking about this is that at least one mean is different
from all others.
Step 3: Set [Math Processing Error]
If we look at what can happen in a hypothesis test, we can construct the following
contingency table:
In Reality
Type II Error
Accept
OK β = probability of Type
H0
II Error
Type I Error
Reject H0 α = probability of Type I OK
Error
You should be familiar with type I and type II errors from your introductory
course. It is important to note that we want to set [Math Processing Error] before
the experiment (a-priori) because the Type I error is the more ‘grevious’ error to
make. The typical value of [Math Processing Error] is 0.05, establishing a 95%
confidence level. For this course we will assume [Math Processing Error] =0.05.
Step 4: Collect Data
Remember the importance of recognizing whether data is collected through an
experimental design or observational.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
into the rejection region and the p-value becomes less than α. So the decision rule
is as follows:
If the p-value obtained from the ANOVA is less than α, then Reject H0 and Accept
HA.
Z-test Vs T-test
Sometimes, measuring every single piece of item is just not practical. That is why
we developed and use statistical methods to solve problems. The most practical
way to do it is to measure just a sample of the population. Some methods test
hypotheses by comparison. The two of the more known statistical hypothesis test
are the T-test and the Z-test. Let us try to breakdown the two.
A T-test is a statistical hypothesis test. In such test, the test statistic follows a
Student’s T-distribution if the null hypothesis is true. The T-statistic was
introduced by W.S. Gossett under the pen name “Student”. The T-test is also
referred as the “Student T-test”. It is very likely that the T-test is most commonly
used Statistical Data Analysis procedure for hypothesis testing since it is
straightforward and easy to use. Additionally, it is flexible and adaptable to a broad
range of circumstances.
There are various T-tests and two most commonly applied tests are the one-sample
and paired-sample T-tests. One-sample T-tests are used to compare a sample mean
with the known population mean. Two-sample T-tests, the other hand, are used to
compare either independent samples or dependent samples.
T-test is best applied, at least in theory, if you have a limited sample size (n < 30)
as long as the variables are approximately normally distributed and the variation of
scores in the two groups is not reliably different. It is also great if you do not know
the populations’ standard deviation. If the standard deviation is known, then, it
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
would be best to use another type of statistical test, the Z-test. The Z-test is also
applied to compare sample and population means to know if there’s a significant
difference between them. Z-tests always use normal distribution and also ideally
applied if the standard deviation is known. Z-tests are often applied if the certain
conditions are met; otherwise, other statistical tests like T-tests are applied in
substitute. Z-tests are often applied in large samples (n > 30). When T-test is used
in large samples, the t-test becomes very similar to the Z-test. There are
fluctuations that may occur in T-tests sample variances that do not exist in Z-tests.
Because of this, there are differences in both test results.
Summary:
Chi-square Test
Chi-square test is one of the important nonparametric tests that is used to compare
more than two variables for a randomly selected data. The expected frequencies are
calculated based on the conditions of null hypothesis. The rejection of null
hypothesis is based on the differences of actual value and expected value.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The data can be examined by using the two types of Chi-square test, which is given
below:
1. Chi-square goodness of fit test
It is used to observe that the closeness of a sample matches a population. The Chi-
square test statistic is,
F-Test
The F-test is designed to test if two population variances are equal. It does this by
comparing the ratio of two variances. So, if the variances are equal, the ratio of the
variances will be 1.
All hypothesis testing is done under the assumption the null hypothesis is true
If the null hypothesis is true, then the F test-statistic given above can be
simplified (dramatically). This ratio of sample variances will be test
statistic used. If the null hypothesis is false, then we will reject the null
hypothesis that the ratio was equal to 1 and our assumption that they were equal.
There are several different F-tables. Each one has a different level of significance.
So, find the correct level of significance first, and then look up the numerator
degrees of freedom and the denominator degrees of freedom to find the critical
value.
You will notice that all of the tables only give level of significance for right tail
tests. Because the F distribution is not symmetric, and there are no negative values,
you may not simply take the opposite of the right critical value to find the left
critical value. The way to find a left critical value is to reverse the degrees of
freedom, look up the right critical value, and then take the reciprocal of this value.
For example, the critical value with 0.05 on the left with 12 numerator and 15
denominator degrees of freedom is found of taking the reciprocal of the critical
value with 0.05 on the right with 15 numerator and 12 denominator degrees of
freedom.
Since the left critical values are a pain to calculate, they are often avoided
altogether. This is the procedure followed in the textbook. You can force the F test
into a right tail test by placing the sample with the large variance in the numerator
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
and the smaller variance in the denominator. It does not matter which sample has
the larger sample size, only which sample has the larger variance.
The numerator degrees of freedom will be the degrees of freedom for whichever
sample has the larger variance (since it is in the numerator) and the denominator
degrees of freedom will be the degrees of freedom for whichever sample has the
smaller variance (since it is in the denominator).
If a two-tail test is being conducted, you still have to divide alpha by 2, but you
only look up and compare the right critical value.
Assumptions / Notes
Non parametric tests are tests that do not required that the underlying population be
Normal or indeed that they have any single mathematical form and some even
apply to non numerical data. Non-parametric methods are also known as
distribution free methods since they do not have any underlying population.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Definition
Back to Top
Non parametric tests are defined as the mathematical methods used in statistical
hypothesis testing which, unlike parametric tests, do not make assumptions about
the frequency distribution of variables that are to be assessed. Non parametric test
used when there are skewed data, and it covers techniques that do not rely on data
belonging to any particular distribution.
The word non-parametric does not exactly mean that these models do not have
have any parameters. Actually, the fact is that the nature and number of parameters
is quite flexible and not predefined. Therefore, non-parametric models are known
as distribution-free models.
Sign Test is merely based on the signs (+ or -) of the deviations x-y and not on
their magnitudes. This test is applicable when ties or zero differences between the
paired observations cannot occur. If tie or zero differences are occurred, then they
must be excluded from the analysis and the number of paired observations counted
is also reduced. This method can be used to analyze individual data also
Sign Test
If both of np or nq > 5,
σpσp = pqn−−√pqn.
Where zp is the value obtained from the standard normal table with αα level of
significance. If αα not given , αα = 0.05.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Case 3 (n ≥ 30)
If n ≥ 30, we find Mean = np and standard deviation = npq
Find the normal table value for the given αα. If |z| ≤≤ table value, we accept the
null hypothesis of sign test, otherwise we reject the hypothesis.
Kruskal-Wallis H-test
Back to Top
Kruskal Wallis H test is used in case of testing two or more populations are
identical. In this test, the null hypothesis is H0:μ1=μ2=γ30:μ1=μ2=γ3 (when there
are three populations). And alternative hypothesis is H1:μ1≠μ2≠γ3H1:μ1≠μ2≠γ3
In Kruskal-Wallis test, we first calculate ranks of the observations items in the
samples and then determine the rank sums for each sample.
H = 12n(n+1)(∑mi−lRiNi)−3(n+1)12n(n+1)(∑i−lmRiNi)−3(n+1)
where,
n is the total number of observations in all samples,
m is the number of samples,
nini represents the number of observations in ithith sample,
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Here, we use the x2x2 distribution with m - 1 degrees of freedom and α level of
significance to calculate the critical value. If the calculated value is less than x2x2,
then null hypothesis is accepted, otherwise rejected.
Whenever few assumption in the given population are uncertain, we use non-
parametric tests which can be referred as parametric counterparts. When data are
not normally distributed or when they are on an ordinal level of measurement, non-
parametric tests should be used. The basic rule is to use a parametric t test for data
normally distributed and a nonparametric test for skewed data.
Examples
Back to Top
Solved Examples
Question 1: Use Kruskal Wallis test to test for differences in mean among 3
samples for α = 0.05
Sample 1 : 100, 65, 102, 86, 80, 89, 98, 96, 91, 101
Sample 2: 84, 103, 126, 62, 92, 97, 95, 90, 94, 76
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Sample 3: 90, 99, 57, 106, 88, 91, 88, 102, 77, 90.
Solution:
We first find the rank of the items in the samples (considering whole group as one)
and then find the rank sums of each sample.
100 24 83 7 90 13
65 3 103 28 99 23
86 8 62 2 106 29
80 6 92 17 88 9.5
89 11 97 21 91 15.5
98 22 95 19 88 9.5
96 20 90 13 102 26.5
91 15.5 94 18 77 5
101 25 76 4 90 13
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Here
nn = 30
nini = 10 for all i
m=3
Degrees of freedom = m - 1 = 3 - 1 = 2.
Test statisic,
H = 12n(n+1)(∑mi−lRiNi)−3(n+1)12n(n+1)(∑i−lmRiNi)−3(n+1)
= 1230(30+1)(161210+159210+145210)−3(30+1)1230(30+1)(161210+159210+14
5210)−3(30+1)
= 0.196
From the x2x2 distribution with m - 1 degrees of freedom and α level of
significance we get critical value = 5.991. Since H < 5.991, we accept the null
hypothesis and we conclude that there is no difference in the mean among 3
samples.
Question 2: The following data show the employee’s rate of defective work before
and after a change in the wage incentive plan. Compare the following two sets of
data to see whether the change has lowered the defective units produced. Use sign
test with αα= 0.01
Before: 9, 8, 7, 10, 8, 11, 9, 7, 6, 9, 11, 9
After: 7, 6, 9, 7, 10, 9, 10, 8, 6, 7, 10, 9.
Solution:
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Here we use one tailed test since we have to check whether the change is lowered.
αα = 0.01.
np = (410410) ×× 10 = 4 < 5
Since P > 0.01, we accept the null hypothesis of the sign test.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
... ANOVA is used to test general rather than specific differences among means.
Analysis of variance/Assumptions
< Analysis of variance
Jump to navigationJump to search
Assumptions
ANOVA models are parametric, relying on assumptions about the distribution of
the dependent variables (DVs) for each level of the independent variable(s) (IVs).
Initially the array of assumptions for various types of ANOVA may seem
bewildering. In practice, the first two assumptions here are the main ones to check.
Note that the larger the sample size, the more robust ANOVA is to violation of the
first two assumptions: normality and homoscedasticity (homogeneity of variance).
Two-Way ANOVA
Assumptions
• The populations from which the samples were obtained must be normally or
approximately normally distributed.
• The samples must be independent.
• The variances of the populations must be equal.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Hypotheses
The null hypotheses for each of the sets are given below.
1. The population means of the first factor are equal. This is like the one-way
ANOVA for the row factor.
2. The population means of the second factor are equal. This is like the one-
way ANOVA for the column factor.
3. There is no interaction between the two factors. This is similar to performing
a test for independence with contingency tables.
Factors
The two independent variables in a two-way ANOVA are called factors. The idea
is that there are two variables, factors, which affect the dependent variable. Each
factor will have two or more levels within it, and the degrees of freedom for each
factor is one less than the number of levels.
Treatment Groups
Treatement Groups are formed by making all possible combinations of the two
factors. For example, if the first factor has 3 levels and the second factor has 2
levels, then there will be 3x2=6 different treatment groups.
As an example, let's assume we're planting corn. The type of seed and type of
fertilizer are the two factors we're considering in this example. This example has
15 treatment groups. There are 3-1=2 degrees of freedom for the type of seed, and
5-1=4 degrees of freedom for the type of fertilizer. There are 2*4 = 8 degrees of
freedom for the interaction between the type of seed and type of fertilizer.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The data that actually appears in the table are samples. In this case, 2 samples from
each treatment group were taken.
Seed A-402 106, 110 95, 100 94, 107 103, 104 100, 102
Seed B-894 110, 112 98, 99 100, 101 108, 112 105, 107
Main Effect
The main effect involves the independent variables one at a time. The interaction is
ignored for this part. Just the rows or just the columns are used, not mixed. This is
the part which is similar to the one-way analysis of variance. Each of the variances
calculated to analyze the main effects are like the between variances
Interaction Effect
The interaction effect is the effect that one factor has on the other factor. The
degrees of freedom here is the product of the two degrees of freedom for each
factor.
Within Variation
The Within variation is the sum of squares within each treatment group. You have
one less than the sample size (remember all treatment groups must have the same
sample size for a two-way ANOVA) for each treatment group. The total number of
treatment groups is the product of the number of levels for each factor. The within
variance is the within variation divided by its degrees of freedom.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
F-Tests
There is an F-test for each of the hypotheses, and the F-test is the mean square for
each main effect and the interaction effect divided by the within variance. The
numerator degrees of freedom come from each effect, and the denominator degrees
of freedom is the degrees of freedom for the within variance in each case.
It is assumed that main effect A has a levels (and A = a-1 df), main effect B has b
levels (and B = b-1 df), n is the sample size of each treatment, and N = abn is the
total sample size. Notice the overall degrees of freedom is once again one less than
the total sample size.
Source SS df MS F
Statistical Inference
number of samples the average over all estimations lies near the true parameter.
estimators θb1 and θb2. • Estimator θb1 is relatively efficient compared to θb2, if
has the smallest variance among all unbiased estimators of ϑ. 1.2 Estimation
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
|ϑ)} → maximize Least squares method Quadratic form Q(ϑ) = Pn i=1 (xi − E[Xi
Introduction to Estimation
To estimate means to esteem (to give value to). An estimator is any quantity
calculated from the sample data which is used to give information about an
unknown quantity in the population. For example, the sample mean is an estimator
of the population mean m.
Again, the usual estimator of the population mean is = Sxi / n, where n is the size
of the sample and x1, x2, x3,.......,xn are the values of the sample. If the value of the
estimator in a particular sample is found to be 5, then 5 is the estimate of the
population mean µ.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
A "Good" estimator is the one which provides an estimate with the following
qualities:
Efficiency: An efficient estimate is one which has the smallest standard error
among all unbiased estimators.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The "best" estimator is the one which is the closest to the population parameter
being estimated.
The above figure illustrates the concept of closeness by means of aiming at the
center for unbiased with minimum variance. Each dart board has several samples:
The first one has all its shots clustered tightly together, but none of them hit the
center. The second one has a large spread, but around the center. The third one is
worse than the first two. Only the last one has a tight cluster around the center,
therefore has good efficiency.
The following chart depicts the quality of a few popular estimators for the
population mean µ:
The widely used estimator of the population mean µ is = Sxi/n, where n is the
size of the sample and x1, x2, x3,......., xn are the values of the sample that have all
of the above good properties. Therefore, it is a "good" estimator.
You might like to use Descriptive Statistics Applet for obtaining "good" estimates.
Know that a confidence interval computed from one sample will be different from
a confidence interval computed from another sample.
Understand the relationship between sample size and width of confidence interval,
moreover, know that sometimes the computed confidence interval does not contain
the true value.
Let's say you compute a 95% confidence interval for a mean m . The way to
interpret this is to imagine an infinite number of samples from the same
population, 95% of the computed intervals will contain the population mean m ,
and at most 5% will not. However, it is wrong to state, "I am 95% confident that
the population mean m falls within the interval."
Tolerance Interval and CI: A good approximation for the single measurement
tolerance interval is n½ times confidence interval of the mean.
You need to use Sample Size Determination JavaScript at the design stage of your
statistical investigation in decision making with specific subjective requirements.
One should examine the confidence interval for the difference explicitly. Even if
the confidence intervals are overlapping, it is hard to find the exact overall
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
confidence level. However, the sum of individual confidence levels can serve as an
upper limit. This is evident from the fact that: P(A and B) £ P(A) + P(B).
Interval estimation
Test of hypothesis
value tobs is in the critical region, and to accept or "fail to reject" the
hypothesis otherwise.
An alternative process is commonly used:
1. Compute from the observations the observed value tobs of the test statistic T.
2. Calculate the p-value. This is the probability, under the null hypothesis, of
sampling a test statistic at least as extreme as that which was observed.
3. Reject the null hypothesis, in favor of the alternative hypothesis, if and only
if the p-value is less than the significance level (the selected probability)
threshold.
The two processes are equivalent.[6] The former process was advantageous in the
past when only tables of test statistics at common probability thresholds were
available. It allowed a decision to be made without the calculation of a probability.
It was adequate for classwork and for operational use, but it was deficient for
reporting results.
The latter process relied on extensive tables or on computational support not
always available. The explicit calculation of a probability is useful for reporting.
The calculations are now trivially performed with appropriate software.
Assume that a biological population is sampled and you wish to estimate the mean
value of some variable within that population. In chapter 3, we saw that the Central
Limit Theorem indicates that, when the population distribution is normal, the
sampling distribution of the mean also will be normal. In addition, we saw that,
when using the sample standard deviation, s, to estimate σσ, the tdistribution can
be used to represent the sampling distribution of the mean. Thus, the t distribution
can be used to test hypotheses about the population mean, μμ. This is referred to as
the "one sample t test."
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The t test evaluates the hypothesis that the parametric mean, μμ, is equal to a
particular value. That is, it tests $H_{o}: \mu = \mu {o},where,where\mu
_{o}isthespecificvalueofinterest.Ifisthespecificvalueofinterest.IfH{o}$ is true, then
the value,
t=y¯−μosn√,t=y¯−μosn,
Ultimately we will measure statistics (e.g. sample proportions and sample means)
and use them to draw conclusions about unknown parameters (e.g. population
proportion and population mean). This process, using statistics to make judgments
or decisions regarding population parameters is called statistical inference.
Ho: p = .40 (or greater) That is, no difference from Tufts study finding.
Ha: p < .40 (proportion feeling they are overweight is less for college age females.
Example 4 – This is a test of a mean:
Is there a difference between the mean amount that men and women study per
week? Competing hypotheses are:
Null hypothesis: There is no difference between mean weekly hours of study for
men and women, writing in statistical language as μ1 = μ2
Alternative hypothesis: There is a difference between mean weekly hours of study
for men and women, writing in statistical language as μ1≠ μ2
This notation is used since the study would consider two independent samples: one
from Women and another from Men.
Test Statistic and p-value
▪ A test statistic is a summary of a sample that is in some way sensitive to
differences between the null and alternative hypothesis.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
▪ A p-value is the probability that the test statistic would "lean" as much (or more)
toward the alternative hypothesis as it does if the real truth is the null hypothesis.
That is, the p-value is the probability that the sample statistic would occur under
the presumption that the null hypothesis is true.
A small p-value favors the alternative hypothesis. A small p-value means the
observed data would not be very likely to occur if we believe the null hypothesis is
true. So we believe in our data and disbelieve the null hypothesis. An easy
(hopefully!) way to grasp this is to consider the situation where a professor states
that you are just a 70% student. You doubt this statement and want to show that
you are better that a 70% student. If you took a random sample of 10 of your
previous exams and calculated the mean percentage of these 10 tests, which mean
would be less likely to occur if in fact you were a 70% student (the null
hypothesis): a sample mean of 72% or one of 90%? Obviously the 90% would be
less likely and therefore would have a small probability (i.e. p-value).
Using the p-value to Decide between the Hypotheses
▪ The significance level of a test is the border used for deciding between the null and
alternative hypotheses.
▪ Decision Rule: We decide in favor of the alternative hypothesis when a p-value is
less than or equal to the significance level. The most commonly used significance
level is 0.05.
In general, the smaller the p-value the stronger the evidence is in favor of the
alternative hypothesis.
Example 3 Continued:
In a recent elementary statistics survey, the sample proportion (of women) saying
they felt overweight was 37 /129 = .287. Note that this leans toward the alternative
hypothesis that the "true" proportion is less than .40. [Recall that the Tufts
University study finds that 40% of 12th grade females feel they are overweight. Is
this percent lower for college age females?]
Step 1: Let p = proportion of college age females who feel they are overweight.
Ho: p = .40 (or greater) That is, no difference from Tufts study finding.
Ha: p < .40 (proportion feeling they are overweight is less for college age females.
Step 2:
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
If npo ≥ 10 and n(1 – po) ≥ 10 then we can use the following Z-test statistic: Since
both (129) × (0.4) and (129) × (0.6) > 10 [or consider that the number of successes
and failures, 37 and 92 respectively, are at least 10] we calculate the test statistic
by:
z=^p−p0√p0(1−p0)nz=p^−p0p0(1−p0)n
Note: In computing the Z-test statistic for a proportion we use the hypothesized
value po here not the sample proportion p-hat in calculating the standard error! We
do this because we "believe" the null hypothesis to be true until evidence says
otherwise.
z=0.287−0.40√ 0.40(1−0.40)129 =−2.62z=0.287−0.400.40(1−0.40)129=−2.62
Step 3: The p-value can be found from Standard Normal Table
Calculating p-value:
The method for finding the p-value is based on the alternative hypothesis:
2 × P(Z ≥ | z | ) for Ha : p ≠ po where |z| is the absolute value of z
P(Z ≥ z ) for Ha : p > po
P(Z ≤ z) for Ha : p < po
In our example we are using Ha : p < .40 so our p-value will be found from P(Z ≤
z) = P(Z ≤ -2.62) and from Standard Normal Table this is equal to 0.0044.
Step 4: We compare the p-value to alpha, which we will let alpha be 0.05. Since
0.0044 is less than 0.05 we will reject the null hypothesis and decide in favor of the
alternative, Ha.
Step 5: We’d conclude that the percentage of college age females who felt they
were overweight is less than 40%. [Note: we are assuming that our sample, since
not random, is representative of all college age females.]
The p-value= .004 indicates that we should decide in favor of the alternative
hypothesis. Thus we decide that less than 40% of college women think they are
overweight.
The "Z-value" (-2.62) is the test statistic. It is a standardized score for the
difference between the sample p and the null hypothesis value p = .40. The p-
value is the probability that the z-score would lean toward the alternative
hypothesis as much as it does if the true population really was p = .40.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Chi-Square-tests and F-tests for variance or standard deviation both require that the
original population be normally distributed.
To test a claim about the value of the variance or the standard deviation of a
population, then the test statistic will follow a chi-square distribution
with n−1n−1 dgrees of freedom, and is given by the following formula.
χ2=(n−1)s2σ20χ2=(n−1)s2σ02
The television habits of 30 children were observed. The sample mean was found to
be 48.2 hours per week, with a standard deviation of 12.4 hours per week. Test the
claim that the standard deviation was at least 16 hours per week.
we can use the F-distribution to test against a null hypothesis of equal variances.
Note that this approach does not allow us to test for a particular magnitude of
difference between variances or standard deviations.
Given sample sizes of n1n1 and n2n2, the test statistic will have n1−1n1−1 and
n2−1n2−1 degrees of freedom, and is given by the following formula.
F=s21s22F=s12s22
If the larger variance (or standard deviation) is present in the first sample, then the
test is right-tailed. Otherwise, the test is left-tailed. Most tables of the F-
distribution assume right-tailed tests, but that requirement may not be necessary
when using technology.
Samples from two makers of ball bearings are collected, and their diameters (in
inches) are measured, with the following results:
Assuming that the diameters of the bearings from both companies are normally
distributed, test the claim that there is no difference in the variation of the
diameters between the two companies.
If the two samples had been reversed in our computations, we would have obtained
the test statistic F=1.1741F=1.1741, and performing a right-tailed test, found the
p-value p=Fcdf(1.1741,∞,119,79)=0.2232p=Fcdf(1.1741,∞,119,79)=0.2232.
Of course, the answer is the same.
UNIT-4
Correlation Analysis:
Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. „ Correlation analysis deals with the
association between two or more variables.
Types of Correlation Type I :-1) Positive Correlation. 2)Negative Correlation
Positive Correlation: The correlation is said to be positive correlation if the
values of two variables changing with same direction. Ex. Pub. Exp. & sales,
Height & weight.
Negative Correlation: The correlation is said to be negative correlation when the
values of variables change with opposite direction. Ex. Price & qty. demanded.
Direction of the Correlation:-
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Example:-The scores for nine students in physics and math are as follows:
Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31
Compute the student’s ranks in the two subjects and compute the Spearman rank
correlation.
Step 1: Find the ranks for each individual subject. I used the Excel rank function to
find the ranks. If you want to rank by hand, order the scores from greatest to
smallest; assign the rank 1 to the highest score, 2 to the next highest and so on:
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Step 2: Add a third column, d, to your data. The d is the difference between ranks.
For example, the first student’s physics rank is 3 and math rank is 5, so the
difference is 3 points. In a fourth column, square your d values.
= 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.
Regression Analysis
In statistical modeling, regression analysis is a set of statistical processes
for estimating the relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the relationship
between a dependent variable and one or more independent variables (or
'predictors'). More specifically, regression analysis helps one understand how the
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
typical value of the dependent variable (or 'criterion variable') changes when any
one of the independent variables is varied, while the other independent variables
are held fixed.
Most commonly, regression analysis estimates the conditional expectation of the
dependent variable given the independent variables – that is, the average value of
the dependent variable when the independent variables are fixed. Less commonly,
the focus is on a quantile, or other location parameter of the conditional
distribution of the dependent variable given the independent variables. In all cases,
a function of the independent variables called the regression function is to be
estimated. In regression analysis, it is also of interest to characterize the variation
of the dependent variable around the prediction of the regression function using
a probability distribution. A related but distinct approach is Necessary Condition
Analysis[1] (NCA), which estimates the maximum (rather than average) value of
the dependent variable for a given value of the independent variable (ceiling line
rather than central line) in order to identify what value of the independent variable
is necessary but not sufficient for a given value of the dependent variable.
Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning. Regression analysis is also
used to understand which among the independent variables are related to the
dependent variable, and to explore the forms of these relationships. In restricted
circumstances, regression analysis can be used to infer causal
relationships between the independent and dependent variables. However this can
lead to illusions or false relationships, so caution is advisable;[2] for
example, correlation does not prove causation.
Many techniques for carrying out regression analysis have been developed.
Familiar methods such as linear regression and ordinary least squares regression
are parametric, in that the regression function is defined in terms of a finite number
of unknown parametersthat are estimated from the data. Nonparametric
regression refers to techniques that allow the regression function to lie in a
specified set of functions, which may be infinite-dimensional.
The performance of regression analysis methods in practice depends on the form of
the data generating process, and how it relates to the regression approach being
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
used. Since the true form of the data-generating process is generally not known,
regression analysis often depends to some extent on making assumptions about this
process. These assumptions are sometimes testable if a sufficient quantity of data is
available. Regression models for prediction are often useful even when the
assumptions are moderately violated, although they may not perform optimally.
However, in many applications, especially with small effects or questions
of causality based on observational data, regression methods can give misleading
results.[3][4]
In a narrower sense, regression may refer specifically to the estimation of
continuous response (dependent) variables, as opposed to the discrete response
variables used in classification.[5] The case of a continuous dependent variable may
be more specifically referred to as metric regression to distinguish it from related
problems.[6]
Definition: The Regression Line is the line that best fits the data, such that the
overall distance from the line to the points (variable values) plotted on a graph is
the smallest. In other words, a line used to minimize the squared deviations of
predictions is called as the regression line.
There are as many numbers of regression lines as variables. Suppose we take two
variables, say X and Y, then there will be two regression lines:
▪ Regression line of Y on X: This gives the most probable values of Y from the
given values of X.
▪ Regression line of X on Y: This gives the most probable values of X from the
given values of Y.
The algebraic expression of these regression lines is called as Regression
Equations. There will be two regression equations for the two regression lines.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The correlation between the variables depend on the distance between these two
regression lines, such as the nearer the regression lines to each other the higher is
the degree of correlation, and the farther the regression lines to each other the
lesser is the degree of correlation.
The correlation is said to be either perfect positive or perfect negative when the
two regression lines coincide, i.e. only one line exists. In case, the variables are
independent; then the correlation will be zero, and the lines of regression will be at
right angles, i.e. parallel to the X axis and Y axis.
Note: The regression lines cut each other at the point of average of X and Y. This
means, from the point where the lines intersect each other the perpendicular is
drawn on the X axis we will get the mean value of X. Similarly, if the horizontal
line is drawn on the Y axis we will get the mean value of Y.
Question: Find the equation of the two lines of regression and hence find
correlation coefficient from the following data.
0
Defintion
Back to Top
For example: Marks achieved by a student out of 100 in all subjects. Find
percentage:
Mathematics =75
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Statistics = 90
English = 80
Physics = 75
Chemistry = 85
Data seems easy to operate, isn’t it?
But when large data is available,
For example:
Marks (in
Frequency (Number of students achieved marks in that
interval)
interval)
0 - 10 0
11 - 20 1
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
21 - 30 2
31 - 40 2
41 - 50 5
51 - 60 10
61 - 70 10
71 - 80 20
81 - 90 10
91 - 100 0
Total 60
1. Scatter plot:
In scatter plots, it is possible to get idea about relationship between both variables
in a glance. In scatter plot, points are plotted on X and Y axis. The one which is
dependent variable is taken on Y axis and independent is taken on X axis. The
scatter plot looks as follows:
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
2. Regression analysis:
Regression analysis allows to estimate future trends of data. It identifies data,
allows to fit that in one linear line and then by substituting values of independent
variables, future values of dependent variables can be easily found. It also gives
knowledge of slope and intercepts of line and hence can be tested for whole
population of that sample.
3. Correlation coefficients:
Correlation coefficient indicates how much two variables are related to each other.
Steps and calculations to be performed are shown below. Value for correlation is
always between -1 and 1. Basically -1 means there is perfect negative correlation
and 1 stands for perfect positive correlation. Where value of correlation coefficient
is zero indicates, no relationship between x and y at all. [Negative relationship is
when one variable increased, other has to decrease. And Positive indicates, when
one variable increases, other has to increase.]
If given data has numerical values on both sides, and it is required to recognize,
how much they are related to each other. It such cases, there is a way to find out if
there is correlation between 2 variables or if they are related to each other, if yes,
how much. Using “correlation coefficient(r)”
Consider given table,
. . . . . . .
. . . . . . .
Where,
E[XY]E[XY] = 1G1G×∑∑Xi×Yi×Aij×∑∑Xi×Yi×Aij
σxσx = 1∑Ti×Ti(Xi−x̅)2−−−−−−−−−−−−−−−√1∑Ti×Ti(Xi−x̅)2
σyσy = 1∑Si×Ti(Yi−y̅)2−−−−−−−−−−−−−−−√1∑Si×Ti(Yi−y̅)2
are related then we may perform bivariate analysis on them to find out their
relationship.
For example:
The standard error of the estimate is a measure of the accuracy of predictions made
with a regression line. Consider the following data.
The second column (Y) is predicted by the first column (X). The slope and Y
intercept of the regression line are 3.2716 and 7.1526 respectively. The third
column, (Y'), contains the predictions and is computed according to the formula:
The fourth column (Y-Y') is the error of prediction. It is simply the difference
between what a subject's actual score was (Y) and what the predicted score is (Y').
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The sum of the errors of prediction is zero. The last column, (Y-Y')², contains the
squared errors of prediction.
The regression line seeks to minimize the sum of the squared errors of prediction.
The square root of the average squared error of prediction is used as a measure of
the accuracy of prediction. This measure is called the standard error of the estimate
and is designated as σest. The formula for the standard error of the estimate is:
where N is the number of pairs of (X,Y) points. For this example, the sum of the
squared errors of prediction (the numerator) is 70.77 and the number of pairs is 12.
The standard error of the estimate is therefore equal to:
and
A time series is a series of data points indexed (or listed or graphed) in time order.
Most commonly, a time series is a sequence taken at successive equally spaced
points in time. Thus it is a sequence of discrete-time data. Examples of time series
are heights of ocean tides, counts of sunspots, and the daily closing value of
the Dow Jones Industrial Average.
Time series are very frequently plotted via line charts. Time series are used
in statistics, signal processing, pattern recognition, econometrics, mathematical
finance, weather forecasting, earthquake
prediction, electroencephalography, control
engineering, astronomy, communications engineering, and largely in any domain
of applied science and engineering which involves temporalmeasurements.
Time series analysis comprises methods for analyzing time series data in order to
extract meaningful statistics and other characteristics of the data. Time
series forecasting is the use of a model to predict future values based on
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
There are many objectives related to time series analysis, objectives of time series
analysis may be classified as
1. Description
2. Explanation
3. Prediction
4. Control
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The factors that are responsible for bringing about changes in a time series,
also called the components of time series, are as follows:
Secular Trends
The secular trend is the main component of a time series which results from long
term effects of socio-economic and political factors. This trend may show the
growth or decline in a time series over a long period. This is the type of tendency
which continues to persist for a very long period. Prices and export and import
data, for example, reflect obviously increasing tendencies over time.
Seasonal Trends
These are short term movements occurring in data due to seasonal factors. The
short term is generally considered as a period in which changes occur in a time
series with variations in weather or festivities. For example, it is commonly
observed that the consumption of ice-cream during summer is generally high and
hence an ice-cream dealer's sales would be higher in some months of the year
while relatively lower during winter months. Employment, output, exports, etc.,
are subject to change due to variations in weather. Similarly, the sale of garments,
umbrellas, greeting cards and fire-works are subject to large variations during
festivals like Valentine’s Day, Eid, Christmas, New Year's, etc. These types of
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
variations in a time series are isolated only when the series is provided biannually,
quarterly or monthly.
Cyclic Movements
These are long term oscillations occurring in a time series. These oscillations are
mostly observed in economics data and the periods of such oscillations are
generally extended from five to twelve years or more. These oscillations are
associated with the well known business cycles. These cyclic movements can be
studied provided a long series of measurements, free from irregular fluctuations, is
available.
Irregular Fluctuations
These are sudden changes occurring in a time series which are unlikely to be
repeated. They are components of a time series which cannot be explained by
trends, seasonal or cyclic movements. These variations are sometimes called
residual or random components. These variations, though accidental in nature, can
cause a continual change in the trends, seasonal and cyclical oscillations during the
forthcoming period. Floods, fires, earthquakes, revolutions, epidemics, strikes etc.,
are the root causes of such irregularities.
O = T × S × C × I
where O refers to original data,
T refers to trend.
S refers to seasonal variations,
C refers to cyclical variations and
I refers lo irregular variations.
This is the most commonly used model in the decomposition of time series.
There is another model called Additive model in which a particular observation in
a time series is the sum of these four components.
O=T+S+C+I
To prevent confusion between the two models, it should be made clear that in
Multiplicative model S, C, and I are indices expressed as decimal percents whereas
in Additive model S, C and I are quantitative deviations about trend that can be
expressed as seasonal, cyclical and irregular in nature. If in a multiplicative model.
T = 500, S = 1.4, C = 1.20 and I = 0.7 then
O = T × S × C × I
By substituting the values we get
O = 500 × 1.4 × 1.20 × 0.7 = 608
In additive model, T = 500, S = 100, C = 25, I = –50
O = 500 + 100 + 25 – 50 = 575
The assumption underlying the two schemes of analysis is that whereas there is no
interaction among the different constituents or components under the additive
scheme, such interaction is very much present in the multiplicative scheme. Time
series analysis, generally, proceed on the assumption of multiplicative formulation.
Methods of Measuring Trend
Trend can be determined : (i) Free hand curve method ; (ii) moving averages
method ; (iii) semiaverages method; and (iv) least-squares method. Each of these
methods is described below :
(i) Freehand Curve Method : The term freehand is used to any non-mathematical
curve in statistical analysis even if it is drawn with the aid of drafting instruments.
This is the simplest method of studying trend of a time series. The procedure for
drawing free hand curve is an follows :
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
288 284
Year : 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010
Quantity : 282 300 303 298 313 317 309 329 333
327
Solution :
Year Quantity 5-yearly moving total 5-yearly moving average
1990 239
1991 242
1992 238 1228 245.6
1993 252 1239 247.8
1994 257 1270 254.0
1995 250 1302 260.4
1996 273 1318 263.6
1997 270 1349 269.8
1998 268 1383 276.6
1999 288 1392 278.4
1990 284 1422 284.4
2001 282 1457 291.4
2002 300 1467 293.4
2003 303 1496 299.2
2004 298 1531 306.2
2005 313 1540 308.0
2006 317 1566 313.2
2007 309 1601 320.2
2008 329 1615 323.0
2009 333
2010 327
To simplify calculation work: Obtain the total of first five years deta. Find out the
difference between the first and sixth term and add to the total to obtain the total of
second to sixth term. In this way the difference between the term to be omitted and
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
the term to be included is added to the preceding total in order to obtain the next
successive total.
Illustration : Fit a trend line by the method of four-yearly moving average to the
following time series data.
Year : 1995 1996 1997 1998 1999 2000 2001 2002
Sugar production (lakh tons) : 5 6 7 7 6 8 9 10
Year : 2003 2004 2005 2006
Sugar production (lakh tons) : 9 10 11 11
Solution :
Remark : Observe carefully the placement of totals, averages between the lines.
Merits
1. This is a very simple method.
2. The element of flexibility is always present in this method as all the calculations
have not to be altered if same data is added. It only provides additional trend
values.
3. If there is a coincidence of the period of moving averages and the period of
cyclical fluctuations, the fluctuations automatically disappear.
4. The pattern of moving average is determined in the trend of data and remains
unaffected by the choice of method to be employed.
5. It can be put to utmost use in case of series having strikingly irregular trend.
Limitations
1. It is not possible to have a trend value for each and every year. As the period of
moving average increases, there is always an increase in the number of years for
which trend values cannot be calculated and known. For example, in a five yearly
moving average, trend value cannot be obtained for the first two years and last two
years, in a seven yearly moving average for the first three years and last three years
and so on. But usually values of the extreme years are of great interest.
2. There is no hard and fast rule for the selection of a period of moving average.
3. Forecasting is one of the leading objectives of trend analysis. But this objective
remains unfulfilled because moving average is not represented by a mathematical
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
function.
4. Theoretically it is claimed that cyclical fluctuations are ironed out if period of
moving average coincide with period of cycle, but in practice cycles are not
perfectly periodic.
Trend by the Method of Semi-averages : This method can be used if a straight
line trend is to be obtained. Since the location of only two points is necessary to
obtain a straight line equation, it is obvious that we may select two representative
points and connect them by a straight line. Data are divided into two halves and an
average is obtained for each half. Each such average is shown against the mid-
point of the half period, we obtain two points on a graph paper. By joining these
points, a straight line trend is obtained.
The method is to be commended for its simplicity and used to some extent in
practical work. This method is also flexible, for it is permissible to select
representative periods to determine the two points. Unrepresentative years may be
ignored.
Method of Least Squares : If a straight line is fitted to the data it will serve as a
satisfactory trend, perhaps the most accurate method of fitting is that of least
squares. This method is designed to accomplish two results.
(i) The sum of the vertical deviations from the straight line must equal zero.
(ii) The sum of the squares of all deviations must be less than the sum of the
squares for any other conceivable straight line.
There will be many straight lines which can meet the first condition. Among all
different lines, only one line will satisfy the second condition. It is because of this
second condition that this method is known as the method of least squares. It may
be mentioned that a line fitted to satisfy the second condition, will automatically
satisfy the first condition.
The formula for a straight-line trend can most simply be expressed as
Yc = a + bX
where X represents time variable, Yc is the dependent variable for which trend
values are to be calculated and a and b are the constants of the straight tine to be
found by the method of least squares.
Constant is the Y-intercept. This is the difference between the point of the origin
(O) and the point of the trend line and Y-axis intersect. It shows the value of Y
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
when X = 0, constant b indicates the slope which is the change in Y for each unit
change in X.
Let us assume that we are given observations of Y for n number of years. If we
wish to find the values of constants a and b in such a manner that the two
conditions laid down above are satisfied by the fitted equation.
Mathematical reasoning suggests that, to obtain the values of constants a and b
according to the Principle of Least Squares, we have to solve simultaneously the
following two equations.
∑Y = na + b∑Y ...(i)
∑XY = a∑X + b∑X2 ...(ii)
Solution of the two normal equations yield the following values for the constants a
and b :
b=
and a =
Least Squares Long Method : It makes use of the above mentioned two normal
equations without attempting to shift the time variable to convenient mid-year.
This method is illustrated by the following example.
Illustration : Fit a linear trend curve by the least-squares method to the following
data :
Year Production (Kg.)
2001 3
2002 5
2003 6
2004 6
2005 8
2006 10
2007 11
2008 12
2009 13
2010 15
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Solution : The first year 2001 is assumed to be 0, 2002 would become 1, 2003
would be 2 and so on. The various steps are outlined in the following table.
----------------------------------------------------
Year Production
Y X XY X2
1 2 3 4 5
----------------------------------------------------
2001 3 0 0 0
2002 5 1 5 1
2003 6 2 12 4
2004 6 3 18 9
2005 8 4 32 16
2006 10 5 50 25
2007 11 6 66 36
2008 12 7 84 49
2009 13 8 104 64
2010 15 9 135 11
Total 89 45 506 285
-----------------------------------------------------
The above table yields the following values for various terms mentioned below :
n = 10, ∑X = 45, ∑X2 = 285, ∑Y = 89, and ∑XY = 506
Substituting these values in the two normal equations, we obtain
89 = 10a + 45b ...(i)
506 = 45a + 285b ...(ii)
Multiplying equation (i) by 9 and equation (ii) by 2, we obtain
80l = 90a + 405b ...(iii)
1012 = 90a + 570b ...(iv)
Subtracting equation (iii) from equation (iv), we obtain
211 = 165b or b = 211/165 = 1.28
Substituting the value of b in equation (i), we obtain
89 = 10a + 45 × 1.28
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
89 = 10a + 57.60
10a = 89 – 57.6
10a = 31.4
a = 31.4/10 = 3.14
Substituting these values of a and b in the linear equation, we obtain the following
trend line
Yc = 3. 14 + 1.28X
Inserting various values of X in this equation, we obtain the trend values as below :
-----------------------------------------------------------------
Year Observed Y bxX Yc (Col. 3 plus Col. 4)
1 2 3 4 5
-----------------------------------------------------------------
2001 3 3.14 1.28 × 0 3.14
2002 5 3.14 1.28 × 1 4.42
2003 6 3.14 1.28 × 2 5.70
2004 6 3.14 1.28 × 3 6.98
2005 8 3.14 1.28 × 4 8.26
2006 10 3.14 1.28 × 5 9.54
2007 11 3.14 1.28 × 6 10.82
2008 12 3.14 1.28 × 7 12.10
2009 13 3.14 1.28 × 8 13.38
2010 15 3.14 1.28 × 9 14.66
-------------------------------------------------------------------
Least Squares Method : We can take any other year as the origin, and for that
year X would be 0. Considerable saving of both time and effort is possible if the
origin is taken in the middle of the whole time span covered by the entire series.
The origin would than be located at the mean of the X values. Sum of the X values
would then equal 0. The two normal equations would then be simplified to
∑Y = Na ...(i)
or a =
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Solution : Here there are two mid-years viz; 2006 and 2007. The mid-point of the
two years is assumed to be 0 and the time of six months is treated to be the unit.
On this basis the calculations are as shown below:
----------------------------------------------
Years Observed Y X XY X2
----------------------------------------------
2003 6.7 – 7 – 46.9 49
2004 5.3 – 5 – 26.5 25
2005 4.3 – 3 – 12.9 9
2006 6.1 – 1 – 6.1 1
2007 5.6 1 5.6 1
2008 7.9 3 23.7 9
2009 5.8 5 29.0 25
2010 6.1 7 42.7 49
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
----------------------------------------------
Total 47.8 0 8.6 168
----------------------------------------------
From the above computations, we get the following values.
n = 8, ∑Y = 47.8, ∑X = 0, ∑XY = 8.6, ∑X2 = 168
Substituting these values in the two normal equations, we obtain
47.8 = 8a or a = 47.8/8 or a = 5.98 and 8.6 = 168 b or = 8.6/168 or b = 0.051
The equation for the trend line is : Yc = 5.98 + 0.051X
Trend values generated by this equation are below :
-----------------------------------------------------------------------------------
Year Y X X2 X3 X4 XY X2Y Yc
-----------------------------------------------------------------------------------
2000 100 – 2 4 – 8 16 – 200 400 97.744
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
(ii) This method gives the line of best fit because from this line the sum of the
positive and negative deviations is zero and the total of the squares of these
deviations is minimum.
Limitations
The best practicable use of mathematical trends is for describing movements in
time series. It does not provide a clue to the causes of such movements. Therefore,
forecasting on this basis may be quite risky.
Forecasting will be valid if there is a functional relationship between the variable
under consideration and time for a particular trend. But if trend describes the past
behaviour, it hardly throws light on the causes which may influence the future
behaviour.
The other limitation is that if some items are added to the original data, a new
equation has to be obtained.
Curvilinear Trend
Sometimes, the time series may not be represented by a straight line trend. Such
trends are known as curvilinear trends. If the curvilinear trend is represented by a
straight line or semi-log paper, or by polynomials of second or higher degree or by
double logarithmic function, then the method of least squares is also applicable to
such cases.
MEASUREMENT OF SEASONAL VARIATIONS
Seasonal variations are those rhythmic changes in the time series data that occur
regularly each year. They have their origin in climatic or institutional factors that
affect either supply or demand or both. It is important that these variations be
measured accurately for three reasons. First, the investigator wants to eliminate
seasonal variations from the data he is studying. Second, a precise knowledge of
the seasonal pattern aid in planning future operations. Lastly, complete knowledge
of seasonal variations is of use to those who are trying to remove the cause of
seasonals or are attempting to mitigate the problem by diversification, off setting
opposing seasonal patterns, or some other means.
Since the number of calender days and working days vary from month to month,
therefore, it is essential to adjust the monthly figures if the same are based on daily
quantities, otherwise, there is no need for such adjustment when we deal with
either volume of inventories or of bank deposits because then the values are not
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
averages by 12.
(iv) In column No. 9 each monthly average has been expressed as percentage of the
average of monthly averages. Thus, the percentage for January
=
Percentage for February =
If instead of monthly data, we are given weekly or quarterly data, we shall
compute weekly or quarterly averages by following the same procedure.
Ratio-to-moving average method : The method of monthly totals or monthly
averages does not give any consideration to the trend which may be present in the
data. The ratio-to-moving-average method is one of the simplest of the commonly
used devices for measuring seasonal variation which takes the trend into
consideration: The steps to compute seasonal variation are as follows :
(i) Arrange the unadjusted data by years and months.
(ii) Compute the trend values by the method of moving averages. For this purpose
take 12 month moving average followed by a two-month moving average to
recentre the trend values.
(iii) Express the data for each month as a percentage ratio of the corresponding
moving-average trend value.
(iv) Arrange these ratios by months and years.
(v) Aggregate the ratios for January, February etc.
(vi) Find the average ratio for each month.
(vii) Adjust the average monthly ratios found in step (vi) so that they will
themselves average 100 percent. These adjusted ratios will be the seasonal indices
for various months.
A seasonal index computed by the ratios-to-moving-average method ordinarily
does not fluctuate so much as the index based on straight-line trends. This is
because the 12-month moving average follows the cyclical course of the actual
data quite closely. Therefore the index ratios obtained by this method are often
more representative of the data from which they are obtained than is the case in the
ratio-to-trend method which will be discussed later on.
Illustration : Prepare a monthly seasonal index from the following data, using
moving averages method :
Monthly Sales of XYZ Products Co,. Ltd. (Rs.)
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Year
2000 2001 2002
January 3,639 3,913 4,393
February 3,591 3,856 4,530
March 3,326 3,714 4,287
April 3,469 3,820 4.405
May 3,321 3,647 4,024
June 3,320 3,498 3,992
July 3,205 3,476 3,795
August 3,205 3,354 3,492
September 3,255 3,594 3,571
October 3,550 3,830 3,923
November 3,771 4,183 3,984
December 3,772 4,482 3,880
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
say, 20 or 15, are used in the computation, it is not uncommon to omit extremely
erratic ratios from the computation of average of monthly ratios. Only the
arithmetic average should be used for small number of years.
This method has the advantage of simplicity and case of interpretation. Although it
makes allowance for the trend, it may be influenced by errors in the calculation of
the trend. The method may also be influenced by cyclical and erratic influences.
This source of possible error is eliminated by the selection of a period of time in
which depression is offset by prosperity.
Illustration : Find seasonal variations by the ratio-to-trend method from the
following data :
Year 1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
2000 30 40 36 34
2001 34 52 40 44
2002 40 58 54 48
2003 54 76 68 62
2004 80 92 86 82
Solution : For finding out seasonal variations by ratio-to-trend method, first the
trend for yearly data will be obtained and convert them into quarterly data.
Average 92.78 118.28 102.92 89.12
The average of quarterly average of trend figures :
Quarterly seasonal Index for 1st Quarter :
Quarterly seasonal Index for 2rd Quarter :
Quarterly seasonal Index for 3rd Quarter :
Quarterly seasonal Index for 4th Quarter :
The total of seasonal indices should be equal to 400 and that for monthly indices
should be 1200.
Merits
(i) This method is based on a logical procedure for measuring seasonal variations.
This procedure has an advantage over the moving average method for it has a ratio
to trend value for each month for which data is available. So this method avoids
loss of data which is inherent in the case of moving averages. If the period of time
series is very short then the advantage becomes more prominent.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
A second degree trend equation is apporpriate for the secular trend component of a
time series when the data do not fall in a straight line.
Illustration: Fit a parabola (Yc = a + bX + cX2) from the following
Years 1 2 3 4 5 6 7
Values 35 38 40 42 36 39 45
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
– 84c = – 4
c = 4/84 = 0.05
By substituting the value of c in equation (i) we get the value of a
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
7a + 28 × 4/48 = 275
7a = 275 – 1.33
a = 273.67/7 = 39.09
We may get the value of b with the help of equation (ii)
28b = 28
b = 1
The required equation would be:
Yc = 39.09 + 1X + 0.05 X2
= 39.09 + X + 0.05 X2
With the help of above equation we can estimate the value for year 8 where x = 4
Yc = 39.09 + 4 + 0.05 (4)2
= 39.09 + 4 + 0.8 = 43.89
Exponential Trend
The equation for exponential trend is of the form: y = abx
Taking log of both sides we get log y = log a + x log b
To get the value of a and b we have normal equation
∑logy = Nlog a + logb ∑X
∑(x. log y) = log a∑x + log b∑X2
When we slove these equations we get –
log a = and log b =
Illustration : The production of certain raw material by a company in lakh tons for
the years 1996 to 2002 are given below:
Year : 1996 1997 1998 1999 2000 2001 2002
Production : 32 47 65 92 132 190 275
Estimate Production figure for the year 2003 using an equation of the form y = ab1
where x = years and y = production
Solution :
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)
The best fit line is the line for which the sum of the distances between each of
the n data points and the line is as small as possible. A mathematically useful
approach is therefore to find the line with the property that the sum of the
following squares is minimum.
Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by
where
Click here for the proof of Theorem 1. Two proofs are given, one of which does
not use calculus.
Definition 1: The best fit line is called the regression line.
Observation: The theorem shows that the regression line passes through the point
(x̄, ȳ) and has equation
Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can
be viewed as either the population covariance and variance or the sample
covariance and variance. Thus a and b can be calculated in Excel as follows where
R1 = the array of y values and R2 = the array of x values:
b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)
a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)
Property 1:
Proof: By Definition 2 of Correlation,
Excel Functions: Excel provides the following functions for forecasting the value
of y for any x based on the regression line. Here R1 = the array of y data values
and R2 = the array of x data values:
SLOPE(R1, R2) = slope of the regression line as described above
INTERCEPT(R1, R2) = y-intercept of the regression line as described above
FORECAST(x, R1, R2) calculates the predicted value y for the given value of x.
Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b =
SLOPE(R1, R2).
TREND(R1, R2) = array function which produces an array of predicted y values
corresponding to x values stored in array R2, based on the regression line
calculated from x values stored in array R2 and y values stored in array R1.
TREND(R1, R2, R3) = array function which predicts the y values corresponding
to the x values in R3 based on the regression line based on the x values stored in
array R2 and y values stored in array R1.
To use TREND(R1, R2), highlight the range where you want to store the predicted
values of y. Then enter TREND and a left parenthesis. Next highlight the array of
observed values for y (array R1), enter a comma and highlight the array of
observed values for x (array R2) followed by a right parenthesis. Finally
press Crtl-Shft-Enter.
To use TREND(R1, R2, R3), highlight the range where you want to store the
predicted values of y. Then enter TREND and a left parenthesis. Next highlight the
array of observed values for y (array R1), enter a comma and highlight the array of
observed values for x(array R2) followed by another comma and highlight the
array R3 containing the values for x for which you want to predict y values based
on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.
Excel 2016 Function: Excel 2016 introduces a new
function FORECAST.LINEAR, which is equivalent to FORECAST.
Example 1: Calculate the regression line for the data in Example 1 of One Sample
Hypothesis Testing for Correlation and plot the results.
Chanderprabhu Jain College of Higher Studies
&
School of Law
An ISO 9001:2008 Certified Quality Institute
(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)