Professional Documents
Culture Documents
Introduction to Statistics:
Definitions :
Statistics are not just numbers and facts. You know, things like 4 out of 5 dentists prefer a
specific toothpaste. Instead, it’s an array of knowledge and procedures that allow you to
learn from data reliably. Statistics allow you to evaluate claims based on quantitative
evidence and help you differentiate between reasonable and dubious conclusions. That
aspect is particularly vital these days because data are so plentiful along with
interpretations presented by people with unknown motivations.
2.Statistics offer critical guidance in producing trustworthy analyses and predictions. Along
the way, statisticians can help investigators avoid a wide variety of analytical traps.
When analysts use statistical procedures correctly, they tend to produce accurate results.
In fact, statistical analyses account for uncertainty and error in the results. Statisticians
ensure that all aspects of a study follow the appropriate methods to produce trustworthy
results. These methods include:
a.Producing reliable data.
b.Analyzing the data appropriatelyDrawing reasonable conclusions.
c.Statisticians Know How to Avoid Common Pitfalls
3.Using statistical analyses to produce findings for a study is the culmination of a long
process. This process includes constructing the study design, selecting and measuring the
variables, devising the sampling technique and sample size, cleaning the data, and
determining the analysis methodology among numerous other issues. The overall quality
of the results depends on the entire chain of events. A single weak link might produce
unreliable results. The following list provides a small taste of potential problems and
analytical errors that can affect a study.
4.Use of Statistics to Make an Impact in desired Field :Statistical analyses are used in
almost all fields to make sense of the vast amount of data that are available. Even if the
field of statistics is not your primary field of study, it can help you make an impact in your
chosen field. Chances are very high that you’ll need working knowledge of statistical
methodology both to produce new findings in your field and to understand the work of
others.
Scope of statistics :
Statistics plays a vital role in every field of human activity. Statistics helps in
determining the existing position of per capita income, unemployment, population growth
rates, housing, schooling medical facilities etc. in a country.
Now statistics holds a central position in almost every field, including industry,
commerce, trade, physics, chemistry, economics, mathematics, biology, botany,
psychology, astronomy, etc., so the application of statistics is very wide. Now we shall
discuss some important fields in which statistics is commonly applied.
(1) Business
Statistics plays an important role in business. A successful businessman must be
very quick and accurate in decision making. He knows what his customers want; he
should therefore know what to produce and sell and in what quantities.
(2) Economics
Economics largely depends upon statistics. National income accounts are
multipurpose indicators for economists and administrators, and statistical methods are
used to prepare these accounts. In economics research, statistical methods are used to
collect and analyze the data and test hypotheses. The relationship between supply and
demand is studied by statistical methods; imports and exports, inflation rates, and per
capita income are problems which require a good knowledge of statistics.
(3) Mathematics
Statistics plays a central role in almost all natural and social sciences. The methods
used in natural sciences are the most reliable but conclusions drawn from them are only
probable because they are based on incomplete evidence.Statistics helps in describing
these measurements more precisely. Statistics is a branch of applied mathematics. A
large number of statistical methods like probability averages, dispersions, estimation, etc.,
is used in mathematics, and different techniques of pure mathematics like integration,
differentiation and algebra are used in statistics.
(4) Banking
Statistics plays an important role in banking. Banks make use of statistics for a
number of purposes. They work on the principle that everyone who deposits their money
with the banks does not withdraw it at the same time. The bank earns profits out of these
deposits by lending it to others on interest. Bankers use statistical approaches based on
probability to estimate the number of deposits and their claims for a certain day.
(8) Astronomy
Astronomy is one of the oldest branches of statistical study; it deals with the
measurement of distance, and sizes, masses and densities of heavenly bodies by means
of observations. During these measurements errors are unavoidable, so the most probable
measurements are found by using statistical methods.
Example: This distance of the moon from the earth is measured. Since history,
astronomers have been using statistical methods like method of least squares to find the
movements of stars.
Applications of Statistics:
State Administration :
Economics:
Economics is about allocating limited resources among unlimited ends in the most
optimal manner. Statistics offers information to answer some basic questions in economics –
What to produce?
How to produce?
For whom to produce?
Economic Planning:
Economic planning is an important aspect of a country. For effective economic
planning, the authorities require information regarding different components of the
economy.This allows them to plan for the future efficiently. Statistics help in providing data as
well as tools to analyze the data. Some powerful techniques are index numbers, time series
analysis, and also forecasting. These are immensely useful in the analysis of data in
economic planning.
Further, statistical techniques help in framing planning models too. In India, the five-
year plans extensively use statistical tools.
Advantages of Statistics :
Plenty of companies naturally collect lots of data in the course of business. This is
especially true in the Internet age, when it's often possible to gather detailed information
about when customers do everything from open emails to access particular items on a
company website. The role of statistics in business is in evaluating all of this information
to determine what it says about the company's operations and strategy.
In Data Collection :
Collecting data to use in statistics, or summarizing the data, is only an advantage
in business if a manager uses a logical approach and collects and reports data in an
ethical manner. For example, he might use statistics to determine if sales levels the
company achieved for the last few products launched were even close to projected sales
levels. He might decide that the least-performing product needs extra investment or
perhaps the company should shift resources from that product to a new product.
In some cases, it might be necessary to anonymize customer data or strip out
unimportant confidential parts to reduce the risk of a data breach or abuses by
employees or data consultants. Privacy laws also increasingly govern how companies
can use or store personal data, so it's important to make sure your business follows the
rules in jurisdictions where it's active.
Limitations of Statistics:
(1) Statistics laws are true on average. Statistics are aggregates of facts, so a single
observation is not a statistic. Statistics deal with groups and aggregates only.
(2) Statistical methods are best applicable to quantitative data.
(3) Statistics cannot be applied to heterogeneous data.
(4) If sufficient care is not exercised in collecting, analyzing and interpreting the data,
statistical results might be misleading.
(5) Only a person who has an expert knowledge of statistics can handle statistical data
efficiently.
(6) Some errors are possible in statistical decisions. In particular, inferential statistics
involves certain errors. We do not know whether an error has been committed or not.
Sources of data :
Types of Data
A) Primary Data
Secondary Sources:
A secondary source interprets and analyzes primary sources. These sources are one or
more steps removed from the event. Secondary sources may contain pictures, quotes or
graphics of primary sources.
From a statistical point of view, the term ‘Universe’refers to the total of the items or units in
any field of inquiry, whereas the term ‘population’ refers to the total of items about which
information is desired. The attributes that are the object of study are referred to as
characteristics and the units possessing them are called as elementary units. The
aggregate of such units is generally described as population. Thus, all units in any field of
inquiry constitute universe and all elementary units (on the basis of one characteristic or
more) constitute population. Quit often, we do not find any difference between population
and universe, and as such the two terms are taken as interchangeable. However, a
researcher must necessarily define these terms precisely.
The population or universe can be finite or infinite. The population is said to be finite if it
consists of a fixed number of elements so that it is possible to enumerate it in its totality.
For instance, the population of a city, the number of workers in a factory are examples of
finite populations. The symbol ‘N’ is generally used to indicate how many elements (or
items) are there in case of a finite population. An infinite population is that population in
which it is theoretically impossible to observe all the elements. Thus, in an infinite
population the number of items is infinite i.e., we cannot have any idea about the total
number of items. The number of stars in a sky, possible rolls of a pair of dice are examples
of infinite population. One should remember that no truly infinite population of physical
objects does actually exist in spite of the fact that many such populations appear to be
very large. From a practical consideration, we then use the term infinite population for a
population that cannot be enumerated in a reasonable period of time. This way we use the
theoretical concept of infinite population as an approximation of a very large finite
population.
Sample:
finite subset of the population selected from it with the objective of investigating its
properties is called a sample and the number of unites in the sample is known as the
sample size
Concept of Sampling:
Although the scientific development of the theory of sampling has taken place
only during the last few decades, the idea of sampling is very old. From times immemorial,
people have been using it without knowing that some scientific procedure has been used
in arriving at the conclusion. On inspecting the sample of a particular stuff, we arrive at a
conclusion about accepting or rejecting it. For example, the consumer examines only a
handful of the rice, pulses or any commodity in a shop to assess its quality and then
decides to buy it or not. The housewife, usually tastes a spoonful of the cooked products
to ascertain if it is properly cooked and also to see if it contains proper quantity of salt or
sugar. The consumer ascertains the quality of the grapes by testing one or two from the
seller’s basket. The intelligence of the individuals in a subject is estimated by the university
by giving them a 3 – hour test. A businessman order for the products after examining only.
The error involved in approximations about the population characteristics on the basis of
the sample is known as sampling error and is inherent and unavoidable in any sampling
scheme.
Population :
In any Statistical investigation the interest usually lies in studying the various
characteristics relating to items or individuals belonging to a particular group. This group of
individuals under study is known as the population or universe. For example, if an enquiry
is intended to determine the average per capita income of the people in a particular city,
the population will comprise all the earning people in the city. On the other hand if we want
to study the expenditure habits of the families in that city, then the population will consist of
all the house –holds in that city. Further, if we want to study the quality of the
manufactured product in an industrial concern during the day, then the population will
consist of the day’s total production.
On the other hand. If the population does not consist of concrete objects then it
is called hypothetical population. for instance, the populations of the throws of a die or a
coin, thrown infinite number of times are hypothetical populations.
Types of Sampling :
If the unit selected in any draw is not replaced in the population before making
the next draw, then it is known as simple random sampling without replacement and if it is
replaced back before making the next draw, then the sampling plan is called simple
random sampling with replacement . Thus, simple random sampling with replacement
always amounts to sampling from an infinite population, even though the population is
finite.
The example in which the names of 25 employees out of 250 are chosen out of
a hat is an example of the lottery method at work. Each of the 250 employees would be
assigned a number between 1 and 250, after which 25 of those numbers would be chosen
at random.
Because individuals who make up the subset of the larger group are chosen at
random, each individual in the large population set has the same probability of being
selected. This creates, in most cases, a balanced subset that carries the greatest potential
for representing the larger group as a whole, free from any bias.
For larger populations, a manual lottery method can be quite onerous. Selecting
a random sample from a large population usually requires a computer-generated process,
by which the same methodology as the lottery method is used, only the number
assignments and subsequent selections are performed by computers, not humans.
2.Unlike more complicated sampling methods, such as stratified random sampling and
probability sampling, no need exists to divide the population into sub-populations or take
any other additional steps before selecting members of the population at random.
2.Other disadvantages include the fact that for sampling from large populations, the
process can be time-consuming and costly compared to other methods.
Stratified random sampling is also called proportional random sampling or quota random
sampling.
Now assume that the team looks at the different attributes of the sample
participants and wonders if there are any differences in GPAs and students’ majors.
Suppose it finds that 560 students are English majors, 1,135 are science majors, 800 are
computer science majors, 1,090 are engineering majors, and 415 are math majors. The
team wants to use a proportional stratified random sample where the stratum of the
sample is proportional to the random sample in the population.
The team then needs to confirm that the stratum of the population is in
proportion to the stratum in the sample; however, they find the proportions are not equal.
The team then needs to re-sample 4,000 students from the population and randomly
select 480 English, 1,120 science, 960 computer science, 840 engineering, and 600
mathematics students. With those, it has a proportionate stratified random sample of
college students, which provides a better representation of students' college majors in the
U.S. The researchers can then highlight specific stratum, observe the varying studies of
U.S. college students and observe the various grade point averages.
1. The main advantage of stratified random sampling is that it captures key population
characteristics in the sample. Similar to a weighted average, this method of sampling
produces characteristics in the sample that are proportional to the overall population.
2.Stratified random sampling works well for populations with a variety of attributes but is
otherwise ineffective if subgroups cannot be formed.
1.Unfortunately, this method of research cannot be used in every study. The method's
disadvantage is that several conditions must be met for it to be used properly.
2.Researchers must identify every member of a population being studied and classify each
of them into one, and only one, subpopulation. As a result, stratified random sampling
is disadvantageous when researchers can't confidently classify every member of the
population into a subgroup. Also, finding an exhaustive and definitive list of an
entire population can be challenging.
3.Overlapping can be an issue if there are subjects that fall into multiple subgroups. When
simple random sampling is performed, those who are in multiple subgroups are more likely
to be chosen. The result could be a misrepresentation or inaccurate reflection of the
population.
4. The sorting process becomes more difficult, rendering stratified random sampling an
ineffective and less than ideal method.
Cluster Sampling:
Cluster sampling refers to a type of sampling method . With cluster sampling, the
researcher divides the population into separate groups, called clusters. Then, a simple
random sample of clusters is selected from the population. The researcher conducts his
analysis on data from the sampled clusters.
Assuming the sample size is constant across sampling methods, cluster sampling
generally provides less precision than either simple random sampling or stratified
sampling. This is the main disadvantage of cluster sampling.
Given this disadvantage, it is natural to ask: Why use cluster sampling?
Sometimes, the cost per sample point is less for cluster sampling than for other sampling
methods. Given a fixed budget, the researcher may be able to use a bigger sample with
cluster sampling than with the other methods. When the increased sample size is sufficient
to offset the loss in precision, cluster sampling may be the best choice.
Multistage Sampling:
1.It is not as accurate as Simple Random Sample if the sample is the same size.
2.It is difficult to go for more testing .
Quota Sampling:
The data collected for the first time is raw data and so it is arranged in haphazard
manner, which does not provide a clear picture. The classification of data reduces the
large volume of raw data into homogeneous groups, i.e. data having common
characteristics or nature are placed in one group and thus, the whole data is bifurcated
into a number of groups. there are four types of classification:
Tabulation
Tabulation refers to a logical data presentation, wherein raw data is summarized and
displayed in a compact form, i.e. in statistical tables. In other words, it is a systematic
arrangement of data in columns and rows, that represents data in concise and attractive
way. One should follow the given guidelines for tabulation.
A serial number should be allotted to the table, in addition to the self explanatory
title.
The statistical table is required to be divided into four parts, i.e. Box head, Stub,
Caption and Body. The complete upper part of the table that contains columns and
sub-columns, along with caption, is the Box Head. The left part of the table, giving
description of rows is called stub. The part of table that contains numerical figures
and other content is its body.
Length and Width of the table should be perfectly balanced.
Presentation of data should be such that it takes less time and labor to make
comparison between various figures.
Footnotes, explaining the source of data or any other thing, are to be presented at
the bottom of the table.
2.Stability : To make the data suitable for comparison and to meaningfully compare the
results, it is necessary that the classification has stability.
6.Homogeneity : Units of each class should be homogeneous. All the units (data-items)
included in a class or group should be present according to the property on basis of which
the classification was done.
Types of classification:
Geographical Classification
Under this type of classification, the data are classified on the basis of area or place, and
as such, this type of classification is also known as areal or spatial classification. The
areas may be in terms of countries, states, districts, or zones according as the data are
distributed. For countries, states, districts, or zones according as the data are distributed.
For the purpose of ready reference and ranking, the different classes form under the
classification should be arranged in order of their alphabets or size of the frequencies
respectively. Generally, in case of reference tables, alphabetical arrangements are made
while in case of summary tables, ranking arrangements are made.
However, this type of classification is suitable for those data which are distributed
geographically relating to a phenomenon viz. population, mineral resources, production,
sales, students of universities etc.
Chronological Classification
Under this type of classification, the data collected are classified on the basis of time of
their occurrence. As such, the series obtained under this classification is purely known as
a time series. This type of classification is suitable for chose data which take place in
course of time viz. population, production, sales, results etc. The different classes obtained
under this classification are arranged in order of the time which may begin either with the
earliest, or the latest period.
Qualitative Classification
Under this type of classification, the data obtained are classified on the basis of certain
descriptive character or qualitative aspect of a phenomenon viz. sex, beauty, literacy,
honesty, intelligence, religion, eye-sight etc.
As such, this sort of classification is also otherwise known as ‘descriptive classification’.
Such type of classifications are usually dichotomous in nature in which the whole data are
divided into two groups viz, a group with the absence of the attitude such as blind and not-
blind, or deaf and not-deaf etc.
Quantitative Classification
Under this type of classification, the collected data are classified on the basis of certain
variable viz. mark, income, expenditure, profit, loss, height, weight, age, price, production
etc. which is capable of quantitative is also otherwise known as ‘classification by
variables’.
The frequency is the number of times a particular data point occurs in the set of
data. A frequency distribution is a table that list each data point and its frequency. The
relative frequency is the frequency of a data point expressed as a percentage of the total
number of data points.
Example 1.Find the frequency of data items 1 and 6 in the following list.
1, 3, 6, 4, 5, 6, 3, 4, 6, 3, 6 .
Ans : Frequency of the data point 1 is 1 and the frequency of the data point 6 is 4 .
When the data items are less in numbers then ungrouped frequency distribution is
prepared.
Ans: Here the number of family members are varying from 2 to 8 and hence ungrouped
frequency distribution can be prepared.
When the data items are large in numbers and the range value is high,
then grouped distribution table is prepared. In this case the grouping of data items are
done in the form of classes ,each class includes lower limit and upper limit. Generally the
classes are to be formed in such a way that the number of classes should not be too many
and it should not be too less. Ideally 5 to 12 classes are considered as ideal.
Example : Prepare grouped frequency distribution table for the following data showing
marks obtained by the students.
25 45 56 47 22 45 56 78 89 45 46 45 52
12 13 09 15 07 56 58 54 56 57 42 28 56
23 45 51 01 26 55 66 54 77 38
Solution: Here the largest number is 89 and smallest number is 01. Therefore
Range= 89-01=88
As range value is large, we should prepare grouped frequency distribution table. The
classes are to be 00-10, 10-20 , 20-30 and so on. We can create max. 9 classes which is
good enough.
Data can be represented graphically also.It is useful and drawing quick conclusion
regarding data trends.
There are many methods of data representation with diagrams and graphs.
Bar diagrams
Example : In a firm ,the percentage of monthly salary saved by each employee is given in
the following table. Represent it through a bar graph.
Savings (in 10 20 30 40 50 60
percentage)
Number of 100 110 130 120 140 120
Employees(Frequency
)
Ans :
140 -
120 -
100 -
80 -
60 -
40 -
20 -
I I I I I I I I I I I I i
10 20 30 40 50 60
Savings (in percentage)
Pie chart:
A pie chart is a type of graph that represents the data in the circular graph. The
slices of pie show the relative size of the data. It is a type of pictorial representation of
data. A pie chart requires a list of categorical variables and the numerical variables. Here,
the term “pie” represents the whole, and the “slices” represents the parts of the whole.
The “pie chart” also is known as “circle chart”, that divides the circular statistical
graphic into sectors or slices in order to illustrate the numerical problems. Each sector
denotes a proportionate part of the whole. To find out the composition of something, Pie-
chart works the best at that time. In most of the cases, pie charts replace some other
graphs like the bar graph, line plots, histograms etc.
The entire circle arc is of 360 degrees of angle.The various data item values are
converted into degree of angle and then the pie chart is prepared.
Ans : Firs we have to find the total expenditures which is not given here.
Now we find the angle for each heads as shown in the following table. First two angles are
calculated.
Food
Education
Bills
Loan
Rent
Histogram:
Ans: Here the classes are continuous and hence we can construct histogram directly.
70-
60-
50-
40-
30-
20-
10-
I I I I I I I I I I
0 10 20 30 40 50 60 70 80 90
Frequency
Frequency polygon :
A frequency polygon is a graph constructed by using lines to join the midpoints of each interval, or
bin. The heights of the points represent the frequencies. A frequency polygon can be created from
the histogram or by calculating the midpoints of the intervals from the frequency distribution
table.
We can draw a histogram and then after joining the midpoints of heads of each bar in consecutive
manner, we get frequency polygon.
Ans: Here the classes are continuous and hence we can construct histogram as shown in above
problem.Now we will join the midpoints of heads of each bar of histogram..
70-
60-
50-
40-
30-
20-
10-
I I I I I I I I I I I
0 10 20 30 40 50 60 70 80 90 100
Frequency
Ogive curves:
In order to draw ogive curves , the data should be of continuous distribution.We have
to find the cumulative frequencies either of less than upper class or and more than lower class.
We have to take the plotting points (x,y) where x stands for midpoint of each class and y stands for
cumulative frequency of that class.
Example: Draw less than upper class ogive curve from the following data.
Solution: Here classes are continuos and hence we can proceed for drawing frequency curve.
Points Table.
330 -
300 -
270 -
240 -
210 -
180 -
150 -
120 -
90 -
60 -
30 -
00 - I I I I I I I I I I
10 20 30 40 50 60 70 80 90 100
Unit 2
An average represents all the features of a group; hence the results about the whole group
can be deduced from it.
(ii) Brief description:
An average gives us simple and brief description of the main features of the whole data.
Statistical Averages –
1. Arithmetic mean:
The arithmetic mean is the most commonly used and readily understood measure
of central tendency in a data set. In statistics, the term average refers to any of the
measures of central tendency. The arithmetic mean of a set of observed data is defined as
being equal to the sum of the numerical values of each and every observation, divided by
the total number of observations. Symbolically, if we have a data set consisting of the
values {\displaystyle a_{1},a_{2},\ldots ,a_{n}}x1, x2 , x3 ……xn then the arithmetic mean {\
displaystyle A},denoted as X is defined by the formula:
Weight in kgs: 67 45 78 89 98 89 77
67 78 45 65 56 68 45 49
9.16 8.89 7.87 9.23 5.89 5.45 4.56 9.25 9.90 5.00
Ans :
75.11
Mean=
10
Mean=7.51 grades
Type II Problems: When the data is given in the form of x and f i.e. discrete distribution.
Mean =
∑ fi . Xi
∑ fi
Q 1. Find mean the following data.
X 11 13 15 17 19 20 21 24
f 4 7 10 18 15 11 8 7
Solution :
X f fX
11 4 44
13 7 91
15 10 150
17 18 306
19 15 285
20 11 220
21 8 168
24 7 168
∑ f =80 ∑ f X=1432
By using,
Mean =
∑f X
∑f
1432
Mean =
80
Height 151 152 154 156 157 158 160 161 163 164
in cms
No of 05 07 08 14 18 22 15 08 02 01
students
Mean =
∑ fX
∑f
15715
=
100
=157.15 cms
No of viruses 1 2 3 4 5 6 7 8 9 10
No of PCs infected 0 12 90 12 23 45 26 86 9 5
Solution :
No of viruses(X) No of PCs infected (f) fX
1 0 0
2 12 24
3 90 270
4 12 48
5 23 115
6 45 270
7 26 182
8 86 688
9 9 81
10 5 50
∑ f =308 ∑ fX =1728
Mean =
∑ fX
∑f
1728
=
308
= 5.61 viruses
Wages 210 220 230 240 250 260 270 280 290 300
in Rs
No of 5 19 38 57 76 42 36 21 12 04
workers
(Ans : Rs. 251.12)
Type III data: When the data is having continous distribution. (When classes and
frequencies are given)
Steps:
Step 1: Find mean of the class Xm
Step 2: Find the value of f Xm
Step 3:
Mean =
∑ f Xm
∑f
Solved Examples :
Q 1 . Find mean marks from the following data
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
No of 05 13 19 34 32 28 18 07
students
Solution :
Mean =
∑ f Xm
∑f
6560
=
156
=42.05 marks
Q 2. Find mean salary paid to the employees from the following data.
Salary in 2-5 5-10 10-12 12-14 14-18 18-24 24-26 26-30 30-40
Rs. lac
No of 22 36 48 88 102 45 19 6 2
employees
Solution :
Mean =
∑ f Xm
∑f
5309
=
368
Solution :
Mean =
∑ fXm
∑f
34695
=
133
= Rs.260.86
Height 150 154 158 162 166 170 174 178 182
less
than
No of 05 25 38 63 98 130 148 159 160
students
Solution : This data is of type III only as height less than is given here.We first convert it
into standard type III format and then solve it.
Mean =
∑ fXm
∑f
26136
=
160
=163.35 units
2.Median :
Median is the size of middlemost data item when the data items are arranged in
ascending or descending order.
Median =The size of data item at ( N2+1 ) t h position in the list created in step 1.
b) When N is even then
in step 1.
Solved Examples :
233 231 245 321 211 322 268 254 201 204 206 289
Solution :
201 204 206 211 231 233 245 256 268 289 321 322
N
Step 3: Median = 2
N
( )
t h item+ +1 t h item
2
2
12
= 2
t h item+ (
12
2 )
+1 t h item
2
6 t h item+7 t h item
=
2
233+245
=
2
478
=
2
= 239
Step 2: Median = The size of data item corresponding to N/2 th item in c f column.
X 12 13 14 15 16 17 18 19
f 05 24 45 65 44 33 21 04
Ans :
X f cf
12 05 05
13 24 29
14 45 74
15 65 139
16 44 183
17 33 216
18 21 237
19 04 241
N=∑ f =241
= 15 units
=157 cms
No of viruses 1 2 3 4 5 6 7 8 9 10
No of PCs infected 0 12 90 12 23 45 26 86 9 5
Type III: Continuos Distribution i.e. when classes and frequencies are given.
Steps:
1.Find cf.
2.Find the value of N/2 and locate median class corresponding to N/2 th value in
c.f. column.
N
−p.c .f .
3. 2
Median=L+ ∗i
f
Where L =Lower limit of median class
N=∑ f
p. c . f = Preceding median class cumulative frequency
f = Frequency of median class
i= class internal
Example:
1.Find median from the following data.
Ans:
=30-40
Median=31.17
3.Mode:
Type I Data: When the data items are listed in the form of list
Steps :
Example :
12 34 23 43 12 32 45 34 56 34 21 12 67
34 45 70 23 42 34
Ans :
Ascending order :
12 12 12 21 23 23 32 34 34 34 34 34 42
43 45 45 56 67 70
Mode =34
Ans : 213.5 is repeated for 5 times and 212.5 is also repeated for 5 times.
212.5+213.5
Hence Mode =
2
426
Mode =
2
Mode = 213.0
Hence mode=213
Example :
X 10 11 12 13 14 15 16 17 18
f 05 10 19 25 26 19 10 8 2
X 12 13 14 15 16 17 18 19
f 05 24 45 65 44 33 21 04
Mode = 15
Wages 210 215 220 230 245 250 255 260 270
in Rs
No of 05 29 35 48 48 32 19 12 3
worker
s
Ans : Here wages 230 and 245 are repeated for most of the times.
230+245
Mode =
2
475
Mode =
2
Type III : When the data is in continuos distribution i.e when classes and
Steps :
1.Find the largest frequency and note it as f1.Take class corresponding to it as modal
class. Find L , the lower limit of the modal class.
f 1−f 0
4.Mode = L+ ∗i
2 f 1−f 0−f 2
Example
Salary in 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-
Rs’000 100
No of 03 16 28 39=f0 58=f1 42=f2 30 16 4
employees
L=50
i=60-50=10
f 1−f 0
Mode = L+ ∗i
2 f 1−f 0−f 2
58−39
Mode= 50 + ∗10
2∗58−39−42
19
Mode= 50 + ∗10
35
190
Mode = 50 +
35
Mode= 55.42
Height 130- 135- 140- 145- 150- 155- 160- 165- 170-
135 140 145 150 155 160 165 170 175
No of 02 25 46=f0 82=f1 81=f2 70 52 16 5
studen
t
Modal class=145-150
L=145 , i=150-145=5
f0=46 ,f1=82,f2=81,
f 1−f 0
Mode = L+ ∗i
2 f 1−f 0−f 2
82−46
Mode= 145 + ∗5
2∗82−46−81
36
Mode= 145 + ∗5
37
180
Mode = 145 +
37
Merits:
Demerits of mean :
Merits of median :
3) It can also be computed in case of frequency distribution with open ended classes.
6) It is proper average for qualitative data where items are not measured but are scored.
7)It is only suitable average when the data are qualitative & it is possible to rank various
items according to qualitative characteristics.
Demerits of median :
Merits of Mode :
Demerits of mode :
1) It is ill defined. It is not always possible to find clearly defined mode. In some cases, we
may come across distributions with two modes. Such distributions are called Bimodal. If a
distribution has more than two modes, it is said to be Multimodal.
2) It is not based upon all the observation.
3) Mode can be calculated by various formulae as such the value may differ from one to
other. Therefore, it is not rigidly defined.
Measures of Dispersion
1.Range:
The difference between the largest value and smallest value is called as range.
Range = L-S
Example: Find the mean and range from the following data:
25 65 87 49 28 89 90 80 87 54
Solution:
654
=
10
= 65.4 units
Range=L-S
=90-25
=65
Ex2: Find range from the following data.Also find the coefficient of range
212.2 231.5 203.5 245.5 233.4 289.0
Solution:
L=289.0 S=201.4
Range= L-S
= 289.0 – 201.4
=87.6
L−S
Coefficient of Range=
L+S
289 .0−201 . 4
=
289 . 0+201. 4
87 . 6
=
490 . 4
= 0.1786
Type II: When data is in the form of discrete distribution i.e. x and f are given.
Example: Find range and coefficient of range from the following data.
x 11 12 13 14 15 16 17 18 19
f 04 09 20 29 38 45 24 15 7
Range= L-S
=19-11
=8
L−S
Coefficient of range =
L+S
19−11
=
19+11
8
=
30
=0.2667
Type 3: When the data is in the form continuos distribution i.e. when the classes and
frequencies are given.
Ex .Find the range and coefficient of range from the following data
Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Frequency 05 12 19 28 38 42 36 22 10
Solution:
Class X 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Frequency 05 12 19 28 38 42 36 22 10
Xm 5 15 25 35 45 55 65 75 85
L=85, S=5
Range=L-S
=85-5
=80
L−S
Coefficient of range==
L+S
85−5
=
85+5
80
=
90
=0.88
Mean Deviation :
It is sum of the differences of actual data items from the central values(mean or median
or mode) divided by the total number of items.
M.D.=
∑ |Xi−Xc|
N
Example 1: Find the mean deviation from the mean of the following data
23 45 76 45 78 45 64 34 67 43
Solution :
520
=
10
=52.
X X – X = X- 52 X- X
23 23-52= - 29 +29
45 45-52= - 7 +7
76 76-52=24 24
45 45-52 = -7 +7
78 78-52= 26 26
45 45-52= -7 +7
64 64-52= 12 12
34 34-52=-18 +18
67 67-52=15 15
43 43-52=-9 +9
154
=
10
= 15.4
Type III: When the data is in the form of continuos data i.e. when classes and frequencies
are given
Mean =
∑ fXm
∑f
N
− pcf
Median= 2
L+ ∗i
f
f 1−f 0
Mode= L+ ∗i
2 f 1−f 0−f 2
Xmean= 81492/360
=226.36
1452
¿
360
= 4.033
Example 2 : Find mean deviation from median and mode from the following data.
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
No. of 05 13 18 22 24 26 25 16 5
students
Solution :
Median=47.91
Mode= 56.66
Example 3: Find mean deviation from mean and mode from the following data.
Salary in 5-8 8-11 11-14 14-17 17-20 20-23 23-26 26-29 29-32
Rs Lac
No of 05 12 19 24 25 21 18 08 02
employees
Solution:
Xmean=
∑ fXm
∑f
2392
=
134
=17.85
M.D. =
∑ f ∗¿ Xm−Xmean∨¿ ¿
∑f
630.1
=
134
=4.70
From mode :
=17-20
L=17, i=20-17=3 , f0=24 , f1=25 , f2= 21
f 1−f 0
Mode ¿ L+ ∗i
2 f 1−f 0−f 2
25−24
= 17+ ∗3
2∗25−24−21
3
=17+
5
=17.6
=633.6/134
=4.72
Standard Deviation
It is the square root of sum of squares of deviations of the actual values from the
mean values divided by total number of items.
S.D.(σ) =
√ ∑ (X −Xmean)2
N
where N=∑ f
23 45 32 45 34 28 76 45 43 39
Solution :
410
=
10
=41
X X-Xmean (X-Xmean)2
=X-41
23 -18 324
45 4 16
32 -9 81
45 4 16
34 -7 49
28 -13 169
76 35 1225
45 4 16
43 2 4
39 -2 4
∑ ¿1889
√
S.D.(σ) = ∑ (X −Xmean)2
N
=
√ 1904
10
=√ 190.4
=13.78 units.
X X-Xmean (X-Xmean)2
=X-230.58
213 -17.58 309.05
213 -17.58 309.05
243 12.42 154.25
222 -8.58 73.61
240 9.42 88.73
236 5.42 29.37
234 3.42 11.69
242 11.42 130.41
245 14.42 207.93
248 17.42 303.45
219 -11.58 134.09
212 -18.58 345.21
∑ X=¿ ¿2767 ∑ ❑=2096.83
Mean =
∑X
N
2767
=
12
= 230.58
√
S.D.(σ) = ∑ (X −Xmean)2
N
=
√ 2096.83
12
=√ 174.7
= 13.21
√
S.D.(σ) = ∑ f ∗(X −Xmean)2
∑f
Where Xmean=
∑ fX
f
X 1 2 3 4 5 6 7 8
f 2 9 18 16 21 14 7 3
Solution :
Mean=
∑ f ∗X
∑f
400
=
90
=4.44
S.D.(σ) =
√ ∑ f ∗(X −Xmean)2
∑f
=
√ 241.90
90
=√ 2.68
=1.63
Marks 0 1 2 3 4 5 6 7 8 9 10
No of 02 08 12 20 25 26 29 20 09 04 01
student
s
Ans :
Mean=
∑ f ∗X
∑f
754
= =4.83
156
S.D.(σ) =
√ ∑ f ∗(X −Xmean)2
∑f
=
√ 665.6684
156
¿ √ 4.2671
=2.065
Type III: Continuos Distribution i.e.when classes and frequencies are given
S.D.(σ) =
√ ∑ f ∗( Xm− Xmean)2
∑f
where
Xmean=
∑ f ∗Xm
∑f
Ex 1: Find standard deviation from the following data
Sales in Rs. 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
lac
No of 05 17 29 45 41 34 19 08 05
companies
Solution :
Xm- (Xm-
X f Xm fXm 52.1428 52.1428)2 f*(Xm-52.1428)2
10-20 5 15 75 -37.1428 1379.587592 6897.937959
20-30 17 25 425 -27.1428 736.7315918 12524.43706
30-40 29 35 1015 -17.1428 293.8755918 8522.392163
40-50 45 45 2025 -7.1428 51.01959184 2295.881633
50-60 41 55 2255 2.8572 8.16359184 334.7072654
60-70 34 65 2210 12.8572 165.3075918 5620.458123
70-80 19 75 1425 22.8572 522.4515918 9926.580245
80-90 8 85 680 32.8572 1079.595592 8636.764735
90-
100 5 95 475 42.8572 1836.739592 9183.697959
203 10585 63942.85714
Mean=52.1428
S..D.=
√ 63942.85
203
=17.74
Quartiles :
1. First Quartile Q1
2. Third Quartile Q2
First quartile gives us the central value when we divide entire data items in
such a way that 25% of the data items are on one side and 75% of the data items are on
other side in ascending order data list.
Q1 Median Q3
25% 75% data items
Third Quartile Q 3:
Third quartile gives us the central value when we divide entire data items in
such a way that 75% of the data items are on one side and 25% of the data items are on
other side in ascending order data list.
Here N=12
Q1 = Size of data item at N/4th position
= 12/4
= 3rd item
=21
Q3= Size of data item at 3N/4th position
= Size of 3*12/4
= Size of 9th item
= 25
Q3−Q1
Quartile deviation QD =
2
Q3−Q1
Coefficient of QD =
Q3+ Q1
211 213 213 222 231 231 234 234 235 239 243 243 245 245
N=14
Q1 = Size of data item at N/4th value
=size of data item at 14/4th value
= size of 3.5th item
= AVG of 3rd and 4th item
= 213+222/2
= 435/2
= 217.5
Q3 = Size of data item at 3N/4th value
= 3*14 /4
= 10.5th item
= avg of 10th and 11th item
=239+243/2
=482/2
=241
Q.D = Q3- Q1 /2
= 241-217.5 /2
=23.5/2
=11.75
Steps:
1.Find cf
2. Q1 = Size of data item at N/4th position in ascending order list.
Q3= Size of item at 3N/4th position in ascending order list.
X f cf
20 5 5
22 15 20
24 25 45
25 25 70
27 27 97
29 20 117
30 15 132
32 05 N=137
Type III- Continuos distribution i.e when classes and frequencies are given
N
− p . c .. f
Q1 =L+ 4
∗i
f
3N
− p . c .. f
Q3= L + 4
∗i
f
Example : Find Q1,Q3 and coefficient of Q.D from the following data.
Marks 00-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-
100
No of 05 22 42 65 62 45 26 13 7 3
students
Ans :
Marks No of students cf
00-10 05 05
10-20 22 27
20-30 42 69
30-40 65 134
40-50 62 196
50-60 45 241
60-70 26 267
70-80 13 280
80-90 7 287
90-100 3 290=N
= 30-40
N
− p . c .. f
Q1 =L+ 4
∗i
f
290
−69
= 30+ 4
∗10
65
72.5−69
= 30+ ∗10
65
35
=30 +
65
=30+0.53
=30.53
= c.c.t. 50-60
3N
− p . c .. f
Q3= L + 4
∗i
f
3∗290
−196
=50+ 4
∗10
45
=50+ 4.77
= 54.77
Q3−Q1
Q.D.=
Q3+ Q1
54.77−30.53
=
54.77+30.53
24.24
¿
85.30
=0.28
3. As it takes middle 50% terms hence it is a measure better than Range and Percentile
Range.
4. It is not affected by extreme terms as 25% of upper and 25% of lower terms are left out.
5. Quartile Deviation also provides a short cut method to calculate Standard Deviation
using the formula 6 Q.D. = 5 M.D. = 4 S.D.
6. In case we are to deal with the center half of a series this is the best measure to use.
Demerits :
1. As Q1 and Q3 are both positional measures hence are not capable of further algebraic
treatment.
2. Calculation are much more, but the result obtained is not of much importance.
4. 50% terms play no role; first and last 25% items ignored may not give reliable result.
6. We can’t call it a measure of dispersion as it does not show the scatterness around any
average.
7. The value of Quartile may be same for two or more series or Q.D. is not affected by the
distribution of terms between Q1 and Q3 or outside these positions.
Mean deviation:
3. MD is less affected by the values of extreme items than the Standard deviation.
Demerits of Mean Deviation:
1. The greatest drawback of this method is that algebraic signs are ignored while taking
the deviations of the items.
Standard Deviation:
Merits:
Demerits:
1.Standard deviation is only used to measure spread or dispersion around the mean of a
data set.
3.Standard deviation is sensitive to outliers. A single outlier can raise the standard
deviation and in turn, distort the picture of spread.
4.For data with approximately the same mean, the greater the spread, the greater the
standard deviation.
5.If all values of a data set are the same, the standard deviation is zero
Variance:
The term variance refers to a statistical measurement of the spread between
numbers in a data set. More specifically, variance measures how far each number in the
set is from the mean and thus from every other number in the set. Variance is often
depicted by this symbol: σ2. It is used by both analysts and traders to
determine volatility and market security. The square root of the variance is the standard
deviation (σ), which helps determine the consistency of an investment's returns over a
period of time.
Coefficient of Variation:
σ
Coefficient of variation (C.V.) = ∗100
Mean
If the value of c.v. is less then the data items are consistent .
If the value of c.v. is more then the data items are not consistent.
Ex 1: The following data is related with the scores made by Sehwag and Rahul Dravid in
last ten innings.
Inn 1 2 3 4 5 6 7 8 9 10
Runs made by Sehwag 45 98 02 04 57 89 12 18 05 90
Runs made by Rahul 45 56 78 56 48 57 88 40 12 18
Find
1.The highest run getter. 2. average runs made by each of them 3. Which batsman is
more consistent?
Ans:
Ans:
√ ∑ ( X −Xm)
2
S.D.(X)=
N
S.D.(X)=
√ 13772
10
S.D.(X)=√ 1377 .2
S.D.(X)=37.11
S . D .( X )
Coe.of Variance (X)= *100
Xm
37 .11
= ∗100
42
=88.35
√ ∑ (Y −Ym )
2
S.D.(Y)=
N
S.D.(Y)=
√ 4945 . 46
10
S.D.(Y)=22.23
S . D .(Y )
Coefficient of Variance (Y)= *100
Ym
22. 23
= ∗100
49 .8
=44.63
Definition :
The degree to the extent of how much the two variables depend each other.
Example 1:
1. Age
2. Weight (Up to 18 yrs)
Example 2:
Example 3:
3.Supply
Types of correlation:
1.Positive Correlation:
When the value of one variable is increasing (or decreasing) then the value of dependent
variable also increases(or decreasing) then it is called positive correlation.
X 1 2 3 4 5 6 7 8
Y(dependent ) 10 12 18 24 29 32 39 40
2.Negative correlation: When the value of one variable is increasing (or decreasing) then
the value of dependent variable decreasing(or increasing) then it is called negative
correlation.
X 1 2 3 4 5 6 7 8
Y(dependent ) 25 24 20 18 12 6 2 1
Measuring the degree of correlations:
1.Scatter Diagram:
It gives us rough idea regarding the type of correlation between the two variables.
In this method, points are plotted on XY plane and the trend of points is noticed.
When the points are moving in upward direction from left to right then we conclude that
there is positive correlation between the two variables.
When the points are moving in downward direction from left to right then we conclude that
there is negative correlation between the two variables.
Example 1 : State the type of correlation between the two variables by using scatter
diagram.
X 2 5 8 19 28 35 45 65 78
Y 10 20 30 40 50 60 70 80 90
Ans :
90 - .
80 - .
70 - .
60 - .
Y 50 - .
40 - .
30 - .
20 - .
10 - .
| | | | | | | | | | |
10 20 30 40 50 60 70 80 90 100
X
There is positive correlation between X and Y as the points are moving in upward
direction.
Ex 2: State the type of correlation between the two variables by using scatter diagram.
X 12 19 45 59 66 75 88 90 99
Y 98 85 72 65 45 35 26 18 09
Ans:
100 - .
90 - .
80 -
70 - .
60 - .
Y 50 -
40 - .
30 - .
20 - . .
10 - .
| | | | | | | | | | |
10 20 30 40 50 60 70 80 90 100
X
There is negative correlation between X and Y as the points are moving in downward
direction from left to right.
3.No correlation
When the points do not show any trend either in upward or downward direction then we
can say that there is no correlation between the given variables.
Ex: 3
X 10 20 30 40 50 60 70 80 90
y 25 63 45 57 49 88 42 48 12
Ans :
100 -
90 - .
80 -
70 -
60 - . .
Y 50 - . .
40 - . .
30 - .
20 -
10 - .
| | | | | | | | | | |
10 20 30 40 50 60 70 80 90 100
X
2.Analytical Method:
Karl Pearson’s Coefficient of Correlation is an extensively used mathematical method in
which the numerical representation is applied to measure the level of relation between linear
related variables. The coefficient of correlation is expressed by “r”.
Depending upon the value of r ,we can conclude types of correlation in the following
manner.
3.r=0…… No correlation
More specifically when r=1 , there is a perfect positive correlation between the variables.
∑ xy
r =
√∑ x 2∗√∑ y
2
where x=X-Xm
y= Y-Ym
Ex. Find Karl Pearson’s correlation coefficient from the following data
X 2 4 6 8 10 12 14 16 18
Y 5 9 13 17 21 25 29 33 37
Ans:
Xm=
∑X Ym =
∑Y
N N
90 189
= =
9 9
=10 =21
∑ xy
r =
√∑ x 2∗√∑ y
2
480
=
√ 240∗√ 960
480
=
480
=1
X 13 25 36 48 56 69 45 58 29 11
Y 26 56 48 45 23 34 45 99 104 200
Ans:
Xm=
∑X Ym =
∑Y
N N
390 680
= =
10 10
=39 =68
∑ xy
r =
√∑ x 2∗√∑ y
2
−4277
=
√3432∗√ 26228
−4277
=
9487
=-0.45
Age of X 1 2 3 4 5 6 7 8 9 10
His IQ 2.05 3.42 3.05 2.65 2.05 1.69 1.02 0.99 0.85 0.23
Ans:
Xm=
∑X Ym =
∑Y
N N
55 18
= =
10 10
=5.5 =1.8
∑ xy
r =
√∑ x 2∗√∑ y
2
−24.96
=
√ 80.25∗√ 9.6784
−24.96
=
27.83
=-0.89
Ex 4. State whether there is any kind of correlation between the cost enquired by the
company on advertisement and the sales that they achieve from the following data.
Cost of 12 16 28 35 40 38 45 34 24 18
Advertisement
Rs. lac
Sales in Rs. lac 45 40 35 48 25 30 12 34 15 26
Ans : -0.40
6∗∑ D2
R= 1 –
N 3 −N
where D=R1-R2
When R=1, we can say there is perfect correlation between the two variables.
Ex: In a competition, two judges J1 and J2 were invited to judge 10 participants. They
gave marks to the participants as shown below.
Participant 1 2 3 4 5 6 7 8 9 10
No
Marks by J1 25 45 89 65 57 48 91 56 95 78
Marks by J2 45 89 57 68 69 50 95 58 80 90
❑
❑ Marks Rank R1 Marks by Rank R2 D=R1-R2 D
2
by J1 J2
1 25 10 45 10 0 0
2 45 9 89 3 6 36
3 89 3 57 8 -5 25
4 65 5 68 6 -1 1
5 57 6 69 5 1 1
6 48 8 50 9 -1 1
7 91 2 95 1 1 1
8 56 7 58 7 0 0
9 95 1 80 4 -3 9
10 78 4 90 2 2 4
∑ D 2=78
N=10
6∗∑ D2
R =1–
N 3 −N
6∗78
=1- 3
10 −10
468
=1-
1000−10
468
=1-
990
=1 – 0.47
=0.53
Interpretation : The judges had not properly evaluated the participants as the value of R is
0.53.
Ex2 : From the following data, find spearman’s rank correlation coefficient.
X 25 78 89 98 56 54 57 45 59
Y 45 75 89 87 58 62 55 48 65
Ans: r=0.91
Ex. 3. Find Spearman’s Rank correlation coefficient from the following data.
Roll No 1 2 3 4 5 6 7 8 9 10
Marks 45 56 78 48 98 49 88 57 79 51
in
Maths
Marks 54 59 80 40 85 43 79 60 81 46
in C
Ans :
Marks in Rank in Marks in C Rank in C D=R1-R2 D2
Maths maths R1 R2
45 10 54 7 3 9
56 6 59 6 0 0
78 4 80 3 1 1
48 9 40 10 -1 1
98 1 85 1 0 0
49 8 43 9 -1 1
88 2 79 4 -2 4
57 5 60 5 0 0
79 3 81 2 1 1
51 7 46 8 -1 1
∑ D 2=18
6∗∑ D
2
R =1– 3
N −N
6∗18
=1- 3
10 −10
108
=1-
1000−10
108
=1-
990
=1 – 0.10
=0.90
There is good positive relation between marks in maths and C means we can conclude
that the students have taken almost same marks in these two subject.The students who
are good in maths are good in C and vice versa.
Type II problems: When there is a tie between two data items or amongst more than two
items.
(ni3-ni)/12 where ni stands for number of data items with same rank.
r = 1-6*[∑ D 2+(¿3−¿) /12 ¿ ¿ Type equation here .
Ex.: Find Spearman’s rank correlation coefficient from the following data.
X 78 45 89 78 56 55 89 56 56
Y 88 89 45 56 88 96 88 45 45
ANS:
n1=2
n2=2
n3=3
n4=3
5.For 45 in Y, the ranks are 7,8 and 9
n5=3
N3-N
In this case,
r= 1- 6 * ¿ ¿
r= 1- 6* ¿ ¿
r=1- 6 * ¿ ¿
r=1-6* ¿ ¿
172
r=1-6*
720
r=1-6* 0.2388
r=1-1.433
r=-0.433
3. The Pearson product-moment correlation does not take into consideration whether a
variable has been classified as a dependent or independent variable. It treats all variables
equally.
4. A change of origin of the system, or any scaling of the variables doesn’t affect the value
of r. The sign might change depending on the sign of scaling done.
Association of attributes:
Technically, we say that the two attributes are associated if they appear together in a
greater number of cases than is to be expected if they are independent and not simply on
the basis that they are appearing together in a number of cases as is done in ordinary life.
The association may be positive or negative (negative association is also known as
disassociation). If class frequency of AB, symbolically written as (AB), is greater than the
expectation of AB being together if they are independent, then we say the two attributes
are positively associated; but if the class frequency of AB is less than this expectation, the
two attributes are said to be negatively associated. In case the class frequency of AB is
equal to expectation, the two attributes are considered as independent i.e., are said to
have no association. It can be put symbolically as shown hereunder:
The value of this coefficient will be somewhere between +1 and –1. If the attributes
are completely associated (perfect positive association) with each other, the coefficient will
be +1, and if they are completely disassociated (perfect negative association), the
coefficient will be –1. If the attributes are completely independent of each other, the
coefficient of association will be 0. The varying degrees of the coefficients of association
are to be read and understood according to their positive and negative nature between +1
and –1.
Sometimes the association between two attributes, A and B, may be regarded as
unwarranted when we find that the observed association between A and B is due to the
association of both A and B with another attribute C. For example, we may observe
positive association between inoculation and exemption for small-pox, but such
association may be the result of the fact that there is positive association between
inoculation and richer section of society and also that there is positive association between
exemption from small-pox and richer section of society. The sort of association between A
and B in the population of C is described as partial association as distinguished from total
association between A and B in the overall universe. We can workout the coefficient of
partial association between A and B in the population of C by just modifying the above
stated formula for finding association between A and B as shown below:
where,
QAB.C = Coefficient of partial association between A and B in the population of C; and all
other values are the class frequencies of the respective classes (A, B, C denotes the
presence of concerning attributes and a, b, c denotes the absence of concerning
attributes)
Unit No : 4
Regression
Regression analysis:
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables.
It can be utilized to assess the strength of the relationship between variables and for
modeling the future relationship between them.
Regression analysis includes several variations, such as linear, multiple linear, and
nonlinear. The most common models are simple linear and multiple linear. Nonlinear
regression analysis is commonly used for more complicated data sets in which the
dependent and independent variables show a nonlinear relationship.
1.The dependent and independent variables show a linear relationship between the slope
and the
intercept.
5.The value of the residual (error) is not correlated across all observations.
6.The residual (error) values follow the normal distribution.
Regression Analysis
It is a method establishing the relation between two variables in the form linear equation.
This relation is used for determining the value of one variable on the basis the given value
of second variable.
Regression Equations:
1.Regression Eqaution X on Y:
X-Xmean = bxy(Y-Ymean)
bxy =
∑ xy
∑ y2
where x=X −Xmean
y=Y −Ymean
2.Regression of Y on X.
Y-Ymean = byx(X-Xmean)
byx =
∑ xy
∑ x2
where x=X −Xmean
y=Y −Ymean
X 2 4 6 8 10 12 14 16
Y 7 13 19 25 31 37 43 49
Ans :
X Y x=X −Xmean y=Y −Ymean xy x2 y2
=X - 9 =Y - 28
2 7 -7 -21 +147 49 441
4 13 -5 -15 +75 25 225
6 19 -3 -9 +27 9 81
8 25 -1 -3 +3 1 9
10 31 1 3 3 1 9
12 37 3 9 27 9 81
14 43 5 15 75 25 225
16 49 7 21 147 49 441
∑ X=72
∑ Y =224 ∑ xy =504 ∑ x =168∑ y2 =1512
2
Xmean=
∑X =
72
=9
n 8
Ymean=
∑ Y = 224 =28
n 8
1.Regression of X on Y is given by
X-Xmean = bxy(Y-Ymean)
bxy =
∑ xy
∑ y2
504
bxy=
1512
bxy=0.3333
X-Xmean = bxy(Y-Ymean)
X- 9 =0.3333(Y-28)
X-9 =0.3333Y-9.33
X=0.3333Y-9.33+9
X=0.3333Y-0.33
2.Regression of Y on X.
Y-Ymean = byx(X-Xmean)
byx =
∑ xy
∑ x2
504
=
168
=3
Y-28=3(X-9)
Y-28=3X-27
Y=3X-27+28
Y=3X+1
1.regression equation of X on Y
2.regression equation of Y on X
3.Correlation coefficient
X 10 12 14 16 18 20 22 24 26
Y 29 35 41 47 53 59 65 71 77
Ans:
Xmean=
∑X =
162
=18
n 9
Ymean=
∑ Y = 477 =53
n 9
1.Regression of X on Y is given by
X-Xmean = bxy(Y-Ymean)
bxy =
∑ xy
∑ y2
720
bxy=
2160
bxy=0.3333
X-Xmean = bxy(Y-Ymean)
X- 18 =0.3333(Y-53)
X-18 =0.3333Y-17.6649
X=0.3333Y-17.6649+18
X=0.3333Y+0.3351
2.Regression of Y on X.
Y-Ymean = byx(X-Xmean)
byx =
∑ xy
∑ x2
720
=
240
=3
Y-53=3(X-18)
Y-53=3X-54
Y=3X-54+53
Y=3X-1
3.r=√ bxy∗byx
r=√ 0 . 3333∗3
r=0.99
b. regression equations
c.correlation coefficient
X 12 15 19 27 56 62 70 82
Y 06 09 15 21 24 27 29 35
Ans : Students are expected to calculate the data in the blank spaces
=X – 42.87 =Y – 20.75
12 06 12-42.87 -14.75 +455.33 …….. ………
=-30.87
15 09 -27.87 -11.75 +327.47 …. ………..
19 15 -23.87 -5.75 +137.25 …………. ………
27 21 -15.87 0.25 -3.96 ………… ………..
56 24 13.13 3.25 42.67 ……….. ………….
62 27 19.13 6.25 119.56 …………. ……….
70 29 27.13 8.25 223.82 ……… ……
82 35 39.13 14.25 557.60
∑ X=¿ 343
∑ ¿Y =166 ∑ xy =¿ ¿1863 ∑ x 2=5356 . ∑ 2
82 y =709 . 48
.70
Xmean=
∑X =
343
=42 . 87
n 8
Ymean=
∑ Y = 166 =20 .75
n 8
1.Regression of X on Y is given by
X-Xmean = bxy(Y-Ymean)
1.bxy =
∑ xy
∑ y2
1863. 70
bxy=
709 . 48
bxy=2.62
X-Xmean = bxy(Y-Ymean)
X- 42.87=2.62(Y-20.75)
X-42.87 =2.62Y-54.36
X=2.62Y-54.36+42.87
X=2.62Y-11.49
2.Regression of Y on X.
Y-Ymean = byx(X-Xmean)
byx =
∑ xy
∑ x2
1863 .70
=
5356 .82
= 0.3479
Y-20.75=0.3479(X-42.87)
Y-20.75=0.3479X-14.91
Y=0.3479X-14.91+20.75
Y=0.3479X+5.84
3.r=√ bxy∗byx
r=0.9544
X=2.62Y-11.49
=2.62*34-11.49
X=89.08-11.49
X=77.59
b. regression equations
c.correlation coefficient
Business application:
1.Regression analysis in finance :
Regression analysis has several applications in finance. For example, the statistical
method is fundamental to the Capital Asset Pricing Model (CAPM). Essentially, the CAPM
equation is a model that determines the relationship between the expected return of an
asset and the market risk premium.
The analysis is also used to forecast the returns of securities, based on different factors, or
to forecast the performance of a business. Learn more forecasting methods in
CFI’s Budgeting and Forecasting Course!
Elementary probability
Probability :
i) Classical approach
ii) Empirical Approach
iii) Axiomatic Approach
Random Experiment :
Event :
Two or more events are said to be mutually exclusive events if occurring of any one
of them excludes the happening of all other events in the same experiment. When a coin
is tossed, appearing of head automatically terminates the appearing of tail and vice versa.
Equally Likely :
The outcomes are said to be equally likely or equally probable if none of them is
expected to appear or occur in preference to other.
Independent Events:
Events are said to be independent of each other if happening of any one of them is
not affected and does not affect the happening of any one of others.
Types of Probability
Probability = Number of favorable outcomes/Total number of trials
Solved Examples
Q.1) An unbiased die is thrown what is the probability that on upper most face
S={1,2,3,4,5,6}
Here n(s)=6
A={2,3,5}
∴ n(A)=3
P(A)=n(A)/n(S)
=3/6
=0.50
ii)Let B event that even number appears
B={2,4,6}
∴ n(B)=3
P(B)=n(B)/n(S)
=3/6
=0.50
C={1,2}
∴ n(C)=2
P(C)=n(C)/n(S)
=2/6
=0.3333
D={3,4,5,6}
∴n(D)=4
P(D)=n(D)/n(S)
=4/6
=0.6666
Q.2)Two unbiased dice are thrown. What is the probability that the total score on upper
most faces of two dice is
i)multiple of 5 ?
iv)greater than 5?
v)at least 9?
vi)at most 6
S={(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),(3,1),(3,2),(3,3),
(3,4),(3,5),(3,6),(4,1),
(4,2),(4,3), (4,4),(4,5),(4,6),(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),
(6,6) }
∴ n(S)=36
A={(1,4,)(2,3),(3,2),(4,1),(4,6),(5,5),(6,4)}
∴ n(A)=7
P(A)=n(A)/n(S)
=7/36
=0.197
ii) Let B event that the total score is perfect square i.e. 4,9
B={(1,3),(2,2),(3,1),(3,6),(4,5),(5,4),(6,3)}
∴n(B)=7
P(B)=n(B)/n(S)
=7/36
=0.19
iii) Let C event that the total score is prime number i.e. 2,3,5,7,11
C={(1,1),(2,1),(1,4)(1,6)(2,1)(2,3)(2,5)(3,2)(3,4)(4,1)(4,3)(5,2)(5,6)(6,1)(6,5)}
∴ n(C)=15
P(C)=n(C)/n(S)=15/36=
(6,2)(6,3)(6,4)(6,5)(6,6)}
∴n(D)=26
P(D)=n(D)/n(s)
=26/36
=0.72
E={(3,6)(4,5)(4,6)(5,4)(5,5)(5,6)(6,3)(6,4)(6,5)(6,6)}
∴n(E)=10
P(E)=n(E)/n(s)
=10/36
n(F)={(1,1)(1,2)(1,3)(1,4)(1,5)(2,1)(2,2)(2,3)(2,4(3,1)(3,2)(3,3)(4,1)(4,2)(5,1)}
∴n(F)=15
P(F)=n(F)/n(s)
=15/36
=0.416
1-K
1-Q
1-A (Ace)
9- 2 to 10 Number cards
Hence
Q.3) Three cards are drown from well shuffced pack of 52 playing cards find the
probability that
n(S)= 52C3
=52 x 51x 50 /3 x 2 x 1
=22100
i)Let A event that all 3 Cards are club cards
n(A) = 13C3
= 13x12x11/3x2x1
= 286
P(A) =n(A)/n(S)
=286/22100
=0.0129
ii) Let A event that all 3 cards are red picture cards
n(B) = 6C3
=6 x 5 x 4 /3x 2x 1
=20
P(B) =n(B)/n(S)
=20/22100
=0.0009
n(C)= 16C3
=16 x 15 x 14 /3 x 2 x 1
=560
P(C) =n(C)/n(S)
=560/22100
=0.0253
n(D) = 18C3
=18 x 17 x 16/3x 2 x 1
=816
P(D) =n(D)/n(S)
=816/22100
=0.0369
n(E) = 9C3
n(E) = 9 x 8 x 7/3 x 2x 1
=84
P(E) =n(E)/n(S)
=84/22100
=0.0038
Proof : Let A & B are any two event assume that there are m element in event A assume
that there are
n element in event B there are p element which are appearing both event A & B.
The corresponding Venn diagram is as follows.
A B
m-p p n-p
m-p+p+n-p=Total numbers
∴ m+n-p= n( A ∪ B)
∴ n ( A ) +n ( B )−n ( A ∩B )=n ( A ∪ B )
Divide both side by n ( S ) ,
∴n ( A ) n(B) n( A ∩ B) n( A ∪ B)
+ − =
n(S) n( S) n(S) n(S)
By definition of probability,
P ( A ) + P ( B ) −P ( A ∩ B )=P( A ∪ B)
∴ P( A ∪ B)=P ( A )+ P ( B )−P( A ∩ B)
P( A ∪ B)=P ( A )+ P ( B )
Solved Examples :
Q1) Two Cards are drawn from a well shuffled pack of 52 playing cards.Find the
probability that
n(S) =52C2
∴ n(S)=52 x 51 / 2 x 1
∴n(S) =1326
n(A) = 26C2
∴ P(A)=n(A)/n(S)
∴ P(A) = 325/1326
∴ P(A) =0.2450
Let B event that both cards are picture card
n(B) =12C2
∴n(B)=12 x 11/ (2 x 1)
∴ n(B)=66
P(B) =n(B)/n(S)
= 66/1326
= 0.0497
Let A ∩ B event that both cards are red as well as picture cards
n(A ∩ B) = 6C2
= 6 x 5 / (2 x 1)
=15
= 15/1326
= 0.0113
=0.2450+0.0497-0.0113
=0.2834
n(A) =26C2
∴ n(A) =26*25/2*1
=325
P(A)= n(A)/n(S)
= 325/1326
= 0.2450
n(B) =36C2
∴n(B)=36*35/2*1
=630
P(B)=n(A)/n(S)
= 630/1326
=0.4751
Let A∩ B event that both cards are red as well as picture cards
= 18*17/(2*1)
=153
P(C)=n(A ∩ B)/n(S)
= 153/1326
=0.1153
By addition law of probability, the probability that both cards are red cards or number
cards
= 0.2450+0.0475-0.1153
= 0.6048
n(A)= 12C2
=12*11/2*1
=66
P(A)=n(A)/n(S)
=66/1326
=0.1153
n(B)= 13C2
=13*12/2*1
=78
P(B)=n(A)/n(S)
=78/1326
=0.0588
Let A ∩ B event that both cards are picture cards well as spade cards
= 3*2/2*1
=3
P( A ∩ B) = n(A ∩ B)/n(S)
=3/1326
=0.0022
By addition law of the probability , the probability that both cards are picture cards or
spade cards
P(A ∪ B)=P(A)+P(B)-P(A∩B)
=0.0497+0.0588-0.0022
=0.1063
n(A) = 36C2
= 36*35/2*1
= 630
P(A)=n(A)/n(S)
=630/1326
=0.4751
n(B) = 13C2
∴n(B)=13*12/2*1=78
P(B)=n(A)/n(S)
=78/1326
=0.0588
Let A ∩ B event that both cards are red as well as heart cards
= 13*12/2*1
=78
=78/1326
=0.0588
By addition law of probability ,the probability that both cards are red or heart cards
P(A ∪ B)=P(A)+P(B)-P(A ∩ B)
=0.4751+0.0588-0.0588
=0.4751
OR
P(A ∩ B)=P(B)*P(B/A)
P(A∩ B)=P(A)*P(B)
Solved Examples:
Q1.Two tickets are drawn one after the other from a box containing 20 tickets numbered
from 1 to 20 .
n(A)= 10C1
= 10
n(S1 )= 20C1
= 20
P(A)=n(A)/n(S1)
=10/20
=0.5
n(B/A)= 9C1
=9
n(S2) = 19C1
= 19
P(B/A)=n(B/A)/n(S2)
= 9/19
=0.4736
=0.5*0.4736
=0.2368
i.e. 2,3,5,7,11,13,17,19
n(A) = 8C1
=8
n(S1)= 20C1
= 20
P(A)=n(A)/n(S1)
=8/20
Let B/A event that second ticket also shows prime number
n(B/A)= 7C1
=7
n(S2 )= 19C1
=19
P(B/A)=n(B/A)/n(S2)
=7/19
=0.4736
=(8/20)/(7/19)
=0.1473
Probability Distributions:
Random Variable:
Random variable
Continuous random variables can represent any value within a specified range or
interval and can take on an infinite number of possible values. An example of a continuous
random variable would be an experiment that involves measuring the amount of rainfall in
a city over a year, or the average height of a random group of 25 people.
However, the two coins land in four different ways: TT, HT, TH, HH. Therefore,
the P(Y=0) = 1/4 since we have one chance of getting no heads (i.e., two tails [TT] when
the coins are tossed). Similarly, the probability of getting two heads (HH) is also 1/4.
Notice that getting one head has a likelihood of occurring twice: in HT and TH. In this
case, P (Y=1) = 2/4 = 1/2.
Solved Examples :
Q1. Find mean and variance from the following probability distribution:
xi 0 1 2 3 4 5
pi 1/12 4/12 3/12 1/12 2/12 1/12
Solution :
xi pi xi pi 2
xi . pi
0 1/12 0 0
1 4/12 4/12 4/12
2 3/12 6/12 12/12
=
√ 82
12
−¿¿
=√ 6.8333−4.6943
= 1.4625
P(X)= ∫ f ( x ) dx
−∞
Solved Example :
P(X)= ∫ f ( x ) dx
−∞
3
1
P(X)= ∫ (x ¿¿ 2−1¿) 18 dx ¿ ¿
0
{ }
3
1 x3
= −x
18 3 0
{( ) ( )}
3 3
1 3 0
= −3 − −0
18 3 3
1
= x6
18
=0.3333
where
the right-hand side
represents
the probability that
the random
variable X takes
on a value less
than or equal to x .
The probability
that X lies in the
semi-closed interv
al (a,b] where a<b
, is therefore
P ( a< X ≤ b ) =F ( b ) −F(a)
The
CDF of
a continuous
random variable X
can be expressed
as the integral of
its probability
density function f
(x) as follows:
x
F ( x )= ∫ f ( t ) dt
−∞
Moments :
The n-th moment of a real-valued continuous function f(x) of a real variable
about a value c is
∞
µn = ∫ ( x− y )n f ( x ) dx .
−∞
µ’n = E( x n ¿
= ∫ x dF (x)
n
−∞
a. Binomial distribution :-
1) Random experiment is performed repealed finite & fixed number of time at is n number
of
2) The outcome of random experiment results in only to mutually disjoint categories that is
success and failover
3) All trails are independent that is result of any trail not affected in any way by the
sprucing trails
4) The probability of success of any trail p and his concept for his trail q=p-1 is then turned
that
Statement:-
Solved Examples
i)Exactly 6 head
iii)no head
Ans:-Here n=10
∴p=1/2
q=1-p
∴q=1-1/2
∴q=1/2
=210 x 1/1024
= 210/1024
= 0.205
= 10C8 (½ )8 (½ )10-8
+10C9 (½ )9 (½ )10-9
=(45+10+1) X 1/1024
= 56/1024
=0.0546
= 1/1024
P(X≤3)=P(X=1)+P(X=2)+P(X=3)
= 10C1 (½ )1 (½ )10-1 + 10C2 (½ )2 (½ )10-2 +10C3 (½ )3 (½ )10-3
=10*1/1024+45*1/1024+120*1/1024
=(10+45+120)*1/1024
=175*1/1024
=175/1024
=0.1708.
In any practical case we will already know n, the number of trials. How can we
estimate p, the probability of “success” in a single trial? An intuitive answer is that we can
estimate p by the fraction of all the trials which were “successes,” that is, the proportion or
relative frequency of “success.” It is possible to show mathematically that this intuitive
answer is correct, an unbiased estimate of the parameter p.
Example: The following data give the number of seeds germinating (X) out of 10 on damp
filter for 80
sets of seed. Fit a binomial distribution to the data.
X 0 1 2 3 4 5 6 7 8 9 10
f 6 20 28 12 8 6 0 0 0 0 0
Solution: Here the random variable X denotes the number of seeds germinating out of a set
of 10 seeds.
The total number of trials n = 10.
The mean of the given data
X= 0*6+1*20+2*28+3*12 4*8+5*6/80=174/80=2.175
Since mean of a binomial distribution is np,
∴ np = 2.175.
Thus, we get.
N=∑f =80
The calculated probabilities and the respective expected frequencies are shown in the
following table:
b. Poisson Distribution:-
i.e. n→∞
i.e.p →0
3 )np=m is finite
Under the above three condition the probability function of Poisson distribution is
given as
−m r
e .m
P ( x=r )=
r!
Following are some of the practical situation where in poisson distribution can be
used
1) the no. of telephone call arriving at telephone switch board in unit time
Solved Example: The following table gives the number of days in 50- day period during
which automobile accidents occurred in a city:
No. of accidents 0 1 2 3 4
No of days 21 18 7 3 1
Fit a Poisson distribution to the data.
Solution :
No of No of fx
accidents days
(x) (f)
0 21 0
1 18 18
2 7 14
3 3 9
4 1 4
Total n=50 ∑ fx =45
m=
∑ fx
m
45
m=
50
m =0.9
In order to fit a Poisson distribution, we have to multiply each of these values by n i.e.
by 50 . Hence the required Poisson distribution is :
x 0 1 2 3 4
f 0.4066x50=20.33 18.23 8.23 2.47 0.56
Mode of Poisson Distribution :
Solved Example:
Between the hr 2 pm and 4 pm ,the avg no. of phone call per minutes coming into the
switch mode of a company is 2.35.Find the probability that driving one particular minute
there will be at most 2 phone calls
P(x≤2) = P(x=0)+p(x=1)+p(x=2)
= 0.095369*6.11125
= 0.5828
c. Normal Distribution:-
where e=2.7183
Normal Curve :
-X X=µ X
2) Normal curve is symmetrical about the line x=µ i.e. it has same shape on either side
3 ) The mean deviation from the mean in normal distribution is equal to 4/5 of its standard
deviation.
4)Mean = Median
2. The normal distribution is a probability function that describes how the values of a
variable are distributed. It is a symmetric distribution where most of the observations
cluster around the central peak and the probabilities for values further away from
the mean taper off equally in both directions. Extreme values in both tails of the distribution
are similarly unlikely.
5. The central limit theorem states that as the sample size increases, the sampling
distribution of the mean follows a normal distribution even when the underlying distribution
of the original variable is non-normal.
Solved Example :
1. In normal distribution whose mean is 2 and standard deviation is 3, find the value of
the variate such that the probability of the interval from the mean to the value is
0.4115.
x−m
z=
σ
x−2
z =
3
Also from the tables of the areas under the standard normal curve, the value of z for
which
x−2
1.35=
3
x=3 ×1.35+2
x=6.05
Unit no 6
Hypothesis
Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology employed by the analyst depends on
the nature of the data used and the reason for the analysis. Hypothesis testing is used to
infer the result of a hypothesis performed on sample data from a larger population.
The null hypothesis is the hypothesis the analyst believes to be true. Analysts
believe the alternative hypothesis to be untrue, making it effectively the opposite of a null
hypothesis. Thus, they are mutually exclusive, and only one can be true. However, one of
the two hypotheses will always be true.
If, for example, a person wants to test that a penny has exactly a 50% chance of
landing on heads, the null hypothesis would be yes, and the alternative hypothesis would
be no (it does not land on heads). Mathematically, the null hypothesis would be
represented as Ho: P = 0.5. The alternative hypothesis would be denoted as "Ha" and be
identical to the null hypothesis, except with the equal sign struck-through, meaning that it
does not equal 50%.
A random sample of 100 coin flips is taken from a random population of coin
flippers, and the null hypothesis is then tested. If it is found that the 100 coin flips were
distributed as 40 heads and 60 tails, the analyst would assume that a penny does not
have a 50% chance of landing on heads and would reject the null hypothesis and accept
the alternative hypothesis. Afterward, a new hypothesis would be tested, this time that a
penny has a 40% chance of landing on heads.
(i) State the two hypotheses so that only one can be right i.e. Null hypothesis (HO) and
alternate
hypothesis Ha
(ii)The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
(iii)The third step is to carry out the plan and physically analyze the sample data.
(iv)The fourth and final step is to analyze the results and either accept or reject the null
hypothesis.
Significance Level:
However, this does not mean that there is a 95% probability that the research
hypothesis is true. The p-value is conditional upon the null hypothesis being true is
unrelated to the truth or falsity of the research hypothesis.
A p-value higher than 0.05 (> 0.05) is not statistically significant and indicates
strong evidence for the null hypothesis. This means we retain the null hypothesis
and reject the alternative hypothesis. You should note that you cannot accept the
null hypothesis, we can only reject the null or fail to reject it.
True False
Type II error
Don't Correct inference (false negative)
reject (true negative) (probability = β)
Decision (probability = 1−α)
about null
hypothesis (H0)
Type I error Correct inference
Reject (false positive) (true positive)
(probability = α)
(probability = 1−β)
2.In the chi square test, a sample with a sufficiently large size is assumed. If the chi
square test is conducted on a sample with a smaller size, then the chi square test will yield
inaccurate inferences. The researcher, by using the chi square test on small samples,
might end up committing a Type II error.
3.In the chi square test, the observations are always assumed to be independent of each
other.
4. In the chi square test, the observations must have the same fundamental distribution.
A chi-squared test, also written as X2 test, is any statistical hypothesis
test where the sampling distribution of the test statistic is a chi-squared distribution when
the null hypothesis is true. Without other qualification, 'chi-squared test' often is used as
short for Pearson's chi-squared test. The chi-squared test is used to determine whether
there is a significant difference between the expected frequencies and the observed
frequencies in one or more categories.
In the standard applications of this test, the observations are classified into
mutually exclusive classes, and there is some theory, or say null hypothesis, which gives
the probability that any observation falls into the corresponding class. The purpose of the
test is to evaluate how likely the observations that are made would be, assuming the null
hypothesis is true.
Chi-squared tests are often constructed from a sum of squared errors, or
through the sample variance. Test statistics that follow a chi-squared distribution arise
from an assumption of independent normally distributed data, which is valid in many cases
due to the central limit theorem. A chi-squared test can be used to attempt rejection of the
null hypothesis that the data are independent.
Also considered a chi-squared test is a test in which this is asymptotically true,
meaning that the sampling distribution (if the null hypothesis is true) can be made to
approximate a chi-squared distribution as closely as desired by making the sample size
large enough.
The formula for the chi-square statistic used in the chi square test is:
Solved Example :
A random sample of 395 people were surveyed and each person was asked to report
the highest education level they obtained. The data that resulted from the survey is
summarized in the following table:
Solution :
χ2=(60−50.886)250.886+⋯+(57−48.132)248.132=8.006.
Since 8.006 > 7.815, therefore we reject the null hypothesis and conclude that the
education level depends on gender at a 5% level of significance.