You are on page 1of 133

Statistical Techniques

Unit No : 1 Introduction to Statistics

Introduction to Statistics:

Definitions :

Statistics refers to techniques or methods relating to collection,


classification, presentation analysis and interpretation of quantitative data.
According to Seligman, “Statistics is the science which deals with the methods of
collecting, classifying, presenting, comparing and interpreting numerical data collected to
throw some light on any sphere of enquiry”.
Statistics is a mathematical body of science that pertains to the collection, analysis,
interpretation or explanation, and presentation of data, or as a branch of mathematics.
[
 Some consider statistics to be a distinct mathematical science rather than a branch of
mathematics. While many scientific investigations make use of data, statistics is
concerned with the use of data in the context of uncertainty and decision making in the
face of uncertainty.
In applying statistics to a problem, it is common practice to start with
a population or process to be studied. Populations can be diverse topics such as "all
people living in a country" or "every atom composing a crystal". Ideally, statisticians
compile data about the entire population. This may be organized by governmental
statistical institutes.
  Descriptive statistics can be used to summarize the population data. Numerical
descriptors include mean and standard deviation for continuous data types like income,
while frequency and percentage are more useful in terms of describing categorical
data like education.
When a census is not feasible, a chosen subset of the population called a  sample is
studied. Once a sample that is representative of the population is determined, data is
collected for the sample members in an observational or experimental setting. Again,
descriptive statistics can be used to summarize the sample data. However, the drawing of
the sample has been subject to an element of randomness, hence the established
numerical descriptors from the sample are also due to uncertainty. To still draw meaningful
conclusions about the entire population, inferential statistics is needed. It uses patterns in
the sample data to draw inferences about the population represented, accounting for
randomness. These inferences may take the form of answering yes/no questions about
the data, known as hypothesis testing, estimating numerical characteristics of the data,
describing associations within the data correlation and modeling relationships within the
data.

Importance of Statistics in modern business environment:


The field of statistics is the science of learning from data. Statistical knowledge
helps you use the proper methods to collect the data, employ the correct analyses, and
effectively present the results. Statistics is a crucial process behind how we make
discoveries in science, make decisions based on data, and make predictions. Statistics
allows you to understand a subject much more deeply.

1. Statistics Uses Numerical Evidence to Draw Valid Conclusions

Statistics are not just numbers and facts. You know, things like 4 out of 5 dentists prefer a
specific toothpaste. Instead, it’s an array of knowledge and procedures that allow you to
learn from data reliably. Statistics allow you to evaluate claims based on quantitative
evidence and help you differentiate between reasonable and dubious conclusions. That
aspect is particularly vital these days because data are so plentiful along with
interpretations presented by people with unknown motivations.

2.Statistics offer critical guidance in producing trustworthy analyses and predictions. Along
the way, statisticians can help investigators avoid a wide variety of analytical traps.
When analysts use statistical procedures correctly, they tend to produce accurate results.
In fact, statistical analyses account for uncertainty and error in the results. Statisticians
ensure that all aspects of a study follow the appropriate methods to produce trustworthy
results. These methods include:
a.Producing reliable data.
b.Analyzing the data appropriatelyDrawing reasonable conclusions.
c.Statisticians Know How to Avoid Common Pitfalls

3.Using statistical analyses to produce findings for a study is the culmination of a long
process. This process includes constructing the study design, selecting and measuring the
variables, devising the sampling technique and sample size, cleaning the data, and
determining the analysis methodology among numerous other issues. The overall quality
of the results depends on the entire chain of events. A single weak link might produce
unreliable results. The following list provides a small taste of potential problems and
analytical errors that can affect a study.

4.Use of Statistics to Make an Impact in desired Field :Statistical analyses are used in
almost all fields to make sense of the vast amount of data that are available. Even if the
field of statistics is not your primary field of study, it can help you make an impact in your
chosen field. Chances are very high that you’ll need working knowledge of statistical
methodology both to produce new findings in your field and to understand the work of
others.

Scope of statistics :

Statistics plays a vital role in every field of human activity. Statistics helps in
determining the existing position of per capita income, unemployment, population growth
rates, housing, schooling medical facilities etc. in a country.
Now statistics holds a central position in almost every field, including industry,
commerce, trade, physics, chemistry, economics, mathematics, biology, botany,
psychology, astronomy, etc., so the application of statistics is very wide. Now we shall
discuss some important fields in which statistics is commonly applied.

(1) Business
Statistics plays an important role in business. A successful businessman must be
very quick and accurate in decision making. He knows what his customers want; he
should therefore know what to produce and sell and in what quantities.

Statistics helps businessmen to plan production according to the taste of the


customers, and the quality of the products can also be checked more efficiently by using
statistical methods. Thus, it can be seen that all business activities are based on statistical
information. Businessmen can make correct decisions about the location of business,
marketing of the products, financial resources, etc.

 (2) Economics
  Economics largely depends upon statistics. National income accounts are
multipurpose indicators for economists and administrators, and statistical methods are
used to prepare these accounts. In economics research, statistical methods are used to
collect and analyze the data and test hypotheses. The relationship between supply and
demand is studied by statistical methods; imports and exports, inflation rates, and per
capita income are problems which require a good knowledge of statistics.

(3) Mathematics
Statistics plays a central role in almost all natural and social sciences. The methods
used in natural sciences are the most reliable but conclusions drawn from them are only
probable because they are based on incomplete evidence.Statistics helps in describing
these measurements more precisely. Statistics is a branch of applied mathematics. A
large number of statistical methods like probability averages, dispersions, estimation, etc.,
is used in mathematics, and different techniques of pure mathematics like integration,
differentiation and algebra are used in statistics.

(4) Banking
Statistics plays an important role in banking. Banks make use of statistics for a
number of purposes. They work on the principle that everyone who deposits their money
with the banks does not withdraw it at the same time. The bank earns profits out of these
deposits by lending it to others on interest. Bankers use statistical approaches based on
probability to estimate the number of deposits and their claims for a certain day.

(5) State Management (Administration)


Statistics is essential to a country. Different governmental policies are based on
statistics. Statistical data are now widely used in making all administrative decisions.
Suppose if the government wants to revise the pay scales of employees in view of an
increase in the cost of living, and statistical methods will be used to determine the rise in
the cost of living. The preparation of federal and provincial government budgets mainly
depends upon statistics because it helps in estimating the expected expenditures and
revenue from different sources. So statistics are the eyes of the administration of the state.

 (6) Accounting and Auditing


Accounting is impossible without exactness. But for decision making purposes, so
much precision is not essential; the decision may be made on the basis of approximation,
know as statistics. The correction of the values of current assets is made on the basis of
the purchasing power of money or its current value.In auditing, sampling techniques are
commonly used. An auditor determines the sample size to be audited on the basis of error.

(7) Natural and Social Sciences


Statistics plays a vital role in almost all the natural and social sciences. Statistical
methods are commonly used for analyzing experiments results, and testing their
significance in biology, physics, chemistry, mathematics, meteorology, research,
chambers of commerce, sociology, business, public administration, communications and
information technology, etc.

(8) Astronomy
Astronomy is one of the oldest branches of statistical study; it deals with the
measurement of distance, and sizes, masses and densities of heavenly bodies by means
of observations. During these measurements errors are unavoidable, so the most probable
measurements are found by using statistical methods.
Example: This distance of the moon from the earth is measured. Since history,
astronomers have been using statistical methods like method of least squares to find the
movements of stars.

Applications of Statistics:

In the age of information technology, statistics has a wide range of applications. Let’s


look at some important areas of application of statistics

State Administration :

For the effective functioning of the State, Statistics is indispensable. Different


department and authorities require various facts and figures on different matters. They use
this data to frame policies and guidelines in order to perform smoothly.Traditionally, people
used statistics to collect data pertaining to manpower, crimes, wealth, income, etc. for the
formation of suitable military and fiscal policies.
Over the years, with the change in the nature of functions of the State from maintaining law
and order to promoting human welfare, the scope of the application of statistics has changed
too.
Today, the State authorities collect statistics through their agencies on multiple
aspects like population, agriculture, defense, national income, oceanography, natural
resources, space research, etc.
Further, nearly all ministries at the Central as well as State level, rely heavily on statistics for
their smooth functioning. Also, the availability of statistical information enables
the government to frame policies and guidelines to improve the overall working of the
system.

Economics:
Economics is about allocating limited resources among unlimited ends in the most
optimal manner. Statistics offers information to answer some basic questions in economics –
 What to produce?
 How to produce?
 For whom to produce?

Statistical information helps to understand the economic problems and formulation of


economic policies. Traditionally, the application of statistics was limited since the economic
theories were based on deductive logic. Also, most statistical techniques were not developed
enough for application in all disciplines.
However, today, with computers and information technology, statistical data and
advanced techniques of statistical analysis are a boon to many.In economics, many scholars
have now shifted their stand from deductive logic to inductive logic in order to explain any
economic proposition. This inductive logic requires the observation of economic behavior of a
large number of units. Hence, it needs strong statistical support in the form of data and
techniques.
Economists have developed various theories and principles based on deductive
reasoning in the areas of production, distribution, exchange, consumption, business cycles,
taxation, etc.
These theories are for academic interest only unless they are put through an empirical test or
verification. Statistics enables us to compare these theories in real-life situations.
Statistics also help us in understanding various economic problems with precision and
clarity. Further, it enables us to frame policies in relevant areas for better results.To give you
an example, wealth and income statistics help in the framing of policies for reducing
disparities of income. On the other hand, price statistics help us in understanding the
problem of inflation and the cost of living in the economy.

Economic Planning:
Economic planning is an important aspect of a country. For effective economic
planning, the authorities require information regarding different components of the
economy.This allows them to plan for the future efficiently. Statistics help in providing data as
well as tools to analyze the data. Some powerful techniques are index numbers, time series
analysis, and also forecasting. These are immensely useful in the analysis of data in
economic planning.
Further, statistical techniques help in framing planning models too. In India, the five-
year plans extensively use statistical tools.

Measurement of National Income and Components:


Statistics also allows the study and measure of various national income components
and their compilations. It collects information on income, investment, saving, expenditure, etc
and establishes the relationships between them.

Business Management and Industry:


In today’s world, business management is a complex process. This is due to a change
in:
 Size
 Technical know-how
 Quantum of production
 Number of employees
 Capital deployed
 Competition levels, etc.
Also, while planning, organizing, controlling, and communicating, the management is
confronted with many alternative courses of action. The trial and error method is not a great
way of making decisions.
Therefore, statistical data and powerful statistical techniques of probability, expectations,
sampling, significance test, estimation theory, forecasting, etc. play an important role.
Social Sciences and Natural Science:
In social sciences, especially sociology, statistics are used in the field of demography
for studying mortality, fertility, marriage, population, and growth. Also, in psychology and
education, the intelligence quotient (IQ) is determined using statistics.

Biology and Medicine:


In biology and medical sciences, there is regular use of statistical tools for collecting,
presenting, and also analyzing the observed data pertaining to the causes of the incidence of
diseases.
For example, the statistical pulse rate, body temperature, blood pressure, etc. of the
patients helps the physician in diagnosing the disease properly. Additionally, statistics help in
testing the efficacy of manufacturing drugs or injections or medicines for controlling or curing
certain diseases.
Research:
Statistics help in the conduction of research in new areas and the opening of newer
vistas of knowledge to mankind.

Advantages of Statistics :

Plenty of companies naturally collect lots of data in the course of business. This is
especially true in the Internet age, when it's often possible to gather detailed information
about when customers do everything from open emails to access particular items on a
company website. The role of statistics in business is in evaluating all of this information
to determine what it says about the company's operations and strategy.

Advantages of Statistics in Performance

One role of statistics in business is informing a manager working on employee


performance management. A manager collects data about employee productivity, such
as the number of tasks completed or the number of units produced. He or she must
analyze data to find ways in which an employee should improve to achieve maximum
productivity. Many companies also collect data about employee engagement and
happiness on the job, which can be tracked to not only keep workers motivated but
ensure they don't leave for other positions elsewhere.
For example, if a manager finds that an employee's number of finished outputs
drops by 20 percent every Friday, he or she should communicate with the employee,
setting the expectation that his or her output will remain above a minimum level every
day of the work week.
Many companies will also compile aggregate statistics about employee
performance. If a company finds that employees overall are doing less work right before
or after the weekend, its managers will want to consider ways to either motivate
employees or, if it turns out to be due to external factors, provide them with alternative
tasks they can do during downtimes. Companies may want to avoid collecting too much
data about employee activities, however, since it may come off as creepy to workers.

Evaluating Alternative Scenarios:

Beyond managing the performance of her own workers, a manager participates in


joint decision making with other managers. Statistics help the managers to compare
alternative scenarios and choose the best option for the company. The team must decide
which software to use for automating the customer ordering process.
They consider which software products have been successfully used by
competitors and choose the most popular one, or they might find how many orders that
an ordering system can process on average daily. The team collects performance data
from software makers and independent sources, such as trade magazines, to inform
their purchasing decisions.

In Data Collection :
Collecting data to use in statistics, or summarizing the data, is only an advantage
in business if a manager uses a logical approach and collects and reports data in an
ethical manner. For example, he might use statistics to determine if sales levels the
company achieved for the last few products launched were even close to projected sales
levels. He might decide that the least-performing product needs extra investment or
perhaps the company should shift resources from that product to a new product.
In some cases, it might be necessary to anonymize customer data or strip out
unimportant confidential parts to reduce the risk of a data breach or abuses by
employees or data consultants. Privacy laws also increasingly govern how companies
can use or store personal data, so it's important to make sure your business follows the
rules in jurisdictions where it's active.

Statistics in Research and Development:

A company also uses statistics in market research and product development,


using different surveys, such as random samples of consumers, to gauge the market for
a proposed product. A manager conducts surveys to determine if there is sufficient
demand among target consumers.
Survey results might justify spending on developing the product. A product launch
decision might also include a break-even analysis, such as finding out what percentage
of consumers must try a new product for it to be successful.

Limitations of Statistics:

(1) Statistics laws are true on average. Statistics are aggregates of facts, so a single
observation is not a statistic. Statistics deal with groups and aggregates only.
(2) Statistical methods are best applicable to quantitative data.
(3) Statistics cannot be applied to heterogeneous data.
(4) If sufficient care is not exercised in collecting, analyzing and interpreting the data,
statistical results might be misleading.
(5) Only a person who has an expert knowledge of statistics can handle statistical data
efficiently.
(6) Some errors are possible in statistical decisions. In particular, inferential statistics
involves certain errors. We do not know whether an error has been committed or not.

Sources of data :

Sources of Data can be classified into 2 types. Statistical sources refer to data that


are gathered for some official purposes and incorporate censuses and officially
administered surveys. Non-statistical sources refer to the collection of data for other
administrative purposes or for the private sector.
Following are the two sources of data:
1. Internal Source : When data are collected from reports and records of the organisation
itself, it is known as the internal source.
For example, a company publishes its ‘Annual Report’ on Profit and Loss, Total Sales,
Loans, Wages etc.
2. External Source: When data are collected from outside the organisation, it is known as
the external source. For example, if a Tour and Travels Company obtains information on
‘Karnataka Tourism’ from Karnataka Transport Corporation, it would be known as external
sources of data.

Types of Data
A) Primary Data

 Primary data means ‘First-hand information’ collected by an investigator.


 It is collected for the first time.
 It is original and more reliable.
 For example Population census conducted by the government of India after every
10 years.
B) Secondary Data

 Secondary data refers to ‘Second-hand information’.


 These are not originally collected rather obtained from already published or
unpublished sources.
 For example the Address of a person taken from the Telephone Directory or Phone
number of a company taken from ‘Just Dial’.
Students can also refer to Meaning and Sources of Secondary Data

Methods of Collecting Primary Data (Primary sources)

1. Direct Personal Investigation


2. Indirect Oral Investigation
3. Information Through Correspondents
4. Telephonic Interview
5. Mailed Questionnaire
6. The questionnaire filled by enumerators

Secondary Sources:

Secondary sources were created by someone who did not experience first-hand or


participate in the events or conditions you’re researching. For a historical research project,
secondary sources are generally scholarly books and articles.

A secondary source interprets and analyzes primary sources. These sources are one or
more steps removed from the event. Secondary sources may contain pictures, quotes or
graphics of primary sources.

Some types of secondary source include: 


1.Textbooks
2.Journal articles
3.Histories
4.Criticisms
5.Commentaries
6.Encyclopedias 
Universe or Population:

From a statistical point of view, the term ‘Universe’refers to the total of the items or units in
any field of inquiry, whereas the term ‘population’ refers to the total of items about which
information is desired. The attributes that are the object of study are referred to as
characteristics and the units possessing them are called as elementary units. The
aggregate of such units is generally described as population. Thus, all units in any field of
inquiry constitute universe and all elementary units (on the basis of one characteristic or
more) constitute population. Quit often, we do not find any difference between population
and universe, and as such the two terms are taken as interchangeable. However, a
researcher must necessarily define these terms precisely.
The population or universe can be finite or infinite. The population is said to be finite if it
consists of a fixed number of elements so that it is possible to enumerate it in its totality.
For instance, the population of a city, the number of workers in a factory are examples of
finite populations. The symbol ‘N’ is generally used to indicate how many elements (or
items) are there in case of a finite population. An infinite population is that population in
which it is theoretically impossible to observe all the elements. Thus, in an infinite
population the number of items is infinite i.e., we cannot have any idea about the total
number of items. The number of stars in a sky, possible rolls of a pair of dice are examples
of infinite population. One should remember that no truly infinite population of physical
objects does actually exist in spite of the fact that many such populations appear to be
very large. From a practical consideration, we then use the term infinite population for a
population that cannot be enumerated in a reasonable period of time. This way we use the
theoretical concept of infinite population as an approximation of a very large finite
population.

Sample:

finite subset of the population selected from it with the objective of investigating its
properties is called a sample and the number of unites in the sample is known as the
sample size

Concept of Sampling:

Sampling is a tool which enables us to draw conclusions about the characteristics of


the population after studying only those objects or items that are included in the sample.

The main objectives of the sampling theory are


( i ) To obtain the optimum results, i.e., the maximum information about the characteristics
of the population with the available sources at our disposal in terms of time, money and
manpower by studying the sample values only.

( ii ) To obtain the best possible estimates of the population parameters.

Although the scientific development of the theory of sampling has taken place
only during the last few decades, the idea of sampling is very old. From times immemorial,
people have been using it without knowing that some scientific procedure has been used
in arriving at the conclusion. On inspecting the sample of a particular stuff, we arrive at a
conclusion about accepting or rejecting it. For example, the consumer examines only a
handful of the rice, pulses or any commodity in a shop to assess its quality and then
decides to buy it or not. The housewife, usually tastes a spoonful of the cooked products
to ascertain if it is properly cooked and also to see if it contains proper quantity of salt or
sugar. The consumer ascertains the quality of the grapes by testing one or two from the
seller’s basket. The intelligence of the individuals in a subject is estimated by the university
by giving them a 3 – hour test. A businessman order for the products after examining only.
The error involved in approximations about the population characteristics on the basis of
the sample is known as sampling error and is inherent and unavoidable in any sampling
scheme.

Population :

In any Statistical investigation the interest usually lies in studying the various
characteristics relating to items or individuals belonging to a particular group. This group of
individuals under study is known as the population or universe. For example, if an enquiry
is intended to determine the average per capita income of the people in a particular city,
the population will comprise all the earning people in the city. On the other hand if we want
to study the expenditure habits of the families in that city, then the population will consist of
all the house –holds in that city. Further, if we want to study the quality of the
manufactured product in an industrial concern during the day, then the population will
consist of the day’s total production.

Thus, “ In Statistics, population is the aggregate of objects, animate or


inanimate, under study in any Statistical investigation”. In sampling theory, the population
means the larger group from which the samples are drawn.

A population containing a finite number of objects or items is known as finite


population. e.g., the students in a college, the day’s production in an industrial concern,
the population of a city or a town, etc.
On the other hand, a population having an infinite number of objects or with the
number of objects so large as to appear practically infinite, is termed as an infinite
population, e.g., the population of temperatures at various points of thermosphere; the
population of the heights, weights or ages of the people in the country, the population of
stars in the sky, etc. Infinite populations are better for sampling studies.

The population may further be classified as existent or hypothetical. A


population consisting of concrete objects is known as existent population, e. g., the
population of (i) the books in a library, (ii) the airplanes in the India Air Force, (iii) the
scooters in Delhi, etc.

On the other hand. If the population does not consist of concrete objects then it
is called hypothetical population. for instance, the populations of the throws of a die or a
coin, thrown infinite number of times are hypothetical populations.

Types of Sampling :

The choice of an appropriate sampling design is of paramount


importance in the execution of a sample survey and is generally made keeping in view
the objectives and scope of the enquiry and the type of the universe to be sampled.
The sampling techniques may be broadly classified as follows :
(i) Purposive or Subjective or Judgment Sampling.
(ii) Probability Sampling.
(iii) Mixed Sampling.

(i) Purposive or Subjective or Judgment Sampling.


In this method, a desired number of sample units is selected
deliberately or purposely depending upon the object of the enquiry so that only the
important items representing the characteristics of the population are included in the
sample.

An obvious and serious drawback of this sampling scheme is that it is


highly subjective in nature, since the selection of the sample depends entirely on the
personal convenience, beliefs, biases and prejudices of the investigator .For example,
if in a socio-economic survey it is desired to study the standard of living of the people
in New Delhi if the investigator wants to show that the standard has gone down, then
he may include individuals in the samples only from the low income stratum of the
society and exclude the people from the posh colonies like South Extension, Greater
Kailash, Jor Bagh, Chanakyapuri and so on. This method cannot be worked out for
large samples and is expected to give good results in small samples only provided the
selection of the sample is representative. This can be achieved if the investigator is
thoroughly skilled and experienced in the filed of enquiry and knows the limitations of
such a selection. Further, since this scheme does not involve the principle of
probability. Estimation of the sampling error depends upon the hypothesis which are
rarely met in practice.

(ii) Probability Sampling:


Probability sampling provides a scientific technique or drawing samples from the
population according to some laws of chance in which each unit in the universe has
some definite pro-assigned probability of being selected in the sample. Different types
of sampling are in which :

(i) Each sample unit has an equal chance of being selected.


(ii) Sampling units have varying probability of being selected.
(iii) Probability of selection of a unit is proportional to the sample size.

(iii) Mixed sampling:


Sampling design in which the sample units are selected partly according to
some probability laws, and partly according to a fixed sampling rule (no use of
chance), is known as Mixed Sampling.

Some of the important types of sampling schemes are given below :


(i) Simple Random Sampling
(ii) Stratified Random Sampling
(iiii) Systematic Sampling
(iv) Multistage Sampling
(v) Quasi Random Sampling
(vi) Area Sampling
(vii) Sample Cluster Sampling
(viii) Multistage Cluster Sampling
(ix) Quota Sampling

Simple Random Sampling :

Simple random sampling is the technique in which sample is so drawn that


each and every unit in the population has an equal and independent chance of being
included in the sample.

A simple random sample is a subset of a statistical population in which each


member of the subset has an equal probability of being chosen. A simple random sample
is meant to be an unbiased representation of a group.
An example of a simple random sample would be the names of 25 employees
being chosen out of a hat from a company of 250 employees. In this case, the population is
all 250 employees, and the sample is random because each employee has an equal
chance of being chosen. Random sampling is used in science to conduct randomized
control tests or for blinded experiments.

If the unit selected in any draw is not replaced in the population before making
the next draw, then it is known as simple random sampling without replacement and if it is
replaced back before making the next draw, then the sampling plan is called simple
random sampling with replacement . Thus, simple random sampling with replacement
always amounts to sampling from an infinite population, even though the population is
finite.

Researchers can create a simple random sample using a couple of methods. With


a lottery method, each member of the population is assigned a number, after which
numbers are selected at random.

The example in which the names of 25 employees out of 250 are chosen out of
a hat is an example of the lottery method at work. Each of the 250 employees would be
assigned a number between 1 and 250, after which 25 of those numbers would be chosen
at random.

Because individuals who make up the subset of the larger group are chosen at
random, each individual in the large population set has the same probability of being
selected. This creates, in most cases, a balanced subset that carries the greatest potential
for representing the larger group as a whole, free from any bias.

For larger populations, a manual lottery method can be quite onerous. Selecting
a random sample from a large population usually requires a computer-generated process,
by which the same methodology as the lottery method is used, only the number
assignments and subsequent selections are performed by computers, not humans.

Advantages of Simple Random Samples

1. Ease of use represents the biggest advantage of simple random sampling.

2.Unlike more complicated sampling methods, such as stratified random sampling and
probability sampling, no need exists to divide the population into sub-populations or take
any other additional steps before selecting members of the population at random.

3.A simple random sample is meant to be an unbiased representation of a group. It is


considered a fair way to select a sample from a larger population since every member of
the population has an equal chance of getting selected.
Disadvantages of Simple Random Samples
1. A sampling error can occur with a simple random sample if the sample does not end up
accurately reflecting the population it is supposed to represent. For example, in our simple
random sample of 25 employees, it would be possible to draw 25 men even if the
population consisted of 125 women and 125 men.For this reason, simple random
sampling is more commonly used when the researcher knows little about the population. If
the researcher knew more, it would be better to use a different sampling technique, such
as stratified random sampling, which helps to account for the differences within the
population, such as age, race or gender.

2.Other disadvantages include the fact that for sampling from large populations, the
process can be time-consuming and costly compared to other methods.

Stratified Random Sampling:

Stratified random sampling is a method of sampling that involves the division of a


population into smaller sub-groups known as strata. In stratified random sampling or
stratification, the strata are formed based on members' shared attributes or characteristics
such as income or educational attainment.

Stratified random sampling is also called proportional random sampling or quota random
sampling.

When completing analysis or research on a group of entities with similar


characteristics, a researcher may find that the population size is too large for which to
complete research. To save time and money, an analyst may take on a more feasible
approach by selecting a small group from the population. The small group is referred to as
a sample size, which is a subset of the population that is used to represent the entire
population. A sample may be selected from a population through a number of ways, one of
which is the stratified random sampling method.

A stratified random sampling involves dividing the entire population into


homogeneous groups called strata (singular is stratum). Random samples are then
selected from each stratum. For example, consider an academic researcher who would
like to know the number of MBA students in 2007 who received a job offer within three
months of graduation. He will soon find that there were almost 200,000 MBA graduates for
the year. He might decide to just take a simple random sample of 50,000 graduates and
run a survey. Better still, he could divide the population into strata and take a random
sample from the strata. To do this, he would create population groups based on gender,
age range, race, country of nationality, and career background. A random sample from
each stratum is taken in a number proportional to the stratum's size when compared to the
population. These subsets of the strata are then pooled to form a random sample.

Procedure for stratified random sampling:


Suppose a research team wants to determine the GPA of college students
across the U.S. The research team has difficulty collecting data from all 21 million college
students; it decides to take a random sample of the population by using 4,000 students.

Now assume that the team looks at the different attributes of the sample
participants and wonders if there are any differences in GPAs and students’ majors.
Suppose it finds that 560 students are English majors, 1,135 are science majors, 800 are
computer science majors, 1,090 are engineering majors, and 415 are math majors. The
team wants to use a proportional stratified random sample where the stratum of the
sample is proportional to the random sample in the population.

Assume the team researches the demographics of college students in the U.S


and finds the percentage of what students major in 12% major in English, 28% major in
science, 24% major in computer science, 21% major in engineering, and 15% major in
mathematics. Thus, five strata are created from the stratified random sampling process.

The team then needs to confirm that the stratum of the population is in
proportion to the stratum in the sample; however, they find the proportions are not equal.
The team then needs to re-sample 4,000 students from the population and randomly
select 480 English, 1,120 science, 960 computer science, 840 engineering, and 600
mathematics students. With those, it has a proportionate stratified random sample of
college students, which provides a better representation of students' college majors in the
U.S. The researchers can then highlight specific stratum, observe the varying studies of
U.S. college students and observe the various grade point averages. 

Advantages of Stratified Random Sampling

1. The main advantage of stratified random sampling is that it captures key population
characteristics in the sample. Similar to a weighted average, this method of sampling
produces characteristics in the sample that are proportional to the overall population.

2.Stratified random sampling works well for populations with a variety of attributes but is
otherwise ineffective if subgroups cannot be formed.

3.Stratification gives a smaller error in estimation and greater precision than the simple


random sampling method. The greater the differences between the strata, the greater the
gain in precision.

Disadvantages of Stratified Random Sampling

1.Unfortunately, this method of research cannot be used in every study. The method's
disadvantage is that several conditions must be met for it to be used properly.
2.Researchers must identify every member of a population being studied and classify each
of them into one, and only one, subpopulation. As a result, stratified random sampling
is disadvantageous when researchers can't confidently classify every member of the
population into a subgroup. Also, finding an exhaustive and definitive list of an
entire population can be challenging. 

3.Overlapping can be an issue if there are subjects that fall into multiple subgroups. When
simple random sampling is performed, those who are in multiple subgroups are more likely
to be chosen. The result could be a misrepresentation or inaccurate reflection of the
population. 

4. The sorting process becomes more difficult, rendering stratified random sampling an
ineffective and less than ideal method.

Cluster Sampling:

Cluster sampling refers to a type of sampling method . With cluster sampling, the
researcher divides the population into separate groups, called clusters. Then, a simple
random sample of clusters is selected from the population. The researcher conducts his
analysis on data from the sampled clusters.

Compared to simple random sampling and stratified sampling , cluster sampling


has advantages and disadvantages. For example, given equal sample sizes, cluster
sampling usually provides less precision than either simple random sampling or stratified
sampling. On the other hand, if travel costs between clusters are high, cluster sampling
may be more cost-effective than the other methods.

Cluster sampling refers to a sampling method that has the following properties.

 The population is divided into N groups, called clusters.


 The researcher randomly selects n clusters to include in the sample.
 The number of observations within each cluster Mi is known, and M = M1 + M2 +
M3 + ... + MN-1 + MN.
 Each element of the population can be assigned to one, and only one, cluster.

Assuming the sample size is constant across sampling methods, cluster sampling
generally provides less precision than either simple random sampling or stratified
sampling. This is the main disadvantage of cluster sampling.
Given this disadvantage, it is natural to ask: Why use cluster sampling?
Sometimes, the cost per sample point is less for cluster sampling than for other sampling
methods. Given a fixed budget, the researcher may be able to use a bigger sample with
cluster sampling than with the other methods. When the increased sample size is sufficient
to offset the loss in precision, cluster sampling may be the best choice.

Conditions to use Cluster Sampling :

Cluster sampling should be used only when it is economically justified - when


reduced costs can be used to overcome losses in precision. This is most likely to occur in
the following situations.

Constructing a complete list of population elements is difficult, costly, or impossible.


For example, it may not be possible to list all of the customers of a chain of hardware
stores. However, it would be possible to randomly select a subset of stores (stage 1 of
cluster sampling) and then interview a random sample of customers who visit those stores
(stage 2 of cluster sampling).

The population is concentrated in "natural" clusters (city blocks, schools, hospitals,


etc.). For example, to conduct personal interviews of operating room nurses, it might make
sense to randomly select a sample of hospitals (stage 1 of cluster sampling) and then
interview all of the operating room nurses at that hospital. Using cluster sampling, the
interviewer could conduct many interviews in a single day at a single hospital. Simple
random sampling, in contrast, might require the interviewer to spend all day traveling to
conduct a single interview at a single hospital.

Multistage Sampling:

Multistage sampling can be a complex form of cluster sampling because it is a type of


sampling which involves dividing the population into groups (or clusters). Then, one or
more clusters are chosen at random and everyone within the chosen cluster is sampled.
Using all the sample elements in all the selected clusters may be prohibitively
expensive or unnecessary. Under these circumstances, multistage cluster sampling
becomes useful. Instead of using all the elements contained in the selected clusters, the
researcher randomly selects elements from each cluster. Constructing the clusters is the
first stage. Deciding what elements within the cluster to use is the second stage. The
technique is used frequently when a complete list of all members of the population does
not exist and is inappropriate.
Advantages of multistage sampling:

1.Cost and speed that the survey can be done in


2.Convenience of finding the survey sample
3.Normally more accurate than cluster sampling for the same size sample

Disadvantages of multistage sampling:

1.It is not as accurate as Simple Random Sample if the sample is the same size.
2.It is difficult to go for more testing .

Quota Sampling:

In quota sampling, a population is first segmented into mutually exclusive sub-


groups, just as in stratified sampling. Then judgment is used to select the subjects or units
from each segment based on a specified proportion. For example, an interviewer may be
told to sample 300 females and 400 males between the age of 45 and 60. This means that
individuals can put a demand on who they want to sample (targeting).
This second step makes the technique non-probability sampling. In quota
sampling, there is non-random sample selection and this can be unreliable. For example,
interviewers might be tempted to interview those people in the street who look most
helpful, or may choose to use accidental sampling to question those closest to them, to
save time. The problem is that these samples may be biased because not everyone gets a
chance of selection, whereas in stratified sampling (its probabilistic version), the chance of
any unit of the population is the same as 1/n (n= number of units in the population). This
non-random element is a source of uncertainty about the nature of the actual sample and
quota versus probability has been a matter of controversy for many years.
Quota sampling is useful when time is limited, a sampling frame is not
available, the research budget is very tight or detailed accuracy is not important. Subsets
are chosen and then either convenience or judgment sampling is used to choose people
from each subset. The researcher decides how many of each category are selected.

Classification, Tabulation and Presentation of Data:

Classification refers to a process, wherein data is arranged based on the


characteristic under consideration, into classes, or groups, as per resemblance of
observations. Classification puts the data in a condensed form, as it removes unnecessary
details that helps to easily comprehend data.

The data collected for the first time is raw data and so it is arranged in haphazard
manner, which does not provide a clear picture. The classification of data reduces the
large volume of raw data into homogeneous groups, i.e. data having common
characteristics or nature are placed in one group and thus, the whole data is bifurcated
into a number of groups. there are four types of classification:

 Qualitative Classification or Ordinal Classification


 Quantitative Classification
 Chronological or Temporal Classification
 Geographical or Spatial Classification

Tabulation

Tabulation refers to a logical data presentation, wherein raw data is summarized and
displayed in a compact form, i.e. in statistical tables. In other words, it is a systematic
arrangement of data in columns and rows, that represents data in concise and attractive
way. One should follow the given guidelines for tabulation.

 A serial number should be allotted to the table, in addition to the self explanatory
title.
 The statistical table is required to be divided into four parts, i.e. Box head, Stub,
Caption and Body. The complete upper part of the table that contains columns and
sub-columns, along with caption, is the Box Head. The left part of the table, giving
description of rows is called stub. The part of table that contains numerical figures
and other content is its body.
 Length and Width of the table should be perfectly balanced.
 Presentation of data should be such that it takes less time and labor to make
comparison between various figures.
 Footnotes, explaining the source of data or any other thing, are to be presented at
the bottom of the table.

Requisites of a good classification:

Following are the essential characteristics of a good classification:-

1. Clarity : There should be no uncertainty or ambiguity as to which class or group the


collected data is to be kept in. 

2.Stability : To make the data suitable for comparison and to meaningfully compare the
results, it is necessary that the classification has stability. 

3.Extensiveness : The various classes should be created in such an extensive manner


that no data-item from the collected data is left out and is necessarily included in some or
the other class. If required, a miscellaneous class can be created; i.e. while creating
classes on basis of marital status, widower, widow, divorcee, etc. cannot be included in
married and unmarried classes. Hence, classification should be complete and extensive. 
4.Suitability : Classes should be created in accordance with the objectives. As, in order to
know the financial status or saving tendency of people, it would be suitable to create
classes on basis of income. 

5.Flexibility : The classification should be flexible enough to accommodate change,


amendment and inclusion in various classes in accordance with new situations. 

6.Homogeneity : Units of each class should be homogeneous. All the units (data-items)
included in a class or group should be present according to the property on basis of which
the classification was done.

Types of classification:

Geographical Classification 
Under this type of classification, the data are classified on the basis of area or place, and
as such, this type of classification is also known as areal or spatial classification. The
areas may be in terms of countries, states, districts, or zones according as the data are
distributed. For countries, states, districts, or zones according as the data are distributed.
For the purpose of ready reference and ranking, the different classes form under the
classification should be arranged in order of their alphabets or size of the frequencies
respectively. Generally, in case of reference tables, alphabetical arrangements are made
while in case of summary tables, ranking arrangements are made.
However, this type of classification is suitable for those data which are distributed
geographically relating to a phenomenon viz. population, mineral resources, production,
sales, students of universities etc.
Chronological Classification
Under this type of classification, the data collected are classified on the basis of time of
their occurrence. As such, the series obtained under this classification is purely known as
a time series. This type of classification is suitable for chose data which take place in
course of time viz. population, production, sales, results etc. The different classes obtained
under this classification are arranged in order of the time which may begin either with the
earliest, or the latest period.
Qualitative Classification
Under this type of classification, the data obtained are classified on the basis of certain
descriptive character or qualitative aspect of a phenomenon viz. sex, beauty, literacy,
honesty, intelligence, religion, eye-sight etc.
As such, this sort of classification is also otherwise known as ‘descriptive classification’.
Such type of classifications are usually dichotomous in nature in which the whole data are
divided into two groups viz, a group with the absence of the attitude such as blind and not-
blind, or deaf and not-deaf etc.
Quantitative Classification
Under this type of classification, the collected data are classified on the basis of certain
variable viz. mark, income, expenditure, profit, loss, height, weight, age, price, production
etc. which is capable of quantitative is also otherwise known as ‘classification by
variables’.

Tabulation - Frequency and Frequency Distribution,:

The frequency is the number of times a particular data point occurs in the set of
data. A frequency distribution is a table that list each data point and its frequency. The
relative frequency is the frequency of a data point expressed as a percentage of the total
number of data points.

Example 1.Find the frequency of data items 1 and 6 in the following list.

1, 3, 6, 4, 5, 6, 3, 4, 6, 3, 6 .

Ans : Frequency of the data point 1 is 1 and the frequency of the data point 6 is 4 .

A frequency distribution shows us a summarized grouping of data divided into


mutually exclusive classes and the number of occurrences in a class. It is a way of
showing unorganized data notably to show results of an election, income of people for a
certain region, sales of a product within a certain period, student loan amounts of
graduates, etc. Some of the graphs that can be used with frequency distributions
are histograms, line charts, bar charts and pie charts. Frequency distributions are used for
both qualitative and quantitative data.

There are two types of frequency distributions:

a) Ungrouped frequency distribution


b) Grouped frequency distribution

Ungrouped frequency distribution:

When the data items are less in numbers then ungrouped frequency distribution is
prepared.

Example : From the following number of family members in 15 families, prepare


ungrouped frequency distribution .
4 5 4 3 2 8 5 4 6 4 8 7 4
3 4

Ans: Here the number of family members are varying from 2 to 8 and hence ungrouped
frequency distribution can be prepared.

No of family members Tally marks Frequency


2 I 1
3 II 2
4 IIII I 6
5 II 2
6 I 1
7 I 1
8 II 2

Grouped Frequency Distribution :

When the data items are large in numbers and the range value is high,
then grouped distribution table is prepared. In this case the grouping of data items are
done in the form of classes ,each class includes lower limit and upper limit. Generally the
classes are to be formed in such a way that the number of classes should not be too many
and it should not be too less. Ideally 5 to 12 classes are considered as ideal.

Example : Prepare grouped frequency distribution table for the following data showing
marks obtained by the students.

25 45 56 47 22 45 56 78 89 45 46 45 52
12 13 09 15 07 56 58 54 56 57 42 28 56
23 45 51 01 26 55 66 54 77 38

Solution: Here the largest number is 89 and smallest number is 01. Therefore

Range= 89-01=88

As range value is large, we should prepare grouped frequency distribution table. The
classes are to be 00-10, 10-20 , 20-30 and so on. We can create max. 9 classes which is
good enough.

Marks Tally Marks Frequency


00-10 III 3
10-20 III 3
20-30 IIII 5
30-40 I 1
40-50 IIII III 8
50-60 IIII IIII II 12
60-70 I 1
70-80 II 2
80-90 I 1
Total 36

Diagrammatic and graphic representation of Data :

Data can be represented graphically also.It is useful and drawing quick conclusion
regarding data trends.

There are many methods of data representation with diagrams and graphs.

Bar diagrams

Bar diagrams are the pictorial representation of data (generally grouped), in the


form of vertical or horizontal rectangular bars, where the length of bars are proportional to
the measure of data. They are also known as bar charts. Bar graphs are one of the means
of data handling in statistics.
The collection, presentation, analysis, organization, and interpretation of
observations of data are known as statistics. The statistical data can be represented by
various methods such as tables, bar graphs, pie charts, histograms, frequency polygons.

Types of Bar Charts


The bar graphs can be vertical or horizontal. The primary feature of any bar graph is its
length or height. If the length of the bar graph is more, then the values are greater of any
given data.
Bar graphs normally show categorical and numeric variables arranged in class intervals.
They consist of an axis and a series of labeled horizontal or vertical bars. The bars
represent frequencies of distinctive values of a variable or commonly the distinct values
themselves. The number of values on the x-axis of a bar graph or the y-axis of a column
graph is called the scale.
The types of bar charts are as follows:

1. Vertical bar chart


2. Horizontal bar chart
Even though the graph can be plotted using horizontally or vertically, the most usual type
of bar graph used is the vertical bar graph. The orientation of the x-axis and y-axis are
changed depending on the type of vertical and horizontal bar chart. Apart from the vertical
and horizontal bar graph, the two different types of bar charts are:

 Grouped Bar Graph


 Stacked Bar Graph
Vertical Bar Graphs
When the grouped data are represented vertically in a graph or chart with the help of bars,
where the bars denote the measure of data, such graphs are called vertical bar graphs.
The data is represented along the y-axis of the graph, and the height of the bars shows
the values.

Horizontal Bar Graphs


When the grouped data are represented horizontally in a chart with the help of bars, then
such graphs are called horizontal bar graphs, where the bars show the measure of data.
The data is depicted here along the x-axis of the graph, and the length of the bars denote
the values.

Grouped Bar Graph


Grouped bar graph is also called the clustered bar graph, which is used to represent the
discrete value for more than one object that shares the same category. In this type of bar
chart, the total number of instances are combined into a single bar. In other words, a
grouped bar graph is a type of bar graph in which different sets of data items are
compared. Here, a single color is used to represent the specific series across the set. The
grouped bar graph can be represented using both vertical and horizontal bar charts.

Stacked Bar Graph


Stacked bar graph is also called the composite bar chart, which divides the aggregate into
different parts. In this type of bar graph, each part can be represented using different
colors, which helps to easily identify the different categories. The stacked bar chart
requires specific labeling to show the different parts of the bar. In a stacked bar graph,
each bar represents the whole and each segment represents the different parts of the
whole.

Simple Bar Diagram:

Example : In a firm ,the percentage of monthly salary saved by each employee is given in
the following table. Represent it through a bar graph.
Savings (in 10 20 30 40 50 60
percentage)
Number of 100 110 130 120 140 120
Employees(Frequency
)
Ans :

140 -

120 -

100 -

80 -

60 -

40 -

20 -

I I I I I I I I I I I I i

10 20 30 40 50 60
Savings (in percentage)

Pie chart:

A pie chart is a type of graph that represents the data in the circular graph. The
slices of pie show the relative size of the data. It is a type of pictorial representation of
data. A pie chart requires a list of categorical variables and the numerical variables. Here,
the term “pie” represents the whole, and the “slices” represents the parts of the whole. 

The “pie chart” also is known as “circle chart”, that divides the circular statistical
graphic into sectors or slices in order to illustrate the numerical problems. Each sector
denotes a proportionate part of the whole. To find out the composition of something, Pie-
chart works the best at that time. In most of the cases, pie charts replace some other
graphs like the bar graph, line plots, histograms etc.

The entire circle arc is of 360 degrees of angle.The various data item values are
converted into degree of angle and then the pie chart is prepared.

Example: Draw Pie chart from the following information.

Heads of Food Education Bills Loan Rent Savings


expenditure
Expenditure 5400 10800 3600 7200 2700 900
s in Rs

Ans : Firs we have to find the total expenditures which is not given here.

The total expenditures=5400+10800+3600+7200+2700+900 =30600

Now we find the angle for each heads as shown in the following table. First two angles are
calculated.

Heads of Food Education Bills Loan Rent Savings Total


expenditure
Expenditure 5400 10800 360 7200 2700 2700 32400
s in Rs 0
Angle 5400 10800 40 80 30 30 360
∗360 ∗360
32400 32400
¿60 =120

We will take suitable radius ,generally 4 cm , to draw circle.

Food
Education
Bills
Loan
Rent

Histogram:

A histogram is an approximate representation of the distribution of numerical data. It


was first introduced by Karl Pearson.
To construct a histogram, the first step is to divide the entire range of values into a series
of intervals and then count how many values fall into each interval. The intervals are usually
specified as consecutive, non-overlapping. The intervals must be adjacent and are often of equal
size.Histogram is same as that of simple bar diagram wherein the bars are jointed.

Example : Draw histogram from the following data.

Class 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90


Frequenc 12 26 45 56 65 34 17 09
y

Ans: Here the classes are continuous and hence we can construct histogram directly.

90- Scale: 1cm = 10 units on both

80- the axes

70-

60-

50-

40-

30-

20-

10-

I I I I I I I I I I

0 10 20 30 40 50 60 70 80 90

Frequency
Frequency polygon :

A frequency polygon is a graph constructed by using lines to join the midpoints of each interval, or
bin. The heights of the points represent the frequencies. A frequency polygon can be created from
the histogram or by calculating the midpoints of the intervals from the frequency distribution
table.

We can draw a histogram and then after joining the midpoints of heads of each bar in consecutive
manner, we get frequency polygon.

Example: Draw frequency polygon from the following data.

Class 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90


Frequency 12 26 45 56 65 34 17 09

Ans: Here the classes are continuous and hence we can construct histogram as shown in above
problem.Now we will join the midpoints of heads of each bar of histogram..

90- Scale: 1cm = 10 units on both

80- the axes

70-

60-

50-

40-

30-

20-

10-

I I I I I I I I I I I

0 10 20 30 40 50 60 70 80 90 100
Frequency

Ogive curves:

An ogive graph plots cumulative frequency on the y-axis and class boundaries along the x-


axis. It’s very similar to a histogram, only instead of rectangles, an ogive has a single point marking
where the top right of the rectangle would be. It is usually easier to create this kind of graph from
a frequency table.

There are two types of ogive curves that can be drawn.

1.Less than upper class frequency distribution curve.

2.more than lower class frequency distribution curve.

In order to draw ogive curves , the data should be of continuous distribution.We have
to find the cumulative frequencies either of less than upper class or and more than lower class.
We have to take the plotting points (x,y) where x stands for midpoint of each class and y stands for
cumulative frequency of that class.

Example: Draw less than upper class ogive curve from the following data.

Class 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90


Frequency 10 30 60 80 65 45 25 15

Solution: Here classes are continuos and hence we can proceed for drawing frequency curve.

Points Table.

Class Frequency Cummulative Mid Point of class Points to be


Frequency plotted
10-20 10 10 15 (15,10)
20-30 30 10+30=40 25 (25,40)
30-40 60 40+60=100 35 (35,100)
40-50 80 100+80=180 45 (45,180)
50-60 65 180+65=245 55 (55,245)
60-70 45 245+45=290 65 (65,290)
70-80 25 290+25=315 75 (75,315)
80-90 15 315+15=330 85 (85,330)
We take suitable scale on Y axis fro frequency.Here it should be 1 cm=30 units.

330 -

300 -

270 -

240 -

210 -

180 -

150 -

120 -

90 -

60 -

30 -

00 - I I I I I I I I I I
10 20 30 40 50 60 70 80 90 100

Unit 2

Measures of Central Tendency and Dispersion

Objectives of statistical averages:

(i) Representative of the group:

An average represents all the features of a group; hence the results about the whole group
can be deduced from it.

(ii) Brief description:
An average gives us simple and brief description of the main features of the whole data.

(iii) Helpful in comparison:


The measures of central tendency or averages reduce the data to a single value which is
highly useful for making comparative studies. For example, comparing the per capita
income of two countries, we can conclude that which country is richer.
(iv) Helpful in formulation of policies:
Averages help to develop a business in case of a firm or help the economy of a country to
develop.

(v) Base of other statistical Analysis:


Other statistical devices such as mean deviation, co-efficient of variation, co-relation,
analysis of time series and index numbers are also based on the averages.

Requisites of a Good Average:

1. It should be simple to compute.


2. It should be easy to understand.
3. It should be rigidly defined.
4. It should be representative of all the items.
5. It should not be unduly affected by extreme values.
6. It should be capable of further algebraic treatment.

Statistical Averages –

1. Arithmetic mean:

It is simple to understand and compute and is rigidly defined but is affected by


extreme items. It is a calculated value and not based on position in the series.

The arithmetic mean is the most commonly used and readily understood measure
of central tendency in a data set. In statistics, the term average refers to any of the
measures of central tendency. The arithmetic mean of a set of observed data is defined as
being equal to the sum of the numerical values of each and every observation, divided by
the total number of observations. Symbolically, if we have a data set consisting of the
values {\displaystyle a_{1},a_{2},\ldots ,a_{n}}x1, x2 , x3 ……xn then the arithmetic mean {\
displaystyle A},denoted as X  is defined by the formula:

Type I Data: Listed data items:

Q 1. Find mean weight from the following data

Weight in kgs: 67 45 78 89 98 89 77

67 78 45 65 56 68 45 49

67+45+ 78+89+98+ 89+ 77+67+78+ 45+65+56+ 68+45+ 49


Ans: Mean ¿
15
1016
Mean =
15

Mean =67.73 kgs

Q 2. Find mean grade points from the following grades

9.16 8.89 7.87 9.23 5.89 5.45 4.56 9.25 9.90 5.00

Ans :

9.16+8.89+7.87+ 9.23+5.89+5.45+ 4.56+9.25+9.90+ 5.00


Mean =
10

75.11
Mean=
10

Mean=7.51 grades

Type II Problems: When the data is given in the form of x and f i.e. discrete distribution.

The following formula is used.

Mean =
∑ fi . Xi
∑ fi
Q 1. Find mean the following data.

X 11 13 15 17 19 20 21 24
f 4 7 10 18 15 11 8 7
Solution :

X f fX
11 4 44
13 7 91
15 10 150
17 18 306
19 15 285
20 11 220
21 8 168
24 7 168
∑ f =80 ∑ f X=1432
By using,
Mean =
∑f X
∑f
1432
Mean =
80

Mean = 17.9 units

Q 2. Find mean height of the students from the following data.

Height 151 152 154 156 157 158 160 161 163 164
in cms
No of 05 07 08 14 18 22 15 08 02 01
students

Solution :We arrange the data vertically as follows

Height in cms (X) No of students ( f) fX


151 05 755
152 07 1064
154 08 1232
156 14 2184
157 18 2826
158 22 3476
160 15 2400
161 08 1288
163 02 326
164 01 164
∑ f =100 ∑ fX =15715

Mean =
∑ fX
∑f
15715
=
100

=157.15 cms

Q 3 .Find mean viruses

No of viruses 1 2 3 4 5 6 7 8 9 10
No of PCs infected 0 12 90 12 23 45 26 86 9 5
Solution :
No of viruses(X) No of PCs infected (f) fX
1 0 0
2 12 24
3 90 270
4 12 48
5 23 115
6 45 270
7 26 182
8 86 688
9 9 81
10 5 50
∑ f =308 ∑ fX =1728

Mean =
∑ fX
∑f
1728
=
308

= 5.61 viruses

Q 4 Find mean wages from the following data (For practice)

Wages 210 220 230 240 250 260 270 280 290 300
in Rs
No of 5 19 38 57 76 42 36 21 12 04
workers
(Ans : Rs. 251.12)

Type III data: When the data is having continous distribution. (When classes and
frequencies are given)

Steps:
Step 1: Find mean of the class Xm
Step 2: Find the value of f Xm
Step 3:

Mean =
∑ f Xm
∑f
Solved Examples :
Q 1 . Find mean marks from the following data
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
No of 05 13 19 34 32 28 18 07
students
Solution :

Marks (X) No of students Mean Xm f.Xm


(f)
0-10 05 5 25
10-20 13 15 195
20-30 19 25 475
30-40 34 35 1190
40-50 32 45 1440
50-60 28 55 1540
60-70 18 65 1170
70-80 07 75 525
∑ f =156 ∑ fXm=6560

Mean =
∑ f Xm
∑f
6560
=
156

=42.05 marks

Q 2. Find mean salary paid to the employees from the following data.

Salary in 2-5 5-10 10-12 12-14 14-18 18-24 24-26 26-30 30-40
Rs. lac
No of 22 36 48 88 102 45 19 6 2
employees

Solution :

Salary in Rs.Lac (X) No of employees Xm f Xm


( f)
2-5 22 3.5 77
5-10 36 7.5 270
10-12 48 11 528
12-14 88 13 1144
14-18 102 16 1632
18-24 45 21 945
24-26 19 25 475
26-30 6 28 168
30-40 2 35 70
∑ f =368 ∑ fXm=5309

Mean =
∑ f Xm
∑f
5309
=
368

=Rs. 14.42 lac

Q 3 .Find the mean wages from the following data

Wages 220- 230- 240- 250- 260- 270- 280- 290-


in Rs 230 240 250 260 270 280 290 300
No of 05 13 19 28 27 20 13 08
workers

Solution :

Wages (X) No of workers (f) Xm fXm


220-230 05 225 1125
230-240 13 235 3055
240-250 19 245 4655
250-260 28 255 7140
260-270 27 265 7155
270-280 20 275 5500
280-290 13 285 3705
290-300 08 295 2360
∑ f =¿ ¿133 ∑ fXm=34695

Mean =
∑ fXm
∑f
34695
=
133

= Rs.260.86

Q 4 Find the mean height from the following.

Height 150 154 158 162 166 170 174 178 182
less
than
No of 05 25 38 63 98 130 148 159 160
students

Solution : This data is of type III only as height less than is given here.We first convert it
into standard type III format and then solve it.

Height less No of Height X No of Xm f Xm


than students students
Cf f
150 05 146-150 05 148 740
154 25 150-154 20 152 3040
158 38 154-158 13 156 2028
162 63 158-162 25 160 4000
166 98 162-166 35 164 5740
170 130 166-170 32 168 5376
174 148 170-174 18 172 3096
178 159 174-178 11 176 1936
182 160 178-182 01 180 180
∑ f =¿ ¿160 ∑ fXm=¿ ¿26
136

Mean =
∑ fXm
∑f
26136
=
160

=163.35 units

2.Median :
Median is the size of middlemost data item when the data items are arranged in
ascending or descending order.

Type I Data: Listed data items

Steps in finding median :

1.Arrange the data items in ascending or descending order.

2.Count the number of items as N.

3.a)When N value is odd then

Median =The size of data item at ( N2+1 ) t h position in the list created in step 1.
b) When N is even then

Median = Mean of the data items at(


N
2
¿ t h and(N
2 )
+1 t hitems in the list formed

in step 1.

Solved Examples :

Q 1 Find median from the following data

233 231 245 321 211 322 268 254 201 204 206 289

Solution :

Step1 : Arrange data items in ascending order .It comes as under

201 204 206 211 231 233 245 256 268 289 321 322

Step 2: Here N=12 which is even.

N
Step 3: Median = 2
N
( )
t h item+ +1 t h item
2
2

12
= 2
t h item+ (
12
2 )
+1 t h item

2
6 t h item+7 t h item
=
2

233+245
=
2

478
=
2

= 239

Type 2 : When X and f are given (Discrete distribution)

Step 1: Find cumulative frequency (c f)

Step 2: Median = The size of data item corresponding to N/2 th item in c f column.

Example: Find median from the following data

X 12 13 14 15 16 17 18 19
f 05 24 45 65 44 33 21 04

Ans :

X f cf
12 05 05
13 24 29
14 45 74
15 65 139
16 44 183
17 33 216
18 21 237
19 04 241
N=∑ f =241

Median = Size of data item at N/2 th position in cf column

= Size of data item at 241/2 th item position in cf column

= size of data item at 120.5th i.e.121st item

= 15 units

2. Find median height from the following table


Height 151 152 154 156 157 158 160 161 163 164
in cms
No of 05 07 08 14 18 22 15 08 02 01
students
Ans :

Height in cms X No of students f cf


151 05
152 12
154 20
156 34
157 52
158 74
160 89
161 97
163 99
164 100
N=100
Median = Size of N/2 th item in cf column

= size of 100/2 th item in cf column

= size of 50 th item in cf column

=157 cms

3.Find median virus from the following data

No of viruses 1 2 3 4 5 6 7 8 9 10
No of PCs infected 0 12 90 12 23 45 26 86 9 5

Type III: Continuos Distribution i.e. when classes and frequencies are given.

Steps:

1.Find cf.
2.Find the value of N/2 and locate median class corresponding to N/2 th value in
c.f. column.
N
−p.c .f .
3. 2
Median=L+ ∗i
f
Where L =Lower limit of median class

N=∑ f
p. c . f = Preceding median class cumulative frequency
f = Frequency of median class
i= class internal

Example:
1.Find median from the following data.

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80


Frequency 07 18 29 45 36 24 12 9

Ans:

Class Frequency Cummulative Frequency cf


0-10 07 07
10-20 18 25
20-30 29 54
30-40 45 99 Median class
40-50 36 135
50-60 24 159
60-70 12 171
70-80 09 180
N=180

Median Class= Class corresponding to N/2 th value in cf column

= class corresponding to 180/2 th value in cf column

= class corresponding to 90th item in cf column

=30-40

Median=31.17

3.Mode:

The most frequently repeated data item is known as mode.

Type I Data: When the data items are listed in the form of list

Steps :

1.Arrange the data in ascending or descending order.


2.Find the item which is repeated for most of the time which is nothing but mode.

If there is a tie between two or more data items then

Mode= mean of the tied data items

Example :

1.Find mode from the following data

12 34 23 43 12 32 45 34 56 34 21 12 67
34 45 70 23 42 34

Ans :

Ascending order :

12 12 12 21 23 23 32 34 34 34 34 34 42
43 45 45 56 67 70

34 is repeated for 5 times.

Mode =34

2.Find the mode from the following data of wages in rupees.

212.5 213.5 213.0 213.5 200.5 212.5 210.5

213.5 213.0 212.5 210.0 240.5 213.5 212.5

213.0 200.5 212.5 213.5 210.5 240.0 240.5

Ans : 213.5 is repeated for 5 times and 212.5 is also repeated for 5 times.

212.5+213.5
Hence Mode =
2

426
Mode =
2

Mode = 213.0

Q 3 Find modal size from the following data.


233 213 243 233 213 203 253 233 213 203 213 233
243 233 213 233 243 243 253 213 203 213 233 243
253 213 203

Ans : 213 is repeated for 8 times.

Hence mode=213

Type II: When the data is in the form of discrete disribution.

i.e. when x and f are given

Mode= Size of data item corresponding to largest frequency f value.

Example :

1.Find the mode from the following data.

X 10 11 12 13 14 15 16 17 18
f 05 10 19 25 26 19 10 8 2

Ans : Mode =14

2.Find mode from the following data

X 12 13 14 15 16 17 18 19
f 05 24 45 65 44 33 21 04

Mode = 15

3.Find modal wages from the following data.

Wages 210 215 220 230 245 250 255 260 270
in Rs
No of 05 29 35 48 48 32 19 12 3
worker
s

Ans : Here wages 230 and 245 are repeated for most of the times.
230+245
Mode =
2

475
Mode =
2

Mode = Rs. 237.5

Type III : When the data is in continuos distribution i.e when classes and

frequencies are given.

Steps :

1.Find the largest frequency and note it as f1.Take class corresponding to it as modal
class. Find L , the lower limit of the modal class.

2. Find f0 , frequency of preceding modal class and

f2 , frequency of succeeding modal class.

3.Find i=modal class interval .

f 1−f 0
4.Mode = L+ ∗i
2 f 1−f 0−f 2

Example

1.Find modal salary from the following data

Salary in 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-
Rs’000 100
No of 03 16 28 39=f0 58=f1 42=f2 30 16 4
employees

Answer: 50-60 is modal class

L=50

i=60-50=10

f0=39 f1=58 f2=42

f 1−f 0
Mode = L+ ∗i
2 f 1−f 0−f 2
58−39
Mode= 50 + ∗10
2∗58−39−42

19
Mode= 50 + ∗10
35

190
Mode = 50 +
35

Mode =50 + 5.42

Mode= 55.42

i.e. Mode = 55.42 *1000

i.e. Mode =Rs.55,420

Example : Find modal height from the following

Height 130- 135- 140- 145- 150- 155- 160- 165- 170-
135 140 145 150 155 160 165 170 175
No of 02 25 46=f0 82=f1 81=f2 70 52 16 5
studen
t

Ans : Here highest frequency =82

Modal class=145-150

L=145 , i=150-145=5

f0=46 ,f1=82,f2=81,

f 1−f 0
Mode = L+ ∗i
2 f 1−f 0−f 2

82−46
Mode= 145 + ∗5
2∗82−46−81

36
Mode= 145 + ∗5
37

180
Mode = 145 +
37

Mode =145 + 4.86


Mode= 149.86 units

Properties of arithmetic mean, median and mode:

Merits and demerits of arithmetic mean:

Merits:

1) Arithmetic mean rigidly defined by Algebraic Formula.


2) It is easy to calculate and simple to understand.
3) It is based on all observations of the given data.
4) It is capable of being treated mathematically hence it is widely used in statistical
analysis.
5) Arithmetic mean can be computed even if the derailed distribution is not known but
some of the observation and number of the observation are known.
6) It is least affected by the fluctuation of sampling.
7) For every kind of data mean can be calculated.

Demerits of mean :

1) It can neither be determined by inspection or by graphical location.


2) Arithmetic mean can not be computed for qualitative data like data on intelligence
honesty and smoking habit etc.
3) It is too much affected by extreme observations and hence it is not adequately
represent data consisting of some extreme point.
4) Arithmetic mean can not be computed when class intervals have open ends.
5) If any one of the data is missing then mean can not be calculated.

Merits of median :

1) It is easy to compute and understand.

2) It is well defined an ideal average should be.

3) It can also be computed in case of frequency distribution with open ended classes.

4) It is not affected by extreme values and also interdependent of range or dispersion of


the data.

5) It can be determined graphically.

6) It is proper average for qualitative data where items are not measured but are scored.
7)It is only suitable average when the data are qualitative & it is possible to rank various
items according to qualitative characteristics.

8) It can be calculated easily by watching the data.


9) In some cases median gives better result than mean.

Demerits of median :

1) For computing median data needs to be arranged in ascending or descending order.

2) It is not based on all the observations of the data.

3) It can not be given further algebraic treatment.

4) It is affected by fluctuation of sampling.

5) It is not accurate when the data is not large.

6) In some cases median is determined approximately as the mid-point of two


observations whereas for mean this does not happen.

Merits of Mode :

1) It is readily comprehensible and easy to compute. In some case it can be computed


merely by
inspection.
2) It is not affected by extreme values. It can be obtained even if the extreme values are
not
known.
3) Mode can be determined in distributions with open classes.
4) Mode can be located on the graph also.
5) If all the data are not given then also mode can be calculated.
6) It is easy to understand.
7) It is the most observed data point.
8) Sometimes we get more than one mode and sometimes there is no mode,that gives us
the characteristics of the data.

Demerits of mode :

1) It is ill defined. It is not always possible to find clearly defined mode. In some cases, we
may come across distributions with two modes. Such distributions are called Bimodal. If a
distribution has more than two modes, it is said to be Multimodal.
2) It is not based upon all the observation.

3) Mode can be calculated by various formulae as such the value may differ from one to
other. Therefore, it is not rigidly defined.

4) It is affected to a greater extent fluctuations of sampling.

Measures of Dispersion

1.Range:

The difference between the largest value and smallest value is called as range.

Range = L-S

Example: Find the mean and range from the following data:

25 65 87 49 28 89 90 80 87 54

Solution:

25+65+87+ 49+28+ 89+ 90+80+ 87+54


Mean=
10

654
=
10

= 65.4 units

Range=L-S

=90-25

=65

Ex2: Find range from the following data.Also find the coefficient of range
212.2 231.5 203.5 245.5 233.4 289.0

245.2 201.4 234.5 225.5 236.5 278.5

Solution:

L=289.0 S=201.4

Range= L-S

= 289.0 – 201.4

=87.6

L−S
Coefficient of Range=
L+S

289 .0−201 . 4
=
289 . 0+201. 4

87 . 6
=
490 . 4

= 0.1786

Type II: When data is in the form of discrete distribution i.e. x and f are given.

Example: Find range and coefficient of range from the following data.

x 11 12 13 14 15 16 17 18 19
f 04 09 20 29 38 45 24 15 7

Ans: L= 19, S=11

Range= L-S

=19-11

=8

L−S
Coefficient of range =
L+S
19−11
=
19+11

8
=
30

=0.2667

Type 3: When the data is in the form continuos distribution i.e. when the classes and
frequencies are given.

Find mean of each class Xm

Range=Largest Xm- Smallest Xm

Ex .Find the range and coefficient of range from the following data

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Frequency 05 12 19 28 38 42 36 22 10

Solution:

Class X 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Frequency 05 12 19 28 38 42 36 22 10
Xm 5 15 25 35 45 55 65 75 85

L=85, S=5

Range=L-S

=85-5

=80

L−S
Coefficient of range==
L+S

85−5
=
85+5

80
=
90
=0.88

Mean Deviation :

It is sum of the differences of actual data items from the central values(mean or median
or mode) divided by the total number of items.

M.D.=
∑ |Xi−Xc|
N

where Xc is mean or median or mode .

Type I : When the data is in the form of list.

Example 1: Find the mean deviation from the mean of the following data

23 45 76 45 78 45 64 34 67 43

Solution :

23+45+76 +45+78+ 45+64 +34+67 +43


X=
10

520
=
10

=52.

X X – X = X- 52 X- X
23 23-52= - 29 +29
45 45-52= - 7 +7
76 76-52=24 24
45 45-52 = -7 +7
78 78-52= 26 26
45 45-52= -7 +7
64 64-52= 12 12
34 34-52=-18 +18
67 67-52=15 15
43 43-52=-9 +9

Sum= X-X =154


M.D from mean .=
∑ |Xi−Xmean|
N

154
=
10

= 15.4

Type III: When the data is in the form of continuos data i.e. when classes and frequencies
are given

Mean =
∑ fXm
∑f

N
− pcf
Median= 2
L+ ∗i
f

f 1−f 0
Mode= L+ ∗i
2 f 1−f 0−f 2

Mean Deviation from mean=


∑ f |Xm− Xmean|
∑f
Example 1: Find mean deviation from mean from the following data

Wages 210- 216- 220- 224- 226- 230- 234- 238-


in Rs 216 220 224 226 230 234 238 242
No of 12 24 56 88 102 52 22 04
workers
Solution:

Wages X No of workers f Xm f*Xm | Xm− Xmean| f*| Xm− Xmean|


=| Xm−226 . 36|
210-216 12 213 2556 +13.36 160.32
216-220 24 218 5232 +8.36 200.64
220-224 56 222 1243 +4.36 244.16
2
224-226 88 225 1980 +1.36 119.68
0
226-230 102 228 2325 1.64 167.28
6
230-234 52 232 1206 5.64 293.28
4
234-238 22 236 5192 9.64 212.08
238-242 04 240 960 13.64 54.56
360 8149 ∑ f ∗|Xm−Xmean|
2 =1452

Xmean= 81492/360

=226.36

Mean Deviation From mean=


∑ f ∗|Xm− Xmean|
f

1452
¿
360

= 4.033

Example 2 : Find mean deviation from median and mode from the following data.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
No. of 05 13 18 22 24 26 25 16 5
students
Solution :

Median=47.91

Mean Deviation from median=17.13

Mode= 56.66

MD from mode =18.26

Example 3: Find mean deviation from mean and mode from the following data.
Salary in 5-8 8-11 11-14 14-17 17-20 20-23 23-26 26-29 29-32
Rs Lac
No of 05 12 19 24 25 21 18 08 02
employees

Solution:

Salary No.of Xm f.Xm |Xm- f*|Xm- |Xm-Xmode| f*|Xm-


X emp f Xmean| Xmean| =|Xm-17.6| Xmode|
=|Xm-
17.85|
5-8 05 6.5 32.5 +11.35 56.75 11.1 55.5
8-11 12 9.5 114 +8.35 100.2 8.1 97.2
11-14 19 12.5 237.5 +5.35 101.65 5.1 96.9
14-17 24 f0 15.5 372 +2.35 56.4 2.1 50.4
17-20 25 f1 18.5 462.5 0.65 16.25 0.9 22.5
20-23 21 f2 21.5 451.5 3.65 76.65 3.9 81.9
23-26 18 24.5 441 6.65 119.7 6.9 124.2
26-29 08 27.5 220 9.65 77.2 9.9 79.2
29-32 02 30.5 61 12.65 25.3 12.9 25.8
∑ f =134 ∑ fXm=2392 ∑ ❑=630.1 633.6

Xmean=
∑ fXm
∑f
2392
=
134

=17.85

M.D. =
∑ f ∗¿ Xm−Xmean∨¿ ¿
∑f
630.1
=
134

=4.70

From mode :

Modal class = Class corresponding to highest frequency

=17-20
L=17, i=20-17=3 , f0=24 , f1=25 , f2= 21

f 1−f 0
Mode ¿ L+ ∗i
2 f 1−f 0−f 2

25−24
= 17+ ∗3
2∗25−24−21

3
=17+
5

=17.6

M.D. from mode =


∑ f ∗¿ Xm−Xmode∨¿ ¿
∑f

=633.6/134

=4.72

Standard Deviation

It is the square root of sum of squares of deviations of the actual values from the
mean values divided by total number of items.

S.D.(σ) =
√ ∑ (X −Xmean)2
N

where N=∑ f

Type I : Data is in the form of list.

Ex 1: Find mean deviation from mean

23 45 32 45 34 28 76 45 43 39

Solution :

23+45+32+ 45+34+ 28+76+ 45+43+39


Xmean=
10

410
=
10
=41

X X-Xmean (X-Xmean)2
=X-41
23 -18 324
45 4 16
32 -9 81
45 4 16
34 -7 49
28 -13 169
76 35 1225
45 4 16
43 2 4
39 -2 4
∑ ¿1889


S.D.(σ) = ∑ (X −Xmean)2
N

=
√ 1904
10

=√ 190.4

=13.78 units.

Ex 2: Find standard deviation for the following wage rate in Rs.

213 213 243 222 240 236

234 242 245 248 219 212

X X-Xmean (X-Xmean)2
=X-230.58
213 -17.58 309.05
213 -17.58 309.05
243 12.42 154.25
222 -8.58 73.61
240 9.42 88.73
236 5.42 29.37
234 3.42 11.69
242 11.42 130.41
245 14.42 207.93
248 17.42 303.45
219 -11.58 134.09
212 -18.58 345.21
∑ X=¿ ¿2767 ∑ ❑=2096.83

Mean =
∑X
N

2767
=
12

= 230.58


S.D.(σ) = ∑ (X −Xmean)2
N

=
√ 2096.83
12

=√ 174.7

= 13.21

Type II: Discrete Distribution i.e. when X and f are given.


S.D.(σ) = ∑ f ∗(X −Xmean)2
∑f

Where Xmean=
∑ fX
f

Ex1. Find standard deviation from the following data.

X 1 2 3 4 5 6 7 8
f 2 9 18 16 21 14 7 3
Solution :

X f f*X X-Xmean (X-Xmean)2 f*(X-Xmean)2


=X-4.44
1 2 2 -3.44 11.83 23.66
2 9 18 -2.44 5.95 53.55
3 18 54 -1.44 2.07 37.26
4 16 64 -0.44 0.19 3.04
5 21 105 0.56 0.31 6.51
6 14 84 1.56 2.43 34.02
7 7 49 2.56 6.55 45.85
8 3 24 3.56 12.67 38.01
∑ ¿ 90 ∑ ¿ 400 ∑ ❑=¿241.90

Mean=
∑ f ∗X
∑f
400
=
90

=4.44

S.D.(σ) =
√ ∑ f ∗(X −Xmean)2
∑f
=
√ 241.90
90

=√ 2.68

=1.63

Q 2 Find S.D from the following data

Marks 0 1 2 3 4 5 6 7 8 9 10
No of 02 08 12 20 25 26 29 20 09 04 01
student
s
Ans :

X f f*X X-Xmean (X-Xmean)2 f*(X-Xmean)2


=X-4.83
0 02 0 -4.83 23.3289 46.6578
1 08 8 -3.83 14.6689 117.3512
2 12 24 -2.83 8.0089 96.1068
3 20 60 -1.83 3.3489 66.978
4 25 100 -0.83 0.6889 17.2225
5 26 130 0.17 0.0289 0.7514
6 29 174 1.17 1.3689 39.6981
7 20 140 2.17 4.7089 94.178
8 09 72 3.17 10.0489 90.4401
9 04 36 4.17 17.3889 69.5556
10 01 10 5.17 26.7289 26.7289
∑ ¿156 ∑ ¿754 665.6684

Mean=
∑ f ∗X
∑f
754
= =4.83
156

S.D.(σ) =
√ ∑ f ∗(X −Xmean)2
∑f
=
√ 665.6684
156

¿ √ 4.2671

=2.065

Type III: Continuos Distribution i.e.when classes and frequencies are given

S.D.(σ) =
√ ∑ f ∗( Xm− Xmean)2
∑f
where
Xmean=
∑ f ∗Xm
∑f
Ex 1: Find standard deviation from the following data

Sales in Rs. 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
lac
No of 05 17 29 45 41 34 19 08 05
companies

Solution :

Xm- (Xm-
 X f Xm fXm 52.1428 52.1428)2 f*(Xm-52.1428)2
 10-20 5 15 75 -37.1428 1379.587592 6897.937959
20-30 17 25 425 -27.1428 736.7315918 12524.43706
30-40 29 35 1015 -17.1428 293.8755918 8522.392163
40-50 45 45 2025 -7.1428 51.01959184 2295.881633
50-60 41 55 2255 2.8572 8.16359184 334.7072654
60-70 34 65 2210 12.8572 165.3075918 5620.458123
70-80 19 75 1425 22.8572 522.4515918 9926.580245
80-90 8 85 680 32.8572 1079.595592 8636.764735
90-
100 5 95 475 42.8572 1836.739592 9183.697959
203 10585 63942.85714

Mean=52.1428

S..D.=
√ 63942.85
203

=17.74

Quartiles :

1. First Quartile Q1
2. Third Quartile Q2
First quartile gives us the central value when we divide entire data items in
such a way that 25% of the data items are on one side and 75% of the data items are on
other side in ascending order data list.
Q1 Median Q3
25% 75% data items

Third Quartile Q 3:
Third quartile gives us the central value when we divide entire data items in
such a way that 75% of the data items are on one side and 25% of the data items are on
other side in ascending order data list.

Type 1 data : When the data is given in list form


Steps:
1.Arrange data in ascending order.
2. Q1 = Size of data item at N/4th position in ascending order list.
Q3= Size of item at 3N/4th position in ascending order list.

Ex 1. Find Q1 and Q 3 from the following data


21 23 32 22 21 34 24 24 38 25 21 20
Ans :

Ascending order list:


20 21 21 21 22 23 24 24 25 32 34 38

Here N=12
Q1 = Size of data item at N/4th position
= 12/4
= 3rd item
=21
Q3= Size of data item at 3N/4th position
= Size of 3*12/4
= Size of 9th item
= 25

Q3−Q1
Quartile deviation QD =
2
Q3−Q1
Coefficient of QD =
Q3+ Q1

Ex 2: Find Q1, Q3 and quartile deviation from the following data.


213 234 245 243 231 222 231 213 243 245 211 234 235 239
Ans:

211 213 213 222 231 231 234 234 235 239 243 243 245 245

N=14
Q1 = Size of data item at N/4th value
=size of data item at 14/4th value
= size of 3.5th item
= AVG of 3rd and 4th item
= 213+222/2
= 435/2
= 217.5
Q3 = Size of data item at 3N/4th value
= 3*14 /4
= 10.5th item
= avg of 10th and 11th item
=239+243/2
=482/2
=241

Q.D = Q3- Q1 /2
= 241-217.5 /2
=23.5/2
=11.75

Type II Data: Discrete distribution i.e. x and f are given.

Steps:
1.Find cf
2. Q1 = Size of data item at N/4th position in ascending order list.
Q3= Size of item at 3N/4th position in ascending order list.

Ex : Find Q1 and Q3 from the following data


X 20 22 24 25 27 29 30 32
f 5 15 25 25 27 20 15 05
Ans :

X f cf
20 5 5
22 15 20
24 25 45
25 25 70
27 27 97
29 20 117
30 15 132
32 05 N=137

Q1 = Size of data item at N/4th position in cf column.


= Size of data item at 137/4 th item
= size of data item at 34.25 th item
= size of data item at 34th item
= 24

Q3 = Size of item at 3N/4th position in cf column


= Size of data item at 3*137/4 th item
= size of data item at 102.75 th item
= size of data item at 103th item
= 29

Type III- Continuos distribution i.e when classes and frequencies are given

N
− p . c .. f
Q1 =L+ 4
∗i
f

Where L is lower limit Q1 class


p.c.f cumulative frequency of preceeding Q1 class
f frequency of Q1 class
i class interval of Q1 class
Q1 class = Class corresponding to N/4 th item in c.f. column

3N
− p . c .. f
Q3= L + 4
∗i
f
Example : Find Q1,Q3 and coefficient of Q.D from the following data.

Marks 00-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-
100
No of 05 22 42 65 62 45 26 13 7 3
students

Ans :
Marks No of students cf
00-10 05 05
10-20 22 27
20-30 42 69
30-40 65 134
40-50 62 196
50-60 45 241
60-70 26 267
70-80 13 280
80-90 7 287
90-100 3 290=N

Q1 class= Class corresponding to N/4th item in c.f. column

= c.c.t.290/4th item in c.f. column

=c.c.t. 72.5th item in c.f. column

= c.c.t 73rd item

= 30-40

L=30, f=65 ,p.c.f=69 ,i=40-30=10

N
− p . c .. f
Q1 =L+ 4
∗i
f

290
−69
= 30+ 4
∗10
65

72.5−69
= 30+ ∗10
65
35
=30 +
65

=30+0.53

=30.53

Q3 class =Class corresponding to 3N/4th item in cf column

= C.c..t. 3*290/4th item

= c.c.t. 217.5th item

=c.c.t 218th item

= c.c.t. 50-60

L=50 ,f=45 ,pcf =196 , i=10

3N
− p . c .. f
Q3= L + 4
∗i
f
3∗290
−196
=50+ 4
∗10
45

=50+ 4.77

= 54.77

Q3−Q1
Q.D.=
Q3+ Q1

54.77−30.53
=
54.77+30.53

24.24
¿
85.30

=0.28

Merits and demerits of Quartile deviations


1. It can be easily calculated and simply understood.

2. It does not involve much mathematical difficulties.

3. As it takes middle 50% terms hence it is a measure better than Range and Percentile
Range.

4. It is not affected by extreme terms as 25% of upper and 25% of lower terms are left out.

5. Quartile Deviation also provides a short cut method to calculate Standard Deviation
using the formula 6 Q.D. = 5 M.D. = 4 S.D.

6. In case we are to deal with the center half of a series this is the best measure to use.

Demerits :

1. As Q1 and Q3 are both positional measures hence are not capable of further algebraic
treatment.

2. Calculation are much more, but the result obtained is not of much importance.

3. It is too much affected by fluctuations of samples.

4. 50% terms play no role; first and last 25% items ignored may not give reliable result.

5. If the values are irregular, then result is affected badly.

6. We can’t call it a measure of dispersion as it does not show the scatterness around any
average.

7. The value of Quartile may be same for two or more series or Q.D. is not affected by the
distribution of terms between Q1 and Q3 or outside these positions.

Mean deviation:

Merits of Mean Deviation: 

1. It is simple to understand and easy to compute.

2. It is based on each and every item of the data.

3. MD is less affected by the values of extreme items than the Standard deviation.
Demerits of Mean Deviation:

1. The greatest drawback of this method is that algebraic signs are ignored while taking
the deviations of the items.

2. It is not capable of further algebraic treatments.

3. It is much less popular as compared to standard deviation.

Standard Deviation:

Merits:

1.It is rigidly defined.

2. It is based on all the observations of the series and hence it is representative.

3.It is amenable to further algebraic treatment.

4. It is least affected by fluctuations of sampling.

Demerits:

1) It is more affected by extreme items.

2) It cannot be exactly calculated for a distribution with open-ended classes.

3) It is relatively difficult to calculate and understand.

Properties of standard deviation:

1.Standard deviation is only used to measure spread or dispersion around the mean of a
data set.

2.Standard deviation is never negative.

3.Standard deviation is sensitive to outliers. A single outlier can raise the standard
deviation and in turn, distort the picture of spread.

4.For data with approximately the same mean, the greater the spread, the greater the
standard deviation.

5.If all values of a data set are the same, the standard deviation is zero 

Variance:
The term variance refers to a statistical measurement of the spread between
numbers in a data set. More specifically, variance measures how far each number in the
set is from the mean and thus from every other number in the set. Variance is often
depicted by this symbol: σ2. It is used by both analysts and traders to
determine volatility and market security. The square root of the variance is the standard
deviation (σ), which helps determine the consistency of an investment's returns over a
period of time.

Coefficient of Variation:

σ
Coefficient of variation (C.V.) = ∗100
Mean

If the value of c.v. is less then the data items are consistent .

If the value of c.v. is more then the data items are not consistent.

Ex 1: The following data is related with the scores made by Sehwag and Rahul Dravid in
last ten innings.

Inn 1 2 3 4 5 6 7 8 9 10
Runs made by Sehwag 45 98 02 04 57 89 12 18 05 90
Runs made by Rahul 45 56 78 56 48 57 88 40 12 18
Find

1.The highest run getter. 2. average runs made by each of them 3. Which batsman is
more consistent?

Ans:

Inning Runs made Runs (X-Xm) (X-Xm)2 (Y-Ym) (Y-Ym)2


No by made by =X-42 =Y-49.8
Sehwag(X) Rahul(Y)
1 45 45 3 9 -4.8 23.04
2 98 56 56 3136 6.2 38.44
3 02 78 -40 1600 28.2 795.4
4 04 56 -38 1444 6.2 38.44
5 57 48 15 225 -1.8 3.24
6 89 57 47 2209 7.2 51.84
7 12 88 -30 900 38.2 1459.24
8 18 40 -24 576 -9.8 96.04
9 05 12 -37 1369 -37.8 1428.84
10 90 18 48 2304 -31.8 1011.24
∑ X=420 ∑ Y =498 ∑ ( X−Xm )2 ∑ (Y −Ym )2
=13772 = 4945.6

The highest run getter is Rahul Dravid.

2.Average runs made by each of them

Avg. runs made by Sehwag=420/10=42 runs

Avg. runs made by Rahul Dravid=498/10=49.8

3.Which batsman is more consistent?

Ans:

√ ∑ ( X −Xm)
2
S.D.(X)=
N

S.D.(X)=
√ 13772
10

S.D.(X)=√ 1377 .2

S.D.(X)=37.11

S . D .( X )
Coe.of Variance (X)= *100
Xm

37 .11
= ∗100
42

=88.35

√ ∑ (Y −Ym )
2
S.D.(Y)=
N
S.D.(Y)=
√ 4945 . 46
10

S.D.(Y)=√ 494 .546

S.D.(Y)=22.23

S . D .(Y )
Coefficient of Variance (Y)= *100
Ym

22. 23
= ∗100
49 .8

=44.63

Dravid is more consistent as his coe. Of variation is less.


Unit No :3 Correlation

Definition :

The degree to the extent of how much the two variables depend each other.

Example 1:

1. Age
2. Weight (Up to 18 yrs)

One variable is increasing and then the second variable is increasing.

Example 2:

1. The standard in which we are learning.


2. The percentage that we are getting.
When one is increasing then the second variable is also decreasing.

Example 3:

1.Demand for goods

2.Price for goods

3.Supply

Types of correlation:

1.Positive Correlation:

When the value of one variable is increasing (or decreasing) then the value of dependent
variable also increases(or decreasing) then it is called positive correlation.

X 1 2 3 4 5 6 7 8
Y(dependent ) 10 12 18 24 29 32 39 40

2.Negative correlation: When the value of one variable is increasing (or decreasing) then
the value of dependent variable decreasing(or increasing) then it is called negative
correlation.

X 1 2 3 4 5 6 7 8
Y(dependent ) 25 24 20 18 12 6 2 1
Measuring the degree of correlations:

1.Scatter Diagram:

Scatter diagram is a graphical method of studying correlation between two variables.

It gives us rough idea regarding the type of correlation between the two variables.

In this method, points are plotted on XY plane and the trend of points is noticed.

When the points are moving in upward direction from left to right then we conclude that
there is positive correlation between the two variables.

When the points are moving in downward direction from left to right then we conclude that
there is negative correlation between the two variables.

Example 1 : State the type of correlation between the two variables by using scatter
diagram.

X 2 5 8 19 28 35 45 65 78
Y 10 20 30 40 50 60 70 80 90

Ans :

90 - .

80 - .

70 - .

60 - .

Y 50 - .

40 - .

30 - .

20 - .

10 - .
| | | | | | | | | | |

10 20 30 40 50 60 70 80 90 100
X

There is positive correlation between X and Y as the points are moving in upward
direction.

Ex 2: State the type of correlation between the two variables by using scatter diagram.

X 12 19 45 59 66 75 88 90 99
Y 98 85 72 65 45 35 26 18 09

Ans:

100 - .

90 - .

80 -

70 - .

60 - .

Y 50 -

40 - .

30 - .

20 - . .

10 - .

| | | | | | | | | | |

10 20 30 40 50 60 70 80 90 100
X
There is negative correlation between X and Y as the points are moving in downward
direction from left to right.

3.No correlation

When the points do not show any trend either in upward or downward direction then we
can say that there is no correlation between the given variables.

Ex: 3

X 10 20 30 40 50 60 70 80 90
y 25 63 45 57 49 88 42 48 12
Ans :

100 -

90 - .

80 -

70 -

60 - . .

Y 50 - . .

40 - . .

30 - .

20 -

10 - .

| | | | | | | | | | |

10 20 30 40 50 60 70 80 90 100
X

2.Analytical Method:
Karl Pearson’s Coefficient of Correlation is an extensively used mathematical method in
which the numerical representation is applied to measure the level of relation between linear
related variables. The coefficient of correlation is expressed by “r”.

Depending upon the value of r ,we can conclude types of correlation in the following
manner.

1.r >0 --- Positive correlation

2.r<0 ----- Negative correlation

3.r=0…… No correlation

The value of r lies between -1 to +1

More specifically when r=1 , there is a perfect positive correlation between the variables.

When r=-1 , there is a perfect negative correlation.

The value of r can be found by using the following formula

∑ xy
r =
√∑ x 2∗√∑ y
2

where x=X-Xm

y= Y-Ym

Ex. Find Karl Pearson’s correlation coefficient from the following data

X 2 4 6 8 10 12 14 16 18
Y 5 9 13 17 21 25 29 33 37

Ans:

X Y x=X −Xm y=Y −Ym x


2
y
2
xy
=X-10 =Y-21
2 5 -8 -16 64 256 128
4 9 -6 -12 36 144 72
6 13 -4 -8 16 64 32
8 17 -2 -4 4 16 8
10 21 0 0 0 0 0
12 25 2 4 4 16 8
14 29 4 8 16 64 32
16 33 6 12 36 144 72
18 37 8 16 64 256 128
∑ X=¿ 90 ¿ ∑ Y =¿ 189 ¿ ∑ x =¿ 240∑
2 2
¿ y =960 ∑ xy =480

Xm=
∑X Ym =
∑Y
N N

90 189
= =
9 9

=10 =21

Karl Pearson’s correlation coefficient is given by

∑ xy
r =
√∑ x 2∗√∑ y
2

480
=
√ 240∗√ 960
480
=
480

=1

There is perfect positive correlation as r=+1

Ex 2:Find Karl Pearson’d Correlation coefficent

X 13 25 36 48 56 69 45 58 29 11
Y 26 56 48 45 23 34 45 99 104 200

Ans:

X Y x=X −Xm y=Y −Ym x


2
y
2
xy
=X-39 =Y-68
13 26 -26 -42 +1092
25 56 -14 -12 +168
36 48 -3 -20 +60
48 45 9 -23 -207
56 23 17 -45 -765
69 34 30 -34 -1020
45 45 6 -23 -138
58 99 19 31 +589
29 104 -10 36 -360
11 200 -28 132 -3696
∑ X=¿ 390 ¿ ∑ Y =¿ 680 ¿ ∑ x =¿ 3432∑¿ y =26228∑ xy =−4277
2 2

Xm=
∑X Ym =
∑Y
N N

390 680
= =
10 10

=39 =68

Karl Pearson’s correlation coefficient is given by

∑ xy
r =
√∑ x 2∗√∑ y
2

−4277
=
√3432∗√ 26228
−4277
=
9487

=-0.45

Ex 3. Find Karl Pearson’s correlation coefficient from the following.

Age of X 1 2 3 4 5 6 7 8 9 10
His IQ 2.05 3.42 3.05 2.65 2.05 1.69 1.02 0.99 0.85 0.23
Ans:

Age of X His IQ (Y ¿ x=X −Xm y=Y −Ym x


2
y
2
xy
(X) =X-5.5 =Y-1.8
1 2.05 -4.5 0.25 20.25 0.0625 -1.125
2 3.42 -3.5 1.62 12.25 2.6244 -5.67
3 3.05 -2.5 1.25 6.25 1.5625 -3.125
4 2.65 -1.5 0.85 2.25 0.7225 -1.275
5 2.05 -0.5 0.25 0.25 0.0625 -0.125
6 1.69 0.5 -0.11 0.25 0.0121 -0.055
7 1.02 1.5 -0.78 2.25 0.6084 -1.17
8 0.99 2.5 -0.81 6.25 0.6561 -2.025
9 0.85 3.5 -0.95 12.25 0.9025 -3.325
10 0.23 4.5 -1.57 20.25 2.4649 -7.065
∑ X=¿ 55 ¿ ∑ Y =¿ 18 ¿ ∑ x =¿ 80.25
2
∑ xy =−24.96
∑¿y =9.6784
2

Xm=
∑X Ym =
∑Y
N N

55 18
= =
10 10

=5.5 =1.8

Karl Pearson’s correlation coefficient is given by

∑ xy
r =
√∑ x 2∗√∑ y
2

−24.96
=
√ 80.25∗√ 9.6784
−24.96
=
27.83

=-0.89

There is negative correlation.

Ex 4. State whether there is any kind of correlation between the cost enquired by the
company on advertisement and the sales that they achieve from the following data.

Cost of 12 16 28 35 40 38 45 34 24 18
Advertisement
Rs. lac
Sales in Rs. lac 45 40 35 48 25 30 12 34 15 26

Ans : -0.40

Spearman’s Rank Correlation Coefficient (R):


The Spearman's rank-order correlation is the nonparametric version of the Pearson
product-moment correlation. Spearman's correlation coefficient,R, measures the strength and
direction of association between two ranked variables.It is calculated by using,

6∗∑ D2
R= 1 –
N 3 −N

where D=R1-R2

N=No of pairs of X and Y values.

When R=1, we can say there is perfect correlation between the two variables.

Ex: In a competition, two judges J1 and J2 were invited to judge 10 participants. They
gave marks to the participants as shown below.

Find Spearman’s Rank correlation coefficient and interpret it.

Participant 1 2 3 4 5 6 7 8 9 10
No
Marks by J1 25 45 89 65 57 48 91 56 95 78
Marks by J2 45 89 57 68 69 50 95 58 80 90

❑ Marks Rank R1 Marks by Rank R2 D=R1-R2 D
2

by J1 J2
1 25 10 45 10 0 0
2 45 9 89 3 6 36
3 89 3 57 8 -5 25
4 65 5 68 6 -1 1
5 57 6 69 5 1 1
6 48 8 50 9 -1 1
7 91 2 95 1 1 1
8 56 7 58 7 0 0
9 95 1 80 4 -3 9
10 78 4 90 2 2 4
∑ D 2=78

N=10

6∗∑ D2
R =1–
N 3 −N
6∗78
=1- 3
10 −10

468
=1-
1000−10

468
=1-
990

=1 – 0.47

=0.53

Interpretation : The judges had not properly evaluated the participants as the value of R is
0.53.

Ex2 : From the following data, find spearman’s rank correlation coefficient.

X 25 78 89 98 56 54 57 45 59
Y 45 75 89 87 58 62 55 48 65

Ans: r=0.91

Ex. 3. Find Spearman’s Rank correlation coefficient from the following data.

Roll No 1 2 3 4 5 6 7 8 9 10
Marks 45 56 78 48 98 49 88 57 79 51
in
Maths
Marks 54 59 80 40 85 43 79 60 81 46
in C

Ans :
Marks in Rank in Marks in C Rank in C D=R1-R2 D2
Maths maths R1 R2
45 10 54 7 3 9
56 6 59 6 0 0
78 4 80 3 1 1
48 9 40 10 -1 1
98 1 85 1 0 0
49 8 43 9 -1 1
88 2 79 4 -2 4
57 5 60 5 0 0
79 3 81 2 1 1
51 7 46 8 -1 1
∑ D 2=18

6∗∑ D
2
R =1– 3
N −N

6∗18
=1- 3
10 −10

108
=1-
1000−10

108
=1-
990

=1 – 0.10

=0.90

There is good positive relation between marks in maths and C means we can conclude
that the students have taken almost same marks in these two subject.The students who
are good in maths are good in C and vice versa.

Type II problems: When there is a tie between two data items or amongst more than two
items.

In such cases , one correction factor is to be added in the formula which is

(ni3-ni)/12 where ni stands for number of data items with same rank.
r = 1-6*[∑ D 2+(¿3−¿) /12 ¿ ¿ Type equation here .

Ex.: Find Spearman’s rank correlation coefficient from the following data.

X 78 45 89 78 56 55 89 56 56
Y 88 89 45 56 88 96 88 45 45
ANS:

X Rank R1 Y Rank R2 D=R1-R2 D2


78 3.5 88 4 -0.5 0.25
45 9 89 2 7 49
89 1.5 45 8 -6.5 42.25
78 3.5 56 6 -2.5 6.25
56 6 88 4 2 4
55 8 96 1 7 49
89 1.5 88 4 -2.5 6.25
56 6 45 8 -2 4
56 6 45 8 -2 4
∑ D 2=165

1.For 89 in X, the ranks are 1 and 2.

Mean of rank =(1+2) /2 = 1.5

n1=2

2.For 78 in X, the ranks are 3 and 4

Mean of rank =3+4/2=3.5

n2=2

3.For 56 in X ,the ranks are 5,6 and 7

Mean rank =(5+6+7)/3 =6

n3=3

4.For 88 in Y, the ranks 3,4 and 5

Mean rank =(3+4+5) /3 =4

n4=3
5.For 45 in Y, the ranks are 7,8 and 9

Mean rank =7+8+9/3 =8

n5=3

r = 1-6*{∑ [ D2+ (¿ 3−¿ ) /12 ]

N3-N

In this case,

r= 1- 6 * ¿ ¿

r= 1- 6* ¿ ¿

r=1- 6 * ¿ ¿

r=1-6* ¿ ¿

172
r=1-6*
720

r=1-6* 0.2388

r=1-1.433

r=-0.433

Properties of Karl Pearson’s correlation coefficient:

1.r is unit less.

2. The value of r always lies between +1 and -1

3. The Pearson product-moment correlation does not take into consideration whether a
variable has been classified as a dependent or independent variable. It treats all variables
equally.

4. A change of origin of the system, or any scaling of the variables doesn’t affect the value
of r. The sign might change depending on the sign of scaling done.
Association of attributes:

When data is collected on the basis of some attribute or attributes, we have


statistics commonly termed as statistics of attributes. It is not necessary that the objects
may process only one attribute; rather it would be found that the objects possess more
than one attribute. In such a situation our interest may remain in knowing whether the
attributes are associated with each other or not. For example, among a group of people
we may find that some of them are inoculated against small-pox and among the inoculated
we may observe that some of them suffered from small-pox after inoculation.
The important question which may arise for the observation is regarding the
efficiency of inoculation for its popularity will depend upon the immunity which it provides
against small-pox. In other words, we may be interested in knowing whether inoculation
and immunity from small-pox are associated.

Technically, we say that the two attributes are associated if they appear together in a
greater number of cases than is to be expected if they are independent and not simply on
the basis that they are appearing together in a number of cases as is done in ordinary life.
The association may be positive or negative (negative association is also known as
disassociation). If class frequency of AB, symbolically written as (AB), is greater than the
expectation of AB being together if they are independent, then we say the two attributes
are positively associated; but if the class frequency of AB is less than this expectation, the
two attributes are said to be negatively associated. In case the class frequency of AB is
equal to expectation, the two attributes are considered as independent i.e., are said to
have no association. It can be put symbolically as shown hereunder:

The value of this coefficient will be somewhere between +1 and –1. If the attributes
are completely associated (perfect positive association) with each other, the coefficient will
be +1, and if they are completely disassociated (perfect negative association), the
coefficient will be –1. If the attributes are completely independent of each other, the
coefficient of association will be 0. The varying degrees of the coefficients of association
are to be read and understood according to their positive and negative nature between +1
and –1.
Sometimes the association between two attributes, A and B, may be regarded as
unwarranted when we find that the observed association between A and B is due to the
association of both A and B with another attribute C. For example, we may observe
positive association between inoculation and exemption for small-pox, but such
association may be the result of the fact that there is positive association between
inoculation and richer section of society and also that there is positive association between
exemption from small-pox and richer section of society. The sort of association between A
and B in the population of C is described as partial association as distinguished from total
association between A and B in the overall universe. We can workout the coefficient of
partial association between A and B in the population of C by just modifying the above
stated formula for finding association between A and B as shown below:

where,
QAB.C = Coefficient of partial association between A and B in the population of C; and all
other values are the class frequencies of the respective classes (A, B, C denotes the
presence of concerning attributes and a, b, c denotes the absence of concerning
attributes)
Unit No : 4

Regression

Regression analysis:

Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables.

It can be utilized to assess the strength of the relationship between variables and for
modeling the future relationship between them.

Regression analysis includes several variations, such as linear, multiple linear, and
nonlinear. The most common models are simple linear and multiple linear. Nonlinear
regression analysis is commonly used for more complicated data sets in which the
dependent and independent variables show a nonlinear relationship.

Linear regression analysis is based on following fundamental assumptions:

1.The dependent and independent variables show a linear relationship between the slope
and the

intercept.

2.The independent variable is not random.

3.The value of the residual (error) is zero.

4.The value of the residual (error) is constant across all observations.

5.The value of the residual (error) is not correlated across all observations.
6.The residual (error) values follow the normal distribution.

Regression Analysis

It is a method establishing the relation between two variables in the form linear equation.

This relation is used for determining the value of one variable on the basis the given value
of second variable.

Regression Equations:

1.Regression Eqaution X on Y:

X-Xmean = bxy(Y-Ymean)

Where bxy –regression coefficient X onY which is calculated by using

bxy =
∑ xy
∑ y2
where x=X −Xmean

y=Y −Ymean

2.Regression of Y on X.

Y-Ymean = byx(X-Xmean)

Where byx –regression coefficient of Y on which is calculated by using

byx =
∑ xy
∑ x2
where x=X −Xmean

y=Y −Ymean

Also correlation coefficient r =√ bxy∗byx

Ex: Find two regression equations from the following data

X 2 4 6 8 10 12 14 16
Y 7 13 19 25 31 37 43 49

Ans :
X Y x=X −Xmean y=Y −Ymean xy x2 y2
=X - 9 =Y - 28
2 7 -7 -21 +147 49 441
4 13 -5 -15 +75 25 225
6 19 -3 -9 +27 9 81
8 25 -1 -3 +3 1 9
10 31 1 3 3 1 9
12 37 3 9 27 9 81
14 43 5 15 75 25 225
16 49 7 21 147 49 441
∑ X=72
∑ Y =224 ∑ xy =504 ∑ x =168∑ y2 =1512
2

Xmean=
∑X =
72
=9
n 8

Ymean=
∑ Y = 224 =28
n 8

1.Regression of X on Y is given by

X-Xmean = bxy(Y-Ymean)

bxy =
∑ xy
∑ y2
504
bxy=
1512

bxy=0.3333

X-Xmean = bxy(Y-Ymean)

X- 9 =0.3333(Y-28)

X-9 =0.3333Y-9.33

X=0.3333Y-9.33+9

X=0.3333Y-0.33

2.Regression of Y on X.
Y-Ymean = byx(X-Xmean)

Where byx –regression coefficient of Y on which is calculated by using

byx =
∑ xy
∑ x2
504
=
168

=3

Y-28=3(X-9)

Y-28=3X-27

Y=3X-27+28

Y=3X+1

EX 2. From the following data , find

1.regression equation of X on Y

2.regression equation of Y on X

3.Correlation coefficient

X 10 12 14 16 18 20 22 24 26
Y 29 35 41 47 53 59 65 71 77

Ans:

X Y x=X −Xmean y=Y −Ymean xy x2 y2


=X - 18 =Y - 53
10 29 -8 -24 192 64 576
12 35 -6 -18 108 36 324
14 41 -4 -12 48 16 144
16 47 -2 -6 12 4 36
18 53 0 0 0 0 0
20 59 2 6 12 4 36
22 65 4 12 48 16 144
24 71 6 18 108 36 324
26 77 8 24 192 64 576
∑ X=162
∑ Y =477 ∑ xy =720∑ x =240 ∑ y2 =2160
2

Xmean=
∑X =
162
=18
n 9

Ymean=
∑ Y = 477 =53
n 9

1.Regression of X on Y is given by

X-Xmean = bxy(Y-Ymean)

bxy =
∑ xy
∑ y2
720
bxy=
2160

bxy=0.3333

X-Xmean = bxy(Y-Ymean)

X- 18 =0.3333(Y-53)

X-18 =0.3333Y-17.6649

X=0.3333Y-17.6649+18

X=0.3333Y+0.3351

2.Regression of Y on X.

Y-Ymean = byx(X-Xmean)

Where byx –regression coefficient of Y on which is calculated by using

byx =
∑ xy
∑ x2
720
=
240

=3
Y-53=3(X-18)

Y-53=3X-54

Y=3X-54+53

Y=3X-1

3.r=√ bxy∗byx

r=√ 0 . 3333∗3

r=0.99

Q3. From the following data, find

a.both regression coefficients

b. regression equations

c.correlation coefficient

d. the value of X when Y=34

X 12 15 19 27 56 62 70 82
Y 06 09 15 21 24 27 29 35

Ans : Students are expected to calculate the data in the blank spaces

X Y x=X −Xmean y=Y −Ymean xy x


2
y
2

=X – 42.87 =Y – 20.75
12 06 12-42.87 -14.75 +455.33 …….. ………
=-30.87
15 09 -27.87 -11.75 +327.47 …. ………..
19 15 -23.87 -5.75 +137.25 …………. ………
27 21 -15.87 0.25 -3.96 ………… ………..
56 24 13.13 3.25 42.67 ……….. ………….
62 27 19.13 6.25 119.56 …………. ……….
70 29 27.13 8.25 223.82 ……… ……
82 35 39.13 14.25 557.60
∑ X=¿ 343
∑ ¿Y =166 ∑ xy =¿ ¿1863 ∑ x 2=5356 . ∑ 2
82 y =709 . 48
.70

Xmean=
∑X =
343
=42 . 87
n 8
Ymean=
∑ Y = 166 =20 .75
n 8

1.Regression of X on Y is given by

X-Xmean = bxy(Y-Ymean)

1.bxy =
∑ xy
∑ y2
1863. 70
bxy=
709 . 48

bxy=2.62

X-Xmean = bxy(Y-Ymean)

X- 42.87=2.62(Y-20.75)

X-42.87 =2.62Y-54.36

X=2.62Y-54.36+42.87

X=2.62Y-11.49

2.Regression of Y on X.

Y-Ymean = byx(X-Xmean)

Where byx –regression coefficient of Y on which is calculated by using

byx =
∑ xy
∑ x2
1863 .70
=
5356 .82

= 0.3479

Y-20.75=0.3479(X-42.87)

Y-20.75=0.3479X-14.91

Y=0.3479X-14.91+20.75

Y=0.3479X+5.84
3.r=√ bxy∗byx

r=√ 2 .62∗0 .3479

r=0.9544

4. the value of X when Y=34

X=2.62Y-11.49

=2.62*34-11.49

X=89.08-11.49

X=77.59

Ex4. From the following sales and advt expenses, find

a.both regression coefficients

b. regression equations

c.correlation coefficient

d. Sales when advt expenses is Rs.55 lacs

e. Advt. expenses in order to achieve sales of Rs.102 lacs

Sales in 10 20 30 40 60 80 100 120 160 180


Rs.lacs
Advt 0.52 1.53 4.25 6.25 8.21 7.24 7.00 8.00 7.52 7.48
Expenses
in Rs.lacs

Business application:
1.Regression analysis in finance :

Regression analysis has several applications in finance. For example, the statistical
method is fundamental to the Capital Asset Pricing Model (CAPM). Essentially, the CAPM
equation is a model that determines the relationship between the expected return of an
asset and the market risk premium.

The analysis is also used to forecast the returns of securities, based on different factors, or
to forecast the performance of a business. Learn more forecasting methods in
CFI’s Budgeting and Forecasting Course!

2. Forecasting Revenues and Expenses

When forecasting financial statements for a company, it may be useful to do a multiple


regression analysis to determine how changes in certain assumptions or drivers of the
business will impact revenue or expenses in the future. For example, there may be a very
high correlation between the number of salespeople employed by a company, the number
of stores they operate, and the revenue the business generates.
Unit No : 5

Elementary probability

Probability :

Theory of probability was an outcome of chance of wining a game in gambling.


When a fair coin is tossed then chances of appearing head or tail can be determined.
There are three approaches to probability:

i) Classical approach
ii) Empirical Approach
iii) Axiomatic Approach

Random Experiment :

An experiment is called a random experiment if when conducted repeatedly under


homogenous environmental condition, the result is not unique but may be any one of the
various possible outcomes.

Event :

It is the single possible outcome of any random experiment.

Mutually Exclusive Events:

Two or more events are said to be mutually exclusive events if occurring of any one
of them excludes the happening of all other events in the same experiment. When a coin
is tossed, appearing of head automatically terminates the appearing of tail and vice versa.

Equally Likely :

The outcomes are said to be equally likely or equally probable if none of them is
expected to appear or occur in preference to other.

Independent Events:

Events are said to be independent of each other if happening of any one of them is
not affected and does not affect the happening of any one of others.

Types of Probability

a) Mathematical or classical or ‘A Priori’ Probability or theoretical probability :


If a random experiment results in N exhaustive , mutually exclusive and equally
likely outcomes out of which m are favourable to the happening of an event A , then
the probability of happening of event A, denoted as P(A) is given by,

Favourable outcome (m)


P ( A )=
Exhaustive number of cases( N )

b) Experimental probability: Experimental probability is calculated using data from


experiments. It is
based upon number of possible outcomes divided by total number of trials.

Probability = Number of favorable outcomes/Total number of trials

Solved Examples

Q.1) An unbiased die is thrown what is the probability that on upper most face

i)A prime number appears

ii) A even number appears

iii)A perfect square number appears

iv) A last 3 as a score appears

Ans:- Let S be sample space

S={1,2,3,4,5,6}

Here n(s)=6

i) Let A event that prime number appears

A={2,3,5}

∴ n(A)=3

P(A)=n(A)/n(S)

=3/6

=0.50
ii)Let B event that even number appears

B={2,4,6}

∴ n(B)=3

P(B)=n(B)/n(S)

=3/6

=0.50

iii)let C event that perfect square number appears

C={1,2}

∴ n(C)=2

P(C)=n(C)/n(S)

=2/6

=0.3333

iv) let D event that least 3 as a score appears

D={3,4,5,6}

∴n(D)=4

P(D)=n(D)/n(S)

=4/6

=0.6666

Q.2)Two unbiased dice are thrown. What is the probability that the total score on upper
most faces of two dice is

i)multiple of 5 ?

ii)a perfect square?

iii)a prime number?

iv)greater than 5?
v)at least 9?

vi)at most 6

Ans:-Let S be sample space.

S={(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),(3,1),(3,2),(3,3),
(3,4),(3,5),(3,6),(4,1),

(4,2),(4,3), (4,4),(4,5),(4,6),(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),
(6,6) }

∴ n(S)=36

i) Let A event that the total score is multiple of 5 i.e. 5, 10

A={(1,4,)(2,3),(3,2),(4,1),(4,6),(5,5),(6,4)}

∴ n(A)=7

P(A)=n(A)/n(S)

=7/36

=0.197

ii) Let B event that the total score is perfect square i.e. 4,9

B={(1,3),(2,2),(3,1),(3,6),(4,5),(5,4),(6,3)}

∴n(B)=7

P(B)=n(B)/n(S)

=7/36

=0.19

iii) Let C event that the total score is prime number i.e. 2,3,5,7,11

C={(1,1),(2,1),(1,4)(1,6)(2,1)(2,3)(2,5)(3,2)(3,4)(4,1)(4,3)(5,2)(5,6)(6,1)(6,5)}

∴ n(C)=15

P(C)=n(C)/n(S)=15/36=

iv) let D event that the total score is Greater than 5


D={(1,5)(1,6)(2,4)(2,5)(2,6)(3,3)(3,4)(3,5)(3,6)(4,2)(4,3)(4,4)(4,5)(4,6)(5,1)(5,2)(5,3)(5,4)
(5,5)(5,6)(6,1),

(6,2)(6,3)(6,4)(6,5)(6,6)}

∴n(D)=26

P(D)=n(D)/n(s)

=26/36

=0.72

v) Let E event that the total score is at least 9

E={(3,6)(4,5)(4,6)(5,4)(5,5)(5,6)(6,3)(6,4)(6,5)(6,6)}

∴n(E)=10

P(E)=n(E)/n(s)

=10/36

vi)l et F event that the total score is at most 6

n(F)={(1,1)(1,2)(1,3)(1,4)(1,5)(2,1)(2,2)(2,3)(2,4(3,1)(3,2)(3,3)(4,1)(4,2)(5,1)}

∴n(F)=15

P(F)=n(F)/n(s)

=15/36

=0.416

Problems on Playing cards :

Playing cards (52)

26 Red cards 26 Black Cards

13 Heart 13 Diamond 13 Spade 13 Club


Each suit of 13 cards contains :

1-J picture cards

1-K

1-Q

1-A (Ace)

9- 2 to 10 Number cards

Hence

Total picture cards =12

Total prime number cards=16

Total Ace cards=4

Total number cards=36

Total non picture cards=40

Q.3) Three cards are drown from well shuffced pack of 52 playing cards find the
probability that

i) all 3 cards are club cards .

ii) all 3 cards are red picture cards .

iii) all 3 cards show prime digit number.

iv)all 3 cards are block number cards.

v) all 3 cards are spade number cards .

ANS:-Let S be the sample space

n(S)= 52C3

=52 x 51x 50 /3 x 2 x 1

=22100
i)Let A event that all 3 Cards are club cards

n(A) = 13C3

= 13x12x11/3x2x1

= 286

P(A) =n(A)/n(S)

=286/22100

=0.0129

ii) Let A event that all 3 cards are red picture cards

n(B) = 6C3

=6 x 5 x 4 /3x 2x 1

=20

P(B) =n(B)/n(S)

=20/22100

=0.0009

iii)Let C event that all 3 cards show prime digit number

n(C)= 16C3

=16 x 15 x 14 /3 x 2 x 1

=560

P(C) =n(C)/n(S)

=560/22100

=0.0253

iv)Let D event that all 3 Cards are black number cards

n(D) = 18C3

=18 x 17 x 16/3x 2 x 1
=816

P(D) =n(D)/n(S)

=816/22100

=0.0369

v)Let E event that all 3 Cards are spade number cards

n(E) = 9C3

n(E) = 9 x 8 x 7/3 x 2x 1

=84

P(E) =n(E)/n(S)

=84/22100

=0.0038

Addition law of probability(Total probability)

Statement : If A & B are any two events then P( A ∪ B)=P ( A )+ P ( B )−P( A ∩ B)

Proof : Let A & B are any two event assume that there are m element in event A assume
that there are

n element in event B there are p element which are appearing both event A & B.
The corresponding Venn diagram is as follows.

A B

m-p p n-p

Now, from figure

m-p+p+n-p=Total numbers

∴ m+n-p= n( A ∪ B)

∴ n ( A ) +n ( B )−n ( A ∩B )=n ( A ∪ B )
Divide both side by n ( S ) ,

∴n ( A ) n(B) n( A ∩ B) n( A ∪ B)
+ − =
n(S) n( S) n(S) n(S)

By definition of probability,

P ( A ) + P ( B ) −P ( A ∩ B )=P( A ∪ B)

∴ P( A ∪ B)=P ( A )+ P ( B )−P( A ∩ B)

If A and B are mutually exclusive t

P( A ∪ B)=P ( A )+ P ( B )

Solved Examples :

Q1) Two Cards are drawn from a well shuffled pack of 52 playing cards.Find the
probability that

i)both cards are red card or picture cards

ii) both cards are red card or number cards

iii)both cards are picture cards or spade card

iv)both cards are red card or number cards

Ans: Let S be the sample space.

n(S) =52C2

∴ n(S)=52 x 51 / 2 x 1

∴n(S) =1326

i)Let A event that both cards are red card

n(A) = 26C2

∴ n(A)=26 x 25/ (2 x 1)=325

∴ P(A)=n(A)/n(S)

∴ P(A) = 325/1326

∴ P(A) =0.2450
Let B event that both cards are picture card

n(B) =12C2

∴n(B)=12 x 11/ (2 x 1)

∴ n(B)=66

P(B) =n(B)/n(S)

= 66/1326

= 0.0497

Let A ∩ B event that both cards are red as well as picture cards

n(A ∩ B) = 6C2

= 6 x 5 / (2 x 1)

=15

P(A∩ B)=n(A∩ B)/n(S)

= 15/1326

= 0.0113

By addition law of the probability ,

P(A∪ B)=P(A)+P(B)- P(A ∩ B)

=0.2450+0.0497-0.0113

=0.2834

ii) Let A event that both cards are red card

n(A) =26C2

∴ n(A) =26*25/2*1

=325

P(A)= n(A)/n(S)

= 325/1326
= 0.2450

Let B event that both cards are number card

n(B) =36C2

∴n(B)=36*35/2*1

=630

P(B)=n(A)/n(S)

= 630/1326

=0.4751

Let A∩ B event that both cards are red as well as picture cards

n(A ∩ B)= 18C2

= 18*17/(2*1)

=153

P(C)=n(A ∩ B)/n(S)

= 153/1326

=0.1153

By addition law of probability, the probability that both cards are red cards or number
cards

P(A∪ B)= P(A)+P(B)-P(A ∩ B)

= 0.2450+0.0475-0.1153

= 0.6048

iii)Let A event that both cards are picture card

n(A)= 12C2

=12*11/2*1
=66

P(A)=n(A)/n(S)

=66/1326

=0.1153

Let B event that both cards are spade card

n(B)= 13C2

=13*12/2*1

=78

P(B)=n(A)/n(S)

=78/1326

=0.0588

Let A ∩ B event that both cards are picture cards well as spade cards

n(A ∩ B)= 3C2

= 3*2/2*1

=3

P( A ∩ B) = n(A ∩ B)/n(S)

=3/1326

=0.0022

By addition law of the probability , the probability that both cards are picture cards or
spade cards

P(A ∪ B)=P(A)+P(B)-P(A∩B)

=0.0497+0.0588-0.0022

=0.1063

iv)Let A event that both cards are number card

n(A) = 36C2
= 36*35/2*1

= 630

P(A)=n(A)/n(S)

=630/1326

=0.4751

Let B event that both cards are heart card

n(B) = 13C2

∴n(B)=13*12/2*1=78

P(B)=n(A)/n(S)

=78/1326

=0.0588

Let A ∩ B event that both cards are red as well as heart cards

n(A∩ B)= 13C2

= 13*12/2*1

=78

P(A∩ B)=n(A∩ B)/n(S)

=78/1326

=0.0588

By addition law of probability ,the probability that both cards are red or heart cards

P(A ∪ B)=P(A)+P(B)-P(A ∩ B)

=0.4751+0.0588-0.0588

=0.4751

Multiplications law of probability or conditional probability

Statement : If A and B are two events then


P(A ∩ B)= P(A)*P(A/B)

OR

P(A ∩ B)=P(B)*P(B/A)

Note : If A & B are independent then

P(A∩ B)=P(A)*P(B)

Solved Examples:

Q1.Two tickets are drawn one after the other from a box containing 20 tickets numbered
from 1 to 20 .

Find the probability that

i)both tickets drawn show even number

ii) both tickets drawn show prime number

Ans:-i) Let A event that first ticket shows an even number

n(A)= 10C1

= 10

Let S1 be the sample space for first ticket

n(S1 )= 20C1

= 20

P(A)=n(A)/n(S1)

=10/20

=0.5

Let B/A event that second ticket shows even number

n(B/A)= 9C1

=9
n(S2) = 19C1

= 19

P(B/A)=n(B/A)/n(S2)

= 9/19

=0.4736

P(A∩ B)= P(A)*P(A/B)

=0.5*0.4736

=0.2368

ii)let A event that first ticket shows prime number

i.e. 2,3,5,7,11,13,17,19

n(A) = 8C1

=8

Let S1 be the sample space for first ticket

n(S1)= 20C1

= 20

P(A)=n(A)/n(S1)

=8/20

Let B/A event that second ticket also shows prime number

n(B/A)= 7C1

=7

n(S2 )= 19C1

=19

P(B/A)=n(B/A)/n(S2)

=7/19
=0.4736

P(A∩ B)= P(A)*P(A/B)

=(8/20)/(7/19)

=0.1473

Probability Distributions:

Random Variable:

A random variable is a variable whose value is unknown or a function that assigns


values to each of an experiment's outcomes.

In probability and statistics, random variables are used to quantify outcomes of a


random occurrence, and therefore, can take on many values. Random variables are
required to be measurable and are typically real numbers. For example, the letter X may
be designated to represent the sum of the resulting numbers after three dice are rolled. In
this case, X could be 3 (1 + 1+ 1), 18 (6 + 6 + 6), or somewhere between 3 and 18, since
the highest number of a die is 6 and the lowest number is 1.

A random variable is different from an algebraic variable. The variable in an


algebraic equation is an unknown value that can be calculated. The equation 10 + x = 13
shows that we can calculate the specific value for x which is 3. On the other hand, a
random variable has a set of values, and any of those values could be the resulting
outcome as seen in the example of the dice above.

In the corporate world, random variables can be assigned to properties such as


the average price of an asset over a given time period, the return on investment after a
specified number of years, the estimated turnover rate at a company within the following
six months, etc. Risk analysts assign random variables to risk models when they want to
estimate the probability of an adverse event occurring. These variables are presented
using tools such as scenario and sensitivity analysis tables which risk managers use to
make decisions concerning risk mitigation.
Types of Random Variable:

Random variable

Discrete Random Variable Continuous Random Variable

Discrete random variable :

Discrete random variables take on a countable number of distinct values.


Consider an experiment where a coin is tossed three times. If X represents the number of
times that the coin comes up heads, then X is a discrete random variable that can only
have the values 0, 1, 2, 3 (from no heads in three successive coin tosses to all heads). No
other value is possible for X.

Continuous Random Variable :

Continuous random variables can represent any value within a specified range or
interval and can take on an infinite number of possible values. An example of a continuous
random variable would be an experiment that involves measuring the amount of rainfall in
a city over a year, or the average height of a random group of 25 people.

Probability distribution of discrete random variable

A random variable has a  probability distribution that represents the likelihood


that any of the possible values would occur. Let’s say that the random variable, Z, is the
number on the top face of a die when it is rolled once. The possible values for Z will thus
be 1, 2, 3, 4, 5, and 6. The probability of each of these values is 1/6 as they are all equally
likely to be the value of Z.

For instance, the probability of getting a 3, or P (Z=3), when a die is thrown is


1/6, and so is the probability of having a 4 or a 2 or any other number on all six faces of a
die. Note that the sum of all probabilities is 1.

A typical example of a random variable is the outcome of a coin toss. Consider


a probability distribution in which the outcomes of a random event are not equally likely to
happen. If random variable, Y, is the number of heads we get from tossing two coins, then
Y could be 0, 1, or 2. This means that we could have no heads, one head or both heads
on a two-coin toss.

However, the two coins land in four different ways: TT, HT, TH, HH. Therefore,
the P(Y=0) = 1/4 since we have one chance of getting no heads (i.e., two tails [TT] when
the coins are tossed). Similarly, the probability of getting two heads (HH) is also 1/4.
Notice that getting one head has a likelihood of occurring twice: in HT and TH. In this
case, P (Y=1) = 2/4 = 1/2.

Mean or Expected value E(X) :

The expected value of discrete random variable is given by

Mean or Expected value E(X) =∑ xi . pi

Variance V(X) = √∑ xi . pi−¿ ¿ ¿


2

Solved Examples :

Q1. Find mean and variance from the following probability distribution:

xi 0 1 2 3 4 5
pi 1/12 4/12 3/12 1/12 2/12 1/12
Solution :

xi pi xi pi 2
xi . pi
0 1/12 0 0
1 4/12 4/12 4/12
2 3/12 6/12 12/12

3 1/12 3/12 9/12


4 2/12 8/12 32/12
5 1/12 5/12 25/12
Total ∑ xi . pi =26/ ∑ xi2 . pi=¿ ¿8
12 2/12

Mean or Expectation E(X) =∑ xi . pi = 26/12

Variance V(X) = √∑ xi . pi−¿ ¿ ¿


2

=
√ 82
12
−¿¿
=√ 6.8333−4.6943

= 1.4625

Probability Distribution of continuous random variable :

If X is continues random variable then probability density function of r.v X is given by


P(X)= ∫ f ( x ) dx
−∞

Solved Example :

1. If F(x)=(x2-1)/18 for 0 ≤x≤5 ,find P(0<x<3)


Ans :

P(X)= ∫ f ( x ) dx
−∞

3
1
P(X)= ∫ (x ¿¿ 2−1¿) 18 dx ¿ ¿
0

{ }
3
1 x3
= −x
18 3 0

{( ) ( )}
3 3
1 3 0
= −3 − −0
18 3 3

1
= x6
18

=0.3333

Distribution function of cumulative probability distribution :

The cumulative distribution function (CDF) of a real-valued random


variable X, or just distribution function of X , evaluated at  x , is the probability that X will
take a value less than or equal to x  .
In the case of a continuous distribution, it gives the area under the probability
density function from minus infinity to  x . Cumulative distribution functions are also used to
specify the distribution of multivariate random variables.
The cumulative distribution function of a real-valued random variable X is the
function given by
F ( x )=P( X ≤ x)

where
the right-hand side
represents
the probability that
the random
variable X  takes
on a value less
than or equal to  x .
The probability
that X lies in the
semi-closed interv
al  (a,b] where a<b
, is therefore

P ( a< X ≤ b ) =F ( b ) −F(a)
The
CDF of
a continuous
random variable X
can be expressed
as the integral of
its probability
density function f
(x) as follows:

x
F ( x )= ∫ f ( t ) dt
−∞

Example : If X is a r.v with the probability distribution as shown below


xi 0 1 2 3 4 5
pi 1/12 4/12 3/12 1/12 2/12 1/12
then cumulative distribution function is given as
xi 0 1 2 3 4 5
c.d.f. F( x ¿ 1/12 5/12 8/12 9/12 11/12 1

Moments :
The n-th moment of a real-valued continuous function f(x) of a real variable
about a value c is

µn = ∫ ( x− y )n f ( x ) dx .
−∞

It is possible to define moments for  random variables in a more general


fashion than moments for real values—see moments in metric spaces. The moment of a
function, without further explanation, usually refers to the above expression with c = 0.
For the second and higher moments, the central moment (moments about
the mean, with c being the mean) are usually used rather than the moments about zero,
because they provide clearer information about the distribution's shape.
If  f is a probability density function, then the value of the integral above is
called the n-th moment of the probability distribution. More generally, if F is a cumulative
probability distribution function of any probability distribution, which may not have a density
function, then the n-th moment of the probability distribution is given by ,

µ’n = E( x n ¿

= ∫ x dF (x)
n

−∞

a. Binomial distribution :-

It is also known as Bernoulli Distribution. This distribution is useful in the following


condition

1) Random experiment is performed repealed finite & fixed number of time at is n number
of

travel is finite fixed

2) The outcome of random experiment results in only to mutually disjoint categories that is
success and failover

3) All trails are independent that is result of any trail not affected in any way by the
sprucing trails

dense affect result of subsiding

4) The probability of success of any trail p and his concept for his trail q=p-1 is then turned
that

probability of failure and it concept for each trail

Statement:-

The probability of getting ‘r’ success out of ‘n’ trials is given by ,

P(X=r) =nCr pr qn-r

where p is probability of success and q is probability of failure

Solved Examples

Q1.Ten unbiased coins are tossed find the probability of upturning

i)Exactly 6 head

ii) at least 8 head

iii)no head

iv)not more then 3 head

Ans:-Here n=10

p=prob of getting head

∴p=1/2

q=1-p

∴q=1-1/2

∴q=1/2

Let X is random variable which denotes number of head.

By using Binomial distribution,


P(X=r) = nCr pr qn-r

i) probability of getting exactly 6 heads is

P(X=r) = nCr pr qn-r

∴ P(X=6) = 10C6 (½ )6 (½ )10-6


10
10 X 9 X 8 X 7 X 6 X 5 1
= X( )
6X5X4X3X2X1 2

=210 x 1/1024

= 210/1024

= 0.205

ii) probability of getting at least 8 head i.e. 8,9,10

P(X≥ 8 ¿=¿ P(X=8)+P(X=9)+P(X=10)

= 10C8 (½ )8 (½ )10-8

+10C9 (½ )9 (½ )10-9

+10C10 (½ )10 (½ )10-10

=(10C8 +10C9+10C10 ) X (½ )10

=(45+10+1) X 1/1024

= 56/1024

=0.0546

iii)probability of getting no head i.e. X=0

P(X=0) = 10C0 (½ )0 (½ )10-0

= 1/1024

iv) probability of getting not more then 3 heads

P(X≤3)=P(X=1)+P(X=2)+P(X=3)
= 10C1 (½ )1 (½ )10-1 + 10C2 (½ )2 (½ )10-2 +10C3 (½ )3 (½ )10-3

=10*1/1024+45*1/1024+120*1/1024

=(10+45+120)*1/1024

=175*1/1024

=175/1024

=0.1708.

Fitting up Binomial Distribution :

We often want to compare a set of data from observations with a theoretical


probability distribution. Can the data be represented satisfactorily by a theoretical
distribution? If so, the data can be represented very succinctly by the parameters of the
theoretical distribution. Specifically, let us consider whether a set of data can be
represented by a binomial distribution.

The binomial distribution has two parameters, n and p.

In any practical case we will already know n, the number of trials. How can we
estimate p, the probability of “success” in a single trial? An intuitive answer is that we can
estimate p by the fraction of all the trials which were “successes,” that is, the proportion or
relative frequency of “success.” It is possible to show mathematically that this intuitive
answer is correct, an unbiased estimate of the parameter p.

Example: The following data give the number of seeds germinating (X) out of 10 on damp
filter for 80
sets of seed. Fit a binomial distribution to the data.
X 0 1 2 3 4 5 6 7 8 9 10
f 6 20 28 12 8 6 0 0 0 0 0

Solution: Here the random variable X denotes the number of seeds germinating out of a set
of 10 seeds.
The total number of trials n = 10.
The mean of the given data

X= 0*6+1*20+2*28+3*12 4*8+5*6/80=174/80=2.175
Since mean of a binomial distribution is np,

∴ np = 2.175.
Thus, we get.

p= 2.175/10 = 0.22 and q = 1 - 0.22 = 0.78.


Using these values, we can compute P(X) = 10Cx (0.22)x x (0.78)10-x and then
expected frequency

[ N × P(X)] for X = 0, 1, 2, ...... 10.

N=∑f =80

The calculated probabilities and the respective expected frequencies are shown in the
following table:

X P(X) N x P(X) Approximated


Frequency
0 0.0834 6.67 6
1 0.2351 18.81 19
2 0.2984 23.87 24
3 0.2244 17.96 18
4 0.1108 8.86 9
5 0.0375 3.00 3
6 0.0088 0.71 1
7 0.0014 0.11 0
8 0.0001 0.01 0
9 0.0000 0.00 0
10 0.0000 0.00 0
Total =80

b. Poisson Distribution:-

Poisson distribution was derived in 1837 by French mathematician simeon D


Poisson. Poisson distribution may be opted as a limiting case of binomial distribution
under following condition
1) n-the number of travels in indefinitely large

i.e. n→∞

2) p-the concept probability of success(p) for each trail indefinitely small

i.e.p →0

3 )np=m is finite

Under the above three condition the probability function of Poisson distribution is
given as
−m r
e .m
P ( x=r )=
r!

where r is no. of success.

Application Areas of Poisson distribution :

Following are some of the practical situation where in poisson distribution can be
used

1) the no. of telephone call arriving at telephone switch board in unit time

2) the no. of customers arriving at D-mart

3) the no. of defects of manufactured product

4) to count no. of by coteries per unit time

5) the no. of assistant taking space on a busy road

For Poisson distribution,

Mean and variance are equal in Poisson distribution

Fitting of Poisson Distribution :

Solved Example: The following table gives the number of days in 50- day period during
which automobile accidents occurred in a city:

No. of accidents 0 1 2 3 4
No of days 21 18 7 3 1
Fit a Poisson distribution to the data.

Solution :

No of No of fx
accidents days
(x) (f)
0 21 0
1 18 18
2 7 14
3 3 9
4 1 4
Total n=50 ∑ fx =45

m=
∑ fx
m

45
m=
50

m =0.9

e−m .m r e−0.9 . 0.90


and P(x=0) = = =0.4066
r! 0!
−m r −0.9 1
e .m e . 0.9
P(x=1) = = =0.3659
r! 1!

e−m .m r e−0.9 . 0.92


P(x=2) = = =0.1647
r! 2!
−m r −0.9 3
e .m e . 0.9
P(x=3) = = =0.0494
r! 3!
−m r −0.9 4
e .m e . 0.9
P(x=4) = = =0.0111
r! 4!

In order to fit a Poisson distribution, we have to multiply each of these values by n i.e.
by 50 . Hence the required Poisson distribution is :

x 0 1 2 3 4
f 0.4066x50=20.33 18.23 8.23 2.47 0.56
Mode of Poisson Distribution :

The mode of a Poisson-distributed random variable with non-integer m is equal to


└ m ┘ , which is the largest integer less than or equal to m. This is also written as floor(m).
When m is a positive integer, the modes are m and m − 1.

Solved Example:

Between the hr 2 pm and 4 pm ,the avg no. of phone call per minutes coming into the
switch mode of a company is 2.35.Find the probability that driving one particular minute
there will be at most 2 phone calls

Solution : Here m=2.35

P(x≤2) = P(x=0)+p(x=1)+p(x=2)

We know by Poisson distribution,


−m r
e .m
P ( x=r )=
r!

e−2.35 . 2.350 e−2.35 . 2.351 e−2.35 .2.352


P(x≤2) = + +
0! 1! 2!

= e-2.35( 1 + 2.35 +2.76125)

= 0.095369*6.11125

= 0.5828

c. Normal Distribution:-

Normal probability distribution or comely called normal distribution is one of


the most important continues theoretical distribution in statics most of the data relating to
economic and dimness static or even in social and physical since conform to his
distribution normal distribution was first discard by English mathematics De-movie who
obtained a mathematical equation for this distribution while dealing with problem arranging
in the game of chance is also known as gluss distribution who used this distribution error
of card the theory of assonantal error of measurement invalid in calculation of orbits
heavily bobby his distribution is whiled used in statically analysis
Definition:-

If x is a continuous random variable following normal probability distribution


with mean µ and standard deviation σ then its probability density function is given by,
2
−( x−m )
1 2σ
2
, -∞<x<∞
f ( x )= e
σ √2 π

where e=2.7183

Normal Curve :

The graph of the Normal distribution is as shown below:

-X X=µ X

Properties of normal probability distribution

1) Graph of f(x) is a bell shaped curve .

2) Normal curve is symmetrical about the line x=µ i.e. it has same shape on either side

3 ) The mean deviation from the mean in normal distribution is equal to 4/5 of its standard
deviation.

4)Mean = Median

Importance of Normal Distribution:

1. The normal distribution is the most important probability distribution in statistics


because it fits many natural phenomena. For example, heights, blood pressure,
measurement error, and IQ scores follow the normal distribution. It is also known as the
Gaussian distribution and the bell curve.

2. The normal distribution is a probability function that describes how the values of a
variable are distributed. It is a symmetric distribution where most of the observations
cluster around the central peak and the probabilities for values further away from
the mean taper off equally in both directions. Extreme values in both tails of the distribution
are similarly unlikely.

3. Some statistical hypothesis tests assume that the data follow a normal distribution.

4. Linear and nonlinear regression both assume that the residuals follow a normal


distribution. Learn more in my post about assessing residual plots.

5. The central limit theorem states that as the sample size increases, the sampling
distribution of the mean follows a normal distribution even when the underlying distribution
of the original variable is non-normal.

Solved Example :

1. In normal distribution whose mean is 2 and standard deviation is 3, find the value of
the variate such that the probability of the interval from the mean to the value is
0.4115.

Solution : Here mean m=2 and σ =3.

Hence the normal variate is

x−m
z=
σ

x−2
z =
3

Also from the tables of the areas under the standard normal curve, the value of z for
which

Mean =0.4115 is found to be 1.35.

x−2
1.35=
3

x=3 ×1.35+2

x=6.05
Unit no 6

Introduction to Testing of Hypothesis

Hypothesis
Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology employed by the analyst depends on
the nature of the data used and the reason for the analysis. Hypothesis testing is used to
infer the result of a hypothesis performed on sample data from a larger population.

 Hypothesis testing is used to infer the result of a hypothesis performed on sample


data from a larger population.
 The test tells the analyst whether or not his primary hypothesis is true.
 Statistical analysts test a hypothesis by measuring and examining a random sample
of the population being analyzed.

Null and alternate hypothesis :

Statistical analysts test a hypothesis by measuring and examining a random


sample of the population being analyzed. All analysts use a random population sample to
test two different hypotheses: the null hypothesis and the alternative hypothesis.

The null hypothesis is the hypothesis the analyst believes to be true. Analysts
believe the alternative hypothesis to be untrue, making it effectively the opposite of a null
hypothesis. Thus, they are mutually exclusive, and only one can be true. However, one of
the two hypotheses will always be true.

If, for example, a person wants to test that a penny has exactly a 50% chance of
landing on heads, the null hypothesis would be yes, and the alternative hypothesis would
be no (it does not land on heads). Mathematically, the null hypothesis would be
represented as Ho: P = 0.5. The alternative hypothesis would be denoted as "Ha" and be
identical to the null hypothesis, except with the equal sign struck-through, meaning that it
does not equal 50%.

A random sample of 100 coin flips is taken from a random population of coin
flippers, and the null hypothesis is then tested. If it is found that the 100 coin flips were
distributed as 40 heads and 60 tails, the analyst would assume that a penny does not
have a 50% chance of landing on heads and would reject the null hypothesis and accept
the alternative hypothesis. Afterward, a new hypothesis would be tested, this time that a
penny has a 40% chance of landing on heads.

Four Steps of Hypothesis Testing :

All hypotheses are tested using a four-step process.

(i) State the two hypotheses so that only one can be right i.e. Null hypothesis (HO) and
alternate

hypothesis Ha
(ii)The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.

(iii)The third step is to carry out the plan and physically analyze the sample data.

(iv)The fourth and final step is to analyze the results and either accept or reject the null
hypothesis.

Significance Level:

The level of statistical significance is often expressed as a p-value


between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject
the null hypothesis.

 A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates


strong evidence against the null hypothesis, as there is less than a 5% probability
the null is correct (and the results are random). Therefore, we reject the null
hypothesis, and accept the alternative hypothesis.

However, this does not mean that there is a 95% probability that the research
hypothesis is true. The p-value is conditional upon the null hypothesis being true is
unrelated to the truth or falsity of the research hypothesis.

 A p-value higher than 0.05 (> 0.05) is not statistically significant and indicates
strong evidence for the null hypothesis. This means we retain the null hypothesis
and reject the alternative hypothesis. You should note that you cannot accept the
null hypothesis, we can only reject the null or fail to reject it.

A statistically significant result cannot prove that a research hypothesis is correct


(as this implies 100% certainty).
Instead, we may state our results “provide support for” or “give evidence for” our
research hypothesis (as there is still a slight probability that the results occurred by
chance and the null hypothesis was correct – e.g. less than 5%).

Type I and Type II error:

In statistical hypothesis testing, a type I error is the rejection of a true null


hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent
person is convicted"), while a type II error is the non-rejection of a false null hypothesis
(also known as a "false negative" finding or conclusion; example: "a guilty person is not
convicted"). Much of statistical theory revolves around the minimization of one or both of
these errors, though the complete elimination of either is a statistical impossibility for non-
deterministic algorithms. By selecting a low threshold (cut-off) value and modifying the
alpha (p) level, the quality of the hypothesis test can be increased.[2] The knowledge of
Type I errors and Type II errors is widely used in medical
science, biometrics and computer science.
Intuitively, type I errors can be thought of as errors of  commission, i.e. the researcher
unluckily concludes that something is the fact. For instance, consider a study where
researchers compare a drug with a placebo. If the patients who are given the drug get
better than the patients given the placebo by chance, it may appear that the drug is
effective, but in fact the conclusion is incorrect. In reverse, type II errors as errors
of omission. In the example above, if the patients who got the drug did not get better at a
higher rate than the ones who got the placebo, but this was a random fluke, that would be
a type II error. The consequence of a type II error depends on the size and direction of the
missed determination and the circumstances. An expensive cure for one in a million
patients may be inconsequential even if true, while demonstrated astrological predictions
on the motion of one in a billion flies through telepathy from an alien species from the
Andromeda nebula would be revolutionary.
Tabularised relations between truth/falseness of the null hypothesis and outcomes of the
test.

Null hypothesis (H0) is


 Table of error types

True False

Type II error
Don't Correct inference (false negative)
reject (true negative) (probability = β)
Decision (probability = 1−α)
about null
hypothesis (H0)
Type I error Correct inference
Reject (false positive) (true positive)
(probability = α)
(probability = 1−β)

Chi square test :

The chi square statistic is used by the researcher for determining whether or


not a relationship exists.The calculation of the statistic in the chi square test is done by
computing the sum of the square of the deviation between the observed and the expected
frequency, which is divided by the expected frequency.

In the chi square test, the  null hypothesis is assumed as there is no


association between the two variables that are observed in the study. The chi square test
is calculated by evaluating the cell frequencies that involve the expected frequencies in
those types of cases when there is no association between the variables. The comparison
between the expected type of frequency and the actual observed frequency is then made
in this test. The computation of the expected frequency is calculated as the product of the
total number of observations in the row and the column, which is divided by the total size
of the sample.
This type of test for goodness of fit helps the researcher to understand whether
or not the sample drawn from a certain population has a specific distribution and whether
or not it actually belongs to that specified distribution. This type of test can be applicable to
only discrete One of the important points to be noted by the researcher is that the
expected number of frequencies in this type of chi square test should be at least five. This
means that the chi square test will not be valid for those whose expected cell frequency is
less than five.

Assumptions in the chi square test:

1.The random sampling of data is assumed in the chi square test.

2.In the chi square test, a sample with a sufficiently large size is assumed. If the chi
square test is conducted on a sample with a smaller size, then the chi square test will yield
inaccurate inferences. The researcher, by using the chi square test on small samples,
might end up committing a Type II error.

3.In the chi square test, the observations are always assumed to be independent of each
other.

4. In the chi square test, the observations must have the same fundamental distribution.
A chi-squared test, also written as X2 test, is any statistical hypothesis
test where the sampling distribution of the test statistic is a chi-squared distribution when
the null hypothesis is true. Without other qualification, 'chi-squared test' often is used as
short for Pearson's chi-squared test. The chi-squared test is used to determine whether
there is a significant difference between the expected frequencies and the observed
frequencies in one or more categories.
In the standard applications of this test, the observations are classified into
mutually exclusive classes, and there is some theory, or say null hypothesis, which gives
the probability that any observation falls into the corresponding class. The purpose of the
test is to evaluate how likely the observations that are made would be, assuming the null
hypothesis is true.
Chi-squared tests are often constructed from a  sum of squared errors, or
through the sample variance. Test statistics that follow a chi-squared distribution arise
from an assumption of independent normally distributed data, which is valid in many cases
due to the central limit theorem. A chi-squared test can be used to attempt rejection of the
null hypothesis that the data are independent.
Also considered a chi-squared test is a test in which this is asymptotically true,
meaning that the sampling distribution (if the null hypothesis is true) can be made to
approximate a chi-squared distribution as closely as desired by making the sample size
large enough.
The formula for the chi-square statistic used in the chi square test is:

Where Oi- Observed frequency

Ei- Expected frequency

Solved Example :

A random sample of 395 people were surveyed and each person was asked to report
the highest education level they obtained. The data that resulted from the survey is
summarized in the following table:

High School  Bachelors Masters Ph.d. Total


Female 60 54 46 41 201
Male 40 44 53 57 194
Total 100 98 99 98 395

Are gender and education level dependent at 5% level of significance? In other


words, given the data collected above, is there a relationship between the gender of an
individual and the level of education that they have obtained?

Solution :

The expected value is calculated as :

For observed frequency 60 : 100 x 201 /395=¿ 50.886

For observed frequency 54 : 98x201 / 395 =49.868


Similarly after calculating for all observed frequencies, we get the expected frequency as
shown in the following table :

High School  Bachelors Masters Ph.d. Total


Female 50.886 49.868 50.377 49.868 201
Male 49.114 48.132 48.623 48.132 194
Total 100 98 99 98 395
No by using the formula of chi square , we get

χ2=(60−50.886)250.886+⋯+(57−48.132)248.132=8.006.

The critical value of χ2 with 3 degree of freedom is 7.815.

Since 8.006 > 7.815, therefore we reject the null hypothesis and conclude that the
education level depends on gender at a 5% level of significance.

You might also like