# BHARTHIDASAN UNIVERSITY

UNIT-IV THEORY OF SAMPLING AND TESTING OF HYPOTHESIS 4.0 4.1 4.2 4.3 OBJECTIVES NEED FOR SAMPLING ELEMENTS OF SAMPLING PLAN TYPES OF SAMPLING 4.3.1 Random or Probability Sampling Simple Random Sampling Stratified Random Sampling Systematic Random Sampling Cluster Sampling 4.3.2 Non-Random or Non-Probability Sampling Convenience Sampling Judgmental Sampling Quota Sampling 4.4 SAMPLING AND NON-SAMPLING ERRORS 4.4.1 Reasons for sampling errors 4.42 Reasons for non-sampling errors 4.5 TESTING OF HYPOTHESIS 4.5.1 Sampling Distribution 4.5.2 Standard Error 4.5.3 Null & Alternative Hypothesis 4.5.4 Errors in testing of hypothesis 4.5.5 Critical Region 4.5.6 Two tailed and One tailed test 4.5.7 Large and Small sample test 4.6 PROCEDURE FOR TESTING OF HYPOTHESIS 4.7 TESTS OF SIGNIFICANCE 4.7.1 Test for single mean 4.7.2 Test for difference of two means 4.7.3 Test for two standard deviations 4.7.4 Test for Single Proportion 4.7.5 Test for difference of two proportions 4.8 Analysis of Variance 4.8.1 Assumptions 4.8.2 One way ANOVA 4.8.3 Applications

4.0 Objectives
MATHEMATICS AND STATISTICS Page 1

BHARTHIDASAN UNIVERSITY

Sampling is being used in our everxyday life without knowing about it. For examples, a cook tests a small quantity of rice to see whether it has been well cooked and a grain merchant does not examine each grain of what he intends to purchase, but inspects only a small quantity of grains. Most of our decisions are based on the examination of a few items only. In a statistical investigation, the interest usually lies in the assessment of general magnitude and the study of variation with respect to one or more characteristics relating to individuals belonging to a group. This group of individuals or units under study is called population or universe. Thus in statistics, population is an aggregate of objects or units under study. The population may be finite or infinite. Sampling and Sample Sampling is a method of selecting units for analysis such as households, consumers, companies etc. from the respective population under statistical investigation. The theory of sampling is based on the principle of statistical regularity. According to this principle, a moderately large number of items chosen at random from a large group are almost sure on an average to possess the characteristics of the larger group. A smallest nondivisible part of the population is called a unit. A unit should be well defined and should not be ambiguous. For example, if we define unit as a household then it should be defined that a person should not belong to two households nor should it leave out persons belonging to the population. A finite subset of a population is called a sample and the number of units in a sample is called its sample size. By analyzing the data collected from the sample one can draw inference about the population under study. Parameter and Statistic The statistical constants of a population like mean (m), variance (s2), and proportion (P) are termed as parameters. Statistical measures like mean (x), variance (s2), proportion (p) computed from the sampled observations are known as statistics. Sampling is employed to throw light on the population parameter. A statistic is an estimate based on sample data to draw inference about the population parameter.

MATHEMATICS AND STATISTICS

Page 2

BHARTHIDASAN UNIVERSITY

4.1 NEED FOR SAMPLING Suppose that the raw materials department in a company receives items in lots and issues them to the production department as and when required. Before accepting these items, the inspection department inspects or tests them to make sure that they meet the required specifications. Thus (i) (ii) it could inspect all items in the lot or it could take a sample and inspect the sample for defectives Statistics for Managers and then estimate the total number of defectives for the population as a whole.

The first approach is called complete enumeration (census). It has two major disadvantages namely, the time consumed and the cost involved in it. The second approach that uses sampling has two major advantages. (i) (ii) It is significantly less expensive. It takes least possible time with best possible results.

There are situations that involve destruction procedure where sampling is the only answer. A well-designed statistical sampling methodology would give accurate results and at the same time will result in cost reduction and least time. Thus sampling is the best available tool to decision makers. 4.2 ELEMENTS OF SAMPLING PLAN The main steps involved in the planning and execution of sample survey are:

I)

Objectives The first task is to lay down in concrete terms the basic objectives of the survey. Failure to define the objective(s) will clearly undermine the purpose of carrying out the survey itself. For example, in a nationalized bank wants to study savings bank account holders perception of the service quality rendered over a period of one year, the objective of the sampling is, here, to analyze the perception of the account holders in the bank.

MATHEMATICS AND STATISTICS

Page 3

BHARTHIDASAN UNIVERSITY

ii)

Population to be covered Based on the objectives of the survey, the population should be well defined. The characteristics concerning the population under study should also be clearly defined. For example, to analyze the perception of the savings bank account holders about the service rendered by the bank, all the account holders in the bank constitute the population to be investigated. Sampling frame In order to cover the population decided upon, there should be some list, map or other acceptable material (called the frame) which serves as a guide to the population to be covered. The list or map must be examined to be sure that it is reasonably free from defects. The sampling frame will help us in the selection of sample. All the account numbers of the savings bank account holders in the bank are the sampling frame in the analysis of perception of the customers regarding the service rendered by the bank. Sampling Unit For the purpose of sample selection, the population should be capable of being divided up into sampling units. The division of the population into sampling units should be unambiguous. Every element of the population should belong to just one sampling unit. Each account holder of the savings bank account in the bank, form a unit of the sample as all the savings bank account holders in the bank constitute the population. Sample Selection The size of the sample and the manner of selecting the sample should be defined based on the objectives of the statistical investigation. The estimation of population parameter along with their margin of uncertainty are some of the important aspccts to be followed in sample selection. Collection of data The method of collecting the information has to be decided, keeping in view the costs involved and the accuracy aimed at. Physical observation, intcrvewing respondents and collecting data through mail are some ofthe methods that can be followed in collection of data. Analysis of data The collected data should be properly classified and subjected to an appropriate analysis. The conclusions are drawn based on the results of the analysis.

iii)

iv)

v)

vi)

vii)

4.3 Types of Sampling

MATHEMATICS AND STATISTICS

Page 4

BHARTHIDASAN UNIVERSITY

The technique of selecting a sample from a population usually depends on the nature of the data and the type of enquiry. The procedure of sampling may be broadly classified under the following heads: 1) Probability sampling or random sampling and 2) Non-probability sampling or non-random sampling.

4.3.1 Probability sampling Statistics for Managers Probability sampling is a method of sampling that ensures that every unit in the population has a known non-zero chance of being included in the sample. The different methods of random sampling are: (a) Simple Random Sampling (SRS) Simple random sampling is the foundation of probability sampling. It is a special case of probability sampling in which every unit in the population has an equal chance of being included in a sample. Simple random sampling also makes the selection of every possible combination of the desired number of units equally likely. Sampling may be done with or without replacement. It may be noted that when the sampling is with replacement, the units drawn are replaced before the next selection is made. The population size remains constant when the sampling is with replacement. If one wants to select n units from a population of size N without replacement, then every possible selection of n units must have the same probability. Thus there are NCn possible ways to pick up n units from the population of size N. Simple random sampling guarantees that a sample of n units has the same probability 1NCnof being selected.

Example A bank wants to study the Savings Bank account holders perception of the service quality rendered over a period of one year. The bank has to prepare a complete list of savings bank account holders, called as sampling frame, say 500. Now the process involves selecting a sample of5O out of 500
MATHEMATICS AND STATISTICS Page 5

BHARTHIDASAN UNIVERSITY

and interviewing them. This could be achieved in many ways. Two common ways are: (1) Lottery method: Select 50 slips from a box containing well shuffled 500 slips of account numbers without replacement. This method can be applied when the population is small enough to handle. (2) Random numbers method: When the population size is very large, the most practical and inexpensive method of selecting a simple random sample is by using the random number tables. (b) Stratified Random Sampling Stratified sampling is a two-step process in which the population is partitioned into sub-populations, or strata. The strata should be mutually exclusive and collectively exhaustive in that every population element should be assigned to one and only one stratum and no population elements should be omitted. Next, elements are selected from each stratum by a random procedure, usually SRS. Technically, only SRS should be employed in selecting the elements from each stratum. In practice, sometimes systematic sampling and other probability sampling procedures are employed. Stratified sampling differs from quota sampling in that the sample elements are selected probabilistically rather than based on convenience or judgment. A major objective of stratified sampling is to increase precision without increasing cost. The variables used to partition the population into strata are referred to as Theory of Sampling and stratification variables. The criteria for the selection of these variables consist of Testing of Hypothesis homogeneity, heterogeneity, relatedness, and cost. The elements within a stratum should be as homogeneous as possible, but the elements in different strata should be as heterogeneous as possible. The stratification variables should also be closely related to the characteristic of interest. The more closely these criteria are met, the greater the effectiveness in controlling extraneous sampling variation. Finally, the variables should decrease the cost of the stratification process by being easy to measure and apply. (c) Systematic Random Sampling In systematic random sampling, the sample is chosen by selecting a random starting point and then picking every ith element in succession from the sampling frame. The sampling interval, i, is determined by dividing the population size N by
MATHEMATICS AND STATISTICS Page 6

BHARTHIDASAN UNIVERSITY

the sample size n and rounding to the nearest integer. For example, there are 100,000 elements in the population and a sample of 1,000 is desired. In this case, the sampling interval. i, is 100. A random number between I and 100 is selected. If, for example, this number is 23, the sample consists of elements 23, 123,223,323,423,523, and so on. Systematic sampling is similar to SRS in that each population element has a known and equal probability of selection. However, it is different from SRS in that only the permissible sample size n that can be drawn has a known and equal probability of selection. The measuring sample of size n has a zero probability of being selected. For systematic sampling, we assume that the population elements are ordered in some respect. In some cases, the ordering is unrelated to the characteristic of interest. Systematic sampling is a convenient way of selecting a sample. It requires less time and cost when compared to simple random sampling. (d) Cluster Random Sampling In cluster sampling, the target population is first divided into mutually exclusive and collectively exhaustive subpopulation, or clusters. Then a random sample of clusters selected, based on a probability sampling technique such as SRS. For each selected cluster, either all the elements are included in the sample or a sample of elements is drawn probabilistically. If all the elements in each selected cluster are included in the sample the procedure is called one-stage cluster sampling. If a sample of elements is drawn probabilistically from each selected cluster, the procedure is twostage cluster sampling. Furthermore, a cluster sample can have multiple (more than two) stages, as in multistage cluster sampling.
MATHEMATICS AND STATISTICS Page 7

BHARTHIDASAN UNIVERSITY

The distinction between cluster sampling and stratified sampling is that in cluster sampling, only sample of subpopulations (clusters) is chosen, whereas in stratified sampling, all the subpopulations(strata) are selected for further sampling. 4.3.2 Non-Probability Sampling The fundamental difference between probability sampling and nonprobability sampling is that in non-probability sampling procedure, the selection of the sample units does not ensure a known chance to the units being selected. In other words the units are selected without using the principle of probability. Even though the nonprobability sampling has advantages such as reduced cost, speed and convenience in implementation, it lacks accuracy in view of the selection bias. Nonprobability sampling is suitable for pilot studies and exploratory research. The methods of non-random sampling are:

(a) Convenience sampling: Convenience sampling attempts to obtain a sample of convenient elements. The selection of sampling units is left primarily to the interviewer. Often, respondents are selected because they happen to be in the right place at the right time. Examples of convenience sampling include (1) Use of students, church groups, and members of social organizations, (2) mall-intercept interviews without qualifying the respondents, (3) Department stores using charge account lists. Convenience sampling is the least expensive and least time consuming of all sampling techniques. The sampling units are accessible, easy to measure, and cooperative. (b) Judgmental sampling: Judgmental sampling is a form of convenience sampling in which the population elements are selected based on the judgment of the researcher. The researcher, exercising judgment or expertise, chooses the elements to
MATHEMATICS AND STATISTICS Page 8

BHARTHIDASAN UNIVERSITY

be included in the sample. Because he or she believes that they are representative of the population of interest or are otherwise appropriate. Common examples of judgmental sampling include (1) Test markets selected to determine the potential of a new product, (2) Purchase engineers selected in industrial marketing research because they are considered to be representative of the company, (3) Expert witnesses used in court. (c) Quota sampling: This is a restricted type of judgment sampling. This consists in specifying quotas of the samples to be drawn from different groups and then drawing the required samples from these groups by judgmental sampling. Quota sampling is widely used in opinion and market research surveys.

4.4 SAMPLING AND NON SAMPLING ERRORS: A sample is a part of the whole population. A sample drawn from the population depends on chance and as such all the characteristic of the population may not be present in the sample drawn from the same population. Any statistical measure say, mean of the sample, may not be equal to the corresponding statistical measure of the population from which the sample has been drawn. Thus there can be discrepancies in the statistical measure of population. i.e.. parameters and (he statistical measures of sample drawn from the same population. i.e., statistic. These discrepancies are known as Errors in sampling. Errors in sampling are of two types (i) Sampling Errors (ii,) Non-sampling Errors 4.4.1 Sampling Errors Sampling Errors is inherent in the method of sampling. Sampling depends on
MATHEMATICS AND STATISTICS Page 9

BHARTHIDASAN UNIVERSITY

chance and due to the existence of chance in sampling, the sampling errors occur. Errors in sampling arise primarily due to the following reasons: 1. Faulty selection of the sample. This may be due to selection of defective sampling techniques which may introduce the element of bias, e.g., purposive or judgmental sampling, in which investigator deliberately selects a non-representative sample. 2. Substitution. Sometimes an investigator while collecting the information from a particular sampling unit, included in the random selection substitutes a convenience member of the population and this may lead to some bias as the characteristic possessed by the substituted unit may be different from those possessed by the original unit included in sampling. 3. Faulty demarcation of sampling units 4. Variability of the population. Sampling error may also depend o the variability or heterogeneity of the population from which the samples are to be drawn. 4.4.2 Non-Sampling Errors Non-sampling errors or Bias automatically creep in due to human factors which always vary from one investigator to another. Bias may arise in the following different ways. (i) (ii) (iii) (iv) (v) (vi) Due to negligence and carelessness on the part of the investigator Due to faulty planning of sampling Due to the faulty selection of sample units Due to incomplete investigation and sample survey Due to framing of a wrong questionnaire Due to negligence and non-response on the part of the respondents
Page 10

MATHEMATICS AND STATISTICS

BHARTHIDASAN UNIVERSITY

(vii)

Due to substitutes of selected unit by another

(viii) Due to error in compilation (ix) Due to applying wrong statistical measure.

4.5 TESTING OF HYPOTHESIS The testing of hypothesis is a procedure that helps us to ascertain the likelihood of hypothesized population parameter being correct by making use of the sample statistic. In testing of hypothesis a statistic is computed from a sample drawn from the parent population and on the basis of this statistic, it is observed whether the sample so drawn has come from the population with certain specified characteristic. 4.5.1 Sampling Distribution Consider all possible samples of size ‘n’ which can be drawn from a given population. For each sample we can compute a statistic such as mean, standard deviation, etc. which will vary from sample to sample. The aggregate of various values of the statistic under consideration may be grouped into a frequency distribution. This distribution is known as sampling distribution of the statistic. Thus the probability distribution of all the possible values that a sample statistic can take is called the sampling distribution of the statistic. Sampling mean and sample proportion based on random sample are example of sample statistic Sampling distribution of the Mean from normal population If x1, x2, x3, ……….. xn are n independent random samples drawn from a normal population with mean m and standard deviation s, then the sampling distribution of x (the sample mean) follows a normal distribution with mean m and standard deviation σn . It may be noted that
(i)

the sample mean x = i=1nxin = x1+x2+ ……+xnn Thus x is a random variable and will be different every time when a new sample of n observations are taken.

MATHEMATICS AND STATISTICS

Page 11

BHARTHIDASAN UNIVERSITY (ii) (iii)

is an unbiased estimator of the population mean m. i.e. E ( x ) = μ, denoted by μx = μ. The standard deviation of the sample mean x is given by σx = σn .
x

Sampling distribution of proportions Suppose that a population is infinite and that the probability of occurrence of an even, say success is P. Let Q=1-P denotes the probability of failure. Consider all possible samples of size n drawn from this population. For each sample, determine the proportion p of successes. Applying central limit theorem, if the sample of size n is large, the distribution of the sample proportion p follows a normal distribution with mean mp = P and S.D σp=PQn. 4.5.2 Standard Error The standard deviation of the sampling distribution of a statistic is called the standard error of the statistic. The standard deviation of the distribution of the sample mean is called the standard error of the mean. Likewise, the standard deviation of the distribution of the sample proportion is called the standard error of the proportion. The standard error is popularly known as sampling error. Sampling error throws light on the precision and accuracy of the estimate. The standard error is inversely proportional to the sample size i.e. the larger the sample size the smaller the standard error. The standard error measures the dispersion of all possible values of the statistic in repeated samples of a fixed size from a given population. It is used to set up confidence limits for population parameters in tests of significance. Thus the standard errors of sample mean x and sample proportion p are used to find confidence limits for the population mean m and the population proportion P respectively. Statistic Sample mean x Standard Error
σn σnN-nN-1

Remark Population size is infinite or sample with replacement Population size N finite or sample

MATHEMATICS AND STATISTICS

Page 12

BHARTHIDASAN UNIVERSITY

without replacement Sample Proportion PQn (p)
PQn N-nN-1

Population size is infinite or sample with replacement Population size N finite or sample without replacement

4.5.3 Null and Alternative Hypothesis Null Hypothesis: The statistical hypothesis that is set up for testing a hypothesis is known as Null Hypothesis. The null hypothesis is set up in testing a statistical hypothesis only to decide whether to accept or reject the null hypothesis. It asserts that there is no difference between the sample statistic and population parameter and whatever difference is there, is attributable to sampling errors. Null Hypothesis usually denoted by H0. Alternative Hypothesis: The negation of Null Hypothesis is called the Alternative hypothesis. In other words, any hypothesis which is not a null hypothesis is called alternative hypothesis. It is always denoted by H1 or Ha. It is set in such a way that rejection of null hypothesis implies the acceptance of alternative hypothesis. 4.5.4 Error in testing of hypothesis For testing the hypothesis, we take a sample from the population, an on the basis of the sample result obtained, we decide whether to accept or reject the hypothesis. Here, two type of errors are possible. A null hypothesis could be rejected when it is true. This is called Type I error and the probability of committing type I error is denoted by α. Alternatively, an error could result by accepting a null hypothesis when it is false, this is known as Type II error and the probability of committing type II error is denoted by β. This is illustrated in the following table: Statistical Decision of the Test True Situation H0 is True H0 is True Correct H0 is False Type I Error

MATHEMATICS AND STATISTICS

Page 13

BHARTHIDASAN UNIVERSITY

Decision H0 is False Type II Error Correct Decision

4.5.5. Critical Region A region in the sample space which amounts to rejection of null hypothesis (H0) is called the critical region. After formulating the null and alternative hypothesis about a population parameter, we take a sample from the population and calculate the value of the relevant statistic, and compare it with the hypothesized population parameter. After doing this, we have to decide the criteria for accepting or rejecting the null hypothesis. These criteria are given as a range of values in the form of an interval, say (a,b), so that if the statistic value falls outside the range, were reject the null hypothesis. If the statistic value falls within the interval (a,b) then we accept H0. This criterion has to be decided on the basis of the level of significance. At 5% level of significance means that 5% of the statistical value arrived at from the samples will fall outside this range (a,b)and 95% of the values will be within the range (a,b). Thus the level of significance is the probability of Type I error. The levels of significance usually employed in testing of hypothesis are 5% and 1%. A high significance level chose for testing a hypothesis would imply that higher is the probability of rejecting a null hypothesis if it is true. Table of critical value Zα of Z. Level of Significance Critical (Zα) Value 1% 5% 10% | Zα| =1.645

Two tailed test One tailed test

| Zα| | Zα|=1.96 =2.58

| Zα| | Zα| | Zα|=1.28 =2.33 =1.645

4.5.6 Two tailed and one tailed test:

MATHEMATICS AND STATISTICS

Page 14

BHARTHIDASAN UNIVERSITY

The probability curve of the sampling distribution of the test statistic is a normal curve. In any test, the critical region is represented by a portion of the area under this normal curve. This curve has two sides (or ends) known as two tails. The rejection region may be represented by a portion of area on each of the two sides or by only on the side of the normal curve and correspondingly the test is known as two tailed test (or two sided test) or one tailed test (or one sided test). When the test of hypothesis is made on the basis of rejection region represented by both sides of the standard normal curve, it is called a two tailed test or two sided test. When the test of hypothesis is made on the basis of rejection region represented by any of the sides of the standard normal curve, it is called a one tailed test or one sided test. 4.5.7 Large and small sample test The test of significance is (a) Test of significance for large sample and (b) Test of significance for small samples. For larger sample size (.30), all the distributions like Binomial, Poisson etc., are approximated by normal distribution. Thus normal probability curve can be used for testing of hypothesis. 4.6 PROCEDURE FOR TESTING OF HYPOTHESIS: Steps for testing hypothesis is given below ( for both large sample and small sample tests) Step 1: Null hypothesis: Set up null hypothesis H0 Step 2: Alternative Hypothesis: Set up alternative Hypothesis H1, which is complementary to H0 which will indicate whether one tailed (right or left tailed) or two tailed test is to be applied Step 3: Level of significance: Choose an appropriate level of significance (a). Step 4: Test Statistic (or test of criterion): Calculate the value of the test statistic, Z = t-E(t)S.E.(t) Under the null hypothesis, where‘t’ is the sample statistic

MATHEMATICS AND STATISTICS

Page 15

BHARTHIDASAN UNIVERSITY

Step 5: Critical Value: Find the critical value Za of Z at the level of significance, from the table “areas under the normal curve Za – values” in case of large samples, or areas under t-table, F-table, Chi-square table” in case of small samples. Step 6: Inference: We compare the computed value of Z (in absolute value) with the significant value (critical value) Zα/2 (or Za). If |Z|>Za, we reject the null hypothesis H0 at a % level of significance and if |Z|<Z a, we accept H0 at a % level of significance.

4.7. LARGE SAMPLE TESTS: 4.7.1 Test for single mean: Step 1: Setting up of a Null hypothesis. There is no significance difference between the sample and the population mean or the sample has been drawn from the parent population. H0: x = μ Step 2: Setting up of an Alternative hypothesis: There is a significance difference between the sample mean and the population mean. H1: x ≠ μ Step 3: Fixing of level of significance: a (normally it is 5%) Step 4: Computation of test Statistic:

Zcal=

x-μσn

Step 5: Critical Value: Find the critical value Za at a % level of significance, from the table” areas under the normal curves Za – values. Step 6: Interference: If the modulus of the calculated value Zcal≤ Zα , obtained in step 5, we accept the null hypothesis at a % level of significance. Otherwise we reject the null hypothesis at a % level of significance. Now, we discuss the above with an example.

Example 4.1 A Sample of 400 male students of a college is found to have a mean height of 171.38cm. Can it be regarded as a sample from a large population with mean height 171.17cm and standard deviation 4.40cm? Solution:
MATHEMATICS AND STATISTICS Page 16

BHARTHIDASAN UNIVERSITY

Given n = 400 (Large Sample) μ = 171.17cm; x = 171.38 cm; σ = 3.30 cm Null Hypothesis (H0): Sample mean has been drawn from a large population with mean height of 171.17 cm. i.e., H0: μ = 171.17 cm Alternative Hypothesis (H1): Sample mean has not been drawn from a large population with mean 171.17cm i.e., H1: μ≠171.17cm. Level of significance (α): 5% Test Statistic:

Zcal= Zcal= Zcal=

x-μσn 171.38-171.174.40400 0.210.22

= 0.9546

Critical value: At 5% level, Z0.05 = 1.96 Interference: Since the calculated value of Z is less than the critical value of Z at 5% level, hence we accept the null hypothesis and conclude that, the sample mean has been drawn from a large population with mean height of 171.17cm.

Example 4.2 The mean lifetime of 100 fluorescent light bulbs produced by a company is computed to be 1570 hours with standard deviation of 120 hours. If m is the mean lifetime of all the bulbs produced by the company, test the hypothesis μ = 1600 hours against the alternative hypothesis m ≠ 1600 hours using a 5% level of significance.

Solution: We are given
X = 1570 hrs. μ = 1600 hrs, σ = s =120 hrs, n = 100

Null Hypothesis (H0): m=1600. i.e., There is no significant difference between the sample mean and population mean.
MATHEMATICS AND STATISTICS Page 17

BHARTHIDASAN UNIVERSITY

Alternative Hypothesis (H1): M1 1600 (tow tailed Alternative) There is a significant difference between the sample mean and population mean. Level of Significance (a): 5% Test Statistic:

Zcal= Zcal=

x-μσn 1570-1600120100

= -2.5

|Zcal| = 2.5
Critical value: At 5% level, Interference: Since the calculated value is greater than the critical value of Z at 5% level, hence we reject the null hypothesis and conclude that , there is a significant difference between the sample mean and population mean.

Self – Assessment Question 1. A random sample of 900 members has a mean 3.4 cm and S.D 2.61 cm. Is the sample from a large population of mean 3.25 cm and S.D 2.61 cm? n=900, x = 3.4, μ = 3.25, σ = 2.61, |Zcal| = 1.724 [Hint: H0 is accepted at 5% level] 2. A random sample os size 400 drawn and the sample mean was found to be 99. Test whether the sample could have come from a normal population with mean 100 and standard deviation 8 at 5% level. n=400, x = 99, μ = 100, σ = 8, |Zcal| = 2.5 [Hint: H0 is rejected at 5% level] 4.7.2 Test for difference of mean: Working Rule:

MATHEMATICS AND STATISTICS

Page 18

BHARTHIDASAN UNIVERSITY

Step 1: Setting up of a Null Hypothesis: The two samples have been drawn from different from different populations having the same means and equal standard deviation H0 : μ1 = μ2 Step 2: Setting up of an Alternative Hypothesis. The two samples have been not drawn from differne tfrom different populations. H0 : μ1 ≠ μ2 (Two tailed test), or H1 : μ1 < μ2 (One tailed test), or H1 : μ1 > μ2(One tailed test). Step 3: Fixing the level of Significance: α (normally it is 5%) Step 4: Computation of test Statistic:

Zcal = Zcal =

x1- x2σ12n1+σ22n2 ; if the population s.d’s are known x1- x2s12n1+s22n2 ; if the population s.d’s are not known.

Step 5: Critical Value: Find the critical value Za at α% level of significance, from the table areas under the normal curve Za -values Step 6: Inference: If the modulus of the calculated value, obtained in step 5, we accept the null hypothesis at α% level of significance, Otherwise we reject the null hypothesis at a % level of significance. Example 4.3: A college conducts both day and evening classes intended to be identical. A sample of 100 day students yields examination results as under x1= 72 and σ1 = 14.8. A sample of 200 evening students yields examination result under x2= 73.9 and σ2=17.9. Are the two mean statistically equal at 5% level? Solution: We are given n=100, x1= 72 and σ1 = 14.8, n=200, x2= 73.9 and σ2=17. Null Hypothesis (H0): H0:μ1 = μ2. .e, the two means are statistically equal. Alternative Hypothesis (H1): μ1 ≠μ2 (Two tailed test) i.e., the two means are not statistically equal.
MATHEMATICS AND STATISTICS Page 19

BHARTHIDASAN UNIVERSITY

Level of Significance (α): 5% Test Statistics

Zcal = x1- x2σ12n1+σ22n2 = = -1.51.947 = -0.77
|zcal| = 0.77

72.4-73.9(14.8)2100+(17.9)2200

=-1.53.7925

Critical value: At 5% level, Z0.05 = 1.96 Inference: Since the calculated value of Zcal is less than the critical value of Z at 5% level, hence we accept the null hypothesis and conclude that, the two means are statistically equal. Example 4.4 A random sample of 1000 workers from South India shows that their mean wages are Rs 47 per week with a standard deviation of Rs. 28. A random sample of 1500 workers from North India gives a mean age of Rs. 49 per week with a standard deviation of Rs. 40. Is there any significant difference between their mean levels of wages? Solution We are given, n1 = 1000, x1 = 47 and s1 = 28, n2=1500, x2 = 73.9 and S2=17.9

Null Hypothesis (H0): H0:μ1=μ2 i.e., there is no significant difference between their mean level of wages Alternative Hypothesis (H1): H1: μ1≠μ2 (Two tailed test) i.e., there is a significant difference between their mean level of wages. Level of significance (α): 5% Test Statistics

Zcal =
-21.3604

x1- x2s12n1+s22n2

=

47- 49(28)21000+(40)21500

=

-21.8507

=

= -1.47

|Zcal|= 1.47
MATHEMATICS AND STATISTICS Page 20

BHARTHIDASAN UNIVERSITY

Critical value: At 5% level Inference: Since the calculated value of Zcal is less than the critical value of Z at 5% level, hence we accept the null hypothesis and conclude that the two means are statistically equal. Self-Assessment Question
1. In a survey if buying habits, 400 women shoppers are chosen at

random in super market ‘A’ located in certain section of the city. Their average weekly food expenditure is Rs. 250 with a standard deviation of Rs. 40. For 400 women shoppers chosen at random in super market ‘B’ in another section of the city, the average weekly food expenditure is Rs. 220 with standard deviation of Rs. 55. Test at 5% level of significance whether the average weekly food expenditure of the two populations of shoppers are equal. [Hint: n1: 400, x1 =250, s1=40, n2=400, x2 =220 and s2=55, |Zcal| =8.82] 2. Random samples drawn from two countries gave the following data relating to the heights of adult males: 3. Country A Mean height inches Standard deviation Number of sample in 67.42 2.58 1000 Country B 67.25 2.50 1200

Is the difference between the means significant? [Hint: n1: 1000, x1 =67.42, s1=2.58, n2=1200, x2 =67.25 and s2=2.50, |Zcal| =1.56]

4.7.3 Test for single proportion Working Rule:

MATHEMATICS AND STATISTICS

Page 21

BHARTHIDASAN UNIVERSITY

Step 1: Setting up of Null Hypothesis. The sample has been drawn from a population with proportion P, i.e., P=P0. Step 2: Setting up of Alternative Hypothesis. The sample has not been drawn from a population with proportion P, i.e, H1:P≠ P0. Step 3: Fixing of level of significance: α (normally it is 5%) Step 4: Computation of test statistic:

Zcal=

p-PPQn

Step 5: Critical Value: Find the critical value Za at α% level of significance, from the table “areas under the normal curve Za – values.” Step 6: Inference: If the modulus of the calculated value |Zcal|≤ Zα, obtained in step 5, we accept the null hypothesis at α% level of significance. Otherwise we reject the null hypothesis at α% level of significance. Now, we discuss the above with an example. Example 4.5 In a sample of 1000 people in Karnataka 540 are rice eater and the rest are wheat eaters. Can we assume that both rice and wheat eaters are equally popular in this state at 1% level of significance? Solution: Given, n=1000; p=5401000 = 0.54; P=).5; Q=1-P=1-0.5=0.5 Null Hypothesis (H0): The sample has been drawn from a population with proportion P, i.e., H0: P=0.5 Alternative Hypothesis (H1): The sample has not been drawn from a population with proportion P, i.e., H1: P≠0.5. Level of significance (α): 1% Test Statistics:

Zcal =

p-PPQn

=

0.54-0.50.5(0.5)1000

=

0.040.0158

= 2.53

|Zcal|= 2.53 Critical Value: At 1% level, Z0.01=2.58

MATHEMATICS AND STATISTICS

Page 22

BHARTHIDASAN UNIVERSITY

Inference: Since the calculated value of |Zcal|is less than the critical value of Z at 1% level, hence we accept the null hypothesis and conclude that, the sample has been drawn from a population with proportion P, i.e., H0: P=0.5. Self – assessment Question 1. In a random sample of 400 persons from a large population, 120 are female can it said that males and females are in the ration 5:3 in the population. Use 5% level of significance. [Hint: n= 400, p=120400 = 0.3; P = 38 =0.375; Q= 1-P=10.375=0.625=2.58] |Zcal|=3.125 4.7.4 Test for two proportions: Working Rule: Step 1: Setting up of a Null Hypothesis. The two samples have been drawn from same population, i.e., H0: P1=P2. Step 2: Setting up of an Alternative hypothesis. The two samples have not been drawn from same population, i.e., H1:P1≠P2 Step 3: Fixing of level of significance: α (normally it is 5%) Step 4: Computation of test statistics:

Zcal= Q=1-P

p1-p2PQ(1n1+1n2)

;

Where

P=n1p1+n2p2n1+n2

and

Step 5:Critical value: Find the critical value Za at a % level of significance, from the table “areas under the normal curve Z a– values. Example 4.6 In a sample of 600 men from a certain city, 450 are found to be smokers. In a sample of 900 from another city 450 are found to be smokers. Do the data indicate that the two cities are significantly different with respect to prevalence of smoking habits among men? Solution: Given n1=600; n2=900; p1=450600 = 0.75; p2 = 450900 = 0.5;

MATHEMATICS AND STATISTICS

Page 23

BHARTHIDASAN UNIVERSITY

P=

n1p1+n2p2n1+n2 = 6000.75+900(0.5)600+900 = 0.6

Q= 1-P = 1-0.6 = 0.4 Null Hypothesis (H0): The two samples have been drawn from same population, i.e., H0: P1 = P2 Alternative Hypothesis (H1): The two samples have not been drawn from the same population, i.e., H1: P1 ≠ P2. Level of significance (α): 5% Test Statistic:

Zcal =

p1-p2PQ(1n1+1n2) = 0.75-0.50.6(0.4)(1600+1900) = 0.250.0258 = 9.7

|Zcal|= 9.7 Critical Value: At 5% level, Z0.05=1.96 Inference: Since the calculated value of |Zcal|is greater than the critical value of Z at 5% level, hence we reject the null hypothesis and conclude that,the two samples have not been drawn from the same population. Example 4.7 A machine puts out 16 imperfect articles in a sample of 500. After the machine is overhauled, it puts out 3 imperfect articles in a batch of 100. Has the machine improved? Solution: Given n1=500; n2=100; p1=16500 = 0.032; p2 = 3100 = 0.03;

P=

n1p1+n2p2n1+n2 = 5000.032+100(0.03)500+100 = 0.0316

Q= 1-P = 1-0.0316 = 0.968 Null Hypothesis(H0): P1 = P2. Alternative Hypothesis (H1): P1 > P2 (one tailed test) Level of significance (α): 5% Test Statistic:

Zcal

=

p1-p2PQ(1n1+1n2)

=

0.032-0.030.0316(0.968)(1500+1100)

=

0.0020.0105 = 0.19 MATHEMATICS AND STATISTICS Page 24

BHARTHIDASAN UNIVERSITY

|Zcal|= 0.19 Critical Value: At 5% level, Z0.05 = 1.645 Inference: Since the calculated value of is less than the critical value of Z at 5% level hence we accept the null hypothesis and conclude that, there is no improvement after overhauling. Self Assessment Question 1. In a random samples of 600 and 1000 men from two cities 400 and 600 men are found to be literate. Do the data indicate at 5% level of significance that the population are significantly different in the percentage of literacy? [Hint: n1=600, n2=1000, p1=400600 = 0.67, p2=6001000=0.6; P=0.625;Q=0.375,] |Zcal|=2.67 2. Before an increase in excise duty on tea 400 people out of a sample of 500 persons were found to be tea drinkers. After an increase in the duty, 400 persons were known to the tea drinkers in sample of 600 people. Do you think that there has been a significant decrease in the consumption of tea after the increase in the excise duty? 3. [Hint: n1=500, n2=600, p1=400500 = 0.80, p2=400600=0.67; P=0.73;Q=0.27,] 4. |Zcal|=4.81; H0: P1 = P2; H1 : P1 < P2 (one tailed test).

4.8ANALYSIS OF VARIANCE: In many statistical studies a variable of interest, called the response variable (or dependent variable), is identified. Then the data are collected that tell us about how one or more factors (or independent variables) influence the variable of interest. If we cannot control the factor(s) being studied, we say that the data obtained are observational. For example, suppose that in order to study how the size of a home relates to the sale price of the home, a real estate agent randomly selects 50 recently sold homes and records the square footages and sale prices of these homes. Because the real estate agent cannot control the sizes of the randomly selected homes, we say that data are observational.

MATHEMATICS AND STATISTICS

Page 25

BHARTHIDASAN UNIVERSITY

If we can control the factors being studied, we say that the data are experimental. Furthermore, in this case the values, or levels, of the factor (or combination of factors) are called treatments. The purpose of most experiment is to compare and estimate the effects of the different treatments on the response variable. For example, suppose that an oil company wishes to study how three different gasoline types (A, B and C) affects the mileage obtained by popular midsized automobile model. Here the response variable is gasoline mileage and the company will study a single factor-gasoline type. Since the oil company can control which gasoline type is used in the midsized automobile; the data that the oil company will collect are experimental. Furthermore, the treatments – the levels of the factor gasoline type – are gasoline type A, B and C. In order to collect data in an experiment, the different treatments are assigned to objects (people, cars, animals or the like) that are called experimental units. For example in gasoline mileage situation, gasoline types A, B and C will be compared by conducting mileage test using a midsized automobile. The automobiles used in the test are the experimental units. Definition: According to R.A. Fisher, Analysis of Variance (ANOVA) is the “Separation of Variance ascribable to one group of causes from the variance ascribable to other group”. By this technique te toal variation in the sample data is expressed as the sum of its nonnegative components where each of these components is a measure of the variation due to some specific independent source or factor or cause 4.8.1 Assumptions: For the validity of the F-test in ANOVA the following assumptions are made (i) The observations are independent. (ii) Parent population from which the observations are taken is normal and (iii) Various treatment and environmental effects are additive in nature. 1.8.1 One Way Classification

MATHEMATICS AND STATISTICS

Page 26

BHARTHIDASAN UNIVERSITY

Let us suppose that N observations, i=1, 2,…………….k; j=1,2…….) of a random variable X are grouped on some basis, into k classes of sizes n1, n2, ……nk respectively (N=i=1kni) as exhibited below: Mean X11 x12 . . . . . . x1n1 X21 x22 . . . . . . x2n2
x1 x2

Total T1 T2

Xi1 xi2 . . . . . . xini

xi

Ti

Xk1 xk2 . . . . . . xknk

xk

Tk G

Grand Total The total variation in the observation xij can be split into the following two components: (i) The variation between he classes or the variation due to different bases of classification, commonly known as treatments. The variation within the classes i.e., the inherent variation of the random variable within the observations of a class.

(ii)

The first type of variation is due to assignable causes which can be detected and controlled by human endeavor and the second type of variation due to chance causes which are beyond the control of human hand. In Particular, let us consider the effect of k diffent rations on the yield in milk of N cows (of the same breed and stock) divided into k classes of sizes n 1, n2, ……..nk. Respectively (N=i=1kni) Hence the sources of variation are (i) (ii) Effect of rations Error due to chance causes produced by numerous causes that they are not detected and identified.

Test Procedure:
MATHEMATICS AND STATISTICS Page 27

BHARTHIDASAN UNIVERSITY

The steps involved in carrying out the analysis are:
1) Null Hypothesis (H0): The first step is to set up of a null hypothesis

H0: μ1 = μ2 =………= μk 2) Alternative Hypothesis (H1): all μ1 ’s are not equal (i=1,2,……k) 3) Level of significance: Let α 0.05 4) Test statistic:

Various sums of squares are obtained as follows: a) Find the sum of values of all the (N) items of the given data. Let this grand total represented by ‘G’. Then correction Factor (C.F)=G2N b) Find the sum of squares of all the individual items (xij) and then the Total sum of squares (TSS) = ∑∑xij2-C.F. c) Find the sum of squares of all the class totals (or each treatment total) Ti (i=1,2,…….k) and then the sum of squares between the classes or between the treatments (SST) is SST = i=1kTi2nj - C.F. where ni (i=1,2,…..k) is the number of observations in the ith class or number of observations received by ith treatment. d) Find the sum of squares within the class or sum of squares due to error (SSE) by subtraction. SSE = TSS-SST
1) Degrees of freedom (d.f): The degrees of freedom for total sum of

squares freedom for SSE is (N-k) 2) Mean sum of squares: The mean sum of squres for treatments is SSTk-1 and mean sum of squares for erro is SSEN-k 3) ANOVA Table: The above sum of squres together with their respective degrees of freedom and mean sum of squres will be summarized in the following table. ANOVA Table for one-way classification Sources Variation Between Treatments Error of d.f. S.S . k-1 N-k M.S.S Fratio
MSTMS E = F1

SST SSTk-1 =MST SSE SSEN-k =

MATHEMATICS AND STATISTICS

Page 28

BHARTHIDASAN UNIVERSITY

MSE Total N-1

Calculation of variance ration Variance ratio of F is the ratio between greater variance and smaller variance, thus = F1 = Variation between the treatmentsVariation within the treatments = MSTMSE If variance within the treatment is more than the variance between the treatments, then numerator and denominator should be interchanged and degrees of freedom adjusted accordingly. 4) Critical Value of F or Table value of F: The critical value of F or table value of F is obtained from F table for (k1, N-k) d.f. at 5% level of significance. 5) Inference: If calculated F value is less than table value of F, we may accept our null hypothesis H0 and say that there is no significant difference between treatments. If Calculated F value is greater than table value of F, we reject our H0 and say that the difference between treatments is significant. Example 4.7 The following table gives the yields on 15 sample plots under three varieties of seed A: B: C: 20 18 25 21 20 28 23 17 22 16 15 28 20 25 32

Prepare an analysis of variance table Solution: Null Hypothesis (H0): μ1 = μ2 = μ3 (i.e., various varities of seeds are homogeneous)

MATHEMATICS AND STATISTICS

Page 29

BHARTHIDASAN UNIVERSITY

Alternative hypothesis (H1): μ1 ≠ μ2 ≠ μ3 (i.e., various varities of seeds are not homogeneous) Level of Significance(α):0.05 Test Statistic: Variet y A B C 20 18 25 21 20 28 23 17 22 16 15 28 20 25 32 Tota l 100 95 135 330 Squar es 10000 9025 18225

Grand Total

Squares: Variet y A B C Total 400 324 625 441 400 784 529 289 484 256 225 784 400 625 102 4 759 0 Tota l 2026 1863 3701

Correction Factor (C.F.)

=

G2N = 330215 = 7260

Total sum of Squares (TSS) is TSS = ∑∑xij2- C.F = 7590-7260 = 330 Sum of squares between the classes or between the treatments (SST) is SST = i=1kTi2ni - C.F. SST = (10025+ 9525+ 13525)-7260
MATHEMATICS AND STATISTICS Page 30

BHARTHIDASAN UNIVERSITY

= 7450-7260 = 190 Sum of squares due to error (SSE) = TSS –SST = 330-190 = 140 ANOVA Table for one-way classification Source Variation Between treatments Error Total of d.f. 3-1=2 S.S 190 M.S.S
1902 = 95 14012

F=ratio
9511.667

=

8.142 14-2=12 140 15-1=14 = 11.667

Table Value: Table value of F for 92,12) d.f., at 5% level of significance is 3.89 (From F-table) Inference: Since calculated F is greater than table value of F, we may reject our H0 and say that various varieties of seeds are not homogeneous. Self – Assessment Question 1. Three processes A, B and C equivalent. The following observation of A 10 12 13 11 B 9 11 10 12 C 11 10 15 14 is tested to see whether their outputs are output are made 10 14 15 13 12 13

13

Carry out the analysis of variance and state your conclusion Hint: Sources Variation Between treatments Error of d.f. 3-1=2 18-2=16 S.S. 7 51 M.S.S 7/2 =3.5 51/16=3.
Page 31

F-ratio 3.5/3.19 1.097 =

MATHEMATICS AND STATISTICS

BHARTHIDASAN UNIVERSITY

19 Total 19-1=18

Table value of F for (2,16) d.f. at 5% level of significance is 3.63

Chapter Summary This chapter has explained different types of sampling. First we discussed the need an elements of sampling plan. We continued by discussing sampling and non-sampling errors and Testing of hypothesis. We saw that both large sample and small sample test are inferences can be made. We learned that procedure for testing of hypothesis for testing single mean, difference of two means, singly proportion and difference of two proportions under large sample tests. To conclude this chapter, we explained how to test homogeneity using one way ANOVA.

MATHEMATICS AND STATISTICS

Page 32

BHARTHIDASAN UNIVERSITY

UNIT – V CORRELATION AND REGRESSION ANALYSIS
5.0 OBJECTIVES 5.1 MEANING OF CORRELATION 5.2 TYPES OF CORRELATION 5.2.1 Positive and Negative Correlation 5.2.2 Linear and Non-linear Correlation 5.3 MEASUREMENT TECHNIQUES OF CORRELATION COEFFICEINT 5.3.1 Scatter Diagram 5.3.2 Karl Pearson’s Coefficient of Correlation 5.3.3 Spearman’s Rank Correlation Ranks are given directly Non -repeated ranks Repeated ranks 5.4 PROPERTIES OF CORRELATION COEFFICEINT 5.5 MEANING OF REGRESSION 5.6 TYPES OF REGRESSION LINES 5.6.1 Regression lines of X on Y 5.6.2 Regression line of Y on X 5.7 CONSTUCTION OF REGRESSION EQUATIONS 5.8 PROPERTIES OF REGRESSION COEFFICENTS 5.9 DIFFERENCES BETWEEN CORRELATION AND REGRESSION 5.10 APPLICATIONS OF REGRESSION ANALYSIS

MATHEMATICS AND STATISTICS

Page 33

BHARTHIDASAN UNIVERSITY

INTRODUCTION In this unit you will be able to learn the concept of correlation and regression. Also from this unit you will be able to learn the various methods of obtaining the correlation coefficients, rank correlation coefficient, regression equations etc. This unit explains the differences between the correlation and regression. It is easy to understand the techniques to be discussed in this unit by making use of calculation. Try out the example problems with the calculator. 5.1 MEANING OF CORRELATION We are familiar that, the change in one factor, say, the amount of rainfall affects the change in the other factors, say, yield of rice. This means that there exists some kind of relationship between the two factors. Thus correlation is relationship between two factors. In simple words, correlation means “the degree of relationship between two or more factors”. An example of the relationship that exist between the price and demand.

5.2 TYPES OF CORRELATION There are different types of correlation. They can be classified into the following categories. a) Positive and Negative degree correlation b) Linear and Non-linear correlation First we will discuss positive and Negative degree correlation 5.2.1 Positive and Negative degree correlation If the changes in the factors are in the same direction then the correlation is said to be “Positive degree correlation”. Relationship between the amount of rainfall and yield of rice is an example of positive degree correlation. If the rainfall level increases then the yield of rice also increases and vice-versa. Now, we will discuss the linear and Non-linear correlation.
MATHEMATICS AND STATISTICS Page 34

BHARTHIDASAN UNIVERSITY

5.2.2 Linear and Non-linear Correlation If the changes in the factors are in the constant ratio then the correlation is said to be “Linear correlation”.

For example Amount of rainfall 40 (in mm) Yield of rice 100 60 15 0 80 20 0 100 250 12 0 30 0

From the above example, it can be observed that amount of rainfall increases with 20 mm at each level and yield of rice increases with 50 tonnes at each level. If the changes in the factors are not in the constant ratio then the correlation is said to be “Non-linear correlation”. For example Factor 1 Factor 2 40 100 60 150 80 200 100 250 120 300

From the above example, it can be observed that, the changes at various levels are different

Self Assessment Question

State the different types of correlation with example in the space given below. Limit your answer in about 80 words.

MATHEMATICS AND STATISTICS

Page 35

BHARTHIDASAN UNIVERSITY

_____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ ________________________________

5.3.1 Scatter Diagram If the values of variables or factors, say X and Y is plotted in the XY – plane, the diagram of the data obtained is called as scatter diagram. The greater the scatter of the plotted points on the diagram, the lesser is the relationship between the two variables or factors

1. If all the points lie on a straight line falling from the lower left- hand

corner to the upper right-hand corner, the correlation is said to be perfective positive(Fig 5.1) i.e. the correlation coefficient r = +1 Figure 5.1 r= +1

2. If all the points lie on a straight line falling from the upper left-hand corner to the upper right-hand corner, the correlation is said to be perfectively negative (i.e. the correlation coefficient r = -1) Fig 5.2.

Figure 5.2 r = -1

Figure If all the points lie on a straight line fall in a narrow band and they show a rising tendency from the lower left-hand corner to the upper righthand corner, there would be high degree of positive correlation. Fig 5.3 5.3 r = 1
MATHEMATICS AND STATISTICS Page 36

BHARTHIDASAN UNIVERSITY

If all the points lie on a straight line fall in a narrow band and they show a declining tendency from the upper left hand corner to the lower right-hand corner, there would be high degree of negative correlation. Fig 5.4. Fig 5.4 r = 1

If a all the points lie on a straight line fall in a widely band and they show a rising tendency from ∑the lower left-hand corner to the upper right-hand corner, there would be low degree of positive correlation. Fig 5.5. Fig 5.5 r > 0

If all the points lie on a straight line fall in a widely band and they show a declining tendency from the upper left hand corner to the lower right hand corner, there would be low degree of negative correlation. Fig 5.6 Fig 5.6 r < 1

If the plotted points lie on a straight line parallel to x-axis or in haphazard manner it shows the absence of correlation between two factors. Fig 5.7. Fig 5.7 r = 0

MATHEMATICS AND STATISTICS

Page 37

BHARTHIDASAN UNIVERSITY

5.3.2 KARL PEARSON’S COEFFICEINT OF CORRELATION As a measure of degree of linear relationship between two variables, Karl Pearson developed a formula called correlation coefficient. The correlation coefficient between two variable usually denoted by rxy, is a measure of relationship between them is defined as,

rxy= = =

Cov (x,y) σ xσ y X-XY-YX-X2Y-Y2

xyx2y2

Where x = X-X ; y = Y-Y Working Procedure Step 1: Denote one series by X and other by Y Step 2: Calculate X and Y of the X and Y series respectively, using the formula,
X= Xn ; and Y= Yn

Step 3: Take the deviations of the observations in X-series and from X and write it under the column headed by x = -X . Take the deviation of the observations in Y series from Y and write it under the column y = Y-Y. Step 4: Multiply the respective deviations and write it under the column headed by xy. Step 5: Square the deviations obtained in step 4 for X and Y series and write it under the column headed by x2 and y2. Step 6: Apply the following formula to calculate the correlation coefficient (r).

rxy = xyx2y2
Example 5.1 Find the coefficient of correlation between height of brothers and sisters from the following data
MATHEMATICS AND STATISTICS Page 38

BHARTHIDASAN UNIVERSITY

Height of Brothers 6 (in cm) 5 Height (in cm) of Sisters 6 7

6 6 6 8

6 7 6 6

68 69 70 69 72 72

71 69

Solution: Let the heights of Brothers be denoted by X and that of Sisters by Y. Let us prepare the following table X x = X-X Y y = X xy = ∑Xn = 4767 = 68
Y-Y

X2

Y2

65 66 67 68 69 70 71 47 6

-3 -2 -1 0 1 2 3 -

67 68 66 69 72 72 69 48 3

-2 -1 -3 0 +3 +3 0 -

6 2 3 0 +3 +6 0 20

9 4 1 0 1 4 9 2 8

4 1 9 0 9 9 0 3 2

From the above table. N=7; ∑X = 476; ∑Y = 483; ∑xy= 20; ∑x2 = 28; ∑y2 = 32
X = ∑Xn = 4767 = 68 Y = ∑Yn = 4837 = 69

Karl Pearson’s Coefficient of Correlation is now calculated as follows:

r=

xyx2y2

MATHEMATICS AND STATISTICS

Page 39

BHARTHIDASAN UNIVERSITY

= 2028 32 = 205.2915(5.6569 = 2029.9335 r = 0.06681

Self – Assessment Question Calculate the correlation coefficient between the height of sister and height of the brothers from the given data: Height (in cm) of Sisters 6 4 6 5 6 7 6 6 6 5 6 7 6 8 6 8 7 0 69 70 68 72

Height of Brothers 6 (in cm) 6

[Hint: X = 67, Y = 68, ∑X2=28, ∑Y2=34, ∑xy = 25, r= 0.81] Short Cut Method The above direct method for calculating ‘r’ is not convenient when (i) the terms of the Series X and Y are larger and the calculation of X and Y become difficult (or) (ii) the mean of X or Y are not integers. In these cases we apply the following formula of assumed mean

rxy =
where,

n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2

dx = X-A, A is the assumed mean of X – series dy = Y-B, B is the assumed mean of Y – series n is number of observation of X and Y Working Procedure Step 1: Denote one series by X and the other by Y. Step 2: Take any term ‘A’ as assumed mean of X series and ‘B’ as assumed mean of Y series (preferably the middle one).

MATHEMATICS AND STATISTICS

Page 40

BHARTHIDASAN UNIVERSITY

Step3: Take the deviations of the observations in X – series from A and writ it under the column headed by dx = X-A. Take the deviations of the observations in Y series from B and write it under the column headed by dy= Y-B. Step 4: Multiply the respective deviations and write it under the column headed by dx dy. Step 5: Square the deviations obtained in step 4 fro X and Y series and write it under the column headed by dx2 and dy2. Step 6: Apply the following formula to calculate the correlation coefficient (r).

rxy =

n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2

Example 5.2: Calculate the coefficient of correlation for the following pairs of values of X and Y. X Y Solution: Let the assumed means for X and Y be 23 and 27 respectively, so that dx = X-23, dy = Y-27, We have the following table X 17 19 21 26 20 28 26 Y 23 27 25 26 27 25 30 dx = X- dy = Y- dxdy 23 27 -6 -4 -2 3 -3 5 3 -4 0 -2 -1 0 -2 3 24 0 4 -3 0 -10 9 dx2 36 16 4 9 9 25 9 dy 2 16 0 4 1 0 4 9
Page 41

17 23

19 27

21 25

26 26

20 27

28 25

26 30

29 33

MATHEMATICS AND STATISTICS

BHARTHIDASAN UNIVERSITY

29 186

33 216

6 2

6 0

36 60

36 144

36 70

Note that, here X =∑Xn = 1868 = 23.25, which is not an integer, we use short-cut method, Here n=8, ∑dx = 2, ∑dy = 0, ∑dxdy=60, ∑dx2 = 144, ∑dy2 = 70,

rxy =

n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2n∑dy2-(∑dy)2

rxy = 860-2(0)8(144)-(2)28(70)-(0)2 rxy = 4801148560 rxy = 48033.8821(23.6643) rxy = 480801.7962 rxy = 0.5987 Self – assessment Question Compute the coefficient of correlation for the following data X Y 10 12 25 22 13 16 25 15 22 18 11 18 12 17 25 23 21 24 20 17

[Ans: rxy= 0.53]

5.3.3 : Spearman’s Rank Correlation Coefficient The coefficient of rank correlation is based on the various values of the varieties and is denoted by

ρ = 1-

6∑D2n3-n

where, D – is the difference of corresponding ranks and n – is the number or pairs of observations.
MATHEMATICS AND STATISTICS Page 42

BHARTHIDASAN UNIVERSITY

TYPE I: RANKS ARE GIVEN DIRECTLY Working Procedure Step 1: Denote rank of X series by R1 and rank of Y series by R2. Step 2: Calculate the difference or R1 and R2 and write it under the column headed by D Step 3: Square the difference D and write it under the column headed by D2. Step 4: Apply the formula:

ρ = 1-

6∑D2n3-n

This method is described with following example Example 5.3: Two judges in a beauty contest rank the 12 entries as follows. Judge X Judge Y 1 1 2 2 9 3 6 4 10 5 3 6 5 7 4 8 7 9 8 10 2 11 11 12 1

Calculate the rank correlation coefficient between the two judges X and Y. Judge (R1) 1 2 3 4 5 6 7 8 9 X Judge (R2) 12 9 6 10 3 5 4 7 8 Y D=R1R2 -11 -7 -3 -6 2 1 3 1 1 D2 121 49 9 36 4 1 9 1 1
Page 43

MATHEMATICS AND STATISTICS

BHARTHIDASAN UNIVERSITY

10 11 12

2 11 1 Total

8 0 11

64 0 121 41 6

Here n = 12; ∑D2=416 Now,

ρ = 1-

6∑D2n3-n

ρ = 1- 6(416)123-12 ρ = 1- 24961728-12 ρ = 1- 24961716 = 1- 1.4545 ρ = -0.4545 Example 5.4: Ten competitors in a beauty contest were ranked by three judges in the following order: Judge 1 Judge 2 Judge 3 1 3 6 6 5 4 5 8 9 10 4 8 3 7 1 2 10 2 4 2 3 9 1 10 7 6 5 8 9 7

Use the rank correlation coefficient to determine which pair of judges has the nearest approach to common taste in beauty. Solution Let R1, R2, R3 respectively be the ranks given by first, second and third judge. Let ρij be the rank correlation coefficient between the ranks given by ith and jth judges, i=1,2,3; j=1,2,3.

MATHEMATICS AND STATISTICS

Page 44

BHARTHIDASAN UNIVERSITY

Let Dij =Ri – Rj, be the difference of ranks of an individual give by ith and Jth Judge.

Judge 1 R1 1 6 5 10 3 2 4 9 7 8

Judge 2 R2 3 5 8 4 7 10 2 1 6 9

Judge 3 R3 6 4 9 8 1 2 3 10 5 7

D12=R1R2

D12
2

D23=R2R3

D23
2

D13=R1R3

D13
2

-2 1 -3 6 -4 -8 2 8 -1 -1 Total

4 1 9 36 16 64 4 64 1 1 20 0

-3 1 -1 -4 6 8 -1 -9 1 2

9 1 1 16 36 64 1 81 1 4 21 4

-5 2 4 2 2 0 1 -1 2 1

25 4 16 4 4 0 1 1 4 1 60

Here n = 10; ∑D122=200, ∑D232=214; ∑D132=60 First and Second Judges

ρ12 = 1 ρ23 = 1 -

6∑D122n3-n = 1- 6(200)103-10 = 1- 1200990 = 1- 1.2121 = 0.2121

Second and Third Judges
6∑D232n3-n = 1- 6(214)103-10 = 1- 1284990 = 1- 1.2969 = 0.2969

First and Third Judges

MATHEMATICS AND STATISTICS

Page 45

BHARTHIDASAN UNIVERSITY

ρ13 = 1 -

6∑D132n3-n = 1- 6(60)103-10 = 1- 360990 = 1- 0.3636 = 0.6364

Since ρ13 is maximum, thus the pair of the first and third judges has the nearest approach to common taste in beauty.

Self –assessment Question 1. Two judges in a musical contest rank the 10 entries as follows: Judge 3 5 8 4 7 10 2 1 6 X Judge Y 6 4 9 8 1 2 3 10 5 9 7

[Hint: n = 10; ∑D2=149; ρ = 0.8495] 2. Ten Competitors in a beauty contest were ranked by three judges in the following order 1st Judge 2nd Judge 3rd Judge 1 4 6 5 8 7 4 7 8 8 6 1 9 5 5 6 9 10 10 10 9 7 3 2 3 2 3 2 1 4

Use spearman’s coefficient of rank correlation to determine which pair of judges has the nearest approach to common taste in beauty: [Hint: n = 10; ∑D122=74, ∑D232=44, ∑D132=156, ρ12=05515, ρ23=0.7333; ρ13=0.0545] TYPE II: RANKS ARE NOT GIVEN – NON – REPEATED RANKS In this case we are given only the data. We assign the ranks to both the series of X and Y by giving the ranks in ascending order for both series (or descending order). Working Rule

MATHEMATICS AND STATISTICS

Page 46

BHARTHIDASAN UNIVERSITY

Step 1: Assign ranks to each items of both series in ascending or descending order. Step 2: Calculate the difference of ranks and write it under the column headed by D. Step 3: Square the difference D and write it under the heading D2. Step 4: Apply the formula,

ρ=1-

6∑D2n3-n

This method is explained with the help of the following example. Example 5.5 For the following data calculate the coefficient of rank correlation. Series 80 X Series 123 Y Solution: Series X 80 91 99 71 61 81 70 59 Series Y 123 135 154 110 105 134 121 106 Rank X (R1) 5 7 8 4 2 6 3 1 Rank Y (R2) 5 7 8 3 1 6 4 2 0 0 0 1 1 0 -1 -1 Tota l 0 0 0 1 1 0 1 1 4 D D2 91 135 99 154 71 110 61 105 81 134 70 121 59 106

MATHEMATICS AND STATISTICS

Page 47

BHARTHIDASAN UNIVERSITY

Here, n = 8; ∑D2 = 4 Now,

ρ=1-

6∑D2n3-n = 1 – 6 (4)83-8 = 1 - 24504 = 1-0.0476 = 0.09524

Self – assessment Question Calculate the rank correlation coefficient for the following data of two series Series X Series Y 92 86 89 83 87 91 86 77 83 68 77 85 71 52 63 82 53 37 50 57

[Hint: n = 10; ∑D2=44; ρ=0.733]

TYPE III: RANKS ARE NOT GIVEN – REPEATED RANKS If two or more individuals are placed together in any classification with respect to an attribute, there are more than one item with the same rank in either or both the series, then the problem is solved by assigning average rank to each of their individuals who are put in a tie. For example, suppose an item is repeated at rank 5, (i.e., the 5th and 6th item are having same values), then the common rank assigned to 5the and 6th is (5+6)/2=5.5. The next rank assigned thrice, then the common rank assigned to the value is sum of the ranks by divided by 3. In order to find the rank correlation coefficient the adjustment factor is added to the formula, which is given by Adjustment Factor (A.F) = 112 (m3-m) Where ‘m’ is the number of times an item is repeated. This Adjustments Factor is to be added for each repeated value in both the series. The modified formula for the rank correlation coefficient is given by,

ρ=1–

6[∑D2+ 112∑(m3-m)]n3-n

This method is explained with the following example,
MATHEMATICS AND STATISTICS Page 48

BHARTHIDASAN UNIVERSITY

Example 5.6 From the following data related to the series X and Y, calculate the coefficient of rank correlation. Series X Series Y 48 13 33 13 40 24 9 6 16 15 16 4 65 20 24 9 16 6 57 19

Solution Series X 48 33 40 9 16 16 65 24 16 57 Series Y 13 13 24 6 15 4 20 9 6 19 Rank (R1) 8 6 7 1 3 3 10 5 3 9 X Rank (R2) 5.5 5.5 10 2.5 7 1 9 4 2.5 8 Y D=R1R2 2.5 0.5 -3 -1.5 -4 2 1 1 0.5 1 Total D2 6.25 0.25 9.00 2.25 16.0 0 4.00 1.00 1.00 0.25 1.00 41.0 0

Here n = 10, ∑D2=41 [Remark: In the X series, we see that the value 16 is repeated thrice, the common rank is given to the X value is 3, which is the average of 2.3 and 4. i.e., (2+3+4)/3=3] Now, Adjustement Factor
MATHEMATICS AND STATISTICS Page 49

BHARTHIDASAN UNIVERSITY

For X series, AF1= 112 (33-3) = 2 For Y series, , AF2= 112 (23-2) = 0.5 AF3= 112 (23-2) = 0.5 The coefficient of rank correlation is,

ρ=1–

6 [ ∑D2+ 112∑m3-m]n3-n

= 1 – 6[41+2+0.5+0.5]103-10 = 1 – 644990 = 1 – 264990 = 1- 0.2667 ρ = 0.7333

Self – assessment Question Obtain the rank correlation coefficient for the following data Series X Series Y 68 62 64 58 75 68 50 45 64 81 80 60 75 68 40 48 55 50 64 70

[Hint: n=10; ∑D2=72; ρ=0.545]

5.4 properties of Correlation Coefficient
➢ The value of ‘r’ does not depend on which of the two variables under

study is labeled X and which is labeled Y. ➢ The correlation coefficient lies between -1 and +1 i.e., -1≤r≤+1 ➢ The correlation coefficient is independent of change of origin and scale.
MATHEMATICS AND STATISTICS Page 50

BHARTHIDASAN UNIVERSITY ➢ r = +1, if all (Xi, Yj) pairs lie on a straight line with positive slope and

r= -1, if all (Xi, Yj) pairs lie on a straight line with negative slope. 5.5 REGRESSION ANALYSIS Managers often make decisions by studying the relationship between variables and process improvements can often be made by understanding how changes in one or more variables affect the process output. Regression Analysis is a statistical technique in which we observe data to relate a variable of interest, which is called the dependent (or response) variable, to one or more independent (or predicator) variable. The objective is to build a regression model, or prediction equation, that can be used to describe , predict and control the dependent variable on the basis of the independent variables. For example, a company might wish to improve its marketing process. After collecting data concerning the demand for a product, the product’s price, and the advertising expenditures made to promote the product, the company might use regression analysis to develop an equation to predict demand on the basis of price and advertising expenditure. Predictions of demand for various price-advertising expenditure combinations can then be used to evaluate potential changes in the company’s marketing strategies. In the words of M.M. Blair, Regression analysis is a “mathematical measure of average relationship between two or more variables in terms of the original unit of the data”.

5.5. Types of Regression Lines A line of regression is the line, which gives the best estimate of one variable X, for any given value of the other variable. We have two types of regression lines, namely, ○ Regression line of X on Y ○ Regression line of Y on X. First we will give the regression line of X on Y. It is the line, which gives the best estimate for the values of X for a specified value of Y.
MATHEMATICS AND STATISTICS Page 51

BHARTHIDASAN UNIVERSITY

It is given by X - X = bxy (Y - Y) Where bxy is the regression coefficient of X on Y, which can be calculated using any of th formula under the natures of the data bxy = ∑xy∑y2 where, x = X - X and y= Y - Y or

bxy =

n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dy)2

where, dx = X – A, dy= Y – B; and A, B are assumed mean or bxy = rσxσy where ‘r’ is the correlation coefficient, σx an d σy are the standard deviations for X and Y series. Now we give the regression line of Y on X. It is the line, which gives the best estimate for the value of Y for a specified value of X. It is given by Y - Y = byx (X - X) Where byx is the regression line of Y on X, which can be calculated using any one of the following formula depending upon the nature of data.

byx =
or

∑xy∑x2 ; where x = X - X and y = Y - Y

bxy =
or

n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dy)2 where, dx = X- A; dy=Y-B and A, B are

assumed mean.

MATHEMATICS AND STATISTICS

Page 52

BHARTHIDASAN UNIVERSITY

byx = rσyσx

where ‘r’ is the correlation coefficient σ x, σy are the standard deviations of X and Y series.

5.7 CONSTRUCTION OF REGRESSION EQUATION Example 5.7 The height of a sample of 10 fathers and their eldest sons are given below 9to the nearest cm). Height of Father 170 (X) Height (Y) of Son 166 167 167 162 164 163 166 167 166 166 164 169 168 171 170 166 163 169 166

(i) (ii) (iii)

Obtain the two regression equations Estimate the likely height of Father when the height of Son is 190 cm. Estimate the likely height of Son when the height of Father is 160 cm.

Solution Height of Father (X) 170 167 162 163 167 166 169 171 166 167 164 166 166 164 168 170 Height of Son (Y) 3 0 -5 -4 0 -1 2 4 0 1 -2 0 0 -2 2 4 0 0 10 0 0 2 4 16 9 0 2 5 1 6 0 1 4 1 0 1 4 0 0 4 4 1
Page 53

X = X - y = Y - Xy x2 Y2
X Y

MATHEMATICS AND STATISTICS

BHARTHIDASAN UNIVERSITY

6 166 169 1670 163 166 1660 -1 2 0 -3 0 0 3 0 1 4

6 9 0 3 8

35 7 6

Here, n=10, ∑X=1670, ∑Y=1660
X = ∑Xn = 167010 = 167 Y = ∑Yn = 166010 = 166

From the table, ∑xy=35, ∑x2=76, ∑y2= 38 bxy = ∑xy∑y2 = 3538 = 0.9211 byx = ∑xy∑x2 = 3576 = 0.4605 (i) Regression line of X on Y X - X = bxy (Y - Y) X-167 = 0.9211(Y-166) X-167 = 0.9211 Y – 152.9028 X = 0.9211Y- 152.9026 + 167 X = 0.9211Y+14.934

Regression line of Y on X Y - Y = byx (X - X) Y – 166 = 0.4605 (X -167) Y – 166 = 0.4605 X – 76.9035 Y = 0.4605 X – 76.9035 + 166 Y = 0.4605X + 89.0965 ii) Given, Height of Son (Y)= 190 cm. To estimate the height of Father (X) we use X on Y equation X = 0.9211Y + 14.0934

MATHEMATICS AND STATISTICS

Page 54

BHARTHIDASAN UNIVERSITY

X = 0.9211(190)+14.0934 X = 175.009+14.0934 X=189.1024cm iii) Given, Height of Father (X) = 160cm. To estimate the height of son, we use Y on X equation. Y = 0.4605X+89.0965 Y= 0.4605(160)+89.0965 Y= 73.68+89.0965 Y=162.78cm

Self – Assessment Question From the following data, obtain the two regression equations. Sales Purcha se 91 71 97 75 108 69 121 97 67 70 124 91 51 99 73 61 111 80 57 47

Also estimate the sales when the purchase is 90. [Hint: n=10; X =90, Y = 76, ∑xy=3900, ∑x2=6360, ∑y2= 2388 bxy = 1.36; byx = 0.6132, Line of X on Y : X = 1.36 Y -5.2; Line of Y on X:Y=0.6132X+14.812; Estimated sales, when the purchase is 90=117.2 Example 5.8 Find the two lines of regression from te following data Price at Mumbai (in 36 Rs.) Price at Chennai (in 15 Rs.) 42 36 55 24 61 26 76 15 26 14

MATHEMATICS AND STATISTICS

Page 55

BHARTHIDASAN UNIVERSITY

Also estimate the likely price at Mumbai when the price at Chennai is Rs 60/Solution Price at Mumbai (X) Price at Chenn ai (Y) 36 42 55 61 76 26 296 15 36 24 26 15 14 130 -19 -13 0 6 21 -29 -34 -11 10 2 0 -9 -12 -44 209 -130 0 0 -189 348 238 361 169 0 36 441 841 184 8 121 100 4 0 81 144 45 0 dx=XA dy=YB dxd y dx2 dy2

(A=55 (B=26 ) )

Here, n=10, ∑x =296, ∑Y=130, ∑dx=34; ∑dy= -44; ∑dxdy= 238, ∑dx2=1848; ∑dy2=450
X Y

= =

∑Xn = 2966 = 49.33 ∑Yn = 1306 = 21.67 n∑dxdy-(∑dx)(∑dy)n∑dy2-(∑dy)2 = 6238--34(-44)6450-(-44)2 = 1428-

bxy = bxy =

14962700-1936 = -68764 = -0.089 n∑dxdy-(∑dx)(∑dy)n∑dx2-(∑dx)2 = 6238--34(-44)61848-(-34)2 = -689888-

1150 = -688732 = 0.0078

Regression Line of X on Y X - X = bxy (Y - Y) X - 49.33 = -0.089(Y-21.67) X - 49.33= -0.089Y + 1.9286
MATHEMATICS AND STATISTICS Page 56

BHARTHIDASAN UNIVERSITY

X = -0.089Y + 1.9286 + 49.33 X = -0.089Y + 51.2586

Regression line of Y on X Y - Y = byx (X - X) Y -21.67 = -0.0078 (X-49.33) Y-21.67 = -0.0078X+0.3848 Y= -0.0078X+0.3848+21.67 Y=-0.0078X+22.0548 To find the estimate likely price at Mumbai, we use the line X on Y X = -0.089Y + 51.2586 X = -0.089(60)+51.2586 = 45.92 Hence the price at Mumbai is Rs 45.92.

Self – assessment Question Age Husband Age of Wife of 23 18 22 15 28 20 26 17 35 22 20 14 22 16 40 21 20 15 18 14

Hence estimate the age of husband when the age of wife is 19. [Hint: n=10; X = 25.6, Y = 17.2; bxy =2.23; byx=0.385 X=2.23Y-12.76; Y=0.385X+7.3Y; Age of Husband(X) = 29.61]

Example 5.9 Find out the likely production corresponding to a rainfall of 40 cm from the following data Rainfall cm) Average Standard 30 5 (in Output quintals) 50 10
Page 57

(in

MATHEMATICS AND STATISTICS

BHARTHIDASAN UNIVERSITY

Deviation Coefficient of correlation, r=0.8 Let X and Y denotes the rainfall and output respectively Given: X = 30, Y = 50, σx = 10, σy=10, r=0.8 Regression line of Y on X Y - Y = byx (X - X) byx= r (σyσx)=0.8(105) = 1.6 Y-50 = 1.6(X-30) Y-50 = 1.6X-48 Y=1.6X-48+50 Y=1.6X+2 When rainfall X = 40 cm Y=1.6(40)+2 Y=66Quintals Self – Assessment Question Estimate the most likely yield of paddy when the annual rainfall is 22cm other factors being assumes to remain same. Yield per hectare (in kg) Mean Standard Deviation 973.5 38.4 Annual Rainfall (in cm) 18.3 2.0

Coefficient of Correlation = 0.58 [Hint: Regression line of Y and X, Y=11.136X+769.71; For X=22; Yield (Y)= 1014.7 kg] 5.8 Properties of regression coefficients 1. There two regression lines, namely, X on Y and Y on X and they always intersect at the mean (X,Y) 2. If one regression coefficient is greater than unity, then the other one has to be less than unity.

MATHEMATICS AND STATISTICS

Page 58

BHARTHIDASAN UNIVERSITY

3. Geometric mean between the regression coefficients is correlation coefficient (i.e., r = ±bxybyx ) 4. Although regression equations are usually different, they become identical if r= +1. 5. If r=0 then the regression lines are perpendicular to each other. 5.9. Difference between correlation and regression Correlation Regression

1. IT is the degree of relationship 1. It is the average relationship between two or more variable between two or more variables or or factos factors
1. It is symmetric in x and y, i.e.,

rxy = ryx

2. The regression coefficients are not symmetric

2. The correlation coefficient does 3. Regression coefficients reflects on not reflect upon the nature of the nature of variable i.e., which is variable (independent or dependent and which is independent. dependent variable) 3. It does not imply cause and 4. It indicates the cause and effect effect relationship; between the relationship between the variable. variable under study. The variable corresponding to cause is taken as independent variable, whereas corresponding to effect is taken as dependent variable. 4. It is a relative measure and is 5. Regression coefficients are independent of the units of absolute measure of finding out the measurement. relationship between two or more variables.
5. It

indicates associations.

the

degree

of 6. It is used to forecast the nature of dependent variable when the value of independent variable is known.

5.10 Application of Regression
MATHEMATICS AND STATISTICS Page 59

BHARTHIDASAN UNIVERSITY ✔ The causes and effect relations are indicated from the study of

✔ ✔

✔ ✔

regression analysis. It establishes the rate of change in one variable in terms of the changes in another variable It is useful in economic analysis as regression equation can determine an increase in the cost of living index for a particular increase in general price level. It helps in prediction and thus it can estimate the values of unknown quantities It helps in determining the coefficient of correlation. It enables us to study the nature of relationship between the variables. It can be useful to all natural, social and physical sciences, where the data are in functional relationship.

Chapter Summary This chapter has discussed simple correlation coefficient, correlation coefficient and simple linear regression analysis, which relates a dependent variable to a single independent variable. We began by considering the simple linear regression model, which employs two parameters; the slope and y intercept. We next discussed how to compute the least square point estimates of the parameters and how to construct the regression equations by using various methods. We learned that the difference between correlation and regression and applications of regression analysis.

MATHEMATICS AND STATISTICS

Page 60

BHARTHIDASAN UNIVERSITY

MATHEMATICS AND STATISTICS

Page 61