# REGRESSION AND CORRELATION

This chapter considers the problems of analysing the relationships between variables. Different types of scatter diagrams are depicted. Straight line equations are described and the method of calculating the least squares regression line is described. The uses and method of calculating the coefficient of determination and coefficient of correlation are described and the development of confidence limits for the regression line is explained in detail. The chapter concludes with an explanation of the Rank Correlation coefficients as non parametric measures of the statistical associations

7.1 Introduction Quit often, there are occasions in business when changes in one or many variable appear to be related in certain way to movements in one or several other variables. For example, a sales manager may observe that sales value changed when there has been a change in advertising expenditure, or the logistic manager may notice that as cars and trucks are more used and the number of clients increased then the maintenance expenses becomes larger.

Certain questions may occur for the manager or analyst, as the followings: 1. Are the changes of the variables in the same or in opposite directions? 2. Could changes in one variable be influencing or be influenced by movements in the other variable? 3. This is an important relationship or could apparently related movements come about purely by chance? 4. Could movements in two variables be related, not directly, but through movements in a third variable? 5. What is the importance of this knowledge for the business decision system? In many occasions the manager or analyst is interested in predicting the value of one variable related to other variables which were considered to influence it. For example, the quality control manager may want to know what might be the effect on the number of failures if the amount of expenditure on inspection were increased. The Marketing Manager may wish to predict market share if advertising costs were cut by 20%. Suppose that a manager has sensed that two variables are behaving in some way ‘related’, how will the manager proceed to investigate the relation? A possible methodology might be as follows: a) Observe and note what is happening in a systematic way. b) Draw a scatter diagram of data that is being observed. c) Measure statistically the intensity of the relation, its significance and describe the relation. d) Use the result to improve your decisions In the managerial process it is necessary to dispose of statistical information as variate and complex as possible, that will be known and used to measure the relations of independence or dependence between the variables. The relationship between the statistical variables or between indicators can be observed in all economic activity: production, between the production indicators and these of efficiency and productivity, between resources and the results of their using, between the obtained results and the investment plan.

Regression and Correlation

Bivariate data implies two distinct categories of variables: independent and dependent variables. The independent variable is that variable occurring randomly or chosen freely and it is usually denoted by x. The dependent variable occurs as a result of the variation of the independent variable and it is usually denoted by y.

7.2 Categories of Relations between Variables The relations that can be found between x and y variables, modelled as y = f (x) + ε, allows characterizing the direction of change, the intensity of change and the shape of the relation. The relations are classified as follows: a. according to the way of change we can have : - direct relations, also called positive relations, meaning that a change in independent variable will induce a change of the dependent variable in the same direction: if x is increasing then y will also increase and if x is decreasing then y will decrease - opposite relations, also called negative relations, meaning that a change in the independent variable will induce a change of the dependent variable in opposite direction: if x is increasing then y will decrease and if x is decreasing then y will increase b. according to the intensity of the relation we can have: - high intensity, strong, or tight relations, expressed by high correlation level between the variables - medium intensity relations - low intensity causal relation c. according to the shape of the relations we can observe: - linear relations - non linear relations, as exponential growth, logarithmic decrease, etc. d. according to the randomness involved, we will have deterministic and probabilistic models The deterministic model is allowing us to determine the value of a dependent variable from the values of the

independent variables. Such models represent relationships in the natural sciences. Example of deterministic model: E = mc2 , where: E – energy m – mass c – speed of light For practical models we have to represent the randomness that is part of a real – life process. Such models are called probabilistic models. For a probabilistic model we add a random term (also called the error variables). The random term accounts for all the variables, measurable and immeasurable, that are not part of the model. In the case of the probabilistic first-order model: Y = β0 + β1 X+ ε, (7.1.) where: Y – dependent or explained variable; X – independent or explanatory variable; ε – random variable; β0 , β1 – parameters. Example 1: For the following situation in Table 7-1 we are asked to characterize the relation between expenditure on inspection and defective parts delivered to the customer for a company with ten operating plants of similar size producing small components:
Costs and defective products recording Observation number 1 2 3 4 5 6 7 8 9 10 Control costs per batch 25 30 15 75 40 65 45 24 35 70 Table 7-1 Defective parts per batch of units 50 35 60 15 46 20 28 45 42 22

an opposite or negative relation due to the negative slope .1 Scatter diagram based on the data in Example 1. Figure 7. in this case. The scatter diagram is graphical form of data displaying constructed as follows: the horizontal or x axis is used for the independent variable variants or classes in this case.Regression and Correlation We can deduce that there is likely to be an opposite relationship between the control cost and the number of defectives parts delivered to the customer. defective parts delivered.1 shows a clear drift downwards in defectives delivered as costs per batch increases. the higher the cost. expenditure.a medium intensity of the relation due to the fact that the points are not extremely gathered .a linear relation due to the linear shape of the scatter diagram points (the linear equation can correctly model the relation due to the fact that the points are close to the line) . Based on this assumption – which is a form of hypothesis – the data can be graphed using the scatter diagram. 70 60 Defective parts 50 per delivered batch 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Cost per controlled batch Figure 7. The scatter diagram shows: . the fewer defective units are delivered. the y or vertical axis is used for the dependent variable variants or classes. This type of diagram is known as a scatter diagram.

2 in Figure 7.2 Perfect positive relationship Figure 7.3 Perfect negative relationship 7. for example 2.an inelastic relation due to the fat that the slope is almost 45o Sometimes other possibilities exist ranging from a perfect negative or perfect positive relationship to no discernible relationship.1 and 2. The purpose of the regression techniques is: . A perfect relationship is one where a single straight line can be drawn through all the point.3 Simple Regression Model Regression is a statistical method providing a mathematical description of the statistical relations between variables. Figure 7.2.Statistics for Business Administration .

To have confidence in the regression relationship calculated it is preferable to have a large number of observations.to estimate the value of the dependent variable given the value of the independent variable and . the ratio of correlation or r. 2 . companies Regression is concerned with obtaining a mathematical function describing the statistical relation between variables. contributions and sales. regions.to describe in mathematical terms a statistical relation. A measure of the accuracy of fit (R . Useful means of forecasting when the data has a generally linear relationship. With further analysis confidence limits can be calculated for forecasts produced by the regression formula. If the relation is between one dependent and one independent variable then we are in the case of the simple regression. if the statistical relation is between one dependent and two or more independent variables then we are in the case of the multiple regression. c.Regression and Correlation .to compare statistical relations between variables for two companies or for two countries. b. According to the mathematical function modelling the relation between the variables we can identify linear and non-linear equations 7. This section deals only with the simple regression techniques. modelling the relation between two variables using the first degree equation. . Over operational ranges linearity (or near linearity) is often assumed for such items as costs.2) The main attributes of the linear regression. are: a.1 Simple Linear Regression Model Simple linear regression model: Y = β0 + β1 X+ ε (7.3. d. the coefficient of correlation) can be easy calculated for any linear regression line.

f.e. In many circumstances it is not sufficiently accurate to assume that y depends only on one independent variable as discussed above in simple linear regression. Regression is not an adaptive forecasting system. must be done with great precaution taking into account also other forecasting techniques.3. it is not suitable for incorporation in.…. x: labour hours worked y: machine hours z: production volume (tonnage) 7. . a particular value depends on two or more factors in which case multiple regression analysis is employed. Frequently.9x + 7.7z. Once outside the observed values relationships and conditions may change drastically. Estimated linear regression is: yi = b0 + b1xi.Statistics for Business Administration e. including that based on regression analysis. i.2y + 3. say a stock control system where the requirements would be for a forecasting system automatically producing forecasts which adapt to current market conditions. where. i = 1. g. an analysis of a firm might produce the following multiple regression equation: Overheads (EUROS) = 10800 + 6. Any form of extrapolation.2 Least Squares Method For defining the relationship between Y and X we need to know the values of the coefficients of the linear model β0 and β1 (the population parameters).n (7. For example. We have to estimate the parameters by using a sample of observations of size n.3) Usually the estimators for the parameters of the regression line are obtained by using the least squares method.

The regression coefficient measures the average variation of the y variable when the x variable increases by one unit. expresses geometrically the slope of the straight line. the individual values of the resultative variable would be equal between them.Regression and Correlation To find the line of best fit mathematically it is necessary to calculate a line that minimizes the total of the squared deviations of the actual observations from the calculated line. s= ∑(y − b i i =1 n 0 − b1 xi ) 2 → min n n n ⎧ ⎧ ∂s xi = yi = −2 ( yi − b0 − b1 xi ) = 0 ⎪ ⎪nb0 + b1 ⎪ ⎪ ∂b0 i =1 i =1 i =1 so. More. so equal to their mean. The parameter b0 represents the fixed element and b1 is the slope of the line i.4) ⎨ n n n ⎪ ∂s ⎪ 2 xi + b1 xi = xi yi ⎪ ∂b = −2 xi ( yi − b0 − b1 xi ) = 0 ⎪b0 i =1 i =1 i =1 ⎩ i =1 ⎩ 1 ∑ ∑ ∑ ∑ ∑ ∑ ∑ By solving the system of equations we obtain the values for b0 and b1 and we calculate the value of the regression equation for each value of the x variable. the regression coefficient shows the direction in which it is realized the relation: • Thus. ⎨ n (7. The b1 parameter. if b1 >0. . This is known as the method of least squares or the least squares method of linear regression. the change in the mean value of y per unit change in x. The two parameters b0 and b1 have a mean character and they have to be representative for the biggest part of the values which helped to their calculation. These values of the regression equations are also called the theoretical values of the y variable depending on x and the operation to replace the real terms with the values of the regression equation (theoretical values) is called adjustment computation. called regression coefficient. In this case. The b0 parameter called intercept has a mean character to the extent that its value shows at what level would reach the value of the y characteristic if all the factors were exercised a constant action over its formation.e. positive relationship.

65 x Note: the Normal equations automatically produce sign (+ or -) for the regression coefficient b1.815 Solving gives b0= 63.926 b1 = 12. negative relationship. This is usually done by plotting based on three values of x: the lowest. highest and mean. The use of these equations will be demonstrated using the Example 1 data contained in Table 1. Based on Example 1 the three values of x are: 15. The calculated values can be used to draw the mathematically correct line of best fit on a graph.65 to 2 decimal places. An alternative is to transpose the Normal Equations so as to be able to find b0 and b1 directly. so the mean value of the regression equation equals the mean value of the dependent variable ( y x = y y ) . Therefore.97 – 0.Statistics for Business Administration • When b1<0. 42. The equations become: 10 b0 + 424 b1 = 363 424 b0 + 21.97 and b1 = -0.6) (∑ x ) 2 . Note: The values of b0 and b1 have been calculated in the example above by substituting in the Normal Equations. Each of these values is substituted into the calculated regression line and the result values plotted on the graph. • When b1=0. the regression line for Example 1 is: y = 63. the two variables are unrelated and y x = b0 . The formulae are as follows: b0 = ∑y -b ∑x = y −b x 1 n n 1 (7.5) b1 = n∑ x n ∑ xy 2 - ∑ x ∑ y (7.4 and 75. in this case. minus.

predictions or forecasts can be made for values of x that have not yet occurred.97 10 10 Defective parts per delivered batch 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 Inspection costs per batch Figure 7. . y ) of the data.Regression and Correlation It is often more convenient to use this alternative form especially when using a calculator.0. Reverting to Example 1 it will be recalled that the manager wished to know the likely number of defects if 50 parts per 1000 was spent on inspection.815 − 424 × 363 10 × 21.916 − (424 ) 2 = -0.3 Using the Results of the Simple Regression Analysis When the values have been calculated for b0 and b1. or the values inserted into the straight-line formula. For any set of bivariate data a least squares regression line always passes through the mean point ( x . 7.4.652467 × = 63. Values for b0 and b1 are re-calculated using the transposed formulae and the Table 1 data.652467 = -0. The predictions can be read from the graph on which the line of best fit has been plotted.65 b0 = 363 424 . b1 = 10 × 12.3.4 Calculated lines of best fit. Figure 7.

123 − 63.4 it will be seen that the number of defects would be 31 per 1000.3. The formula can also be used. Also this measure is used to estimate the regression parameters b0 and b1. thus: y = 63.47 defects per 1000 would be found if 50 parts per 1000 was spent on inspection.47 Thus the manager would conclude that. It will be denoted by Se: Se = ∑y 2 i − a ∑ y i − b∑ xi y i n−2 (7. Predictions should be given only if the result has an economic meaning. Regression Line Standard Error The regression line accuracy can be measured with the standard error of the regression.97 – 0. on average. The predicted value is just a single point that needs to be qualified by the use of the confidence classes. For the example concerning the defective parts we have computed the standard error as follows: Se = 15.97 – 0. This is why it is also called residual standard deviation. In both cases we need to calculate the standard error. so when x is 50: y = 63.65x.76 defective parts 10 − 2 This value is used to set the confidence classes limits for an individual value prediction or for the whole regression line.4 Quality of the Regression Line.7) The above formula provides an estimate of the standard error due to the fact that it is using the regression line values b0 and b1 which are themselves estimates.65 (50) = 31. If the x value can be used to make a prediction according to the regression line this does not necessarily mean that we have obtained a practical forecasted value.97 ⋅ 363 − (−0. The inference concerning these estimates can be made using the significance test t and using the confidence class construction. 31. . to construct their confidence class.Statistics for Business Administration From Figure 7.65) ⋅ 12815 = 5. 7.

the limits round these estimates are: 54. y = 63. The confidence limits for the whole of the regression line are calculated by using a quantity known as the standard error of the average forecast that is given by: S ef = S e 1 + n (x − x )2 ∑x 2 (∑ x ) − n 2 (7. This gives an upper limit of 61.65 (15) = 54.9) ) (7. The confidence interval can be now calculated as follows: When x = 15. y = 48. ∑x 2 (∑ x ) − 2 = 5.2 ± 7.3.76 × Value from Table 2 above and .306 for 8 degrees of freedom and a 95% confidence interval.9 Given that (based on Example 1): b0 = 63.65 lower limit. The interval then takes the form: y ± S ef × t (7.8) 7.72 giving a 54. In this case since the number of observations is 10.65 S e = 5. The interval is calculated by estimating the fitted value of y for each value of x in the original data using the equation y =b0 + b1x.2. then the t distribution is used with 10-2 = 8 degrees of freedom.37 ± 5.05 when x = 24.5 Constructing the Confidence Interval The actual confidence interval is constructed in exactly the same way as that for a mean or for a proportion.35 and a lower limit of 47.97 b1 = -0.15.97 – 0.09 upper limit and a 42.Regression and Correlation The line of best fit y = b0 + b1x is an average line which passes through x and y and any estimate based must be a mean value of a point estimate.76 S ef = S e 1 + n (x − x )2 n t = 2.

the population intercept and the mean value of b1 values is expressed as β1. The confidence class for β0 and β1 are obtained as follows: . The standards deviations are: Sa = Se ⋅ ∑x n∑ x − (∑ x ) 2 i 2 i i 2 .for the intercept: b0 ± t x Sb . the individual confidence interval is: 34.65.79 and the upper limit of 48.10) (∑ xi ) 2 2 ∑ xi − n Using the data in our example and x value for instance 45 we are obtaining S ef = 5. 10 424 2 21.926 − 10 When x = 45 and y = 34. The mean value of b0 values coming from repeated sampling is expressed as β0.4) 2 + = 6.3.11) where: Se= standard error of regression.04. with the lower limit of 20. obtaining the standard error of the individual forecast: S ef = S e 1 + 1 + n ( xi − x ) 2 (7. This is because when an individual prediction of y is made the confidence intervals are much wider. 7.72.6 Standard Errors for the Parameters b0 and b1 If b0 and b1 are computed from sample data they can be considered as estimates. statistics of the population intercept denoted by and the population coefficient of correlation denoted β1 in the case of repeated sampling.Statistics for Business Administration When making an individual value prediction for y due to technical reasons it is necessary to amend the previous formula of the standard error. the population slope.76 1 + 1 (45 − 42. limits which are different from the previous limits computed.04 . (7.306 6.72 ± 2.

Regression and Correlation .13) t= 1 Sb 0.092 Since 7.306.for the slope: b1 ± t x Sb where: Sb = ∑x 2 i (∑ x ) − i Se 2 n is the value of the statistics t corresponds to n-2 degree of freedoms at the chosen probability.65 − 0 = 7.65 x can be used as a basis of prediction for Example 1.97 – 0. In such circumstances some form of non- .07 > 2. On the basis of this evidence the regression equation y = 63. 4 Non-linear Regression Models There are many occasions when the relationship between variables cannot be adequately described by linear functions.12) t= 0 Sa .07 = 0. 7. H 0 can be rejected.For the intercept H 0 : β0 = b0 chosen value H 1 : β0 ≠ b0 chosen value The test statistics is the t test: b − β0 (7.For the slope: H 0 : β1 = 0 H 1 : β1 ≠ 0 The test statistics is the t test: b − β1 (7. showing the confidence level. In addition we construct a significance test for and β: . whether they use a single independent variable or several.

this function can be expressed in a linear form using logarithms thus log y = log a + b log x In this function y is said to be a logarithmic function of x. As with the exponential function. both horizontal and vertical scales being logarithmic). a and b are constants and x denotes the time periods. Logarithmic functions An alternative non-linear function is known as a logarithmic function which has the form of: y = ax b .e. where y denotes variable to be predicted. An interesting feature of the log form of the exponential function is that it is equivalent to fitting a straight line to a graph drawn on semi-logarithmic scale graph paper (i.Statistics for Business Administration linear or curvy-linear model is likely to be more suitable and the following paragraphs describe some commonly encountered non-linear models. This function is equivalent to fitting a straight line to a graph drawn on log-log paper (i. A = log a and B = log b The similarity of this expression and the linear regression line previously discussed will be apparent.e. a logarithmic scale on the vertical axis and an ordinary arithmetic scale on the horizontal axis). . The exponential function The exponential function takes the form: y = ab x where y is the dependent variable a and b are constants and x denotes the independent variable Linear form of the exponential function The exponential function can be reduced to linear form by taking the logarithm of the function thus: log y = log a + x log b or log y = A + Bx where.

Regression and Correlation The hyperbolic curve This is another type of non-linear curve and takes the form b y =a+ x The values of a and b are calculated by reference to amended formulas: ⎛1⎞ ⎛1⎞ n∑ ⎜ ⎟y − ∑ ⎜ ⎟ y x⎠ ⎝ ⎝x⎠ b= 2 2 ⎛1⎞ ⎛ 1⎞ n∑ ⎜ ⎟ − ⎜ ∑ ⎟ ⎝ x⎠ ⎝ x⎠ ⎛1⎞ b∑ ⎜ ⎟ ∑y − ⎝ x⎠ a= n n Data have been kept for 10 orders showing the variation in unit cost against order volume for 10 clients.5. as follows in Table 7-2: Relation between order size and unit costs Table 7-2 Client number 1 2 3 4 5 6 7 8 9 10 Order volume x 10 11 12 13 14 15 17 18 19 20 Unit cost y 150 127 123 117 110 107 104 101 97 95 These data have been graphed. Figure 7. and the graph suggests that the hyperbolic curve might be appropriate for predicting the unit cost of an order of 22 units. What is the predicted cost? .

92 1.0027 0.0025 0.0100 0.857 7.49 10 × 82.118 5.545 10.077 0.105 4.369 − 0.0524 b= 10 × 0.706 ) b = 985.Statistics for Business Administration Unit cost 150 140 130 120 110 100 90 80 8 10 12 14 16 18 Order volume 20 22 24 Figure 7.083 0.92 × 10 10 a = 43.071 0.5 Unit cost and order size relation Solution: The calculations for the least squares line of best fit are shown in Table 7-3.0524 − (0.750 150 127 123 117 110 107 104 101 97 95 1.0051 0.0044 0.706 y ⎛1⎞ ⎜ ⎟ ⎝x⎠ 2 ⎛1⎞ ⎜ ⎟y ⎝x⎠ 15.131 0.250 9.611 5.090 0.706 × 1131 2 .0035 0.706 a= − 985.100 0.059 0.067 0.000 7.133 6.000 11.0059 0. Table 7-3 Observation 1 2 3 4 5 6 7 8 9 10 Total 1 x 0.0069 0.050 0.056 0.0031 0.131 0.0083 0.053 0.

79 43.49 + 985.92 + 20 These values are plotted on Figure 7.91 109.92 + 18 43.49 + 985.33 113.92 + 13 43.49 + 985.92 + 17 43.49 + 985.92 + 10 43.92 + 14 43. in Table 7-4.49 + 985.Regression and Correlation Thus the hyperbolic function is: 985.49 + 985.49 98.92 + 19 43.26 95.49 + 985.49 + 985.22 101.92 x The calculated least squares line can now be fitted on the graph using the calculated values according to the hyperbolic function.12 125.49 + 985.08 133.92 + 15 43.49 + 985.92 + 11 43. y = 43.6. 150 140 130 Unit cost120 110 100 90 80 8 10 12 14 16 18 20 22 24 Order volume Figure 7.49 + Table 7-4 X 10 11 12 13 14 15 17 18 19 20 a+ b x Value of y 142.65 119.6 Scatter diagram with fitted hyperbolic curve .38 92.92 + 12 43.

when the time taken will become constant. Cost predictions especially those relating to direct labour costs should allow for the effects of the learning process. experience and skill is gained. Studies have shown that there is a tendency for the time per unit to reduce at some constant rate as production mounts. productivity increases and there is a reduction of time taken per unit.Statistics for Business Administration The same information can be reproduced in a linear form where the x axis is 1 defined as . The question posed in the problem.92 y = 43. For example. The main practical application is concerned with direct labour times and costs. On the assumption that the known relationship between x and y continues beyond the observed range then the unit cost for an order size of 22 is: 985. The learning curve depicts the way people learn by doing a task and are therefore able to complete the task more quickly the next time they attempt it. Learning is rapid in the early stages and the rate gradually declines until a sufficient number of units or tasks have been completed. Unthinking extrapolation of past conditions is unlikely to produce good forecasts. an 80% learning curve means that as cumulative production quantities double the average time per unit falls by 20%. A particular example of this relates to what are known as learning curves that are a practical application of a non-linear function. what is the unit cost for x an order size of 22 units can be answered from one of the graphs or by direct calculation.30 22 Learning curves Forecasting is concerned with what we anticipate will happen in the future. If we are aware of an expected change in conditions in the future this must be taken into account when preparing the finalised forecast. During the early stages of producing a new part or carrying out a new process.49 + = 88. .

638.4 Average time per client 20 16 12.e.s) 400 640 1. For example.024 1.903309 i. Having established the values for the function it can be used to find the expected labour time per unit.303103.09691 which. what is the expected time per client when cumulative number of clients is 20 clients? Using the function we obtain: y = ax b = 10 * 20 −0.812 mins .90309 = = -0.322.Regression and Correlation This is shown in Table 7-5: Illustration of an 80% Learning Curve Table 7-5 Cumulative number of clients 20 40 80 160 Cumulative time taken (min. with an 80% learning curve and a time of 10 minutes for the first client. gives -0.8 10.90309 but is actually -1 + 0.24 (20 × 80% × 80% ) (20 × 80% ) ( 20 × 80% × 80% × 80% ) The learning curve is a non-linear function with the general form: y = ab x where.30103 Note: It will be remembered from mathematics that the log of 0.e. -0. an 80% learning curve) b= log(1 − 0.322 = 3. y: average labour hours for a client a: number of labour hours for the first client x: cumulative number of clients b: the learning coefficient The learning coefficient is calculated as follows: b= log(1 − Pr oportionatedecrease) log 2 thus for a 20% decrease (i.8 is conventionally written as 1. divided by 0.322 log 2 0.2 ) 1.

and the average serving time. Linear transformation of learning curve An alternative method of calculating the learning curve coefficient uses the linear transformation formed by taking the logarithm of the function thus: log y = log (ax b ) log y = log a + b log x This will be recognised as a transformation to the general linear form y = a + bx If X stands for log x and Y stands for log y then the standard formulae for a and b become n ∑ XY − ∑ X ∑ Y b= n ∑ X 2 − (∑ X ) 2 log a = ∑Y .024 1. any cost predictions based on conventional learning curves should be cautiously used when forecasting as any other kind of forecasting method. in practice it is highly unlikely that there will be a regular consistent rate of decrease as exemplified above. are used to find the values for the formulae above and are shown in Table 7-7.b ∑X n n The above formulae are illustrated using the data in the previous paragraph thus from Table 7 – 5 into Table 7-6. x.638.24 The logarithms of the cumulative number of clients. Cumulative Number of clients x 20 40 80 160 Cumulative time 400 640 1. y.Statistics for Business Administration Note: Whilst it is clear that learning does take place and that average times are likely to reduce.4 Table 7-6 Average time per client y 20 16 12. According. .8 10.

58068 (this represents 20-03223) 1.22682 ∑ X = 7.92907 2.3223 we can compute: Number 20 1.010303 X 2 2 (log x) 1.95569 These values are inserted into the formula 4 × 7.546.69268 1.01030 log a = .62175 4.01030 ∑ Y = 4. The learning curve thus has the form: y = ax −0.(-0.01030 2 This will be seen to be the same value as calculated above.20412 Y log y 1. and checking that the calculated time agrees with the observed time of 20 minutes.73917 − 7.20412 1. so the value represents the theoretical time for the first client.41932 1.30103 0.30103 1.30120 the anti-log f which is almost . say 20 units.3223 For completeness the value of a is calculated.73917 ∑ XY = 7. To find the value of y = 52.10712 2.30103 1. Using the formula given above 4. log y 1.62266 B= = -0.546x −03223 The value of 52.60206 1.Regression and Correlation Table 7-7 X i.3223 0-0.62266 ∑X 2 = 12. This was not one of the observed values.56659 3. The full learning curve formula is thus: y = 52.41932 52.725052.90309 2.3223) 4 4 log a = 1.62266 7.69268 2. which started at 20 clients. This represents the number of labour hours for the first client.3223 4 × 12.01030 × 4.72052+1.30103x0.85815 XY log x. log x 1.546 hours for the first unit can be proved by inserting one of the observed values. given the relationships found for the observed range of 20 to 160 clients.e.10721 1.546 log 1. and finding the antilog gives a = 52.46 × 20 −0.

however. y = b0 + b 1 x. To investigate the possibility that movements in y. Because of the lengthy nature of the calculations it would be unlikely that a detailed question on multiple regression would appear in the examinations for which this manual is intended. the dependent variable. logarithmic and hyperbolic functions are explained and exemplified and the chapter concludes with an analysis of learning curves. Various non-linear models such as the exponential. These models are dealt with in the first part of the chapter. A model which incorporates several independent variables is known as a multiple regression model.5 Multiple Regression Models The section shows the development of a multiple regression model and how the closeness of fit is measured by the coefficient of multiple determinations. The development of this model is shown below. There will be occasions when the simple model. For example. Familiarity with the processes involved and the structure of the model is. b. In such circumstances there are two possible courses of action: a.Statistics for Business Administration 7. Alternatively a non-linear model may be considered more appropriate and several of the more important non-linear functions are dealt with later in the chapter. . changes in demand for a product may depend on: the price of a product the price of substitutes the level of incomes consumer tastes and so on If linearity can be assumed then a linear multiple regression models can be used. necessary. will not be considered satisfactory. This means that the simple linear model will not be a good enough predictor. depends on several independent variables and not just one as in the basic model.

y = b0 + b 1 x 1 + b2 x 2 (7. which determine in a certain measure the variation of the resulting variable (y). x 2 . than the estimation equation will be: Y (x1 . K .K.Regression and Correlation The basic two variable model (one dependent and one independent variable) is: y = b0 + b 1 x which can be solved using the Normal equations thus: ∑ y = b0 n + b ∑ x ∑ xy = b0 ∑ x + b ∑ x 1 1 2 From this can be developed models with more than 2 variables and this is illustrated below using a 3 variable model (one dependent and two independent variables. x 1 and x 2 ). xn are the factorial variable. x2 . If the relation between every factor and the resulting variable is linear. x1 . .K.15) where: b0: represents the parameter that expressed the unregistered factors considered as having constant action that is all the other factors except for those considered factorial variables b1 . x2 . xn ) + e where the variables x1 .14) In the case of mass phenomena the resulting variable is considered a function with many variables: y = f ( x1 . x n ) = b0 + b1 x1 + K + bn x n + e (7. x2 . xn : independent variables included in the relation of interdependence. y.K. bn : coefficients of regression that shows the measure with which it is modified the resulting variable if the factorial variable is modified on average with a unit.K.

and the plane cuts the y axis at ‘a’.16) 2 2 2 2 1 1 2 2 The line of best fit gives way to a plane of best fit. square rose.. The multiple linear model can be solved by the Normal equations for a three variable model. x 2 . .. conditioning it that the sum of the errors of the empiric terms from the line of regression. as follows: ∑ y = b0n + b ∑ x + b ∑ x ∑x y = ∑x +b ∑x +b ∑x x ∑ x y = a∑ x + b ∑ x x + b ∑ x 1 1 2 2 2 1 1 1 1 2 1 2 (7. The above models are illustrated by the following examples.Statistics for Business Administration The determination of the parameters is made by the application of the method of the least squares. xn ) 2 = min In order to find the value of these parameters is necessary to be established the system of the Normal equations: ∑ [y − (b0 + b x 1 1 + K + bn x n )] = min 2 At the end of solving the system we have the parameters of estimation of the regression function... we are using the ratio of correlation. The parameter b 1 is the slope of the plane along the x 1 axis. b 2 is the slope along the x 2 axis. The aim of adding to the simple two variable model is to improve the fit of the data. to be minimum: ∑ (y − Y x1. As in the case of multiple correlation in order to measure the degree of intensity of the correlation.

862 1.376 102 594 7. a x1 = 27.035 10.Regression and Correlation Example of multiple regression The X consultancy company is investigating the relationship between performance in Statistics Methods and hours studied per week and the general level of intelligence of candidates.59 110 10 .310 8.396 1. Examination level (%) 1 6 100 45 2 6 117 55 3 12 119 80 4 14 95 73 5 11 110 71 6 9 99 56 7 19 98 95 8 16 101 86 9 3 100 34 10 9 115 66 It is required: to calculate the simple separate regressions.616 300 1.500 9.356 46. Scores: hours studied) The parameters are: n ∑ x 1 y − ∑ x 1 ∑ y 10 × 7.161 9.604 10.025 9.025 7.744 x2 y 5.225 111.054 2 x2 x1 y 504 270 960 1.136 2.330 1.68.974 3.156 4.025 6.428 1. b x1 = 3.590 69. the multiple regression and the coefficients of determination.67734 × 105 − .806 For Regression y on x 1 (Exam.744 − 105 × 661 = .805 1.520 6.100 13.899 81 36 144 196 121 36 361 256 9 81 1.210 702 1. Solution Part A – Calculation of separate regressions Table 7-8 y 1 2 3 4 5 6 7 8 9 10 56 45 80 73 71 55 95 86 34 66 661 y 2 x1 9 6 12 14 11 6 19 16 3 9 105 x 2 1 x2 99 100 119 95 110 117 98 101 100 115 1.022 781 330 1.041 3.801 10.201 10.321 9.810 6. The company has data on ten students as follows: Student Hours I.686 3.000 13. b x1 = 2 2 110 × 1.400 5.435 9.400 7.544 4.730 x1 x 2 891 600 1.025 12.935 7.Q.321 − 105 2 n ∑ x 1 − (∑ x 1 ) a x1 = ∑ Y − bx ∑ x 1 1 n n = 661 3.000 14.329 5.689 9.

085x 2 rx21 = 0.e.68 The co-efficient of correlation for this relationship is: rx1 = n∑ x 1 y − ∑ x 1 ∑ y 2 n∑ x 1 − (∑ x ) 1 2 × n∑ y 2 − (∑ y ) 2 Note: This formula is a direct equivalent of that given previously but is easier to work with since all except n∑ y 2 − n∑ y 2 − (∑ y ) (∑ y ) 2 is already known. 1 In a similar manner the regression y on x 2 (exam. coefficient of determination for y: x 1 .001608 .9243 i.889-661 2 = 468.969 rx1 = 8.921 = 31.890 – 436.185 × 31.Statistics for Business Administration The regression equation for the relationship of hours studied and examination result is: y x1 = a x1 + b x1 x 1 = 27.59 +3.035 2.969 = 0.9613 rx2 = 0. 2 = 10 × 46. scores: IQ scores) is calculated resulting in: y x 2 = a x 2 + b x 2 x 2 = 57.16 + 0.

100 90 80 70 60 Examination 50 score 40 30 20 10 0 90 100 IQ Score y = 0.Q. scores (y: x 2 ).Regression and Correlation 100 90 80 70 Examination 60 50 score 40 30 20 10 0 0 2 4 6 8 y = 3.085x + 57.59 10 12 14 16 18 20 Hours studied per week Figure 7-6 Scatter diagram of examination scores and hours studied (y: x 1 ).69x + 27.16 110 120 Figure 7-7 Scatter diagram of examination scores I. Solution: Part B – The multiple regression (y : x 1 and x 2 ) The multiple regression calculations are carried out using the three variable .

321b1 + 10.66 × 661) + (3.744 = 105a + 1. R 2 Using the computational formula given and the values calculated above.6 × 69.06 + 3. given the number of hours worked and IQ.9243 rx22 = 0. For example. thus: 661 = 10a + 1.744) + (0.730) − 661 R2 = 10 6612 46.6 × 102 = 74. which is obviously a major influence.93 × 13 + 0.054b 2 7.23% expected examination score Solution: Part C – Coefficient of multiple determination.93x 1 + 0.889 − 10 2 = 0.974b 2 69.93 × 7.Statistics for Business Administration Normal Equations from Para 3 and the results in Table 1 above.9995 The various coefficients of determination can now be summarised and interpreted rx21 = 0.974b 1 + 111.This indicates that about 92% of the variation in examination scores is caused by variation in hours of study. .6x 2 This result could be used to predict the examination score for a candidate.730 = 1.0016 R 2 = 0.9995 rx21 .06 + 3.806b 2 Using standard simultaneous equation procedures results in the following values for the coefficients in the equation: y = a + b1 x1 + b 2 x 2 y = −38.05b 1 + 1.054a + 10. what is the expected score of a candidate who has worked for 13 hours per week and who has an IQ of 102? y = −38. R 2 can be calculated thus: (− 38.

. assumes that it is a reasonable hypothesis that examination results are influenced by the intelligence of candidates and how hard they work! R2 7. b) Coefficient of Correlation. If the result is positive. i. y ) = ∑ (x i − x )( y i − y ) n (7.18) .e.This indicates that only 0.e.16% of any variation in examination score is caused by variation in IQ score which is a very small influence indeed. denoted by r. however. The formula for the coefficient of correlation is: r= cov( x. perfect negative correlation to +1 i.17) If the results tend to zero then there is no relation between the variables. we have a negative correlation. This. y ) = σ x ⋅σ y ∑ (x i − x )( y i − y ) n ⋅σ x ⋅σ y (7. r can range from -1.6 Correlation between Variables The degree of correlation between two variables can be measured by using the following indicators: a) Covariance represents an absolute measure of the relation intensity and it is computed as the arithmetic mean of the product: ( xi − x )( y i − y ) . than we have a positive correlation and if the result is negative. It can be also computed as: cov( x.Regression and Correlation rx22 . The covariance maximum value equals the multiplication between the standard deviations of the variables in the case of the perfect correlation. perfect positive correlation. used only for the linear relations This provides a measure of the strength of association between two variables.This shows the combined effect of two independent variables and indicates that 99.95% of the movement in examination score is brought about by movements in hours studied and IQ score.

Whichever type of coefficient is being used it follows that a coefficient of zero or near zero generally indicates no correlation. In the case of simple bivariate numerical data which are not grouped and are presented as related pair of figures: X values Y values X1 ………… Y1 ………… Xi ………. Yi ………… Xn Yn Σ Xi Σ Yi The general formula is r= cov(x.expresses how much of total variation of Y variable it is explained by the independent variable.6.19) . 7. d) The Rank Correlation Coefficients. Coefficient and Ratio of Correlation Coefficient of correlation This coefficient represents a measure computed differently for different way of data presentations as related pairs or classified pair of figures: a simple bivariate numerical data which were not grouped b bivariate numerical data grouped by classes or variants with common frequencies for x and y variation c bivariate numerical data grouped by variants or classes into a cross table This coefficient gives an indication of the strength of the linear relationship between two variables. This provides a measure of the association between two sets of ranked or ordered data. a.Statistics for Business Administration c) Ratio of determination denoted by R 2 . y ) = σ x ⋅σ y ∑ (x i − x )( y i − y ) n ⋅σ x ⋅σ y (7.1 Parametric Measures of Simple Correlation.

20): r= n ∑ xy − ∑ x ∑ y 2 n ∑ x − (∑ x ) × n ∑ y − (∑ y ) 2 2 2 (7.123 − 363 ) 2 2 10 x12.123 XY 900 1080 1250 1050 1470 1840 1260 1300 1540 1125 12.769 = − 25. .926 − 424 )x (10 x15.815 − 424 x363 = 128.150 − 153.926 Y2 3600 2025 2500 1225 1764 2116 784 400 484 225 15.912 (219.26 − 179.Regression and Correlation There are several possible formulae but a practical one is the “reduced computation formula (7.461 = −0.762 39. It will be seen that the formula automatically produces the correct sign for the coefficient.776)x 151.484 x 19.815 XY ∑X Using the formula above: r= = ∑Y ∑X 2 ∑Y 2 ∑ (10 x21.20) This formula is used to find r from the data in Example 1 from Table 7-9.93 Thus the correlation coefficient is -0.93 which indicates a strong negative linear association between expenditure on inspection and defective parts delivered. Table 7-9 X 15 24 25 30 35 40 45 65 70 75 424 Y 60 45 50 35 42 46 28 20 22 15 363 X2 225 576 625 900 1225 1600 2025 4225 4900 5625 21.230 − 131.

Xn Total (f. f21 Variation class middles or variants of the dependent variable (yj) ……. Yi ………… fi…………… Xn Yn fn Σ Xi Σ Yi Σ fi For the above table the practical formula for the coefficient of correlation is: r= ∑ f ⋅∑ x y ⋅ f − ∑ x f ∑ y f ∑ f ⋅∑ x f − (∑ x f ) × ∑ f ⋅ ∑ y f − (∑ y f ) i i i i i i i i 2 2 2 i i i i i i i i 2 In the case bivariate numerical data grouped into a cross table as follows in Table 7-10: Table 7-10 Variation class middles or variants of the dependent variable(xi) X1 . . fn1 f.Statistics for Business Administration b.j) Y1 f11 . X2 . . . j = ∑ f ij fn. . fi. . . . . In the case of bivariate numerical data grouped by classes or variants with common frequencies for x and y variation In this case the input data are arranged as follows: X values Y values Frequencies fi X1 ………… Y1 ………… f1……………. Yj f1i . . Xi ………. ….1 fnj f. For a cross table the practical formula of r is: r= ∑ ∑ f ⋅∑ x y ⋅ f − ∑ x f ∑ y f ∑ ∑ f ⋅∑ x f − (∑ x f ) × ∑ ∑ f ⋅ ∑ y ij i j ij i i j 2 2 ij i i i i ij j j 2 f j − (∑ y j f j ) 2 (7. = ∑ f . f2i ……… Ym f1m .m ∑ f i. f2m Total (fi) f1.21) No matter the data presentation and classification the coefficient of correlation is interpreted compared to zero and its limits. ….j fnm f. -1 and +1. . .

0.75. All it says is that there is no linear relationship between the variables. somewhere near zero.2. does not always mean that there is no relationship between the variables. An example might be the wheat harvest in America and the number of deaths by drowning in Britain.95.5) : there is a week correlation. high intensity relation we have an extremely strong relationship between the variables.0. case needing a significance test to be applied as for instance the Student test • r ∈ (0. The coefficient of correlation can take values between –1 and +1 as follows: • • • r = 0 : no relationship.Regression and Correlation Interpretation of the value of r Cautiousness is needed in the interpretation of the coefficient of correlation.95) : r ∈ (0.9) only shows a strong association between the two variables and does not show a causal relationship. shows annegative relationship and should correspond to a negative slope A low correlation coefficient. It is possible to find two variables which produce a high calculated r value yet which have no causal relationship.2 ) : there r ∈ (0.1.00 ) : tight. .5. almost a deterministic relation (functional relation) If we are comparing r with zero than: • r > 0: shows a positive relationship and should correspond to a positive slope of the regression line • r < 0. r. There might be a high apparent correlation between these two variables but there clearly is no causal relationship. independent variables is a low intensity relation between the variables r ∈ (0.75) : medium intensity relation • • r ∈ (0.9 or -0. A high value (above +0.there may be a strong relationship but of a non-linear one. This is known as spurious or nonsense correlation.0.0.

if r is low does it really imply a lack of a relationship? There may indeed be a close relationship but the data has not revealed it. It is questionable whether the sample size given in examinations gives enough data for a credible judgment to be formed about a possible relationship between the X and Y values or is it just that the particular samples gives this impression? Conversely. It is possible to test whether the value of r is sufficiently different from zero for the analyst to decide whether the X and Y values are correlated. The significance of r Frequently the set of X and Y observations is based upon a sample. In the same way that the knowledge of x s enables an estimate to be made of the population mean then the knowledge of r enables the analyst to make an estimate of ρ . although the degree of correlation in the reference population would remain the same. but it may not to be linear or it may not be direct. the relationship may exist. whereas a particular variable may be dependent on several independent variables in which case multiple correlation should have been calculated rather than the simple two-variable coefficient. Generally in examination questions the sample size is limited to some figure that can be dealt with in the time allowed. Had a different sample between drawn then the value of r would be different.22) . The test may be stated the null hypothesis and its alternative: H0: ρ = 0 H1: ρ ≠ 0 It is a t test for which the test statistic is given by: t = r−ρ 1− r2 × n−2 (7. the population coefficient of correlation. Further.Statistics for Business Administration A further problem in interpretation arises from the fact that the coefficient of correlation measures the relationship between a single independent variable and dependent variable.

93 and n=10 we obtain: t = − 0. linear or not linear relationship.93 2 × 10 − 2 = 2.83 = 7. coefficient R 2 for which the general formula and the useful computational formula are given below: R2 = Explained var iation = Total var iation ∑ (Yestimate − Y ) ∑ (Y − Y ) 2 2 (7.16 Ratio of correlation The ratio of correlation can be used to characterise any category of relation.e. It can be also used to measure the intensity of the relation no matter how many independent variables we take into account. i. The ratio of correlation shows only the intensity of the relation and it does not show the direction.93 − 0 1 − 0.Regression and Correlation Using the values from example 1.24) . (7.53 × 2. It is computed with the formula: R= ∑ (y − Y ) 1− ∑ (y − y) i x i 2 2 .2 Parametric Measures of Multiple Correlation In the case of multiple correlation the closeness of fit is measured by the coefficient of multiple determination.6. 7. r= -0.23) where: yi: array of dependent data Yx: array of adjusted values. calculated according to the regression function y: the arithmetic mean of the dependent values The ratio of correlation is interpreted similarly with the coefficient of correlation and it can take values between 0 and +1.

x ) ∑ (y − y ) i 1 i 2 n 2 2 (7.27) . So. x n ) = a + b1 x1 + b2 x 2 + K + bn x n . K . and the free term would be 0: Y (x1 . the more there are considered many factors. Theoretically.Statistics for Business Administration Where Y estimate now equals the estimate of Y for each value of x 1 and x 2 . The ratio of multiple correlation is calculated as in the case of the simple correlation.25) It is not necessarily the case that the value of the coefficient of determination will improve with the addition of extra variables. x 2 . Therefore.xn + σy 2 /r). it can be admitted that under the conditions in which the factors could be expressed numerically. the equation of regression will be equal to the empiric value of the factorial variable calculated by the size of all the determinative factors. (7. R2= (∑ y ) a∑ y + b ∑ x y + b ∑ x y − 1 1 2 2 2 ( y) ∑ y − ∑n 2 2 n (7. x2 variable ( σy 2 ). the ratio of correlation is computed after the formula: Ry = 1− x1.…..x2. This ratio has the highest value by rapport to the simple correlation indicators. because it reunites the influence of each factor and of the interaction between them.26) The ratio of multiple correlations can take values between 0 and +1.. x . than the ratio of multiple correlation should be 1. depending on the specific weight of the dispersion produced by 2 ) over the total dispersion of the resulting registered factors: ( σy x1 . the higher is the ratio’s value.. showing the functional dependence between all its determinative factors and its level (of the resulting variable).. If we are using the relation between the three dispersions: ( σy 2 = σy 2 /x1. x 2 ∑ ( y − Yx .

In the case of multiple correlations. than the ratio of multiple determinations equals the sum of the ratios of simple determination. In the case of linear relation verified with every of the considered factors.30) x1. The coefficient of multiple correlations equals the ratio of multiple correlations. than R is substituted by r: Ry2 x1. .28) x2 If the relation is linear. x2 (7. x 2 ≠ 0 . because of the influence of those unregistered factors included in the value of the free term a0. x 2 = Ry2 + Ry2 x1 (7. the ratio of linear correlation synthesizes all the simple linear relations. actually there can’t be identified all the influence factors and some of them can’t be quantified. (7. If the factors are independent between them. For instance.29) x1 x2 From this. rx1. for two factors: Ry2 x1. x 2 Usually.Regression and Correlation But. among the socio-economic phenomena the factors of influence are independent between them and therefore it appears the necessity of considering the reciprocal influence of the factors. x 2 = ry 2 + ry 2 . the value of the multiple regression line will have errors more or less close to the real values of the series’ terms. the ratio of correlation is: Ry2 = ry 2 + ry 2 x1 . the ratio of multiple correlation transforms into a coefficient of a multiple correlation. If the factors are interdependent. From this reason.

x 2 x2 1 − rx21 .3 Nonparametric Measures of Correlation Sometimes in practice we cannot use for the interpretation of the relation any of the known functions. In this case there are used nonparametric methods like the coefficient of association proposed by Yule. These coefficients have the advantage they can be used in the case of a skewed distribution or a small number of units.6. the coefficients of ranks correlation proposed by Kendall and Spearman. This thing can be possible due to the fact in this type of situations the terms’ distribution is made in connection to the rank of each independent variable. The ratio of multiple linear correlations is calculated using the coefficient of simple correlation: ry2 + ry2 − 2ry ry rx1 x1 x2 x1 x2 Ry = x1. (7. Yule coefficient of association This coefficient is used when the statistical units can be separated into two groups according to the x and y variation or they have the form of the binary variables: Table 7-11 X groups or variants X1 X2 Total Y groups or variants Y1 Y2 A B C D A+C B+D Total A+B C+D A+B+C+D .Statistics for Business Administration This inter-influence has to be eliminated because it can be found in the value of multiple correlation coefficients.31) x2 7. because we do not have enough elements to identify the rule of distribution of the errors for the used series.

(7. . The direction is important only if the analysis of correlation is combined with the establishment of a hierarchic typology.Regression and Correlation In order to express the intensity of the relation we are using the formula: A⋅ D − B ⋅C .32) with the same interpretation as for the KYule = A⋅ D + B ⋅C coefficient of correlation. but can be classified after a certain rank. (7. when the maximum value has the rank one. the sense of distributing the ranks does not have a great importance. When it exists a relation between the two variables of the same unit.33) where: di: the rank difference between correlated variables and n: the number of correlated units. Ranks coefficients This nonparametric method also has the advantage to include in the analysis the rapport of dependence between phenomena and qualitative variable that cannot be expressed numerically. if we maintain the same direction for all the variables. or decreasingly. The most frequent calculation formulas of the coefficient of correlation of the ranks are those of Spearman and Kendall. Therefore. From the point of view of the value of coefficients of correlation. the data are arranged after the variation of the independent variable and each variant is replaced with its number of order called rank. The ranks can be distributed either increasingly. The rank coefficient proposed by Spearman: rs = 1 − 6 n3 − n ∑d 2 i . Starting from the hypothesis between the two series of ranks there is concordance. there has to correspond the same number of units with a higher or smaller rank than them. when the best value of the indicator is the one with the minimum value. taking values between -1 and +1.

the most frequent used is Spearman’s coefficient. Due to the fact it easier to be calculated. It is deduced from the coefficient of simple linear correlation where the mean and the dispersion are based on the properties of the asymmetric progression. Their interpolation is made as the parametrical correlation. Tied rankings. The advantages and the facility of calculation make these coefficients very applicable for studying the relation between specific phenomena including qualitative variables measured on the ordinal scale. Q is negative. the coefficients of correlation of the ranks can take values between –1 and +1. The adjustment is: t3 − t 12 where t is the number of tied rankings. Always P has a positive value. some students obtained the same marks in a test and thus are given the same ranking. It has been concluded that Kendall’s coefficient is smaller than Spearman’s.35) . Adjusted rankings A slight adjustment to the formula is necessary if in a research recording the students marks.34) n(n − 1) where: S = P + Q: the score of the two different positions of the ranks of the correlated variables. and so S can be positive or negative. So. (7. (7. Q: the number of inferior ranks. P: the number of superior ranks that succeed the rank of the effect variable for which it is made the calculation.Statistics for Business Administration The coefficient of correlation of the ranks proposed by Kendall has the formula: rK = 2S . of the effect variable. that succeed the same rank.

⎝ R= ∑ d + 12 n(n − 1) 2 2 t3 − t ⎞ ⎟ ⎟ ⎠ (7.69.T Ranking 2 7 6 1 M.69 8 82 − 1 ( ) As will be seen. the Spearman value has moved also from +0. The revised data are given by Table 7-12: Table 7-12 Student A B C D E F G H Q. Regression analysis ANSWER: d .A. Covariance d. Coefficient of correlation c.35’) For example assume that students E and F achieved equal marks in QT and were given joint third place. Which of the following techniques is used to predict the value of one variable on the basis of other variables? a.7 Exercises Multiple choice exercises with answers 1. Correlation analysis b. Ranking 3 6 4 2 5 1 8 7 d -1 +1 +2 -1 d2 1 1 4 2 2 1 2 1 3 2 3 5 8 1 2 1 +2 2 -1 -3 +1 1 4 1 6 4 9 1 ⎛ 1 23 − 2 ⎞ ⎟ 6⎜ 25 + ⎜ 2 12 ⎟ ⎝ ⎠ = + 0.74 to 0. ⎛ t3 − t ⎞ ⎟ 6⎜ ∑ d 2 + ⎜ 12 ⎟ ⎝ ⎠ = 1− R =1− = 2 n n −1 ( ) 7.Regression and Correlation The adjusted formula for the Spearman coefficient is: ⎛ 6⎜ ⎜ 1.

95% of the variation in y can be explained by the variation in x c. if the coefficient of determination is 0.5 c. variation in x that is explained by variation in y c. the y-intercept represents the: a. 95% of the variation in x can be explained by the variation in y 6.95. 0. In the simple linear regression model. 95% of the y values are positive b. variation in y that is explained by variation in x b. 1 b.Y) = 1260. 0. 1. In a regression problem. 0 d.7656 ANSWER: c 3. this means that: a. . 0.0286 c.8100 d. s x = 1600 and y determination is: a. change in x per unit change in y c. then the coefficient of determination must be: a. then the coefficient of 2. –1 . change in y per unit change in x b. value of y when x = 0 d.7875 b. value of x when y = 0 ANSWER: c Multiple choice exercises without answers 5. 95% of the x values are equal d. if all the values of the independent variable are equal. variation in y that is unexplained by variation in x d.Statistics for Business Administration 2 s 2 = 1225. In a regression problem. The coefficient of determination R 2 measures the amount of: a. variation in x that is unexplained by variation in y ANSWER: a 4. If cov (X.

33% d. c. 50% Open ended exercises with answers 8.637.995. d.Regression and Correlation 7. ˆ y = . R 2 = .5% of the variation in the dependent variable y is explained by the variation in the independent variable x.667 If x increases by one unit. Calculate the Pearson coefficient of correlation.637x 24. 25% b. There is a very strong (almost perfect) positive linear relationship between the two variables. What does the value of the slope of the regression line tell you? Calculate the coefficient of determination. r = . and describe what this statistic tells you about the relationship between the two variables. This means that 99. e. Consider the following data values of variables x and y. e. 75% c. .934 + 2. f. Determine the least squares regression line. c. x y 2 7 4 11 6 17 8 21 10 27 13 36 a. b. ˆ ∑(y i − y ) 2 = 150 The proportion of the variation in y that is explained by the variation in x is: a. Find the predicted value of y for x = 9. d. ∑(y i ˆ − y i ) 2 = 50 . The following sum of squares are produced: ∑(y i − y ) 2 = 200 . What sign does it have? Why? f. b.9975. What does the coefficient of correlation calculated in part (e) tell you about the direction and strength of the relationship between the two variables? ANSWERS: a. y on average will increase by 2. It is positive since the slope of the regression line is positive.

909.6165 + 2. A random sample eight individuals is taken and the results are shown below. For each additional year of education.2 + 2. the model’s fit is good.5 + 4.9098x b.436. d.0x Using the least squares method. Determine the standard error of estimate and describe what this statistic tells you about the regression line. . c. 10. y = 10. Interpret the value of the slope of the regression line. c. A professor of economics wants to study the relationship between income (y in \$1000s) and education (x in years). Education Income 16 58 11 40 15 55 8 35 12 43 10 41 13 52 14 49 a.5x ˆ Model 2: y = 5. A scatter diagram includes the following data points: x y 3 8 2 6 5 12 4 10 5 14 Two regression models are proposed: ˆ Model 1: y = 1. Draw a scatter diagram of the data to determine whether a linear model appears to be appropriate. ˆ a.Statistics for Business Administration 9. 70 60 50 Income 40 30 20 10 0 0 2 4 6 8 10 12 14 16 18 Y ears of Education It appears that a linear model is appropriate.80. Determine the least squares regression line. b. the income on average increases by \$2. sε = 2. which of these regression models provide the better fit to the data? Why? ANSWERS: Scatter Diagram a.

b. What does the coefficient of correlation calculated in part (e) tell you about the direction and strength of the relationship between the two variables? ANSWERS: ˆ y = 0. What sign does it have? Why? . This means that 99.95 and 593. c. d. Therefore. f. y on average will increase by 2. model 1 is better than model 2. Calculate the Pearson correlation coefficient. b.5% of the variation in the dependent variable y is explained by the variation in the independent variable x. Find the predicted value of y for x = 9.637. Calculate the Pearson coefficient of correlation. Determine the coefficient of determination and discuss what its value tells you about the two variables. 2 7 4 11 6 17 8 21 10 27 13 36 Determine the least squares regression line.637x 24.25 for models 1 and 2. What does the value of the slope of the regression line tell you? Calculate the coefficient of determination. a.667 If x increases by one unit. respectively. Open ended exercises without answers 12. c. e.Regression and Correlation Standard error = 4. 11.9975. What sign does it have? Why? f. Consider the following data values of variables x and y: x y a. and describe what this statistic tells you about the relationship between the two variables. Refer to Exercise 10. There is a very strong (almost perfect) positive linear relationship between the two variables. d. a. It is positive since the slope of the regression line is positive.934 + 2. e. b. r = 0.995. R 2 = 0.

Given the least squares regression line y = -2. m.u. Conduct a test of the population coefficient of correlation to determine at the 5% significance level whether a linear relationship exists between years of education and income. ∑ x = 50. 16. and the average rate of profit per year was + 6. forecast the turnover evolution for the next year. Considering the profit evolution is influenced by the turnover evolution you are asked to: a reconstruct and graph the historical evolutions b. Use the predicted and actual values of y to calculate the residuals. ˆ 14. the following statistics are calculated from a sample of 10 observations. b. For a company we know the information regarding the turnover and the profit evolution: Year Turnover mobile relative change (%) Profit chain base absolute change (mill.) 1991 +3% 6 1992 +4% 4 1993 +2% -1 1994 +4% 9 1995 -2% 2 1996 +8% 9 The turnover in 1990 was 80 m. and a coefficient of determination of 0. s x = 10. . Refer to Exercise 10. d. compute and interpret the coefficient of correlation 15. a. ∑ ( x − x )( y − y ) = 2250. Does the variance appear to be constant.81. In a simple linear regression problem. c.48 + 1.63x. Measure the intensity of the relation using parametrical and nonparametric measures.Statistics for Business Administration c. Use the regression equation to determine the predicted values of y.4 %. choosing the most appropriate method between the simple and the analytic methods c. 13.u. Plot the residuals against the predicted values of y. d. ∑ y = 75 Compute the regression equation. forecast the profit in 1997 taking into account its dependency upon the turnover (according to the regression line). identify possible outliers.