You are on page 1of 17
CHAPTER 9 Regression And Correlation Analysis Three groups of variables are normally recorded in crop experiments, These are: 1. Treatments, such as fertilizer rates, varieties, and weed control methods, which are generated from one or more management practices and are the primary focus of the experiment. 2. Environmental factors, such as rainfall and solar radiation, which represent the portion of the environment that is not within the re- searcher’s control. 3. Responses, which represent the biological and physical features of the experimental units that are expected to be affected by the treatments being tested, Response to treatments can be exhibited either by the crop, in terms of changes in such biological features as grain yield and plant height (to be called crop response), or by the surrounding environment in terms of changes in such features as insect incidence in an entomological trial and soil nutrient in a fertility trial (to be called noncrop response). Because agricultural research focuses primarily on the behavior of biological organisms in a specified environment, the associations among treatments, environmental factors, and responses that are usually evaluated in crop re- search are: 1, Association between Response Variables. Crop performance is a prod- uct of several crop and noncrop characters. Each, in turn, is affected by the treatments. All these characters are usually measured simultaneously, and their association with each other can provide useful information about how the treatments influenced crop response. For example, in a trial to determine the effect of plant density on rice yield, the association between yield and its Components, such as number of tillers or panicle weight, is a good indicator of 357 358 Regression and Correlation Analysis the indirect effect of treatments; grain yield is increased as a result of increased tiller numbers, or larger panicle size, or a combination of the two. ‘Another example is in a varietal improvement program designed to produce rice varieties with both high yield and high protein content. A positive asscciation between the two characters would indicate that varieties with both high yield and high protein content are easy to find, whereas a negative association would indicate the low frequency of desirable varieties. 2, Association between Response and Treatment, When the treatments are quantitative, such as kilograms of nitrogen applied per hectare and numbers of plants per m?, it is possible to describe the association between treatment and response. By characterizing such an association, the relationship between treatment and response is specified not only for the treatment levels actually tested but for all other intermediate points within the range of the treatments tested, For example, in a fertilizer trial designed to evaluate crop yield at 0, 30, 60, and 90 kg N/ha, the relationship between yield and nitrogen rate specifies the yields that can be obtained not only for the four nitrogen rates actually tested but also for all other rates of application between zero and 90 kg N/ha. 3. Association between Response and Environment. For a new crop management practice to be acceptable, its superiority must hold over diverse environments. Thus, agricultural experiments are usually repeated in different areas or in different crop seasons and years. In such experiments, association between the environmental factors (sunshine, rainfall, temperature, soil nutri- ents) and the crop response is important. In characterizing the association between characters, there is a need for statistical procedures that can simultaneously handle several variables. If two plant characters are measured to represent crop response, the analysis of variance and mean comparison procedures (Chapters 2 to 5) can evaluate only one character at a time, even though response in one character may affect the other, or treatment effects may simultaneously influence both characters. Regression and correlation analysis allows a researcher to examine any one or ‘a combination of the three types of association described earlier provided that the variables concerned are expressed quantitatively. Regression analysis describes the effect of one or more variables (designated as independent variables) on a single variable (designated as the dependent variable) by expressing the latter as a function of the former. For this analysis, it is important to clearly distinguish between the dependent and independent variable, a distinction that is not «lways obvious. For instance, in experiments ‘on yield response to nitrogen, yield is obviously the dependent variable and nitrogen rate is the independent variable. On the other hand, in the example on grain yield and protein content, identification of variables is not obvious. Generally, however, the character of major importance, say grain yield, be- ‘comes the dependent variable and the factors or characters that influence grain yield become the independent variables, Linear Relationship 359 Correlation analysis, on the other hand, provides a measure of the degree of association between the variables or the goodness of fit of a prescribed relationship to the data at hand. Regression and correlation procedures can be classified according to the number of variables involved and the form of the functional relationship between the dependent variable and the independent variables. The procedure is termed simple if only two variables (one dependent and one independent variable) are involved and multiple, otherwise. The procedure is termed linear if the form of the underlying relationship is linear and nonlinear, otherwise. Thus, regression and correlation analysis can be classified into four types: simple linear regression and correlation multiple linear regression and correlation simple nonlinear regression and correlation multiple nonlincar regression and correlation eee aw We describe + The statistical procedure for applying each of the four types of regression and correlation analysis, with emphasis on simple linear regression and correlation because of its simplicity and wide usage in agricultural research, + The statistical procedures for selecting the best functional form to describe the relationship between the dependent variable and the independent vari- ables of interest, + The common misuses of regression and correlation analysis in agricultural research and the guidelines for avoiding them. 9.1 LINEAR RELATIONSHIP The relationship between any two variables is lincar if the change is constant throughout the whole range under consideration, The graphical representation of a linear relationship is a straight line, as illustrated in Figure 9.1a. Here, Y constantly increases two units for each unit change in X throughout the whole range of X values from 0 to 5: ¥ increases from 1 to 3 as X changes from 0 to 1, and Y increases from 3 to 5 as X changes from 1 to 2, and so on, The functional form of the linear relationship between a dependent variable ¥ and an independent variable X is represented by the equation: Y=a+ px where a is the intercept of the line on the Y axis and , the linear regression coefficient, is the slope of the line or the amount of change in ¥ for each unit 360 Regression and Correlation Analysis y 10 aS 0). 1 veesr-ran(ghy) Figure 9.1. Mlustration of a linear (a), and a nonlinear (b), relationship between the dependent variable ¥ and the independent variable X. change in X. For example, for the linear relationship of Figure 9.1a, with the intercept a of 1 and the linear regression coefficient 8 of 2, the relationship is expressed as: Y=14+2X for0 2) of ¥ and X values. For example, in the study of nitrogen response using data from a fertilizer trial involving 1 nitrogen rates, the n paits of Y and X values would be the ¢ pairs of mean yield (Y) and nitrogen rate (X). ‘We illustrate the procedure for the simple linear regression analysis with the rice yield data from a tcial with four levels of nitrogen, as shown in Table 9.1. The primary objective of the analysis is to estimate a linear response in rice yield to the rate of nitrogen applied, and to test whether this linear response is significant, The step-by-step procedures are: Q step 1. Compute the means X and Y, the corrected sums of squares Dx? and Ly?, and the corrected sum of cross products Dxy, of variables X and Y as: x y-* 7 Bate E(-37 Dy E (1-7) Ea= E (4-2) - 7) where (X,, ¥,) represents the ith pair of the X and Y values. For our example, n = 4 pairs of values of rice yield (Y) and nitrogen rate (X), Their means, corrected sums of squares, and corrected sum of cross products are computed as shown in Table 9.1. Linear Relationship 363 Table 9.1 Computation of a Simple Linear Regression Equation between Grain Yield and Nitrogen Rate Using Data from a Fertilizer Experiment in Rice eee Nitrogen Grain i Deviation from Squareof Product of hae Noa Means Deviate Deviates 129) Mx y ey (2) 0 4230-75 -1,640.75 5,625 2,692,061 123,056 805,42 25 42875625 183,827— 10,719 100 6,661 25 79025625 624,495 19,756 150 7,150 75 1,279.25 5,625 1,636,481 95,944 Sum 300 23,483, 0 0.00 12,500 5,136,864 249,475 Mean 73 5,870.15 ——— © step 2. Compute the estimates of the regression parameters a and f as: —6X Bors xx where a is the estimate of a; and b, the estimate of f. For our example, the estimates of the two regression parameters are: = 249,475 12,500 a = 5,870.75 ~(19.96)(75) = 4,374 6 = 19.96 Thus, the estimated linear regression is Yeatox = 4,374 + 19.96X ford < X < 150 O step 3, Plot the observed points and draw a graphical representation of the estimated regression equation of step 2: Plot the n observed points. For our example, the four observed points (ie. the X and Y values in Table 9,1) are plotted in Figure 9.2. + Using the estimated linear regression of step 2, compute the ¥ values, one corresponding to the smallest X value (i.e, Xmq) and the other corre- 364 Regression and Correlation Analysis Groin yield (kg/ha) 4000 ° 50 00 150 Nitrogen (kg/bo) Figure 92 The estimated linear regression between grain yield (¥) and nitrogen rate (X), computed from data in Table 9.1. sponding to the largest X value (i€., Xyuax): Youtg = + B( Xepin) Vinx = 0+ B(Xeax) For our example, with Xpuq = 0 kg N/ha and Xue = 150 kg N/ha, the corresponding Yj, and Yi, Values are computed as: Vain = 4,374 + 19.96(0) = 4,374 kg/ha Yona, = 4,374 + 19,96(150) = 7,368 kg/ha + Plot the two points (Xqinv Yin) ANd (Xmaxr Ymax) on the (X,Y) plane and draw the line beiween the two points, as shown in Figure 9.3. The following features of a graphical representation of a linear regression, such as that of Figure 9.3, should be noted: (i) The line must be drawn within the range of values of Xpiq and . It is not valid to extrapolate the line outside this range. i) The line must pass through the point (X, Y), where X and Y are the means of variables X and ¥, respectively. (ili) The slope of the Tine is b. (iv) The ling, if extended, must intersect the Y axis at the Y value of a. For our example, the two points (0, 4374) and (150, 7368) are plotted and the line drawn between them, as shown in Figure 9.2. Linear Relationship 365 Xmir x, ‘min i ‘max Figure 93 Graphical representation of an estimated regression line: Y= a + bX. D step 4, Test the significance of B: + Compute the residual mean square as: . er ha where the values of Dy?, Exy, and Lx? are those computed in step 1, as recorded in Table 9.1. Compute the f, value as: Compare the computed ¢, value to the tabular ¢ values of Appendix C, with (n — 2) degrees of freedom. f is judged to be significantly different from zero if the absolute value of the computed ¢, value is greater than the tabular 1 value at the prescribed level of significance. 366 Regression and Correlation Analysis For our example, the residual mean square and the 1, value are computed as: 2 5,136,864 — 28a) Sh gag = 78,921 ‘The tabular ¢ values at the 5% and 1% levels of significance, with (n= 2) = 2 degrees of freedom, are 4.303 and 9.925, respectively. Be- cause the computed 1, value is greater than the tabular 1 value at the 5% level of significance but smaller than the tabular r value at the 1% level, the linear response of rice yield to changes in the rate of nitrogen application, within the range of 0 to 150 kg N/ha, is significant at the 5% level of significance. O step 5, Construct the (100 — a)% confidence interval for B, as: 3 CL = bly * xx where f, is the tabular ¢ value, from Appendix C, with (n — 2) degrees of freedom and at a level of significance. For our example, the 95% confidence interval for B is computed as: z Sy C.1(9S%) = b + tgs ve 78,921 = 19.96 + 4.303] ison = 19.96 + 10.81 = (9.15, 30.77) ‘Thus, the increase in grain yield for every 1 kg/ha increase in the rate of nitrogen applied, within the range of 0 to 150 kg N/ha, is expected to fall between 9.15 kg/ha and 30.77 kg/ha, 95% of the time, Linear Relationship 367 Oi step 6, Test the hypothesis that « = ag: + Compute the t, value as: + Compare the computed 1, value to the tabular # value, from Appendix C, with (m — 2) degrees of freedom and at a prescribed level of significance. Reject the hypothesis that a = ay if the absolute value of the computed 1, value is greater than the corresponding tabular # value. For our example, although there is probably no need to make the test of significance on a, we illustrate the test procedure by testing whether a (ie, yield at O kg N/ha) is significantly different from 4,000 kg/ha. The 1,, value is computed as: 1 = 4374 = 4,000 Fasnlt. 03)" rasa} * 77500 Because the f, value is smaller than the tabular ¢ value with (n — 2) = 2 degrees of freedom at the 5% level of significance of 4.303, the « value not significantly different trom 4,000 kg/ha. =159 91.1.2 Simple Linear Correlation Analysis. The simple linear correlation analysis deals with the estimation and test of significance of tie simple linezr correlation coefficient r, which is a measure of the degree of lineur association between two variables X and Y. Computation of tue simple linear correlation coefficient is hased on the amount of variability it. one variable that can be explained by a linear function of the other variable. The result is the same whether Y is expressed as a linear function of X, or X is expressed as a linear function of Y. Thus, in the computation of the simple linear correlation coefficient, there is no need to specify which variable is the cause and which is the consequence, or to distinctly differemiate between the dependent and the independent variable, as is required in the regression analysis. The value of r lies within the range of —1 and +1, with the extreme value indicating perfect linear association and the midvalue of zero indicating no linear association between the two variables, An intermediate value of r indicates the portion of variation in one variable that can be accounted for by the linear function of the other variable. For example, with an r value of .8, the implication is that 64% [(100(r?) = (100.8)? = 64] of the variation in the variable Y can be explained by the linear function of the variable X. The minus 368 Regression and Correlation Analysis oe fo 1 } hey wo Tom pve ven ‘er lium negat oF 8 wane Penge Figure 94 Graphical representations of various values of simple correlation coefficient r, or plus sign attached to the r value indicates the direction of change in one variable relative to the change in the other. That is, the value of r is negative when a positive change in one variable is associated with a negative change in another, and positive when the two variables change in the same direction. Figure 9.4 illustrates graphically the various degrees of association between two variables as reflected in the r values. Even though the zero r value indicates the absence of a linear relationship between two variables, it does not indicate the absence of any relationship between them. It is possible for the two variables to have a nonlinear relationship, such as the quadratic form of Figure 9.5, with an r value of zero. This is why we prefer to use the word linear, as in simple linear correlation coefficient, instead of the more conventional names of simple correlation coefficient or merely correlation coefficient, The word linear emphasizes the underlying assumption of linearity in the computation of r. The procedures for the estimation and test of significance of a simple linear correlation coefficient between two variables X and Y are: O step 1. Compute the means ¥ and Y, the corrected suins of squares Dx? and Ly?, and the corrected sum of cross products Lxy, of the two variables, following the procedure in step 1 of Section 9.1.1.1. © ster 2. Compute the simple linear correlation coefficient as: Ley (2+)(Ly*) Linear Relationship 369 ast a ok . . ro . wr . ste se ol Figure 9.5 Ilustration of a quadratic relation- 01 2 3 4 6 6 ship, between two variables ¥ and X, chat results simple linear correlation coefficient r being 0. O step 3, Test the sienificance of the simple linear correlation coefficient by comparing the computed r value of step 2 to the tabular r value of Appendix H with (1 — 2) degrees of freedom. The simple linear correlation coefficient is declared significant at the a level of significance if the absolute value of the computed r value is greater than the corresponding tabular r value at the a level of significance, In agricultural research, there are two common applications of the simple linear correlation analysis: + It is used to measure the degree of association between two variables with a well-defined cause and effect relationship that can be defined by the linear regression equation Y = a + BX. + It is used to measure the degree of linear association between two variables in which there is no clear-cut cause and effect relationship. We illustrate the linear correlation procedure, with two examples, Each example represents one of the two types of application. Example 1. We illustrate the association between response and treatment with the data used to illustrate the simple linear regression analysis in Section 9.1.1.1. Because the data was obtained from an experiment in which all other environmental factors except the treatments were kept constant, itis logical to assume that the treatments are the primary cause of variation in the crop response. Thus, we apply the simple linear correlation analysis to determine the strength of the linear relationship between crop response (represented by grain yield) as the dependent variable and treatment as the independent 370 Regression and Correlation Analysis variable. The step-by-step provedures are: 11 sep 1. Compute the means, corrected sums of squares, and corrected sum of cross products of the two variables (nitrogen rate and yield), as shown in Table 9.1, O ster 2, Compute the simple linear correlation coefficient r, as: Ly Vex)(Ly 249,875, * [i2.500)(5, 136,864) res Gi ster 3. Compare the absolute value of the computed r value to the tabular values with (n — 2) = 2 degrees of freedom, which are .950 at the 5 level of significance and .990 at the 1% level. Because the computed r value is greater than the tabular r value at the 5 level but smaller than the tabular r value at the 1% level, the simple linear correlation coefficient is declared significant at the 5% level of significance. The computed r value of .985 indicates that (100).985)? = 97% of the variation in the mean yield is accounted for by the linear function of the rate of nitrogen applied. ‘The relatively high r value obtained is also indicative of the claseness between the estimated regression line and the observed points, as shown in Figure 9.2. Within the range of 0 to 150 kg N/ha, the linear relationship between mean yield and rate of nitrogen applied seems to fit the data adequately. We add a note of caution here concerning the magnitude of the com- puted r value and its corresponding degree of freedom. It is clear that the tabular r values in Appendix H decrease sharply with the increase in the degree of freedom, which is a function of 1 (ie, the number af pairs of observations used in the computation of the r value). Thus, the smaller 7 is, the larger the computed r value must be to be declared significant. In our example with n = 4, the seemingly high value of the computed r of 985 is still net significant at the 1% level. On the otner hand, with n= 9, a computed r value of .8 would have been declared significant at the 1% level. Thus, the practical importance of the significance and the size of the r value must be judged in relation to the sample size n. It is, therefore, a good practice to always specify n in the presentation of the regression and correlation result (for more discussion, see Section 9.4). Example 2. To illustrate the association between two responses, we use data on soluble protein nitrogen (variable X,) and total chlorophyll (variable X,) in the leaves obtained from seven samples of the rice variety IR8 (able 9.2). In Linear Relationship 371 Table 9.2 Computation of a Simple Linear Correlation Coefficient between Soluble Protein Nitrogen (1";) and Tota! Chlorophyll (X,) In the Leaves of Rlco Variety IRB Soluble Total Protein N, Chlorophyll, Sample mg/leaf —mg/leaf Square Product of Deviate of Deviate _Deviates Number (%) 4) Ke) 1 0.60 0.44 0.37 = -0.38 0.1369 0.1444 0.1406 2 112 0.96 0.15 0.14 0.0225 0.0196 0.0210 3 2.10 1.90 113 1.08 1.2769 1.1664 1.2204 4 116 151 0.69 0.0361 0.4761 0.1311 5 0.70 0.46 0.36 0.0729 0.1296 0.0972 6 0.80 0.44 0.38 0.0289 0.1444 0.0646 7 0.32 0.04 , 0.78 0.4225 0.6084 9.5070 Total 6.80 5.75 0.01° 0.01" 1.9967 2.6889 2.3819 Mean 097 0.82 *The nonzero values are only due to rounding. this case, it is not clear whether there is a cause and effect relationship between the two variables and, even if there were one, it would be difficult to specify which is the cause and which is the effect. Hence, the simple linear correlation analysis is applied to measure the degree of linear association between the two variables without specifying the causal relationship. The step-by-step proce- dures are: O step 1. Compute the means, corrected sums of squares, and corrected sum of cross products, following the procedure in step 1 of Section 9.1.1.1. Results are shown in Table 9.2, © step 2, Compute the simple linear correlation coefficient r: G step 3. Compare the absolute value of the computed r value to the tabular +r values from Appendix H, with (n — 2) = 5 degrees of freedom, which are .754 at the 5% level of significance and .874 at the 1% level. Becau + the computed r value exceeds both tabular r values, we conclude that the si ple linear correlation coefficient is significantly different from zero at the 1% probability level. This significant, high r value indicates that there is strong evidence that the soluble protein nitrogen and the total chlorophyll in the leaves of IR8 are highiy associated with one another in a linear way: leaves 372 Regression and Correlation Analysis with high soluble protein nitrogen also have a high total chlorophyll, and vice versa. S13 Homogeneity of Regression Coefficients. In a single-factor experi- ment where only one factor is allowed to vary, the association between response and treatment is clearly defined. On the other hand, in a factorial experiment, where more than one factor varies, the linear relationship between response and a given factor may have to be examined over different levels of the other factors. For example, with data from a two-factor experiment involving four rice varieties and five levels of nitrogen fertilization, the linear regression of yield on nitrogen level may have to be examined separately for each variety. Or, with data from a three-factor experiment involving three varieties, four plant densities, and five levels of nitrogen fertilization; twelve separate regressions between yield and nitrogen levels may need to be esti- mated for each of the 3 x 4 = 12 treatment combinations of variety and plant density. Similarly, if the researcher is interested in examining the relationship between yield and plant density, he can estimate a regression equation for each of the 3 5 = 15 treatment combinations of variety and nitrogen. In the same manner, for experiments in a series of environments (ie., at different sites or seasons or years), the regression analysis may need to be applied separately for each experiment, ‘When several linear regressions are estimated, it is usually important 10 determine whether the various regression coefficients or the slopes of the various regression lines differ from each other. For example, in a two-factor experiment involving variety and rate of nitrogen, it would be important to know whether ine rate of change in yield for every incremental change in nitrogen fertiliza...n varies from one variety to another. Such a question is answered by comparing the regression coefficients of the different varieties, This is referred to as testing the homogeneity of regression coefficients. The concept of homogeneity of regression coefficients is closely related to the concept of interaction between factors, which is discusse! in Chapter 3, Section 3.1, and Chapter 4, Section 4.1. Regression lines with equal slopes are parallel to one another, which also means that there is no interaction between the factors involved. In other words, the response to the levels of factor A remains the same over the levels of factor B. Note also that homogeneity of regression coefficients does not imply equiva- lence of the regression lines. For two or more regression lines to coincide (one ‘on top of another) the regression coefficients 8 and the intercepts a must be homogeneous. In agricultural research where regression analysis is usually applied to data from controlled experiments, researchers are generally more interested in comparing the rates of change (8) than the intercepts (a). However, if a researcher wishes to determine whether a single regression line can be used to represent several regression lines with homogeneous regression coefficients, the appropriate comparison of treatment means (at the X level of zero) can be made following the procedures outlined in Chapter 5. If the Linear Relationship 373 difference between these means is not significant, then a single regression line can be used. We present procedures for testing the homogeneity of regression coefficients for two cases, one where only two regression coefficients are involved and another where there are three or more regression coefficients. {a doing this. we concentrate on simplicity of the procedure for two regrcssion coefficients, which is most commonly used in agricultural research, 9.1.1.3.1 Two Regression Coefficients. ‘The procedure for testing the hy- pothesis that 8, = B, in two regression lines, represented by Y, = a, + BX and ¥; = a, + ApX, is illustrated using data of grain yield (Y) and tiller number (X) shown in Table 9.3. The objective is to determine whether the regression coefficients in the linear relationships between grain yield and tiller ‘number are the same for the two varieties. The step-by-step procedures are: Oi step 1. Apply the simple linear regression procedure of Section 9.1.1.1 to each of the two sets of data, one for each variety, to obtain the estimates of Table 9.3 Computation of Two Simple Linear Regression Coefficients between Grain Yield (Y) and Tiller Number (X), One for Each of the Two Rice Yariaties Milfor 6(2) and Taichung Native 1 Milfor 6(2) Taichung Native 1 Grain Y: Tillers, Grain Yield, Tillers, kg/ha no./m? kg/ha ‘me 4,862 160 5,380 293 5,244 175 5,510 325 5,128 192 6,000 332 5,052 195 5,840 342 5,298 238 6Al6 342 5,410 240 6,66 378 5,234 252 7,016 380 5,608 282 6,994 410 X,= 217 ¥, = 350 Y¥, = 5,230 ¥, = 6,228 Lx? = 12,542 Lx = 9,610 Ly} = 357,630 DoF = 2,872,046 Day = 57131 Daayz = 153,854 = 456 by = 3854 | 1601 1,610 5,230 — (4.56217) 4; = 6,228 — (16.01)(350) = 4,240 = 624

You might also like