This action might not be possible to undo. Are you sure you want to continue?
The full utilization of quantitative data typically involves the fitting of the data to a mathematical model. Fitting is usually performed for one of two purposes. On the one hand, the scientist wishes to determine a physical quantity and a measure of its uncertainty from experimental data when there is a well-established relationship between variables. For example, Ohm's Law (V = IR) states that the voltage V across a resistor is directly proportional to the current I through it. Consequently one can use measured values of V and I to determine the best value of the resistance R, a parameter. On the other hand, the goal might be to establish a mathematical relationship between a dependent variable such as the toxicity of chlorinated hydrocarbons and independent variables such as their chemical and physical properties. A wide range of tools has been developed to solve these problems. The most frequently employed technique is the method of least squares, which will be discussed in this section and used in General Chemistry. We shall only discuss in General Chemistry the method of linear-least-squares, sometimes referred to as linear regression or simply least squares because the applications in the lecture and laboratory will be limited to those cases in which there is a linear relationship, i.e. y = mx + b, between a single dependent variable and a single independent variable. Additional fitting procedures for more complicated problems are addressed in Chemistry 160. Rationale of the method of linear least squares. Suppose that a student measures V in Volts and I in Ampere and wishes to determine the resistance in Ohms from the data (c.f. Table 1). The traditional approach would involve constructing a graph of V versus I, drawing the "best" line through the points, and calculating the slope of the graph, ∆ V/∆I, which is the required value of R. This time honored procedure suffers from a number of deficiencies. First, what is meant by "best" is not clearly defined and the act of drawing the line through the points is subject to bias, particularly if the experimental data are noisy. Furthermore, the freehand procedure does not yield an estimate of the uncertainty of the slope. In careful work we require an objective procedure that reduces bias on the part of the researcher and yields a quantitative measure of the uncertainty of the parameters and the validity of the model. Statisticians have shown that the elementary method of least squares yields the best results if two conditions often satisfied in chemical experiments are met: 1) The principal source of error in the experiment is in the values of the dependent variable. 2) The errors in the dependent variable are random and fit a normal or Gaussian distribution. Even if these conditions are not met exactly, a careful comparison with other techniques has shown that the method of least squares usually yields acceptable results when applied to a carefully prepared set of chemical data. Table 1. V versus I for a Hypothetical Resistor V(Volt) I(Ampere) 5.2 0.10 9.8 0.20 15.4 0.30 20.1 0.40 24.6 0.50
Two attempts at drawing a freehand graph (Figures 1 and 2) demonstrate the basis for the method. In Figure 1, the line has clearly been poorly drawn with the result that most of the experimental points are far from the line. In contrast, the superior graph in Figure 2 is drawn so that the points are close to the line. Exact agreement should not be expected.
METHOD OF LEAST SQUARES PAGE 1
Statisticians have also derived equations for the standard deviations of the parameters and other useful quantities that will be discussed in the following treatment of an illustrative example.4 0.2 I(Ampere) V(Volt) 0. Hence. in Figure 2 that the deviations of the points from the line. Furthermore. In General Chemistry we shall use the Analysis Tool Package add-in to Microsoft Excel to perform the linear regression calculations. provided that the assumptions outlined above are satisfied.6 Illustrative Example and Use of Excel. i. Note. however. "Good" Fit 30 25 20 15 10 5 0 0 0.e. the technique is referred to as the method of least squares. sometimes called the residuals. S. is minimized. are small and that the signs of the deviations are randomly distributed so that there is no pattern to which points are above the line and which are below.The demon noise is always with us and a perfect fit is either evidence for fraudulent data or egregious abuse of fitting procedures.2 I(Ampere) V(Volt) 0. Clearly the well-trained draftsman draws a line which minimizes these deviations or some function of the deviations. "Poor" Fit 30 25 20 15 10 5 0 0 0.6 Figure 2. In this section we shall use an illustrative example which will include a discussion of the results produced by Excel. there is a bonus in the approach.4 0. METHOD OF LEAST SQUARES PAGE 2 . Theorems from statistics show that optimum results. the best-fit line. Figure 1. This example will be demonstrated in class. is obtained when the sum of the squares of the deviations.
8906 0.005173 296. the researcher may have to derive such a relationship. Before jumping in.20500 0. As a first step the pressures are converted from torr to atm and the dependent variable pV/mRT is calculated. It is well known that an ideal gas obeys the equation pV = nRT from which one can readily show that pV/nRT = 1.8 Table 2.005101 296. In a new situation. and mass of an unknown gas that displays small deviations from ideality. Note that a bit of mathematics must be performed to put the data into a linear format.3 101. the independent variable.18961 0.005209 297. p(torr) 46.2 0.1 1. In our case. volume. i.2 0. as we do not know the molecular weight.06118 0. the dependent variable. In the laboratory.0 1. some algebra is required.1153 liter bulb and obtained the data tabulated in the first three columns of Table 2.2 1.Following this example is a second one that will be assigned as homework and will be due at the beginning of the following lab period. It is a simple example of the Maclaurin theorem in mathematics that states that all well behaved functions appear linear over a small range of the independent variable. an equation of the form y = mx + b. p in our case.005246 METHOD OF LEAST SQUARES PAGE 3 . we shall provide the identity of the dependent and independent variables that are linearly related. 1/M. with a slope (B/M) and an intercept.1780 0. One more step is required. Nonideality can be modeled by adding a term linear in the pressure p.6 76.e. is linear.13368 0.10040 0. yielding equation (1) pV/nRT = 1 + Bp (1) where the coefficient B is called the virial coefficient.7871 0.5504 0. versus p. Gas Density Data for an Unknown Gas T(K) m(g) p(atm) pV/mRT (mole/g) 296. pressure. From stoichiometry we know that n = m/M so with a bit of manipulation one can quickly derive equation (2): pV/mRT = (B/M)p + 1/M (2) Equation 2 states that for real gases a plot of pV/mRT. There is nothing mysterious about this equation. the student collected samples of the gas in a 1. yielding columns 4 and 5 in the table and the Excel spreadsheet.6511 0.005255 297. Our first example is a set of student measurements in the gas phase of the temperature.1 155. The goal is to obtain the molecular weight of the gas and a measure of its non-ideality.6 144.
00525254 0.6511 1.1000 0.1 297 297.2 m_g 0.001383524 RESIDUAL OUTPUT Observation 1 2 3 4 5 Predicted pV/mRT 0.005259639 Residuals -1.00515 0.5504 0.969570181 R Square 0.p_torr 46.00519224 0.44744E-06 -1.005160779 0.1337 0.00510925 0.00513084 0.00050662 0.6 76.8 T_K 296.940066336 Adjusted R Square 0.65701E-05 Observations 5 ANOVA df Regression Residual Total 1 3 4 SS MS 1.37436E-08 F Significance F 47.8906 1.45938E-05 9.00535 0.005000959 0.00505 0.25997 1.2000 0.00520649 0.0% Upper 95.2500 p_atm METHOD OF LEAST SQUARES PAGE 4 .005065899 2.2050 pV/mRT 0.005000959 0.00517001 0.292E-08 8.005123847 0.2353E-06 1.8596896 0.0% 0.178 1.0613 0.2 296.00524330 SUMMARY OUTPUT Regression Statistics Multiple R 0.00530 pV/mRT Predicted pV/mRT Linear (Predicted pV/mRT) pV/mRT 0.00050662 0.1 155.1500 0.00138352 0.005245089 0.00525 0.005130839 0.1004 0.6 144.00520 0.0500 0.00634294 Intercept p_atm Coefficients Standard Error t Stat P-value 0.0000 0.000137772 6.1896 0.4412E-07 0.7871 p_atm 0.920088448 Standard Error 1.746E-10 1.42491E-05 7.29199E-08 1.2 296.055341 0.00634294 Lower 95% Upper 95% Lower 95.04056E-05 248.3 101.000945072 0.23704E-10 2.63381E-05 p_atm Line Fit Plot 0.00510 0.
The information in this section. A new model for the relationship or better data should be sought in such cases. which tests whether the data support the hypothesis that there is a relationship between the dependent variable and the independent variable. indicates how well the data fit the linear model. In close cases where bias might influence the decision to retain or reject a datum. This quantity. We determined two parameters. In our case. 38. In this case. which is 1. The column labeled “significance of F” is the probability that the observed fit could be generated by random means alone. The format of the presentation has been edited somewhat so that the data would fit on one page. If there is a statistically significant relationship. one uses a quantitative procedure called the F test. the information presented on the previous page is obtained. in this case so the number of degrees of freedom is 5 . we shall not use this statistic. 2) ANOVA. is a small number.8359. which is appended to the right of the spreadsheet.05 is considered to be unacceptably large. If there is no relationship at all. R is zero. R. The square of R. R is 0. Any probability larger than 0. Detailed written instructions for the homework example are also provided in the following pages. ANOVA stands for analysis of variance. 0. which is based on the standard error.2 = 3. Note that this probability. which is 0. is based on the deviations of the data points from the line. “Multiple R” provides the linear correlation coefficient. If R2 were a small number. If a datum is contaminated by systematic error. METHOD OF LEAST SQUARES PAGE 5 .When the Excel regression (linear least squares) routine is run with pV/mRT as the dependent (y) variable and p as the independent variable. The sign of R is given by the sign of the slope. the numbers that we shall normally use will be found here. The number of degrees of freedom equals the number of data points (the last entry in the section) minus the number of parameters. Some statisticians adjust R in the case of a small sample size and calculate the adjusted R (third entry in the output). like all standard deviations. The row labeled “Intercept” contains statistics for the intercept unless the line is forced through the origin. which you use in the colorimetric analysis experiment. The most important statistic in this section is the F value. measures the goodness of fit. the standard error will decrease significantly when the suspected datum is removed from the dataset.) 1) SUMMARY OUTPUT. R2 gives that fraction of the variation in the dependent variable that is explained by the model.962 and the hypothesis that the gas deviates measurably from ideality is supported. we would have to reject the results of the entire calculation and consider another model. 3) Coefficients. the standard error or the standard deviation of the residuals. has a straightforward interpretation. R is close in magnitude to 1. It is calculated by dividing the sum of the squares of the deviations by the number of degrees of freedom and by taking the square root of this quotient. The standard error is also used to determine whether the rejection of a suspect data point is justified. The significance of the numbers in the report is given below. a slope and an intercept. N/A (not apply) will appear in the columns of this row. A small standard error indicates that the model is good and the errors in the values of the dependent variable are small. The fourth entry.99 x 10-5.009878. Assuming that we have a statistically significant fit. (We will demonstrate how to run this analysis in class. This statistic is used to determine if the proposed relationship is statistically significant.925 in this example.
Note that the relative standard deviation of the slope is much larger than that of the intercept and the slope can only be reported to 2 significant digits. the least-square parameters.078) x 10-3 mole/g. The final result is reported as (5. and the value of the independent variable. The slope is as not well determined. which we obtain by taking the reciprocal of the intercept. and 3 degrees of freedom and obtain (3. there would be a 5% or greater chance that the results are spurious and we would have to reject the analysis. Recall that our goal was the determination of the gas’ molecular weight. Gases deviate very slightly from ideal-gas behavior and measuring small quantities is a difficult task.0000245 Hence the value is rounded off to 0. 2 parameters. A pattern to these signs is an indication of systematic error and the need to modify the equation. To obtain the 95% confidence interval of the intercept. Recall that we use the standard deviation of a parameter to determine the number of significant digits in the value to be reported. m = 0.005059 standard deviation of the intercept 0.00053 so the result is reported as 0. 4) RESIDUAL OUTPUT.182)(0.45 x 10-5) = 7. The 95% CI of the slope is calculated by multiplying the Student’s t by the standard deviation of the slope.8 x 10-5. the signs of the residuals should be randomly distributed. This section contains a table of the predicted values of the dependent variable and the residuals. 2.05 or greater. * intercept 0. The residuals or deviations give the differences between the observed and calculated values of the dependent variable.45 x 10-5.005059 g/mole. Note that P is very small.e.06± 0. The column labeled P-value is the probability that a non-zero value of the intercept could be generated by random means alone. The next row labeled by the dependent variable name provides the same information for the slope.000165) = 0.182)(2. (3.00010± 0. The predicted values are calculated using the equations. If one compares the two numbers. METHOD OF LEAST SQUARES PAGE 6 . If all goes well. the intercept minus the 95% CI and the intercept plus the 95% CI. The second number yields the standard deviation or error of the intercept.00506. if P were 0. That is. we multiple the standard deviation of the intercept by the Student’s t which is 3. one notes that the 10-5’th place is the first digit with significant error.00010 mole/g-atm. 0.182 for 5 data.The first numerical column in the “Intercept” row gives the value of the intercept. i. Note that the standard deviation of the intercept is a small fraction of the intercept so the intercept is a well-defined statistic. A large residual marks a suspect datum. This should come as no surprise. This derivation which is called a propagation of errors analysis is covered in Physics 51 and Chemistry 160. one can calculate a standard deviation for the molecular weight from the standard deviation of the intercept. This is the least significant digit (marked with an * below) and all digits to the right are insignificant. The final data in the row yield the actual interval defined by the value of the intercept and the 95% confidence interval.00056 mole/g-atm. With the use of the calculus.
Weighting is important if the data are of uneven quality. Our main focus in General Chemistry is on fitting data to a known equation and determining parameters such as the absorption coefficient in the colorimetric manganese experiment. the focus shifts to developing models with predictive value and a different approach called cross-validated statistics is used. The NCSS package can also solve problems in which the relationship between the dependent and independent variable is intrinsically non-linear. In fields such as drug design.Variations on the Theme and Extensions. All of these advanced topics are considered in Chemistry 160 with the aid of the NCSS package. More advanced packages such as NCSS which is used in Chemistry 160 and Mathematics 57 permit handling more than one independent variable as well as weighting the data points. METHOD OF LEAST SQUARES PAGE 7 .
you don’t have to use all the data in a column. The following exercise illustrates the method of least squares and regression calculations with Excel. 30. 4) Consider as the first model a linear dependence of vapor pressure on Centigrade temperature. enter the label “ T_C “ in the first row and in rows 2-11 the following values of the temperature in degrees Centigrade: 0. You may wish to seek assistance from a consultant at OIT in performing this step.) b) When the regression window appears. e. 100. You can use spreadsheet techniques to choose the data selectively. (In the event that the regression features have not been installed. Select Regression from the Data Analysis menu and click on OK.”1/T_K”. 90. and line fit plot. 10.Least Squares Exercise Using Microsoft Excel. and the natural log of the vapor pressure in atmosphere. METHOD OF LEAST SQUARES PAGE 8 . Label the second column “p_torr” and use Appendix F in your laboratory manual to enter in rows 2-12 the vapor pressure in torr of water at the 11 temperatures defined above. 20. Log on to the campus network. copy. and OK. Similarly select the values of the independent or X variable by clicking in the field for Input X range and then selecting the spreadsheet column that contains the values of the independent variable and its label. Select the values of the dependent or Y variable by first clicking in the field for Input Y Range and then selecting the spreadsheet column that contains the values of the dependent variable and its label. This will be the independent variable in the first model. the inverse of the absolute temperature. select the following options by clicking on the appropriate boxes: labels. 40. Use the spreadsheet functions (cell definition. Complete the following exercises on your own and turn in the printouts and answers to the questions at the beginning of your next lab period. further steps are required. and Analysis ToolPak and Analysis ToolPakVBA. Start the Excel software. 3) Label columns 3-6 “T_K”. Do not select the “Constant Is Zero” option. click on Tools. Use the following set of instructions to perform a least squares fit with vapor pressure in atm as the dependent variable and Centigrade temperature as the independent variable. and paste) to load columns 3-5 with the values of the absolute temperature.” p_atm”. 80. It will automatically open with a new spreadsheet. the Colorimetric Manganese experiment. confidence level. 60. 1) 2) In the first column. This option is used in cases where the intercept is known to be zero. In general.g. residuals. To install the features. respectively. click on Tools in the spreadsheet command line to access the Tools menu and then on Data Analysis. 70. 50. then Add-Ins. and “ln(p_atm)”. the vapor pressure in atmosphere. a) To access the Regression window.
2 and you want to line to extend to X = 0. Then move the graph into the first empty page and change its size to fill the page. To this end. the backward forecast value should be 0. For example.2. When the mouse is at one of the four corners.c) You would like to have the regression results on the same page as the input data. Excel calls this the trendline. This step informs Excel of the location of the regression results. The following steps will achieve this goal. f) Once the graph has edited. You can always exit the editing mode by clicking on an empty cell outside of the graph. To achieve this result. d) Initiate the calculations by clicking on OK. iii. the sizing tool will appear which can be used to change the size of the graph. METHOD OF LEAST SQUARES PAGE 9 . top spreadsheet cell which is empty. ii. A Trendline menu will now appear. Click on OK when you have set all the options and the best-fit line will be drawn. Click on any one of the symbols marking a predicted value of the dependent variable. Double click on the portion of the graph you'd like to edit and an options menu should appear. access the Options tab and adjust the forward and backward values of the Forecast parameter. i. The best-fit line is drawn between the first and last points. In some case. From the Chart menu. Add the best-fit line. You want the graph to fill or nearly fill a full page. click on the graph and a frame will appear around its perimeter. You can now edit the caption for the graph and the axes. if the X value of the leftmost point on the graph is 0. choose Add Trendline. Select the Linear trendline on the Type tab. you may wish to customize your graph further. click in the field for Output Range and then in the leftmost. e) You will probably want to edit the graph produced by Excel as it is not quite in the format required in your lab reports. By clicking on the symbols for the data points in the legend box on the right. Activate the graph (chart in Excelese) for editing by clicking on it. print the spreadsheet by clicking the Print item under the File menu. This will be the G1 cell in your case. you can change the symbol type and color. I usually choose the largest possible symbol for the measured values of the dependent variable and change the size and color of the symbol for the calculated values so that the latter merges with the best-fit line. Grab the framed graph with the mouse and move the graph to its new location. In a few seconds the regression results and the graph will appear on the screen. select the option Output Range. Force Excel to show the page boundaries by clicking on Page Setup in the File menu and then OK. To extend the line so that it covers the entire graph.
Data Analysis. Click on the Save As item under the File menu. After you click OK. you can always read it later by clicking on the Open item under the File menu. The hard drive always comes up as the default drive so you have to select the a: drive if you're saving to diskette. Do the statistics support the conjecture that vapor pressure depends on temperature? Is the linear model a good one? How do you know? What are the least-squares values of the slope and intercept and the associated 95% confidence intervals to the appropriate number of significant digits? Answer these questions directly on your printout or on a separate page and turn in with the printout of your results.doc WES. Choose a new output destination cell to see the new analysis without overwriting the previous results.g) Save the spreadsheet on your diskette or user space. One can show from thermodynamics that the natural logarithm of vapor pressure depends linearly on the inverse of the absolute temperature so repeat the analysis using the better model. 2002. Also provide a file name and then execute the save by clicking on OK. you will be asked if you want to erase the old output. exit Excel by clicking on the Exit item under the File menu. and Regression) and select the natural log of the vapor pressure in atm as the dependent variable and the inverse of the absolute temperature as the independent variable. you can analyze the results. b) Repeat the regression calculations. When you are done. report the values of the slope and intercept with the associated 95% confidence intervals along with the answers to the above questions and turn in with the printout of your results. 6) Refinement of the model. Don’t forget units and significant digits. Once you have saved a spreadsheet. Insert a formatted diskette in the a: drive. Log off of the network unless you want the next user to have access to your user space. All of the options in the Regression window have been preset in the previous calculation. January 2. d) Is the second model better? How do you know? As above. 3 July 1997 updated. We deliberately started with a poor model so that you could recognize the characteristics of a “dog”. c) Save and print out your results for the second analysis. JMT METHOD OF LEAST SQUARES PAGE 10 . a) Re-open the Regression window (recall click on Tools. 7) least_sq. 5) Now that the calculations and graphics have been done.