# Nonlinear Curve Fitting

Earl F. Glynn
Scientific Programmer Bioinformatics
11 Oct 2006

1

Nonlinear Curve Fitting
• Mathematical Models • Nonlinear Curve Fitting Problems
– – – – Mixture of Distributions Quantitative Analysis of Electrophoresis Gels Fluorescence Correlation Spectroscopy (FCS) Fluorescence Recovery After Photobleaching (FRAP)

• Linear Curve Fitting • Nonlinear Curve Fitting
– – – – Gaussian Case Study Math Algorithms Software

• Analysis of Results
– Goodness of Fit: R2 – Residuals

• Summary
2

Mathematical Models
• Want a mathematical model to describe observations based on the independent variable(s) under experimental control • Need a good understanding of underlying biology, physics, chemistry of the problem to choose the right model • Use Curve Fitting to “connect” observed data to a mathematical model
3

Nonlinear Curve Fitting Problems
Mixture Distribution Problem
Heming Lake Pike: Length Distribution
0.06 Probability Density 0.00 0 0.01 0.02 0.03 0.04 0.05

20

40 Length [cm]

60

80

4

Nonlinear Curve Fitting Problems Mixture Distribution Problem Heming Lake Pike: Distribution by Age Groups Normal Probability Density Function 0.04 0.05 e − ( x − µ )2 2σ 2 0.00 0.06 f ( x) = 1 2π σ 2 Probability Density 0.02 Coefficient of Variation cv = 0 20 40 Length [cm] 60 80 σ µ Data are fitted by five normal distributions with constant coefficient of variation 5 0.03 0.01 .

RNA (2005). .ppt#23 Takamato.utexas. 2004. Nucleic Acids Research. 32(15). et al.6 2 p. et al.edu/cshl-2005/lectures/ CSHL_Lecture05_khodursky.icmb.Nonlinear Curve Fitting Problems Quantitative Analysis of Electrophoresis Gels Deconvolve a pixel profile of a banding pattern into a family of Gaussian or Lorentzian curves Das. 11:348 http://papakilo.

1-30.Nonlinear Curve Fitting Problems Quantitative Analysis of Electrophoresis Gels Many proposed functional forms besides Gaussian or Lorentzian curves DiMarco and Bombi. 7 . 931(2001). Mathematical functions for the representation of chromatographic peaks. Journal of Chromatography A.

2006. Vol 3.Nonlinear Curve Fitting Problems Fluorescence Correlation Spectroscopy (FCS) Bacia.” Nature Methods. Kim & Schwille. Feb. “Fluorescence cross-correlation spectroscopy in living cells. p. No 2. 8 . 86.

Nonlinear Curve Fitting Problems Fluorescence Correlation Spectroscopy (FCS) Note likely heteroscedasticity in data From discussion by Joe Huff at Winfried Wiegraebe’s Lab Meeting. 11 Aug 2006 9 .

25 Sept 2006 10 .Nonlinear Curve Fitting Problems Fluorescence Recovery After Photobleaching (FRAP) From discussion by Juntao Gao at Rong Li’s Lab Meeting.

Nonlinear Curve Fitting Problems Fluorescence Recovery After Photobleaching (FRAP) From discussion by Juntao Gao at Rong Li’s Lab Meeting. 25 Sept 2006 11 .

Linear Curve Fitting • • • • • Linear regression Polynomial regression Multiple regression Stepwise regression Logarithm transformation 12 .

where y is the“fitted” value at point : i i ˆ y = a +b x i i 13 . y ). ˆ through these points.x i Linear Curve Fitting Linear Regression: Least Squares Given data points ( . ( . i ˆ We want the “best” straight line. y).

yi) Error Function y 3 χ 0 1 2 x 3 4 2 0 1 ( a . b ) = ∑ y i − (a i =1 N [ + b⋅x )] i 2 2 Assume homoscedasticity (same variance) 14 .y) Data Linear Fit 5 ˆ y = a +b x i i 4 (xi.x i Linear Curve Fitting Linear Regression: Least Squares (x.

0.b) parameter space to minimize error function.9 150 100 Linear Fit 3 2 1 -1 0 0 1 b 50 a 2 ˆ y = 1.2 + 0.Linear Curve Fitting Linear Regression: Least Squares Search (a.9) = 1. b ) = ∑ y i − (a i =1 N [ + b⋅x )] i 2 χ 2 (1. χ2 Error Function χ2 χ 200 2 ( a .9 x i i 15 .2.

b) parameters be found directly without a search? 16 .x i Linear Curve Fitting Linear Regression: Least Squares Least Squares Line 5 y = 1.9x y 0 0 1 2 3 4 1 2 x 3 4 How can (a.2 + 0.

N ∂ χ2 = −2∑ yi − a − b ⋅ xi = 0 ∂a i =1 N ∂ χ2 = −2∑ xi ( yi − a − b ⋅ xi ) = 0 ∂b i =1 17 .x i Linear Curve Fitting Linear Regression: Least Squares How can (a.b) parameters be found directly without a search? • Differentiate χ2 with respect to parameters a and b • Set derivatives to 0.

x i Linear Curve Fitting Linear Regression: Least Squares How can (a.b) parameters be found directly without a search? Linear Fit Simultaneous Linear Equations  N ∑ xi   a  =  ∑ y i     2   b  ∑ xi y i  ∑ xi ∑ x i   ˆ y = a +b x i i 18 .

9 10 5 ⋅ 30 − 10 ⋅10 50 30 ˆ y = a +b x i i i x y x² xy 1 0 1 0 0 2 1 3 1 3 3 2 2 4 4 4 3 4 9 12 5 4 5 16 20 Sum 10 15 30 39 19 .2 = 10 5 ⋅ 30 − 10 ⋅10 50 30 15 39 5 ⋅ 39 − 10 ⋅15 45 = = = 0.x i Linear Curve Fitting Linear Regression: Least Squares How can (a.b) parameters be found directly without a search? Linear Fit Simultaneous Linear Equations  N ∑ xi   a  =  ∑ y i     2   b  ∑ xi y i  ∑ xi ∑ x i    5 10  a  15  10 30 b  = 39      15 39 a= 5 10 5 10 b= 5 10 10 30 15 ⋅ 30 − 39 ⋅10 60 = = 1.

03739 20 .2 0. codes: 0 '***' 0.7467 F-statistic: 12.2 4 0.9000 0.6164 1.79 on 1 and 3 DF. Error t value Pr(>|t|) (Intercept) 1.001 '**' 0.3.2517 3.576 0.1468 x 0.0374 * --Signif.9 -1. Adjusted R-squared: 0.4.81.2000 0.1 Coefficients: Estimate Std.0:4 > y <.5) > summary( lm(y ~ x) ) Call: lm(formula = y ~ x) Residuals: 1 2 3 -0.2.x i Linear Curve Fitting Linear Regression: Least Squares > x <.01 '*' 0.c(1. p-value: 0.' 0.947 0.1 ' ' 1 Residual standard error: 0.7958 on 3 degrees of freedom Multiple R-Squared: 0.0 R solution using lm (linear model) 5 0.05 '.

x i Linear Curve Fitting Linear Regression: Least Squares Assume homoscedasticity (σi = constant = 1) Assume heteroscedasticity  y − ( a + b⋅ x )  χ ( a. b) = ∑   σi   2 N i i i =1 2 Often weights σi are assumed to be 1. . 21 Experimental measurement errors can be used if known.

28539 0.10748 0.5 5.0 -1.0 0.02739 0.51275 0.28539 0.5 1.00472 0.62335 0.5 0.2 0.5 3.0 -2 0.10748 0.5 2.5 -1 0 1 x 2 3 4 5 Normal Probability Density Function f ( x) = 1 2π σ 2 e − ( x − µ )2 2σ 2 22 .00004 0.1 0.0 1.Nonlinear Curve Fitting Gaussian Case Study x -2.0 4.00055 0.6 y 0.0 2.5 -1.00004 Gaussian Data 0.0 -0.3 0.0 y 0.51275 0.5 4.4 0.00055 0.02739 0.00472 0.0 3.

0 1.8 ma si g 0.4 5 0 mu -5 f ( x) = 1 2π σ 2 e − ( x − µ )2 2σ 2 χ 2 ( µ .4 1.8 6 4 2 Gradient descent works well only inside “valley” here 0.6 0.5 σ = 0.Nonlinear Curve Fitting Gaussian Case Study χ 2 8 Minimum μ = 1.2 1. σ ) = ∑ yi − f ( xi ) i =1 N [ ] 2 23 Assume homoscedasticity .

2 -0.R .4 0.2 0.4 -0.Nonlinear Curve Fitting Gaussian Case Study Derivatives may be useful for estimating parameters Single Gaussian 0.0 0.0 y' -4 -2 0 x 2 4 2nd Derivative 0.2 y'' -0.1 0.2 y 0.0 -4 -2 0 x 2 4 24 U:/efg/lab/R/MixturesOfDistributions/SingleGaussian.2 0.3 -4 -2 0 x 2 4 1st Derivative 0.

006 -0.Nonlinear Curve Fitting Gaussian Case Study Derivatives may be useful for determining number of terms Heming Lake Pike 0.06 y 0 20 40 x 60 80 1st Derivative y' -0.04 0.002 0 20 40 x 60 80 25 .00 0.02 0.005 0.002 y'' -0.005 0 20 40 x 60 80 2nd Derivative 0.

a )  χ (a) = ∑  σ  i   2 N i i i =1 2 From Press..Nonlinear Curve Fitting Math Given data points (xi. p. 1992.. Given desired model to fit (not always known): y = y(x. M.. et al.2.. 682 26 . k = 1.yi). The error function (“merit function”) is  y − y (x . Numerical Recipes in C (2nd Ed). a) where there are M unknown parameters: ak.

Nonlinear Curve Fitting Math Need to search multidimensional parameter space to minimize error function. χ2 27 .

a ) ∂ χ2 = −2∑ i k = 1. 1992. a ) ∂ χ ∂ y (x i . a )] ∂ a k ∂ al ∂ ak ∂ al ∂ al ∂ a k  i =1 σ i   αkl = Hessian or “curvature” matrix (after dropping “2”) From Press. p. M 2 ∂ ak ∂ ak σi i =1 βk (after dropping “-2”) Taking the second derivative of χ2: Often small and ignored 2 2 2 N 1  ∂ y ( x i .. a )] ∂ y (xi ... et al. a ) ∂ y (x i . Numerical Recipes in C (2nd Ed).Nonlinear Curve Fitting Math Gradient of χ2 with respect to parameters a will be zero at the minimum: N [ y − y (xi .2. 682 28 .. a )  = 2∑ 2  − [ yi − y (xi .

switch to inverse Hessian: ∑α l =1 M kl ∆a l = β k • “Full Newton-type” methods keep dropped term in second derivative – considered more robust but more complicated • Simplex is an alternative algorithm 29 .Nonlinear Curve Fitting Algorithms • Levenberg-Marquardt is most widely used algorithm: – When “far” from minimum. use gradient descent: ∆al = constant ⋅ β l – When “close” to minimum.

Nonlinear Curve Fitting Algorithms • Fitting procedure is iterative • Usually need “good” initial guess. based on understanding of selected model • No guarantee of convergence • No guarantee of optimal answer • Solution requires derivatives: numeric or analytic can be used by some packages 30 .

Nonlinear Curve Fitting Software IDL: curvefit function. MPFIT: Robust non-linear least square curve fitting Instrumentation is quite-well versed in using MPFIT and applying it in IDL (3 limited licenses) • Joe Huff in Advanced Mathematica (1 limited license) MatLab: (1 limited license) (10 limited licenses) Curve Fitting Toolbox Peak Fitting Module OriginPro: PeakFit: (1 limited license) Nonlinear curve fitting for spectroscopy. chromatography and electrophoresis R: nls function • many statistics • symbolic derivatives (if desired) • flawed implementation: exact “toy” problems fail unless “noise” added 31 .

nist.html 32 .itl.gov/div898/strd/general/dataarchive.Nonlinear Curve Fitting Software NIST reference datasets with certified computational results http://www.

Analysis of Results • Goodness of Fit: R2 • Residuals 33 .

6 might be considered good. Linear Models with R. and considerable noise.Goodness of Fit: R2 Coefficient of Determination Percentage of Variance Explained R 2 = 1− Residual Sum of Squares (RSS) Total Sum of Squares (SS) [Corrected for Mean] ˆ ∑ ( yi − yi ) R =1− ∑ ( yi − y ) 2 2 2 0 ≤ R2 ≤ 1 • “Adjusted” R2 compensates for higher R2 as terms added. p. 2005.6 might be considered low. • A “good” value of R2 depends on the application. • In biological and social sciences with weakly correlated variables. • In physical sciences in controlled experiments. Faraway. R2 ~ 0. R2 ~ 0.16-18 34 .

The Industrial Physicist. April/May 2003. • Residuals are not independent (they sum to 0). http://www.Residuals • Residuals are estimates of the true and unobservable errors.aip.” Marko Ledvij.org/tip/INPHFA/vol-9/iss-2/p24. “Curve fitting made easy.html 35 .

210. 87-105. Vol. 36 . Methods in Enzymology. 1992.Analysis of Residuals • Are residuals random? • Is mathematical model appropriate? • Is mathematical model sufficient to characterize the experimental data? • Subtle behavior in residuals may suggest significant overlooked property Good Reference: “Analysis of Residuals: Criteria for Determining Goodness-of-Fit.” Straume and Johnson.

but why is there a pattern in the residuals? 37 .Analysis of Residuals Synthetic FRAP Data: Fit with 1 term when 2 terms are better Near “perfect” fit.

Analysis of Residuals Lomb-Scargle periodogram can indicate “periodicity” in the residuals Flat line with all “bad” p-values would indicate “random” residuals 38 .

Analysis of Residuals Synthetic FRAP Data: Fit with 2 terms 39 .

Analysis of Residuals FCS Data and Heteroscedasticity χ 2 (a) = ∑ i =1 N  y i − y (x i . a )    σi 2   Scaling Factor Heteroscedasticity in Residuals Scaled Residuals Use F Test to test for unequal variances FCS Residual Plots Courtesy of Joseph Huff. Advanced Instrumentation & Physics 40 .

” Studentize d Residual = ˆ σ ˆ ε i i 1 − hii • Externally Studentized Residuals follow Student’s t-distribution. • Can be used to statistically reject “outliers” See http://en.wikipedia.Analysis of Residuals Heteroscedasticity and Studentized Residuals • Studentized residual is a residual divided by an estimate of its standard deviation • The “leverage” hii is the ith diagonal entry of a “hat matrix.org/wiki/Studentized_residual 41 .

• Beware of heteroscedasticity in your data.Summary • A mathematical model may or may not be appropriate for any given dataset. • Nonlinear curve fitting is powerful (when the technique works). • Linear curve fitting is deterministic. and may not converge. involves searching a huge parameter space. Make sure analysis doesn’t assume homoscedasticity if your data are not. • The R2 and adjusted R2 statistics provide easy to understand dimensionless values to assess goodness of fit. • Use F Test to compare the fits of two equations. • Nonlinear curve fitting is non-deterministic. 42 . • Always study residuals to see if there may be unexplained patterns and missing terms in a model.

Acknowledgements Advanced Instrumentation & Physics • Joseph Huff • Winfried Wiegraebe 43 .