Intro:
Method of least squares:
In statistical data analysis, there are many powerful tools and methods that allow us to fit a
curve to some data, making it possible to generate predictions on how a system will behave to
input outside of the limited data range.
One such method is the method of least squares. Suppose we have some independent variable
x whose value is known to a very high degree of accuracy. Or rather, it can be assumed that
there is no uncertainty in the measurement of x. This input is related to a dependent variable y,
which is subject to random fluctuations, whose typical size is sigma, with a total of N
measurements.
Due to the nature of the uncertainty, we do not know how far away from its “true” value is from
the measured value, but it is within +-sigma.
Now with the set of data points (xi, yi, sigmai) where i = 1, 2, … , N we can now hypothesise the
shape of our best fit curve f(x). Using the example points above, let us assume a linear curve of
form f(x; theta) = theta0 + theta1 x where the thetas are adjustable parameters. Consequently,
the estimated curve is not very good, resulting in a deviation or residual yi - f(xi; theta).
Some points will naturally have a large standard deviation, which will result in a larger residual.
Inversely, others may have a small standard deviation resulting in a smaller residual. The
absolute size of the residual is not a good indicator for whether it is a poor fit. To account for this
fact, it is necessary to define a normalised or standardised residual yi - f(xi;theta)/sigma.
It is also possible for the standardised residual to have either a positive or negative value, so to
remove this, squaring the quantity is required.
In the method of Least Squares, a quantity xi^2 is defined such that [formula]
Although the Method of Least Squares is fairly good, there are some limitations with it. One
major issue is that it assumes data to follow a Gaussian distribution, which may not always be
the case for all data. It takes in all data points, which can skew results if the data contains
outliers.
The parameters theta that minimise chi-squared are called estimators and are denoted theta
hat. In general, it is incredibly difficult to algebraically minimise x^2, thus numerical minimisation
techniques must be employed. This is the general theory behind the Method of Least Squares.
Statistical Error in Fitted Parameters:
The estimators found via the method of Least Squares are subject to statistical errors. This is
because the measurements yi are independent random variables. This means that if yi was
measured at the same points xi, there will be different values measured to to fluctuations. This
results in a level of uncertainty in the value of the estimators found with any given set of (x, y,
sigma).
To quantify these fluctuations, we define a quantity called the covariance. For two random
variables, the covariance is defined as equation.
In the case of more than 2 variables, in our case a set of m estimators, a covariance matrix can
be defined as matrix. The diagonals of said matrix would simply be the variance of the
estimators. This variance is what allows us to find the error in our values.
The problem with (equation for covariance) is that it would require multiple measurements to
allow for a calculation of the mean and square of the means. A better formula for finding the
matrix that only requires one set of measurements is [formula].
Despite this being only an estimate and not entirely accurate, for most practical purposes it is
good enough.
There may or may not be a relationship between two estimators. The correlation coefficient
allows us to see how correlated data points are, using their covariance and individual variances.
[formula]
Error propagation of f(x;theta)
Now that the values of the estimators have been found, the guess function f(x;theta) is
complete. However, due to there being an uncertainty in the estimators themselves, any
function comprising them will also have uncertainty in them as well. To estimate the uncertainty
of said function, the simplest method is to apply linear error propagation.
This is accomplished by using the covariance matrix to gather the variance of f(x;theta).
[formula]
It should be pointed out that even if the uncertainty of a fit is large, that does not necessarily
mean the fit is incorrect. The best way to determine the goodness of a fit is using its xi^2.
This method is slightly problematic as it assumes functions to have a linear relationship, which is
not the case for every functional relationship. It is possible to work around this by taking small
values within the range of the data, where the relationship is approximately linear.
Goodness of fit from the minimum of xi ^2
If [eq] was a good fit, it would seem reasonable to assume that yi - f(x) would be within the order
of sigmai, resulting in a normalised residual in the order of unity. From this, one may assume
that xi^2 would be equal to the number of measurements N. This however is not the case
because there are adjustable parameters in the normalised residual.
If, for example, there were N measurements as well as N adjustable parameters, we would find
that it would be possible to find a functional form that passes through every given point,
resulting in a ximin = 0.
In reality, due to ximin being a function of random variables, it too is subject to random
fluctuations. Assuming the correct form of f has been found, and additionally yi follows a
gaussian distribution, the chi squared distribution is assa where n is a quantity known as the
degrees of freedom. For a set of data of N measurements and a m adjustable variables theta, n
= N - m.
Plotting this distribution, it shows the probability of getting any given value of chi. It is also useful
to define a pvalue = integral, which gives us the probability of getting a given chi min or higher.
For low probability (p<0.05), it would be reasonable to assume that the hypothesis is incorrect,
due to there being 1 in 20 chance or lower to actually get said chi. For a high probability
(p>0.05) it would be reasonable to assume that the hypothesised function is correct.
For chi values within a few multiples of N, and high p-values, a hypothesis is likely to be correct.
This is a general rule, but these specific p values are entirely arbitrary, and it is entirely possible
to get a small p value, but the hypothesis is still correct.
Simple Fits:
To start off, we are given sample data
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0])
y = np.array([2.7, 3.9, 5.5, 5.8, 6.5, 6.3, 7.7, 8.5, 8.7])
sig = np.array([0.3, 0.5, 0.7, 0.6, 0.4, 0.3, 0.7, 0.8, 0.5])
In this case, we will use Mth order polynomial, i.e., with M + 1 adjustable parameters, for M = 1,
2, 3:
Y = alpha + beta x
Y = alpha + beta x + gamma x^2
Y = alpha + beta x + gamma x^2 + delta x^3
Using python, we can use numerical analysis to find the coefficients for each of the predictions.
Furthermore, using the error propagation formula [formula], it becomes possible to review the
uncertainty in the calculated fits. Lastly, calculating chi squared min to see how well the fit is for
the data.
Linear Hypothesis:
The fit parameters of fig 2 are alpha = 2.58 +- 0.29 and beta = 0.741 +- 0.057. The propagated
error in the fit appears to be relatively small. This is to be expected as this hypothesised function
is linear, and thus would show a small increase in error when outside of the data range due to
linear error propagation. Using (eq above) the formula is found to be [formula]
This curve passes through 6 of the 9 points within 1 standard deviation. This does not appear to
be a great fit. The chi min for this hypothesis is 72, much higher than its degrees of freedom (7).
To be more robust in this conclusion, we calculate p value of this chi min, which turns out to be
5*10^-13. This is incredibly tiny, making it very unlikely that the data has a linear relationship.
Quadratic Hypothesis:
The fit parameters of fig 2 are alpha = 1.88 +- 0.43 , beta = 0.99 +-0.22 gamma = -0.028
+-0.024. The propagated error in the fit is small in the region of the data points, but grows
quickly the further away from the data the prediction gets. This is understandable due to the
fitted curve being quadratic in nature, therefore linear error propagation becomes very
inaccurate the further from the data we look. Using (eq above) the formula is found to be
[formula]
The curve passes through 7 of the 9 points within 1 standard deviation. This is a better fit than
previous, but is it likely to be the correct fit? The chi min for this fit is 35, which is closer to its
degrees of freedom (6) than the linear relationship. However, the evaluated p-value turns out to
be 1*10^-13, ruling out a quadratic relationship being the form this data takes.
Cubic Hypothesis:
The fit parameters of fig 2 are alpha = 0.60 +-0.85 , beta = 2.43 +- 0.85, gamma = -0.38 +-0.20,
delta = 0.0233 +-0.013. The propagated error in the fit is incredibly small in the region of the
data points, but grows quickly the further away from the data the prediction gets. Similar to the
quadratic fit, this is understandable due to the fitted curve being cubic in nature. Using (eq
above) the formula is found to be [formula]
This curve passes through all the points within 1 standard deviation, much better than the
previous hypotheses. Chi min = 14 and the p-value is 0.018. Chi min is less than thrice the
degrees of freedom (5) and the p value is high enough to be within the realm of possibility.
Because the p-value < 0.05, it may be useful to try a quartic equation for the fit and compare.
Galileo’s Ball and Ramp:
[figure for shit]
In 1608, Galileo dropped a ball from a ramp at some height h, measuring the distance it
travelled before falling to the ground, d, in units of punto. Due to the accuracy of the available
measuring equipment, an error of 1-2% can be expected. For this case, an uncertainty sigma =
15 punto for d was used, whilst it is assumed the height is known to be a negligible uncertainty.
As with the previous section, a hypothesis is required to attempt to find the functional form of the
data. Again, we choose 3:
d=alpha*h
d=alpha*h + beta*h^2
d=alpha*h^beta
Linear Hypothesis:
The estimator of fig is alpha = 1.66 +- 0.12. Because there is only 1 adjustable parameter, the
covariance simply has size 1. In this instance, it is 0.013. The curve passes through only a
single point, indicating that the fit is incredibly poor. This is further backed up with the chi2 being
84 and the p value = 3.2*10^-17. The number of degrees of freedom is 4 and this chi^2 value is
much greater than that. Not only this, but the p-value is incredibly tiny, resulting in it being
realistically implausible for the fit to be linear in origin.
Quadratic Hypothesis:
The estimators of fig are alpha = 2.79+-0.22 and beta = 1.35*10^-3 +- 2.6*10^-4. The resulting
covariance matrix is [matrix]. This curve also passes through only 1 point, however it is far
closer to and better fits the shape of the data. Chi^2 = 8 and the p-value = 0.04. Given that the
number of degrees of freedom is 3, this seems much more plausible. However, the p-value is
quite small, but it does not rule out this hypothesis.
The estimators have a correlation coefficient of -0.98, implying that they are strongly negatively
correlated.
Exponential Hypothesis:
The estimators of fig are alpha = 43.8+-5.7 and beta = 0.511 +- 0.018. The resulting covariance
matrix is [matrix]. This curve appears to pass through every point, which could indicate that this
is the functional form of the data. However, Chi^2 = 11 and the p-value = 0.01. The number of
degrees of freedom is also 3. Comparing this chi^2 and p-value to the previous, it appears that
although the graph appears to pass through more points, it is less likely to be the correct
functional form of the data.
The estimators have a correlation coefficient of -0.999, implying that they are strongly negatively
correlated. More so anticorrelated.
Ptolemy Experiment
In 140 AD Ptolemy was doing experiments with refraction using a circular copper disc half
submerged in water, logging the incident and refracted angles: [table]
At the time, Snell’s law was not known and therefore an estimate of the functional form of
refraction used was θr = αθi. Although Ptolemy preferred θr = αθi − βθ2.
To find the coefficients, we shall use the method of least squares
Linear Hypothesis:
The estimator of Fig. is alpha = 0.667 +- 0.016. The curve passes through 3 points, but is
reasonably close to the remaining points. This approximation makes sense as, for small angles,
sin x ~ x. Chi^2 in this instance is 145, far too high for the number of degrees of freedom (7).
The p-value = 4-10^-28. Although this hypothesis fits somewhat fits the data, it is not the
functional form.
Quadratic Hypothesis:
The estimators of Fig. are alpha = 0.8315 +- 0.0077 and beta = 0.00258 +-7.7*10*-5. The curve
passes through every point as an incredible fit. The chi^2 = 0.78 and the p-value = 0.99.
According to the method of least squares, it appears that the true form of this data is in fact
quadratic.
However, we know this to not be true as the true fit is sinusoidal.
Snell’s Law Hypothesis:
Snell’s law dictates that the functional relationship between thetai and thetar is: θr = sin−1 sin θi
r, where r = nr/ni (the ratio of refractive indices).
Using the method of least squares, we get Fig. Like the quadratic fit, it appears to go through
almost every point, and looks to be a good fit for the data. The estimator r = 1.310+-0.002. This
value appears to agree with the known refractive index of water (1.33) to two significant figures.
This shows that the data collected is most likely from a real experiment. If one were to calculate
chi^2 and the p-value, it would be found that they are 13 and 0.05 respectively.
This is strange, as going purely based off of this method, it would appear that the true law of
refraction has a quadratic form. The reason for this odd discrepancy is due to both chi^2 and the
p-value assuming the data follow some Gaussian distribution. As we can see, this is clearly not
the case. This is an example of how the Method Of Least Squares can result in an incorrect
conclusion.