0% found this document useful (0 votes)

13 views6 pages

Statistical Analysis

The document discusses the method of least squares for fitting curves to data, emphasizing its application in statistical data analysis and the challenges associated with outliers and non-Gaussian distributions. It details the process of estimating parameters, calculating statistical errors, and assessing the goodness of fit through chi-squared values and p-values across various hypotheses, including linear, quadratic, and cubic models. Additionally, it presents historical experiments by Galileo and Ptolemy, illustrating the practical application of these statistical methods in analyzing experimental data.

Uploaded by

ancienttech.inc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

Statistical Analysis

Uploaded by

ancienttech.inc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Intro:

Method of least squares:

In statistical data analysis, there are many powerful tools and methods that allow us to fit a
curve to some data, making it possible to generate predictions on how a system will behave to
input outside of the limited data range.

One such method is the method of least squares. Suppose we have some independent variable
x whose value is known to a very high degree of accuracy. Or rather, it can be assumed that
there is no uncertainty in the measurement of x. This input is related to a dependent variable y,
which is subject to random fluctuations, whose typical size is sigma, with a total of N
measurements.

Due to the nature of the uncertainty, we do not know how far away from its “true” value is from
the measured value, but it is within +-sigma.

Now with the set of data points (xi, yi, sigmai) where i = 1, 2, … , N we can now hypothesise the
shape of our best fit curve f(x). Using the example points above, let us assume a linear curve of
form f(x; theta) = theta0 + theta1 x where the thetas are adjustable parameters. Consequently,
the estimated curve is not very good, resulting in a deviation or residual yi - f(xi; theta).

Some points will naturally have a large standard deviation, which will result in a larger residual.
Inversely, others may have a small standard deviation resulting in a smaller residual. The
absolute size of the residual is not a good indicator for whether it is a poor fit. To account for this
fact, it is necessary to define a normalised or standardised residual yi - f(xi;theta)/sigma.

It is also possible for the standardised residual to have either a positive or negative value, so to
remove this, squaring the quantity is required.

In the method of Least Squares, a quantity xi^2 is defined such that [formula]

Although the Method of Least Squares is fairly good, there are some limitations with it. One
major issue is that it assumes data to follow a Gaussian distribution, which may not always be
the case for all data. It takes in all data points, which can skew results if the data contains
outliers.

The parameters theta that minimise chi-squared are called estimators and are denoted theta
hat. In general, it is incredibly difficult to algebraically minimise x^2, thus numerical minimisation
techniques must be employed. This is the general theory behind the Method of Least Squares.

Statistical Error in Fitted Parameters:

The estimators found via the method of Least Squares are subject to statistical errors. This is
because the measurements yi are independent random variables. This means that if yi was
measured at the same points xi, there will be different values measured to to fluctuations. This
results in a level of uncertainty in the value of the estimators found with any given set of (x, y,
sigma).

To quantify these fluctuations, we define a quantity called the covariance. For two random
variables, the covariance is defined as equation.

In the case of more than 2 variables, in our case a set of m estimators, a covariance matrix can
be defined as matrix. The diagonals of said matrix would simply be the variance of the
estimators. This variance is what allows us to find the error in our values.

The problem with (equation for covariance) is that it would require multiple measurements to
allow for a calculation of the mean and square of the means. A better formula for finding the
matrix that only requires one set of measurements is [formula].

Despite this being only an estimate and not entirely accurate, for most practical purposes it is
good enough.

There may or may not be a relationship between two estimators. The correlation coefficient
allows us to see how correlated data points are, using their covariance and individual variances.

[formula]

Error propagation of f(x;theta)

Now that the values of the estimators have been found, the guess function f(x;theta) is
complete. However, due to there being an uncertainty in the estimators themselves, any
function comprising them will also have uncertainty in them as well. To estimate the uncertainty
of said function, the simplest method is to apply linear error propagation.

This is accomplished by using the covariance matrix to gather the variance of f(x;theta).
[formula]

It should be pointed out that even if the uncertainty of a fit is large, that does not necessarily
mean the fit is incorrect. The best way to determine the goodness of a fit is using its xi^2.

This method is slightly problematic as it assumes functions to have a linear relationship, which is
not the case for every functional relationship. It is possible to work around this by taking small
values within the range of the data, where the relationship is approximately linear.

Goodness of fit from the minimum of xi ^2

If [eq] was a good fit, it would seem reasonable to assume that yi - f(x) would be within the order
of sigmai, resulting in a normalised residual in the order of unity. From this, one may assume
that xi^2 would be equal to the number of measurements N. This however is not the case
because there are adjustable parameters in the normalised residual.
If, for example, there were N measurements as well as N adjustable parameters, we would find
that it would be possible to find a functional form that passes through every given point,
resulting in a ximin = 0.

In reality, due to ximin being a function of random variables, it too is subject to random
fluctuations. Assuming the correct form of f has been found, and additionally yi follows a
gaussian distribution, the chi squared distribution is assa where n is a quantity known as the
degrees of freedom. For a set of data of N measurements and a m adjustable variables theta, n
= N - m.

Plotting this distribution, it shows the probability of getting any given value of chi. It is also useful
to define a pvalue = integral, which gives us the probability of getting a given chi min or higher.
For low probability (p<0.05), it would be reasonable to assume that the hypothesis is incorrect,
due to there being 1 in 20 chance or lower to actually get said chi. For a high probability
(p>0.05) it would be reasonable to assume that the hypothesised function is correct.

For chi values within a few multiples of N, and high p-values, a hypothesis is likely to be correct.

This is a general rule, but these specific p values are entirely arbitrary, and it is entirely possible
to get a small p value, but the hypothesis is still correct.

Simple Fits:

To start off, we are given sample data

x = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0])
y = np.array([2.7, 3.9, 5.5, 5.8, 6.5, 6.3, 7.7, 8.5, 8.7])
sig = np.array([0.3, 0.5, 0.7, 0.6, 0.4, 0.3, 0.7, 0.8, 0.5])

In this case, we will use Mth order polynomial, i.e., with M + 1 adjustable parameters, for M = 1,
2, 3:

Y = alpha + beta x
Y = alpha + beta x + gamma x^2
Y = alpha + beta x + gamma x^2 + delta x^3

Using python, we can use numerical analysis to find the coefficients for each of the predictions.
Furthermore, using the error propagation formula [formula], it becomes possible to review the
uncertainty in the calculated fits. Lastly, calculating chi squared min to see how well the fit is for
the data.

Linear Hypothesis:
The fit parameters of fig 2 are alpha = 2.58 +- 0.29 and beta = 0.741 +- 0.057. The propagated
error in the fit appears to be relatively small. This is to be expected as this hypothesised function
is linear, and thus would show a small increase in error when outside of the data range due to
linear error propagation. Using (eq above) the formula is found to be [formula]

This curve passes through 6 of the 9 points within 1 standard deviation. This does not appear to
be a great fit. The chi min for this hypothesis is 72, much higher than its degrees of freedom (7).
To be more robust in this conclusion, we calculate p value of this chi min, which turns out to be
5*10^-13. This is incredibly tiny, making it very unlikely that the data has a linear relationship.

Quadratic Hypothesis:
The fit parameters of fig 2 are alpha = 1.88 +- 0.43 , beta = 0.99 +-0.22 gamma = -0.028
+-0.024. The propagated error in the fit is small in the region of the data points, but grows
quickly the further away from the data the prediction gets. This is understandable due to the
fitted curve being quadratic in nature, therefore linear error propagation becomes very
inaccurate the further from the data we look. Using (eq above) the formula is found to be
[formula]

The curve passes through 7 of the 9 points within 1 standard deviation. This is a better fit than
previous, but is it likely to be the correct fit? The chi min for this fit is 35, which is closer to its
degrees of freedom (6) than the linear relationship. However, the evaluated p-value turns out to
be 1*10^-13, ruling out a quadratic relationship being the form this data takes.

Cubic Hypothesis:
The fit parameters of fig 2 are alpha = 0.60 +-0.85 , beta = 2.43 +- 0.85, gamma = -0.38 +-0.20,
delta = 0.0233 +-0.013. The propagated error in the fit is incredibly small in the region of the
data points, but grows quickly the further away from the data the prediction gets. Similar to the
quadratic fit, this is understandable due to the fitted curve being cubic in nature. Using (eq
above) the formula is found to be [formula]

This curve passes through all the points within 1 standard deviation, much better than the
previous hypotheses. Chi min = 14 and the p-value is 0.018. Chi min is less than thrice the
degrees of freedom (5) and the p value is high enough to be within the realm of possibility.

Because the p-value < 0.05, it may be useful to try a quartic equation for the fit and compare.

Galileo’s Ball and Ramp:

[figure for shit]

In 1608, Galileo dropped a ball from a ramp at some height h, measuring the distance it
travelled before falling to the ground, d, in units of punto. Due to the accuracy of the available
measuring equipment, an error of 1-2% can be expected. For this case, an uncertainty sigma =
15 punto for d was used, whilst it is assumed the height is known to be a negligible uncertainty.
As with the previous section, a hypothesis is required to attempt to find the functional form of the
data. Again, we choose 3:

d=alpha*h
d=alpha*h + beta*h^2
d=alpha*h^beta

Linear Hypothesis:
The estimator of fig is alpha = 1.66 +- 0.12. Because there is only 1 adjustable parameter, the
covariance simply has size 1. In this instance, it is 0.013. The curve passes through only a
single point, indicating that the fit is incredibly poor. This is further backed up with the chi2 being
84 and the p value = 3.2*10^-17. The number of degrees of freedom is 4 and this chi^2 value is
much greater than that. Not only this, but the p-value is incredibly tiny, resulting in it being
realistically implausible for the fit to be linear in origin.

Quadratic Hypothesis:
The estimators of fig are alpha = 2.79+-0.22 and beta = 1.35*10^-3 +- 2.6*10^-4. The resulting
covariance matrix is [matrix]. This curve also passes through only 1 point, however it is far
closer to and better fits the shape of the data. Chi^2 = 8 and the p-value = 0.04. Given that the
number of degrees of freedom is 3, this seems much more plausible. However, the p-value is
quite small, but it does not rule out this hypothesis.

The estimators have a correlation coefficient of -0.98, implying that they are strongly negatively
correlated.

Exponential Hypothesis:
The estimators of fig are alpha = 43.8+-5.7 and beta = 0.511 +- 0.018. The resulting covariance
matrix is [matrix]. This curve appears to pass through every point, which could indicate that this
is the functional form of the data. However, Chi^2 = 11 and the p-value = 0.01. The number of
degrees of freedom is also 3. Comparing this chi^2 and p-value to the previous, it appears that
although the graph appears to pass through more points, it is less likely to be the correct
functional form of the data.

The estimators have a correlation coefficient of -0.999, implying that they are strongly negatively
correlated. More so anticorrelated.

Ptolemy Experiment
In 140 AD Ptolemy was doing experiments with refraction using a circular copper disc half
submerged in water, logging the incident and refracted angles: [table]

At the time, Snell’s law was not known and therefore an estimate of the functional form of
refraction used was θr = αθi. Although Ptolemy preferred θr = αθi − βθ2.
To find the coefficients, we shall use the method of least squares

Linear Hypothesis:

The estimator of Fig. is alpha = 0.667 +- 0.016. The curve passes through 3 points, but is
reasonably close to the remaining points. This approximation makes sense as, for small angles,
sin x ~ x. Chi^2 in this instance is 145, far too high for the number of degrees of freedom (7).
The p-value = 4-10^-28. Although this hypothesis fits somewhat fits the data, it is not the
functional form.

Quadratic Hypothesis:

The estimators of Fig. are alpha = 0.8315 +- 0.0077 and beta = 0.00258 +-7.7*10*-5. The curve
passes through every point as an incredible fit. The chi^2 = 0.78 and the p-value = 0.99.
According to the method of least squares, it appears that the true form of this data is in fact
quadratic.

However, we know this to not be true as the true fit is sinusoidal.

Snell’s Law Hypothesis:

Snell’s law dictates that the functional relationship between thetai and thetar is: θr = sin−1 sin θi
r, where r = nr/ni (the ratio of refractive indices).

Using the method of least squares, we get Fig. Like the quadratic fit, it appears to go through
almost every point, and looks to be a good fit for the data. The estimator r = 1.310+-0.002. This
value appears to agree with the known refractive index of water (1.33) to two significant figures.
This shows that the data collected is most likely from a real experiment. If one were to calculate
chi^2 and the p-value, it would be found that they are 13 and 0.05 respectively.

This is strange, as going purely based off of this method, it would appear that the true law of
refraction has a quadratic form. The reason for this odd discrepancy is due to both chi^2 and the
p-value assuming the data follow some Gaussian distribution. As we can see, this is clearly not
the case. This is an example of how the Method Of Least Squares can result in an incorrect
conclusion.

Least Squares for Data Analysis
No ratings yet
Least Squares for Data Analysis
7 pages
Data Analysis for Scientists
No ratings yet
Data Analysis for Scientists
23 pages
Unit-5 - Notes
No ratings yet
Unit-5 - Notes
41 pages
Understanding Weighted Means and Curve Fitting
No ratings yet
Understanding Weighted Means and Curve Fitting
7 pages
Statistical Inference Techniques Guide
No ratings yet
Statistical Inference Techniques Guide
66 pages
Physics Lab Data Analysis Guide
No ratings yet
Physics Lab Data Analysis Guide
5 pages
Understanding Maximum Likelihood Estimation
No ratings yet
Understanding Maximum Likelihood Estimation
26 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
Intro to Simple Linear Regression
No ratings yet
Intro to Simple Linear Regression
11 pages
Simple Regression
No ratings yet
Simple Regression
18 pages
Unit-II DM Techniques
No ratings yet
Unit-II DM Techniques
20 pages
Unit 2
No ratings yet
Unit 2
19 pages
Density Estimation and Least Squares Methods
No ratings yet
Density Estimation and Least Squares Methods
9 pages
ECON 1630 Problem Set #2 Solutions
No ratings yet
ECON 1630 Problem Set #2 Solutions
9 pages
FDSA Unit V LECTURE NOTS
No ratings yet
FDSA Unit V LECTURE NOTS
28 pages
Geodesy: Least Squares Method
No ratings yet
Geodesy: Least Squares Method
192 pages
Method of Least Squares Overview
No ratings yet
Method of Least Squares Overview
192 pages
Econometric Estimation BETA
No ratings yet
Econometric Estimation BETA
36 pages
Notes 4 2samp GoodnessOfFit Corr Regr
No ratings yet
Notes 4 2samp GoodnessOfFit Corr Regr
32 pages
Understanding Least Squares Method
No ratings yet
Understanding Least Squares Method
3 pages
IV AI-DS AD3491 FDSA Unit5
No ratings yet
IV AI-DS AD3491 FDSA Unit5
35 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
34 pages
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
No ratings yet
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
8 pages
Cs1a
No ratings yet
Cs1a
10 pages
Statistics
No ratings yet
Statistics
53 pages
One-Way ANOVA Model Explained
No ratings yet
One-Way ANOVA Model Explained
81 pages
Gamma Distribution MSE Estimators Analysis
No ratings yet
Gamma Distribution MSE Estimators Analysis
23 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Least Squares Curve Fitting Methods
No ratings yet
Least Squares Curve Fitting Methods
2 pages
Stat
No ratings yet
Stat
43 pages
LS MethodLeastSquares
No ratings yet
LS MethodLeastSquares
12 pages
Point Estimation Methods in Statistics
No ratings yet
Point Estimation Methods in Statistics
62 pages
12 W12NSE6220 - Fall 2023 - Zeng
No ratings yet
12 W12NSE6220 - Fall 2023 - Zeng
44 pages
Econometrics Study Guide
No ratings yet
Econometrics Study Guide
9 pages
STAT 353: Expectation, Variance & Regression Guide
No ratings yet
STAT 353: Expectation, Variance & Regression Guide
44 pages
Linear Regression Tutorial Guide
No ratings yet
Linear Regression Tutorial Guide
10 pages
Seattle SISG 18 IntroQG Lecture08
No ratings yet
Seattle SISG 18 IntroQG Lecture08
21 pages
Econometrics: Domodar N. Gujarati
No ratings yet
Econometrics: Domodar N. Gujarati
36 pages
AllNotes 4
No ratings yet
AllNotes 4
56 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Python Notes
No ratings yet
Python Notes
25 pages
Understanding Ordinary Least Squares (OLS)
No ratings yet
Understanding Ordinary Least Squares (OLS)
7 pages
Understanding Econometrics Basics
No ratings yet
Understanding Econometrics Basics
14 pages
Maximum Likelihood Estimation in Engineering
No ratings yet
Maximum Likelihood Estimation in Engineering
171 pages
Mathematical Methods in Supervised Learning
No ratings yet
Mathematical Methods in Supervised Learning
25 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
14 pages
STAM Formula Sheet
100% (3)
STAM Formula Sheet
4 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Regression Basics for Epidemiologists
No ratings yet
Regression Basics for Epidemiologists
18 pages
Numerical Methods: Curve Fitting Techniques
No ratings yet
Numerical Methods: Curve Fitting Techniques
45 pages
Curve Fitting: Fitting A Straight Line
No ratings yet
Curve Fitting: Fitting A Straight Line
3 pages
Dodge1987 - An Introduction To L1-Norm Based Statistical Data Analysis
No ratings yet
Dodge1987 - An Introduction To L1-Norm Based Statistical Data Analysis
15 pages
Regression Analysis: Correlation & Methods
No ratings yet
Regression Analysis: Correlation & Methods
41 pages
Statistical Inference
No ratings yet
Statistical Inference
35 pages
STAT4003 Final Exam Formula Sheet
No ratings yet
STAT4003 Final Exam Formula Sheet
8 pages
Formula Help Sheet
No ratings yet
Formula Help Sheet
6 pages
Linear Regression Analysis and Least Square Methods
No ratings yet
Linear Regression Analysis and Least Square Methods
65 pages
Euclidean Geometry Module Grade 10
No ratings yet
Euclidean Geometry Module Grade 10
33 pages
Intel 8085 Microprocessor Guide
0% (1)
Intel 8085 Microprocessor Guide
187 pages
Types of Luminous Fixtures Explained
No ratings yet
Types of Luminous Fixtures Explained
11 pages
ICDA (M) - Degree Regulations - 241213rev.
No ratings yet
ICDA (M) - Degree Regulations - 241213rev.
3 pages
qNMR Analysis Guidelines for Forensics
No ratings yet
qNMR Analysis Guidelines for Forensics
24 pages
DPP-03 Advanced PC English
No ratings yet
DPP-03 Advanced PC English
16 pages
SMK Mahsuri, Jalan Ayer Hangat, 07000 KUAH, LANGKAWI, Kedah Darul Aman
No ratings yet
SMK Mahsuri, Jalan Ayer Hangat, 07000 KUAH, LANGKAWI, Kedah Darul Aman
13 pages
Mental Maths 2 Handbook
No ratings yet
Mental Maths 2 Handbook
69 pages
28 Jupyter Notebook Tips, Tricks and Shortcuts
No ratings yet
28 Jupyter Notebook Tips, Tricks and Shortcuts
35 pages
2D Animation Course Overview
No ratings yet
2D Animation Course Overview
3 pages
Oliver Racing Products 2019 Catalog PDF
No ratings yet
Oliver Racing Products 2019 Catalog PDF
20 pages
04 - ENG1013 Coding Standards - 1.0
No ratings yet
04 - ENG1013 Coding Standards - 1.0
9 pages
Netsure 531 Series Datasheet
No ratings yet
Netsure 531 Series Datasheet
2 pages
Beamex EXT Module Manual ENG
No ratings yet
Beamex EXT Module Manual ENG
19 pages
Centrifugal Pump Fundamentals Guide
100% (2)
Centrifugal Pump Fundamentals Guide
62 pages
Principles of Geographical Info - Burrough, P. A
100% (1)
Principles of Geographical Info - Burrough, P. A
216 pages
Understanding Minerals and Processing Techniques
No ratings yet
Understanding Minerals and Processing Techniques
15 pages
Module 1-System Unit
No ratings yet
Module 1-System Unit
34 pages
Pulsar Flow-Pulse
No ratings yet
Pulsar Flow-Pulse
5 pages
Samsung BN44-00192B - 195
67% (3)
Samsung BN44-00192B - 195
2 pages
Evaluation of The Interactions Between Collagen and The Surface of A Bioactive Glass During in Vitro Test.
No ratings yet
Evaluation of The Interactions Between Collagen and The Surface of A Bioactive Glass During in Vitro Test.
7 pages
Overview of Pneumatic Systems
No ratings yet
Overview of Pneumatic Systems
15 pages
HVAC System Overview for Model 211/219
0% (1)
HVAC System Overview for Model 211/219
3 pages
SLG M4 13 - 1 - 5
No ratings yet
SLG M4 13 - 1 - 5
4 pages
Precast Wall Material Specifications
No ratings yet
Precast Wall Material Specifications
22 pages
Linear Algebra Review Exercises
No ratings yet
Linear Algebra Review Exercises
3 pages
BEDMAS with Fractions Practice Problems
No ratings yet
BEDMAS with Fractions Practice Problems
3 pages
Rover Vehicle Programming Lab Guide
No ratings yet
Rover Vehicle Programming Lab Guide
4 pages
Speed Control of DC Motors Lab Report
100% (2)
Speed Control of DC Motors Lab Report
8 pages
Digital Circuits Course Overview
No ratings yet
Digital Circuits Course Overview
7 pages

Statistical Analysis

Uploaded by

Statistical Analysis

Uploaded by

Intro:

Method of least squares:

​ Statistical Error in Fitted Parameters:

​ Error propagation of f(x;theta)

​ Goodness of fit from the minimum of xi ^2

To start off, we are given sample data

Galileo’s Ball and Ramp:

Snell’s Law Hypothesis:

You might also like

Statistical Error in Fitted Parameters:

Error propagation of f(x;theta)

Goodness of fit from the minimum of xi ^2