You are on page 1of 12

Week 4- checklist items

“Probability” Video
- One way of explaining probability in our context stems from having a complete data set
and describes the relative frequency of any observation.
- In our beer measurement data, 19 our of 145 students reported an alcohol content of
5.5%, Based on this, we could say that the probability of finding a measurement of 5.5%
in our data set is 19/145 or 0.13, 13%
- For probability, values closer to 1, or 100%, implies higher certainty that an event will
happen or that a measurement will be found in the data set.
- Here, 13% of our measurements were of our value of interest (5.5).
- Remember that predictions based on a sample set are always approximations. We
would have to measure all the beer in the world an infinite number of times to be
certain of the value

- The other way we can use probability is in a more predictive way. If someone is about to
make a measurement or other observation, what tis the likely hood it will be a particular
value or in a particular range?
- The same math still applies, if I am looking for 5.5% I still have 19/145, but here I am
assuming that my current data set represents that infinite population of all
measurements and for every measurement I make then there will be a 13% chance that
I observe 5.5. For ever 100 measurements made, about 13 will be 5.5.
- Again, here a number closer to 1 or 100% represents higher certainty. If instead of
questioning the alcohol content of beer, your data set was answers to the question how
many lab sections are in chem 311 this year?
- You should then have a probability very near 100% for 14, and very near 0% for any
other value.
- When we have a discrete data set like the one shown, that has definite separate
measurements all along the x-axis, we can calculate the probability of finding a specific
measurement by taking the height of the bar on the histogram, representing the
number of measurements, and dividing by the number of measurements we have.
- So we could describe the probability of measuring 9.999ml without pipet as being about
19% (as 5/26=0.1923).
- When we have an ideal data set, or once composed of infinite measurements with
infinite precision, we don’t have a count of the measurements as a particular value
anymore. So we will use the area under the curve describing that data set as our
probability.
- To calculate the probability of measuring 9.99ml with my pipet, I will take the area
under the curve for all values I would count as 9.99. If my 26-sample data set was
representative, this would come out close to 19%

- Since the Gaussian curve follows a known equation with only a few variables, finding the
area under the curve is conveniently done from tables of standardized values. These are
usually based on a normalized curve, given in units of Z, where z is the distance from the
mean and one unit of z is one standard deviation.
- These tables give the area under the curve from 0 to the desired z cvalue. For example,
we can see from this section of the table that between z=0 and z=1, the area under the
curve is 0.3413 units.
- We are not always interested in probabilities that start at our mean value (where z=0)
though. Luckily we can find them by just adding or subtracting values that we can easily
find in the data tables.
- Say I want to find the area between z=0.5 and z=1.5. I can find the area from 0 to 1.5
and 0 to 0.5 in the table. To find the area between 0.5 and 1.5, I can subtract the two
areas, leaving with only the portion wanted.
- 0.4332-0.1915=0.2417.
- The other thing we are often interested in is the probability of being outside of a given
range.
- For a range of z=-2 to z=2, the area included in these boundaries will be double the
value between zero and 2 from the table, or 99.46%.
- What if we wanted to know the probability of being outside this range? So any z value
greater than 2 or less than -2? Since the probability of anything happening is 1 or 100%<
we can subtract the inside area from the outside area.
- So the probability of finding values outside is 0.0454.
- Finally, if we wanted to know the probability outside of the blue region but only regions
larger, the curve is symmetric, so we divide the outside area in half to get the area for
just one of those tails

“Error Limits” video


- If you were wondering where those error limits, we worked with last week can be
derived from for our data set, you will find out.
- One way we can generate values for our error limits is to use standard deviation.
- Since we should be pretty sure that the true value is somewhere in the range of
observations that we measured and probably close to our mean, we may choose to
report our value as our mean plus or minus twice the standard deviation, or plus or
minus three times the standard deviation.
- Three times the standard deviation is more common as it encompasses more of the data
set so that you can be surer that even if your mean was a little bit off, the true value is
still somewhere in the range that you are reporting.
- You’ll also often see percent relative standard deviation listed with data as a reflection
of the uncertainty, but it is not usually directly used as a plus or minus kind of error
limit.
- Normally when you are reporting a standard deviation as a standard deviation, you will
follow your sig fig rules through the calculation to report your final answer.
- But if you are using standard deviation as an error limit, tick over and follow the error
limit rules and cut it off to one sig fig as you should for an error limit.

- The other usual way to generate error limits is by using a confidence interval.
- Here we attach a specific probability or certainty to a range that we are writing with our
error limits.
- For example, with a 95% confidence interval, 12.5 plus or minus 0.4 could mean I am
95% confident that my value is between 12.1 and 12.9.
- More correctly when we say this, we are saying were seeing there less than a 5% chance
that my value is not between 12.1 and 12.9
- I can define a narrower range for my confidence interval, but if I do, I become less
certain that my true value is inside of that range.
- Think of it this way, the only thing I can be 100% sure of is that my true value lies
between positive and negative infinity.
- Every time I narrow my range down from there, I become a little less certain that I have
not accidentally cut out my true value.
- Another way to visualize this idea of our certainty and our confidence interval is to do it
by looking at some actual data sets. The experiment done here was to start with four
copies of our true value of 10 thousand and then add random errors to those four
values.
- Each square represents the average value of these four numbers. So, this experiment
was repeated 100 times.
- Now let’s add a confidence interval to these measurements. The size of the confidence
intervals will vary, depending on the four values that went into each average. So here
we have got a 50% confidence interval added. All the data sets where the confidence
interval is correct. (That is it includes our true answers somewhere inside the range) are
colored white.
- Notice that only about half of them correctly include the true value. These are 50%
confidence intervals. So we can only be 50% sure that any data set is correct. Now let’s
take the exact same data set but lets apply some 90% confidence intervals.
- If we do the same thing and color all of the datasets that don’t include our true value,
only a handful of them (about 10%) do not correctly include the true value.
- So 90% of them do.
- For any of these datasets represented by squares, we can be more certain that the
interval includes the correct value using our bigger, 90% confidence intervals.

- To calculate our confidence intervals, we report them as our mean, plus or minus our
confidence interval
o So we start with out mean and our standard deviation.
o The size of our confidence interval is also dependant on our number of
measurements in the data set. And a new value, T, which comes from the
students t distribution (one of the ones we saw in the previous video that was
kind of similar in shape to our normal distribution).
o The value for t also depends on our number of measurements as well as our
desired confidence level.
- To find the T value: you can calculate it from the formula just like you can calculate your
gaussian curves. But we have precalculated values of t since we use t so often.
- We can look up these precalculated values in data tables in our textbook. They are
arranged by the desired confidence limit, and by the degrees of freedom (which is just
the value one less than the measurements in our data set)
- Normally, you can then choose the value of t that corresponds to your data set and the
number of measurements and your desired confidence level.
- So as an example, to find the 90% confidence interval, in one of our 4-point data sets (so
degrees of freedom would equal three), we would use T=2.353.
- The value at the intersection of 90% confidence and 3 degrees of freedom.
- While we are here, take note that for any one confidence level, as your number of
measurements and degrees of freedom go up, your t value gets smaller.

- That t value getting smaller will affect the size of our confidence interval.
- For larger data sets, my t value is going to be smaller, so we will have a smaller,
narrower interval for the same confidence level.
- Increasing the number of measurements also directly decreases the size of my
confidence interval for the same confidence level by decreasing this one over the root of
n term.
- So for a larger dataset will usually have a smaller confidence interval for these reasons,
both our t and our 1/n term gets smaller.
- Example: lets calculate the confidence interval for this data set. With 10 measurements,
we will have nine degrees of freedom in our measurement.
- So for a 95% confidence, we should use t=2.262.
- If we use this value along with their standard deviation and the number of data points in
our formula, we are going to get a size for our confidence interval of just over 16 ppm.
- Now our confidence interval is an error limit.
- So before we are done, we need to round everything off to 1 sig fig in our error limit
correctly.
- Since our error limit ends up in the 10’s place its nicer to write our values in scientific
notation to avoid any questionable zeros.

So to recap, now we know several ways that we can define our error limits from a data set.
- From this one data set we can create error limits.
- Ones based on standard deviation and ones based on confidence intervals.
“Case 0 Comparison” video
- Now we can start to use our knowledge of statistics to start answering questions.
- Does this measurement belong in my data set?
- When we want to answer a question like: are these values the same or are they
different? We must qualify it a bit. First, we have our comparison.
- Even though our numbers might not exactly agree, is that difference between them
significant, or do they agree?
- What is our certainty with this statement?
- Do we know it to 90%? 95%?
- How confident are we?
- In order to be clear what tests you have done, we need to be sure to word our
statements carefully.

- Our first comparison will be case 0 test.


- Here we will compare one measurement to the entire dataset to determine whether it
belongs or if it’s what we call an outlier (result of a blunder or unusual random error
such as a cosmic ray hitting your photo detector).
- First, when you suspect a measurement is an outlier, consider whether there’s an
external reason to believe that the data point is wrong or whether it just looks
suspicious. If you know a data point is say an overshot endpoint, you can discard it right
away and explain why you discarded it.
- If you do not have a good experimental reason as to discard that point, you need to
prove that it really belongs in the data set and it’s not just a data point with a somewhat
large random error that still fits.
- You cannot get rid of data just because you do not like it as that creates bias in your
measurements.
- So to test whether a single data point really is part of the data set you are comparing it
to, the recommended test for this is the grubs test for an outlier.
- The formula is simple, we subtract our questionable value against our mean and then
divide by the standard deviation.
- Here, both the mean and the standard deviation should be of the whole dataset,
including your questionable value.
- Then we compare out calculated G value to a table of g values.
- They are available at other confidence limits but 95% is the most used and the one in
the text.
- If the g value we calculated is bigger than the g value in the table, then we will reject
that data point as an outlier. If that is the case, drop it. Make sure to write in your
notebook as to why. And then recalculate your mean and standard deviation.
- If our calculated g is less than or equal to the table value, then we can’t be at least 95%
certain that the measurement does not belong. So we must keep it in our data set.
- We will try this our with actual numbers.
- To find out if they all belong, we first need our mean and standard deviation of our data
set which is already calculated.
- Then we will pick a value to test if it is an outlier.
- Lowest value appears to be higher from average so we test this.
- Tabulated g values are listed by actual number of observations, so my critical g value for
this data set is 1.463.
- This is smaller than the calculated g value, so this data point van be rejected with a 95%
confidence.
- At this point, use the grubs test to eliminate ONE POINT ONLY.

- our first test that we know how to do is to ask whether you have a questionable value- and
use the grubs test to verify.

“case 1 comparison” video


- Our next comparison will help us answer the question: “is my data set the same or
different to this reference value?”
- COMPARING A DATA SET TO A TRUE VALUE

grubbs is for determining if a value is an outlier, case 1 test is for comparing your value to a true
value, case test 2 is for comparing two data sets, case test 3 is for comparing values from data
sets, f-test is for comparing standard dev.
t-test is for comparing means i think

You might also like