You are on page 1of 22

Probability and statistics crash course Probability 1 (for dummies:-) Stats 1 (averages and deviations) Probability 2 (Trials and distributions) Stats 2 (signicance) Stats 3 (errors)

p. 1/2

So what is statistics? Applied branch of mathematics Concerning data and its representation Descriptive Statistics (today) are concerned with representing and summarising data Analytical Statistics (in a few weeks) are concerned drawing conclusions from data ... probability theory enables us to nd the consequences of a given ideal world, while statistical theory enables us to to measure the extent to which our world is ideal Skiena, 2001.

p. 2/2

Descriptive statistics: Why?

Summarising data. 32 33 22 21 23 7 10 11 13 16 16 13 15 17 15 33 35 34 32 24

Max, Min, Mean(s), Median, Mode, Variance, Standard Deviation, Interquartile range, ... All ways of presenting numerical data in such a way that we learn something of its spread and tendency and deviation.

p. 3/2

What is an average?
Average originally meant Financial loss incurred through damage to goods in transit, from the Italian avaria, a word from 12c. Mediterranean maritime trade. Sometimes traced to Arabic arwariya damaged merchandise, but this is less certain. Later, the meaning of the word shifts to equal sharing of such loss by the interested parties.

p. 4/2

Measures of central tendency

Arithmetic Mean (often what we think of when we say the word Average). Add em all up and divide by the number there are.
1 x= n


p. 5/2

An aside about samples and populations

Often we cant measure an entire population, and instead have to measure a subset (a sample). The mean on the previous slide x is, strictly speaking, a sample mean. The population mean is usually referred to as , and the size of the whole population as N .
1 = N


p. 6/2

The other two

Median = put them all in order, and choose the middle one. IF there are an even number, then there are two middle ones, so use the number halfway between these. Mode = choose the most frequent one.

p. 7/2

I am just going to mention this in passing today, but...
A fictitious but nastily skewed dataset 700



400 Count 300 200 100 0 0



30 Number





Figure 1: A skewed dataset

This dataset has a mean of 21.8, a median of 12 and a mode of 12.

p. 8/2

An aside about types of data

There are various types of data we can consider within statistics. Not all measures of central tendency apply to all of these Data type Description Average Nominal Categories or names Mode Ordinal Orderings (e.g., First, Median Second, Third . . . ) Interval Proper numbers Mean (symmetrical) and Ratio Median (skewed)

p. 9/2

And now over to my sequinned assistant. .

p. 10/2

To conclude the average bit

Arithmetic Mean; Median; Geometric median; Mode; Geometric Mean; Harmonic Mean; Quadratic Mean (or RMS); Generalised Mean (like quadratic mean but with different powers); Weighted Mean (some matter more than others); Truncated Mean (leave out the tricky outliers); Interquartile Mean (uses the interquartile range, of which more later); Midrange (max+min/2); Winsorized mean (Like truncated but not quite); Annualization (to do with nance stuff). All of these have their own wikipedia page, so, you know where to start!

p. 11/2

Boring practical bit

32 33 22 21 23 7 10 11 13 16 16 13 15 17 15 33 35 34 32 24

p. 12/2

Boring practical bit: answers

32 7 16 33 33 10 13 35 22 11 15 34 21 13 17 32 23 16 15 24 Mean 26.2 11.4 15.2 31.6 Median 23 11 15 33 Mode ? ? 15 ?

p. 13/2

As well as knowing some kind of average of a particular sample, you might want to know something of its spread.
6 x 10

More fictitious data


0 1.5



1 Number




Figure 2: Three datasets with the same mean but different spreads.

p. 14/2

The really simple one

The range is the simplest way of describing the spread of data - nd the max, nd the min, subtract the min from the max, there you go.

p. 15/2

The deviation of a sample is measured with reference to some measure of central tendency you want to know how much the sample deviates from something. With average deviation, variance, and standard deviation, this is the mean or the sample mean x.

p. 16/2

Measures of deviation
Average deviation =

|x | N (x )2 N (x )2 N

Variance = =

Standard deviation = =

For reasons you will now be familiar with, when considering samples, becomes s, and becomes x. To account for bias, sample standard deviation is divided by n 1 rather than n.

p. 17/2

Worked example
This examplea involves the rainfall in Liberiab . J F M A M J J A S O N D 1 2 4 6 18 37 31 16 28 24 9 4 The mean of this data is
1 + 2 + 4 + 6 + 18 + 37 + 31 + 16 + 28 + 24 + 9 + 4 = 15 12

The range of this data is 36; (max-min, or 37-1)

a b

taken from Sternsteins Statistics No, Ive never been there either

p. 18/2

Average deviation
The average deviation
|1 15| + |2 15| + |4 15| + |6 15| + |18 15| + ... = 12 14 + 13 + 11 + 9 + 3 + 22 + 16 + 1 + 13 + 9 + 6 + 11 = 12 (10.7 Inches)

p. 19/2

Variance and standard deviation

The variance

142 + 132 + 112 + 92 + 32 + 222 + 162 + 12 + 132 + 92 + 62 + 112 = 12

(143.7 Inches squared) AND the standard deviation is the square root of the variance, so...
= 143.7 = 12.0

and the units of the standard deviation are... the same as the units of measurement.

p. 20/2

Interquartile range
One nal measure of deviation is the interquartile range. This is related to the median, and the rst thing you do is place your data in order.
1 Discard the lowest and the highest 4 of your data, and use the range of what remains. This is much more robust to outliers.

p. 21/2

And to nish
If your data is normally distributed (of which more next week), knowing the standard deviation tells you all sorts of useful stuff.

Figure 3: Another graph stolen from wikipedia

p. 22/2

You might also like