You are on page 1of 27

Statistics I

Chapter 2: Univariate data analysis

Chapter 2: Univariate data analysis

Contents
I

Graphical displays for categorical data (barchart, piechart)

Graphical displays for numerical data data (histogram, polygon,


boxplot)
Numerical measures to describe:

I
I

central tendency (mean, median, mode)


variation (variance, standard deviation, quasi-variance and
quasi-standard-deviation, range, IQR, coefficient of variation)
others (quartiles, percentiles)

Chapter 2: Univariate data analysis

Recommended reading
I

Pe
na, D., Romo, J., Introducci
on a la Estadstica para las Ciencias
Sociales
I

Chapters 4, 5

Newbold, P. Estadstica para los Negocios y la Economa (2009)


I

Chapter 2

Graphical presentation of data

Once we have a frequency distribution of the data, the following


graphical displays can be obtained:
Categorical

piechart
barchart

Numerical

histogram
polygon
boxplot

Graphs for qualitative data: piechart

Example 1: The frequency table below corresponds to the data


representing blood types reported for a sample of 40 individuals.

Class
A
B
AB
O
Total

Absolute
Frequency
12
11
8
9
40

Relative
Frequency
0.300
0.275
0.200
0.225
1

Piechart
Example 1 cont.:
I Each slice is a fraction of the total size of the pie
I Many softwares rank slices alphabetically
I Although pretty harder to read than barcharts
I Avoid 3D piecharts, for those the area in the background seems to
be smaller than the area in the foreground

O 22.5%

B 27.5%

A 30%

AB 20%

Graphs for qualitative data: barchart

Example 2: The frequency table below corresponds to levels of


satisfaction for 901 employees.

Class
VU
U
S
VS
Total

Absolute
Frequency
62
108
319
412
901

Relative
Frequency
0.07
0.12
0.35
0.46
1

Cumulative
Absolute
Frequency
62
170
489
901

Cumulative
Relative
Frequency
0.07
0.19
0.54
1

Barchart

200
100
0

FREQUENCY

300

400

Example 2 cont.:
I Bars are of the same width and equally-spaced, with the heights
corresponding to the frequencies
I There are gaps between the bars
I Bars are labeled with class names
I Many softwares rank bars alphabetically

VU

VS

Barchart

12
10
8
6
4
2
0

Barcharts can also be constructed for discrete data if there are not
too many values
This is a barchart for Example 3 of Ch.1 where we looked at the
number of leaves attacked by a pest for a sample of 50 plants

FREQUENCY

10

Graphs for quantitative data: histogram and polygon

Example: 4 The frequency distribution of the daily high temperature (in


Fahrenheit) reported on 20 winter days is as follows:
Class Interval
[10, 20)
[20, 30)
[30, 40)
[40, 50)
[50, 60)
Total

Midpoint
15
25
35
45
15

ni
3
6
5
4
2
20

fi
0.15
0.30
0.25
0.20
0.10
1

Ni
3
9
14
18
20

Fi
0.15
0.45
0.70
0.90
1

Histogram and polygon

Polygon

There are no gaps between the bars/bins


Bin widths = widths of class intervals (identical), class boundaries
are marked on the horizontal axis
Bin heights = frequencies (here, absolute)
Bin areas are proportional to the frequencies

FREQUENCIES

10

20

30

40

TEMP (F)

50

60

70

Histogram with area of 1 (on a density scale)

0.030
0.020

Bin widths = widths of class intervals (not necessarily identical)


Bin heights = li lfii1
Bin areas = fi
TOTAL AREA = 1

0.010

0.000

10

20

30

40

TEMP (F)

50

60

70

Describing data numerically


Variation

Center

mean
median
mode

New notation:

n
X

range
interquartile range
variance
standard deviation
coeff. of variation

Others

quartiles
percentiles

xi = x1 + x2 + . . . + xn

i=1

P
( : sum, i = 1: the lower limit, n: the upper limit, xi : example of a
formula depending on i)
Example:
3
X
i 2 = (1)2 + 02 + 12 + 22 + 32 = 15
i=1

Central tendency: (arithmetic) mean


I

The most common measure of central tendency

Population mean
PN
=

Sample mean

xi

N
Pn

x =
I

i=1

i=1

xi

x1 + . . . + xN
N

x1 + . . . + xn
n

If a, b (b 6= 0) are real numbers and y = a + bx, then


y = a + b
x

Affected by extreme values (outliers)

Example: X : 3, 1, 5, 4, 2,
x =

Y : 3, 1, 5, 4, 200

3+1+5+4+2
=3
5

y =

3 + 1 + 5 + 4 + 200
= 42.6!
5

Central tendency: median


I

In the ordered list, the median M is the middle number



x((n+1)/2)
if n odd (the middle number)
M=
x(n/2) +x(n/2+1)
if
n even (the average of the two middle numbers)
2
(x(1) , x(2) , . . . , x(n) means that the observations are ranked in increasing
order, eg. x(1) = xmin , x(n) = xmax )

Not affected by outliers

Example: Given observations 3, 1, 5, 4, 2 (n = 5), first rank the data


1,2, 3 ,4,5, then identify the middle number(s)

M = x((5+1)/2) =

3rd smallest
z}|{
x(3)
=3

Example: Given observations 3, 1, 5, 4, 2, 0 (n = 6), first rank the data


0,1, 2,3 ,4,5, then identify the middle number(s)

M=

x(6/2) + x(6/2+1)
2

the average of 3rd and 4th


z }| {
x(3) + x(4)
2+3
=
=
= 2.5
2
2

Central tendency: mode

The value that occurs most often

Not affected by outliers

Used for either numerical or categorical data

There may be no mode, there may be several modes

Example: Given observations 3, 1, 5, 4, 2, there is no mode


Example: Given observations 3, 1, 5, 4, 2, 1, the mode is 1

Shape: comparing mean and median


Three types of distributions:
I

Skewed to the left Mean < Median

Symmetric Mean = Median

Skewed to the right Median < Mean

LEFTSKEWED

x<M

SYMMETRIC

x=M

RIGHTSKEWED

M<x

Note: The distribution in the middle is known as bell-shaped or normal

Variation: range and interquartile range (IQR)

Range is the simplest measure of variation


R = xmax xmin

Ignores the way the data is distributed

Sensitive to outliers

Example: Given observations 3, 1, 5, 4, 2, R = 5 1 = 4


Example: Given observations 3, 1, 5, 4, 100, R = 100 1 = 99
I

Interquartile range (IQR) can eliminate some outlier problems.


Eliminate high and low observations and calculate the range of the
middle 50% of the data
IQR = 3rd quartile 1st quartile = Q3 Q1

Variation: Interquartile range and boxplot


I

Outliers are observations that fall


I
I

below the value of Q1 1.5 IQR


above the value of Q3 + 1.5 IQR

For extreme outliers, replace 1.5 by 3 in the above definition

xmin

Q1

25%

12

MEDIAN
(Q2)

25%

24

xmax

Q3

25%

31
IQR=18

25%

42

58

Quartiles and percentiles

Quartiles split the ranked data into four segments with an equal number
of values per segment

The first quartile Q1 has position 14 (n + 1)

The second quartile Q2 (= median) has position 12 (n + 1)

The third quartile Q3 has position 34 (n + 1)

Example: Given observations 22, 18, 17, 16, 16, 13, 12, 21, 11 (n = 9), first rank
the data 11, 12, 13, 16, 16 , 17, 18, 21, 22, then identify the positions
Q1 = x(2.5) = x(3) = 12

Q2 = 16

Q3 = x(7.5) = x(8) = 21

pth percentile, p = 1, 2, . . . , 99, Pk = x(k(n+1)/100) .

Example cont.: 60th percentile = x(60(9+1)/100) = x(6) = 17

Measure of variation: variance


I

Average of squared deviations of values from the mean

Population variance
2 =

PN

i=1

(xi )2
N

Sample variance

2 =

faster to calculate
}|
{
zP
n
2
x )2
)
i=1 xi n(
i=1 (xi x
=
n
n

Pn

divided by n

Sample quasi-variance (corrected sample variance)


Pn
Pn
2
)2
x )2
i=1 (xi x
i=1 xi n(
s2 =
=
divided by n 1
n1
n1

They are related via

2 =

n1 2
s
n

If a, b (b 6= 0) are real numbers and y = a + bx, then sy2 = b 2 sx2

Measure of variation: standard deviation (SD)

I
I

The most-commonly used measure of spread


Population standard deviation, sample standard deviation and
sample quasi-standard deviation are respectively

= 2

=
2
s = s2

Shows variation about the mean

Has the same units as the original data, whilst variance is in units2

Variance and SD are both affected by outliers

Calculating variance and standard deviation


Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,
Z : 11, 11, 11, 12, 19, 20, 20, 20
x =

124
= 15.5
8
n
X
i=1
n
X

y =

124
= 15.5
8

z =

124
= 15.5
8

xi2 = 112 + 122 + . . . + 212 = 2000


yi2 = 142 + 152 + . . . + 172 = 1928

i=1
n
X

zi2 = 112 + 112 + . . . + 202 = 2068

i=1

sx2

Pn
=

i=1

xi2 n(
x )2
2000 8(15.5)2
78
=
=
= 11.1429 sx = 3.3381
n1
81
7
1928 8(15.5)2
6
sy2 =
= = 0.8571 sy = 0.9258
81
7
2
2068

8(15.5)
146
sz2 =
=
= 20.8571 sz = 4.5670
81
7

Comparing standard deviations


Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,
Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20
x = 15.5 sx = 3.3

11

12

13

14

15

16

17

18

19

20

21

18

19

20

21

19

20

y = 15.5 sy = 0.9

11

12

13

14

15

16

17

z = 15.5 sz = 4.6

11

12

13

14

15

16

17

18

21

Numerical summaries and frequency tables. Standarization.


I

If the data is discrete then


Pk
i=1 xi ni
x =
n

and

s2 =

Pk

i=1

xi2 ni n
x2
n1

If the data is continuous, we replace xi in the above difinition, by the


mid-points of class intervals

To standardize variable x means to calculate


x x
s

If you apply this formula to all observations x1 , . . . , xn and call the


transformed ones z1 , . . . , zn , then the mean of the zs is zero with the
standard deviation of one

Standarization = finding z-score

Empirical rule
If the data is bell-shaped (normal), that is, symmetric and with light
tails, the following rule holds:
I

68% of the data are in (


x 1s, x + 1s)

95% of the data are in (


x 2s, x + 2s)

99.7% of the data are in (


x 3s, x + 3s)

Note: This rule is also known as 68-95-99.7 rule


Example: We know that for a sample of 100 observations, the mean is
40 and the quasi-standard deviation is 5. Assuming that the data is
bell-shaped, give the limits of an interval that captures 95% of the
observations.
95% of xi s are in: (
x 2s) = (40 2(5)) = (30, 50)

Measure of variation: coefficient of variation (CV)

Measures relative variation and is defined as


CV =

s
|
x|

Is a unitless number (sometimes given in %s)

Shows variation relative to mean

Example: Stock A: Average price last year = 50, Standard deviation = 5


Stock B: Average price last year = 100, Standard deviation = 5
5
5
= 0.10 CVB =
= 0.05
50
100
Both stocks have the same SDs, but stock B is less variable relative to its
mean price
CVA =

You might also like