You are on page 1of 12

9/6/2016

PART I:
Descriptive Statistics

• A. Role of statistics
• Description of data
• Inference/ association
• Explanation
• Decision-making

1
9/6/2016

B. Data Sets, Types of Variables

• Descriptive stats used to deal with large sets of


data.
• Example:
• records of temperature and humidity
• annual incomes
• opinion polls
• transportation counts
• quality assessment of buildings
• information about states, counties, cities: population, density,
unemployment rates, poverty rates …

• Data typically displayed in tables with:


• the variables in columns
• observation in rows
• E.g., states data:

1999
Median
Total Urban Household
State Region Population Population Income ($)

Alabama South 4,447,100 2,465,673 34,135


51,571
Alaska Non-contiguous 626,932 411,257

Arizona Southwest 5,130,632 4,523,535 40,558

Arkansas South 2,673,400 1,404,179 32,182

2
9/6/2016

Recording of Weather Information

Temperature Relative
(in degrees Humidity
Hour Celcius) (in %) Cloudiness Rain
0 23 60 No Clouds no
1 22.5 65 No Clouds no
2 22 68 No Clouds no
3 21.5 70 No Clouds no
4 21 73 No Clouds no
5 21.5 69 No Clouds no
6 22.5 68 No Clouds no
7 24 62 No Clouds no
8 25 60 No Clouds no
9 28 58 No Clouds no
10 30 55 No Clouds no
11 32 54 Partly Cloudy no
12 33 50 Partly Cloudy no
13 34 48 Partly Cloudy no
14 34 48 Partly Cloudy no
15 33 48 Cloudy no
16 25 49 Cloudy yes
17 24 50 Cloudy yes
18 27 52 Cloudy no
19 26 53 Partly Cloudy no
20 25.5 52 Partly Cloudy no
21 25 56 Partly Cloudy no
22 24.5 58 No Clouds no
23 24 59 No Clouds no

Types of variables
1) Nominal
describes/names/ labels categories

e.g., - ‘color’ : ‘green’, ‘yellow’, ‘red’


- ‘attitudes’: ‘prejudiced’ vs. ‘tolerant’
- in weather dataset: ‘rain’

Notes:
- Categories must be mutually exclusive
- Variable with only two outcomes (yes/no, present/absent) = binary

3
9/6/2016

• 2) Ordinal
Outcomes sorted in ordered categories e.g.: poor/medium/good
dark/medium/light
Weather data: ‘degree of cloudiness’.

Note: Nominal and ordinal variables = categorical

• 3) Interval variable:
= continuous, can take any possible numerical value [-∞ ; + ∞] or [0 ;
100]
E.g., ‘relative humidity’ = 55.3575% or 55%, dollars, age, minutes, years

Notes: - zero-point can be arbitrary (temp in Celsius or Fahrenheit)


OR non arbitrary (if humidity is 0%)

4
9/6/2016

Nominal? Ordinal? Interval?

• Sex (male/female)
• Year when house built
• Housing Quality (very high/ high/ low/ very low)
• Temperature
• Amount of carbon monoxide in air
• Amount of precipitation
• Race
• Age
• Income

• Large datasets
• Seeing patterns or regularities is impossible, unless we do “something”
with the data
• -> 2 common procedures that help us understand and reveal regularities
• graphical display of data
• statistical summary measures

5
9/6/2016

Why do we need to distinguish between


types of variables?

• Treated differently mathematically and statistically

• E.G., interval data: relative humidity of 60% is twice as high as 30%


but cannot make such comparisons for categorical data

A few rules
1) Variable names: usually CAPITALIZED and abbreviated
POP90, POP_URB_90

2) Giving numerical values to categorical variables:


• Codes are arbitrary
• no clouds = 0, partly cloudy = 1, cloudy = 2

• Standards for assigning numbers to different categories:


• numbers consistent with “natural ordering”
e.g., 1 = poor; 2 = medium; 3 = good.
- Binary variables: named after category coded as 1
e.g., sex (male/female) recorded as FEMALE: 0 =male; 1= female

=> Codebook must indicate what each value stands for

6
9/6/2016

C. Description of data

Nominal data: frequency distributions

Hair color Frequency

Black 3
Brown 25
Blond 12
Red 4
Total 44

• Interval data: grouped frequency distributions

Age Frequency
15-19 3
20-24 15
25-29 12
30-34 11
Total 41

7
9/6/2016

Cumulative distribution
= number of observation at value x or higher

Score Frequency Cumulative frequency


81 3 64
82 15 61
83 12 46
84 11 34
85 10 23
86 7 13 (13 score 86 or more)
87 4 6
88 2 2
Total 64

8
9/6/2016

Cross- tabulations: 2 variables

Transpo. mode Male Female Total


Bike 5 3 8
Walk 2 5 7
Drive alone 15 12 27
Carpool 6 10 16
Total 28 30 58

C. Graphical Display of Data

• Extremely useful
• at beginning of every data analysis
• no limits to your creativity

• BUT not every graph is a good graph

• Should be self-explanatory
• Clear title
• Clear labels (axes, title, legend )

9
9/6/2016

Line-graph (used often for trend data)


Label
of Temperature Hourly Temperature Variation
y- (Celcius) Title of
axis 35 Graph
33
31
29
27
25
23
21
19 Label of x-axis
17
15 Hour
0 2 4 6 8 10 12 14 16 18 20 22 24

Relative Hourly Variation of Relative Humidity


Humidity (%)
75
70
65
60
55
50
45
40 Hour
0 2 4 6 8 10 12 14 16 18 20 22 24

Frequency distributions

1) Kurtosis (peakedness)

Flat
Peaked

Normal

10
9/6/2016

2) Skewness (degree of symmetry)

Positive skew skew


Negative skew

Symetrical

Scatter diagrams
• 2 variables in 1graph (= x-y graph in Excel)
Relative Relationship between
Humidity (%) Temperature and Relative Humidity
75

70

65

60

55

50

45

40 Temperature
(Celcius)
19 21 23 25 27 29 31 33 35

- Downward slope: negative, inverse relationship


- Upwards slope: direct, positive relationship

11
9/6/2016

2) Displaying categorical data

• 1. create summary frequency tables


Variable Frequenc Frequency
CLOUD (absolute) (in %) Variable Frequency Frequency
RAIN (absolute) (in %)

No Clouds (0) 13 54.17 No Rain (0) 22 91.67

Partly Cloudy (1) 7 29.17 Rain (1) 2 8.33


Cloudy (2) 4 16.67

2. Usually displayed in bar (histograms) or pie charts


- Histograms: in absolute values or in percentage
- Pie charts: in percentages (sum to 100% of the pie)

Absolute Frequency of Absolute Frequency of


Absolute Absolute
CLOUD Categories Frequency RAIN Categories
Frequency

14 no clouds 25 no rain
12
20
10 partly cloudy
8 15
6 cloudy
10
4 rain
5
2
0 0

Frequency Distribution of Frequency Distribution of


CLOUD Categories RAIN Categories
cloudy 17%
8% rain

no clouds

54%
partly
29%
cloudy
no
rain 92%

Temperature Hourly Temperature Variation


(Celcius)
35
33
31
29
27
25
23
21 12
19
17
Hour
15
0 2 4 6 8 10 12 14 16 18 20 22 24

You might also like