You are on page 1of 56

BC0406 – Introduction to

Probability and Statistics

Lecture 3
Types and classification of data
Graphical representation of data

Dr. Richard H.A.H. Jacobs


Universidade Federal do ABC
Agenda
Program for today
• Announcements
• Types and classification of data
• Qualitative / Quantitative
• Nominal, ordinal, interval, ratio
• Graphical representation of data
• Frequency tables
• Graphs for qualitative data
• Bar graphs
• Pie chart or circle chart
• Graphs for quantitative data
• Scatter plot
• Histogram
• Absolute frequency
• Relative frequency
• Polygon
• Tree diagram
Announcements

Questions:
• E-mail: richard.jacobs@ufabc.edu.br

Bibliography for today´s lecture


• Larson & Farber: Chapters 1 & 2

Bibliography for next lecture


• Larson & Farber: Chapter 2
Announcement

• The lecture is going to speed up


• Denser content
• Accumulation of concepts
• Review the slides and do the exercises (always
read the book chapters for the next lecture)

• Do not wait with reviewing the material until two


days before the exam!
Brief review
Inferences about the
population & treatment effects
Population
Sampling
“Inferential statistics”
method
(Preferably random) “Inverse reasoning”
“Hypothesis testing”
“Model comparison”

Sample
“Descriptive statistics”
Exploratory data analysis.
Treatment A Summary graphs and numbers.
“Detective work”
Allocation to groups “Systematic accumulation and
contr/exp exploration of evidence”
(Preferably random) Treatment B

Data
Experimental and Control Procedures
Transition to data classification
Population parameters
Population Average height of Brazilian men
Sampling Average score of UFABC students at IPE
method Number of drunk drivers at the last
(Preferably random) carnaval

Sample statistics
Sample Average height of the 200 Brazilians interviewed
Average grade of the 90 students from the last IPE
class
Number of drunk drivers on the last blitz
(carnaval?)

Data / variables
Classification (types of data / variables)
Dependent / independent variables
Graphical representation / visualization
Numerical summariesResumos numéricos
Data classification

Socioeconomic aspects of employees da cia MB [?]

Variable Representation
Quality Civil status X
(attribute)
Degree of education Y
Qualitative Region of origin Z
Salary S
Count
Age U
Quantitative Number of children V
Data classification

Qualitative data: Attributes, classification or non-numeric


registration.
Quality of the researched individual: married, male, high
school, etc.

Quantitative data: Numerical measurements or counts.


Numbers resulting from counting or measurement: 46 years
old, 3 children, 2000 R$ monthly, etc.
Data classification
What are the qualitative and quantitative data?

City Population
Baltimore, MD 651.154
Boston, MA 589.141
Dallas, TX 1.188.580
Las Vegas, NV 478.434
Lincoln, NE 225.581
Seattle, WA 563.374
Fonte: U.S. Census Bureau

a. Identify the content of each dataset


b. Decide whether each data set consists of numeric or non-numeric records
c. Specify qualitative and quantitative data
Data classification
Levels of measurement: Nominal: names, brands, qualities; no
mathematical calculation is possible
Qualitative variable
Ordinal: arranged in order; differences
between records are meaningless

Interval: can be ordered; differences are


meaningful; zero point is just one of many
Quantitative variable values on the scale

Rational: Similar to the interval scale; zero


value is an inherent zero
Discrete quantitative variable: Values form a finite (enumerable) set of numbers;
usually counting results (e.g. number of children)

Continuous quantitative variable: Values belong to a range of real numbersç


usually measurement results (e.g. height and weight)
Data classification
Nonimal qualitative variable
names, brands, qualities; no mathematical calculation possible

Affiliate Networks
in Portland, Oregon
KATU (ABC) Colors of the
KGW (NBC) Brazilian flag
KOIN (CBS) Green
Yellow Origin of approved
KPDX (FOX) students
Blue
São Paulo, SP
White
Campinas, SP
Porto Alegre, RS
Brasília, DF
Data classification
Four levels of measures

Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?

Nominal YES NO NO NO
Data classification
Ordinal qualitative variable
arranged in order; differences between records are not meaningful

Five best movies


from the USA
1. Citizen Kane The five TV shows
most watched in
2. Casablanca 2001
3. The Godfather
4. Gone with the Wind 1. E.R.
5. Lawrence of Arabia 2. Friends
3. Law and Order
4. West Wing
5. Will & Grace
Data classification
Four levels of measures

Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?

Nominal YES NO NO NO

Ordinal YES YES NO NO


Data classification
Quantitativa interval variable
Can be ordered; differences are meaningful; zero is a random position on the scale

Final temperature
sample (ºC)
27.8 Titles of the
National Team
27.5
1958
27.8
1962
27.7
1970 = 24 anos
27.9
1994
2002
Data classification
Quantitativa intervalar variable
Can be ordered; differences are meaningful; zero is a random position on the scale

Final temperature
sample (ºC)
27.8 Titles of the
National Team
27.5
1958
27.8
1962
27.7
1970 = 1.012
27.9 ?
1994
2002

It makes no sense to divide the data


Data classification
Four levels of measures

Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?

Nominal YES NO NO NO

Ordinal YES YES NO NO

Interval YES YES YES NO


Data classification
Quantitativa rational variable
Similar to the interval variable; zero is a real baseline; it makes sense to determine
ratios (one number is a multiple of another)

Final temperature
sample (K)
300.9 Number of goals
by the national
301.1 team
300.8
1958 16
301.0
1962 14
301.2
1970 19
1994 11
2002 18
Data classification
Four levels of measures

Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?

Nominal YES NO NO NO

Ordinal YES YES NO NO

Interval YES YES YES NO

Ratio YES YES YES YES


Exercise in class
• Exercices
1. The basic prices of several vehicles are in the table below.
A. Which data are qualitative and which are quantitative?
B. Specify the measurement level for each dataset and justify

Model Basic price


Ranger XL 12.595
ZX2 12.730
Focus LX 13.120
Taurus LX 19.075
Explorer 22.510

2. Identify what each data set represents and specify the measurement level:
A. The body temperature (in ºC) of an exercising athlete
B. The athlete’s heart rate (beats per minute)
C. The final ranking of the SP teams in the São Paulo football
championship
D. The total number of goals scored by each of the 5 topscorers
E. The final list of these 5 top scorers
F. The list of Brazilian states that held state competitions
Graphic representation

“The greatest value of a figure is when it forces us to realize what we


would never expect to see … The exploratory analysis of data can never
be the whole story, but nothing else can serve as its foundation – as the
first step”.
Tukey (1977)

“Drawing graphs, like driving a car and


making love, is one of those activities
that everyone thinks they can do well
without instructions. The results, of
course, are generally abominable”.
Margerison (1965)
Frequency table

Variable Discrete quantitative variable


Civil status
Degree of Degree of Frequency Proportion Percentage
education education ni fi 100 fi

Number of Basic 12 0,3333 33,33


children Medium 18 0,5000 50,00
Higher 6 0,1667 16,67
Salary
Age Total 36 1,0000 100,00
Region of
origin
Frequency table
Degree of Frequency
education ni

Basic 12
Company Medium 18
Higher 6
A
Total 36
How do they compare in terms of
workers with higher education?

Degree of Frequency
education ni

Basic 650
Company Medium 1.020
Higher 330
B
Total 2.000
Frequency table
Degree of Frequency Proportion Percentage
education ni fi 100 fi

Basic 12 0,3333 33,33


Company Medium 18 0,5000 50,00
Higher 6 0,1667 16,67
A
Total 36 1,0000 100,00

Degree of Frequency Proportion Percentage


education ni fi 100 fi

Basic 650 0,3250 32,50


Company Medium 1.020 0,5100 51,00
Higher 330 0,1650 16,50
B
Total 2.000 1,0000 100,00
Frequency table
Degree of Frequency Proportion Percentage
education ni fi 100 fi

Basic 12 0,3333 33,33


Company Medium 18 0,5000 50,00
Higher 6 0,1667 16,67
A
Total 36 1,0000 100,00

Degree of Frequency Proportion Percentage


education ni fi 100 fi

Basic 650 0,3250 32,50


Company Medium 1.020 0,5100 51,00
Higher 330 0,1650 16,50
B
Total 2.000 1,0000 100,00
Frequency table

Variable Discrete quantitative variable


Civil status
Degree of Degree of Frequency Proportion Percentage
education education ni fi 100 fi

Number of Basic 12 0,3333 33,33


children Medium 18 0,5000 50,00
Higher 6 0,1667 16,67
Salary
Age Total 36 1,0000 100,00
Region of
origin
Frequency table

Variable Continuous quantitative


Civil status variable
Degree of
Problem:
education
• Salaries between employees are different
Number of • There are no equal observations
children
Salary • How to assemble the frequency table?
• Is it helpful to count each individual value?
Age
Region of
origin
Frequency table

Variable Quantitative continuous


Civil status variable
Degree of Frequency Percentage
Salary ni 100 fi
education
Number of 4,00 1 0,03
children

?
4,56 1 0,03
Salary
5,25 1 0,03
Age
5,73 1 0,03
Region of
origin ... ... ...
23,30 1 0,03

Total 36 100,00
Frequency table
When summarizing the data:
• We can obtain a better representation of the distribution of
wages
• We lose some information
• Choice of arbitrary intervals – familiarity of the researcher with
Variable the data dictates how many / which classes
Civil status
Degree of Frequency Percentage
Salary class ni 100 fi
education
Number of 4,00 ˫ 8,00 10 27,78
children 8,00 ˫ 12,00 12 33,33
Salary 12,00 ˫ 16,00 8 22,22
Age 16,00 ˫ 20,00 5 13,89
Region of
origin
20,00 ˫ 24,00 1 2,78
Total 36 100,00
Charts for qualitative variables
Degree of Frequency Proportion Percentage
Education ni fi 100 fi

Basic 12 0,3333 33,33


Medium 18 0,5000 50,00
Higher 6 0,1667 16,67

Total 36 1,0000 100,00

Dependent Variable (DV)


Bar graph
20
Typical bar graph:
18

16
• Qualitative variable on the x-axis
14 (independent variable)
12 • Dependent variable on y-axis (eg: absolute
Frequência

10
or relative frequency)
8

0
Independent variable (IV)
Fundamental Médio Superior
Grau de instrução
Charts for qualitative variables
Degree of Frequency Proportion Percentage
Education ni fi 100 fi

Basic 12 0,3333 33,33


Medium 18 0,5000 50,00
Higher 6 0,1667 16,67

Total 36 1,0000 100,00

Bar graph

Horizontal bar graph:


Superior

• Like typical bar graph, but with Grau de instrução

dependent variable on x-axis Médio

• Useful when many groups have to be


Fundamental
plotted
0 5 10 15 20
Frequência
Charts for qualitative variables
Other types of bar graphs

A
Dieta B Horizontal bar graph
C (note the change in the origin
Controle line of the x-axis)
-10 -5 0 5 10
Mudança de peso (kg)

Bar chart for spread A


• Spread is plotted on the x-axis
B
• Often, with other values built into the

Local
spread (ex. average) C
• Useful when interest is in variability of D
response
• Difficult to create with common tools 0 0.01 0.02 0.03 0.04 0.05

(Excel) Concentração de ozônio


Charts for qualitative variables
Other types of bar graph
women men
70

60 Grouped bar graph


• Two or more independent variables, one
% Ss Jealous

50

40 dependent variable
30
• Requires legend to indicate the second
20
independent variable (IV)
10
• Useful when the answer depends on both IVs
0
low high
Sexual Experience

“Stacked” bar graph


• 2+ IVs, levels, or groups; one DV
• The values are “stacked”
• Most useful when the sum of IVs is of
interest
• Often hard to see the values of the
individual IVs
Charts for qualitative variables
Sector diagram (pie chart)

Fundamental 17%

Superior
33%

Médio
50%

Sector diagram
• Similar to the stacked bar graph, but DV
represented as % of area
• Comparisons between groups difficult
• Therefore, labels or legend added
• Use with caution; Consider using bar graphs
first
Charts for quantitative variables

• Most common representations:

• Bar chart

• Scatter plot

• Histogram

• Stem-and-leaf plot

• Boxplot
Charts for quantitative variables
Bar graphs: discrete quantitative variables
Dependent variable (DV)
quantitative

Variable
Number of Frequency
Civil status children ni 10

Degree of 0 4 8

education
1 5 6

Frequência
Number of
children 2 7 4

Salary 3 3 2

Age 5 1 0
0 1 2 3 4 5
Region of Total 20
Número de filhos

origin

Independent variable (IV)


quantitative
Charts for quantitative variables
Scatter plot

Number of Frequency 10

children ni
8
0 4
1 5

Frequência
6

2 7 4

3 3
2
5 1
Total 20 0
0 1 2 3 4 5 6
Número de filhos

• Two quantitative variables, generally continuous


• Individual cases: pairs of X,Y
• There is often no clear distinction between independent and dependent
variable
• Used a lot to reveal the association between variables
Charts for quantitative variables
Scatter plot: continuous quantitative variables

Regression /
Correlation
Charts for quantitative variables

Frequency histogram
• Graphic method used for summarizing
frequency distributions

• Useful to reveal
– 1. Center: where most of the
observations are
– 2. Dispersion: how observations
vary
– 3. Shape: symmetry
– 4. “Outliers”: extreme observations
Charts for quantitative variables

Frequency histogram
• Characteristics:

– The horizontal scale (x-axis) is


quantitative
• Measures the data values

– The vertical scale (y axis) is


quantitative
• Measures the frequency of the
classes

– Consecutive bars should be against


each other
Charts for quantitative variables

Frequency histogram
• Characteristics:

– Base widths are equal to the


intervals of classes

– Area of each rectangle (bar) is


proportional to its frequency

• Total area under bars = 1


Charts for quantitative variables
Area in the bar corresponds to %
Percentage in interval = #/N = 7/30 = 23%

(Intervals of equal classes) Height = # Cases

8
Frequency distributions of
the salaries (x$1k) 7
6
Classe # casos
5
# casos

4-5 1 4
6-7 2
3
8-9 3
10-11 7 2
12-13 5 1
14-15 6
0
16-17 3
4--5 6--7 8--9 10--11 12--13 14--15 16--17 18--19 20-21
18-19 2
Total 30 Class intervals (x$1k)
Charts for quantitative variables
Frequency histogram
88
77
• Intervals in a histogram based on a 66
continuous variable

# casos
55
– Adjacent bars, no separation 44
33
• Actual lower boundary 22
– Lowest value to be classified in that 11

interval 00
4--5
4--5 6--7
6--7 8--9
8--9 10--11
10--11 12--13
12--13 14--15
14--15 16--17
16--17 18--19
18--19 20-21
20-21

• Intervalos das classes (x$1k)


Actual upper boundary
– Highest value to be classified in
Limite superior real (7,5k)
that interval
Limite inferior real (5,5k)
• Concerns about values exactly at the
borders
– “Endpoint conventions”
– Ex: “values in the upper real
boundary are included in the
smaller range”
Charts for quantitative variables
Relative frequency histogram: example Int. %
30
0-1k 1
25 1-2 2
2-3 3
20
% of sample

3-4 4
4-5 5
15 5-6 5
6-7 5
10
7-10 15
5 10-15 26
15-25 26
0 25-50 8
51+ 1
0 1 2 3 4 5 6 7 10 15 25 51+
Intervalo salarial (em $1000)

P. Is this correct?
R. No. The class intervals are unequal, so the size of the categories is
misleading. Ex: 1-2k is 1/25 the size of 25-50k.
Charts for quantitative variables
Conversion to a density scale
(density of frequency)

• When class intervals are not equal, beware


– Using “raw frequency” as a height is misleading

• In such cases, convert to a “density” scale


– Set the observation unit
• Ex: $1000 or $1k
– Determine the number of units in each interval
• Ex: interval $25 – 50K has 25 “units” of $1k
– Density = # cases (or %) in the interval / # units in the interval
• Ex: interval has 8%, so Density = 8/25 = 0.32
– In essence: how many cases are there in each “slice” (unit) of a
range
Charts for quantitative variables
Conversion…
6 Int. % L D
D = 5 D = 5.2
5
% por unidade ($1k)

0-1k 1 1 1
1-2 2 1 2
4
2-3 3 1 3
3 D = 2.6
3-4 4 1 4
4-5 5 1 5
2 5-6 5 1 5
6-7 5 1 5
1
D = .32 7-10 15 3 5
0 10-15 26 5 5,2
15-25 26 10 2,6
-1 -5 10 15 25 50
0- 4-
-1
-5

25-50 8 25 0,32
0-
4-

Intervalo (em $1000) 51+ 1 ? ?

Note: Area in each bar now corresponds to the percentage of


cases in that range!
Charts for quantitative variables
Comparison…
First try Second try
30 6

25 5

20 4

15 3

10 2

5 1

0 0
0 1 2 3 4 5 6 7 10 15 25 51+

Histogram of Income
50
40

Example of software
30
% Cases

used for data


20

processing (S+)
10
0

0 20 40 60 80

Income
Charts for quantitative variables
Histogram for frequency polygon: just one step
• Histograms give us powerful summaries of distributions
– However, the bars may distract from the primary interest (eg: shape)
• Frequency polygons are widely used
– 1. Start with a histogram
– 2. Lines connect midpoints of adjacent ranges
– 3. By convention, connect the smallest and longest intervals to zero at the
next midpoint

8
8
7
7
6
6
# casos

5 5
4 4
3 3
2 2
1 1
0 0
2--3 4--5 6--7 8--9 10-- 12-- 14-- 16-- 18-- 20-- 2--3 4--5 6--7 8--9 10-- 12-- 14-- 16-- 18-- 20--
11 13 15 17 19 21 11 13 15 17 19 21

Intervalo Intervalo
Charts for quantitative variables

Comments on intervals in frequency distributions


• Selecting the number of classes is important
– Some books recommend the whole number closest to √N
– What is informative will depend on the context

• The width of the (intervals between) the classes is important


– Some data may require unequal intervals (our example)
– If this occurs, be very careful in constructing the graph
– Often statistical software has problems with unequal intervals
Cumulative frequency plots 9

Frequency
7

plots “summing 5

up” from left to 3

right 1


16-23 23-30 30-37 37-44 44-51 51-58

Advantage: no 25

need for intervals 20


 Larson & Farber
only illustrates 15

with intervals
10
 Assignment:
5

0 10 15 20 25 30 35 40 45 50 55 60
Charts for quantitative variables
Stem-and-leaf tables
• Histograms give us important information about our sample
– Central tendency, variability & shape

• Two disadvantages
– 1. Difficult to construct
– 2. Information is lost

• Stem-and-leaf tables
– Give us all the histogram information
– Easier to construct
– Less information is lost; one can return to the original
values
– Easier to add new data
– However, it is less traditional and sometimes “less
beautiful”
– Less adopted by popular software
Charts for quantitative variables
Stem-and-leaf tables
• Components:
– Stem
• Leader-digit of each interval, corresponding
to the smallest boundary of the interval
• Typically allocated in the table vertically from
lowest (top) to highest (bottom) value
– Leaves
• Other digits of the data, positioned in such a
way that:
– Stem + leaf = original observation
• Typically allocated in the table from the
lowest value (closest to the branch) to the
largest (farthest from the branch)

Use keys/
legends!
Charts for quantitative variables
20 grades of the statistical exam in 2007:
63 71 100 77 74 83 64 94 82 79
95 76 89 77 94 82 63 98 69 76

Stem Leaf

6 3349
7 1466779
8 2239
9 4458
10 0

Key:
1. “6|3” = “63”
The leaves may not always be sorted as in this example
Charts for quantitative variables
Example 2: Stem-and-leaf table
adjusted (“back-to-back”) for life expectancy

Sweden 78
France, US, Japan, Canada 77
Finland, Austria, UK 76
USSR, Germany 75
74
73
72 Sweden
71 Japan
Women 70 Men
69 Canada, UK, US, France
68 Germany, Austria
67 Finland
66
65
64
63
62 USSR
Bibliography for next lecture
• Larson & Farber: Chapter 2
• Measures of central tendency
Charts for quantitative variables
Example 2: Data of Lord Rayleigh
Measured weight of oxygen:
Info: observations are
from air (a);
2.2981 2.3101 2.2984 2.3100 2.3014 2.3103
[observações pares]
2.2988 2.3098 2.2989 2.3101 2.2994 2.3102 are from other sources
2.2988 2.3102 2.3018 (o).
ramo
folha

NB: 3 branches very coarse, 2.29o81 84 a 88 a 88 a 89 a 94 a a


So we created more intervals: 2.30z14 18 a a
ID. z=0,1 2.30d
d=2,3 2.30q
2.30s
q=4,5
2.30o98 o
s=6,7 2.31z00 01 o 01 o 02 o 02 o 03 o o
o=8,9

You might also like