Professional Documents
Culture Documents
Lecture 3
Types and classification of data
Graphical representation of data
Questions:
• E-mail: richard.jacobs@ufabc.edu.br
Sample
“Descriptive statistics”
Exploratory data analysis.
Treatment A Summary graphs and numbers.
“Detective work”
Allocation to groups “Systematic accumulation and
contr/exp exploration of evidence”
(Preferably random) Treatment B
Data
Experimental and Control Procedures
Transition to data classification
Population parameters
Population Average height of Brazilian men
Sampling Average score of UFABC students at IPE
method Number of drunk drivers at the last
(Preferably random) carnaval
Sample statistics
Sample Average height of the 200 Brazilians interviewed
Average grade of the 90 students from the last IPE
class
Number of drunk drivers on the last blitz
(carnaval?)
Data / variables
Classification (types of data / variables)
Dependent / independent variables
Graphical representation / visualization
Numerical summariesResumos numéricos
Data classification
Variable Representation
Quality Civil status X
(attribute)
Degree of education Y
Qualitative Region of origin Z
Salary S
Count
Age U
Quantitative Number of children V
Data classification
City Population
Baltimore, MD 651.154
Boston, MA 589.141
Dallas, TX 1.188.580
Las Vegas, NV 478.434
Lincoln, NE 225.581
Seattle, WA 563.374
Fonte: U.S. Census Bureau
Affiliate Networks
in Portland, Oregon
KATU (ABC) Colors of the
KGW (NBC) Brazilian flag
KOIN (CBS) Green
Yellow Origin of approved
KPDX (FOX) students
Blue
São Paulo, SP
White
Campinas, SP
Porto Alegre, RS
Brasília, DF
Data classification
Four levels of measures
Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?
Nominal YES NO NO NO
Data classification
Ordinal qualitative variable
arranged in order; differences between records are not meaningful
Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?
Nominal YES NO NO NO
Final temperature
sample (ºC)
27.8 Titles of the
National Team
27.5
1958
27.8
1962
27.7
1970 = 24 anos
27.9
1994
2002
Data classification
Quantitativa intervalar variable
Can be ordered; differences are meaningful; zero is a random position on the scale
Final temperature
sample (ºC)
27.8 Titles of the
National Team
27.5
1958
27.8
1962
27.7
1970 = 1.012
27.9 ?
1994
2002
Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?
Nominal YES NO NO NO
Final temperature
sample (K)
300.9 Number of goals
by the national
301.1 team
300.8
1958 16
301.0
1962 14
301.2
1970 19
1994 11
2002 18
Data classification
Four levels of measures
Are ratios
Measurement Categorized Subtractable between the
Ordered data
level data data values data values
meaningful?
Nominal YES NO NO NO
2. Identify what each data set represents and specify the measurement level:
A. The body temperature (in ºC) of an exercising athlete
B. The athlete’s heart rate (beats per minute)
C. The final ranking of the SP teams in the São Paulo football
championship
D. The total number of goals scored by each of the 5 topscorers
E. The final list of these 5 top scorers
F. The list of Brazilian states that held state competitions
Graphic representation
Basic 12
Company Medium 18
Higher 6
A
Total 36
How do they compare in terms of
workers with higher education?
Degree of Frequency
education ni
Basic 650
Company Medium 1.020
Higher 330
B
Total 2.000
Frequency table
Degree of Frequency Proportion Percentage
education ni fi 100 fi
?
4,56 1 0,03
Salary
5,25 1 0,03
Age
5,73 1 0,03
Region of
origin ... ... ...
23,30 1 0,03
Total 36 100,00
Frequency table
When summarizing the data:
• We can obtain a better representation of the distribution of
wages
• We lose some information
• Choice of arbitrary intervals – familiarity of the researcher with
Variable the data dictates how many / which classes
Civil status
Degree of Frequency Percentage
Salary class ni 100 fi
education
Number of 4,00 ˫ 8,00 10 27,78
children 8,00 ˫ 12,00 12 33,33
Salary 12,00 ˫ 16,00 8 22,22
Age 16,00 ˫ 20,00 5 13,89
Region of
origin
20,00 ˫ 24,00 1 2,78
Total 36 100,00
Charts for qualitative variables
Degree of Frequency Proportion Percentage
Education ni fi 100 fi
16
• Qualitative variable on the x-axis
14 (independent variable)
12 • Dependent variable on y-axis (eg: absolute
Frequência
10
or relative frequency)
8
0
Independent variable (IV)
Fundamental Médio Superior
Grau de instrução
Charts for qualitative variables
Degree of Frequency Proportion Percentage
Education ni fi 100 fi
Bar graph
A
Dieta B Horizontal bar graph
C (note the change in the origin
Controle line of the x-axis)
-10 -5 0 5 10
Mudança de peso (kg)
Local
spread (ex. average) C
• Useful when interest is in variability of D
response
• Difficult to create with common tools 0 0.01 0.02 0.03 0.04 0.05
50
40 dependent variable
30
• Requires legend to indicate the second
20
independent variable (IV)
10
• Useful when the answer depends on both IVs
0
low high
Sexual Experience
Fundamental 17%
Superior
33%
Médio
50%
Sector diagram
• Similar to the stacked bar graph, but DV
represented as % of area
• Comparisons between groups difficult
• Therefore, labels or legend added
• Use with caution; Consider using bar graphs
first
Charts for quantitative variables
• Bar chart
• Scatter plot
• Histogram
• Stem-and-leaf plot
• Boxplot
Charts for quantitative variables
Bar graphs: discrete quantitative variables
Dependent variable (DV)
quantitative
Variable
Number of Frequency
Civil status children ni 10
Degree of 0 4 8
education
1 5 6
Frequência
Number of
children 2 7 4
Salary 3 3 2
Age 5 1 0
0 1 2 3 4 5
Region of Total 20
Número de filhos
origin
Number of Frequency 10
children ni
8
0 4
1 5
Frequência
6
2 7 4
3 3
2
5 1
Total 20 0
0 1 2 3 4 5 6
Número de filhos
Regression /
Correlation
Charts for quantitative variables
Frequency histogram
• Graphic method used for summarizing
frequency distributions
• Useful to reveal
– 1. Center: where most of the
observations are
– 2. Dispersion: how observations
vary
– 3. Shape: symmetry
– 4. “Outliers”: extreme observations
Charts for quantitative variables
Frequency histogram
• Characteristics:
Frequency histogram
• Characteristics:
8
Frequency distributions of
the salaries (x$1k) 7
6
Classe # casos
5
# casos
4-5 1 4
6-7 2
3
8-9 3
10-11 7 2
12-13 5 1
14-15 6
0
16-17 3
4--5 6--7 8--9 10--11 12--13 14--15 16--17 18--19 20-21
18-19 2
Total 30 Class intervals (x$1k)
Charts for quantitative variables
Frequency histogram
88
77
• Intervals in a histogram based on a 66
continuous variable
# casos
55
– Adjacent bars, no separation 44
33
• Actual lower boundary 22
– Lowest value to be classified in that 11
interval 00
4--5
4--5 6--7
6--7 8--9
8--9 10--11
10--11 12--13
12--13 14--15
14--15 16--17
16--17 18--19
18--19 20-21
20-21
3-4 4
4-5 5
15 5-6 5
6-7 5
10
7-10 15
5 10-15 26
15-25 26
0 25-50 8
51+ 1
0 1 2 3 4 5 6 7 10 15 25 51+
Intervalo salarial (em $1000)
P. Is this correct?
R. No. The class intervals are unequal, so the size of the categories is
misleading. Ex: 1-2k is 1/25 the size of 25-50k.
Charts for quantitative variables
Conversion to a density scale
(density of frequency)
0-1k 1 1 1
1-2 2 1 2
4
2-3 3 1 3
3 D = 2.6
3-4 4 1 4
4-5 5 1 5
2 5-6 5 1 5
6-7 5 1 5
1
D = .32 7-10 15 3 5
0 10-15 26 5 5,2
15-25 26 10 2,6
-1 -5 10 15 25 50
0- 4-
-1
-5
25-50 8 25 0,32
0-
4-
25 5
20 4
15 3
10 2
5 1
0 0
0 1 2 3 4 5 6 7 10 15 25 51+
Histogram of Income
50
40
Example of software
30
% Cases
processing (S+)
10
0
0 20 40 60 80
Income
Charts for quantitative variables
Histogram for frequency polygon: just one step
• Histograms give us powerful summaries of distributions
– However, the bars may distract from the primary interest (eg: shape)
• Frequency polygons are widely used
– 1. Start with a histogram
– 2. Lines connect midpoints of adjacent ranges
– 3. By convention, connect the smallest and longest intervals to zero at the
next midpoint
8
8
7
7
6
6
# casos
5 5
4 4
3 3
2 2
1 1
0 0
2--3 4--5 6--7 8--9 10-- 12-- 14-- 16-- 18-- 20-- 2--3 4--5 6--7 8--9 10-- 12-- 14-- 16-- 18-- 20--
11 13 15 17 19 21 11 13 15 17 19 21
Intervalo Intervalo
Charts for quantitative variables
Frequency
7
plots “summing 5
right 1
•
16-23 23-30 30-37 37-44 44-51 51-58
Advantage: no 25
with intervals
10
Assignment:
5
0 10 15 20 25 30 35 40 45 50 55 60
Charts for quantitative variables
Stem-and-leaf tables
• Histograms give us important information about our sample
– Central tendency, variability & shape
• Two disadvantages
– 1. Difficult to construct
– 2. Information is lost
• Stem-and-leaf tables
– Give us all the histogram information
– Easier to construct
– Less information is lost; one can return to the original
values
– Easier to add new data
– However, it is less traditional and sometimes “less
beautiful”
– Less adopted by popular software
Charts for quantitative variables
Stem-and-leaf tables
• Components:
– Stem
• Leader-digit of each interval, corresponding
to the smallest boundary of the interval
• Typically allocated in the table vertically from
lowest (top) to highest (bottom) value
– Leaves
• Other digits of the data, positioned in such a
way that:
– Stem + leaf = original observation
• Typically allocated in the table from the
lowest value (closest to the branch) to the
largest (farthest from the branch)
Use keys/
legends!
Charts for quantitative variables
20 grades of the statistical exam in 2007:
63 71 100 77 74 83 64 94 82 79
95 76 89 77 94 82 63 98 69 76
Stem Leaf
6 3349
7 1466779
8 2239
9 4458
10 0
Key:
1. “6|3” = “63”
The leaves may not always be sorted as in this example
Charts for quantitative variables
Example 2: Stem-and-leaf table
adjusted (“back-to-back”) for life expectancy
Sweden 78
France, US, Japan, Canada 77
Finland, Austria, UK 76
USSR, Germany 75
74
73
72 Sweden
71 Japan
Women 70 Men
69 Canada, UK, US, France
68 Germany, Austria
67 Finland
66
65
64
63
62 USSR
Bibliography for next lecture
• Larson & Farber: Chapter 2
• Measures of central tendency
Charts for quantitative variables
Example 2: Data of Lord Rayleigh
Measured weight of oxygen:
Info: observations are
from air (a);
2.2981 2.3101 2.2984 2.3100 2.3014 2.3103
[observações pares]
2.2988 2.3098 2.2989 2.3101 2.2994 2.3102 are from other sources
2.2988 2.3102 2.3018 (o).
ramo
folha