You are on page 1of 25

1 Summarizing  

Data

(Ch  1.1,  1.3,  1.10-­1.13,  2.4.3,  2.5)


Populations and  Samples
An  investigation  of  some  characteristic  of  a  population  of  
interest.  

Example:  You  want  to  study  the  average  GPA  of  juniors  
who  are  engineering  majors.

Population:  
All  engineering  majors  who  are  juniors.

Characteristic  of  interest:  


Average  GPSA.

2
Populations  and  Samples
What  statisticians  need  to  do:

1) Learn  about  the  distribution  of  the characteristic in  


the  population.
2) Do  this  by  taking  a  sample  from  the  population.  
Why?  
3) Sample  statistics.

3
Graphics:  Histograms
A histogram is  a  graphical  representation  of  the  
distribution of  numerical  data.  
Construct  a  histogram:  
1. “Bin”  the  range  of  values.  (The  bins  are  usually  
consecutive,  non-­overlapping,  and  are  usually  equal  
size.)
2. Frequency  histogram:  count  how  many  values  fall  into  
each  bin/interval  and  draw  accordingly.
3. Density  histogram:  count  how  many  values  fall  into  
each  bin,  and  adjust  the  height  such  that  the  sum  of  the  
area  of  each  bin  equals  1.

4
Graphics:  Histograms
Examples:
-­ Drawing  a  frequency  histogram  by  hand.
-­ Drawing  a  density  histogram  by  hand.

5
Example
Charity  is  a  big  business  in  the  United  States.  The  Web  site  
charitynavigator.com gives  information  on  roughly  5500  
charitable  organizations.

Some  charities  operate  very  efficiently,  with  fundraising  and  


administrative  expenses  that  are  only  a  small  percentage  of  
total  expenses,  whereas  others  spend  a  high  percentage  of  
what  they  take  in  on  such  activities.

6
Example cont’d

Here  are  the  data  on  fundraising  expenses  as  a  percentage  


of  total  expenditures  for  a  random  sample  of  60  charities:

6.1      12.6      34.7        1.6        18.8          2.2          3.0        2.2        5.6        3.8
2.2          3.1          1.3        1.1        14.1          4.0      21.0        6.1        1.3    20.4
7.5          3.9      10.1        8.1        19.5          5.2      12.0    15.8    10.4        5.2
6.4      10.8      83.1        3.6            6.2          6.3      16.3    12.7        1.3        0.8
8.8          5.1          3.7      26.3          6.0      48.0          8.2    11.7        7.2        3.9
15.3      16.6          8.8      12.0          4.7      14.7          6.4    17.0        2.5    16.2

7
Example cont’d

We  can  see  that  a  substantial  majority  of  the  charities  in  


the  sample  spend  less  than  20%  on  fundraising:

8
Graphics:  Histograms
Histograms  come  in  a  variety  of  shapes.  
• Unimodal histogram:  single  peak  
• Bimodal  histogram:  two  different  peaks  
• Multimodal  histogram:  many  different  peaks  

Bimodality:  Can  occur  when  the  data  set  consists  of  


observations  on  two  quite  different  kinds  of  individuals  or  
objects.    
Multimodality

Symmetric  histograms
Positively  skewed  histograms
Negatively  skewed histograms
9
Sample  Statistics
• Histograms  and  other  visual  summaries  of  samples  are  
excellent  tools  for  informal  learning  about  population  
characteristics.  

• The  calculation  and  interpretation  of  certain  summarizing  


numbers  are  required  for  a  deeper  understanding  of  the  
data.

•These  sample  numerical  summaries  are  called  “Sample  


Statistics”

10
Sample  Statistics:  Measures  of  Centrality
Summarizing  the  center  of  the  sample  data  is  a  popular  
and  important  characteristic  of  a  set  of  numbers.  

3  popular  types  of    center:


1. Mean  
2. Median
3. Mode

11
The  Sample  Mean
For  a  given  set  of  numbers  x1,  x2,.  .  .,  xn,  the  most  
familiar  measure  of  the  center  is  the  mean  
(arithmetic  average).  

Sample  mean  x of  observations  x1,  x2,.  .  .,  xn:

12
The  Sample  Mean
For  a  given  set  of  numbers  x1,  x2,.  .  .,  xn,  the  most  
familiar  measure  of  the  center  is  the  mean  
(arithmetic  average).  

Sample  mean  x of  observations  x1,  x2,.  .  .,  xn:

Disadvantage?
13
The  Sample  Median
Median:  Middle  value  when  observations  are  ordered  
smallest  to  largest.

14
The  Sample  Median
Median:  Middle  value  when  observations  are  ordered  
smallest  to  largest.
To  calculate:  Order  the  n  observations  smallest  to  largest  
(repeated  values  included  and  find  the  middle  one.

15
The  Mean  vs.  the  Median
The  population  mean   µ and  median          will  not  generally  be  
identical.  If  the  population  distribution  is  positively  or  
negatively  skewed,  as  pictured  below,  then

(a)  Negative  skew (b)  Symmetric (c)  Positive  skew

Three  different   shapes  for  a  population   distribution

Which  population  characteristic  is  most  important?


16
The  Mean  vs.  the  Median
The  population  mean   µ and  median          will  not  generally  be  
identical.  If  the  population  distribution  is  positively  or  
negatively  skewed,  as  pictured  below,  then

(a)  Negative  skew (b)  Symmetric (c)  Positive  skew

Three  different   shapes  for  a  population   distribution

17
Other  Sample  Measures
• Quartiles:  divide  the  data  set  into  four equal  parts  (how  
is  this  calculated?)
• Percentiles: A  data  set  can  be  even  more  finely  
divided.    What  does  “percentile”  mean?

Example  calculations  of  the  median  and  quartiles.

18
Graphics:  Boxplots
A boxplot is  a  convenient  way  of  graphically  depicting  
groups  of  numerical  data  through  the  five  number  
summary:  minimum,  first  quartile,  median,  third  quartile,  
and  maximum.

Example:  Drawing  a  boxplot  by  hand.

19
Variability
So  far,  we’ve  learned  techniques  for  visualizing  our  data  
and  measures  of  center.    What  about  how  far  apart  the  
data  is  spread  out?

Samples  with  identical   measures  of  center  but  different  amounts   of  variability

20
Variability
Simplest  measure  of  variability:  The  range.  

Samples  with  identical   measures  of  center  but  different  amounts   of  variability

21
Variability
Simplest  measure  of  variability:  The  range.  

Samples  with  identical   measures  of  center  but  different  amounts   of  variability

What  are  the  disadvantages  of  the  range?

22
Variability
Can  we  combine  the  deviations  into  a  single  quantity  by  
finding  the  average  deviation?  

A  more  robust  measure  of  variation  takes  into  account  


deviations  from  the  mean                                                                          

23
Variability
The  sample  variance,  denoted  by  s2,  is  given  by

The  sample  standard  deviation,  denoted  by  s,  is  the  


(positive)  square  root  of  the  variance:

Note  that  s2 and  s  are  both  nonnegative.  The  unit  for  s  is  
the  same  as  the  unit  for  each  of  the  xi.

Example:  Calculation  of  the  SD.

24
Summarizing  Data  in  R
-­ Summary  statistics
-­ Graphics  (boxplots,  histograms)

25

You might also like