You are on page 1of 57

Data

descrip,on – distribu,on
parameters
Dr Nguyen Thi Van Anh
Department of Biotechnology-Pharmacology
University of Science and Technology of Hanoi

Data don’t make any sense.


We will have to resort to staEsEcs
Objec,ves of lesson
•  Understand how data can be appropriately organized
and displayed

•  Be able to calculate and interpret measures of central


tendency (median, mean, mode)

•  Be able to calculate and interpret measures of


dispersion (range, variance, standard devia,on)
What is Descrip,ve sta,s,cs?

Techniques for summarizing and organizing the data

so that we may more easily determine

what information they contain


Descrip,ve sta,s,cs
Content:
•  The ordered array
•  Grouped data
•  Frequency tables
•  The central tendency
•  DistribuEon and skewness
•  The dispersion

1. The ordered array (sor,ng data)

List the data in order from

the smallest to the largest values


2. Grouped data
•  To group a set of observations we select a set of contiguous,
nonoverlapping intervals such that each value in the set of
observations can be placed in one, and only one, of the
intervals

•  5 ≤ number of class intervals ≤ 15


2. Grouped data
A formula given by Sturges can be used as a guide

k: number of class intervals


n: number of values
w: class interval width
R: difference between the smallest and biggest value
Sturge’s formula - example
3. Frequency tables
Frequency: number of occurrences
Types of Rice

Rice Frequency

Old 17
TradiEonal 15
New 4

A table like this is called


A frequency distribu,on
Propor,on and percentages
•  Proportion (relative frequency): is the frequency divided by the total
number of values

•  Percentage: proportion multiply by 100

•  Percentages are easier to interpret than raw frequencies, so


frequency table are often augmented with an extra column of
percentages

Rice Frequency Rela,ve Percentage


frequency
Old 17 0.4722 47.22%
TradiEonal 15 0.4167 41.67%
New 4 0.1111 11.11%
Sturges formula and frequency table - example
Sturge’s formula and frequency table - example
Cumula,ve (rela,ve) frequency

We may sum, or cumulate frequencies, and relative frequencies


to obtain information within 2 or more contiguous groups.

In this case we have cumulative frequencies and

cumulative relative frequencies


Bar chart (Bar graph)
•  The main graphical display of categorical data is a bar chart

•  The height of each bar is equal to frequency (or relative


frequency) of that category
Histogram ( a special type of bar chart)
Display graphically frequency distribution or
relative frequency distribution.
QUIZ

Construct: a) A frequency table


b) The bar chart
QUIZ
The heights of students (in cm) in a class are as follows:

a)  Build a frequency table, grouping data into classes


b)  Plot the data on a graph
Sturge’s formular

b) A bar chart / Histogram


4. Measures of central tendency
•  Mean
•  Median
•  Mode
Mean (average)?
Obtained by adding up all the values of a variable and dividing by
the number of the values

Ex: 15, 20, 21, 20, 36, 15, 25, 15


Mean = Sum/8 = 167/8 = 20.875
4. Measures of central tendency
Properties of Mean
-  Uniqueness (Only one mean for a data set)
-  Simplicity (easily understood and easy to compute)
-  Affected by each value of the data set (Extreme values can
distort the mean)

Ex: 15, 20, 21, 19, 80


Mean = Sum/5 = 31 → not representative of the data set
4. Measures of central tendency
Median
-  The value that divides the data set into 2 equal parts
-  The number of value ≤ median = the number of value ≥ median

How to compute meadian?


Values are arranged in order of magnitude (n = number of values)
-  If number of values is odd, median will be the middle value (n+1)/2th
-  If number of values is even, median is the mean of 2 middle values
(n/2)th and (n/2+1)th
4. Measures of central tendency
Median
Example:
1 2 4 4 5 6 6
Median is 4 (the 4th value)

1 2 4 4 5 6 6 7
50% below 50% above
2 middle values: 4 and 5 (4th and 5th value)
Median = 4.5 = (4+5)/2
4. Measures of central tendency
Properties of Median
-  Uniqueness (Only one mean for a data set)
-  Simplicity (easily understood and easy to compute)
-  Not as drastically affected by each extreme values as the mean
4. Measures of central tendency
Mode
-  The value that occurs most frequently in the data set
-  If all values are different, there is no mode
-  A set of data may have more than 1 mode
5. Distribu,on and skewness

The distribu,on gives informa,on about

•  a typical value (a center) which data are spread


•  the variability of values (the spread of distribu,on)
•  Shape of distribu,on (whether a distribu,on is symmetric or
skewed,…)
Skewness

Symetric distribution: right half is a mirror image of left half


Asymetric distribution = skewed distribution

Mean < mode


Mean > mode

NegaEve skew PosiEve skew


6. Measures of dispersion

•  Dispersion (variation/spread): the variability of the data


•  No variability (all values are the same) means no dispersion
•  When values are close together è small dispersion
Range
Difference between the largest and smallest value

Example: Calculate Mean, median and range?

Both A and B: Mean = Median = 7


Range of set A: 13 – 1 = 12
Range of set B: 9 – 5 = 4
Variance

A measure of how far each value in the data set is different from the mean

Population variance

Sample variance

Degrees of freedom = n - 1
Variance - Example

Compute mean, variance of a data set:


Standard deviation

The square root of variance

A measure of variation in one data set


Coefficient of Variation

Expresses the standard deviation as a percentage of the mean

•  A measure of relative variation rather than absolute variation


•  Can be used to compare variability of 2 or more data sets
measured in different units
Coefficient of Variation

Variation is much higher in the sample 2 than in the sample 1


Percentiles and quartiles

p% of observations < P < (100-p)% of observations


•  Quartiles
•  Quartiles

Lower quartile 25th percenEle

Median
50th percenEle

Upper quartile 75th percenEle

The interquar,le range (IQR): difference between first and third


quarEle

Small IQR: Small variability


Lower quartile Median Upper quartile
Example:
Calculate quartiles: 5, 7, 4, 4, 6, 2, 8
Example:
Calculate quartiles: 8, 7, 1, 3, 6, 3, 4, 5, 6, 8
Boxplot
•  Box plot

Represent 5 values graphically

•  Split data set into 4 quarters with


equal number of values
•  What does the boxplot tell about the
distribution?

ü  Center (median): the vertical line inside the box indicates the
center of the distribution

ü  Spread: the width of the box (interquartile range IQR)

ü  Shape: median in the middle of the box è symetric


distribution
•  Construction of a boxplot? (5 steps)
Oulier:
- Value is more than 1.5 Emes the IQR from the box

or è outlier

Outside the range [Q1 – 1.5(IQR), Q3 + 1.5(IQR)]


ü  Detection of ouliers is important (may be incorrectly recorded)

ü  If anything atypical found, ouliers should be deleted from the data


•  Example:

Kriesel et al. examined glomerular filtration


rate (GFR) in 19 pediatric patients (some
measured more than once). Compute:

a)  Mean, median, variance, standard


deviation, coefficient of variance

b)  Construct a boxplot


QUIZ

•  What are the objectives of descriptive statistics?


QUIZ
Michelson in 1882 determined the speed of light:

Calculate: a) mean
b) Variance
c) Standard deviaEon
d) Coefficient of variance
QUIZ

Answer:
QUIZ
Suppose that the range of a sample is 105.4,
The calculated SD = 260.6
What do you conclude?

Answer:

There was an error in the calculaEon of SD


SD is the measure of distance of sample values from the mean.
Thus mean, SD must fall in the range.
QUIZ
QUIZ
Answer:

You might also like