You are on page 1of 86

Describing Data By Graphs

Data Distribution

Truong Phuoc Long, ph.D

3/1/2023 Describing Data 1


Content
• Review on descriptive statistics
• Describe data using graphs
• Normal distribution
• Sampling distributions of the sample means
• Central Limit Theorem
• Standard error
• Confidence interval

3/1/2023 Describing Data 2


Descriptive statistics
What are descriptive statistics?

• Descriptive statistics are the way of organizing and summarizing


observations.

• They provide us with an overview of the general features of a set of data.

• Descriptive statistics can assume a number of different forms; among


these are tables, graphs, and numerical summary measures.

• The mean, median, mode, percentiles, range, variance, and standard


deviation are the most commonly used numerical measures for
quantitative data.

3/1/2023 Describing Data 3


Descriptive statistics

Example:

• We have data on weight of 30 students, how do we describe it?

• What measures do we use to tell other people about the weight


of 30 students?

3/1/2023 Describing Data 4


Descriptive statistics
• Measures of the center of data
- Mean: average value of a data set
- Median: is a value separating the higher half from the lower half of
a data sample (the middle value of a set of ordered data).
- Mode: appears most often in a set of data values.
• Measures of the dispersion (or variability) of data
- Range: Min, Max
- Interquartile range (IQR)
- Variance
- Standard deviation

3/1/2023 Describing Data 5


Descriptive statistics

Types of numerical data


- Nominal Data
- Ordinal Data / Ranked Data
- Discrete Data
- Continuous Data

3/1/2023 Describing Data 6


Percentiles
• Percentiles
- Are positional measures to indicate what % of the data set
have a value less than a specified value.
- Have the data in order.
- Divide a set of data into 100 equal parts.

3/1/2023 Describing Data 7


Quartiles
• Quartiles use to express a statistical dispersion of data set.
• Values that split the data set into quarters based on percentiles.
- Q1: 25th percentile (middle number between the smallest number and
the median).
- Q2: 50th percentile (the median of the data set).
- Q3: 75th percentile (middle value between the median and the highest value)

3/1/2023 Describing Data 8


Interquartile range (IQR)
• IQR called the midspread or middle 50%, or technically H-spread,
is a measure of statistical dispersion of data set.
• IQR is between upper and lower quartiles, IQR = Q3 − Q1.

3/1/2023 Describing Data 9


Interquartile range (IQR)
• Example 1:
Data set: 1, 2, 3, 4, 5, 6, 7, 8
What are Q1, Q2, Q3, IQR?

Q1 = (2+3)/2 = 2.5.
Q2 = (4+5)/2 = 4.5.
Q3= (6+7)/2 = 6.5.
IQR = Q3-Q1 = 6.5 – 2.5 = 4.

3/1/2023 Describing Data 10


Interquartile range (IQR)
• Example 2: 4, 5, 5, 6, 7, 8, 9, 10, 10
What are Q1, Q2, Q3, IQR?

3/1/2023 Describing Data 11


Interquartile range (IQR)

Example 3: Find the inter-quartile range for the data set 87, 71,
72, 73, 84, 92, 73.
 71, 72, 73, 73, 84, 87, 92

IQR = 87 – 72 = 15  The inter-quartile range is 15


- Order the data from least to greatest.
- Find the median(Q2) first and then other quartiles (Q1, Q3).

3/1/2023 Describing Data 12


Measures of dispersion

3/1/2023 Describing Data 13


Notes on the sample standard deviation (s)

3/1/2023 Describing Data 14


Review on frequency distributions
• A frequency distribution is a representation, either in a graphical or tabular
format, that displays the number of observations within a given interval.
• Types of frequency distributions: ungrouped frequency distributions,
grouped frequency distributions, cumulative frequency distributions,
and relative frequency distributions.
• Grouping data into a number of categories.

3/1/2023 Describing Data 15


Describe data using graphs
• Data can be represented in many ways.
• The main types of graphs are a bar graph or bar chart,
boxplot, line graph, pie chart, and diagram, etc.

3/1/2023 Describing Data 16


Bar chart
• Displays a frequency distribution for nominal or ordinal data.
• The height of the bar indicates the measured value or frequency.

3/1/2023 Describing Data 17


Boxplot
• Give good insight into distribution shape (skewness and outliers)
• Easily compare distribution of multiple groups

Skewness is a measure of the asymmetry of the distribution of a


variable in a data set.

3/1/2023 Describing Data 18


Boxplot-Components

3/1/2023 Describing Data 19


Boxplot -Distribution

In statistics, an outlier is a data point that differs significantly


from other observations.
3/1/2023 Describing Data 20
Boxplot – Compare groups

3/1/2023 Describing Data 21


Stem and leaf
Stem and Leaf Plot is a special table where each data value is split into
a "stem" (the first digit or digits) and a "leaf" (usually the last digit).

3/1/2023 Describing Data 22


Stem and leaf

3/1/2023 Describing Data 23


Stem and leaf
Example: long jump
Sam got his friends to do a long jump and got these results:
2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0
And here is the stem-and-leaf plot:

3/1/2023 Describing Data 24


Histogram

• A histogram is a chart that plots the


distribution of a numerical variable’s
values as a series of bars.
• Display a frequency distribution for
continuous data by charting the
number or percentage of
observations whose values fall within
pre-defined numerical ranges.
• Choice of bin width affect the
distribution.

3/1/2023 Describing Data 25


Histogram

3/1/2023 Describing Data 26


Common shapes of data distribution

3/1/2023 Describing Data 27


Common shapes of data distribution

3/1/2023 Describing Data 28


Shape of distributions

• Symmetric distribution: right and left sides are mirror images.


- Left tail looks like right tail.
- Mean = Median = Mode

3/1/2023 Describing Data 29


Shape of distributions
• Right skewed: positive skewed
- Long right tail
- Mean > Median

3/1/2023 Describing Data 30


Shape of distributions

• Left skewed: negative skewed


- Long left tail
- Mean < Median

3/1/2023 Describing Data 31


Normal distribution(Gaussian distribution)
• A theoretical probability distribution that is perfectly symmetric
about its mean (median and mode), i.e. bell-shaped.
• It is the most important probability distribution because it fits many
natural phenomena.
• Ex: heights, blood pressure, measurement error, and IQ scores follow
the normal distribution.

3/1/2023 Describing Data 32


The normal distribution
• Defined by the population mean () and population
standard deviation ().
• Denoted by N (, ). The equation for a normalized
Gaussian curve has the form:

3/1/2023 Describing Data 33


The normal distribution
• Defined by the population mean() and standard deviation()
 infinite number of normal curves for every combination of ()
and ().

3/1/2023 Describing Data 34


The standard normal distribution
• A special case of normal distribution with  = 0,  = 1
• Sometimes referred as z-distribution with z-scores on the
horizontal line.

The equation for the Gaussian


error curve is:

3/1/2023 Describing Data 35


Areas under the normal curve

• Areas under a normal


curve represent the
proportion of all values
fall in that range.
• Z-scores from (-1,1):
about 68% of the
observations fall within
one SD of the mean.

3/1/2023 Describing Data 36


Areas under the normal curve

• Z-scores from (-2,2)


(actually, -1.96, 1.96):
95% of the observations
fall within two SDs of
the mean.

3/1/2023 Describing Data 37


Areas under the normal curve
• Z-scores from (-3,3):
99.7% of the observations
fall within three SDs of
the mean.

3/1/2023 Describing Data 38


Proportion of observations under
standard normal distribution

3/1/2023 Describing Data 39


The z-table

3/1/2023 Describing Data 40


The z-table

3/1/2023 Describing Data 41


Transforming to standard normal distribution

• Any normal distribution can


be transformed to standard
normal distribution.
• Example of N(-2, 2) and
N(0,1)

3/1/2023 Describing Data 42


Transform N(-2,2) to N(0,1)
• To center at zero, subtract mean of -2 from each
observation under the red curve

3/1/2023 Describing Data 43


Transform N(-2,2) to N(0,1)
• To change shape (spread/sd), divide each new
observation by sd of 2

3/1/2023 Describing Data 44


Computing z-score

• The process of transforming N(-2,2) to N(0,1) is called


standardizing or computing z-scores.
• We can compute z-score for any observation from any
normal curve to assess where the observation fall
relative to the rest of observations in the distribution.

3/1/2023 Describing Data 45


Example:
• Lets say, the distribution of systolic blood pressure in
males (in the population) is normal  N(123.6, 12.9)
• A man with SBP = 130 mmHg. What is the percentage
of men having SBP greater than his blood pressure?
 Compute Z-score

i.e. what’s the percentage of observations under a standard


normal curve that are 0.5SD or more above the mean?

3/1/2023 Describing Data 46


The z-table

3/1/2023 Describing Data 47


Example

• As a result, 30.853% or approximately 31% of men in the


population having SBP > 130 mmHg
• Note: we can only have correct result if we know or can
assume that the distribution of SBP is normal (not
necessarily standard normal)

3/1/2023 Describing Data 48


How do we know that distribution of data in
the population is normal ?
• Remember: we only know the measures of sample and have to
infer for the population.
• If the distribution of data in our sample is not normal, does
it mean the distribution of data in the population is not
normal?

3/1/2023 Describing Data 49


Sampling distribution

What is a sampling distribution?


• Sampling distribution is a probability distribution of a
statistic obtained by selecting all of the possible samples of
a specific size (n) from the population.
• If we repeatedly choose samples from the same population,
what happen to the statistics (x, sd) and the distributions?
• Let’s look at example?

3/1/2023 Describing Data 50


Example: Blood pressure of males
• Assume the population distribution is:

The population distribution is normal

3/1/2023 Describing Data 51


Example: Blood pressure of males

3/1/2023 Describing Data 52


Example: Blood pressure of males

3/1/2023 Describing Data 53


Distribution of the sample means

3/1/2023 Describing Data 54


Example: Blood pressure of males

• Now, we do another experiment: take 500 random samples


from this population, each sample with 50 men, get the
sample means and SDs.

3/1/2023 Describing Data 55


Example: Blood pressure of males

3/1/2023 Describing Data 56


Distribution of the sample means
• Here is the histogram of sample means (n = 50)

3/1/2023 Describing Data 57


Example: Blood pressure of males

• Now, we do one more experiment: take 500 random


samples from this population, each sample with 100
men, get the sample means and SDs.

3/1/2023 Describing Data 58


Example: Blood pressure of males

3/1/2023 Describing Data 59


Distribution of the sample means
• Here is the histogram of 500 sample means (n=100)

3/1/2023 Describing Data 60


Example: Blood pressure of males

3/1/2023 Describing Data 61


Example: Blood pressure of males

What are your comments about this?


 Sample mean closes to population mean
 SD of sample mean decreases when sample size increases
 Sample mean is approximately normal.

3/1/2023 Describing Data 62


Example 2: Hospital length of stay

The population distribution is right skewed

3/1/2023 Describing Data 63


Example 2: Hospital length of stay

The population distribution is right skewed

3/1/2023 Describing Data 64


Example 2: Hospital length of stay

3/1/2023 Describing Data 65


Example 2: Hospital length of stay

3/1/2023 Describing Data 66


Distribution of the sample means
• Here is the histogram of 500 sample means (n=20)

3/1/2023 Describing Data 67


Example 2: Hospital length of stay

3/1/2023 Describing Data 68


Example 2: Hospital length of stay

3/1/2023 Describing Data 69


Distribution of the sample means
• Here is the histogram of 500 sample means (n=50)

3/1/2023 Describing Data 70


Example 2: Hospital length of stay

3/1/2023 Describing Data 71


Example 2: Hospital length of stay

3/1/2023 Describing Data 72


Distribution of the sample means
• Here is the histogram of 500 sample means (n=100)

3/1/2023 Describing Data 73


Example 2: Hospital length of stay

3/1/2023 Describing Data 74


Example 2: Hospital length of stay

What are your comments about this?


 Sample mean closes to population mean
 SD of sample mean decreases when sample size increases
 Sampling distribution of sample mean is approximately normal.
3/1/2023 Describing Data 75
From the two examples

Any observation/conclusion?

3/1/2023 Describing Data 76


From the two examples
• Sampling distribution of sample means tended to be
approximately normal even when original, individual
level data was not.
• Variability in sample mean values decreased as size of
sample increased (SDs getting smaller).
• Distributions of sample means centered at true
population mean.

3/1/2023 Describing Data 77


Sampling distribution of the sample mean
• It is a theoretical probability distribution.
• It describes the distribution of all sample means from all possible random
samples of the same size taken from a population.
• The mean of the distribution of sample means will be exactly equal to
the population mean if we are able to select all possible samples of the
same size from a given population.
• There will be less dispersion in the sampling distribution of the sample
mean than in the population.
• As the sample size n increases, the standard error of the mean or the standard
deviation of the distribution of sample means decreases.
• This distribution has well-defined (and predictable) characteristics that
are specified in the Central Limit Theorem.

3/1/2023 Describing Data 78


The Central Limit Theorem (CLT)
1) The sampling distribution of sample means based on all samples of
same size n (n 30) is approximately normal, regardless of the
distribution of the original data in the population/samples.
2) The mean of all sample means in the sampling distribution is always
equal to the true mean of the population from which the samples were
taken.
3) Standard deviation in the sample means is called the standard error
(SE) of the sample mean.

3/1/2023 Describing Data 79


Standard Error

• Standard error known as the standard deviation of the


distribution of sample means provides a measure of the average
distance between sample mean (x) and population mean (μ)
• Standard error describes the distribution of sample means
(variability).
• Law of large numbers:
- The larger the sample size (n), the more probable that sample
mean is close to μ.
- Inverse relationship: the larger the sample size, the smaller the
standard error.

3/1/2023 Describing Data 80


Example: Blood pressure of males

3/1/2023 Describing Data 81


CLT: so what?
• For 95% of the random samples we take, the sample
mean will fall within 2 SEs of the true mean µ

3/1/2023 Describing Data 82


CLT: so what?

3/1/2023 Describing Data 83


CLT: so what?

3/1/2023 Describing Data 84


Confidence Interval
• Such an interval is called a 95% confidence interval for the
population mean .
• Interval is given by:

• Interpretation:
- 100 random samples of size (n) were taken from the same
population, and 95% confidence intervals computed using each
of these 100 samples, 95 of the 100 intervals would contain
the values of true mean .
- This interval describes the range of values that contains the true
population mean in 95% of the random samples of size n taken
from the same population.

3/1/2023 Describing Data 85


Next week
Describing data in SPSS
• Create variable in SPSS by:
- Compute variable
- Recode variable
• Obtain descriptive statistics in SPSS
• Make graphs in SPSS to describe data

3/1/2023 Describing Data 86

You might also like