Describe Data II

Describing Data By Graphs
Data Distribution
Truong Phuoc Long, ph.D
3/1/2023 Describing Data 1

Content
• Review on descriptive statistics
• Describe data using graphs
• Normal distribution
• Sampling distributions of the sample means
• Central Limit Theorem
• Standard error
• Confidence interval

Descriptive statistics
What are descriptive statistics?
• Descriptive statistics are the way of organizing and summarizing

observations.
• They provide us with an overview of the general features of a set of data.
• Descriptive statistics can assume a number of different forms; among

these are tables, graphs, and numerical summary measures.
• The mean, median, mode, percentiles, range, variance, and standard

deviation are the most commonly used numerical measures for
quantitative data.

Example:
• We have data on weight of 30 students, how do we describe it?
• What measures do we use to tell other people about the weight

of 30 students?

• Measures of the center of data
- Mean: average value of a data set
- Median: is a value separating the higher half from the lower half of
a data sample (the middle value of a set of ordered data).
- Mode: appears most often in a set of data values.
• Measures of the dispersion (or variability) of data
- Range: Min, Max
- Interquartile range (IQR)
- Variance
- Standard deviation

Types of numerical data

- Nominal Data
- Ordinal Data / Ranked Data
- Discrete Data
- Continuous Data

Percentiles
• Percentiles
- Are positional measures to indicate what % of the data set
have a value less than a specified value.
- Have the data in order.
- Divide a set of data into 100 equal parts.

Quartiles
• Quartiles use to express a statistical dispersion of data set.
• Values that split the data set into quarters based on percentiles.
- Q1: 25th percentile (middle number between the smallest number and
the median).
- Q2: 50th percentile (the median of the data set).
- Q3: 75th percentile (middle value between the median and the highest value)

Interquartile range (IQR)
• IQR called the midspread or middle 50%, or technically H-spread,
is a measure of statistical dispersion of data set.
• IQR is between upper and lower quartiles, IQR = Q3 − Q1.

• Example 1:
Data set: 1, 2, 3, 4, 5, 6, 7, 8
What are Q1, Q2, Q3, IQR?
Q1 = (2+3)/2 = 2.5.
Q2 = (4+5)/2 = 4.5.
Q3= (6+7)/2 = 6.5.
IQR = Q3-Q1 = 6.5 – 2.5 = 4.

• Example 2: 4, 5, 5, 6, 7, 8, 9, 10, 10
What are Q1, Q2, Q3, IQR?

Example 3: Find the inter-quartile range for the data set 87, 71,
72, 73, 84, 92, 73.
 71, 72, 73, 73, 84, 87, 92
IQR = 87 – 72 = 15  The inter-quartile range is 15

- Order the data from least to greatest.
- Find the median(Q2) first and then other quartiles (Q1, Q3).

Measures of dispersion

Notes on the sample standard deviation (s)

Review on frequency distributions
• A frequency distribution is a representation, either in a graphical or tabular
format, that displays the number of observations within a given interval.
• Types of frequency distributions: ungrouped frequency distributions,
grouped frequency distributions, cumulative frequency distributions,
and relative frequency distributions.
• Grouping data into a number of categories.

Describe data using graphs
• Data can be represented in many ways.
• The main types of graphs are a bar graph or bar chart,
boxplot, line graph, pie chart, and diagram, etc.

Bar chart
• Displays a frequency distribution for nominal or ordinal data.
• The height of the bar indicates the measured value or frequency.

Boxplot
• Give good insight into distribution shape (skewness and outliers)
• Easily compare distribution of multiple groups
Skewness is a measure of the asymmetry of the distribution of a

variable in a data set.

Boxplot-Components

Boxplot -Distribution
In statistics, an outlier is a data point that differs significantly

from other observations.
Boxplot – Compare groups

Stem and leaf
Stem and Leaf Plot is a special table where each data value is split into
a "stem" (the first digit or digits) and a "leaf" (usually the last digit).

Stem and leaf

Stem and leaf
Example: long jump
Sam got his friends to do a long jump and got these results:
2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0
And here is the stem-and-leaf plot:

Histogram
• A histogram is a chart that plots the

distribution of a numerical variable’s
values as a series of bars.
• Display a frequency distribution for
continuous data by charting the
number or percentage of
observations whose values fall within
pre-defined numerical ranges.
• Choice of bin width affect the
distribution.

Histogram

Common shapes of data distribution

Common shapes of data distribution

Shape of distributions
• Symmetric distribution: right and left sides are mirror images.

- Left tail looks like right tail.
- Mean = Median = Mode

• Right skewed: positive skewed
- Long right tail
- Mean > Median

• Left skewed: negative skewed

- Long left tail
- Mean < Median

Normal distribution(Gaussian distribution)
• A theoretical probability distribution that is perfectly symmetric
about its mean (median and mode), i.e. bell-shaped.
• It is the most important probability distribution because it fits many
natural phenomena.
• Ex: heights, blood pressure, measurement error, and IQ scores follow
the normal distribution.

The normal distribution
• Defined by the population mean () and population
standard deviation ().
• Denoted by N (, ). The equation for a normalized
Gaussian curve has the form:

The normal distribution
• Defined by the population mean() and standard deviation()
 infinite number of normal curves for every combination of ()
and ().

The standard normal distribution
• A special case of normal distribution with  = 0,  = 1
• Sometimes referred as z-distribution with z-scores on the
horizontal line.
The equation for the Gaussian

error curve is:

Areas under the normal curve
• Areas under a normal

curve represent the
proportion of all values
fall in that range.
• Z-scores from (-1,1):
about 68% of the
observations fall within
one SD of the mean.

• Z-scores from (-2,2)

(actually, -1.96, 1.96):
95% of the observations
fall within two SDs of
the mean.

• Z-scores from (-3,3):
99.7% of the observations
fall within three SDs of
the mean.

Proportion of observations under
standard normal distribution

The z-table

The z-table

Transforming to standard normal distribution
• Any normal distribution can

be transformed to standard
normal distribution.
• Example of N(-2, 2) and
N(0,1)

Transform N(-2,2) to N(0,1)
• To center at zero, subtract mean of -2 from each
observation under the red curve

Transform N(-2,2) to N(0,1)
• To change shape (spread/sd), divide each new
observation by sd of 2

Computing z-score
• The process of transforming N(-2,2) to N(0,1) is called

standardizing or computing z-scores.
• We can compute z-score for any observation from any
normal curve to assess where the observation fall
relative to the rest of observations in the distribution.

Example:
• Lets say, the distribution of systolic blood pressure in
males (in the population) is normal  N(123.6, 12.9)
• A man with SBP = 130 mmHg. What is the percentage
of men having SBP greater than his blood pressure?
 Compute Z-score
i.e. what’s the percentage of observations under a standard

normal curve that are 0.5SD or more above the mean?

The z-table

Example
• As a result, 30.853% or approximately 31% of men in the

population having SBP > 130 mmHg
• Note: we can only have correct result if we know or can
assume that the distribution of SBP is normal (not
necessarily standard normal)

How do we know that distribution of data in
the population is normal ?
• Remember: we only know the measures of sample and have to
infer for the population.
• If the distribution of data in our sample is not normal, does
it mean the distribution of data in the population is not
normal?

Sampling distribution
What is a sampling distribution?

• Sampling distribution is a probability distribution of a
statistic obtained by selecting all of the possible samples of
a specific size (n) from the population.
• If we repeatedly choose samples from the same population,
what happen to the statistics (x, sd) and the distributions?
• Let’s look at example?

Example: Blood pressure of males
• Assume the population distribution is:
The population distribution is normal



Distribution of the sample means

• Now, we do another experiment: take 500 random samples

from this population, each sample with 50 men, get the
sample means and SDs.


• Here is the histogram of sample means (n = 50)

• Now, we do one more experiment: take 500 random

samples from this population, each sample with 100
men, get the sample means and SDs.


• Here is the histogram of 500 sample means (n=100)


What are your comments about this?

 Sample mean closes to population mean
 SD of sample mean decreases when sample size increases
 Sample mean is approximately normal.

Example 2: Hospital length of stay
The population distribution is right skewed

The population distribution is right skewed











What are your comments about this?

 Sample mean closes to population mean
 SD of sample mean decreases when sample size increases
 Sampling distribution of sample mean is approximately normal.
From the two examples
Any observation/conclusion?

From the two examples
• Sampling distribution of sample means tended to be
approximately normal even when original, individual
level data was not.
• Variability in sample mean values decreased as size of
sample increased (SDs getting smaller).
• Distributions of sample means centered at true
population mean.

Sampling distribution of the sample mean
• It is a theoretical probability distribution.
• It describes the distribution of all sample means from all possible random
samples of the same size taken from a population.
• The mean of the distribution of sample means will be exactly equal to
the population mean if we are able to select all possible samples of the
same size from a given population.
• There will be less dispersion in the sampling distribution of the sample
mean than in the population.
• As the sample size n increases, the standard error of the mean or the standard
deviation of the distribution of sample means decreases.
• This distribution has well-defined (and predictable) characteristics that
are specified in the Central Limit Theorem.

The Central Limit Theorem (CLT)
1) The sampling distribution of sample means based on all samples of
same size n (n 30) is approximately normal, regardless of the
distribution of the original data in the population/samples.
2) The mean of all sample means in the sampling distribution is always
equal to the true mean of the population from which the samples were
taken.
3) Standard deviation in the sample means is called the standard error
(SE) of the sample mean.

Standard Error
• Standard error known as the standard deviation of the

distribution of sample means provides a measure of the average
distance between sample mean (x) and population mean (μ)
• Standard error describes the distribution of sample means
(variability).
• Law of large numbers:
- The larger the sample size (n), the more probable that sample
mean is close to μ.
- Inverse relationship: the larger the sample size, the smaller the
standard error.


CLT: so what?
• For 95% of the random samples we take, the sample
mean will fall within 2 SEs of the true mean µ

CLT: so what?

CLT: so what?

Confidence Interval
• Such an interval is called a 95% confidence interval for the
population mean .
• Interval is given by:
• Interpretation:
- 100 random samples of size (n) were taken from the same
population, and 95% confidence intervals computed using each
of these 100 samples, 95 of the 100 intervals would contain
the values of true mean .
- This interval describes the range of values that contains the true
population mean in 95% of the random samples of size n taken
from the same population.

Next week
Describing data in SPSS
• Create variable in SPSS by:
- Compute variable
- Recode variable
• Obtain descriptive statistics in SPSS
• Make graphs in SPSS to describe data

Describe Data II

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Describe Data II

Uploaded by

Copyright:

Available Formats

Describing Data By Graphs

Truong Phuoc Long, ph.D

3/1/2023 Describing Data 1

3/1/2023 Describing Data 2

• Descriptive statistics are the way of organizing and summarizing

• They provide us with an overview of the general features of a set of data.

• Descriptive statistics can assume a number of different forms; among

• The mean, median, mode, percentiles, range, variance, and standard

3/1/2023 Describing Data 3

• We have data on weight of 30 students, how do we describe it?

• What measures do we use to tell other people about the weight

3/1/2023 Describing Data 4

3/1/2023 Describing Data 5

Types of numerical data

3/1/2023 Describing Data 6

3/1/2023 Describing Data 7

3/1/2023 Describing Data 8

3/1/2023 Describing Data 9

3/1/2023 Describing Data 10

3/1/2023 Describing Data 11

IQR = 87 – 72 = 15  The inter-quartile range is 15

3/1/2023 Describing Data 12

3/1/2023 Describing Data 13

3/1/2023 Describing Data 14

3/1/2023 Describing Data 15

3/1/2023 Describing Data 16

3/1/2023 Describing Data 17

Skewness is a measure of the asymmetry of the distribution of a

3/1/2023 Describing Data 18

3/1/2023 Describing Data 19

In statistics, an outlier is a data point that differs significantly

3/1/2023 Describing Data 21

3/1/2023 Describing Data 22

3/1/2023 Describing Data 23

3/1/2023 Describing Data 24

• A histogram is a chart that plots the

3/1/2023 Describing Data 25

3/1/2023 Describing Data 26

3/1/2023 Describing Data 27

3/1/2023 Describing Data 28

• Symmetric distribution: right and left sides are mirror images.

3/1/2023 Describing Data 29

3/1/2023 Describing Data 30

• Left skewed: negative skewed

3/1/2023 Describing Data 31

3/1/2023 Describing Data 32

3/1/2023 Describing Data 33

3/1/2023 Describing Data 34

The equation for the Gaussian

3/1/2023 Describing Data 35

• Areas under a normal

3/1/2023 Describing Data 36

• Z-scores from (-2,2)

3/1/2023 Describing Data 37

3/1/2023 Describing Data 38

3/1/2023 Describing Data 39

3/1/2023 Describing Data 40

3/1/2023 Describing Data 41

• Any normal distribution can

3/1/2023 Describing Data 42

3/1/2023 Describing Data 43

3/1/2023 Describing Data 44

• The process of transforming N(-2,2) to N(0,1) is called

3/1/2023 Describing Data 45