You are on page 1of 75

STATISTICS FOR LIFE AND SOCIAL SCIENCES

Thach Thanh Tien

Ton Duc Thang University

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 1 / 76
Sir Ronald Fisher (1890-1962)

“A genius who almost single-handedly created the foundations for modern


statistical science”

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 2 / 76
C. R. Rao

“Statistics is not a discipline like physics, chemistry or biology where we


study a subject to solve problems in the same subject. We study statistics
with the main aim of solving problems in other disciplines.”

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 3 / 76
C. R. Rao

“Không giống như các ngành khoa học khác, khoa học thống kê không
phát triển từ thống kê. Nó cần sự thúc đẩy từ những bài toán mới phát
sinh trong tất cả các hoạt động của con người. Tương lai của thống kê
nằm ở sự giao tiếp trao đổi giữa nhà Thống kê với các nhà nghiên cứu
trong các lĩnh vực khác.”

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 4 / 76
S. R. Srinivasa Varadhan (Jan 2, 1940)

“ If you have a problem in biology, you have to understand chemistry.


And if you want to understand why things happen in chemistry, you have
to know physics.
And if you want to understand why things happen in physics, you have to
know mathematics.”

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 5 / 76
STATISTICS FOR LIFE AND SOCIAL SCIENCES
CHAPTER 1: DESCRIPTIVE STATISTICS

Thach Thanh Tien

Ton Duc Thang University

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 6 / 76
Content
1 Introduction
Variables and Types of Data (Biến và các dạng dữ liệu)
Descriptive and Inferential Statistics
2 Displaying distributions with graphs
Graphs for categorical variables
Steamplots
Histogram
Time plots
3 Describing distributions with numbers
Measuring center: the mean, the median
Measuring spread: the quartiles, the standard deviation
4 Density curves and normal distribution
Density curves
Measuring center and spreed for density curves
Normal distribution
5 Relationship between variable
Scatterplots
Correlation
Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 7 / 76
Introduction

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 8 / 76
Introduction

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 9 / 76
Introduction

Statistics
Statistics is the science of conducting studies to collect, organize, summarize,
analyze, and draw conclusions from data.
Statistics is the science of learning from data.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 10 / 76
Why statistics?

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 11 / 76
Why statistics?

1 Numerical information is everywhere.


2 No matter what line of work you select, you will find yourself faced with
decisions where an understanding of data analysis is helpful.

“In God we trust. All others bring data.”


-W. E. Deming

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 12 / 76
Introduction

Population (Tổng thể)


A population consists of all subjects (human or otherwise) that are being studied.

Sample (Mẫu)
A sample is a group of subjects selected from a population.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 13 / 76
Introduction

Cases, labels, variables, and values


Cases are the objects described in a set of data. Cases may be customers,
companies, subjects in a study, units in an experiment, or other objects.
A label is a special variable used in some data sets to distinguish the
different cases.
A variable is any characteristic of a case.
Different cases can have different values of the variables.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 14 / 76
Introduction

Data (Dữ liệu)


Data are the values (measurements or observations) that the variables can
assume.
A collection of data values forms a data set.
Each value in the data set is called a data value or a datum.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 15 / 76
Variables and Types of Data (Biến và các dạng dữ liệu)

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 16 / 76
Variables and Types of Data (Biến và các dạng dữ liệu)

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 17 / 76
Variables and Types of Data (Biến và các dạng dữ liệu)

Categorical and quantitative variables


A categorical variable (biến phân loại hay biến định tính) places an
individual into one of two or more groups or categories.
A quantitative variable (biến định lượng) takes numerical values for which
arithmetic operations such as adding and averaging make sense.
Discrete variables (biến rời rạc) assume values that can be counted.
Continuous variables (biến liên tục) can assume an infinite number of values
between any two specific values. They are obtained by measuring. They often
include fractions and decimals.
The distribution (phân phối) of a variable tells us what values it takes and
how often it takes these values.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 18 / 76
Example

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 19 / 76
Measurement scales (Thang đo)

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 20 / 76
Measurement scales

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 21 / 76
Measurement scales

Nominal level of measurement (Thang đo định danh)


The nominal level of measurement classifies data into mutually exclusive
(nonoverlapping) categories in which no order or ranking can be imposed on the
data.

Ordinal level of measurement (Thang đo thứ bậc)


The ordinal level of measurement classifies data into categories that can be
ranked; however, precise differences between the ranks do not exist.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 22 / 76
Measurement scales

Interval level of measurement (Thang đo khoảng)


The interval level of measurement ranks data, and precise differences between
units of measure do exist; however, there is no meaningful zero.

Ratio level of measurement (Thang đo tỷ lệ)


The ratio level of measurement possesses all the characteristics of interval
measurement, and there exists a true zero. In addition, true ratios exist when the
same variable is measured on two different members of the population

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 23 / 76
Introduction

How many variables have you measured?


Univariate data: One variable is measured on a single experimental unit.
Bivariate data: Two variables are measured on a single experimental unit.
Multivariate data: More than two variables are measured on a single
experimental unit.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 24 / 76
Introduction

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 25 / 76
Introduction

Look at the data


1 Who? What cases do the data describe? How many cases does the data set
contain?
2 What? How many variables do the data contain? What are the exact
definitions of these variables? What are the units of measurement for each
quantitative variable?
3 Why? What purpose do the data have? Do we hope to answer some
specific questions? Do we want to draw conclusions about cases other than
the ones we actually have data for? Are the variables that are recorded
suitable for the intended purpose?

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 26 / 76
Descriptive and Inferential Statistics

Descriptive statistics
Descriptive statistics consists of the collection, organization, summarization, and
presentation of data.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 27 / 76
Descriptive and Inferential Statistics

Inferential statistics
Inferential statistics consists of generalizing from samples to populations,
performing estimations and hypothesis tests, determining relationships among
variables, and making predictions.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 28 / 76
Displaying distributions with graphs

Exploratory data analysis (Phân tích khám phá dữ liệu)


Like an explorer crossing unknown lands, we want first to simply describe what we
see.
Begin by examining each variable by itself. Then move on to study the
relationships among the variables.
Begin with a graph or graphs. Then add numerical summaries of specific
aspects of the data.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 29 / 76
Graphs for categorical variables

Example
The distribution of the highest level of education for people aged 25 to 34 years.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 30 / 76
Graphs for categorical variables

What do you see?

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 31 / 76
Graphs for categorical variables

What do you see?

Pie charts require that you include all the categories that make up a whole. Use
them only when you want to emphasize each category’s relation to the whole.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 32 / 76
Displaying distributions with graphs

Steamplots
1 Separate each observation into a stem consisting of all but the final
(rightmost) digit and a leaf, the final digit. Stems may have as many digits as
needed, but each leaf contains only a single digit.
2 Write the stems in a vertical column with the smallest at the top, and draw a
vertical line at the right of this column.
3 Write each leaf in the row to the right of its stem, in increasing order out
from the stem.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 33 / 76
Steamplots

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 34 / 76
Steamplots

What do you see?

A stemplot of the percents of females

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 35 / 76
Histogram

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 36 / 76
Histogram

What do you see?

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 37 / 76
Time plots

Time plot (Biểu đồ dữ liệu theo thời gian)


A time plot of a variable plots each observation against the time at which it was
measured. Always put time on the horizontal scale of your plot and the variable
you are measuring on the vertical scale. Connecting the data points by lines helps
emphasize any change over time.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 38 / 76
Time plots

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 39 / 76
Time plots

What do you see?

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 40 / 76
Measuring center: the mean, the median

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 41 / 76
Measuring center: the mean, the median

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 42 / 76
Measuring center: the mean, the median

The mean (Trung bình)


To find the mean x of a set of observations, add their values and divide by the
number of observations. If the n observations are x1 , x2 , ..., xn , their mean is
x1 + x2 + ... + xn
x̄ = (1)
n
or, in more compact notation,
1X
x̄ = xi (2)
n

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 43 / 76
Measuring center: the mean, the median

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 44 / 76
Measuring center: the mean, the median

The median (Trung vị)


The median M is the midpoint of a distribution. Half the observations are smaller
than the median and the other half are larger than the median. Here is a rule for
finding the median:
1 Arrange all observations in order of size, from smallest to largest.
2 If the number of observations n is odd, the median M is the center
observation in the ordered list. Find the location of the median by counting
(n + 1)/2 observations up from the bottom of the list.
3 If the number of observations n is even, the median M is the mean of the two
center observations in the ordered list. The location of the median is again
(n + 1)/2 from the bottom of the list.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 45 / 76
Measuring center: the mean, the median

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 46 / 76
Measuring spread: the quartiles, the standard deviation

The quartiles Q1 and Q3 (Tứ phân vị)


To calculate the quartiles:
1 Arrange the observations in increasing order and locate the median M in the
ordered list of observations.
2 The first quartile Q1 is the median of the observations whose position in the
ordered list is to the left of the location of the overall median.
3 The third quartile Q3 is the median of the observations whose position in the
ordered list is to the right of the location of the overall median.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 47 / 76
Describing distributions with numbers

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 48 / 76
Measuring spread: the quartiles, the standard deviation

The five-number summary (5-giá trị tóm tắt dữ liệu)


The five-number summary of a set of observations consists of the smallest
observation, the first quartile, the median, the third quartile, and the largest
observation, written in order from smallest to largest. In symbols, the five-number
summary is

Minimum Q1 M Q3 Maximum

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 49 / 76
Measuring spread: the quartiles, the standard deviation

The interquartile range (IQR)


The interquartile range IQR is the distance between the first and third quartiles,

IQR = Q3 − Q1 (3)

The 1.5 × IQR rule for outliers


Call an observation a suspected outlier if it falls more than 1.5 × IQR above the
third quartile or below the first quartile.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 50 / 76
Describing distributions with numbers

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 51 / 76
Measuring spread: the quartiles, the standard deviation

The standard deviation (Độ lệch chuẩn)


The variance s2 of a set of observations is the average of the squares of the
deviations of the observations from their mean. In symbols, the variance of n
observations x1 , x2 , ..., xn is

(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2


s2 = (4)
n−1
or, in more compact notation,
1 X
s2 = (xi − x)2 (5)
n−1
The standard deviation s is the square root of the variance s2 :
r
1 X
s= (xi − x)2 (6)
n−1

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 52 / 76
Describing distributions with numbers

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 53 / 76
Density curves

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 54 / 76
Density curves

Density curves (Đường cong mật độ)


A density curve is a curve that
is always on or above the horizontal axis and
has area exactly 1 underneath it.
A density curve describes the overall pattern of a distribution. The area under the
curve and above any range of values is the proportion of all observations that fall
in that range.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 55 / 76
Measuring center and spreed for density curves

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 56 / 76
Measuring center and spreed for density curves

MEDIAN AND MEAN OF A DENSITY CURVE


The median of a density curve is the equal-areas point, the point that divides
the area under the curve in half.
The mean of a density curve is the balance point, at which the curve would
balance if made of solid material.
The median and mean are the same for a symmetric density curve. They both lie
at the center of the curve. The mean of a skewed curve is pulled away from the
median in the direction of the long tail.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 57 / 76
Normal distribution

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 58 / 76
Normal distribution

The 68-95-99.7 rule


In the Normal distribution with mean µ and standard deviation σ:
Approximately 68% of the observations fall within σ of the mean µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 59 / 76
Normal distribution

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 60 / 76
Normal distribution

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 61 / 76
Standardizing observation

Standardizing and z-scores


If x is an observation from a distribution that has mean µ and standard deviation
σ, the standardized value of x is
x−µ
z= (7)
σ
A standardized value is often called a z-score.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 62 / 76
Standardizing observation

The standard normal distribution


The standard Normal distribution is the Normal distribution N (0, 1) with
mean 0 and standard deviation 1.
If a variable X has any Normal distribution N (µ, σ) with mean µ and
standard deviation σ, then the standardized variable
X −µ
Z= (8)
σ
has the standard Normal distribution.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 63 / 76
Normal distribution calculations

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 64 / 76
Normal distribution calculations

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 65 / 76
Normal distribution calculations

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 66 / 76
Normal quantile plot

Acidity of rainwater (measured by pH)

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 67 / 76
Normal quantile plot

Normal Q−Q Plot

6.5
6.0
Sample Quantiles
5.5
5.0
4.5

−2 −1 0 1 2
Theoretical Quantiles

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 68 / 76
Relationship between variable

Association between variables


Two variables measured on the same cases are associated if knowing the value of
one of the variables tells you something about the values of the other variable that
you would not know without this information.

Response variable, explanatory variable


A response variable measures an outcome of a study. An explanatory variable
explains or causes changes in the response variables.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 69 / 76
Scatterplots

Scatterplot
A scatterplot shows the relationship between two quantitative variables measured
on the same individuals. The values of one variable appear on the horizontal axis,
and the values of the other variable appear on the vertical axis.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 70 / 76
Scatterplots
Example
State mean SAT scores plotted against the percent of high school seniors in each
state who take the SAT exams, for Example 2.6. The point for West Virginia
(20% take the SAT, mean score 1032) is highlighted.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 71 / 76
Scatterplots

Examining a scatterplot
In any graph of data, look for the overall pattern and for striking deviations
from that pattern.
You can describe the overall pattern of a scatterplot by the form, direction,
and strength of the relationship.
An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern of the relationship.

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 72 / 76
Correlation

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 73 / 76
Correlation

Correlation
The correlation measures the direction and strength of the linear relationship
between two quantitative variables. Correlation is usually written as r.
Suppose that we have data on variables x and y for n individuals. The means and
standard deviations of the two variables are x and s x for the x-values, and y and
sy for the y-values. The correlation r between x and y is
  
1 X xi − x yi − y
r= (9)
n−1 sx sy

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 74 / 76
Correlation

Thach Thanh Tien (Ton Duc Thang University) STATISTICS FOR LIFE AND SOCIAL SCIENCES 75 / 76

You might also like