You are on page 1of 31

MA5701: Statistical

Methods
Chapter 1 : Data and Statistics
Kui Zhang, Mathematical Sciences

Chapter 1, MA5701 Statistical Methods, Fall 2016

Introduction
Statistics deals largely with principles and procedures for colleting,
describing, and drawing conclusions from data
The purpose of this chapter is to:
1.
2.
3.
4.

Provide the definition of data


Define the components of a data set
Present some tools that are used to describe a data set
Discuss methods of data collection

Definition 1.1 - A set of data is a collection of observed values


representing one or more characteristics of some objects or units.
Definition 1.2 - A population is a data set representing the entire
entity of interest.
Chapter 1, MA5701 Statistical Methods, Fall 2016

Example 1.1 A Typical Data


Respondent
1
9
22
36

AGE
41
31
64
26

SEX
1
2
2
1

HAPPY
2
1
3
2

Chapter 1, MA5701 Statistical Methods, Fall 2016

TVHOURS
0
0
0
2

Data Resource and Format


Definition 1.3 - A sample is a data set consisting of a portion of a
population. (Obtained in a way to represent the population)
Where we can obtain the data:
Primary data are collected as the part of study
Secondary data are obtained from other resources

Two types of Study


Observational Study can infer association but not causality
Designed Experiments can help establish causality

Data format
Observation(s) a row in the data file
Variables(s) a column in the data file
Chapter 1, MA5701 Statistical Methods, Fall 2016

Observations and Variables


Variables - Qualitative (Categorical) and Quantitative
Definition 1.4 - A discrete variable can assume only accountable number of
values.
Definition 1.5 A continuous variable is one that can take any one of an
uncountable number of values in an interval.
Definition 1.6 The ratio scale of measurements uses the concept of unit
of distance or measurement and requires a unique of a zero value.
Definition 1.7 The interval scale of measurement also uses the concept
of distance or measurement and requires a zero point, but the definition
of zero is arbitrary.
Chapter 1, MA5701 Statistical Methods, Fall 2016

Variables (Cont.)
Definition 1.8 The ordinal scale distinguishes between
measurements on the basis of the relative amounts of some
characteristic they process.
You can convert ratio or interval scale to ordinal scale, but the criteria is not
always clear, and it induces the loss of information.

Definition 1.9 The nominal scale identifies observed values by


name or classification.
Generally for categorical or qualitative variables
Weakest scale
Can convert ratio, interval, or ordinal scale to nominal scale
Chapter 1, MA5701 Statistical Methods, Fall 2016

Variables Example
Obs

Zip

Age

1
3
5
7
9

3
4
1
3
1

21
7
51
8
51

Bed Bath

3
1
3
3
2

2
1
1
2
1

Size

Lot

Exter

garage

fp

Price

951
676
1186
1368
1176

64904
54450
10857
.
6259

Other
Other
Other
Frame
Frame

0
2
1
0
1

0
0
0
0
1

30000
46500
51500
56990
65500

Chapter 1, MA5701 Statistical Methods, Fall 2016

Distributions
Definition 1.10 A frequency distribution is a listing of frequencies
of all categories of the observed values of variable.
Definition 1.11 A relative frequency distribution consists of the
relative frequencies, or proportions (percentages), of observations
belong to each category.
Definition A cumulative frequency distribution gives the frequency
of observed values less than or equal to the upper limit of that class
interval.
Definition A cumulative percent gives the relative frequency of
observed values less than or equal to the upper limit of that class
interval.
Chapter 1, MA5701 Statistical Methods, Fall 2016

Distributions Discrete Variable


bed

bed

Frequency

Percent

1
2
3
4
5

1
3
46
16
3

1.45
4.35
66.67
23.19
4.35

Cumulative Cumulative
Frequency
Percent
1
1.45
4
5.80
50
72.46
66
95.65
69
100

Chapter 1, MA5701 Statistical Methods, Fall 2016

Distributions Continuous Variables


price

Frequency

Percent

[ 0, 50k)
[ 50k, 100k)
[100k, 150k)
[150k, 200k)
[200k, 250k)
[250k, 300k)
[300k, 350k)
[350k, 400k)

4
22
23
10
2
1
4
3

5.80
31.88
33.33
14.49
2.90
1.45
5.80
4.35

Cumulative Cumulative
Frequency
Percent
4
5.80
26
37.68
49
71.01
59
85.51
61
88.41
62
89.86
66
95.65
69
100.00

Chapter 1, MA5701 Statistical Methods, Fall 2016

10

Distributions Bar Chart

Chapter 1, MA5701 Statistical Methods, Fall 2016

11

Distributions Pie Chart

Chapter 1, MA5701 Statistical Methods, Fall 2016

12

Distributions Histogram

Chapter 1, MA5701 Statistical Methods, Fall 2016

13

Distributions Graphical Representation


Components in a correctly constructed chart and graph

Axes labeled correctly with clearly identifiable scales


Captioned Correctly
Bars with equal width
Sizes of figures properly proportioned
With only Relevant information

Generally used Statistical Software can produce a correctly


constructed chart and graph with some adjustments
Some graphical and table representation of distributions contain too
much details and are useful for univariate analysis
Chapter 1, MA5701 Statistical Methods, Fall 2016

14

Graphical Representation

Chapter 1, MA5701 Statistical Methods, Fall 2016

15

Descriptive Statistics

Exterior = Brick

Exterior = Frame
Chapter 1, MA5701 Statistical Methods, Fall 2016

16

Descriptive Statistics - Location


Definition 1.12 The mean is the sum of all the observed values divided by
the number of values ( y ( y ) / n )
Definition 1.13 The median of a set of observed values is defined to be
the middle value when the measurement are arranged from lowest to the
highest. ( y y y )
Definition 1.18 The pth percentile is defined to be that value for which at
most (p)% of the measurement are less and at most (100-p)% of the
measurement are greater.
Definition Quartiles, 25%, 50%, 75% percentile (Location)
Definition Range (Dispersion) and Midrange (location) are the difference
and mean of the smallest and largest observed values, respectively
Definition Mode is the most frequently observed value
n

i 1

(1)

(2)

(n)

Chapter 1, MA5701 Statistical Methods, Fall 2016

17

Descriptive Statistics - Dispersion


Definition Range and Midrange are the difference and mean of the
smallest and largest observed values, respectively.
Definition 1.14 The sample variance, denoted by s2 is defined by
s

SS

n
i 1

( yi y ) 2

n 1

new distance =

i 1 ( yi y ) 2
n

s 2 mean square =

n
i 1

n
i 1

| yi y |

yi2 ( i 1 yi ) 2 / n
n

SS
df

Definition 1.15 The standard deviation of a set of observed values is


defined to be the positive square root of the variance.
Definition 1.17 The coefficient of variation (CV) is the ratio of the
standard deviation to the mean, expressed in percentage terms.
Definition 1.19 The interquartile range is the length of the interval
between the 25th and 75th percentiles (Dispersion).
Chapter 1, MA5701 Statistical Methods, Fall 2016

18

Mean and Standard Deviation


Usefulness of the Mean and Standard Deviation
Interval (mean +- 1*SD) contains approximately 68% of observations
Interval (mean +- 2*SD) contains approximately 95% of observations
Interval (mean +- 3*SD) contains virtually all of the observations

Change of Scale
Linear transformation (from the change of unit)
Non-Linear transformation (squared transformation, log transformation)
What will be changed?
Mean
Variance and Standard Deviation
CV

Chapter 1, MA5701 Statistical Methods, Fall 2016

19

Exploratory Analysis Box Plot


Box plot show distribution shapes and detect unusual observations

Chapter 1, MA5701 Statistical Methods, Fall 2016

20

Exploratory Analysis Box Plot


Schematic Box-and-Whiskers Plot

Chapter 1, MA5701 Statistical Methods, Fall 2016

21

Exploratory Analysis Box Plot


Box plot show distribution shapes and detect unusual observations

Chapter 1, MA5701 Statistical Methods, Fall 2016

22

Bivariate Data Categorical Variables


Contingency Table (Frequency Table from PROC Freq)
Table of exter by zip
exter(exter)

zip(zip)
1

brick

frame

other

Total

Total

10

30

48

5.80

14.49

5.80

43.48

69.57

8.33

20.83

8.33

62.50

66.67

76.92

25.00

88.24

1.45

1.45

7.25

1.45

11.59

12.50

12.50

62.50

12.50

16.67

7.69

31.25

2.94

13

1.45

2.90

10.14

4.35

18.84

7.69

15.38

53.85

23.08

16.67

15.38

43.75

8.82

13

16

34

69

8.70

18.84

23.19

49.28

100.00

Chapter 1, MA5701 Statistical Methods, Fall 2016

23

Bivariate Data Categorical Variables


Block chart

Chapter 1, MA5701 Statistical Methods, Fall 2016

24

Bivariate Data Categorical and Interval


Calculate numerical
descriptions for
each group
Box plot can be used

Chapter 1, MA5701 Statistical Methods, Fall 2016

25

Bivariate Data Interval Variables

Chapter 1, MA5701 Statistical Methods, Fall 2016

26

Preview Populations, Samples,


Statistical Inference
Definition The population is the values of one or more variables for
the entire collection of units relevant to a particular study
Population Parameters mean and variance, unknown, must be
estimated from samples
Estimates the descriptive measures from samples, can reflect the
population parameters, different from different sets of samples, how
good an estimate is measured by sampling error
Statistics In this book, it is considered as the same as the estimate.
In statistical theory, it refers to the function of a sample, which is
either a random variable or random vector
Chapter 1, MA5701 Statistical Methods, Fall 2016

27

Data Collection
Goal make statements about population according to samples
Radom sampling or some more advanced probability sampling is the
appropriate way to collect data. In this book, we assume all samples
are from simple random sampling
Definition The simple random sampling is a sampling scheme that
each possible sample of the specified size has an equal chance of
occurring
Random sampling can be difficult to implement in practice
Convenience samples are dangerous (Be careful)
Sample size and power calculation
Chapter 1, MA5701 Statistical Methods, Fall 2016

28

Chapter Summary
Variable nominal, ordinal, interval, ratio, discrete,
continuous
Table Frequency table, contingency table
Numerical measurement mean, variance, standard
deviation, largest, smallest, median, range, midrange,
percentile, quartile
Graphic histogram, bar chart, pie chart, block chart,
scatter plot
Chapter 1, MA5701 Statistical Methods, Fall 2016

29

Writing Report
Use appropriate tables and figures to summarize the
data set you have
Do not directly copy tables or output from SAS output,
do some edits (e.g., add descriptions, appropriate row
and/or column names, effective digit)
Produce appropriate figures (see previous slides)
Discrete variables report frequency and percentage
Continuous variables mean and standard deviation
Chapter 1, MA5701 Statistical Methods, Fall 2016

30

Writing Report (Cont.)


Summarize data: the purpose of study, how data was
collected, sample size, the number of variables,
missing data
Detailed description: summary statistics in tables
instead of figures in general, univariate instead of
multivariate, may organize tables according to one
important group variable
Multivariate analysis tables or figures
Chapter 1, MA5701 Statistical Methods, Fall 2016

31