M130 Tutorial-1 Introduction To Statistics and Data Analysis-V.4

Course Contents
1. Introduction to Statistics
and Data Analysis
2. Probability
3. Random Variables and
Probability Distributions
4. Mathematical Expectation
5. Some Discrete Probability
Distributions
6. Some Continuous
Probability Distributions
Copyright © 2010 Pearson Addison-Wesley. All rights reserved.

Chapter 1
Introduction to
Statistics and Data
Analysis
1-2
Chapter Outline
1.1 Overview: Statistical Inference, Samples, Populations,

and the Role of Probability
1.2 Sampling Procedures; Collection of Data
1.3 Measures of Location: The Sample Mean and Median
1.4 Measures of Variability
1.5 Discrete and Continuous Data
1.6 Statistical Modeling, Scientific, Inspection, and
Graphical Diagnostics
1-3
Example: Data Samples in Tabular Form
Two samples of 10 northern red oak seedlings were planted in a greenhouse,

one containing seedlings treated with nitrogen and the other containing
seedlings with no nitrogen. The stem weights in grams were recorded after the
end of 140 days. The data are given as follows:
1-4
The Dot Plot: Another Representation of the tabulated data
1-5
Fundamental Relationship between
Probability and Inferential Statistics
The Sample along with inferential statistics allow us to draw conclusion about
the population.
Based on known features of the population, elements of probability allow us to

draw conclusion about characteristics of hypothetical data taken from the
population.
1-6
Data Classification
Data
Qualitative Quantitative
Categorical Numerical,
Can be ranked
Discrete Continuous
Countable Non-Countable
Number of (Measureable)
children in a Height of a student
family From 175 to 180
1-7
Qualitative Categorical Frequency distribution
Example: Twenty-five army indicates were given a blood

test to determine their blood type.
Raw Data: A,B,B,AB,O O,O,B,AB,B B,B,O,A,O

A,O,O,O,AB AB,A,O,B,A
Class Tally Frequency Percent

A IIII 5 20
B IIII II 7 28
O IIII IIII 9 36
AB IIII 4 16
1-8
Quantitative: Discrete or Continuous
Data Sorting and Labeling:
Before you work on any numerical data set (discrete or

continuous variables), do the following:
– Sort the data from smallest to largest (increasing
order)
– Label each value of the data set as such: X(1) for
the first term (1st value), X(2) for the second term
(2nd value)….. and so on.
1-9
Quantitative: Discrete or Continuous
(Continued)
Example:
Given the following: 4, 7, 9, 3, 0
First step: (Sorting the data)
0, 3, 4, 7, 9
Second step: (Labeling the data)

X(1), X(2), X(3), X(4), X(5)
The data is rewritten as follows:

X(1), X(2), X(3), X(4), X(5)
0, 3, 4, 7, 9
1 - 10
Quantitative-Grouped Frequency Distribution
(continuous data)
The following data represent the record high

temperatures for each of the 50 states.
Class
Frequency
Boundaries
99.5 - 104.5 10
104.5 - 109.5 12
109.5 - 114.5 8
114.5 - 119.5 6
119.5 - 124.5 4
124.5 - 129.5 7
129.5 - 134.5 3
1 - 11
Measures of Location (Central
Tendency)
• The data (observations) often tend to be concentrated around the center of the
data.
• Some measures of location are: the mean, mode, and median.
• These measures are considered as representatives (or typical values) of the
data. They are designed to give some quantitative measures of where the
center of the data is in the sample.
 Notations:
1) The Mean (denoted by: x )
2)  )
The Median (denoted by: x
3) The Mode (no notation)
1 - 12
Sample Mean
Example:
Given the following data: 4, 9, 6, 12, 19, 16
The sample size n is 6 (since we have 6 observations)
Note: There is no need to sort the data, since we are adding all the
observations.
1 - 13
Example
Suppose that the following sample represents the ages (in year) of
a sample of 3 men:
x1  30, x2  35, x3  27.
Then, the sample mean is:

30  35  27 92
x   30.67
3 3
3
Note:   x  x    30  30.67   35  30.67   27  30.67  0
i 1
i
1 - 14
Sample Mean as a Centroid of
the with-nitrogen stem weight
1 - 15
Median
The median is defined to be the middle value of the data set.

Therefore, the calculation of the median, depend on the
sample size n.
Do not forget to sort and label the given data in an increasing order
(from lowest to highest).
Notes: Let n be the sample size:
n is even, if the number of observations is 2, 4, 6,…etc…
n is odd, if the number of observations is 1, 3, 5,…etc…
• If n is odd; Then the median is .
Example: Given the following data: 4, 9, 6, 12, 16
(n=5 is odd) Sorting and labeling the data:
X(1), X(2), X(3), X(4), X(5)
4 , 6 , 9 , 12 , 16
Therefore, the median is . 1 - 16
Median
– If n is even; Then the median is .

Example: Given the following data: 4, 9, 6, 12, 19, 16
(n=6 is even) Sorting and labeling the data:
X(1), X(2), X(3), X(4), X(5), X(6)
4 , 6 , 9 , 12 , 16 , 19
Therefore, the median is:
1 - 17
Mode
• The mode of a set of quantitative data is the most frequently

occurring measurement in a data set.
• If no measurements occurring more than once, then there is no
mode.
• There may be several modes if there are more than one data with the
same most frequently occurring.
e.g. 2, 4, 5, 1, 7, 9, 0 : No mode
2, 4, 2, 5, 4, 2 : Mode is 2
2, 4, 2, 5, 4, 2, 4, 7 : Modes are 2 and 4
The data is said to be:

Unimodal: if it has one mode
Bimodal: if it has two modes
Trimodal: if it has three modes
1 - 18
Multimodal if it has more than three modes
More Examples
Examples (discrete cases):

1)Given the following data: 2, 2, 9, 6, 12, 8
The Mode is 2, since it occurred two times, while the other
observations occurred once.
2)Given the following data: 2, 2, 4, 6, 7, 8, 4
The Mode is 2 and 4, since 2 and 4 occurred two times, while
the other observations occurred once.
3)Given the following data: 2, 6, 9, 16, 12, 8, -2, 0.4
There is No Mode, since the frequency of all the observations
is the same.
1 - 19
Measures of Variability OR Dispersion
Sample Variance
The standard deviation measures how close the

observations are to the mean.
The difference which may be positive or negative,
are called residuals and denoted by ri, where .
1 - 20
Example 1
Compute the sample variance and standard deviation of the following

observations (ages in year): 10, 21, 33, 53, 54.
Solution: n 5
x i x i
10  21  33  53  54 171
x i 1
 i 1
   34.2 year
n 5 5 5
n 5
  xi  x  
 i 
2 2
x  34. 2
s2  i 1
 i 1
n 1 5 1

 10  34.2   21  34.2    33  34.2    53  34.2    54  34.2 
2 2 2 2 2
4
1506.8
  376.7 (year)2
4
s  s 2  376.7  19.41 year 1 - 21

Example 2
A sample of 10 students scored the following grades: 40, 42, 35, 54, 57,
54, 46, 42, 54, 57.
(i)Find the sample mean, mode and median.
(ii)Compute the standard deviation.
Solution:
(i) Listing the scores in order:35, 40, 42, 42, 46, 54, 54, 54, 57, 57
35  40  42  42  46  54  54  54  57  57
Mean  x   48.1
10
46  54
Mode  54 Median   50
2
1
(ii) s  [(35  48.1) 2  (40  48.1) 2    (57  48.1) 2 ]  8.1
9
1 - 22
Example 3: Using a Table (Alternative
Method)
Note: (No need to sort and label the data)

Given the following data: 5, 6, 2, 8, 9 (n=5)
In order to calculate the standard deviation of this data, we should
first calculate the mean
x 2 5 6 8 9 Total Sum
-4 -1 0 2 3 0 = sum of residuals
( xi  x )
( xi  x ) 2 16 1 0 4 9 30
Now,
1 - 23
Example 3: (Continued)
– The sample Variance

The sample variance is given by:
Example:
In the previous page, we found that the standard deviation is
.
Therefore, the variance is 7.5
1 - 24
More Measures of Dispersion.
• The range is the numerical difference between the largest and the
smallest value of a set of a batch of data:
range = max – min
The Range is a measure of the dispersion of the data.

It is equal to the distance between the first and the last observation.
Therefore, one should sort the data and then calculate the range.
That is, the Range is given by:
“Maximum value – Minimum value”
Example:
Given the following data: 2, 6, 8, 9, 12
Data is sorted, then the Range = 12 – 2 = 10
1 - 25
More Measures of Dispersion
• The lower quartile, denoted by Q1, is the median of the lower half of the
batch of data (median of the values below the median of the data set).
• The upper quartile, denoted by Q3, is the median of the upper half of the
batch of data (median of the values above the median of the data set).
• The inter-quartile range, is defined by Q3 – Q1.
• A Box-plot is a diagram consisting of a box and whiskers. On it is displayed
the median, the quartiles and the maximum and minimum values in a batch of
data as shown below.
• A Box-plot is used for comparing two sets of data. In this case two box-plots
are needed and an appropriate common scale.
median
min max
Q1 Q3
1 - 26
Example 1
For the batch of data below (sorted in ascending order):

4, 5, 6, 6, 7, 11, 12, 14, 16, 20, 22, 29
• Sample size n = 12 (even)

• Mode = 6
• Min = 4
• Max = 29 x  n   x  n  1 x  12   x  12  1
        x  6   x  7  11  12
2 2    2   2 
• Median = 2 2 2

2
 11.5
• Q1 is the median of the values below the median 11.5. That is the
median of the values: 4, 5, 6, 6, 7, 11; n = 6 (even); Therefore,
n n   6 6 
x    x   1 x    x   1
2 2    2  2   x  3  x  4   6  6  6
Q 1= 2 2 2 2
1 - 27
Example 1 (Continued)
• Q3 is the median of the values above the median 11.5. That is the median of the
values: 12, 14, 16, 20, 22, 29; n = 6 (even); Therefore,
n n  6 6 
x    x   1 x    x   1
 2 2    2   2   x  3  x  4   16  20  18
Q3= 2 2 2 2
Below is the representation of the Box-plot diagram for the batch of data:
11.5
4 29
6 18
1 - 28
Example 2
The table below gives the gross weekly earning including overtime in
pounds of 20 actors working in a theatre (9 women and 11 men):
Women 221 272 334 361 372 399 415 456 510
Men 258 315 333 353 398 420 435 462 495 523 587
(a) Draw an accurate diagram of the box-plots.

(b) What do box-plots tell you about the relative earnings of male and
female actors.
1 - 29
Example 2
For women For men

Min = 221 Min = 258
Max = 510 Max = 587
Q1 = (272 + 334)/2 = 303 Q1 = 333
Q3 = (415 + 456)/2 = 435.5 Q3 = 495
Median = 372 Median = 420
Women
Men
160 320 480 640
1 - 30
Example 2
CONTINUE
D
From the box-plots it is clear that the men’s earnings are higher than the
women’s: all the five values marked on the box-plots are higher for men
than for the women.
1 - 31
Example 3: Nicotine Data – Outliers Or
Extreme Values
The box-plot below is a representation of the data in the table above. Note the
dots on the two sides of the plot, to the left of the minimum value and to the
right of the maximum value. These dots are called “Outliers” or “Extreme
Values”.
1 - 32
Example 3: Nicotine Data – Outliers Or
Extreme Values
1 - 33
Statistical Modeling: The Histogram
A Histogram is Used with group frequency

– Similar to a bar chart, however the data we are working with is a
continuous data. The bars are stacked to each others.
– Cutpoints: These are the borderlines (boundaries) between groups
(classes) (see table 1.7 page 23- Battery Life in years). Values at
borderline (boundaries) are usually allocated to the higher group,
however it does not matter if we are consistent with all borderline
values.
– Starting point is the value at which we start to draw the histogram.
– All classes (groups) are supposed to have the same width.
1 - 34
Statistical Modeling: The Histogram –
Example: Battery Life (in Years).
Class Interval Class Midpoint Frequency, f Relative Frequency
1.5 – 2.0 1.75 2 0.050
2.0 – 2.5 2.25 1 0.025
2.5 – 3.0 2.75 4 0.100
3.0 – 3.5 3.25 15 0.375
3.5 – 4.0 3.75 10 0.250
4.0 – 4.5 4.25 5 0.125
4.5 – 5.0 4.75 3 0.075
Table 1.7: Relative Distribution of Battery Life
1 - 35
Statistical Modeling: The Histogram –
Example: Battery Life (in Years).
1 - 36
Probability Distribution Function
Corresponding to Histogram of Example of
Battery Life
1 - 37
Skewness of Data Distribution
• The distribution in (a) is said to be “Right Skewed” as it has a longer tail to the
Right side.
• The distribution in (b) is Symmetric.
• The distribution in (c) is said to be “Left Skewed” as it has a longer tail on the
Left side.
1 - 38
Statistical Modeling: Scatter-Plot
– Is used to study the relationship between two variables (paired

observation). Each variable is presented on a separate axis.
One on the horizontal axis and the other on the vertical axis.
– A positive relation between two variables exists if as X
increases, Y increases . A relation is called negative if as X
increases, Y decreases.
– A linear relation ship means that the plotted points have a
linear trend. (move in a form of a line ranging from strong to
weak linear relationship).
– Extreme observations in sets of data are sometimes called
outliers
1 - 39
Examples of Scatter Plots: Death
rates vs. Alcohol Consumption.
1 - 40

M130 Tutorial-1 Introduction To Statistics and Data Analysis-V.4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M130 Tutorial-1 Introduction To Statistics and Data Analysis-V.4

Uploaded by

Copyright:

Available Formats

Course Contents

Copyright © 2010 Pearson Addison-Wesley. All rights reserved.

1.1 Overview: Statistical Inference, Samples, Populations,

Two samples of 10 northern red oak seedlings were planted in a greenhouse,

Based on known features of the population, elements of probability allow us to

Example: Twenty-five army indicates were given a blood

Raw Data: A,B,B,AB,O O,O,B,AB,B B,B,O,A,O

Class Tally Frequency Percent

Data Sorting and Labeling:

Before you work on any numerical data set (discrete or

Second step: (Labeling the data)

The data is rewritten as follows:

The following data represent the record high

3) The Mode (no notation)

x1  30, x2  35, x3  27.

Then, the sample mean is:

The median is defined to be the middle value of the data set.

– If n is even; Then the median is .

• The mode of a set of quantitative data is the most frequently

The data is said to be:

Examples (discrete cases):

The standard deviation measures how close the

Compute the sample variance and standard deviation of the following

s  s 2  376.7  19.41 year 1 - 21

Note: (No need to sort and label the data)

– The sample Variance

The Range is a measure of the dispersion of the data.

For the batch of data below (sorted in ascending order):

• Sample size n = 12 (even)

(a) Draw an accurate diagram of the box-plots.

For women For men

160 320 480 640

A Histogram is Used with group frequency

Class Interval Class Midpoint Frequency, f Relative Frequency

1.5 – 2.0 1.75 2 0.050

2.0 – 2.5 2.25 1 0.025

2.5 – 3.0 2.75 4 0.100

3.0 – 3.5 3.25 15 0.375

3.5 – 4.0 3.75 10 0.250

4.0 – 4.5 4.25 5 0.125

4.5 – 5.0 4.75 3 0.075

Table 1.7: Relative Distribution of Battery Life

– Is used to study the relationship between two variables (paired

You might also like