You are on page 1of 40

Course Contents

1. Introduction to Statistics
and Data Analysis
2. Probability
3. Random Variables and
Probability Distributions
4. Mathematical Expectation
5. Some Discrete Probability
Distributions
6. Some Continuous
Probability Distributions

Copyright © 2010 Pearson Addison-Wesley. All rights reserved.


Chapter 1

Introduction to
Statistics and Data
Analysis

1-2
Chapter Outline

1.1 Overview: Statistical Inference, Samples, Populations,


and the Role of Probability
1.2 Sampling Procedures; Collection of Data
1.3 Measures of Location: The Sample Mean and Median
1.4 Measures of Variability
1.5 Discrete and Continuous Data
1.6 Statistical Modeling, Scientific, Inspection, and
Graphical Diagnostics

1-3
Example: Data Samples in Tabular Form

Two samples of 10 northern red oak seedlings were planted in a greenhouse,


one containing seedlings treated with nitrogen and the other containing
seedlings with no nitrogen. The stem weights in grams were recorded after the
end of 140 days. The data are given as follows:

1-4
The Dot Plot: Another Representation of the tabulated data

1-5
Fundamental Relationship between
Probability and Inferential Statistics

The Sample along with inferential statistics allow us to draw conclusion about
the population.

Based on known features of the population, elements of probability allow us to


draw conclusion about characteristics of hypothetical data taken from the
population.

1-6
Data Classification

Data

Qualitative Quantitative
Categorical Numerical,
Can be ranked

Discrete Continuous
Countable Non-Countable
Number of (Measureable)
children in a Height of a student
family From 175 to 180
1-7
Qualitative Categorical Frequency distribution

Example: Twenty-five army indicates were given a blood


test to determine their blood type.

Raw Data: A,B,B,AB,O O,O,B,AB,B B,B,O,A,O


A,O,O,O,AB AB,A,O,B,A

Class Tally Frequency Percent


A IIII 5 20
B IIII II 7 28
O IIII IIII 9 36
AB IIII 4 16
1-8
Quantitative: Discrete or Continuous

Data Sorting and Labeling:

Before you work on any numerical data set (discrete or


continuous variables), do the following:
– Sort the data from smallest to largest (increasing
order)
– Label each value of the data set as such: X(1) for
the first term (1st value), X(2) for the second term
(2nd value)….. and so on.

1-9
Quantitative: Discrete or Continuous
(Continued)

Example:
Given the following: 4, 7, 9, 3, 0
First step: (Sorting the data)
0, 3, 4, 7, 9

Second step: (Labeling the data)


X(1), X(2), X(3), X(4), X(5)

The data is rewritten as follows:


X(1), X(2), X(3), X(4), X(5)
0, 3, 4, 7, 9
1 - 10
Quantitative-Grouped Frequency Distribution
(continuous data)

The following data represent the record high


temperatures for each of the 50 states.
Class
Frequency
Boundaries

99.5 - 104.5 10
104.5 - 109.5 12
109.5 - 114.5 8
114.5 - 119.5 6
119.5 - 124.5 4
124.5 - 129.5 7
129.5 - 134.5 3
1 - 11
Measures of Location (Central
Tendency)
• The data (observations) often tend to be concentrated around the center of the
data.
• Some measures of location are: the mean, mode, and median.
• These measures are considered as representatives (or typical values) of the
data. They are designed to give some quantitative measures of where the
center of the data is in the sample.
 Notations:
1) The Mean (denoted by: x )

2)  )
The Median (denoted by: x

3) The Mode (no notation)

1 - 12
Sample Mean

Example:
Given the following data: 4, 9, 6, 12, 19, 16
The sample size n is 6 (since we have 6 observations)

Note: There is no need to sort the data, since we are adding all the
observations.
1 - 13
Example

Suppose that the following sample represents the ages (in year) of
a sample of 3 men:

x1  30, x2  35, x3  27.

Then, the sample mean is:


30  35  27 92
x   30.67
3 3
3
Note:   x  x    30  30.67   35  30.67   27  30.67  0
i 1
i

1 - 14
Sample Mean as a Centroid of
the with-nitrogen stem weight

1 - 15
Median

The median is defined to be the middle value of the data set.


Therefore, the calculation of the median, depend on the
sample size n.
Do not forget to sort and label the given data in an increasing order
(from lowest to highest).
Notes: Let n be the sample size:
n is even, if the number of observations is 2, 4, 6,…etc…
n is odd, if the number of observations is 1, 3, 5,…etc…
• If n is odd; Then the median is .
Example: Given the following data: 4, 9, 6, 12, 16
(n=5 is odd) Sorting and labeling the data:
X(1), X(2), X(3), X(4), X(5)
4 , 6 , 9 , 12 , 16
Therefore, the median is . 1 - 16
Median

– If n is even; Then the median is .


Example: Given the following data: 4, 9, 6, 12, 19, 16
(n=6 is even) Sorting and labeling the data:
X(1), X(2), X(3), X(4), X(5), X(6)
4 , 6 , 9 , 12 , 16 , 19
Therefore, the median is:

1 - 17
Mode

• The mode of a set of quantitative data is the most frequently


occurring measurement in a data set.
• If no measurements occurring more than once, then there is no
mode.
• There may be several modes if there are more than one data with the
same most frequently occurring.
e.g. 2, 4, 5, 1, 7, 9, 0 : No mode
2, 4, 2, 5, 4, 2 : Mode is 2
2, 4, 2, 5, 4, 2, 4, 7 : Modes are 2 and 4

The data is said to be:


Unimodal: if it has one mode
Bimodal: if it has two modes
Trimodal: if it has three modes
1 - 18
Multimodal if it has more than three modes
More Examples

Examples (discrete cases):


1)Given the following data: 2, 2, 9, 6, 12, 8
The Mode is 2, since it occurred two times, while the other
observations occurred once.
2)Given the following data: 2, 2, 4, 6, 7, 8, 4
The Mode is 2 and 4, since 2 and 4 occurred two times, while
the other observations occurred once.
3)Given the following data: 2, 6, 9, 16, 12, 8, -2, 0.4
There is No Mode, since the frequency of all the observations
is the same.

1 - 19
Measures of Variability OR Dispersion

Sample Variance

The standard deviation measures how close the


observations are to the mean.
The difference which may be positive or negative,
are called residuals and denoted by ri, where .
1 - 20
Example 1

Compute the sample variance and standard deviation of the following


observations (ages in year): 10, 21, 33, 53, 54.

Solution: n 5

x i x i
10  21  33  53  54 171
x i 1
 i 1
   34.2 year
n 5 5 5
n 5

  xi  x  
 i 
2 2
x  34. 2
s2  i 1
 i 1
n 1 5 1

 10  34.2   21  34.2    33  34.2    53  34.2    54  34.2 
2 2 2 2 2

4
1506.8
  376.7 (year)2
4

s  s 2  376.7  19.41 year 1 - 21


Example 2

A sample of 10 students scored the following grades: 40, 42, 35, 54, 57,
54, 46, 42, 54, 57.
(i)Find the sample mean, mode and median.
(ii)Compute the standard deviation.

Solution:
(i) Listing the scores in order:35, 40, 42, 42, 46, 54, 54, 54, 57, 57
35  40  42  42  46  54  54  54  57  57
Mean  x   48.1
10
46  54
Mode  54 Median   50
2
1
(ii) s  [(35  48.1) 2  (40  48.1) 2    (57  48.1) 2 ]  8.1
9
1 - 22
Example 3: Using a Table (Alternative
Method)

Note: (No need to sort and label the data)


Given the following data: 5, 6, 2, 8, 9 (n=5)
In order to calculate the standard deviation of this data, we should
first calculate the mean

x 2 5 6 8 9 Total Sum
-4 -1 0 2 3 0 = sum of residuals
( xi  x )
( xi  x ) 2 16 1 0 4 9 30
Now,

1 - 23
Example 3: (Continued)

– The sample Variance


The sample variance is given by:

Example:
In the previous page, we found that the standard deviation is
.
Therefore, the variance is 7.5

1 - 24
More Measures of Dispersion.
• The range is the numerical difference between the largest and the
smallest value of a set of a batch of data:
range = max – min

The Range is a measure of the dispersion of the data.


It is equal to the distance between the first and the last observation.
Therefore, one should sort the data and then calculate the range.
That is, the Range is given by:
“Maximum value – Minimum value”
Example:
Given the following data: 2, 6, 8, 9, 12
Data is sorted, then the Range = 12 – 2 = 10
1 - 25
More Measures of Dispersion

• The lower quartile, denoted by Q1, is the median of the lower half of the
batch of data (median of the values below the median of the data set).
• The upper quartile, denoted by Q3, is the median of the upper half of the
batch of data (median of the values above the median of the data set).
• The inter-quartile range, is defined by Q3 – Q1.
• A Box-plot is a diagram consisting of a box and whiskers. On it is displayed
the median, the quartiles and the maximum and minimum values in a batch of
data as shown below.
• A Box-plot is used for comparing two sets of data. In this case two box-plots
are needed and an appropriate common scale.
median
min max

Q1 Q3
1 - 26
Example 1

For the batch of data below (sorted in ascending order):


4, 5, 6, 6, 7, 11, 12, 14, 16, 20, 22, 29

• Sample size n = 12 (even)


• Mode = 6
• Min = 4
• Max = 29 x  n   x  n  1 x  12   x  12  1
        x  6   x  7  11  12
2 2    2   2 
• Median = 2 2 2

2
 11.5

• Q1 is the median of the values below the median 11.5. That is the
median of the values: 4, 5, 6, 6, 7, 11; n = 6 (even); Therefore,

n n   6 6 
x    x   1 x    x   1
2 2    2  2   x  3  x  4   6  6  6
Q 1= 2 2 2 2

1 - 27
Example 1 (Continued)

• Q3 is the median of the values above the median 11.5. That is the median of the
values: 12, 14, 16, 20, 22, 29; n = 6 (even); Therefore,
n n  6 6 
x    x   1 x    x   1
 2 2    2   2   x  3  x  4   16  20  18
Q3= 2 2 2 2

Below is the representation of the Box-plot diagram for the batch of data:

11.5
4 29

6 18

1 - 28
Example 2

The table below gives the gross weekly earning including overtime in
pounds of 20 actors working in a theatre (9 women and 11 men):

Women 221 272 334 361 372 399 415 456 510
Men 258 315 333 353 398 420 435 462 495 523 587

(a) Draw an accurate diagram of the box-plots.


(b) What do box-plots tell you about the relative earnings of male and
female actors.

1 - 29
Example 2

For women For men


Min = 221 Min = 258
Max = 510 Max = 587
Q1 = (272 + 334)/2 = 303 Q1 = 333
Q3 = (415 + 456)/2 = 435.5 Q3 = 495
Median = 372 Median = 420

Women

Men

160 320 480 640

1 - 30
Example 2

CONTINUE
D
From the box-plots it is clear that the men’s earnings are higher than the
women’s: all the five values marked on the box-plots are higher for men
than for the women.

1 - 31
Example 3: Nicotine Data – Outliers Or
Extreme Values

The box-plot below is a representation of the data in the table above. Note the
dots on the two sides of the plot, to the left of the minimum value and to the
right of the maximum value. These dots are called “Outliers” or “Extreme
Values”.

1 - 32
Example 3: Nicotine Data – Outliers Or
Extreme Values

1 - 33
Statistical Modeling: The Histogram

A Histogram is Used with group frequency


– Similar to a bar chart, however the data we are working with is a
continuous data. The bars are stacked to each others.
– Cutpoints: These are the borderlines (boundaries) between groups
(classes) (see table 1.7 page 23- Battery Life in years). Values at
borderline (boundaries) are usually allocated to the higher group,
however it does not matter if we are consistent with all borderline
values.
– Starting point is the value at which we start to draw the histogram.
– All classes (groups) are supposed to have the same width.

1 - 34
Statistical Modeling: The Histogram –
Example: Battery Life (in Years).

Class Interval Class Midpoint Frequency, f Relative Frequency

1.5 – 2.0 1.75 2 0.050

2.0 – 2.5 2.25 1 0.025

2.5 – 3.0 2.75 4 0.100

3.0 – 3.5 3.25 15 0.375

3.5 – 4.0 3.75 10 0.250

4.0 – 4.5 4.25 5 0.125

4.5 – 5.0 4.75 3 0.075

Table 1.7: Relative Distribution of Battery Life

1 - 35
Statistical Modeling: The Histogram –
Example: Battery Life (in Years).

1 - 36
Probability Distribution Function
Corresponding to Histogram of Example of
Battery Life

1 - 37
Skewness of Data Distribution

• The distribution in (a) is said to be “Right Skewed” as it has a longer tail to the
Right side.
• The distribution in (b) is Symmetric.
• The distribution in (c) is said to be “Left Skewed” as it has a longer tail on the
Left side.
1 - 38
Statistical Modeling: Scatter-Plot

– Is used to study the relationship between two variables (paired


observation). Each variable is presented on a separate axis.
One on the horizontal axis and the other on the vertical axis.
– A positive relation between two variables exists if as X
increases, Y increases . A relation is called negative if as X
increases, Y decreases.
– A linear relation ship means that the plotted points have a
linear trend. (move in a form of a line ranging from strong to
weak linear relationship).
– Extreme observations in sets of data are sometimes called
outliers

1 - 39
Examples of Scatter Plots: Death
rates vs. Alcohol Consumption.

1 - 40

You might also like