Session 1 PDF

Session 1
Descriptive Statistics and Probability

Distributions
• Categorical Data
• Numerical Data
• Bar Charts and Histograms
• Probability Distribution
• Discrete Random Variables
• Expected Value
• Standard Deviation
• Probability Density Function
• Normal Random Variable

Session 1 2
Data
Data come in broadly two types:

Categorical & Numerical.
Categorical variables give qualitative responses,

such as the responses to:
“Did you watch TV in the last 24 hours?”
“What is your favorite candy bar?”
Numerical variables give quantitative responses,

such as the responses to:
“How many brothers & sisters do you have?”
“How tall are you?”
Session 1 3
Qualitative Data: Categories

1. Nominal Data...
Each value falls into one and only one of a set of

categories.
• Hair color
• State of birth
• College major
• Political affiliation
There's no meaningful ordering of the categories of

Political Affiliation, other than alphabetical:
Democrat, Independent, Other, Republican.
2. Ordinal Data...
A meaningful ordering of the possible values:

• Evaluation rating for this stat course:
(1 = awful, ... , 5 = awesome)
• Letter grades for a report in English class:

A+, A, A-, B+, B, B-, ..., F+, F, F-.
Note: Arbitrary numbers, like the 1 to 5 or 1 to 10 scales used in questionnaires are

considered categories, not numbers.
Session 1 4
Quantitative Data: Numbers

Discrete (from counting):
• Number of customers at Burger King today.
• Number of students majoring in Business at each
of 20 different colleges.
• Number of students attending this class today.
Discrete (rounding a continuous variable)

• Person's height to nearest 1/2”.
Continuous (from measurement):
• Person's height (exact).
• Daily high temperature (F) during the last year.
• Hours per week students spend studying.
• Batting averages of baseball players.
• Monthly unemployment %.
When presenting discrete data, we make a table or bar

chart showing frequencies. With a continuous variable
there are too many possible values so we use a
histogram (to group data).
Session 1 5
Outliers
A data value may be quite unlike the other data
values. In this case, it is permitted to remove it
(and a few others) from the data set if it helps to
show more clearly the pattern in the remaining
data.
For example, suppose all the data values are

around 100 except for a couple near 500.
Drawing a histogram may give a misleading
picture of the data. Computing an average may
give a misleading result.
Session 1 6
Bar Chart and Pie Chart:

Categorical or Discrete Variables
ACCT
MKTG 22%
28%
FINC
20%
MGMT
15%
INFO
15%
Session 1 7
GMAT Scores of a Business School Class
610 730 590 610 . . . 680 630

640 680 540 660 . . . 610 540
690 610 520 640 . . . 720 680
610 650 660 580 . . . 600 730
710 600 760 690 . . . 500 720
610 650 660 710 . . . 480 600
630 610 680 780 . . . 700 690
530 550 730 690 . . . 670 540
630 720 610 710 . . . 600 600
690 600 730 540 . . . 560 770
Data File: GMAT.jmp

Session 1 8
Descriptive Statistics (GMAT.jmp)
17
Session 1 9
Boxplot (using GMAT.jmp)
Inter Quartile Range (IQR)
1.5 IQR
Outliers Lower Median Upper quartile

quartile (25%) 640 (75%)
600 680
A boxplot displays the prominent

quartiles of the data along with outliers
Session 1 10
Random Variable
A Random Variable is a variable that takes on
numerical values, each associated with a probability.
EXAMPLES:
X = Number of sales made by a salesperson in a

given week.
Y = Number of miles you drive your car in a given

month.
Types of Random Variables:
1. Discrete
-- Possible to list all values.
2. Continuous
-- All values in a range are possible.
Session 1 11
Discrete Random Variables

The Probability Distribution of a discrete random
variable X is a table of all its possible values x, with
their corresponding probability p(x).
Example:
Throw a die once.
x p(x)
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
Throw of one die
0.18
0.16
0.14
0.12
Probability
0.10
0.08
0.06
0.04
0.02
0.00
1 2 3 4 5 6
Result
Session 1 12
Example:
From the coin-operated machine in front of the House
of Pancakes you tabulate the % of days that various
numbers of newspapers were sold:
x p(x)
Newspapers Sold Probability
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
5 0.1
6 0.1
7 0.1
0.20
Probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7
Newspapers Sold
Session 1 13
Computing  and 
For Discrete Random Variables
X is a discrete random variable. Assume we know the

probabilities P(X = x ) for all values x. Call this p(x ) .
Mean (Expected Value):
 = E (X )
=  x p(x )
Standard Deviation:
First compute the Variance:
 2 =  (x −  ) p(x )
2
=   x 2 p(x )  −  2
and then take the square root:
 =   x 2 p(x )  −  2
Session 1 14
Example:
For the Newspaper example, find :
 is the “long-run” average number of newspapers

sold per day.
For the Newspaper example, find :
If  = 0 , then the numbers of newspapers sold per day

is the same, but as  gets larger, the numbers of
newspapers sold per day become more unpredictable.
Session 1 15
Continuous Random Variables

Continuous random variables can be described by a
cumulative probability P(X  a) . But, instead of
P(X = a) , we must use the...
Probability Density Function
Properties of the Density Function f (x ) :
(A) f (x )  0 for all x.
(B) Area under the curve f (x ) is 1.
(C) P(a  X  b) = area under f (x ) between a and b.
a b
For continuous variables,

P(a  X  b) is the same as P(a  X  b) because the area
of a line is equal to _______.
Session 1 16
EXAMPLE:
If a person's height, weight, or IQ is measured to an
arbitrary number of decimal places then we get a
continuous random variable. Its density function
would be bell-shaped:
| | |
Session 1 17
Normal Distribution
The most frequently used

distribution in Statistics.
A normal distribution is completely determined by its

mean and standard deviation.
 = Mean
 = Standard Deviation ( must be > 0)

(x−)2
1 −
f (x) = e 2 2 −  x  
 2
To calculate probabilities, we use the...

Standard Normal Table ( = 0,  = 1).
The standard normal density:
| | |
–1 0 1
Session 1 18
Graph of Normal Density Functions

Here is a graph of two normal densities on the same
scale. One has =0, =1 and the other has =0, =0.5 .
Which is which?
0.8
0.7
0.6
0.5
f(x)
0.4
0.3
0.2
0.1
0
0.00
0.30
0.60
0.90
1.20
1.50
1.80
2.10
2.40
2.70
3.00
-3.00
-2.70
-2.40
-2.10
-1.80
-1.50
-1.20
-0.90
-0.60
-0.30
What would the curve look like for =0, =2?

Session 1 19
The Empirical Rule
What the Empirical Rule says:

• 68% of the data is within 1 σ of the mean
• 95% of the data is within 2 σ of the mean
• Essentially all (99.7%) of the data is within 3 σ of
the mean
Session 1 20
Normal Distribution Functions in Excel
NORMDIST(X,,,TRUE)
Returns cumulative probability under the normal
curve up to point X.
NORMINV(Probability,,)
Returns X for a given probability under normal curve.
NORMSDIST(Z)
Returns cumulative probability under the

standard normal curve up to Z.
NORMSINV (Probability)
Returns Z for a given cumulative probability
under the normal curve.
Session 1 21
Using the Standard Normal Table

Z is the symbol used for the standard normal random variable
P(Z  −2.17) = Area to the left of − 2.17 =
| | |
P(Z  1.03) = Area to the left of 1.03 =
| | |
P(0  Z  1.52) = Area between 0 and 1.52 =
| | |
P(−1.42  Z  0.24) = Area between − 1.42 and 0.24 =
| | |
Session 1 22
What proportion of the area under a standard normal

density is within 1 (or 2 or 3) standard deviations of
the mean?
| | |
P(−1  Z  1) =
| | |
P(−2  Z  2) =
| | |
P(−3  Z  3) =
Session 1 23
Finding tabled value z corresponding to an area

P(Z  z) = 0.0075 Find z from the table.
| | |
z=
P(0  Z  z) = 0.2611 Find z from the table.
| | |
z=
P(−z  Z  z) = 0.9500 Find z from the table.
| | |
z=
Session 1 24
Normal Transformation
A normal random variable with mean  and standard
deviation  can be transformed to a standard normal
random variable using:
Number - Mean
Standardized number =
Standard Deviation
or, symbolically:
X −
z=

These "z scores" change the units of a problem so we

can use the Standard Normal tables.
If X is greater than , then the z score is ______.
If X is less than , then the z score is ______.
If X = , then the z score is ______.
Also:
X =  + z
Session 1 25
Example:
I.Q.'s are normally distributed with  = 100 and  = 15 .
What is the probability that a randomly selected

person has an I.Q. over 130?
| | |
What proportion of the people have I.Q.'s between 76

and 124?
| | |
Session 1 26
Skewed
Some histograms look less symmetrical and

more skewed:
This histogram is said to be skewed to the right (long tail on

the right side.)
Session 1 27
Evaluating the Normal Approximation
• How can we say that the normal

distribution is a reasonable approximation
of the data?
• How can data look different from a normal

distribution?
– More than one mode suggesting data
come from distinct groups
– Lack of symmetry
– Unusual extreme values
• Can identify these differences by looking

at
– Visual inspection of the histogram (not
very accurate)
– Numerical summaries like Skewness and
Kurtosis
– Graphical summaries (Normal Quantile
plot)
Session 1 28
Normal Quantile Plot (NQP)
• A custom graphic for Normality.

• A plot of what we see against what we would
expect to have seen, had the data come from a
Normal Distribution. Particularly useful when
we check the regression assumptions later.
OBSERVED vs. EXPECTED
• When the assumptions are met, observed =

expected and we are on the 45 degree line.
• Departures from the 45 degree line are indicative
of various departures from Normality. These
include:
▪ Right skew.
▪ Left Skew.
▪ Heavy tails.
The plot is created by plotting empirical quantiles
against normal distribution quantiles.
Session 1 29
Interpretation of the NQP

The theoretical distribution (normal) is on the horizontal axis and the observed
on the vertical axis. Fill the shapes up with the same amount of water at the
same rate, and join up the heights of the water at given points in time.
If the observed distribution is not normal, then something other than a

straight line will appear.
Session 1 30
Session 1 31
Linear Combination of Random Variables
Let X1 and X 2 be any two independent random variables with means μ1 and μ2 and variances σ12 and σ22 .
Suppose Y = aX1 + bX 2 . Then,
(1) The mean of Y is

(2) The standard deviation of Y is
• Independent: When Dependent: When the

the value taken by value of one random
one random variable gives us more
variable does not information about the
affect the value other random variable
taken by the other e.g. Height and weight
random variable of students
– e.g. Roll of two
dice
Session 1 32
Linear Combination of Independent

Random Variables
• Suppose Y = aX1+bX2
• Then, mean and variance of Y are given by:
– E[Y] = aμ1+ bμ2
2 2 2 2
– Var[Y] = a σ1 + b σ2
2 2
• Suppose X1~N(μ1, σ1 ) and X2~N(μ2, σ2 )
• Then, above results hold and in addition:

2 2 2 2
– Y~N(aμ1+ bμ2, a σ1 + b σ2 )
Session 1 33
Summary of Session 1
• What is a Random Variable?
• How to summarize a random variable?
• How to pictorially represent a random variable?
• What is Normal Distribution and what are its properties?

Session 1 34
Software Notes
• Open dataset > Analyze > Distribution > Select

variable and click on “Y,Columns” > OK
– Get a histogram + a box plot + a numerical summary

of data
– Click on the red triangle near the variable name

• Choose “Normal Quantile Plot” (for checking visual
normality of data)
• Display Options > Customize Summary Statistics (for
Skewness and Kurtosis)
• Capability Analysis > Continuous Fit > Normal (to see
the fitted normal distribution to data)
• Creating a data column

– Open dataset > click on red button near “Columns” >
New Column > Column Properties > Formula (now
choose formula from different formula groups and
other variable names)

Session 1 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 1 PDF

Uploaded by

Copyright:

Available Formats

Session 1

Descriptive Statistics and Probability

• Bar Charts and Histograms

• Discrete Random Variables

• Probability Density Function

• Normal Random Variable

Data come in broadly two types:

Categorical variables give qualitative responses,

Numerical variables give quantitative responses,

Qualitative Data: Categories

Each value falls into one and only one of a set of

There's no meaningful ordering of the categories of

A meaningful ordering of the possible values:

• Letter grades for a report in English class:

Note: Arbitrary numbers, like the 1 to 5 or 1 to 10 scales used in questionnaires are

Quantitative Data: Numbers

Discrete (rounding a continuous variable)

When presenting discrete data, we make a table or bar

For example, suppose all the data values are

Bar Chart and Pie Chart:

GMAT Scores of a Business School Class

610 730 590 610 . . . 680 630

Data File: GMAT.jmp

Descriptive Statistics (GMAT.jmp)

Boxplot (using GMAT.jmp)

Inter Quartile Range (IQR)

Outliers Lower Median Upper quartile

A boxplot displays the prominent

X = Number of sales made by a salesperson in a

Y = Number of miles you drive your car in a given

Types of Random Variables:

Discrete Random Variables

Throw of one die

X is a discrete random variable. Assume we know the

Mean (Expected Value):

First compute the Variance:

and then take the square root:

For the Newspaper example, find :

 is the “long-run” average number of newspapers

For the Newspaper example, find :

If  = 0 , then the numbers of newspapers sold per day

Continuous Random Variables

Probability Density Function

Properties of the Density Function f (x ) :

(A) f (x )  0 for all x.

(B) Area under the curve f (x ) is 1.

(C) P(a  X  b) = area under f (x ) between a and b.

For continuous variables,

The most frequently used

A normal distribution is completely determined by its

 = Standard Deviation ( must be > 0)

To calculate probabilities, we use the...

The standard normal density:

Graph of Normal Density Functions

What would the curve look like for =0, =2?

The Empirical Rule

What the Empirical Rule says:

Normal Distribution Functions in Excel

Returns cumulative probability under the

Using the Standard Normal Table

P(Z  −2.17) = Area to the left of − 2.17 =

P(Z  1.03) = Area to the left of 1.03 =

P(0  Z  1.52) = Area between 0 and 1.52 =