You are on page 1of 34

Session 1

Descriptive Statistics and Probability


Distributions

• Categorical Data

• Numerical Data

• Bar Charts and Histograms

• Probability Distribution

• Discrete Random Variables

• Expected Value

• Standard Deviation

• Probability Density Function

• Normal Random Variable


Session 1 2

Data

Data come in broadly two types:


Categorical & Numerical.

Categorical variables give qualitative responses,


such as the responses to:
“Did you watch TV in the last 24 hours?”
“What is your favorite candy bar?”

Numerical variables give quantitative responses,


such as the responses to:
“How many brothers & sisters do you have?”
“How tall are you?”
Session 1 3

Qualitative Data: Categories


1. Nominal Data...

Each value falls into one and only one of a set of


categories.

• Hair color
• State of birth
• College major
• Political affiliation

There's no meaningful ordering of the categories of


Political Affiliation, other than alphabetical:
Democrat, Independent, Other, Republican.

2. Ordinal Data...

A meaningful ordering of the possible values:


• Evaluation rating for this stat course:
(1 = awful, ... , 5 = awesome)

• Letter grades for a report in English class:


A+, A, A-, B+, B, B-, ..., F+, F, F-.

Note: Arbitrary numbers, like the 1 to 5 or 1 to 10 scales used in questionnaires are


considered categories, not numbers.
Session 1 4

Quantitative Data: Numbers


Discrete (from counting):
• Number of customers at Burger King today.
• Number of students majoring in Business at each
of 20 different colleges.
• Number of students attending this class today.

Discrete (rounding a continuous variable)


• Person's height to nearest 1/2”.
Continuous (from measurement):
• Person's height (exact).
• Daily high temperature (F) during the last year.
• Hours per week students spend studying.
• Batting averages of baseball players.
• Monthly unemployment %.

When presenting discrete data, we make a table or bar


chart showing frequencies. With a continuous variable
there are too many possible values so we use a
histogram (to group data).
Session 1 5

Outliers
A data value may be quite unlike the other data
values. In this case, it is permitted to remove it
(and a few others) from the data set if it helps to
show more clearly the pattern in the remaining
data.

For example, suppose all the data values are


around 100 except for a couple near 500.
Drawing a histogram may give a misleading
picture of the data. Computing an average may
give a misleading result.
Session 1 6

Bar Chart and Pie Chart:


Categorical or Discrete Variables

ACCT
MKTG 22%
28%

FINC
20%
MGMT
15%

INFO
15%
Session 1 7

GMAT Scores of a Business School Class

610 730 590 610 . . . 680 630


640 680 540 660 . . . 610 540
690 610 520 640 . . . 720 680
610 650 660 580 . . . 600 730
710 600 760 690 . . . 500 720
610 650 660 710 . . . 480 600
630 610 680 780 . . . 700 690
530 550 730 690 . . . 670 540
630 720 610 710 . . . 600 600
690 600 730 540 . . . 560 770

Data File: GMAT.jmp


Session 1 8

Descriptive Statistics (GMAT.jmp)

17
Session 1 9

Boxplot (using GMAT.jmp)

Inter Quartile Range (IQR)

1.5 IQR

Outliers Lower Median Upper quartile


quartile (25%) 640 (75%)
600 680

A boxplot displays the prominent


quartiles of the data along with outliers
Session 1 10

Random Variable
A Random Variable is a variable that takes on
numerical values, each associated with a probability.

EXAMPLES:

X = Number of sales made by a salesperson in a


given week.

Y = Number of miles you drive your car in a given


month.

Types of Random Variables:

1. Discrete
-- Possible to list all values.

2. Continuous
-- All values in a range are possible.
Session 1 11

Discrete Random Variables


The Probability Distribution of a discrete random
variable X is a table of all its possible values x, with
their corresponding probability p(x).

Example:
Throw a die once.

x p(x)
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6

Throw of one die

0.18

0.16

0.14

0.12
Probability

0.10

0.08

0.06

0.04

0.02

0.00
1 2 3 4 5 6
Result
Session 1 12

Example:
From the coin-operated machine in front of the House
of Pancakes you tabulate the % of days that various
numbers of newspapers were sold:

x p(x)
Newspapers Sold Probability
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
5 0.1
6 0.1
7 0.1

0.20
Probability

0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7
Newspapers Sold
Session 1 13

Computing  and 
For Discrete Random Variables

X is a discrete random variable. Assume we know the


probabilities P(X = x ) for all values x. Call this p(x ) .

Mean (Expected Value):

 = E (X )
=  x p(x )

Standard Deviation:

First compute the Variance:

 2 =  (x −  ) p(x )
2

=   x 2 p(x )  −  2

and then take the square root:

 =   x 2 p(x )  −  2
Session 1 14

Example:

For the Newspaper example, find :

 is the “long-run” average number of newspapers


sold per day.

For the Newspaper example, find :

If  = 0 , then the numbers of newspapers sold per day


is the same, but as  gets larger, the numbers of
newspapers sold per day become more unpredictable.
Session 1 15

Continuous Random Variables


Continuous random variables can be described by a
cumulative probability P(X  a) . But, instead of
P(X = a) , we must use the...

Probability Density Function

Properties of the Density Function f (x ) :

(A) f (x )  0 for all x.

(B) Area under the curve f (x ) is 1.

(C) P(a  X  b) = area under f (x ) between a and b.

a b

For continuous variables,


P(a  X  b) is the same as P(a  X  b) because the area
of a line is equal to _______.
Session 1 16

EXAMPLE:
If a person's height, weight, or IQ is measured to an
arbitrary number of decimal places then we get a
continuous random variable. Its density function
would be bell-shaped:

| | |
Session 1 17

Normal Distribution

The most frequently used


distribution in Statistics.

A normal distribution is completely determined by its


mean and standard deviation.

 = Mean

 = Standard Deviation ( must be > 0)


(x−)2
1 −
f (x) = e 2 2 −  x  
 2

To calculate probabilities, we use the...


Standard Normal Table ( = 0,  = 1).

The standard normal density:

| | |
–1 0 1
Session 1 18

Graph of Normal Density Functions


Here is a graph of two normal densities on the same
scale. One has =0, =1 and the other has =0, =0.5 .
Which is which?

0.8

0.7

0.6

0.5
f(x)

0.4

0.3

0.2

0.1

0
0.00
0.30
0.60
0.90
1.20
1.50
1.80
2.10
2.40
2.70
3.00
-3.00
-2.70
-2.40
-2.10
-1.80
-1.50
-1.20
-0.90
-0.60
-0.30

What would the curve look like for =0, =2?


Session 1 19

The Empirical Rule

What the Empirical Rule says:


• 68% of the data is within 1 σ of the mean
• 95% of the data is within 2 σ of the mean
• Essentially all (99.7%) of the data is within 3 σ of
the mean
Session 1 20

Normal Distribution Functions in Excel

NORMDIST(X,,,TRUE)
Returns cumulative probability under the normal
curve up to point X.

NORMINV(Probability,,)
Returns X for a given probability under normal curve.

NORMSDIST(Z)

Returns cumulative probability under the


standard normal curve up to Z.

NORMSINV (Probability)
Returns Z for a given cumulative probability
under the normal curve.
Session 1 21

Using the Standard Normal Table


Z is the symbol used for the standard normal random variable

P(Z  −2.17) = Area to the left of − 2.17 =

| | |

P(Z  1.03) = Area to the left of 1.03 =

| | |

P(0  Z  1.52) = Area between 0 and 1.52 =

| | |

P(−1.42  Z  0.24) = Area between − 1.42 and 0.24 =

| | |
Session 1 22

What proportion of the area under a standard normal


density is within 1 (or 2 or 3) standard deviations of
the mean?

| | |

P(−1  Z  1) =

| | |

P(−2  Z  2) =

| | |

P(−3  Z  3) =
Session 1 23

Finding tabled value z corresponding to an area


P(Z  z) = 0.0075 Find z from the table.

| | |

z=

P(0  Z  z) = 0.2611 Find z from the table.

| | |

z=

P(−z  Z  z) = 0.9500 Find z from the table.

| | |

z=
Session 1 24

Normal Transformation
A normal random variable with mean  and standard
deviation  can be transformed to a standard normal
random variable using:

Number - Mean
Standardized number =
Standard Deviation

or, symbolically:

X −
z=

These "z scores" change the units of a problem so we


can use the Standard Normal tables.

If X is greater than , then the z score is ______.

If X is less than , then the z score is ______.

If X = , then the z score is ______.

Also:
X =  + z
Session 1 25

Example:
I.Q.'s are normally distributed with  = 100 and  = 15 .

What is the probability that a randomly selected


person has an I.Q. over 130?

| | |

What proportion of the people have I.Q.'s between 76


and 124?

| | |
Session 1 26

Skewed

Some histograms look less symmetrical and


more skewed:

This histogram is said to be skewed to the right (long tail on


the right side.)
Session 1 27

Evaluating the Normal Approximation

• How can we say that the normal


distribution is a reasonable approximation
of the data?

• How can data look different from a normal


distribution?
– More than one mode suggesting data
come from distinct groups
– Lack of symmetry
– Unusual extreme values

• Can identify these differences by looking


at
– Visual inspection of the histogram (not
very accurate)
– Numerical summaries like Skewness and
Kurtosis
– Graphical summaries (Normal Quantile
plot)
Session 1 28

Normal Quantile Plot (NQP)

• A custom graphic for Normality.


• A plot of what we see against what we would
expect to have seen, had the data come from a
Normal Distribution. Particularly useful when
we check the regression assumptions later.

OBSERVED vs. EXPECTED

• When the assumptions are met, observed =


expected and we are on the 45 degree line.
• Departures from the 45 degree line are indicative
of various departures from Normality. These
include:
▪ Right skew.
▪ Left Skew.
▪ Heavy tails.
The plot is created by plotting empirical quantiles
against normal distribution quantiles.
Session 1 29

Interpretation of the NQP


The theoretical distribution (normal) is on the horizontal axis and the observed
on the vertical axis. Fill the shapes up with the same amount of water at the
same rate, and join up the heights of the water at given points in time.

If the observed distribution is not normal, then something other than a


straight line will appear.
Session 1 30
Session 1 31

Linear Combination of Random Variables

Let X1 and X 2 be any two independent random variables with means μ1 and μ2 and variances σ12 and σ22 .
Suppose Y = aX1 + bX 2 . Then,

(1) The mean of Y is


(2) The standard deviation of Y is

• Independent: When Dependent: When the


the value taken by value of one random
one random variable gives us more
variable does not information about the
affect the value other random variable
taken by the other e.g. Height and weight
random variable of students
– e.g. Roll of two
dice
Session 1 32

Linear Combination of Independent


Random Variables

• Suppose Y = aX1+bX2

• Then, mean and variance of Y are given by:

– E[Y] = aμ1+ bμ2

2 2 2 2
– Var[Y] = a σ1 + b σ2

2 2
• Suppose X1~N(μ1, σ1 ) and X2~N(μ2, σ2 )

• Then, above results hold and in addition:


2 2 2 2
– Y~N(aμ1+ bμ2, a σ1 + b σ2 )
Session 1 33

Summary of Session 1

• What is a Random Variable?

• How to summarize a random variable?

• How to pictorially represent a random variable?

• What is Normal Distribution and what are its properties?


Session 1 34

Software Notes

• Open dataset > Analyze > Distribution > Select


variable and click on “Y,Columns” > OK

– Get a histogram + a box plot + a numerical summary


of data

– Click on the red triangle near the variable name


• Choose “Normal Quantile Plot” (for checking visual
normality of data)
• Display Options > Customize Summary Statistics (for
Skewness and Kurtosis)
• Capability Analysis > Continuous Fit > Normal (to see
the fitted normal distribution to data)

• Creating a data column


– Open dataset > click on red button near “Columns” >
New Column > Column Properties > Formula (now
choose formula from different formula groups and
other variable names)

You might also like