You are on page 1of 45

STAT 505

PROBABILITY & STATISTICS


FOR ENGINEERING AND
SCIENCE
PROBABILITY
 Informally, probable is one of several words applied to
uncertain events or knowledge, being closely related in
meaning to likely, risky, hazardous, and doubtful.
 Chance, odds, and bet are other words expressing similar
notions.

 The theory of probability attempts to quantify the notion of


probable.

 Probability always lies between 0 and 1.

 If probability is equal to 1 then that event is certain to happen


and if the probability is 0 then that event will never occur.
STATISTICS
 Statistics is a mathematical science pertaining to collection,
analysis, interpretation and presentation of data. It is
applicable to a wide variety of academic disciplines from the
physical and social sciences to the humanities, as well as to
business, government, medicine and industry.

 Given a collection of data, statistics may be employed to


summarize or describe the data; this use is called
descriptive statistics.

 EXAMPLES: optical character recognition, speech recognition,


genomics, computational biology, survival analysis, statistical
genetics, portfolio optimization and management, financial
risk management, credit rating/scoring,
Statistical Packages
 R
 http://cran.r-project.org/bin/windows/base/
 http://stat.ethz.ch
/R-manual/R-patched/doc/html/
 http://www.omegahat.org/REventLoop/man.pdf
 http://www.r-project.org/other-docs.html

 SAS
 http://v8doc.sas.com/sashtml/
Descriptive Statistics

 Can be used to summarize the data, either numerically or


graphically, to describe the sample.

 Basic examples of numerical descriptions include the mean


and standard deviation.

 Graphical summarization include various kinds of charts and


graphs.
Pareto Charts
 A Pareto chart is a bar graph for qualitative data, with the
bars arranged in order according to frequencies.
How To Construct A Pareto Chart

 A Pareto chart can be constructed by segmenting the range of


the data into groups (also called segments, bins or categories).

 For example, if your business was investigating the delay


associated with processing credit card applications, you could
group the data into the following categories:
 No signature
 Residential address not valid
 Non-legible handwriting
 Already a customer
 Other

 The left-side vertical axis of the Pareto chart is labeled Frequency


(the number of counts for each category), the right-side vertical
axis of the Pareto chart is the cumulative percentage, and the
horizontal axis of the Pareto chart is labeled with the group names
of your response variables
What Questions The Pareto Chart
Answers

 What are the largest issues facing our team or business?

 What 20% of sources are causing 80% of the problems (


80/20 Rule)?

 Where should we focus our efforts to achieve the greatest


improvements?
Pareto Chart
Accidental Deaths
45000
40000
35000
30000
25000
20000
15000
10000
5000
Poison

Motor Falls Poison Drowning Fire Firearms


Vehicle Ingestion of
s Food/Object
Stem-and-Leaf Plots

 A Stem-and-Leaf Plot is very useful. It can show the


distribution of the data, yet not lose the actual data points.

 The Stem-and-Leaf plot is most easily explained using and


example.

 Consider the following data - which represents the daily high


temperature for a city over a day span

 78 76 82 75 85 82 78 74 83 90
 70 76 85 92 87 67 65 68 73 74
 83 88 86 85 92 90 82 75 69 80
 85 77 86 85 90 85 80 70 65 60
Stem-and-Leaf Plots
 We can see that the data ranges from about 60 to about 95.

 We sort the data from lowest to highest:


 60 65 65 67 68 69 70 70 73 74
 74 75 75 76 76 77 78 78 80 80
 82 82 82 83 83 85 85 85 85 85
 85 86 86 87 88 90 90 90 92 92

 Now we can create the stem and leaf graph as follows:


 Stem Leaves
 6 055789
 7 003445566788
 8 00222335555556678
 9 00022 
Stem-and-Leaf Plots
 Stem Leaves
 6 055789
 7 003445566788
 8 00222335555556678
 9 00022 

 If you look at the page sideways you can see the distribution
of the data. The same rule that says you should 5-20 classes
of data in a histogram applies to a stem and leaf diagram. We
could clearly expand the stem and leaf diagram to include
more rows and could also be condensed to include fewer rows.
HISTOGRAM

 A histogram is like a bar chart - it consists of a horizontal


scale for values of the data being represented and a vertical
scale for frequencies, and bars representing the frequency of
each class of values.

 A relative frequency histogram will have the same shape and


horizontal scale as the histogram - but the vertical scale will
be marked with relative frequencies.

 Theoretically each bar should be marked with the lower class


boundary at the left and upper class boundary at the right.
HISTOGRAM

 Histograms - are more commonly used. Preserve some


information about the shape of the data distributions and are
not limited by the size of the data set.

 The purpose of a histogram is to take the data that is collected


from a process and then display it graphically to view how the
distribution of the data, centers itself around the mean, or
main specification. From the data, the histogram will
graphically show:

 The center of the data.


 The spread of the data.
 Any data skewness .
 The presence of outliers (product outside the specification
range).
 The presence of multiple modes (or peaks) within the data
HISTROGRAM
BOXPLOT
 In 1977, John Tukey published an efficient method for
displaying a five-number data summary. The graph is called a
boxplot and summarizes the following statistical measures:

 -median
 -upper and lower quartile
 -minimum and maximum value

 Histograms are excellent for focusing attention on key aspects


of the shape of a distribution (symmetry, skewness), but they
are not good tools for making comparisons among datasets.
Boxplots are ideal for making comparisons.
John Wilder Tukey
 John Wilder Tukey (June 16,
1915 - July 26, 2000) was a
statistician born in
New Bedford, Massachusetts.
 Tukey obtained a A.B. in 1936
and Sc.M. in 1937, both in
Chemistry, from
Brown University, before
moving to Princeton University
where he received his Ph.D. in
mathematics. During
World War II, Tukey worked at
the Fire Control Research Office
and collaborated with Samuel
Wilks and William Cochran.
After the war, he returned to
Princeton, dividing his time
between the university and
AT&T Bell Laboratories.
Lottery payoffs for winning numbers for three time periods
(May 1975-March 1976, November 1976-September 1977,
and December 1980-September 1981).
Boxplots

 The median for each dataset is indicated by the black


center line, and the first and third quartiles are the
edges of the red area, which is known as the inter-
quartile range (IQR).
 The extreme values (within 1.5 times the inter-quartile
range from the upper or lower quartile) are the ends of
the lines extending from the IQR. Points at a greater
distance from the median than 1.5 times the IQR are
plotted individually as asterisks. These points represent
potential outliers.

 In this example, the three boxplots have nearly identical


median values. The IQR is decreasing from one time
period to the next, indicating reduced variability of
payoffs in the second and third periods. In addition, the
extreme values are closer to the median in the later
time periods.
Dot Plot
In a dot plot, each data entry is plotted, using a point, above
a horizontal axis.

Use a dot plot to display the ages of the 30 students in the


statistics class.
Ages of Students
18 20 21 27 29 20
19 30 32 19 34 19
24 29 18 37 38 22
30 39 32 44 33 46
54 49 18 51 21 21
Dot Plot
Ages of
 Students

1 1 2 2 2 3 3 3 3 4 4 4 5 5 5
5 8 1 4 7 0 3 6 9 2 5 8 1 4 7

From this graph, we can conclude that most of


the values lie between 18 and 32.
Pie Chart
A pie chart is a circle that is divided into sectors that represent
categories. The area of each sector is proportional to the
frequency of each category.

Accidental Deaths in the USA in 2002


Type Frequency
Motor Vehicle 43,500
Falls 12,200
Poison 6,400
Drowning 4,600
Fire 4,200
Ingestion of Food/Object 2,900
Firearms 1,400
(Source: US Dept. of
Transportation)
Pie Chart
 To create a pie chart for the data, find the relative frequency
(percent) of each category

Relative
Type Frequency
Frequency
Motor Vehicle 43,500 0.578
Falls 12,200 0.162
Poison 6,400 0.085
Drowning 4,600 0.061
Fire 4,200 0.056
Ingestion of Food/Object 2,900 0.039
Firearms 1,400 0.019
n = 75,200
Pie Chart
Next, find the central angle. To find the central angle, multiply the
relative frequency by 360°.

Relative
Type Frequency Angle
Frequency
Motor Vehicle 43,500 0.578 208.2°
Falls 12,200 0.162 58.4°
Poison 6,400 0.085 30.6°
Drowning 4,600 0.061 22.0°
Fire 4,200 0.056 20.1°
Ingestion of Food/Object 2,900 0.039 13.9°
Firearms 1,400 0.019 6.7°
Pie Chart
Ingestion Firearms
3.9% 1.9%
Fire
5.6%

Drowning
6.1%

Poison
8.5% Motor
vehicles
Falls 57.8%
16.2%
Times Series Chart
 A data set that is composed of quantitative data entries taken at
regular intervals over a period of time is a time series. A time
series chart is used to graph a time series.

Example:
Month Minute
The following table lists the
January s
236
number of minutes Robert used
on his cell phone for the last six
February 242
months.
March 188
Construct a time series chart
April 175
for the number of minutes May 199
used.
June 135
Times Series Chart
 Robert’s Cell Phone
Usage
250

200
Minutes

150

100

50

0
Jan Feb Mar Apr May June

Month
Quartiles and Percentiles

 Percentile:  A percentile is a measure that tells us what


percent of the
total frequency scored at or below that measure. 

 Quartiles:   Quartile is another term referred to in percentile

measure.  The total of 100% is broken into four


equal
parts: 25%, 50%, 75%, 100%. 

 The median is the value in the middle of the ordered


array, the lower quartile is the middle value of the half of
the data below the median, and the upper quartile is the
middle value of the half of the data above the median.
Quartiles and Percentiles
 i x[i]
 1 102
 2 104
 3 105 ---- the first quartile, Q1 = 105
 4 107
 5 108
 6 109 ---- the second quartile, Q2 or median = 109
 7 110
 8 112
 9 115 ---- the third quartile, Q3 = 115
 10 115
 11 118
Quartiles and Percentiles

 For this data set:


 smallest non-outlier observation = 5 (left "whisker")
 lower (first) quartile (Q1, x.25) = 7
 median (second quartile) (Med, x.5) = 8.5
 upper (third) quartile (Q3, x.75) = 9
 largest non-outlier observation = 10
 interquartile range, IQR = Q3 − Q1 = 2
 the value 3.5 is a "mild" outlier, between 1.5*(IQR) and
3*(IQR) below Q1
 the value 0.5 is an "extreme" outlier, more than 3*(IQR) below
Q1
 the data is skewed to the left (negatively skewed)
Quartiles and Percentiles

 +------+-+
 o * |---------| + | | -- |
 +-----+-+
 +---+---+---+---+---+---+---+---+---+---+ number line
 0 1 2 3 4 5 6 7 8 9 10
Measures of the center

 Now that we have seen how to picture data, we will explore


methods of measuring characteristics of data.

 The measure we first look at is a measure of central tendency.


This is a value at the center or middle of a data set.

 Consider the following example where we introduce the


mean, median, mode and midrange. Here is some data

 10 11 12 12 15 17 21 22 23 27
Measures of the center
 The mean (or arithmetic mean) is the average of these data
points. To calculate the mean you simply add the data points
and divide by the number of data points. The mean is denoted
by x . In our example above:
 Sum of data points: 10+11+12+12+15+17+21+22+23+27 =
170
 Number of data points = 10
 Average = 170/10 = 17

 The median is the middle value when the scores are arranged
in order of increasing (or decreasing) magnitude To calculate
the median follow this rule:
 If the number of scores is odd, the median is the number that
is located in the exact middle of the list If the number of
scores is even, the median is found by computing the mean of
the two middle numbers
 NOTE: TO APPLY THE RULES ABOVE THE LISTS MUST BE
SORTED!
Measures of the center

 In our example above: 15 and 17 are the middle numbers. So


the median is (15+17)/2 = 16.

 The mode of the data set is the score that occurs most
frequently. When two scores occur with the same greatest
frequency, each one is a mode and the data is bimodal. If
more than two scores occur with the same greatest frequency,
each is a mode and the data is multimodal. When all scores
occur just once there is no mode. The mode is denoted by M
 The value 12 in the above dataset occurs most frequently and
is therefore the mode.

 The midrange is simply (low value + high value)/2. In our


example above this is (10+ 27)/2 = 37/2 = 18.5
SOME MATHEMATICAL
NOTATION
Mathematicians like to have symbols to represent
complicated calculations. Here are some we will use
throughout the course:
 ∑ denotes the summation of a group of values (this
means add them all up)
 x denotes the variable, usually used to represent the
individual data values
 n represents the number of values in a sample
 N represents the number of values in a population
_
x=
∑x
 n is the mean of a sample

µ=∑
x
 N is the mean of a population
Measures of Variation
 Measures of central tendency give us measures of where the
middle of a set of data occurs, but this is not enough to
characterize a set of data.

 Consider the following 2 data sets:


 50 60 70 80 90 And 69 69 70 71 71

 Both these data sets have a mean of 70. Yet the first data set
is more widely dispersed than the second data set. So a
measure of variation is clearly needed.

 Consider the following data - it represents the actual weight of


a 20 oz steak at a restaurant. We will use this throughout this
section

 17 20 21 18 20 20 20 18 19 19
 20 19 22 20 18 20 18 19 20 19
Measures of Variation
 The range is the difference between the highest value and
the lowest value in a dataset.

 To compute it simply subtract the lowest value from the


highest value. In the example above the range is (22-17)=5

 Range can be misleading since it does not take into


consideration every value. Consider each of the following data
sets:
 1 10 10 10 10

 And
 1 2 5 8 10

 Both have a range of 9, yet the first data set is clearly not as
dispersed as the second.
Measures of Variation
 A more accurate measure of variation can be given by
the standard deviation of the data.

 The standard deviation of a set of sample scores is a


measure of variation of scores about the mean. It is
calculated by

n _

∑ i
( x −x ) 2

s= i =1

n −1
Measures of Variation

 The procedure for finding the standard deviation is as


follows:

 Find the mean of the scores


 Subtract the mean from each individual score
 Square each of the values in step 2
 Add up all the squares obtained in step 3
 Divide the total in step 4 by n-1
 Find the square root of step 5.
Measures of Variation
 The sample variance is the standard deviation
squared. To calculate all you do all the steps for the
standard deviation except taking the final square
root. Here is the formula:

n _

∑ i
( x −x ) 2

s2 = i =1
n −1
Interpretation of standard
deviation
  A small standard deviation means the data is close together,
a large deviation means the data is wide spread

  The range rule of thumb states that for typical data sets,
the range of the data is about 4 standard deviations wide so
the standard deviation is about the range divided by 4. This
is a very rough estimate

 The 68-95-99 rule states that about 68% of all scores fall
within one standard deviation of the mean, 95% of all scores
fall within about 2 standard deviations of the mean and
99.7% of all scores fall within 3 standard deviations from the
mean.

 This only works for data that is approximately bell shaped.


  The above rule tells us that data more than 2 standard
deviations from the mean is unusual. While data within 2
standard deviations is normal

  Chebyshev's Theorem states that at least 75% of all scores


fall within 2 standard deviations from the mean and at least
89% fall within at least 3 standard deviations from the mean.
This works for ANY distribution (not just bell shaped)
Z-Scores
 How do we compare two different sets of data.

 Suppose you are comparing gas mileage on two separate


kinds of automobiles - say light trucks and compact cars.
Assume the mean miles per gallon for the light trucks is 23.6
miles per gallon with a standard deviation of 3.6 miles per
gallon and if the mean miles per gallon for compact cars is
28.7 miles per gallon with a standard deviation of 5.7 miles
per gallon.

 If you are trying to compare a light truck with a miles per


gallon rating of 27.5 and a compact car with a miles per gallon
rating 31.2.

 Which one is more "unusual"? To solve this problem we need


some way to standardize these scores - this way we would not
have to know what scale was being used. The way to get a
standard score is the z score.
Z-Scores
 The standard score or z-score, is the number of standard
deviations that a given value x is above or below the
mean. You calculate the z score using:
_
x−x
z=
s

So for the light truck described above: the z score is


z=(28.7-23.6)/3.6=1.42 standard deviations above the mean.

The z score for the compact car described above is


z=(31.2-27.5)/5.7=0.65 standard deviations above the mean
Z-Scores

 Example: According to the American Freshman that number of


hours per week that college freshman spend studying has a
mean of 7.06 hours with a standard deviation of 2.32 hours.
Suppose Sally Simplestudent spends 2 hours per week
studying. Does Sally spend an unusually small amount of time
studying?

 According to the z score: z = (2-7.06)/2.32 = -2.18, Sally is


more than 2 standard deviations away from the mean, so her
low amount of study time is unusual.
Z-Scores
 Intuition: a measure of how far an individual score is
from the mean compared to the average distance of
scores in the entire distribution from the mean.
 Intuition: you can think of z-Scores as simply indicating
the number of standard deviations a certain data point is
away from the mean.

 The Empirical Rule - for any bell-shaped, nearly


symmetric distribution of data, the interval (x-s, x+s)
contains approximately 68% of the data points, the
interval (x-2s, x+2s) contains approximately 95% of the
data points, and the interval (x-3s, x+3s) usually
contains all the data points.