You are on page 1of 40

1

INSE 6220 -- Week 2


Advanced Statistical Approaches to Quality

Overview of Course Contents


Statistical Methods using MATLAB
Statistical Process Control using MATLAB

Dr. A. Ben Hamza Concordia University


2

Contents

Probability

Distributions Example 1:
Descriptive Statistics Probability of success in test

Estimation theory
Example 2:
Hypothesis testing Probability of success in test 2
Linear Model given that test 1<5.5?

Design of experiments
3

Contents

Probability
Distributions 0.35

mu=5.72 sigma=1.55
0.3
Descriptive Statistics
0.25
Estimation theory
0.2

Dens ity
Hypothesis testing
0.15

Linear Model
0.1

Design of experiments 0.05

0
0 1 2 3 4 5 6 7 8 9 10
Score
4

Contents

Probability
25

Distributions
Test 1 Test 2
20 5. 6 6. 1
Descriptive Statistics 5. 1 7. 5
6. 8 6. 6
3. 4 3. 1
Estimation theory 15 6. 8 8. 4

Frequency
4. 6 6. 4
5. 6 4. 9
Hypothesis testing 6. 3 10. 0
10 5. 0 4. 0
Linear Model 7. 6
5. 6
8. 2
5. 8

5
Design of experiments

0
0 1 2 3 4 5 6 7 8 9 10
Score
5

Contents

Descriptive Statistics
Probability
Distributions Example:
What is and ?
Estimation theory
Hypothesis testing Bias
Robustness
Linear Model Confidence Interval
Design of experiments
6

Contents

Descriptive Statistics
Probability
Example 1:
Distributions When you have less than 4. 5
on test 1, you will not pass
Estimation theory
Hypothesis testing Example 2:
Linear Model Average Test1=Average Test 2
Design of experiments
7

Contents

Descriptive Statistics
10

Probability 9

8
Distributions
7

Estimation theory
Score Test 2
6

5
Hypothesis testing
4

Linear Model 3

2
Design of experiments 1
0 1 2 3 4 5 6 7 8 9 10
Score Test 1
8

Contents

Descriptive Statistics
Probability
Distributions
Estimation theory
To improve estimate
Hypothesis testing
Linear Model
... To improve prediction of model
Design of experiments
9

Why Study Statistics?


Decision Makers Use Statistics To:
Present and describe data and information properly
Draw conclusions about large groups of individuals or items, using information
collected from subsets of the individuals or items.
Make reliable forecasts about a computer software company
Predict the number of software defects and Improve software processes

What is Data?
Data: Consist of information coming from observations, counts,
measurements, or responses.
People who eat three daily servings of whole grains have been shown to
reduce their risk of stroke by 37%.
70% of the 1500 U.S. spinal cord injuries to minors result from vehicle
accidents, and 68% were not wearing a seatbelt.
10

What is Statistics?

Statistics is a way to get information from data

Statistics

Data Information

Data: Facts, especially Information: Knowledge


numerical facts, collected communicated concerning
together for reference or some particular fact.
information.

Statistics is a tool for creating new understanding from a set of numbers.


11

Example:: Stats Anxiety


A Computer Science student is anxious about her/his statistics course, since s/he
heard the course is difficult. The professor provides last terms final exam marks to
the student. What can be discerned from this list of numbers?

Statistics

Data Information
List of last terms marks. New information about the
statistics class.
95
89
70 E.g. Class average,
65 Proportion of class receiving As
78 Most frequent mark,
57 Marks distribution, etc.
:
12

Example: Classifying Data by Type


The base prices of several vehicles are shown in the table. Which data are
qualitative data and which are quantitative data?
13

Solution: Classifying Data by Type

Qualitative Data (Names of Quantitative Data (Base prices


vehicle models are non- of vehicles models are
numerical entries) numerical entries)
14

Data Sets

Population
The collection of all outcomes,
responses, measurements, or
counts that are of interest.

Sample
A subset of the population.
15

Branches of Statistics

Descriptive Statistics Inferential Statistics


Involves organizing, Involves using sample data
summarizing, and displaying to draw conclusions about a
data. population.

e.g. Tables, charts,


averages
16

Descriptive Statistics

Collect data
e.g., Survey

Present data
e.g., Tables and graphs

Characterize data
e.g., Sample mean = X i

n
17

Inferential Statistics

Estimation
e.g., Estimate the population mean
weight using the sample mean weight
Hypothesis testing
e.g., Test the claim that the population
mean weight is 120 pounds

Drawing conclusions about a large group of individuals based


on a subset of the large group.
18

Basic Vocabulary of Statistics


VARIABLE
A variable is a characteristic of an item or individual.

DATA
Data are the different values associated with a variable.

POPULATION
A population consists of all the items or individuals about
which you want to draw a conclusion.

SAMPLE
A sample is the portion of a population selected for analysis.

PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.

STATISTIC
A statistic is a numerical measure that describes a
characteristic of a sample.
19

Basic 2D Plotting in MATLAB


The simplest kind of plot is a cartesian plot of (x,y) pairs defined by
symbols or connected with lines
>> x=0:0.05:10*pi;
>> y=exp(-0.1*x).*sin(x);
>> plot(x,y)
>> xlabel('X axis description')
>> ylabel('Y axis description') Title for plot goes here
1
>> title('Title for plot goes here') Legend for graph
>> legend('Legend for graph')
>> grid on
0.5
Y axis description

NOTE #1:
Reversing the x,y order 0
(y,x) simply rotates the
plot 90 degrees!
Manually inserted text...
-0.5
NOTE #2:
line(x,y) is similar to plot(x,y)
but does not have additional options -1
0 5 10 15 20 25 30 35
X axis description
20

Some basic plot commands you may need:

Kinds of plots:
bar(x) creates a bar graph of the vector x. (Note also the command stairs(x))
bar(x,y) creates a bar-graph of the elements of the vector y, locating the bars
according to the vector elements of 'x'

>>x = -2.9:0.2:2.9; y = x.^2; bar(x,y)


21

m-function Structure
Function definition
Arguments
Returned variable
function volume=cylinder(radius, length)
% CYLINDER computes volume of circular cylinder
% given radius and length
% Use:
Help comments
% vol=cylinder(radius, length)
%
volume=pi.*radius^2.*length;

Statements
(no end required)

NOTE: function names are NOT case sensitive in Windows


22

Statistics with MATLAB


Online help for Statistics Toolbox is available from the MATLAB prompt (>> a
double arrow), both generally (listing of all available commands):

>> help stats


[a long list of help topics follows]

and for specific commands:

>> help distool

[a help message on the disttool function follows].

>> help disttool


DISTTOOL Demonstration of many probability distributions.
DISTTOOL creates interactive plots of probability distributions.
This is a demo that displays a plot of the cumulative distribution
function (cdf) or probability distribution function (pdf) of the distributions
in the Statistics Toolbox.
23

Plotting Probability Distributions


>> disttool
24

Probability density functions (pdf)


binopdf - Binomial density.
chi2pdf - Chi square density.
exppdf - Exponential density.
fpdf - F density.
gampdf - Gamma density.
geopdf - Geometric density.
hygepdf - Hypergeometric density.
lognpdf - Lognormal density.
mvnpdf - Multivariate normal density.
normpdf - Normal (Gaussian) density.
pdf - Density function for a specified distribution.
poisspdf - Poisson density.
tpdf - T density.
unifpdf - Uniform density.
wblpdf - Weibull density.
25

Example: Binomal density function


For discrete distributions, the pdf assigns a probability to each outcome.
In this context, the pdf is often called a probability mass function (pmf).
For example, the discrete binomial pdf

n
f ( x) P( X x) p x (1 p) n x , x 0, 1, 2, , n
x

assigns probability to the event of k successes in n trials of a Bernoulli


process (such as coin flipping) with probability p of success at each trial.

p = 0.2; % Probability of success for each trial


n = 10; % Number of trials
x = 0:n; % Outcomes
fx = pdf(bino,x,n,p); % Probability mass vector
bar(x,fx) ; % Visualize the probability distribution
26

Descriptive Statistics
corrcoef - Linear correlation coefficient with confidence intervals.
cov - Covariance.
mean - Sample average (in MATLAB toolbox).
median - 50th percentile of a sample.
range - Range.
std - Standard deviation (in MATLAB toolbox).
var - Variance (in MATLAB toolbox).

Example:
>> X = [ 1 2 3 5 6 7 23 45 33 46 22]
X=
1 2 3 5 6 7 23 45 33 46 22
>> mean(X)
ans =
17.5455
>> std(X)
ans =
17.5455
27

Mean and Median


Examples:

A = [ 0 2 5 7 20] B = [1 2 3
336
468
4 7 7];
Mean:
mean(A) = 6.8
mean(B) = 3.0 4.5 6.0 (column-wise mean)
mean(B,2) = 2.0 4.0 6.0 6.0 (row-wise mean)

Median:
median(A) = 5
median(B) = 3.5 4.5 6.5 (column-wise median)
median(B,2) = 2.0
3.0
6.0
7.0 (row-wise median)
28

Standard Deviation and Variance

Standard deviation is calculated using the std() function


std(X) : Calcuate the standard deviation of vector x
If x is a matrix, std() will return the standard deviation of each column
Variance (defined as the square of the standard deviation) is calculated using the var() function
var(X) : Calcuate the variance of vector x
If x is a matrix, var() will return the standard deviation of each column
29

Descriptive Statistics
Example: The function displaytable.m is posted on the course website
>> X = rand(9,9); %generates 9x9 random matrix
>> displaytable(cov(X)); % plots the covariance matrix of X
>> displaytable(corrcoef(X)); % plots the correlation matrix of X
30

Data Correlations

2
% Compute sample correlation

1
r = corrcoef([var1,var2])
Variable 1

0 r = 1.0000 0.7051
0.7051 1.0000
-1

-2

-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Variable 2
31

Statistical Plotting
andrewsplot - Andrews plot for multivariate data.
biplot - Biplot of variable/factor coefficients and scores.
boxplot - Boxplots of a data matrix (one per column).
cdfplot - Plot of empirical cumulative distribution function (cdf).
fsurfht - Interactive contour plot of a function.
glyphplot - Plot stars or Chernoff faces for multivariate data.
gplotmatrix - Matrix of scatter plots grouped by a common variable.
gscatter - Scatter plot of two variables grouped by a third.
hist - Histogram (in MATLAB toolbox).
hist3 - Three-dimensional histogram of bivariate data.
normplot - Normal probability plot.
parallelcoords - Parallel coordinates plot for multivariate data.
probplot - Probability plot.
surfht - Interactive contour plot of a data grid.
wblplot - Weibull probability plot.
32

Statistical Plotting using MATLAB


Create a Pareto chart from data measuring the
number of manufactured parts rejected for
various types of defects.

>> defects = {'pits';'cracks';'holes';'dents'};


>> quantity = [5 3 19 25];
>> pareto(quantity,defects);

Boxplot(X) produces a box and whisker plot for


each column of the matrix X. The box has lines
at the lower quartile, median, and upper quartile
values. The whiskers are lines extending from
each end of the box to show the extent of the
rest of the data. Outliers are data with values
beyond the ends of the whiskers

>> load parts


>> boxplot(runout);
33

Statistical Plotting using MATLAB


Pareto charts display the values in the vector Y as bars drawn in descending order.
Values in Y must be nonnegative and not include NaNs. Only the first 95% of the
cumulative distribution is displayed.
Examine the cumulative productivity of a group of programmers to see how normal its
distribution is:

>> codelines = [200 120 555 608 1024 101 57 687];


>> coders = {'Travis','Arash','Emad','Waleed','Farshad','Khaled','Mohamed',Maggie'};
>> pareto(codelines, coders)
>> title('Lines of Code by Student')
34

Multivariate Statistical Plotting using MATLAB


Scatter plots in 2D and 3D

>> load carsmall


>> X = [Acceleration Displacement Horsepower MPG Weight];
>> scatter(X(:,2),X(:,3),'.');
>> scatter3(X(:,1),X(:,2),X(:,3),'.');

3D histogram

>> hist3([X(:,1),X(:,2)]);
35

Multivariate Statistical Plotting using MATLAB


>> load carbig
>> X = [MPG,Acceleration,Displacement,Weight,Horsepower];
>> varNames = {'MPG'; 'Acceleration'; 'Displacement'; 'Weight'; 'Horsepowe r'};
>> gplotmatrix(X,[],Cylinders,['c' 'b' 'm' 'g' 'r'],[],[],false); text([.08 .24 .43 .66 .83],
repmat(-.1,1,5), varNames, 'FontSize',8); text(repmat(-.12,1,5), [.86 .62 .41 .25 .02],
varNames, 'FontSize',8, 'Rotat ion',90);

The points in each scatterplot are color-coded


by the number of cylinders: blue for 4
cylinders, green for 6, and red for 8. There is
also a handful of 5 cylinder cars, and rotary-
engined cars are listed as having 3 cylinders.
This array of plots makes it easy to pick out
patterns in the relationships between pairs of
variables. However, there may be important
patterns in higher dimensions, and those are
not easy to recognize in this plot.
36

Statistical Plotting
normplot: Normal probability plot for graphical normality test.

Generate a normal sample and a normal probability plot of the data.


>> x = normrnd(0,1,50,1); Normal Probability Plot

>> h = normplot(x); 0.99


0.98
0.95
0.90

0.75
Probability

0.50

0.25

0.10
0.05
0.02
0.01
-1.5 -1 -0.5 0 0.5 1 1.5
Data
The plot is linear, indicating that you can model the sample by a
normal distribution
37

Multivariate Gaussian distribtuions

A multivariate Gaussian (or normal)


distribution is a n-dimensional extension
of a univariate Gaussian In a single
dimension a normal distribution is the
familiar bell-shaped curve. In two
dimensions each variable is itself a
normal distribution. If the two dimensions
are independent then they tend to
cluster as a circular cloud of points. if
they are correlated then the form an
ellipse. This can be extended to any
number multiple dimensions.
38

Statistical Process Control (SPC)


Statistical process control (SPC) refers to a number of different methods for monitoring
and assessing the quality of manufactured goods. Combined with methods from the
Design of Experiments, SPC is used in programs that define, measure, analyze,
improve, and control development and production processes. These programs are
often implemented using "Design for Six Sigma" methodologies.

capable - Capability indices.


capaplot - Capability plot.
ewmaplot - Exponentially weighted moving average plot.
histfit - Histogram with superimposed normal density.
normspec - Plot normal density between specification limits.
schart - S chart for monitoring variability.
xbarplot - Xbar chart for monitoring the mean.
39

Plot normal density between specification limits


normspec(specs,mu,sigma) plots the normal density between a lower and upper limit defined
by the two elements of the vector specs, where mu and sigma are the parameters of the
plotted normal distribution.
Example:
Suppose a cereal manufacturer produces 10 ounce boxes of corn flakes. Variability in the
process of filling each box with flakes causes a 1.25 ounce standard deviation in the true
weight of the cereal in each box. The average box of cereal has 11.5 ounces of flakes.
What percentage of boxes will have less than 10 ounces?.
Probability Between Limits is 0.88493
0.35
>> normspec([10 20],11.5,1.25)
0.3

0.25

0.2
Density

0.15

0.1

0.05

0
6 8 10 12 14 16 18 20
Critical Value
40

Control Charts
A control chart displays measurements of process samples over time. The measurements
are plotted together with user-defined specification limits and process-defined control
limits. The process can then be compared with its specificationsto see if it is in control or
out of control.
The chart is just a monitoring tool. Control activity might occur if the chart indicates an
undesirable, systematic change in the process. The control chart is used to discover the
variation, so that the process can be adjusted to reduce it.

Xbar or mean
Standard deviation
Range
Exponentially weighted moving average
Individual observation
Moving range of individual observations
Moving average of individual observations
Proportion defective
Number of defectives
Defects per unit
Count of defects