You are on page 1of 73

Ten Data Analysis Tools You Can’t Afford to be Without

Neil W. Polhemus, CTO, StatPoint Technologies, Inc. George H. Dyson, Director of Six Sigma Services
Copyright © 2010 by StatPoint Technologies, Inc.

Business Improvement Objectives
  

Businesses must create value. This output must be greater than the inputs needed to produce it. If the output meets the customers’ needs, the business is effective. If the business creates added value with minimum resources, the business is efficient.

The TheRole Roleof ofSix SixSigma Sigmais isto toHelp Helpa aBusiness BusinessProduce Produce the theMaximum MaximumValue ValueWhile WhileUsing UsingMinimum MinimumResources Resources
(Pyzdek 2003)

2

Six Sigma Business Successes
• Cost reductions • Productivity improvements • Market - share growth • Customer relations improvements • Defect reductions • Culture changes • Product and service improvements • Cycle - time reductions
(CSSBB Primer, 2001) / (Pande, 2000)

All these Successes have a common thread….

3

 To extract meaningful information  To uncover signals in the presence of noise  To understand the past  To monitor the present  To forecast the future  This understanding results in reduced cost and saving money.  Deming: “Doesn’t anyone care about Profit?” 4 .  Analysis uses statistical models & tools of all kinds.DATA!!!  Data drives analysis.

Examples  Product comparisons  Survey analysis  Distribution fitting  Comparison of multiple samples  Outlier detection  Curve fitting  Response surface modeling  Time series forecasting  Event rate modeling  Interactive maps 5 .

Problem #1 – Product Comparisons  Consumer Reports 2010: Sedans (Family. Luxury)  7 variables Price (dollars)  Road-test score (0 – 100)  Predicted reliability (1 – 5)  Owner satisfaction (1 – 5)  Owner cost (1 – 5)  Safety (1 – 5)  Fuel economy (overall mpg)  6 . Upscale.

sgd (n=30) 7 .Data file: cars.

Star glyphs 7. Radar/spider plot  8 . Scatterplot matrix 4. Chernoff faces 8. Bubble chart 5. Each car is thus a point in 7-dimensional space with one other feature. 2-D scatterplot 2. Parallel coordinates plot 6.Multivariate Visualization The data consist of 7 quantitative variables plus one categorical factor. 3-D scatterplot 3. How can we visualize what’s going on? 1.

Plot of MPG vs Price 35 32 29 MPG 26 23 20 17 24 34 44 Price 54 64 (X 1000) Class Family Luxury Upscale 9 .2-D Scatterplot Useful for plotting 2 dimensions.

Plot of MPG vs Price and Road Test Class Family Luxury Upscale Toyota Camry Hybrid 35 32 29 MPG 26 23 20 17 24 Nissan Altima Hybrid 98 88 34 44 Price 54 64 58 (X 1000) 78 68 Road Test 10 .3-D Scatterplot Useful for plotting 3 dimensions.

Bubble Chart for Reliability 35 32 29 MPG 26 23 20 17 24 34 44 Price 54 64 (X 1000) Class Family Luxury Upscale 11 .Bubble Chart Uses size of bubble to illustrate a third dimension.

Scatterplot Matrix Plots all pairs of variables. Class Price Family Luxury Upscale Road Test Reliability Owner Satisfaction Owner Cost Safety MPG 12 .

Parallel Coordinates Plot 1 Class Family Luxury Upscale 0.6 0.2 0 PG Pr ic e Te st el ia bi li t y ty at is fa ct io n Sa fe M w ne rC O oa d os t R R O 13 w ne rS .8 0.4 0.Parallel Coordinates Plot Each case is shown as a line connecting the values of the variables.

Star Glyphs Each case is shown as a polygon with vertices scaled by the variables. 1/Price Owner Cost MPG Safety Road Test Owner Satisfaction Reliability 14 .

Star Glyphs Variables are scaled so that larger area is better. Class Family Luxury Upscale Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80 15 .

Class Family Luxury Upscale Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80 16 .Chernoff Faces Each variable is assigned to a different feature of the faces.

Feature Assignments Unfortunately. 17 . some features have a greater impact than others.

0-5. Radar/Spider Plot Model Volkswagen Passat Chevrolet Impala Lincoln MKZ Cadillac STS MPG (15.0) Owner Satisfaction (0.Radar/Spider Plot Good for comparing a small number of cases.0) Road Test (50.0-100.0-55000.0-5.0-25.0-5.0) Price (25000.0) Reliability (0.0) Owner Cost (0.0-5.0) 18 .0) Safety (0.

healthcare system?” 19 .S. 2010.  Such crosstabulations result in contingency tables that provide much useful information.Problem #2: Survey Analysis  The most commonly used statistical procedure is the calculation of a two-way table (a tabulation of responses that can be classified in 2 ways). Rasmussen Reports asked 1000 likely voters: “How would your rate the U.  Example: On January 18-19.

sgd 20 .Data file: healthcare.

21 .Mosaic Plot Scales the area of each bar according to the counts in the table.

22 .Chi-square Test Tests for lack of independence between row and column classification.

23 .Correspondence Analysis Used to help visualize the important information in two-way tables.

 Distribution fitting is also quite critical in many design problems. determining the distribution of a quantitative variable is critical. 24 .Problem #3: Distribution Fitting  In many studies.  Common examples covered in Six Sigma include capability studies.

net 25 .Data file: waves.sgd (n=26.304) Source: www.iahr.

Frequency Histogram
Shows the number of observations in non-overlapping intervals.
Histogram

2400 2000 1600 1200 800 400 0 0 2 4 6 Height 8 10 12

26

frequency

Normal Distribution
The normal distribution is a poor model for this data.
Histogram for Height (X 1000) 4

Distribution Normal

3 frequency

2

1

0 -1 2 5 Height 8 11 14

27

Comparison of Distributions
Fits many distributions and sorts them by goodness of fit.

28

Histogram for Height 2400 2000 1600 1200 800 400 0 0 2 4 6 Height 8 10 12 Distribution Largest Extreme Value Lognormal (3-Parameter) Normal 29 frequency .Lognormal Distribution The 3-parameter lognormal distribution is much better.

Problem #4: Multiple Samples  Data are frequently obtained from more than one sample. 30 .  Asserting a significant difference between the samples (or lack thereof) is an important application of data analysis.

sgd (n=480) 31 .Data file: thickness.

5 x IQR above 75th) 75th Percentile IQR Median (50th Percentile) 25th Percentile Length 60 59 58 57 56 55 54 1.5 x IQR below 25th) (>1.5 x IQR Minimum Observation (within 1. or >1. Outlier 32 .5 x IQR below 25th Percentile.Box-and-Whisker Plots A very useful plot for comparing samples (from John Tukey).5 x IQR above 75th Percentile) * Plus sign may be added to show sample mean. 64 63 62 61 Maximum Observation (within 1.

Notched Box-and-Whisker Plots Non-overlapping notches indicate significantly different medians. 33 .

34 .HSD Intervals Allow pairwise comparison of all level means.

Problem #5: Outlier Detection  Many data sets contain aberrant observations that don’t come from the same distribution as the others.  Identifying outliers and treating them separately often results in better models. 35 .

Data file: bodytemp.sgd (n=130) 36 .

3 and 4-sigma. 2.Outlier Plot Shows each data value with lines at 1. 37 .

Grubbs’ Test Small P-value indicates that the extreme Studentized deviate (ESD) is highly unusual. 38 .

 If we can estimate a model where Y = f(X). then we can use that model to make predictions. 39 .Problem #6: Curve Fitting  A common data analysis problem involves determining the relationship between a response variable Y and a predictor variable X.

Data file: chlorine.sgd (n=44) 40 .

48551 .4 0.48 0.38 0 10 20 weeks 30 40 50 95% prediction limits for new observations 95% confidence limits for mean 41 .0.00271679*weeks 0.46 chlorine 0.42 0.44 0.5 0. Plot of Fitted Model chlorine = 0.Simple Linear Regression Fits a linear model of the form Y = mX + b.

42 .Comparison of Alternative Models Fits many transformable nonlinear models and sorts by R-Squared.

48 0.44 0.42 0. Plot of Fitted Model chlorine = 0.38 0 10 20 weeks 30 40 50 43 .02553/weeks 0.5 0.Reciprocal X Model A nonlinear model of the form Y = m/X + b is much better.46 chlorine 0.368053 + 1.4 0.

to design the best aircraft engine.  Use optimization models.  Note: the data have been altered and are for demonstration purposes only. 44 .Problem #7: Response Surfaces  The MISSION: air dominance at the lowest possible price. built from performance data.

9 to 4.Problem Statement  Optimize 3 response variables: Minimize total fleet acquisition cost (Y1) Maximize climb rate (Y2) Maximize launch rate (Y3)  Input factors: X1: Fan Pressure Ratio: 3.7 X2: Overall Pressure Ratio : 34 to 40 X3: Inlet airflow : 240 to 270 pps 45 .

sgx (n=584) 46 .Data file: engines.

3 4.5 FPR 4.1 4.7 34 38 37 36 OPR 35 39 40 47 . Experimental Region 270 265 Air Flow 260 255 250 245 240 3.Design Plot Shows the location of the historical data within the factor space.9 4.

Standardized Pareto Chart for Y1 Cost Standardized Pareto Chart for Y2 Ps C:Air Flow A:FPR B:OPR AB BB BC AA CC AC 0 10 20 Standardized effect 30 40 + - C:Air Flow A:FPR B:OPR AA BC AB CC BB AC 0 20 40 60 80 Standardized effect 100 120 + - Standardized Pareto Chart for Y3 Launch C:Air Flow A:FPR B:OPR AA BC AB BB CC AC 0 20 40 60 80 Standardized effect 100 120 + - 48 .Standardized Pareto Charts Show the significant factors affecting each response.

For responses to be minimized: For responses to be maximized: Combined desirability: D=d(Y1)*d(Y2)*d(Y3) 49 . Y2.Desirability Function Quantifies the desirability of a joint response (Y1. Y3).

Optimal Conditions Found at the levels shown below: 50 .

2 0.Response Surface Show the estimated desirability throughout the experimental region.2 FPR 4.7 0.5 4.5 0.0 0.9 4.1 31 51 .8 33 37 39 35 OPR 5.697 Desirability 0.3 0.4 0.6 3.8 0.1 0.9 1. Desirability Plot Air Flow=254.0 41 43 300 280 Air Flow 260 240 220 3.6 0.

Problem #8: Time Series Data  Data recorded at equally spaced points in time is called a time series.  Time series models are used for various purposes: Analysis of trends and seasonal effects  Forecasting  Control   Autocorrelation between adjacent observations requires special models. 52 .

sgd (n=168) 53 .Data file: customers.

Time Series Plot for Customers (X 1000) 113 103 Customers 93 83 73 1/96 1/99 1/02 Month 1/05 1/08 1/11 54 .Time Sequence Plot Plots the data versus time.

Estimated Autocorrelations for Customers 1 0.Autocorrelation Function Estimates the correlation between observations at different lags.2 -0.2 -0.6 -1 0 12 lag 24 36 55 .6 Autocorrelations 0.

Seasonal Decomposition Shows the average value during each season (scaled to 100). Seasonal Index Plot for Customers 114 110 seasonal index 106 102 98 94 90 1 2 3 4 5 6 7 season 8 9 10 11 12 56 .

Seasonal Subseries Plot Shows the seasonal averages and trend within each season. Seasonal Subseries Plot for Customers (X 1000) 113 103 Customers 93 83 73 0 2 4 6 8 Season 10 12 14 57 .

Annual Subseries Plot Shows the seasonal effect separately for each cycle. Annual Subseries Plot for Customers (X 1000) 113 103 Customers 93 83 Cycle 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 0 3 6 Season 9 12 15 73 58 .

Seasonally Adjusted Data Removes the seasonal effects from the data. Seasonally Adjusted Data Plot for Customers (X 1000) 106 101 seasonally adjusted 96 91 86 81 76 1/96 1/99 1/02 Month 1/05 1/08 1/11 59 .

60 .Automatic Forecasting Fits many models and automatically selects the best.

ARIMA Model Selected model is a rather complicated seasonal ARIMA model.2)12 (X 1000) 123 113 Customers 103 actual forecast 95. Time Sequence Plot for Customers ARIMA(2.0% limits 93 83 73 1/96 1/00 1/04 Month 1/08 1/12 61 .1.1)x(1.1.

in which the times between events are independent and follow a negative exponential distribution. the process by which those events are generated is called a point process.Problem #9: Event Rate Modeling  When the data to be analyzed consist of the time at which events occur. the process is called homogeneous. 62 . then the process is called nonhomogeneous.  The most common type of point process is a Poisson process. If the event rate changes over time.  If the event rate is constant.

sgd (n=52) 63 .Data file: earthquakes.

Plot of Magnitude Versus Date Shows an apparent increase in magnitude over the sampling period.9 7.9 5.4 7.9 6.4 6. Plot of Magnitude vs Date 8.4 1/1/08 Magnitude 12/31/08 Date 12/31/09 12/31/10 64 .4 5.

Point Process Plot Shows the dates of occurrence only. Events Plot for Date Earthquake 1/1/08 12/31/08 12/31/09 Time or distance 12/31/10 65 .

 The rate parameter is also related to the mean time between events. 66 . This parameter is usually called .Event Rate  The critical parameter in a point process is the rate of events per unit time (such as earthquakes per year). with MTBE = 1 /    may be constant (a homogeneous process) or vary over time (a nonhomogenous process).

Cumulative Events Plot The slope of the line is related to the event rate. Cumulative Events Plot for Date 80 Mean 60 Number of events 40 20 0 1/1/08 12/31/08 12/31/09 Time or distance 12/31/10 67 .

68 .Trend Test A small P-value would indicate a significant trend.

Other Uses Point process models are also very useful for estimating failure rates. Events Plot Aircraft 1 Aircraft 2 Aircraft 3 Aircraft 4 Aircraft 5 Aircraft 6 Aircraft 7 Aircraft 8 Aircraft 9 Aircraft 10 Aircraft 11 Aircraft 12 Aircraft 13 0 500 1000 1500 Time or distance 2000 2500 69 .

Between Group Comparisons Tests can be made to determine whether there are significant differences. Cumulative Events Plot 30 Mean 25 Number of events 20 15 10 5 0 0 500 1000 1500 Time or distance 2000 2500 Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft Aircraft 1 2 3 4 5 6 7 8 9 10 11 12 13 70 .

71 .  Maps are one important example.Problem #10: Interactive Maps  Graphics that allow the user to interact with the data are extremely useful.

sgd (n=51) 72 .Data file: census2000.

Map Statlet The slider changes the cutoff between the red and blue states. 73 .S.U.