You are on page 1of 45

Math 737/837 Six Sigma

Exploratory Data Analysis


 Introduction
 Distribution
 Tabulate
 Graph Builder
 Data Filter
 Scatterplot Matrices
 Scatterplot 3D
 Bubble Plots
 Column Switcher
 Tree Maps
 Introduction to Partition
Introduction
Data exploration is sometimes known as exploratory data analysis
(EDA) and is an important part of modern statistical analyses.
With proliferation of large data bases and automated data collection
systems, EDA is now an important part of the Define and Measure
phases of the DMAIC cycle, depending on the scenario.
A good deal of EDA, but not all, relies on visualization of data to
understand patterns of variation observed in the variables of interest
and to explore potential hidden relationships among variables.
The JMP software has very powerful data visualization capabilities.
The Analyze and Improve phases of the DMAIC cycle often employ
what are called confirmatory data analysis (CDA) methods; these
include more traditional statistical methods that we cover later.

 2015 Philip J. Ramsey, Ph.D. 2


Introduction
A important step prior to EDA and CDA is the preprocessing of the
data, which includes “cleaning up” the data and getting into a form
that is useful for analysis.
EDA often requires initial data preprocessing before one attempts to
draw conclusions or formulate hypotheses of cause and effect
relationships from data.
A general heuristic for data exploration is to begin by examining the
variables one at a time (use the Distribution platform).
Examine the variables two at a time (Fit Y by X, Graph Builder,
Scatterplot Matrices).
Finally examine three or more variables at a time (Scatterplot 3D,
Bubble Plot, Tabulate, Tree Map, and Graph Builder).

 2015 Philip J. Ramsey, Ph.D. 3


Introduction
In this section we introduce some tools for working with and
exploring data:
• Distribution – Always a good starting point!
• Tabulate – For summarizing data
• Graph Builder – Drag and drop, GUI graphing interface
• Scatterplot Matrices, Scatterplot 3D, Bubble Plots
• Data Filter – Dynamically stratify data
• Tree Map – An alternative to Mosaic Plots
• Partition – Used for exploring potentially important variables.
Particularly valuable when there are many variables, nominal
variables, missing values, and messy data.
 2015 Philip J. Ramsey, Ph.D. 4
Distribution
All graphical displays and the data table and dynamically linked.
Points highlighted in one display are also highlighted in other displays
and the associated rows are selected in the data table.
One can often discover interesting relationships among variables using
the Distribution platform and the interactivity among open reports.
Use the Distribution platform to interactively explore relationships
among variables and assess the distributions.
We use the dataset Car Physical Data.jmp found in JMP Help (Help
> Sample Data).

 2015 Philip J. Ramsey, Ph.D. 5


Distribution
Select Analyze > Distribution.
Enter all of the variables except for Model into the Y, Columns list.
In the lower left corner of the launch window, check the Histograms
Only option.
The report window is shown below. All of the cars manufactured in
the USA have been selected by clicking on the USA bar in the
Country report.
Do you see any potential relationships between USA and the other
variables? Try clicking on other bars and look for relationships.

 2015 Philip J. Ramsey, Ph.D. 6


Tabulate
Use Tabulate to interactively summarize data and construct tables of
descriptive statistics.

From an open JMP data


table select Analyze >
Tabulate.
Example: Car Physical
Data.jmp in JMP
Sample Data.
Click and drag a variable
from the column list in
the control panel to the
drop zone for rows or
columns.
 2015 Philip J. Ramsey, Ph.D. 7
Tabulate
In the figure on the left, Country is in the rows drop zone.
The number of observations per country is displayed.
In the figure on the right, Horsepower is in the columns drop zone as
an analysis column.
The sum for horsepower is displayed for each country.

 2015 Philip J. Ramsey, Ph.D. 8


Tabulate
Click, drag and drop one or more summary statistics from the middle
panel of the control panel into the results area. In the capture
below, Mean and Std Dev are displayed for each country.

 2015 Philip J. Ramsey, Ph.D. 9


Tabulate
To change a numeric format select Change Format at the bottom of
the Tabulate window and make desired changes.

Click the Undo button to reverse the last change or click Start Over
to clear the display.
To create a data table, select Make Into Data Table from the red
triangle menu.

 2015 Philip J. Ramsey, Ph.D. 10


Tabulate
Tips:
 To add an additional row or column
variable, drag and drop a new
variable to either side of the current
variable in the table.
 To add new summary panels to the
table, drag and drop new variables to
the bottom or left of the table.
 Click and drag variables in the table to
rearrange, or right click on a variable
to delete or change the format.

 2015 Philip J. Ramsey, Ph.D. 11


Graph Builder
Use Graph Builder to interactively create graphs for one or more
variables, including line plots, splines, box plots, bar charts,
histograms, mosaic plots and maps.

Example: Big Class.jmp


in Help > Sample
Data.

 2015 Philip J. Ramsey, Ph.D. 12


Graph Builder
From an open JMP data table select Graph > Graph Builder.
Click and drag a variable from Select Columns drop it in the
desired drop zone.

 2015 Philip J. Ramsey, Ph.D. 13


Graph Builder
By default, Graph Builder displays data points.
If the X and Y zone variables are both continuous, a smooth spline is
displayed (to give an idea of the nature of the relationship).
Tips:
 To change the type of graphical display, click on an icon at the top
of the template.
 Or, right click in graph area, select Add and select the display type.
 To replace a variable with a new variable, drag the new variable
and drop it in the center of the drop zone.
 Multiple grouping variables can be assigned in the Group Zone to
create a graph for each combination of the grouping variables.
 2015 Philip J. Ramsey, Ph.D. 14
Graph Builder
Tips:
 More than one variable can be
assigned to an X or Y zone.
 Drag a variable to either side of the
current variable in the zone. A blue
ribbon will indicate where the new
variable will be placed when
dropped.

 2015 Philip J. Ramsey, Ph.D. 15


Graph Builder
Other Drop Zones:
 Drop a variable in Wrap to trellis the graph horizontally and
vertically. Note: Group X will be unavailable.
 Drop a variable in Color to create a legend and color by values of
the variable.
 Drop a variable in Overlay to color and overlay graphs for each
value of the variable on one graph.
 If data has been summarized and a frequency variable exists, drag
the frequency variable to the Freq zone.
 If a column defines a physical shape, drag the column to Shape to
create a map (shape files must exist).

 2015 Philip J. Ramsey, Ph.D. 16


Graph Builder: Mapping
Mapping Example: SATByYear.jmp in
Help > Sample Data
 Click and drag State to the Shape field
 Then, drag and drop a continuous
variable on the map.

 2015 Philip J. Ramsey, Ph.D. 17


Data Filter
The data filter (Rows > Data Filter) allows you to dynamically
stratify data by values of one or more filtering variables.
This is a global filter since it operates on the entire data table.
From an open data table select the data filter. Then, select a variable
of interest, and click Add.

 2015 Philip J. Ramsey, Ph.D. 18


Data Filter
A Local Data Filter is available under the Script submenu in the
main report menu for any JMP Report window.
The Local Data Filter does not change the original data table and
only applies the filtering to the current report.

 2015 Philip J. Ramsey, Ph.D. 19


Data Filter
Select a value of the filtering variable (or a range of values if the
variable is continuous). The corresponding values will be selected
in the data table and every open graph.

 2015 Philip J. Ramsey, Ph.D. 20


Data Filter
Click Show in the Data Filter. All other observations will be hidden
in the data table.
In Graph Builder, only the selected value(s) of the filter variable(s)
will display.

 2015 Philip J. Ramsey, Ph.D. 21


Data Filter
Click Include in the Data Filter. All other observations will be
excluded in the data table.
Graph Builder will automatically update and only non-excluded values
will be included in the graph and calculations.

 2015 Philip J. Ramsey, Ph.D. 22


Scatterplot Matrices
If you have several continuous variables in your dataset, then
Scatterplot Matrix is an ideal way to quickly examine all possible
bivariate relationships among the variables.
Fit Y by X can also be used to examine bivariate relationships
between a single pair of variables.
Scatterplot Matrix can be accessed in two different ways, either from
the Analyze menu or the Graph menu.
From the Graph menu select Scatterplot Matrix.
From the Analyze menu, select Multivariate (Analyze >
Multivariate Methods > Multivariate).

 2015 Philip J. Ramsey, Ph.D. 23


Scatterplot Matrices
Consider Car Physical
Data.jmp.
Select Graph >
Scatterplot Matrix.
Are there bivariate
relationships between
any pairs of variables?
Do you see any unusual
patterns or potential
outliers in the data?

 2015 Philip J. Ramsey, Ph.D. 24


Scatterplot Matrices
If there are nominal variables in the dataset, then it is often useful to
construct plots of the continuous variables stratified by the
nominal variables.
You can stratify in several ways:
• Use Data Filter or Local Data Filter and select one or more
nominal variables as the filtering variables.
• Right-click in the body of a plot and select Row Legend from
the drop-down menu. Select the stratifying nominal variable
and set options for the coloring and marking of observations in
the plot.
• Select Rows > Color and Mark by Column to obtain the Row
Legend options.
 2015 Philip J. Ramsey, Ph.D. 25
Scatterplot Matrices
Make the Scatterplot Matrix for Car
Physical Data.jmp your active plot.
Right-click in any cell.
Select Row Legend from the menu.

 2015 Philip J. Ramsey, Ph.D. 26


Scatterplot Matrices
The colors and markers
stratify the continuous
variables by Country.
A Legend window has
been created for
reference.

 2015 Philip J. Ramsey, Ph.D. 27


Scatterplot 3D
You can construct three dimensional scatterplots of continuous
variables using Graph > Scatterplot 3D.
Enter all of the continuous variables into the Y, Columns list.
You will be able to select subsets of variables in the report window.

 2015 Philip J. Ramsey, Ph.D. 28


Scatterplot 3D
By clicking and dragging
within the plot, you
can rotate the position
of the display.
The plot is stratified
using the colors and
markers imposed
earlier by Row
Legend.
Right click within the
plot to open a menu
with additional
plotting options.
 2015 Philip J. Ramsey, Ph.D. 29
Bubble Plots
Bubble plots (Graph > Bubble Plot) provide a way to plot multiple
variables in the form of a two-dimensional scatterplot.
It is also possible to create animations if your data includes
longitudinal variables such as time.
Open the JMP sample data table SATByYear.jmp and populate the
launch dialog as shown.

Y = SAT Verbal
X = SAT Math
ID = State
Time = Year
Sizes = % Taking 1997
Coloring = Region
 2015 Philip J. Ramsey, Ph.D. 30
Bubble Plots
To see an animation of scores over time, use the animation controls.

What do we learn from


these plots?

• Are Verbal and Math


SAT scores related?
• What about the
distribution of scores?
• Are there differences by
region?
• Are there changes over
time?

 2015 Philip J. Ramsey, Ph.D. 31


Bubble Plots
To display all the labels (ID = State) for the points, select All Labels
from the red triangle menu.
To label only a few
points, hold the shift
key and click on those
points. They remain
labeled until you
deselect them or select
other points.
Are the students in ND
and IA simply smarter
than those in other
states?
 2015 Philip J. Ramsey, Ph.D. 32
Column Switcher
Column Switcher enables you to selectively substitute variables in a
plot for other variables.
It is useful when you have several variables that you want to view in
the same way in a plot that you have constructed for a single
variable.
 As an example, you could look at many different pairs of X-Y
scatterplots using a single report.
 You can also animate your display.
Column Switcher is an option in the Script submenu of the red
triangle menu form many reports.

 2015 Philip J. Ramsey, Ph.D. 33


Column Switcher
Open the JMP sample data table Car Physical Data.jmp.
In Graph Builder, place Gas Tank Size in the X drop zone and Wieght
in the Y drop zone.
Select Column Switcher from the Script submenu.

 2015 Philip J. Ramsey, Ph.D. 34


Column Switcher
In the Choose column to switch menu, select Weight. Click OK.

 2015 Philip J. Ramsey, Ph.D. 35


Column Switcher
In the Choose set of columns to switch to menu, select all columns.
Click OK and click OK again to dismiss the warning.

 2015 Philip J. Ramsey, Ph.D. 36


Column Switcher
From the Graph Builder red triangle menu, de-select Show Control
Panel.
In the Column Switcher panel, scroll through the columns.
Run the animation.

 2015 Philip J. Ramsey, Ph.D. 37


Tree Maps
Tree maps are an alternative
to Mosaic Plots for
displaying nominal data.
Open ParetoDefect.jmp.
Select Graph > Tree Map.
Populate the launch window
as shown and click OK.

 2015 Philip J. Ramsey, Ph.D. 38


Tree Maps
Add a second Categories variable, and add a Coloring variable:

 2015 Philip J. Ramsey, Ph.D. 39


Introduction to Partition
Partition (Analyze > Modeling > Partition) is a data mining tool,
used for data exploration and for building predictive models.
We introduce the tool here, and revisit it later.
Example: Auto Raw Data.jmp in JMP Sample Data.

 The response is Claim


(Y/N) – Was there an auto
insurance claim for the
polity holder?
 Factors are demographic
variables, such as age
class, gender, region,…

 2015 Philip J. Ramsey, Ph.D. 40


Introduction to Partition
JMP displays a graph with a line drawn at the overall response rate.

From the Partition red


triangle menu, select
Display Options > Show
Split Prob.
This displays response rates.

 2015 Philip J. Ramsey, Ph.D. 41


Introduction to Partition
Click the Split button.
The original observations are split
into two nodes.
The Partition plot updates as well.
 In the left node, corresponding
to AgeClass = Young, the
probability that there is a claim
is 0.3604.
 In the right node, corresponding
to AgeClass = Elder, the
probability that there is a claim
is 0.0970.
 2015 Philip J. Ramsey, Ph.D. 42
Introduction to Partition
Click the Split button twice more. Each AgeClass node is split
into two nodes.

What do you learn from


this exploratory
analysis?

 2015 Philip J. Ramsey, Ph.D. 43


Summary
We introduced a number of valuable tools for exploring data:
 Distribution – For a preliminary look at all variables
 Tabulate – A drag and drop pivot table for summarizing data
 Graph Builder – Dynamically graph data, explore many variables at
a time
 Data Filter – Dynamically stratify data
 Scatterplot Matrices, Bubble Plots, and Column Switcher – Explore
many variables at once
 Tree Maps – Explore categorical variables
 Partition – Use to identify potentially important variables when
there are many variables, nominal variables, missing values, and
messy data
 2015 Philip J. Ramsey, Ph.D. 44
Skills Practice
Lets play! If you have project data, then try using various methods
covered in this section to explore the data.
If you have no data, then use as many methods as you can to explore
the data in these files:
1. The file Personal Injury Data.jmp in the 6Sigma Class Data
folder.
2. The file Consumer Preferences.jmp in JMP Sample Data. This
data was obtained in an effort to understand consumer tooth
brushing and flossing behavior. Formula some research questions
or hypotheses, and try to answer them graphically.
3. The file Air Traffic.jmp in JMP Sample Data. Try constructing
a map and seeing which airports are serviced by the various
carriers.
 2015 Philip J. Ramsey, Ph.D. 45

You might also like