You are on page 1of 68

Sutee Sujitparapitaya, Ph.D.

Associate Vice President for


Institutional Effectiveness and Analytics
San José State University
Email: Sutee.Sujitparapitaya@sjsu.edu

Copyright © Sutee Sujitparapitaya, 2011‐2015


Data mining techniques are widely used for data analysis. While data mining
may be viewed as expensive, time‐consuming, and too technical to
understand
d d and
d apply,
l iit is
i an institutional
i i i l research
h tooll used
d for
f efficiently
ffi i l
managing and extracting data from large databases and for expediting
reporting through the use of statistical algorithms.

This workshop will introduce the basic foundations of data mining and identify
types of data typically found in large institutional databases, research
questions
ti tto consider
id b before
f mining
i i d data,
t and d iissues off d
data
t quality.
lit
– It will also address on how to mix traditional institutional research tools
with data mining, and field additional questions typically posed by
novices.
i
– Emphasis will be from a beginners’ (novice) perspective with an emphasis
on institutional research data applications.

2
• Describe the basic foundations of data mining from an
institutional research ((IR)) p
perspective.
p
• Explain the principle components of IR data and research
questions
• Describe why data mining process (CRISP‐DM Methodology)
and primary techniques are valuable for IR
• Describe how the data quality and data selection works
• Explain the primary features of data mining tools
• Describe the relevant resources that are available to help
p the
data mining projects

3
Strategic Decision Making
Wealth Generation

Analyzing trends Security


5
Data Mining is a process of finding hidden
trends patterns,
trends, patterns and relationships in data
that is not immediately apparent from
summarizing the data. By examining data in
large databases and infers rules to
a) obtain an insight;
b) predict future behavior

For example: Finding patterns in student data


for student attrition or to identify student
at‐risk and potential drop out from school.

Motivation of Data Mining :


1. Important need for turning data into useful information
2 Fast growing amount of data,
2. data collected and stored in large and
numerous databases exceeded the human ability for comprehension
without powerful tools.
3. We are drowning in data, but starving for knowledge!

6
Traditional Statistics
((Distributions, mathematics,
etc.)
Machine Learning: the discipline
concerned with the design and
development of algorithms
that gives computers the
ability to learn without being
explicitly programmed.
(
(Computer science, heuristics
and induction algorithms).

Artificial Intelligence:
g the studyy and design
g of intelligent
g agents
g to emulate
human intelligence.
Neural Networks: a mathematical model that uses an interconnected group of
artificial neurons p
processes information between inputs
p and outputs p or to
find patterns in data. It is an adaptive model that changes its structure
during a learning phase. (Biological models, psychology and engineering) 7
Evolutionary Business Enabling
Step Question Technologies Characteristics

Data Collection What was # new applications Computers, Retrospective, static


(1960s) for the last five years? tapes, disks data delivery

What was # new applications Retrospective,


Retrospective
Data Access Relational databases,
for College of Business last dynamic data delivery
(1980s) SQL, ODBC
March? at record level

What was # new applications OLAP,


OLAP
Data Warehousing Retrospective,
for College of Business last multidimensional
& Decision Support dynamic data delivery
March? Drill down to databases,
(1990s) at multiple levels
Accounting Majors data warehouses

What’s likely to happen to Advanced algorithms,


Data Mining Prospective, proactive
# new Accounting applications multiprocessor computers,
(At Present Time) information delivery
next month? Why? massive databases

8
Statistics
Conceptual “Prooff
“P
Model
(Hypothesis)
+ Statistical
Reasoning = (Validation of
Hypothesis)

Data Mining
Data Mining Pattern
Data
+ Algorithm
based on = Discovery
(Model, Rule)
Interestingness

9
Association
A i ti Rules
R l describes
d ib a method
th d ffor discovering
di i iinteresting
t ti relations
l ti
between variables in large databases. It produces dependency rules which will
predict occurrence of an item based on occurrences of other items.
Example 1: Which products are frequently bought together by customers?
(Basket Analysis)
• DataTable = Receipts x Products
• Results
R l could ld be
b used
d to change
h the
h placements
l off products
d

Example 2: Which courses tend to be attended together?


• DataTable = Students x Courses
• Results could be used to avoid scheduling conflicts....
10
Market basket analysis identifies customers purchasing habits. It provides insight
into the combination of products within a customers 'basket'.

Ultimately, the purchasing insights provide the potential to create cross sell
propositions:
• Which product combinations are bought
• When
Wh they th are purchased;
h d and d in
i
• What sequence

Observation Items
1 Break, Coke, Milk
2 Beer, Bread
3 Beer Coke,
Beer, Coke Diapers,
Diapers Milk
4 Beer, Bread, Diapers, Milk Rules Discovered:
5 Coke, Diapers, Milk {Milk}  {Coke}
{Diapers, Milk}  {Beer}
11
The government's data mining projects fall into two broad categories:
1. Subject‐based Data Mining that retrieve data that could help an analyst
follow a lead
lead, and
2. Pattern‐based Data Mining that look for suspicious behaviors across a
spread of activities.
Most data miningg experts
p consider the former a version of traditional police
p
work—chasing down leads—but instead of a police officer examining a
list of phone numbers of suspect calls, a computer does it.

One subject
subject‐based
based data mining technique gaining traction among
government practitioners and academics is called link analysis. Link
analysis uses data to make connections between seemingly
unconnected people or events.

12
Data Visualization is the study of the visual representation of data, meaning
"information that has been abstracted in some schematic form.
– It refers to technique to communicate information clearly and
effectively through graphical means (e.g., creating images,
diagrams, or animations).

Source: Bradbury Science Museum, Los Alamos, NM


13
14
+ =
Interestingness
or
Data Hidden
Criteria
Patterns

Slice 16
Interaction data Attitudinal data
- Offers - Opinions
- Results P f
- Preferences
- Context - Needs
- Click streams - Desires
- Notes

Descriptive data Behavioral data


- Attributes - Orders
- Characteristics - Transactions
- Self-declared
S lf d l d info
i f - Payment
P history
hi
- (Geo)demographics - Usage history

Source: SPSS BI
• Too many records
• Too many variables
• Complex non‐linear relationships
• Multi‐variable combination
• Proactive and prospective approach

Source: Abbot, Data Mining: Level II 19


Traditional IR Work:
Data file => Descriptive/Regression Analysis =>
Tabulations/Reports
Historical Predictive

Data Mining Driven IR Work:


Database => Data Mining (Visualization, Association,
Clustering, Predicative Modeling) => Immediate Actions

Historical Predictive

20
Type of Interestingness
• Frequency
• Correlation
• Length of Occurrence (for sequence)
• Consistency
• Repeating/Periodicity
• Abnormal Behaviors
• Other patterns of Interestingness

22
Typical DBMS Approach Data Mining Approach

What are total applications


pp duringg the last 3 Which inquiries
q are most likelyy to turn into
years? actual applications?
What is the first year retention of the fall What are the most important parameters
2006 first‐time
first time freshmen from under‐
under to predict the first year attrition for next
representative minority? year’s entering freshmen?
How many freshmen had attended the Who are likely to enroll in the freshman
freshman orientation in November for the orientation during the month of
last 5 years? November?
What is the total pledges for California Who are likely to make pledges for alumni
alumni donation last year? donation?
How many “agree” and “strongly agree” What are the main clusters found in
responses did we received from the 2008 student/faculty satisfaction surveys?
student/faculty satisfaction surveys?

23
What do we know about our students?
DBMS Approach
Approach:
• List of students who passed English Proficiency Exam in the spring
• Summary of student’s profile for those who failed, and dropped
out last semester
• How many students enrolled the Business Policy course last fall
semester?

Data Mining Approach:


• What factors are contributive to learning?
• Who is likely to fail or drop out at the end of their 6th year?
• What courses provide high FTES, better use of space?
• What are the course taking patterns?

24
DBMS Approach:
• List of all items that were sold in the last month ?
• List all the items purchased by Sandy Smith ?
• The total sales of the last month grouped by branch ?
• How many sales transactions occurred in the month of December ?

Data Mining Approach:


• Which
Whi h it
items are sold
ld together
t th ? What Wh t items
it to
t stock
t k?
• How to place items ? What discounts to offer ?
• How best to target customers to increase sales ?
• Which clients are most likely to respond to my next promotional
mailing, and why?

25
Supervised Data Mining refers to the prior knowledge of what the outcomes
exist in the data.
• Classification and Prediction  describe and distinguish data classes or
concepts, for the purpose of being able to use the model to predict the
class of objects whose class label is unknown.

Unsupervised Data Mining used when the researcher has no idea what
hidden patterns there are in the vast database.
• Clustering  involve in accurate identification of group membership
based on maximizing the infraclass similarity and minimizing the
interclass similarity.
• Associations and Sequences  identify relationships between events
that occur at one time, determines which things go together or
sequential patterns in data.

27
Categorize your students •Cafeteria meal planning
Clustering •Student housing planning

Predict students retention/Alumni donations •Identify high risk students


Neural Nets/Regression •Estimate/predict alumni contributions
•Predict new student application rate

Group similar students •Course planning


Segmentation •Academic scheduling
•Identify
Identify student preferences for clubs and
social organizations

Identify courses that are taken together •Faculty teaching load estimation
A
Association
i ti n •Course
C planning
l i
•Academic scheduling

Find patterns and trends over time •Predict alumni donations


Sequence •Predict potential demand for library
resources
Source: Thulasi Kumar, 2004
Classification and Prediction
• Decision Trees ((C&RT,, C5.0,, CHAID,, and QUEST))
• Neural Networks
• Regressions (Linear and Logistic)

Clustering
• K‐Means, TwoStep, and Kohonen SOM

Association Rule/Affinity Analysis


• Generalized Rule Induction (GRI)
• CARMA (Continuous Association Rule Mining Algorithm)
• APRIORI

29
 It is tree‐shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset.
 The model predicts the value of a target variable based on several input
variables.
variables

Two primary types of Decision trees:


1. Classification tree analysis
y is used when the p predicted outcome is the
class to which the data belongs.
2. Regression tree analysis is used when the predicted outcome can be
considered a real number (e.g. the price of a house, or a patient’s length of
stay in a hospital).

Advantages: Disadvantages:
• Fast • Inherently unstable
• Simple to understand and interpret • Can become large and complex
• Validation using statistical tests

31
Dependent Variable: Day Outlook Temperature Humidity Wind Play ball
• Target classification is "should we play D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
baseball?" which can be yes or no.
baseball? no
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
Input Variables: D5 Rain Cool Normal Weak Yes
• Weather attributes are outlook, D6 Rain Cool Normal Strong No
temperature humidity,
temperature, humidity and wind speed
speed. D7 O
Overcastt C l
Cool N
Normall Strong
St Y
Yes
They can have the following values: D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
o outlook = { sunny, overcast, rain }
D10 Rain Mild Normal Weak Yes
o temperature = {hot, mild, cool } D11 Sunnyy Mild Normal Strongg Yes
o humidity = { high, normal } D12 Overcast Mild High Strong Yes
o wind = {weak, strong } D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

32
C5.0 (Multiple split, no continuous targets) uses the C5.0 algorithm to build either
a decision tree or a rule set. A C5.0 model works by splitting the sample
based on the field that pprovides the maximum information g gain.

The Classification and Regression (C&R) Tree node is a tree-based


classification and prediction method. Similar to C5.0, this method uses
recursive
ecu s e papartitioning
o g too sp
split the
e training
a g records
eco ds into
o seg
segments
e s with ssimilar
a
output field values. (Binary split, continuous target)

QUEST—or Quick, Unbiased, Efficient Statistical Tree is a binary


classification method for building decision trees. A major motivation in its
development was to reduce the processing time required for large C&RT
analyses with either many variables or many cases

CHAID, or Chi-squared
Chi squared Automatic Interaction Detection, is a classification
method for building decision trees by using chi-square statistics to identify
optimal splits. CHAID first examines the cross tabulations between each of
the predictor variables and the outcome and tests for significance using a chi-
square independence test.

33
Neural network is a model that emulates human biological neural system to
solve the prediction and classification problems.
– solutions for linear and non‐linear relationships
p between input
p and
output variables.
– Does not assume any particular data distribution.

34
Advantages
• Has a mathematical foundation
• Robust with noisy data
• Detects relationships and trends in data
that traditional methods overlook
p non‐linear models
• Can fit complex
• Ability to detect all possible
interactions between predictor
variables

Disadvantages
• “Black Box" nature that does not easily analyze and interpret
• Greater
G t computational
t ti l burden
b d
• Virtually impossible to "interpret" the solution in traditional, analytic
terms, such as those used to build theories that explain phenomena

35
Linear regression is an approach to modeling the relationship between a scalar
dependent variable (y) and one or more predictor variables (X).
• The case of one predictor variable is called simple regression.
• More
M than
h one predictor
di variable
i bl is
i multiple
l i l regression.
i

The regression equation represents a straight line or plane that minimizes the
squared differences between predicted and actual output values
values. This is a
very common statistical technique for summarizing data and making
predictions ‐ y= f(x)

Advantages:
• Available in most software
• Widely accepted statistical technique
Disadvantages:
• Not appropriate for many non‐linear
problems
• Must meet underlying assumptions
36
Logistic regression is a type of regression analysis used for predicting the
outcome of a categorical dependent variable based on one or more predictor
variables that may be either continuous or categorical data.
1 Binomial or binary logistic regression refers to the instance in which the
1.
observed outcome can have only two possible types (e.g., "dead" vs.
"alive", "success" vs. "failure", or "yes" vs. "no").
u o a logistic
2. Multinomial og s c regression
eg ess o refers
e e s to
o cases where
e e thee outcome
ou co e can
ca have
a e
three or more possible types (e.g., "better' vs. "no change" vs. "worse").

For example, logistic regression might be used to predict whether a new student
will
ill graduate
d t within
ithi 6 years, based
b d on observed
b d characteristics
h t i ti off the
th
student (test score, age, gender, pre‐school preparation, etc).
Advantages: g
Disadvantages:
• Well established statistical procedure • Strong sensitivity to outliers
• Simple and easy to interpret • Multicollinearity
• Very fast to train and build
• Can be
b used d with
h smallll sample
l sizes
37
Cluster analysis is an exploratory data analysis
tool (unsupervised) for solving classification
problems.
• Its object is to sort cases (people, things,
events, etc) into groups, or clusters, so that The result of a cluster
the degree of association is strong between analysis shown as the
membersb off the
h same cluster
l andd weakk coloring of the squares into
three clusters.
between members of different clusters.
• It is not an automatic task, but an iterative
Types of Clustering
process of knowledge discovery (interactive
• K‐Means
multi‐objective optimization) that involves
• Two‐Step
trial and failure until the result achieves the
• Kohonen
desired properties
properties.

Advantages: Make up of groups in attitudinal or behavioral tests


g Individual ggroup
Disadvantages: p members mayy still differ

38
K‐Means clustering is an algorithm to classify or to group your objects based
on attributes/features into K number of group. K is positive integer number.
• The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
• Thus the purpose is to classify the data by partitioning n observations into
k clusters in which each observation belongs to the cluster with the
nearest mean.

40
Two‐step cluster analysis is a technique that groups cases into pre‐clusters that
are treated as single cases. Standard hierarchical clustering is then applied to
the pre‐clusters
pre clusters in the second step
step.
• It appropriate for large datasets or datasets that have a mixture of
continuous and categorical variables (not interval or dichotomous).
• It processes d data
t withith a one‐pass‐through‐the‐dataset
th h th d t t method. th d Therefore,
Th f
it does not require a proximity table (like hierarchical classification) or an
iterative process (like K‐means clustering)

41
http://www clustan com
http://www.clustan.com

42
Kohonen networks are a type of neural network that perform clustering,
also known as a knet or a self
self‐organizing
organizing map
map.
• It seeks to describe dataset in terms of natural clusters of cases. This
type of network can be used to cluster the data set into distinct groups
when you don't
don t know what those groups are at the beginning
beginning.
• Don't even need to know the number of groups to look for. Kohonen
networks start with a large number of units, and as training progresses,
the units gravitate toward the natural clusters in the data
data.

Source: SPSS BI

43
Association or affinity analysis is a data mining
technique that discovers co‐occurrence
relationships among activities performed by
specific individuals or groups. These relationships
are then expressed as a collection of association
rules.
l
• Association rules are statements in the form
if antecedent(s) then consequent(s)
Types of Association
• Used to perform • GRI
market basket analysis, • Apriori
in which retailers seek • CARMA
to understand the
purchase behavior of
customers.

45
Customer Purchase Customer Jam Bread Milk
1 jam
2 milk 1 T F F

3 jam
2 F F T
3 bread
4 jam 3 T T F
4 bread
4 T T T
4 milk

46
 Business
B i U
Understanding
d di
 Data Understanding
 Data Preparation
 Modeling
 Evaluation
 Deployment

Source: www.crisp‐dm.org

48
Business Data Data Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine
D t i Collect
C ll t Initial
I iti l Data
D t Data
D t Set
S t SSelect
l t Modeling
M d li E l t R
Evaluate Results
lt Plan Deployment
Pl D l t
Business Objectives Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion / Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test Design Approved Models Monitoring and
T Design
Test D i M i
Maintenance Plan
Pl
Situation Assessment Explore Data Clean Data Review Process
Inventory of Resources Data Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Final Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks
k and d Contingencies Generatedd Recordsd Decision Review
i Project
j
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria

Produce Project Plan


Project Plan
Initial Asessment of
Tools and Techniques
Source: SPSS BI
• Good data= better decisions = more profit
• Bad data= risky decisions = potential disaster:
• Bad data= Errors = losses
– “We cannot offer enough courses” = angry students, drop‐out or
transfer‐out to another institution
– “You’re
“Y ’ not admitted
d i d to your intended
i d d major”
j ” = angry students
d and
d
parents, lost revenue
– “We have more rooms in the dorm for new students” = bad decisions
if the number of students is inflated by bad data.
data

51
52
53
Scalar refer to a quantity consisting of a single real number used to
measured magnitude (size).
• Interval = Scale with a fixed and defined interval e.g. temperature or time.
• Ordinal = Scale for ordering observations from low to high with any ties
attributed to lack of measurement sensitivity e.g. score from a
questionnaire.
• Nominal with order = Scale for grouping into categories with order e.g.
mild, moderate or severe. This can be difficult to separate from ordinal.
• Nominal without order = Scale for grouping into unique categories ee.g.g
eye color.
• Dichotomous = As for nominal but two categories only e.g. male/female.

Non‐Scalar contains more than one value (e.g., lists, arrays, records)

54
• Case‐ or likewise deletion
• Pairwise deletion
• Single value substitution (by mean, median or mode of
variable)
• Regression substitution (using values of other variables in
the same row or using the overall relationships of
variables into account))
• Marking with a dummy variable

55
• Identify outliers (Anomaly Detection Node)
• Verify distributions (Data Audit Node)
• Relationship of variables
• Predictive power of variables (Auto Data Prep Node)
• Data reduction

56
• Data Audit/Data Distribution Charts
• Number of variables
• Number of records
• Information content/Predictive power

57
59
Successful data mining strategy involves:
1. Make data mining models comprehensible to business users
2 Translate
2. T l user’s
’ questions
i iinto a d
data mining
i i problem
bl
– Well defined goals, project objectives, and questions
3. Ensure to use sufficient and relevant data
4 Close the loop: identify causality
4. causality, suggest actions,
actions and measure their
effect.
– Need domain expertise in institutional research to build, test,
validate, and deploy models.
5. Careful consideration and selection of software and analysts (tech and
domain expert)
6. Support
pp from senior administrators (VPs ( and the President))
7. Cope with privacy and security issues
8. Misuse of information/inaccurate information

60
Free Open‐source Data Mining Software and Applications:
• R
• RapidMiner
• WEKA

Commercial Data Mining Software and Applications:


• PASW Modeler (IBM)
• STATISTICA Data Miner (StatSoft)
• Enterpriser Miner (SAS)
• Oracle Data Miningg
• CART/MARS (Salford Systems) ‐ Low Price
• XLMiner ($199)

62
63
64
Information
 www.kdnuggets.com/
 www‐01.ibm.com/software/analytics/spss/products/modeler
 www.educationaldatamining.org/index.html
d ld / d h l
 www.sigkdd.org/
 www.thearling.com/
g /

Training
 www.the‐modeling‐agency.com
th d li
 http://web.ccsu.edu/datamining/

 www.kdnuggets.com/education/usa‐canada.html

66
67
http://kdd.ics.uci.edu/
p // /
http://archive.ics.uci.edu/ml/
http://www.fedstats.gov/
http://www.census.gov/
http://nces.ed.gov/surveys/SurveyGroups.asp?group=2

68

You might also like