Professional Documents
Culture Documents
Fundamental Data Mining in Institutional Research Workshop
Fundamental Data Mining in Institutional Research Workshop
This workshop will introduce the basic foundations of data mining and identify
types of data typically found in large institutional databases, research
questions
ti tto consider
id b before
f mining
i i d data,
t and d iissues off d
data
t quality.
lit
– It will also address on how to mix traditional institutional research tools
with data mining, and field additional questions typically posed by
novices.
i
– Emphasis will be from a beginners’ (novice) perspective with an emphasis
on institutional research data applications.
2
• Describe the basic foundations of data mining from an
institutional research ((IR)) p
perspective.
p
• Explain the principle components of IR data and research
questions
• Describe why data mining process (CRISP‐DM Methodology)
and primary techniques are valuable for IR
• Describe how the data quality and data selection works
• Explain the primary features of data mining tools
• Describe the relevant resources that are available to help
p the
data mining projects
3
Strategic Decision Making
Wealth Generation
6
Traditional Statistics
((Distributions, mathematics,
etc.)
Machine Learning: the discipline
concerned with the design and
development of algorithms
that gives computers the
ability to learn without being
explicitly programmed.
(
(Computer science, heuristics
and induction algorithms).
Artificial Intelligence:
g the studyy and design
g of intelligent
g agents
g to emulate
human intelligence.
Neural Networks: a mathematical model that uses an interconnected group of
artificial neurons p
processes information between inputs
p and outputs p or to
find patterns in data. It is an adaptive model that changes its structure
during a learning phase. (Biological models, psychology and engineering) 7
Evolutionary Business Enabling
Step Question Technologies Characteristics
8
Statistics
Conceptual “Prooff
“P
Model
(Hypothesis)
+ Statistical
Reasoning = (Validation of
Hypothesis)
Data Mining
Data Mining Pattern
Data
+ Algorithm
based on = Discovery
(Model, Rule)
Interestingness
9
Association
A i ti Rules
R l describes
d ib a method
th d ffor discovering
di i iinteresting
t ti relations
l ti
between variables in large databases. It produces dependency rules which will
predict occurrence of an item based on occurrences of other items.
Example 1: Which products are frequently bought together by customers?
(Basket Analysis)
• DataTable = Receipts x Products
• Results
R l could ld be
b used
d to change
h the
h placements
l off products
d
Ultimately, the purchasing insights provide the potential to create cross sell
propositions:
• Which product combinations are bought
• When
Wh they th are purchased;
h d and d in
i
• What sequence
Observation Items
1 Break, Coke, Milk
2 Beer, Bread
3 Beer Coke,
Beer, Coke Diapers,
Diapers Milk
4 Beer, Bread, Diapers, Milk Rules Discovered:
5 Coke, Diapers, Milk {Milk} {Coke}
{Diapers, Milk} {Beer}
11
The government's data mining projects fall into two broad categories:
1. Subject‐based Data Mining that retrieve data that could help an analyst
follow a lead
lead, and
2. Pattern‐based Data Mining that look for suspicious behaviors across a
spread of activities.
Most data miningg experts
p consider the former a version of traditional police
p
work—chasing down leads—but instead of a police officer examining a
list of phone numbers of suspect calls, a computer does it.
One subject
subject‐based
based data mining technique gaining traction among
government practitioners and academics is called link analysis. Link
analysis uses data to make connections between seemingly
unconnected people or events.
12
Data Visualization is the study of the visual representation of data, meaning
"information that has been abstracted in some schematic form.
– It refers to technique to communicate information clearly and
effectively through graphical means (e.g., creating images,
diagrams, or animations).
Slice 16
Interaction data Attitudinal data
- Offers - Opinions
- Results P f
- Preferences
- Context - Needs
- Click streams - Desires
- Notes
Source: SPSS BI
• Too many records
• Too many variables
• Complex non‐linear relationships
• Multi‐variable combination
• Proactive and prospective approach
Historical Predictive
20
Type of Interestingness
• Frequency
• Correlation
• Length of Occurrence (for sequence)
• Consistency
• Repeating/Periodicity
• Abnormal Behaviors
• Other patterns of Interestingness
22
Typical DBMS Approach Data Mining Approach
23
What do we know about our students?
DBMS Approach
Approach:
• List of students who passed English Proficiency Exam in the spring
• Summary of student’s profile for those who failed, and dropped
out last semester
• How many students enrolled the Business Policy course last fall
semester?
24
DBMS Approach:
• List of all items that were sold in the last month ?
• List all the items purchased by Sandy Smith ?
• The total sales of the last month grouped by branch ?
• How many sales transactions occurred in the month of December ?
25
Supervised Data Mining refers to the prior knowledge of what the outcomes
exist in the data.
• Classification and Prediction describe and distinguish data classes or
concepts, for the purpose of being able to use the model to predict the
class of objects whose class label is unknown.
Unsupervised Data Mining used when the researcher has no idea what
hidden patterns there are in the vast database.
• Clustering involve in accurate identification of group membership
based on maximizing the infraclass similarity and minimizing the
interclass similarity.
• Associations and Sequences identify relationships between events
that occur at one time, determines which things go together or
sequential patterns in data.
27
Categorize your students •Cafeteria meal planning
Clustering •Student housing planning
Identify courses that are taken together •Faculty teaching load estimation
A
Association
i ti n •Course
C planning
l i
•Academic scheduling
Clustering
• K‐Means, TwoStep, and Kohonen SOM
29
It is tree‐shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset.
The model predicts the value of a target variable based on several input
variables.
variables
Advantages: Disadvantages:
• Fast • Inherently unstable
• Simple to understand and interpret • Can become large and complex
• Validation using statistical tests
31
Dependent Variable: Day Outlook Temperature Humidity Wind Play ball
• Target classification is "should we play D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
baseball?" which can be yes or no.
baseball? no
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
Input Variables: D5 Rain Cool Normal Weak Yes
• Weather attributes are outlook, D6 Rain Cool Normal Strong No
temperature humidity,
temperature, humidity and wind speed
speed. D7 O
Overcastt C l
Cool N
Normall Strong
St Y
Yes
They can have the following values: D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
o outlook = { sunny, overcast, rain }
D10 Rain Mild Normal Weak Yes
o temperature = {hot, mild, cool } D11 Sunnyy Mild Normal Strongg Yes
o humidity = { high, normal } D12 Overcast Mild High Strong Yes
o wind = {weak, strong } D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
32
C5.0 (Multiple split, no continuous targets) uses the C5.0 algorithm to build either
a decision tree or a rule set. A C5.0 model works by splitting the sample
based on the field that pprovides the maximum information g gain.
CHAID, or Chi-squared
Chi squared Automatic Interaction Detection, is a classification
method for building decision trees by using chi-square statistics to identify
optimal splits. CHAID first examines the cross tabulations between each of
the predictor variables and the outcome and tests for significance using a chi-
square independence test.
33
Neural network is a model that emulates human biological neural system to
solve the prediction and classification problems.
– solutions for linear and non‐linear relationships
p between input
p and
output variables.
– Does not assume any particular data distribution.
34
Advantages
• Has a mathematical foundation
• Robust with noisy data
• Detects relationships and trends in data
that traditional methods overlook
p non‐linear models
• Can fit complex
• Ability to detect all possible
interactions between predictor
variables
Disadvantages
• “Black Box" nature that does not easily analyze and interpret
• Greater
G t computational
t ti l burden
b d
• Virtually impossible to "interpret" the solution in traditional, analytic
terms, such as those used to build theories that explain phenomena
35
Linear regression is an approach to modeling the relationship between a scalar
dependent variable (y) and one or more predictor variables (X).
• The case of one predictor variable is called simple regression.
• More
M than
h one predictor
di variable
i bl is
i multiple
l i l regression.
i
The regression equation represents a straight line or plane that minimizes the
squared differences between predicted and actual output values
values. This is a
very common statistical technique for summarizing data and making
predictions ‐ y= f(x)
Advantages:
• Available in most software
• Widely accepted statistical technique
Disadvantages:
• Not appropriate for many non‐linear
problems
• Must meet underlying assumptions
36
Logistic regression is a type of regression analysis used for predicting the
outcome of a categorical dependent variable based on one or more predictor
variables that may be either continuous or categorical data.
1 Binomial or binary logistic regression refers to the instance in which the
1.
observed outcome can have only two possible types (e.g., "dead" vs.
"alive", "success" vs. "failure", or "yes" vs. "no").
u o a logistic
2. Multinomial og s c regression
eg ess o refers
e e s to
o cases where
e e thee outcome
ou co e can
ca have
a e
three or more possible types (e.g., "better' vs. "no change" vs. "worse").
For example, logistic regression might be used to predict whether a new student
will
ill graduate
d t within
ithi 6 years, based
b d on observed
b d characteristics
h t i ti off the
th
student (test score, age, gender, pre‐school preparation, etc).
Advantages: g
Disadvantages:
• Well established statistical procedure • Strong sensitivity to outliers
• Simple and easy to interpret • Multicollinearity
• Very fast to train and build
• Can be
b used d with
h smallll sample
l sizes
37
Cluster analysis is an exploratory data analysis
tool (unsupervised) for solving classification
problems.
• Its object is to sort cases (people, things,
events, etc) into groups, or clusters, so that The result of a cluster
the degree of association is strong between analysis shown as the
membersb off the
h same cluster
l andd weakk coloring of the squares into
three clusters.
between members of different clusters.
• It is not an automatic task, but an iterative
Types of Clustering
process of knowledge discovery (interactive
• K‐Means
multi‐objective optimization) that involves
• Two‐Step
trial and failure until the result achieves the
• Kohonen
desired properties
properties.
38
K‐Means clustering is an algorithm to classify or to group your objects based
on attributes/features into K number of group. K is positive integer number.
• The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
• Thus the purpose is to classify the data by partitioning n observations into
k clusters in which each observation belongs to the cluster with the
nearest mean.
40
Two‐step cluster analysis is a technique that groups cases into pre‐clusters that
are treated as single cases. Standard hierarchical clustering is then applied to
the pre‐clusters
pre clusters in the second step
step.
• It appropriate for large datasets or datasets that have a mixture of
continuous and categorical variables (not interval or dichotomous).
• It processes d data
t withith a one‐pass‐through‐the‐dataset
th h th d t t method. th d Therefore,
Th f
it does not require a proximity table (like hierarchical classification) or an
iterative process (like K‐means clustering)
41
http://www clustan com
http://www.clustan.com
42
Kohonen networks are a type of neural network that perform clustering,
also known as a knet or a self
self‐organizing
organizing map
map.
• It seeks to describe dataset in terms of natural clusters of cases. This
type of network can be used to cluster the data set into distinct groups
when you don't
don t know what those groups are at the beginning
beginning.
• Don't even need to know the number of groups to look for. Kohonen
networks start with a large number of units, and as training progresses,
the units gravitate toward the natural clusters in the data
data.
Source: SPSS BI
43
Association or affinity analysis is a data mining
technique that discovers co‐occurrence
relationships among activities performed by
specific individuals or groups. These relationships
are then expressed as a collection of association
rules.
l
• Association rules are statements in the form
if antecedent(s) then consequent(s)
Types of Association
• Used to perform • GRI
market basket analysis, • Apriori
in which retailers seek • CARMA
to understand the
purchase behavior of
customers.
45
Customer Purchase Customer Jam Bread Milk
1 jam
2 milk 1 T F F
3 jam
2 F F T
3 bread
4 jam 3 T T F
4 bread
4 T T T
4 milk
46
Business
B i U
Understanding
d di
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Source: www.crisp‐dm.org
48
Business Data Data Modeling Evaluation Deployment
Understanding Understanding Preparation
Determine
D t i Collect
C ll t Initial
I iti l Data
D t Data
D t Set
S t SSelect
l t Modeling
M d li E l t R
Evaluate Results
lt Plan Deployment
Pl D l t
Business Objectives Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion / Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test Design Approved Models Monitoring and
T Design
Test D i M i
Maintenance Plan
Pl
Situation Assessment Explore Data Clean Data Review Process
Inventory of Resources Data Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Final Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks
k and d Contingencies Generatedd Recordsd Decision Review
i Project
j
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria
51
52
53
Scalar refer to a quantity consisting of a single real number used to
measured magnitude (size).
• Interval = Scale with a fixed and defined interval e.g. temperature or time.
• Ordinal = Scale for ordering observations from low to high with any ties
attributed to lack of measurement sensitivity e.g. score from a
questionnaire.
• Nominal with order = Scale for grouping into categories with order e.g.
mild, moderate or severe. This can be difficult to separate from ordinal.
• Nominal without order = Scale for grouping into unique categories ee.g.g
eye color.
• Dichotomous = As for nominal but two categories only e.g. male/female.
Non‐Scalar contains more than one value (e.g., lists, arrays, records)
54
• Case‐ or likewise deletion
• Pairwise deletion
• Single value substitution (by mean, median or mode of
variable)
• Regression substitution (using values of other variables in
the same row or using the overall relationships of
variables into account))
• Marking with a dummy variable
55
• Identify outliers (Anomaly Detection Node)
• Verify distributions (Data Audit Node)
• Relationship of variables
• Predictive power of variables (Auto Data Prep Node)
• Data reduction
56
• Data Audit/Data Distribution Charts
• Number of variables
• Number of records
• Information content/Predictive power
57
59
Successful data mining strategy involves:
1. Make data mining models comprehensible to business users
2 Translate
2. T l user’s
’ questions
i iinto a d
data mining
i i problem
bl
– Well defined goals, project objectives, and questions
3. Ensure to use sufficient and relevant data
4 Close the loop: identify causality
4. causality, suggest actions,
actions and measure their
effect.
– Need domain expertise in institutional research to build, test,
validate, and deploy models.
5. Careful consideration and selection of software and analysts (tech and
domain expert)
6. Support
pp from senior administrators (VPs ( and the President))
7. Cope with privacy and security issues
8. Misuse of information/inaccurate information
60
Free Open‐source Data Mining Software and Applications:
• R
• RapidMiner
• WEKA
62
63
64
Information
www.kdnuggets.com/
www‐01.ibm.com/software/analytics/spss/products/modeler
www.educationaldatamining.org/index.html
d ld / d h l
www.sigkdd.org/
www.thearling.com/
g /
Training
www.the‐modeling‐agency.com
th d li
http://web.ccsu.edu/datamining/
www.kdnuggets.com/education/usa‐canada.html
66
67
http://kdd.ics.uci.edu/
p // /
http://archive.ics.uci.edu/ml/
http://www.fedstats.gov/
http://www.census.gov/
http://nces.ed.gov/surveys/SurveyGroups.asp?group=2
68