Professional Documents
Culture Documents
LEARNING MODULE
1 INTRODUCTION TO DATA MINING 3
2 DATA EXPLORATION 29
3 MODELING 70
4 MODEL EVALUATION 178
5 MODEL DEPLOYMENT 190
Introduction Our ability to generate and collect data has been increasing rapidly. The
widespread use of Information technology in our lives has flooded us with a
tremendous amount of data. This explosive growth of stored and transient data
has generated an urgent need for new techniques and automated tools that can
assist in transforming this data into useful information and knowledge. Data
mining has emerged as a multidisciplinary field that addresses this need. This
course is an introductory course on data mining. It introduces the basic concepts
and techniques of data mining. Students will learn how to apply data mining
principles to the dissection of large complex data sets.
Learning Activities
(to include
Content/Discussion
Activity # 1
of the Topic)
# 2.
• •
Application Techniques
Software
Issues
Used
• •
Statistics
The science of collecting, classifying, summarizing, organizing, analyzing, and
interpreting data.
Artificial Intelligence
The study of computer algorithms dealing with the simulation of intelligent
behaviors in order to perform those activities that are normally thought to
require intelligence.
Machine Learning
The study of computer algorithms to learn in order to improve automatically
through experience.
Database
The science and technology of collecting, storing and managing data so users can
retrieve, add, update or remove such data.
Data warehousing
The science and technology of collecting, storing and managing data with
advanced multi-dimensional reporting services in support of the decision making
processes.
The current evolution of data mining functions and products is the result of years
of influence from many disciplines, including databases, information databases,
information retrieval, statistics, algorithms and machine learning.
Evolution of Sciences
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for
Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems
=================================================================
Data mining involves many different algorithms to accomplish different tasks. All
of these algorithms attempt to fit a model to the data. The algorithms examine
the data and determine the model that is closest to the characteristics of the data
being examined. Data mining algorithms can be characterized as consisting three
parts:
Example 1.
Credit card companies must determine whether to authorize credit card
purchases. Suppose that based on past historical information about
purchases, each purchase is placed into one of four classes: (1) authorize,
(2) ask for further identification before authorization, (3) do not authorize,
and (4) do not authorize but contact police. The data mining functions here
are twofold. First, the historical data must be examined to determine how
the data fit into the four classes. Then the problem is to apply this model to
each new purchase. Although the second part indeed may be stated as a
simple database query, the first part cannot be.
In Example 1 the data are modeled as divided into four classes. The search
requires examining past data about credit card purchases and their outcome to
determine what criteria should be used to define the class structure. The
preference will be given to criteria that seem to fit the data set. For example, we
probably would want to authorize a credit card purchase for a small amount of
money with a credit card belonging to a long-standing customer. Conversely, we
would not want to authorize the use of a credit card to purchase anything if the
card has been reported as stolen. The search process requires that the criteria
needed to fit the data to the classes be properly defined.
A predictive model makes a prediction about values of data using known results
found from different data. Predictive modeling may be made based on the use of
other historical data.
For example, a credit card use might be refused not because of the user's own
credit history, but because the current purchase is similar to earlier purchases that
were subsequently found to be made with stolen cards. Example 1 uses predictive
modeling to predict the credit risk. Predictive model data mining tasks include
classification, regression, time series analysis, and prediction. Prediction may also
be used to indicate a specific type of data mining function.
EXAMPLE 2
2. Regression
Regression is used to map a data item to a real valued prediction variable. In
actuality, regression involves the learning of the function that does this mapping.
Regression assumes that the target data fit into some known type of function
(e.g., linear, logistic, etc.) and then determines the best function of this type that
models the given data. Some type of error analysis is used to determine which
function is "best." Standard linear regression, as illustrated in Example 3, is a
simple example of regression.
EXAMPLE 3
With time series analysis, the value of an attribute is examined as it varies over
time. The values usually are obtained as evenly spaced time points (daily, weekly,
hourly, etc.). A time series plot (Figure 4), is used to visualize the time series. In
this figure you can easily see that the plots for Y and Z have similar behavior, while
X appears to have less volatility. There are three basic functions performed in time
series analysis: In one case, distance measures are used to determine the
similarity between different time series. In the second case, the structure of the
line is examined to determine (and perhaps classify) its behavior. A third
application would be to use the historical time series plot to predict future values.
A time series example is given in Example 4.
EXAMPLE 4
4. Prediction
Many real-world data mining applications can be seen as predicting future data
states based on past and current data. Prediction can be viewed as a type of
classification. (Note: This is a data mining task that is different from the prediction
model, although the prediction task is a type of prediction model.) The difference
is that prediction is predicting a future state rather than a current state. Here we
are referring to a type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding, speech
recognition, machine learning, and pattern recognition. Although future values
may be predicted using time series analysis or regression techniques, other
approaches may be used as well. Example 5 illustrates the process.
EXAMPLE 5
5. Clustering
Clustering is similar to classification except that the groups are not predefined,
but rather defined by the data alone. Clustering is alternatively referred to as
unsupervised learning or segmentation. It can be thought of as partitioning or
segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data
on predefined attributes. The most similar data are grouped into clusters.
Example 6 provides a simple clustering example. Since the clusters are not
predefined, a domain expert is often required to interpret the meaning of the
created clusters.
EXAMPLE 6
6. Summarization
EXAMPLE 7
One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or ACT score [GM99]. This is a summarization
used to estimate the type and intellectual level of the student body.
7. Association Rules
EXAMPLE 8
Users of association rules must be cautioned that these are not causal
relationships. They do not represent any relationship inherent in the actual data
(as is true with functional dependencies) or in the real world. There probably is
no relationship between bread and pretzels that causes them to be purchased
together. And there is no guarantee that this association will apply in the future.
However, association rules can be used to assist retail store management in
effective advertising, marketing, and inventory control.
8. Sequence Discovery
EXAMPLE 9
The Webmaster at the XYZ Corp. periodically analyzes the Web log data to
determine how users of the XYZ's Web pages access them. He is interested in
determining what sequences of pages are frequently accessed. He determines
that 70 percent of the users of page A follow one of the following patterns of
behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). He then determines to add a link
directly from page A to page C.
Figure 5 illustrates the phases, and the iterative nature, of a data mining project.
The process flow shows that a data mining project does not stop when a particular
solution is deployed. The results of data mining trigger new business questions,
which in turn can be used to develop more focused models.
Problem Definition
This initial phase of a data mining project focuses on understanding the project
objectives and requirements. Once you have specified the project from a business
perspective, you can formulate it as a data mining problem and develop a
preliminary implementation plan.
For example, your business problem might be: "How can I sell more of my
product to customers?" You might translate this into a data mining problem
such as: "Which customers are most likely to purchase the product?" A model
that predicts who is most likely to purchase the product must be built on data
that describes the customers who have purchased the product in the past.
Before building the model, you must assemble the data that is likely to contain
relationships between customers who have purchased the product and
customers who have not purchased the product. Customer attributes might
include age, number of children, years of residence, owners/renters, and so on.
The data preparation phase covers all the tasks involved in creating the case table
you will use to build the model. Data preparation tasks are likely to be performed
multiple times, and not in any prescribed order. Tasks include table, case, and
attribute selection as well as data cleansing and transformation.
For example, you might transform a DATE_OF_BIRTH column to AGE; you might
insert the average income in cases where the INCOME column is null.
For example, rather than using the purchase amount, you might create a new
attribute: "Number of Times Amount Purchase Exceeds $500 in a 12 month time
period." Customers who frequently make large purchases may also be related
to customers who respond or don't respond to an offer.
Thoughtful data preparation can significantly improve the information that can be
discovered through data mining.
In preliminary model building, it often makes sense to work with a reduced set of
data (fewer rows in the case table), since the final case table might contain
thousands or millions of cases.
At this stage of the project, it is time to evaluate how well the model satisfies the
originally-stated business goal (phase 1). If the model is supposed to predict
customers who are likely to purchase a product, does it sufficiently differentiate
between the two classes? Is there sufficient lift? Are the trade-offs shown in the
confusion matrix acceptable? Would the model be improved by adding text data?
Should transactional data such as purchases (market-basket data) be included?
Should costs associated with false positives or false negatives be incorporated
into the model?
Knowledge Deployment
Knowledge deployment is the use of data mining within a target environment. In
the deployment phase, insight and actionable information can be derived from
data.
Deployment can involve scoring (the application of models to new data), the
extraction of model details (for example the rules of a decision tree), or the
integration of data mining models within applications, data warehouse
infrastructure, or query and reporting tools. Data can be mined and the results
returned within a single database transaction.
For example, a sales representative could run a model that predicts the
likelihood of fraud within the context of an online sales transaction.
There are many important implementation issues associated with data mining:
1. Human interaction: Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical experts. Technical
experts are used to formulate the queries and assist in interpreting the results.
Users are needed to identify training data and desired results.
3. Outliers: There are often many data entries that do not fit nicely into the
derived model. This becomes even more of an issue with very large databases. If
a model is developed that includes these outliers, then the model may not behave
well for data that are not outliers.
6. Large datasets: The massive datasets associated with data mining create
problems when applying algorithms designed for small datasets. Many modeling
applications grow exponentially on the dataset size and thus are too inefficient
for larger datasets. Sampling and parallelization are effective tools to attack this
scalability problem.
9. Missing data: During the preprocessing phase of KDD, missing data may be
replaced with estimates. This and other approaches to handling missing data can
lead to invalid results in the data mining step.
10. Irrelevant data: Some attributes in the database might not be of interest to
the data mining task being developed.
11. Noisy data: Some attribute values might be invalid or incorrect. These values
are often corrected before running data mining applications.
13. Integration: The KDD process is not currently integrated into normal data
processing activities. KDD requests may be treated as special, unusual, or one-
time needs. This makes them inefficient, ineffective, and not general enough to
be used on an ongoing basis. Integration of data mining functions into traditional
DBMS systems is certainly a desirable goal.
14. Application: Determining the intended use for the information obtained from
the data mining function is a challenge. Indeed, how business executives can
effectively use the output is sometimes considered the more difficult part, not the
running of the algorithms themselves. Because the data are of a type that has not
previously been known, business practices may have to be modified to determine
how to effectively use the information uncovered.
Table 1.
Industries/Fields where you applied Analytics, Data Mining, Data Science in 2016?
2016 % of voters
2015 % of voters
2014 % of voters
CRM/Consumer analytics 16.3%
(90) 18.6%
22.2%
Finance (83) 15.0%
15.4%
10.9%
Banking (74) 13.4%
14.3%
16.7%
Advertising (66) 12.0%
8.9%
10.4%
Science (66) 12.0%
11.7%
13.6%
Health care (66) 12.0%
13.4%
16.3%
Fraud Detection (61) 11.1%
10.0%
13.6%
Retail (57) 10.3%
9.1%
13.6%
Insurance (51) 9.2%
7.4%
8.6%
E-commerce (49) 8.9%
10.3%
9.5%
Automotive/Self-Driving 4.5%
Cars (25) 4.3%
5.9%
Direct Marketing/ 4.3%
Fundraising (24) 5.1%
7.2%
Mining (23) 4.2%
3.7%
na
Travel / Hospitality (22) 4.0%
2.6%
3.2%
Entertainment/ Music/ 4.0%
TV/Movies (22) 3.1%
1.8%
HR/workforce analytics (20) 3.6%
6.3%
5.9%
Mobile apps (18) 3.3%
1.4%
2.3%
Agriculture (18) 3.3%
2.9%
na
Games (16) 2.9%
4.0%
1.8%
Security / Anti-terrorism 2.7%
(15) 2.3%
2.3%
Social Good/Non-profit (11) 2.0%
2.3%
1.4%
Social Policy/Survey analysis 1.8%
(10) 1.7%
1.8%
Junk email / Anti-spam (6) 1.1%
0.3%
1.8%
Source: http://www.saedsayad.com/problem_definition.htm
Assignment
A
Activity # 2c Identify a dataset of an industry sector for market
t analysis on solving real world data mining problem?
i
v
i
CS 325 – Data Mining (compiledtby: DR. MONALEE A. DELA CERNA) 22
y
#
2
.
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Data
Data is information typically the results of measurement (numerical) or counting
(categorical).
Figure 6.
Structure of Data
Dataset
Database
Database collects, stores and manages information so users can retrieve, add,
update or remove such information. It presents information in tables with rows
and columns. A table is referred to as a relation in the sense that it is a collection
of objects of the same type (rows). Data in a table can be related according to
common keys or concepts, and the ability to retrieve related data from related
tables is the basis for the term relational database. A Database Management
System (DBMS) handles the way data is stored, maintained, and retrieved. Most
data science toolboxes connect to databases through ODBC (Open Database
Connectivity) or JDBC (Java Database Connectivity) interfaces.
Figure 7. Database
ETL extracts data from data sources and loads it into data destinations using a
set of transformation functions.
Data extraction provides the ability to extract data from a variety of data
sources, such as flat files, relational databases, streaming data, XML files,
and ODBC/JDBC data sources.
Data transformation provides the ability to cleanse, convert, aggregate,
merge, and split data.
Data loading provides the ability to load data into destination databases
via update, insert or delete statements, or in bulk.
Activity # 3.
Implement a basic pre-processing of
dataset applying structured query
method using statistical software.
1. Gather temperature data at one location every hour starting at 8:00 a.m. for
12 straight hours on 3 different days.
Requirement:
a. Plot the three sets of time series data on the same graph.
b. Analyze the three curves. Do they behave in the same manner? Does
there appear to be a trend in the temperature during the day?
c. Are the three plots similar?
d. Predict what the next temperature value would have been for the next
hour in each of the 3 days.
e. Compare your prediction with the actual value that occurred.
2. Find at least three examples of data mining applications that have appeared in
the business section of your local newspaper or other news publication.
Describe the data mining applications involved.
Review of Concepts Data mining is the task of discovering interesting patterns from large amounts of
data, where the data can be stored in databases, data warehouses, or other
information repositories. It is a young interdisciplinary field, drawing from areas
such as database systems, data warehousing, statistics, machine learning, data
visualization, information retrieval, and high-performance computing. Other
contributing areas include neural networks, pattern recognition, spatial data
analysis, image databases, signal processing, and many application fields, such as
business, economics, and bioinformatics.
Data mining techniques is used for a long process of research and product
development. As this evolution was started when business data was first stored
on computers. Also, it allows users to navigate through their data in real time. We
use data mining in the business community because it is supported by three
technologies that are now mature: Massive data collection, Powerful
multiprocessor computers and Data mining algorithms
We need to apply advanced techniques in the best way. As they must be fully
integrated with a data business analysis tools. To operate data mining tools we
need extra steps for the extracting, and importing the data.
Furthermore, there are some issues to the data mining approach applied and their
limitations such as the versatility of the mining approaches that can dictate mining
methodology choices.
References Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques,
3rd edition. Morgan Kaufman.
Dunham, M.H. (2003). Data Mining Introductory and Advanced Topics. Pearson
Education Inc. Upper Saddle River, New Jersey.