You are on page 1of 28

LEARNING

LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 1


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

LEARNING MODULE
1 INTRODUCTION TO DATA MINING 3
2 DATA EXPLORATION 29
3 MODELING 70
4 MODEL EVALUATION 178
5 MODEL DEPLOYMENT 190

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 2


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

LEARNING MODULE NO. 1


Title INTRODUCTION TO DATA MINING

Topic 1.1 Data Mining Overview

1.2 Problem Definition

1.3 Data Preparation

Time Frame 15 hrs.

Introduction Our ability to generate and collect data has been increasing rapidly. The
widespread use of Information technology in our lives has flooded us with a
tremendous amount of data. This explosive growth of stored and transient data
has generated an urgent need for new techniques and automated tools that can
assist in transforming this data into useful information and knowledge. Data
mining has emerged as a multidisciplinary field that addresses this need. This
course is an introductory course on data mining. It introduces the basic concepts
and techniques of data mining. Students will learn how to apply data mining
principles to the dissection of large complex data sets.

Objectives In this module, learners will be able to:


1. Create a concept map on the applications, techniques, algorithms and
software used in data mining.
2. Identify a dataset of an industry sector for market analysis on solving real
world data mining problem
3. Completely implement a basic pre-processing of dataset applying structured
query method using statistical software.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 3


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Learning Activities
(to include
Content/Discussion
Activity # 1
of the Topic)
# 2.

Draw a concept map on the applications, techniques, algorithms and


software used in data mining.

• •
Application Techniques

Software
Issues
Used

• •

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 4


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Figure 1. Data Mining Map


Source: http://www.saedsayad.com/data_mining_map.htm

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 5


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

1.1 OVERVIEW OF DATA MINING


WHAT IS DATA MINING?
Data Mining (a.k.a. Data Science) is about explaining the past and predicting the
future by means of data analysis. Data mining is a multi-disciplinary field which
combines statistics, machine learning, artificial intelligence and database
technology. The value of data mining applications is often estimated to be very
high. Many businesses have stored large amounts of data over years of operation,
and data mining is able to extract very valuable knowledge from this data. The
businesses are then able to leverage the extracted knowledge into more clients,
more sales, and greater profits. This is also true in the engineering and medical
fields.

Figure 2. Historical perspective of Data Mining


Source: http://www.saedsayad.com/data_mining.htm

Statistics
The science of collecting, classifying, summarizing, organizing, analyzing, and
interpreting data.

Artificial Intelligence
The study of computer algorithms dealing with the simulation of intelligent
behaviors in order to perform those activities that are normally thought to
require intelligence.

Machine Learning
The study of computer algorithms to learn in order to improve automatically
through experience.

Database
The science and technology of collecting, storing and managing data so users can
retrieve, add, update or remove such data.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 6


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Data warehousing
The science and technology of collecting, storing and managing data with
advanced multi-dimensional reporting services in support of the decision making
processes.

HISTORY OF DATA MINING


First of all, in 1960s statisticians used the terms “Data Fishing” or “Data Dredging”.
That was to refer what they considered the bad practice of analyzing data.
Consequently, the term “Data Mining” appeared around 1990 in the database
community.

The current evolution of data mining functions and products is the result of years
of influence from many disciplines, including databases, information databases,
information retrieval, statistics, algorithms and machine learning.

Evolution of Sciences

 Before 1600, empirical science

 1600-1950s, theoretical science


 Each discipline has grown a theoretical component. Theoretical
models often motivate experiments and generalize our
understanding.

 1950s-1990s, computational science


 Over the last 50 years, most disciplines have grown a third,
computational branch (e.g. empirical, theoretical, and computational
ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of
our inability to find closed-form solutions for complex mathematical
models.

 1990-now, data science


 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data
online
 The Internet and computing Grid that makes all these archives
universally accessible
 Scientific information management, acquisition, organization, query,
and visualization tasks scale almost linearly with data volumes. Data
mining is a major new challenge!

 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for
Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 7


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Evolution of Database Technology

 1960s:
 Data collection, database creation, IMS and network DBMS

 1970s:
 Relational data model, relational DBMS implementation

 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)

 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases

 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information
systems

=================================================================

Data mining involves many different algorithms to accomplish different tasks. All
of these algorithms attempt to fit a model to the data. The algorithms examine
the data and determine the model that is closest to the characteristics of the data
being examined. Data mining algorithms can be characterized as consisting three
parts:

 Model: The purpose of the algorithm is to fit a model to the data.


 Preference: Some criteria must be used to fit one model over another.
 Search: All algorithms require some technique to search the data.

Example 1.
Credit card companies must determine whether to authorize credit card
purchases. Suppose that based on past historical information about
purchases, each purchase is placed into one of four classes: (1) authorize,
(2) ask for further identification before authorization, (3) do not authorize,
and (4) do not authorize but contact police. The data mining functions here
are twofold. First, the historical data must be examined to determine how
the data fit into the four classes. Then the problem is to apply this model to
each new purchase. Although the second part indeed may be stated as a
simple database query, the first part cannot be.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 8


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

In Example 1 the data are modeled as divided into four classes. The search
requires examining past data about credit card purchases and their outcome to
determine what criteria should be used to define the class structure. The
preference will be given to criteria that seem to fit the data set. For example, we
probably would want to authorize a credit card purchase for a small amount of
money with a credit card belonging to a long-standing customer. Conversely, we
would not want to authorize the use of a credit card to purchase anything if the
card has been reported as stolen. The search process requires that the criteria
needed to fit the data to the classes be properly defined.

As seen in Figure 3, the model that is created can be either predictive or


descriptive in nature. In this figure, it show under each model type some of the
most common data mining tasks that use that type of model.

Figure 3. Data mining models and tasks

A predictive model makes a prediction about values of data using known results
found from different data. Predictive modeling may be made based on the use of
other historical data.

For example, a credit card use might be refused not because of the user's own
credit history, but because the current purchase is similar to earlier purchases that
were subsequently found to be made with stolen cards. Example 1 uses predictive
modeling to predict the credit risk. Predictive model data mining tasks include
classification, regression, time series analysis, and prediction. Prediction may also
be used to indicate a specific type of data mining function.

A descriptive model identifies patterns or relationships in data. Unlike the


predictive model, a descriptive model serves as a way to explore the properties
of the data examined, not to predict new properties. Clustering, summarization,
association rules, and sequence discovery are usually viewed as descriptive in
nature.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 9


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

BASIC DATA MINING TASKS


1. Classification

Classification maps data into predefined groups or classes. It is often referred to


as supervised learning because the classes are determined before examining the
data. Two examples of classification applications are determining whether to
make a bank loan and identifying credit risks. Classification algorithms require
that the classes be defined based on data attribute values. They often describe
these classes by looking at the characteristics of data already known to belong to
the classes. Pattern recognition is a type of classification where an input pattern
is classified into one of several classes based on its similarity to these predefined
classes. Example 1 illustrates a general classification problem. Example 2 shows a
simple example of pattern recognition.

EXAMPLE 2

An airport security screening station is used to determine: if passengers are


potential terrorists or criminals. To do this, the face of each passenger is
scanned and its basic pattern (distance between eyes, size and shape of mouth,
shape of head, etc.) is identified. This pattern is compared to entries in a
database to see if it matches any patterns that are associated with known
offenders.

2. Regression
Regression is used to map a data item to a real valued prediction variable. In
actuality, regression involves the learning of the function that does this mapping.
Regression assumes that the target data fit into some known type of function
(e.g., linear, logistic, etc.) and then determines the best function of this type that
models the given data. Some type of error analysis is used to determine which
function is "best." Standard linear regression, as illustrated in Example 3, is a
simple example of regression.

EXAMPLE 3

A college professor wishes to reach a certain level of savings before her


retirement. Periodically, she predicts what her retirement savings will be based
on its current value and several past values. She uses a simple linear regression
formula to predict this value by fitting past behavior to a linear function and
then using this function to predict the values at points in the future. Based on
these values, she then alters her investment portfolio.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 10


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

3. Time Series Analysis

With time series analysis, the value of an attribute is examined as it varies over
time. The values usually are obtained as evenly spaced time points (daily, weekly,
hourly, etc.). A time series plot (Figure 4), is used to visualize the time series. In
this figure you can easily see that the plots for Y and Z have similar behavior, while
X appears to have less volatility. There are three basic functions performed in time
series analysis: In one case, distance measures are used to determine the
similarity between different time series. In the second case, the structure of the
line is examined to determine (and perhaps classify) its behavior. A third
application would be to use the historical time series plot to predict future values.
A time series example is given in Example 4.

EXAMPLE 4

Mr. Smith is trying to determine whether to purchase stock from Companies X,


Y, or z. For a period of one month he charts the daily stock price for each
company. Figure 1.3 shows the time series plot that Mr. Smith has generated
using this and similar information available from his stockbroker, Mr. Smith
decides to purchase stock X because it is less volatile while overall showing a
slightly larger relative amount of growth than either of the other stocks. As a
matter of fact, the stocks for Y and Z have a similar behavior. The behavior of
Y between days 6 and 20 is Identical to that for Z between days 13 and 27.

Figure 4. Time Series Plots

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 11


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

4. Prediction

Many real-world data mining applications can be seen as predicting future data
states based on past and current data. Prediction can be viewed as a type of
classification. (Note: This is a data mining task that is different from the prediction
model, although the prediction task is a type of prediction model.) The difference
is that prediction is predicting a future state rather than a current state. Here we
are referring to a type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding, speech
recognition, machine learning, and pattern recognition. Although future values
may be predicted using time series analysis or regression techniques, other
approaches may be used as well. Example 5 illustrates the process.

EXAMPLE 5

Predicting flooding is a difficult problem. One approach uses monitors placed


at various; points in the river. These monitors collect data relevant to flood
prediction: water level, rain amount, time, humidity, and so on. Then the water
level at a potential flooding point in the river can be predicted based on the
data collected by the sensors upriver from this point. The prediction must be
made with respect to the time the data were collected.

5. Clustering

Clustering is similar to classification except that the groups are not predefined,
but rather defined by the data alone. Clustering is alternatively referred to as
unsupervised learning or segmentation. It can be thought of as partitioning or
segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data
on predefined attributes. The most similar data are grouped into clusters.
Example 6 provides a simple clustering example. Since the clusters are not
predefined, a domain expert is often required to interpret the meaning of the
created clusters.

EXAMPLE 6

A certain national department store chain creates special catalogs targeted to


various demographic groups based on attributes such as income, location, and
physical characteristics of potential customers (age, height, weight, etc.). To
determine the target mailings of the various catalogs and to assist in the
creation of new, more specific catalogs, the company performs a clustering of
potential customers based on the determined attribute values. The results of
the clustering exercise are then used by management to create special catalogs
and distribute them to the correct target population based on the cluster for
that catalog.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 12


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

A special type of clustering is called segmentation. With segmentation a


database is partitioned into disjointed groupings of similar tuples called segments.
Segmentation is often viewed as being identical to clustering. In other circles
segmentation is viewed as a specific type of clustering applied to a database itself.
In this text we use the two terms, clustering and segmentation, interchangeably.

6. Summarization

Summarization maps data into subsets with associated simple descriptions.


Summarization is also called characterization or generalization. It extracts or
derives representative information about the database. This may be
accomplished by actually retrieving portions of the data. Alternatively, summary
type information (such as the mean of some numeric attribute) can be derived
from the data. The summarization succinctly characterizes the contents of the
database. Example 7 illustrates this process.

EXAMPLE 7

One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or ACT score [GM99]. This is a summarization
used to estimate the type and intellectual level of the student body.

7. Association Rules

Link analysis, alternatively referred to as affinity analysis or association, refers to


the data mining task of uncovering relationships among data. The best example
of this type of application is to determine association rules. An association rule is
a model that identifies specific types of data associations. These associations are
often used in the retail sales community to identify items that are frequently
purchased together. Example 8 illustrates the use of association rules in market
basket analysis. Here the data analyzed consist of information about what items
a customer purchases. Associations are also used in many other applications such
as predicting the failure of telecommunication switches.

EXAMPLE 8

A grocery store retailer is trying to decide whether to put bread on sale. To


help determine the impact of this decision, the retailer generates association
rules that show what other products are frequently purchased with bread. He
finds that 60% of the time that bread is sold so are pretzels and that 70% of the
time jelly is also sold. Based on these facts, he tries to capitalize on the
association between bread, pretzels, and jelly by placing some pretzels and
jelly at the end of the aisle where the bread is placed. In addition, he decides
not to place either of these items on sale at the same time.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 13


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Users of association rules must be cautioned that these are not causal
relationships. They do not represent any relationship inherent in the actual data
(as is true with functional dependencies) or in the real world. There probably is
no relationship between bread and pretzels that causes them to be purchased
together. And there is no guarantee that this association will apply in the future.
However, association rules can be used to assist retail store management in
effective advertising, marketing, and inventory control.

8. Sequence Discovery

Sequential analysis or sequence discovery is used to determine sequential


patterns in data. These patterns are based on a time sequence of actions. These
patterns are similar to associations in that data (or events) are found to be related,
but the relationship is based on time. Unlike a market basket analysis, which
requires the items to be purchased at the same time, in sequence discovery the
items are purchased over time in some order. Example 9 illustrates the discovery
of some simple patterns. A similar type of discovery can be seen in the sequence
within which data are purchased. For example, most people who purchase CD
players may be found to purchase CDs within one week. As we will see, temporal
association rules really fall into this category.

EXAMPLE 9

The Webmaster at the XYZ Corp. periodically analyzes the Web log data to
determine how users of the XYZ's Web pages access them. He is interested in
determining what sequences of pages are frequently accessed. He determines
that 70 percent of the users of page A follow one of the following patterns of
behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). He then determines to add a link
directly from page A to page C.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 14


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

THE DATA MINING PROCESS

Figure 5 illustrates the phases, and the iterative nature, of a data mining project.
The process flow shows that a data mining project does not stop when a particular
solution is deployed. The results of data mining trigger new business questions,
which in turn can be used to develop more focused models.

Figure 5. The Data Mining Process

Problem Definition
This initial phase of a data mining project focuses on understanding the project
objectives and requirements. Once you have specified the project from a business
perspective, you can formulate it as a data mining problem and develop a
preliminary implementation plan.

For example, your business problem might be: "How can I sell more of my
product to customers?" You might translate this into a data mining problem
such as: "Which customers are most likely to purchase the product?" A model
that predicts who is most likely to purchase the product must be built on data
that describes the customers who have purchased the product in the past.
Before building the model, you must assemble the data that is likely to contain
relationships between customers who have purchased the product and
customers who have not purchased the product. Customer attributes might
include age, number of children, years of residence, owners/renters, and so on.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 15


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Data Gathering and Preparation


The data understanding phase involves data collection and exploration. As you
take a closer look at the data, you can determine how well it addresses the
business problem. You might decide to remove some of the data or add additional
data. This is also the time to identify data quality problems and to scan for
patterns in the data.

The data preparation phase covers all the tasks involved in creating the case table
you will use to build the model. Data preparation tasks are likely to be performed
multiple times, and not in any prescribed order. Tasks include table, case, and
attribute selection as well as data cleansing and transformation.

For example, you might transform a DATE_OF_BIRTH column to AGE; you might
insert the average income in cases where the INCOME column is null.

Additionally you might add new computed attributes in an effort to tease


information closer to the surface of the data.

For example, rather than using the purchase amount, you might create a new
attribute: "Number of Times Amount Purchase Exceeds $500 in a 12 month time
period." Customers who frequently make large purchases may also be related
to customers who respond or don't respond to an offer.

Thoughtful data preparation can significantly improve the information that can be
discovered through data mining.

Model Building and Evaluation


In this phase, you select and apply various modeling techniques and calibrate the
parameters to optimal values. If the algorithm requires data transformations, you
will need to step back to the previous phase to implement them.

In preliminary model building, it often makes sense to work with a reduced set of
data (fewer rows in the case table), since the final case table might contain
thousands or millions of cases.

At this stage of the project, it is time to evaluate how well the model satisfies the
originally-stated business goal (phase 1). If the model is supposed to predict
customers who are likely to purchase a product, does it sufficiently differentiate
between the two classes? Is there sufficient lift? Are the trade-offs shown in the
confusion matrix acceptable? Would the model be improved by adding text data?
Should transactional data such as purchases (market-basket data) be included?
Should costs associated with false positives or false negatives be incorporated
into the model?

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 16


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Knowledge Deployment
Knowledge deployment is the use of data mining within a target environment. In
the deployment phase, insight and actionable information can be derived from
data.

Deployment can involve scoring (the application of models to new data), the
extraction of model details (for example the rules of a decision tree), or the
integration of data mining models within applications, data warehouse
infrastructure, or query and reporting tools. Data can be mined and the results
returned within a single database transaction.

For example, a sales representative could run a model that predicts the
likelihood of fraud within the context of an online sales transaction.

DATA MINING ISSUES

There are many important implementation issues associated with data mining:

1. Human interaction: Since data mining problems are often not precisely stated,
interfaces may be needed with both domain and technical experts. Technical
experts are used to formulate the queries and assist in interpreting the results.
Users are needed to identify training data and desired results.

2. Overfitting: When a model is generated that is associated with a given database


state it is desirable that the model also fit future database states. Overfitting
occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small
size of the training database. For example, a classification model for an employee
database may be developed to classify employees as short, medium, or tall. If the
training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in
the training database under five feet eight. In this case, many future employees
would be erroneously classified as short. Overfitting can arise under other
circumstances as well, even though the data are not changing.

3. Outliers: There are often many data entries that do not fit nicely into the
derived model. This becomes even more of an issue with very large databases. If
a model is developed that includes these outliers, then the model may not behave
well for data that are not outliers.

4. Interpretation of results: Currently, data mining output may require experts to


correctly interpret the results, which might otherwise be meaningless to the
average database user.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 17


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

5. Visualization of results: To easily view and understand the output of data


mining algorithms, visualization of the results is helpful.

6. Large datasets: The massive datasets associated with data mining create
problems when applying algorithms designed for small datasets. Many modeling
applications grow exponentially on the dataset size and thus are too inefficient
for larger datasets. Sampling and parallelization are effective tools to attack this
scalability problem.

7. High dimensionality: A conventional database schema may be composed of


many different attributes. The problem here is that not all attributes may be
needed to solve a given data mining problem. In fact, the use of some attributes
may interfere with the correct completion of a data mining task. The use of other
attributes may simply increase the overall complexity and decrease the efficiency
of an algorithm. This problem is sometimes referred to as the dimensionality
curse, meaning that there are many attributes (dimensions) involved and it is
difficult to determine which ones should be used. One solution to this high
dimensionality problem is to reduce the number of attributes, which is known as
dimensionality reduction. However, determining which attributes not needed is
not always easy to do.

8. Multimedia data: Most previous data mining algorithms are targeted to


traditional data types (numeric, character, text, etc.). The use of multimedia data
such as is found in GIS databases complicates or invalidates many proposed
algorithms.

9. Missing data: During the preprocessing phase of KDD, missing data may be
replaced with estimates. This and other approaches to handling missing data can
lead to invalid results in the data mining step.

10. Irrelevant data: Some attributes in the database might not be of interest to
the data mining task being developed.

11. Noisy data: Some attribute values might be invalid or incorrect. These values
are often corrected before running data mining applications.

12. Changing data: Databases cannot be assumed to be static. However, most


data mining algorithms do assume a static database. This requires that the
algorithm be completely rerun anytime the database changes.

13. Integration: The KDD process is not currently integrated into normal data
processing activities. KDD requests may be treated as special, unusual, or one-
time needs. This makes them inefficient, ineffective, and not general enough to
be used on an ongoing basis. Integration of data mining functions into traditional
DBMS systems is certainly a desirable goal.

14. Application: Determining the intended use for the information obtained from
the data mining function is a challenge. Indeed, how business executives can
effectively use the output is sometimes considered the more difficult part, not the

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 18


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

running of the algorithms themselves. Because the data are of a type that has not
previously been known, business practices may have to be modified to determine
how to effectively use the information uncovered.

These issues should be addressed by data mining algorithms and products.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 19


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

1.2 PROBLEM DEFINITION


Understanding the project objectives and requirements from a domain
perspective and then converting this knowledge into a data science problem
definition with a preliminary plan designed to achieve the objectives. Data science
projects are often structured around the specific needs of an industry sector (as
shown below) or even tailored and built for a single organization. A successful
data science project starts from a well-defined question or
need. Source: KDnuggets

Table 1.
Industries/Fields where you applied Analytics, Data Mining, Data Science in 2016?
2016 % of voters
2015 % of voters
2014 % of voters
CRM/Consumer analytics 16.3%
(90) 18.6%
22.2%
Finance (83) 15.0%
15.4%
10.9%
Banking (74) 13.4%
14.3%
16.7%
Advertising (66) 12.0%
8.9%
10.4%
Science (66) 12.0%
11.7%
13.6%
Health care (66) 12.0%
13.4%
16.3%
Fraud Detection (61) 11.1%
10.0%
13.6%
Retail (57) 10.3%
9.1%
13.6%
Insurance (51) 9.2%
7.4%
8.6%
E-commerce (49) 8.9%
10.3%
9.5%

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 20


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Telecom / Cable (46) 8.3%


7.7%
9.0%
Social Media / Social 8.3%
Networks (46) 10.3%
8.6%
Software (40) 7.2%
6.0%
7.2%
IT / Network Infrastructure 7.2%
(40) 6.6%
na
Oil / Gas / Energy (39) 7.1%
8.9%
9.5%
Education (39) 7.1%
10.0%
7.7%
Credit Scoring (38) 6.9%
7.1%
8.1%
Supply Chain (36) 6.5%
na
na
Medical/ Pharma (36) 6.5%
6.0%
7.2%
Other (35) 6.3%
8.9%
13.6%
Investment / Stocks (34) 6.2%
4.3%
5.0%
Biotech/Genomics (32) 5.8%
4.9%
6.8%
Manufacturing (31) 5.6%
6.9%
9.0%
Government/Military (31) 5.6%
7.1%
6.3%
Search / Web content 5.4%
mining (30) 6.0%
6.3%

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 21


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Automotive/Self-Driving 4.5%
Cars (25) 4.3%
5.9%
Direct Marketing/ 4.3%
Fundraising (24) 5.1%
7.2%
Mining (23) 4.2%
3.7%
na
Travel / Hospitality (22) 4.0%
2.6%
3.2%
Entertainment/ Music/ 4.0%
TV/Movies (22) 3.1%
1.8%
HR/workforce analytics (20) 3.6%
6.3%
5.9%
Mobile apps (18) 3.3%
1.4%
2.3%
Agriculture (18) 3.3%
2.9%
na
Games (16) 2.9%
4.0%
1.8%
Security / Anti-terrorism 2.7%
(15) 2.3%
2.3%
Social Good/Non-profit (11) 2.0%
2.3%
1.4%
Social Policy/Survey analysis 1.8%
(10) 1.7%
1.8%
Junk email / Anti-spam (6) 1.1%
0.3%
1.8%
Source: http://www.saedsayad.com/problem_definition.htm

Assignment
A
Activity # 2c Identify a dataset of an industry sector for market
t analysis on solving real world data mining problem?
i
v
i
CS 325 – Data Mining (compiledtby: DR. MONALEE A. DELA CERNA) 22
y
#
2
.
LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

1.3 DATA PREPARATION


Data preparation is about constructing a dataset from one or more data sources
to be used for exploration and modeling. It is a solid practice to start with an initial
dataset to get familiar with the data, to discover first insights into the data and
have a good understanding of any possible data quality issues. Data preparation
is often a time consuming process and heavily prone to errors. The old saying
"garbage-in-garbage-out" is particularly applicable to those data science projects
where data gathered with many invalid, out-of-range and missing
values. Analyzing data that has not been carefully screened for such problems can
produce highly misleading results. Then, the success of data science projects
heavily depends on the quality of the prepared data.

Data
Data is information typically the results of measurement (numerical) or counting
(categorical).

Variables serve as placeholders for data. There are two types of


variables, numerical and categorical.

A numerical or continuous variable is one that can accept any value


within a finite or infinite interval (e.g., height, weight, temperature, blood
glucose, ...). There are two types of numerical data, interval and ratio. Data on
an interval scale can be added and subtracted but cannot be meaningfully
multiplied or divided because there is no true zero. For example, we cannot say
that one day is twice as hot as another day. On the other hand, data on a ratio
scale has true zero and can be added, subtracted, multiplied or divided (e.g.,
weight).

A categorical or discrete variable is one that can accept two or more


values (categories). There are two types of categorical
data, nominal and ordinal. Nominal data does not have an intrinsic ordering in
the categories. For example, "gender" with two categories, male and female. In
contrast, ordinal data does have an intrinsic ordering in the categories. For
example, "level of energy" with three orderly categories (low, medium and
high).

Figure 6.
Structure of Data

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 23


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Dataset

Dataset is a collection of data, usually presented in a tabular form. Each column


represents a particular variable, and each row corresponds to a given member of
the data.

There are some alternatives for columns, rows and values.

 Columns, Fields, Attributes, Variables


 Rows, Records, Objects, Cases, Instances, Examples, Vectors
 Values, Data

In predictive modeling, predictors or attributes are the input variables


and target or class attribute is the output variable whose value is determined by
the values of the predictors and function of the predictive model.

Database
Database collects, stores and manages information so users can retrieve, add,
update or remove such information. It presents information in tables with rows
and columns. A table is referred to as a relation in the sense that it is a collection
of objects of the same type (rows). Data in a table can be related according to
common keys or concepts, and the ability to retrieve related data from related
tables is the basis for the term relational database. A Database Management
System (DBMS) handles the way data is stored, maintained, and retrieved. Most
data science toolboxes connect to databases through ODBC (Open Database
Connectivity) or JDBC (Java Database Connectivity) interfaces.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 24


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Figure 7. Database

SQL (Structured Query Language) is a database computer language for managing


and manipulating data in relational database management systems (RDBMS).

SQL Data Definition Language (DDL) permits database tables to be created,


altered or deleted. We can also define indexes (keys), specify links between
tables, and impose constraints between database tables.

 CREATE TABLE : creates a new table


 ALTER TABLE : alters a table
 DROP TABLE : deletes a table
 CREATE INDEX : creates an index
 DROP INDEX : deletes an index

SQL Data Manipulation Language (DML) is a language which enables users to


access and manipulate data.

 SELECT : retrieval of data from the database


 INSERT INTO : insertion of new data into the database
 UPDATE : modification of data in the database
 DELETE : deletion of data in the database

ETL (Extraction, Transformation and Loading)

ETL extracts data from data sources and loads it into data destinations using a
set of transformation functions.

 Data extraction provides the ability to extract data from a variety of data
sources, such as flat files, relational databases, streaming data, XML files,
and ODBC/JDBC data sources.
 Data transformation provides the ability to cleanse, convert, aggregate,
merge, and split data.
 Data loading provides the ability to load data into destination databases
via update, insert or delete statements, or in bulk.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 25


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Figure 8. ETL Process

Activity # 3.
Implement a basic pre-processing of
dataset applying structured query
method using statistical software.

Use the credit default datasets.

Credit Default Datasets


credit_test.csv
credit_train.csv

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 26


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Self-Evaluation Name: ______________________________________ Date: __________


Program/Yr/Section: __________________________ Score: _________
Try to answer the questions below to test your knowledge of this lesson.

1. Gather temperature data at one location every hour starting at 8:00 a.m. for
12 straight hours on 3 different days.
Requirement:
a. Plot the three sets of time series data on the same graph.
b. Analyze the three curves. Do they behave in the same manner? Does
there appear to be a trend in the temperature during the day?
c. Are the three plots similar?
d. Predict what the next temperature value would have been for the next
hour in each of the 3 days.
e. Compare your prediction with the actual value that occurred.

2. Find at least three examples of data mining applications that have appeared in
the business section of your local newspaper or other news publication.
Describe the data mining applications involved.

Review of Concepts Data mining is the task of discovering interesting patterns from large amounts of
data, where the data can be stored in databases, data warehouses, or other
information repositories. It is a young interdisciplinary field, drawing from areas
such as database systems, data warehousing, statistics, machine learning, data
visualization, information retrieval, and high-performance computing. Other
contributing areas include neural networks, pattern recognition, spatial data
analysis, image databases, signal processing, and many application fields, such as
business, economics, and bioinformatics.

Data mining techniques is used for a long process of research and product
development. As this evolution was started when business data was first stored
on computers. Also, it allows users to navigate through their data in real time. We
use data mining in the business community because it is supported by three
technologies that are now mature: Massive data collection, Powerful
multiprocessor computers and Data mining algorithms

We need to apply advanced techniques in the best way. As they must be fully
integrated with a data business analysis tools. To operate data mining tools we
need extra steps for the extracting, and importing the data.

Furthermore, there are some issues to the data mining approach applied and their
limitations such as the versatility of the mining approaches that can dictate mining
methodology choices.

References Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques,
3rd edition. Morgan Kaufman.

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 27


LEARNING
LEA MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY

Sayad, S. (2010-2021). An Introduction to Data Mining. http://www.saedsayad.


com/data_mining

Dunham, M.H. (2003). Data Mining Introductory and Advanced Topics. Pearson
Education Inc. Upper Saddle River, New Jersey.

Data Mining Concepts. Oracle Database Online Documentation Library, 11g


Release 2 (11.2). https://docs.oracle.com/cd/E11882_01/datamine.112/e16808/
process.htm#DMCON002

CS 325 – Data Mining (compiled by: DR. MONALEE A. DELA CERNA) 28

You might also like