You are on page 1of 10

Romblon State University

San Fernando, Romblon


Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

Contents

Topic: Introduction ................................................................................................................................ 3


CONTENT .................................................................................................................................................. 3
1. WHAT IS DATA MINING? ............................................................................................................... 3
1.1. DEFINITION OF DATA MINING .............................................................................................. 3
1.1.1. Major Sources of Abundant data:............................................................................. 4
1.1.2. Need for turning data into knowledge – Drowning in data, but starving for
knowledge....................................................................................................................................... 4
1.1.3. Applications that use data mining: ........................................................................... 4
1.2. DIFFERENT KINDS OF DATA MINING:................................................................................... 5
1.3. DATA MINING TECHNIQUES: ................................................................................................ 7
1.4. DATA MINING TOOLS:............................................................................................................ 8
1.5. MAJOR ISSUES IN DATA MINING ......................................................................................... 9
1.6. DATA MINING TECHNOLOGIES: .......................................................................................... 9
ASSESSMENT TASK ................................................................................................................................. 10
General Instructions: ........................................................................................................................... 10
References: ........................................................................................................................................... 10

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

Module 1
Course: IM 3 Fundamentals of Data Warehousing and Data Mining
Unit No. 1
Topic: Introduction
Score:
Name:
Year & Section:
Date:

In this chapter, a brief introduction to Data Mining is outlined. The discussion includes the definitions
of Data Mining; stages identified in Data Mining Process, Models, and it also address the brief
description on Data Mining methods and some of the applications and examples of Data Mining.

Learning Objectives:
At the end of the lesson, you should be able to:
1. Explain the fundamental principles of data mining
2. Discuss the evolving role of data mining for several application areas and industry
3. Justify potential use and or application of data mining in unexplored areas

CONTENT

1. WHAT IS DATA MINING?


The past two decades have seen a dramatic increase in the amount of information or data being
stored in electronic format. This accumulation of data has taken place at an explosive rate. It has
been estimated that the amount of information in the world doubles every 20 months and the size
and number of databases are increasing even faster. The increase in use of electronic data
gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion
of available data. Figure 1, from the Red Brick company illustrates the data explosion.

1.1. DEFINITION OF DATA MINING

Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications.
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

1.1.1. Major Sources of Abundant data:

 Business – Web, E-commerce, Transactions, Stocks


 Science – Remote Sensing, Bio informatics, Scientific Simulation
 Society and Everyone – News, Digital Cameras, You Tube

1.1.2. Need for turning data into knowledge – Drowning in data, but starving for knowledge

1.1.3. Applications that use data mining:

 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Scientific Exploration

Definition of Data Mining?

 Extracting and ‘Mining’ knowledge from large amounts of data.

 “Gold Mining from rock or sand” is same as “Knowledge mining from data”
Other terms for Data Mining:

 Knowledge Mining
 Knowledge Extraction
 Pattern Analysis
 Data Archeology
 Data Dredging

Data Mining is not same as KDD (Knowledge Discovery from Data)

Data Mining is a step in KDD

Data Cleaning – Remove noisy and inconsistent data


Data Integration – Multiple data sources combined
Data Selection – Data relevant to analysis retrieved
Data Transformation – Transform into form suitable for Data Mining (Summarized / Aggregated)
Data Mining – Extract data patterns using intelligent methods
Pattern Evaluation – Identify interesting patterns
Knowledge Presentation – Visualization / Knowledge Representation – Presenting mined
knowledge to the user

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

1.2. DIFFERENT KINDS OF DATA MINING:

There are several major data mining techniques have been developing and using in data mining
projects recently including:
 Association;
 Classification;
 Clustering;
 Prediction;
 sequential patterns; and
 decision tree.

We will briefly examine those data mining techniques in the following sections.

Association:
Association is one of the best-known data mining technique. In association, a pattern is
discovered based on a relationship between items in the same transaction. That’s is the reason
why association technique is also known as relation technique. The association technique is used
in market basket analysis to identify a set of products that customers frequently purchase together.

Retailers are using association technique to research customer’s buying habits. Based on historical
sale data, retailers might find out that customers always buy crisps when they buy beers, and,
therefore, they can put beers and crisps next to each other to save time for the customer and
increase sales.

Classification

Classification is a classic data mining technique based on machine learning. Basically,


classification is used to classify each item in a set of data into one of a predefined set of classes
or groups. Classification method makes use of mathematical techniques such as decision trees,
linear programming, neural network, and statistics. In classification, we develop the software that
can learn how to classify the data items into groups. For example, we can apply classification in
the application that “given all records of employees who left the company, predict who will
probably leave the company in a future period.” In this case, we divide the records of employees
into two groups that named “leave” and “stay”. And then we can ask our data mining software
to classify the employees into separate groups.

Clustering

Clustering is a data mining technique that makes a meaningful or useful cluster of objects which
have similar characteristics using the automatic technique. The clustering technique defines the
classes and puts objects in each class, while in the classification techniques, objects are assigned
into predefined classes. To make the concept clearer, we can take book management in the
library as an example. In a library, there is a wide range of books on various topics available. The
challenge is how to keep those books in a way that readers can take several books on a particular
topic without hassle. By using the clustering technique, we can keep books that have some kinds
of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to
grab books in that topic, they would only have to go to that shelf instead of looking for the entire
library.

Prediction

The prediction, as its name implied, is one of a data mining techniques that discovers the
relationship between independent variables and relationship between dependent and
independent variables. For instance, the prediction analysis technique can be used in the sale to
predict profit for the future if we consider the sale is an independent variable, profit could be a
dependent variable. Then based on the historical sale and profit data, we can draw a fitted
regression curve that is used for profit prediction.

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

Sequential Patterns

Sequential patterns analysis is one of data mining technique that seeks to discover or identify
similar patterns, regular events or trends in transaction data over a business period.

In sales, with historical transaction data, businesses can identify a set of items that customers buy
together different times in a year. Then businesses can use this information to recommend
customers buy it with better deals based on their purchasing frequency in the past.

Decision trees

The A decision tree is one of the most commonly used data mining techniques because its model
is easy to understand for users. In decision tree technique, the root of the decision tree is a simple
question or condition that has multiple answers. Each answer then leads to a set of questions or
conditions that help us determine the data so that we can make the final decision based on it.
For example, We use the following decision tree to determine whether or not to play tennis:

Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is rainy,
we should only play tennis if the wind is the week. And if it is sunny then we should play tennis in
case the humidity is normal.

We often combine two or more of those data mining techniques together to form an appropriate
process that meets the business needs.

1. Classification analysis

This analysis is used to retrieve important and relevant information about data, and metadata. It
is used to classify different data in different classes. Classification is similar to clustering in a way
that it also segments data records into different segments called classes. But unlike clustering, here
the data analysts would have the knowledge of different classes or cluster. So, in classification
analysis you would apply algorithms to decide how new data should be classified. A classic
example of classification analysis would be our outlook email. In outlook, they use certain
algorithms to characterize an email as legitimate or spam.
2. Association rule learning
It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help you unpack
some hidden patterns in the data that can be used to identify variables within the data and the
concurrence of different variables that appear very frequently in the dataset. Association rules
are useful for examining and forecasting customer behavior. It is highly recommended in the retail
industry analysis. This technique is used to determine shopping basket data analysis, product
clustering, catalog design and store layout. In it, programmers use association rules to build
programs capable of machine learning.

3. Anomaly or outlier detection

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

This refers to the observation for data items in a dataset that do not match an expected pattern
or an expected behavior. Anomalies are also known as outliers, novelties, noise, deviations and
exceptions. Often they provide critical and actionable information. An anomaly is an item that
deviates considerably from the common average within a dataset or a combination of data.
These types of items are statistically aloof as compared to the rest of the data and hence, it
indicates that something out of the ordinary has happened and requires additional attention. This
technique can be used in a variety of domains, such as intrusion detection, system health
monitoring, fraud detection, fault detection, event detection in sensor networks, and detecting
eco-system disturbances. Analysts often remove the anomalous data from the dataset top
discover results with an increased accuracy.

4. Clustering analysis

The cluster is actually a collection of data objects; those objects are similar within the same cluster.
That means the objects are similar to one another within the same group and they are rather
different or they are dissimilar or unrelated to the objects in other groups or in other clusters.
Clustering analysis is the process of discovering groups and clusters in the data in such a way that
the degree of association between two objects is highest if they belong to the same group and
lowest otherwise. A result of this analysis can be used to create customer profiling.

5. Regression analysis
In statistical terms, a regression analysis is the process of identifying and analyzing the relationship
among variables. It can help you understand the characteristic value of the dependent variable
changes, if any one of the independent variables is varied. This means one variable is dependent
on another, but it is not vice versa.it is generally used for prediction and forecasting.
All of these techniques can help analyze different data from different perspectives. Now you have
the knowledge to decide the best technique to summarize data into useful information –
information that can be used to solve a variety of business problems to increase revenue,
customer satisfaction, or decrease unwanted cost.

1.3. DATA MINING TECHNIQUES:

1. Classification: This analysis is used to retrieve important and relevant information about
data, and metadata. This data mining method helps to classify data in different classes.
2. Clustering: Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the data.
3. Regression: Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable,
given the presence of other variables.
4. Association Rules: This data mining technique helps to find the association between two
or more Items. It discovers a hidden pattern in the data set.
5. Outer detection: This type of data mining technique refers to observation of data items in
the dataset which do not match an expected pattern or expected behavior. This
technique can be used in a variety of domains, such as intrusion, detection, fraud or fault
detection, etc. Outer detection is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns: This data mining technique helps to discover or identify similar patterns
or trends in transaction data for certain period.

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

7. Prediction: Prediction has used a combination of the other data mining techniques like
trends, sequential patterns, clustering, classification, etc. It analyzes past events or
instances in a right sequence for predicting a future event.

DATA MINING TECHNIQUES (IN DETAIL):

1. Tracking patterns. One of the most basic techniques in data mining is learning to recognize
patterns in your data sets. This is usually a recognition of some aberration in your data
happening at regular intervals, or an ebb and flow of a certain variable over time. For
example, you might see that your sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more people to your website.
2. Classification. Classification is a more complex data mining technique that forces you to
collect various attributes together into discernable categories, which you can then use to
draw further conclusions, or serve some function. For example, if you’re evaluating data
on individual customers’ financial backgrounds and purchase histories, you might be able
to classify them as “low,” “medium,” or “high” credit risks. You could then use these
classifications to learn even more about those customers.
3. Association. Association is related to tracking patterns, but is more specific to dependently
linked variables. In this case, you’ll look for specific events or attributes that are highly
correlated with another event or attribute; for example, you might notice that when your
customers buy a specific item, they also often buy a second, related item. This is usually
what’s used to populate “people also bought” sections of online stores.
4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give
you a clear understanding of your data set. You also need to be able to identify anomalies,
or outliers in your data. For example, if your purchasers are almost exclusively male, but
during one strange week in July, there’s a huge spike in female purchasers, you’ll want to
investigate the spike and see what drove it, so you can either replicate it or better
understand your audience in the process.
5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data
together based on their similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on how much disposable
income they have, or how often they tend to shop at your store.
6. Regression. Regression, used primarily as a form of planning and modeling, is used to
identify the likelihood of a certain variable, given the presence of other variables. For
example, you could use it to project a certain price, based on other factors like availability,
consumer demand, and competition. More specifically, regression’s main focus is to help
you uncover the exact relationship between two (or more) variables in a given data set.
7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used
to project the types of data you’ll see in the future. In many cases, just recognizing and
understanding historical trends is enough to chart a somewhat accurate prediction of
what will happen in the future. For example, you might review consumers’ credit histories
and past purchases to predict whether they’ll be a credit risk in the future.

1.4. DATA MINING TOOLS:


So do you need the latest and greatest machine learning technology to be able to apply these
techniques? Not necessarily. In fact, you can probably accomplish some cutting-edge data
mining with relatively modest database systems, and simple tools that almost any company will
have. And if you don’t have the right tools for the job, you can always create your own. However
you approach it, data mining is the best collection of techniques you have for making the most
out of the data you’ve already gathered. As long as you apply the correct logic, and ask the right
questions, you can walk away with conclusions that have the potential to revolutionize your
enterprise.

Challenges of Implementation of Data mine:

 Skilled Experts are needed to formulate the data mining queries.


 Overfitting: Due to small size training database, a model may not fit future states.
 Data mining needs large databases which sometimes are difficult to manage
 Business practices may need to be modified to determine to use the information
uncovered.

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

 If the data set is not diverse, data mining results may not be accurate.
 Integration information needed from heterogeneous databases and global information
systems could be complex

1.5. MAJOR ISSUES IN DATA MINING

Mining Methodology Issues:


 Mining different kinds of knowledge in databases.
 Incorporation of background knowledge
 Handling noisy or incomplete data.
 Pattern Evaluation – Interestingness Problem

User Interaction Issues:


 Interactive mining of knowledge at multiple levels of abstraction
 Data mining query languages and ad-hoc data mining.

Performance Issues:
 Efficiency and Scalability of Data Mining Algorithms.
 Parallel, distributed and incremental mining algorithms.

Issues related to diversity of data types:


 Handling of relational and complex types of data.
 Mining information from heterogeneous databases and global information system

1.6. DATA MINING TECHNOLOGIES:


As a highly application-driven domain, data mining has incorporated many techniques from other
domains such as statistics, machine learning, pattern recognition, database and data warehouse
systems, information retrieval, visualization, algorithms, high performance computing, and many
application domains (Figure ) The interdisciplinary nature of data mining research and
development contributes significantly to the success of data mining and its extensive applications.
In this section, we give examples of several disciplines that strongly influence the development of
data mining methods.

Supplemental Material:

https://youtube.be/grRwJ5jZBog

IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department

RSU-SFC TechEd Form No. 005: Course Module Format

ASSESSMENT TASK
A. Create a Timeline
1. Read the article on "Weather Prediction Problem".
2. Create a timeline / evolution of data mining methods used in weather
prediction problem.
3. Use GoogleScholar to search for articles on weather prediction data mining
techniques.
4. Review the developments in weather prediction.
5. Present the historical developments in class.
B. Essay
1. Explain the potential application of data mining in unexplored areas such as
 Natural resource mining
 Cultural or historical artifacts
 Marine biodiversity
2. If these areas can be explored using data mining techniques, what are the
potential outputs and outcomes from using Data Mining Techniques? Justify
your answer.
General Instructions:
1. Accomplish the quiz individually.
2. Submit your answer by taking clear pictures of your answers and send it to your
teacher through Facebook Messenger, and Gmail.
3. Submit your answer on/ before February 12, 2021 (11:55 pm, Philippine Standard Time).

Prepared by: Ella B. Paloma

References:
Anderberg, M. R., Cluster Analysis for Applications, New York:
Academic Press, 1973.
Chambers, J. M., Computational Methods for Data Analysis, New York:
John Wiley & Sons, 1977
Cleveland, W. S., Dynamic Graphics for Statistics,
Wadworth and Brooks/Cole, 1988.

Proofread by:

Lect. Jessebelle Garcia

Validated by:

Laarni R. Hellwig, MSCS


Head, Technology Education Department

Recommended for Approval by:

Carmen J. Riva, Ph.D.


Dean of Instruction

Approved by:

Emelia B. Ramos, Ph.D.


Campus Director
IM 3
“Shine and Serve with Honor and Excellence.”

You might also like