You are on page 1of 14

DATA MINING

Learning Outcomes:
• Discuss motivation and challenges of data mining
• Perform data mining tasks
• Identify data set types
• Distinguish types of variables
• Assess data quality

What is Data Mining?

refers to extraction of information from large amount of data. In computer science,


sometimes it is also called knowledge discovery in databases (KDD)
is about finding new information in a lot of data. The information obtained from data
mining is hopefully both new and useful.
In many cases, data is stored so it can be used later. The data is saved with a goal. For example,
a store wants to save what has been bought. They want to do this to know how much they should
buy themselves, to have enough to sell later. Saving this information, makes a lot of data. The data
is usually saved in a database. The reason why data is saved is called the first use.
Later, the same data can also be used to get other information that was not needed for the first
use. The store might want to know now what kind of things people buy together when they buy at
the store. That kind of information is in the data, and is useful, but was not the reason why the data
was saved. This information is new and can be useful. It is a second use for the same data.

Motivation and Challenges of Data Mining

The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the imminent
need for turning such data into useful information and knowledge. The information and knowledge
gained can be used for applications ranging from business management, production control, and
market analysis, to engineering design and science exploration.
Data mining can be viewed as a result of the natural evolution of information technology.
An evolutionary path has been witnessed in the database industry in the development of the
following functionalities: data collection and database creation, data management (including data
storage and retrieval, and database transaction processing), and data analysis and understanding
(involving data warehousing and data mining). For instance, the early development of data
collection and database creation mechanisms served as a prerequisite for later development of
effective mechanisms for data storage and retrieval, and query and transaction processing. With
numerous database systems offering query and transaction processing as common practice, data
analysis and understanding has naturally become the next target.
Data can now be stored in many different types of databases. One database architecture
that has recently emerged is the data warehouse, a repository of multiple heterogeneous data
sources, organized under a unified schema at a single site in order to facilitate management
decision making. Data warehouse technology includes data cleansing, data integration, and On-
Line Analytical Processing (OLAP), that is, analysis techniques with functionalities such as
summarization, consolidation and aggregation, as well as the ability to view information at
different angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysis, such as data classification,
clustering, and the characterization of data changes over time.
The abundance of data, coupled with the need for powerful data analysis tools, has been
described as a "data rich but information poor" situation. The fast-growing, tremendous amount of
data, collected and stored in large and numerous databases, has far exceeded our human ability for
comprehension without powerful tools. As a result, data collected in large databases become "data
tombs" -- data archives that are seldom revisited. Consequently, important decisions are often
made based not on the information-rich data stored in databases but rather on a decision maker's
intuition, simply because the decision maker does not have the tools to extract the valuable
knowledge embedded in the vast amounts of data. In addition, consider current expert system
technologies, which typically rely on users or domain experts to manually input knowledge into
knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely
time-consuming and costly. Data mining tools which perform data analysis may uncover important
data patterns, contributing greatly to business strategies, knowledge bases, and scientific and
medical research. The widening gap between data and information calls for a systematic
development of data mining tools which will turn data tombs into "golden nuggets" of knowledge.
Data Mining Process
The whole process of data mining cannot be completed in a single step. In other words,
you cannot get the required information from the large volumes of data as simple as that. It is a
very complex process than we think involving a number of processes. The processes including
data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation
and knowledge representation are to be completed in the given order.
Types of Data Mining Processes
Different data mining processes can be classified into two types: data preparation or
data preprocessing and data mining. In fact, the first four processes, that are data cleaning,
data integration, data selection and data transformation, are considered as data preparation
processes. The last three processes including data mining, pattern evaluation and knowledge
representation are integrated into one process called data mining.
a. Data Cleaning
is the process where the data gets cleaned. Data in the real world is normally incomplete,
noisy and inconsistent. The data available in data sources might be lacking attribute values,
data of interest etc. For example, you want the demographic data of customers and what if
the available data does not include attributes for the gender or age of the customers? Then
the data is of course incomplete. Sometimes the data might contain errors or outliers. An
example is an age attribute with value 200. It is obvious that the age value is wrong in this
case. The data could also be inconsistent. For example, the name of an employee might be
stored differently in different data tables or documents. Here, the data is inconsistent. If the
data is not clean, the data mining results would be neither reliable nor accurate.
Data cleaning involves a number of techniques including filling in the missing values
manually, combined computer and human inspection, etc. The output of data cleaning
process is adequately cleaned data.
b. Data Integration
is the process where data from different data sources are integrated into one. Data lies in
different formats in different locations. Data could be stored in databases, text files,
spreadsheets, documents, data cubes, Internet and so on.
is a really complex and tricky task because data from different sources does not match
normally. Suppose a table A contains an entity named customer_id where as another
table B contains an entity named number. It is really difficult to ensure that whether both
these entities refer to the same value or not. Metadata can be used effectively to reduce
errors in the data integration process. Another issue faced is data redundancy. The same
data might be available in different tables in the same database or even in different data
sources.
tries to reduce redundancy to the maximum possible level without affecting the reliability
of data.
c. Data Selection
Data mining process requires large volumes of historical data for analysis. So, usually the
data repository with integrated data contains much more data than actually required. From
the available data, data of interest needs to be selected and stored.
is the process where the data relevant to the analysis is retrieved from the database.
d. Data Transformation
is the process of transforming and consolidating the data into different forms that are
suitable for mining.
normally involves normalization, aggregation, generalization etc. For example, a data set
available as "-5, 37, 100, 89, 78" can be transformed as "-0.05, 0.37, 1.00, 0.89, 0.78".
Here data becomes more suitable for data mining. After data integration, the available data
is ready for data mining.
e. Data Mining
is the core process where a number of complex and intelligent methods are applied to extract
patterns from data.

process includes a number of tasks such as association, classification, prediction,


clustering, time series analysis and so on.
f. Pattern Evaluation
The pattern evaluation identifies the truly interesting patterns representing knowledge
based on different types of interestingness measures. A pattern is considered to be
interesting if it is potentially useful, easily understandable by humans, validates some
hypothesis that someone wants to confirm or valid on new data with some degree of
certainty.
g. Knowledge Representation
The information mined from the data needs to be presented to the user in an appealing
way. Different knowledge representation and visualization techniques are applied to
provide the output of data mining to the users.
Data Mining Tasks

can be classified generally into two types based on what a specific task tries to achieve. Those
two categories are descriptive tasks and predictive tasks.

the descriptive data mining tasks characterize the general properties of data whereas predictive
data mining tasks perform inference on the available data set to predict how a new data set will
behave.
There are a number of data mining tasks such as classification, prediction, time- series
analysis, association, clustering, summarization etc. All these tasks are either predictive data
mining tasks or descriptive data mining tasks. A data mining system can execute one or more
of the above specified tasks as part of data mining.

Figure 4: Data Mining Tasks

Predictive Data Mining

Predictive data mining tasks come up with a model from the available data set that is helpful in
predicting unknown or future values of another data set of interest. A medical practitioner
trying to diagnose a disease based on the medical test results of a patient can be considered as a
predictive data mining task.

a. Classification

Classification derives a model to determine the class of an object based on its attributes. A
collection of records will be available, each record with a set of attributes. One of the attributes will
be class attribute and the goal of classification task is assigning a class attribute to new set of records
as accurately as possible.
Classification can be used in direct marketing that is to reduce marketing
costs by targeting a set of customers who are likely to buy a new product. Using the
available data, it is possible to know which customers purchased similar products and
who did not purchase in the past. Hence, {purchase, don’t purchase} decision forms
the class attribute in this case. Once the class attribute is assigned, demographic and
lifestyle information of customers who purchased similar products can be collected
and promotion mails can be sent to them directly.

b.1) Prediction
Prediction task predicts the possible values of missing or future data.
Prediction involves developing a model based on the available data and this model is
used in predicting future values of a new data set of interest. For example, a model
can predict the income of an employee based on education, experience and other
demographic factors like place of stay, gender etc. Also prediction analysis is used
in different areas including medical diagnosis, fraud detection etc.
b.2) Time - Series Analysis
Time series is a sequence of events where the next event is determined by one
or more of the preceding events. Time series reflects the process being measured and
there are certain components that affect the behavior of a process. Time series
analysis includes methods to analyze time-series data in order to extract useful
patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.

Descriptive data mining tasks usually finds data describing patterns and comes up with new, significant
information from the available data set. A retailer trying to identify products that are purchased together
can be considered as a descriptive data mining task.

b.3) Association
Association discovers the association or connection among a set of items.
Association identifies the relationships between objects. Association analysis is used
for commodity management, advertising, catalog design, direct marketing etc. A
retailer can identify the products that normally customers purchase together or even
find the customers who respond to the promotion of same kind of products. If a retailer
finds that beer and nappy are bought together mostly, he can put nappies on sale to
promote the sale of beer.
b.4) Clustering
Clustering is used to identify data objects that are similar to one another. The
similarity can be decided based on a number of factors like purchase behavior,
responsiveness to certain actions, geographical locations and so on. For example, an
insurance company can cluster its customers based on age, residence, income etc. This
group information will be helpful to understand the customers better and hence
provide better customized services.
b.5) Summarization
Summarization is the generalization of data. A set of relevant data is
summarized which result in a smaller set that gives aggregated information of the
data. For example, the shopping done by a customer can be summarized into total
products, total spending, offers used, etc. Such high level summarized information
can be useful for sales or customer relationship team for detailed customer and
purchase behavior analysis. Data can be summarized in different abstraction levels and
from different angles.

TYPES OF DATA SET


1. Record - Data that consists of a collection of records, each of which
consists of a fixed set of attributes.
1.1 Data Matrix - If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct attribute.

Figure 5: Data Matrix

1.2 Document Data - Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.

Figure 6: Document Data

1.3 Transaction Data - A special type of data, where


– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
– Can represent transaction data as record data

Figure 7: Transaction Data

2. Graph Data - an abstract data type that is meant to implement the undirected
graph and directed graph concepts from the field of graph theory within mathematics.
– World Wide Web
– Molecular Structures

Figure 8: Graph data example – a) molecular graph b) linked web pages

3. Ordered Data - records are kept in a physical sequence based on a user-


specified key without the necessity of utilizing a set. Ordered data sets can be either disjoint
or embedded, but are normally embedded.
3.1 Sequential Data - also referred to as temporal data, can be thought of
as an extension of record data, where each record has a time associated with
it. Consider a retail transaction data set that also stores the time at which
the transaction took place.

Figure 9: Sequential transaction data

3.2 Genetic Sequence Data - consists of a data set that is a sequence of


individual entities, such as a sequence of words or letters. It is quite similar
to sequential data, except that there are no time stamps; instead, there are
positions in an ordered sequence. For example, the genetic information of
plants and animals can be represented in the form of sequences of
nucleotides that are known as genes.

Figure 10: Genomic sequence data


3.3 Time Series - a special type of sequential data in which each record
is a time series, i.e., a series of measurements taken over time. For
example, a financial data set might contain objects that are time series of
the daily prices of various stocks.

Figure 11: Temperature time series

3.1 Spatial Data - Some objects have spatial attributes, such as positions
or areas, as well as other types of attributes. An example of spatial data is
weather data (precipitation, temperature, pressure) that is collected for a
variety of geographical locations.

Figure 12: Spatial temperature data


2. Data Quality
Data quality is defined in terms of accuracy, completeness,
consistency, timeliness, believability, and interpretability. These
qualities are assessed based on the intended use of the data.
It also refers to the overall utility of a dataset(s) as a
function of its ability to be easily processed and analyzed for other
uses, usually by a database, data warehouse, or data analytics
system.
What do I need to do know about data quality?

Figure 19: Data Quality

Quality data is useful data. To be of high quality, data must be consistent and unambiguous.
Data quality issues are often the result of database merges or systems/cloud integration processes
in which data fields that should be compatible are not due to schema or format inconsistencies.
Data that is not high quality can undergo data cleansing to raise its quality.

What activities are involved in data quality?


Data quality activities involve data rationalization and validation.
Data quality efforts are often needed while integrating disparate applications that occur during
merger and acquisition activities, but also when siloed data systems within a single organization are brought
together for the first time in a data warehouse or big data lake. Data quality is also critical to the efficiency
of horizontal business applications such as enterprise resource planning (ERP) or customer relationship
management (CRM).
What are the benefits of data quality?
When data is of excellent quality, it can be easily processed and analyzed, leading to insights that
help the organization make better decisions. High-quality data is essential to business intelligence efforts
and other types of data analytics, as well as better operational efficiency.
Addressing quality issues
Data mining applications are often applied to data that was collected for another purpose, or for
future, but unspecified applications. For that reason data mining cannot usually take advantage of the
significant benefits of “addressing quality issues at the source.”
Since preventing data quality problems is not an option in such a case, Data Mining mainly focuses
on:
a. The detection and correction of data quality problems (is
often called data cleaning);
b. The use of algorithms that can tolerate poor data
quality.
What is Measurement Error?
It refers to any problem resulting from the
measurement process. In other words, the recorded data values
differ from true values to some extent. The difference between
measured and true value is called the error.

Figure 20: Data Cleaning Cycle

What is Data Collection Error?

It refers to errors such as omitting data objects or attributes values, or including an unnecessary
data object.
What is Noise?
Noise is the random component of a measurement error. It involves either the distortion of a value
or addition of objects that are not required. The following figure shows a time series before and after
disruption by some random noise.
Figure 21: Noise comparison

The term noise is often connected with data that has a spatial (space related) or temporal (time
related) component. In these cases, techniques from signal and image processing are used in order to
reduce noise. But, the removal of noise is a difficult task, hence much of the data mining work
involves use of Robust Algorithms that can produce acceptable results even in the presence of noise.
What are Outliers?
Outliers are either:
a. Data objects that, in some sense, have characteristics that are different from most
of the other data objects in the data set, or
b. Values of an attribute that are unusual with respect to the most common (typical)
values for that attribute.
Additionally, it is important to distinguish between noise and outliers. Outliers can be
legitimate data objects or values. Thus, unlike noise, outliers may sometimes be of interest.

If there are missing values present in the data set


It is not unusual to have data objects that have missing values for some of the attributes.
The reasons can be:
a. The information was not collected.
b. Some attributes are not applicable to all objects.
Regardless, missing values should be handled during the data analysis.

Assessment Tasks. In no less than 3 but not more than 5 sentences, answer the following:
1. Describe some of the issues in data mining.
2. How is data mining be applied in the Covid-19 Pandemic?
3. Why is it important to perform data cleaning?
4. Name of areas of applications of data mining

You might also like