Professional Documents
Culture Documents
Learning Outcomes:
• Discuss motivation and challenges of data mining
• Perform data mining tasks
• Identify data set types
• Distinguish types of variables
• Assess data quality
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the imminent
need for turning such data into useful information and knowledge. The information and knowledge
gained can be used for applications ranging from business management, production control, and
market analysis, to engineering design and science exploration.
Data mining can be viewed as a result of the natural evolution of information technology.
An evolutionary path has been witnessed in the database industry in the development of the
following functionalities: data collection and database creation, data management (including data
storage and retrieval, and database transaction processing), and data analysis and understanding
(involving data warehousing and data mining). For instance, the early development of data
collection and database creation mechanisms served as a prerequisite for later development of
effective mechanisms for data storage and retrieval, and query and transaction processing. With
numerous database systems offering query and transaction processing as common practice, data
analysis and understanding has naturally become the next target.
Data can now be stored in many different types of databases. One database architecture
that has recently emerged is the data warehouse, a repository of multiple heterogeneous data
sources, organized under a unified schema at a single site in order to facilitate management
decision making. Data warehouse technology includes data cleansing, data integration, and On-
Line Analytical Processing (OLAP), that is, analysis techniques with functionalities such as
summarization, consolidation and aggregation, as well as the ability to view information at
different angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysis, such as data classification,
clustering, and the characterization of data changes over time.
The abundance of data, coupled with the need for powerful data analysis tools, has been
described as a "data rich but information poor" situation. The fast-growing, tremendous amount of
data, collected and stored in large and numerous databases, has far exceeded our human ability for
comprehension without powerful tools. As a result, data collected in large databases become "data
tombs" -- data archives that are seldom revisited. Consequently, important decisions are often
made based not on the information-rich data stored in databases but rather on a decision maker's
intuition, simply because the decision maker does not have the tools to extract the valuable
knowledge embedded in the vast amounts of data. In addition, consider current expert system
technologies, which typically rely on users or domain experts to manually input knowledge into
knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely
time-consuming and costly. Data mining tools which perform data analysis may uncover important
data patterns, contributing greatly to business strategies, knowledge bases, and scientific and
medical research. The widening gap between data and information calls for a systematic
development of data mining tools which will turn data tombs into "golden nuggets" of knowledge.
Data Mining Process
The whole process of data mining cannot be completed in a single step. In other words,
you cannot get the required information from the large volumes of data as simple as that. It is a
very complex process than we think involving a number of processes. The processes including
data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation
and knowledge representation are to be completed in the given order.
Types of Data Mining Processes
Different data mining processes can be classified into two types: data preparation or
data preprocessing and data mining. In fact, the first four processes, that are data cleaning,
data integration, data selection and data transformation, are considered as data preparation
processes. The last three processes including data mining, pattern evaluation and knowledge
representation are integrated into one process called data mining.
a. Data Cleaning
is the process where the data gets cleaned. Data in the real world is normally incomplete,
noisy and inconsistent. The data available in data sources might be lacking attribute values,
data of interest etc. For example, you want the demographic data of customers and what if
the available data does not include attributes for the gender or age of the customers? Then
the data is of course incomplete. Sometimes the data might contain errors or outliers. An
example is an age attribute with value 200. It is obvious that the age value is wrong in this
case. The data could also be inconsistent. For example, the name of an employee might be
stored differently in different data tables or documents. Here, the data is inconsistent. If the
data is not clean, the data mining results would be neither reliable nor accurate.
Data cleaning involves a number of techniques including filling in the missing values
manually, combined computer and human inspection, etc. The output of data cleaning
process is adequately cleaned data.
b. Data Integration
is the process where data from different data sources are integrated into one. Data lies in
different formats in different locations. Data could be stored in databases, text files,
spreadsheets, documents, data cubes, Internet and so on.
is a really complex and tricky task because data from different sources does not match
normally. Suppose a table A contains an entity named customer_id where as another
table B contains an entity named number. It is really difficult to ensure that whether both
these entities refer to the same value or not. Metadata can be used effectively to reduce
errors in the data integration process. Another issue faced is data redundancy. The same
data might be available in different tables in the same database or even in different data
sources.
tries to reduce redundancy to the maximum possible level without affecting the reliability
of data.
c. Data Selection
Data mining process requires large volumes of historical data for analysis. So, usually the
data repository with integrated data contains much more data than actually required. From
the available data, data of interest needs to be selected and stored.
is the process where the data relevant to the analysis is retrieved from the database.
d. Data Transformation
is the process of transforming and consolidating the data into different forms that are
suitable for mining.
normally involves normalization, aggregation, generalization etc. For example, a data set
available as "-5, 37, 100, 89, 78" can be transformed as "-0.05, 0.37, 1.00, 0.89, 0.78".
Here data becomes more suitable for data mining. After data integration, the available data
is ready for data mining.
e. Data Mining
is the core process where a number of complex and intelligent methods are applied to extract
patterns from data.
can be classified generally into two types based on what a specific task tries to achieve. Those
two categories are descriptive tasks and predictive tasks.
the descriptive data mining tasks characterize the general properties of data whereas predictive
data mining tasks perform inference on the available data set to predict how a new data set will
behave.
There are a number of data mining tasks such as classification, prediction, time- series
analysis, association, clustering, summarization etc. All these tasks are either predictive data
mining tasks or descriptive data mining tasks. A data mining system can execute one or more
of the above specified tasks as part of data mining.
Predictive data mining tasks come up with a model from the available data set that is helpful in
predicting unknown or future values of another data set of interest. A medical practitioner
trying to diagnose a disease based on the medical test results of a patient can be considered as a
predictive data mining task.
a. Classification
Classification derives a model to determine the class of an object based on its attributes. A
collection of records will be available, each record with a set of attributes. One of the attributes will
be class attribute and the goal of classification task is assigning a class attribute to new set of records
as accurately as possible.
Classification can be used in direct marketing that is to reduce marketing
costs by targeting a set of customers who are likely to buy a new product. Using the
available data, it is possible to know which customers purchased similar products and
who did not purchase in the past. Hence, {purchase, don’t purchase} decision forms
the class attribute in this case. Once the class attribute is assigned, demographic and
lifestyle information of customers who purchased similar products can be collected
and promotion mails can be sent to them directly.
b.1) Prediction
Prediction task predicts the possible values of missing or future data.
Prediction involves developing a model based on the available data and this model is
used in predicting future values of a new data set of interest. For example, a model
can predict the income of an employee based on education, experience and other
demographic factors like place of stay, gender etc. Also prediction analysis is used
in different areas including medical diagnosis, fraud detection etc.
b.2) Time - Series Analysis
Time series is a sequence of events where the next event is determined by one
or more of the preceding events. Time series reflects the process being measured and
there are certain components that affect the behavior of a process. Time series
analysis includes methods to analyze time-series data in order to extract useful
patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.
Descriptive data mining tasks usually finds data describing patterns and comes up with new, significant
information from the available data set. A retailer trying to identify products that are purchased together
can be considered as a descriptive data mining task.
b.3) Association
Association discovers the association or connection among a set of items.
Association identifies the relationships between objects. Association analysis is used
for commodity management, advertising, catalog design, direct marketing etc. A
retailer can identify the products that normally customers purchase together or even
find the customers who respond to the promotion of same kind of products. If a retailer
finds that beer and nappy are bought together mostly, he can put nappies on sale to
promote the sale of beer.
b.4) Clustering
Clustering is used to identify data objects that are similar to one another. The
similarity can be decided based on a number of factors like purchase behavior,
responsiveness to certain actions, geographical locations and so on. For example, an
insurance company can cluster its customers based on age, residence, income etc. This
group information will be helpful to understand the customers better and hence
provide better customized services.
b.5) Summarization
Summarization is the generalization of data. A set of relevant data is
summarized which result in a smaller set that gives aggregated information of the
data. For example, the shopping done by a customer can be summarized into total
products, total spending, offers used, etc. Such high level summarized information
can be useful for sales or customer relationship team for detailed customer and
purchase behavior analysis. Data can be summarized in different abstraction levels and
from different angles.
2. Graph Data - an abstract data type that is meant to implement the undirected
graph and directed graph concepts from the field of graph theory within mathematics.
– World Wide Web
– Molecular Structures
3.1 Spatial Data - Some objects have spatial attributes, such as positions
or areas, as well as other types of attributes. An example of spatial data is
weather data (precipitation, temperature, pressure) that is collected for a
variety of geographical locations.
Quality data is useful data. To be of high quality, data must be consistent and unambiguous.
Data quality issues are often the result of database merges or systems/cloud integration processes
in which data fields that should be compatible are not due to schema or format inconsistencies.
Data that is not high quality can undergo data cleansing to raise its quality.
It refers to errors such as omitting data objects or attributes values, or including an unnecessary
data object.
What is Noise?
Noise is the random component of a measurement error. It involves either the distortion of a value
or addition of objects that are not required. The following figure shows a time series before and after
disruption by some random noise.
Figure 21: Noise comparison
The term noise is often connected with data that has a spatial (space related) or temporal (time
related) component. In these cases, techniques from signal and image processing are used in order to
reduce noise. But, the removal of noise is a difficult task, hence much of the data mining work
involves use of Robust Algorithms that can produce acceptable results even in the presence of noise.
What are Outliers?
Outliers are either:
a. Data objects that, in some sense, have characteristics that are different from most
of the other data objects in the data set, or
b. Values of an attribute that are unusual with respect to the most common (typical)
values for that attribute.
Additionally, it is important to distinguish between noise and outliers. Outliers can be
legitimate data objects or values. Thus, unlike noise, outliers may sometimes be of interest.
Assessment Tasks. In no less than 3 but not more than 5 sentences, answer the following:
1. Describe some of the issues in data mining.
2. How is data mining be applied in the Covid-19 Pandemic?
3. Why is it important to perform data cleaning?
4. Name of areas of applications of data mining