You are on page 1of 19

Module 2 Introduction to Data Mining, Data Exploration and Data Pre-processing

2.1.1 Data Mining Task Primitives


What are the data mining task primitives?
Sol.
Each user will have a data mining task in mind, that is, some form of data analysis that he or
she would like to have performed.
• A data mining task can be specified in the form of a data mining query, which is input to the
data mining system.
• A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to inter- actively communicate with the data mining system during discovery
in order to direct the mining process, or examine the findings from different angles or depths.
• The data mining primitives specify the following, as illustrated in Figure 1.1.

1. The set of task-relevant data to be mined: This specifies the portions of the database
or the set of data in which the user is interested. This includes the database attributes
or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).
2. The kind of knowledge to be mined: This specifies the data mining functions to be
per- formed, such as characterization, discrimination, association or correlation
analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process: This knowledge
about the domain to be mined is useful for guiding the knowledge discovery process
and for evaluating the patterns found. Concept hierarchies are a popular form of back-
ground knowledge, which allow data to be mined at multiple levels of abstraction. An
example of a concept hierarchy for the attribute (or dimension) age is shown in Figure
1.2. User beliefs regarding relationships in the data are another form of back- ground
knowledge.
4. The interestingness measures and thresholds for pattern evaluation: They may be used
to guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures. For
example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
5. The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes.
2.1.2 DATA mining Architecture
Data Mining Architecture

What is Data Mining Architecture
• Data Mining Architecture is the process of selecting, exploring, and modelling large
amounts of data to discover previously unknown regularities or relationships to
generate clear and valuable findings for the database owner. Data mining is exploring
and analysing large amounts of data using automated or semi-automated processes to
identify practical designs and procedures.
• The primary components of any data mining system are the Data source, data
warehouse server, data mining engine, pattern assessment module, graphical user
interface, and knowledge base.
Basic Working:
• When a user requests data mining queries, these requests are sent to data mining
engines for pattern analysis.
• These software applications use the existing database to try to discover a solution to
the query.
• The retrieved metadata is then transmitted to the data mining engine for suitable
processing, which may interact with pattern assessment modules to decide the
outcome.
• The result is finally delivered to the front end in a user-friendly format via an
appropriate interface.
Components Of Data Mining Architecture
• Data Sources
• Database Server
• Data Mining Engine
• Pattern Evaluation Modules
• Graphic User Interface
• Knowledge Base

Data Sources
• These sources provide the data in plain text, spreadsheets, or other media such as
images or videos Data sources include databases, the World Wide Web (WWW), and
data warehouses.
Database Server
• The real data is stored on the database server and is ready to be processed. Its job is to
handle data retrieval in response to the user's request.
Data Mining Engine:
• It is one of the most important parts of the data mining architecture since it conducts
many data mining techniques such as association, classification, characterisation,
clustering, prediction, and so on.
Pattern Evaluation Modules:
• They are responsible for identifying intriguing patterns in data and, on occasion,
interacting with database servers to provide the results of user queries.
Graphic User Interface:
• Because the user cannot completely comprehend the complexities of the data mining
process, a graphical user interface assists the user in efficiently communicating with
the data mining system.
Knowledge Base:
• The Knowledge Base is an essential component of the data mining engine that aids in
the search for outcome patterns. Occasionally, the knowledge base may also provide
input to the data mining engine. This knowledge base might include information
gleaned from user encounters. The knowledge base's goal is to improve the accuracy
and reliability of the outcome. The Knowledge Base is a crucial component of the
data mining engine that aids in the search for outcome patterns. Occasionally, the
knowledge base may also provide input to the data mining engine. This knowledge
base might include information gleaned from user encounters. The knowledge base's
goal is to improve the accuracy and reliability of the outcome.

Or

Data Mining Architecture

The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.

Data Source:

The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other repositories
of data. Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.

Different processes:

Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure because the data may not be
complete and accurate. So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data of interest will have
to be selected and passed to the server. These procedures are not as easy as we think. Several
methods may be performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per
user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from
various data sources and stored within the data warehouse.
Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the
search on exciting patterns.

This segment commonly employs stake measures that cooperate with the data mining modules
to focus the search towards fascinating patterns. It might utilize a stake threshold to filter out
discovered patterns. On the other hand, the pattern evaluation module might be coordinated
with the mining module, depending on the implementation of the data mining techniques used.
For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as
much as possible into the mining procedure to confine the search to only fascinating patterns.

Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system and
the user. This module helps the user to easily and efficiently use the system without knowing
the complexity of the process. This module cooperates with the data mining system when the
user specifies a query or a task and displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might be helpful to guide
the search or evaluate the stake of the result patterns. The knowledge base may even contain
user views and data from user experiences that might be helpful in the data mining process.
The data mining engine may receive inputs from the knowledge base to make the result more
accurate and reliable. The pattern assessment module regularly interacts with the knowledge
base to get inputs, and also update it.
2.1.3 KDD process

KDD (Knowledge Discovery in Databases) is a process of discovering useful knowledge and


insights from large and complex datasets. The KDD process involves a range of techniques
and methodologies, including data preprocessing, data transformation, data mining, pattern
evaluation, and knowledge representation. KDD and data mining are closely related
processes, with data mining being a key component and subset of the KDD process.

The KDD process in data mining is a multi-step process that involves various stages to
extract useful knowledge from large datasets. The following are the main steps involved in
the KDD process -

• Data Selection - The first step in the KDD process is identifying and selecting the
relevant data for analysis. This involves choosing the relevant data sources, such as
databases, data warehouses, and data streams, and determining which data is required
for the analysis.
• Data Preprocessing - After selecting the data, the next step is data preprocessing.
This step involves cleaning the data, removing outliers, and removing missing,
inconsistent, or irrelevant data. This step is critical, as the data quality can
significantly impact the accuracy and effectiveness of the analysis.
• Data Transformation - Once the data is preprocessed, the next step is to transform it
into a format that data mining techniques can analyze. This step involves reducing the
data dimensionality, aggregating the data, normalizing it, and discretizing it to prepare
it for further analysis.
• Data Mining - This is the heart of the KDD process and involves applying various
data mining techniques to the transformed data to discover hidden patterns, trends,
relationships, and insights. A few of the most common data mining techniques include
clustering, classification, association rule mining, and anomaly detection.
• Pattern Evaluation - After the data mining, the next step is to evaluate the
discovered patterns to determine their usefulness and relevance. This involves
assessing the quality of the patterns, evaluating their significance, and selecting the
most promising patterns for further analysis.
• Knowledge Representation - This step involves representing the knowledge
extracted from the data in a way humans can easily understand and use. This can be
done through visualizations, reports, or other forms of communication that provide
meaningful insights into the data.
• Deployment - The final step in the KDD process is to deploy the knowledge and
insights gained from the data mining process to practical applications. This involves
integrating the knowledge into decision-making processes or other applications to
improve organizational efficiency and effectiveness.

In summary, the KDD process in data mining involves several steps to extract useful
knowledge from large datasets. It is a comprehensive and iterative process that requires
careful consideration of each step to ensure the accuracy and effectiveness of the analysis.
Various steps involved in the KDD process in data mining are shown below diagram –
2.1.4 Issues in Data Mining,

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −

• Mining Methodology and User Interaction


• Performance Issues
• Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −

• Mining different kinds of knowledge in databases − Different users may be


interested in different kinds of knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
• Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −

• Efficiency and scalability of data mining algorithms − In order to effectively


extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from scratch.

Diverse Data Types Issues

• Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Short answer below

1. Mining Methodology Issues

Methodology-related data mining issues encompass challenges related to the choice and
application of mining algorithms and techniques. Selecting the right method for a specific
dataset and problem can be daunting. Moreover, issues like overfitting, bias, and the need for
interpretability often arise, making it crucial to strike a balance between model complexity
and accuracy.

2. Performance Issues

Performance-related data mining issues revolve around scalability, efficiency, and handling
large datasets. As data volumes continue to grow exponentially, it becomes essential to
develop algorithms and infrastructure capable of processing and analyzing data promptly.
Performance bottlenecks can hinder the practical application of data mining techniques.

3. Diverse Data Types Issue

The diverse data types data mining issues highlight the complexity of dealing with
heterogeneous data sources. Data mining often involves integrating data from various
formats, such as text, images, and structured databases. Each data type presents unique
challenges in terms of preprocessing, feature extraction, and modelling, requiring specialized
approaches and tools to tackle these complexities effectively.

2.1.5

Challenges of Data Mining


• In this section, we will explore various challenges of data mining, as mentioned below
-
• Data Quality - Data mining heavily relies on the quality of the input data. Inaccurate,
incomplete, or noisy data can lead to misleading results and hinder the discovery of
meaningful patterns.
• Data Complexity - Complex datasets with diverse structures, including unstructured
data like text and images, pose significant challenges in terms of preprocessing,
integration, and analysis.
• Data Privacy and Security - Safeguarding sensitive information is paramount. Data
mining can potentially compromise privacy if not conducted with stringent privacy-
preserving techniques and compliance with data protection regulations.
• Scalability - As data volumes continue to grow, ensuring that data mining algorithms
and infrastructure can handle large-scale datasets efficiently becomes a pressing issue.
• Interpretability - Understanding and explaining the outcomes of data mining models
is crucial for informed decision-making. Black-box models can raise concerns when
interpretability is required.
• Ethics - Ethical considerations in data mining, such as fairness, bias, and the
responsible use of data, are gaining prominence. Ensuring ethical practices throughout
the data mining process is a critical challenge.

2.1.6 Applications of Data Mining,

Data Mining Applications


Here is the list of areas where data mining is widely used −

• Financial Data Analysis


• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
• Design and construction of data warehouses for multidimensional data analysis
and data mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
•Design and Construction of data warehouses based on the benefits of data
mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes
for biological data analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
• Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein
pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected
from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is
being generated because of the fast numerical simulations in various fields such as climate and
ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications
of data mining in the field of Scientific Applications −

• Data Warehouses and data preprocessing.


• Graph-based mining.
• Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network
administration. Here is the list of areas in which data mining technology may be applied for
intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build
discriminating attributes.
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.

2.1.7 Data Exploration:

Types of attributes

Attribute:
It can be seen as a data field that represents the characteristics or features of a data object. For
a customer, object attributes can be customer Id, address, etc. We can say that a set of
attributes used to describe a given object are known as attribute vector or feature vector.

Type of attributes :
This is the First step of Data-preprocessing. We differentiate between different types of
attributes and then preprocess the data. So here is the description of attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)

Qualitative Attributes:

1. Nominal Attributes – related to names: The values of a Nominal attribute are names of
things, some kind of symbols. Values of Nominal attributes represents some category or state
and that’s why nominal attribute also referred as categorical attributes and there is no order
(rank, position) among values of the nominal attribute.
Example :
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected
or unaffected, true or false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not actually
known, the order of values that shows what is important but don’t indicate how important it
is.

Quantitative Attributes:

1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity,


represented in integer or real values. Numerical attributes are of 2 types, interval, and ratio.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, or we can call zero points. Data can be
added and subtracted at an interval scale but can not be multiplied or divided. Consider an
example of temperature in degrees Centigrade. If a day’s temperature of one day is twice of
the other day we cannot say that one day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median,
mode, Quantile-range, and Five number summary can be given.

2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.

Example:

3. Continuous: Continuous data have an infinite no of states. Continuous data is of float


type. There can be many values between 2 and 3.
Example :

2.1.8 Data visualization

Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see visualizations
in the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-
a-whole. And maps are the best way to share geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.
2.1.9 Data Preprocessing:

Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and
put in a formatted way. So for this, we use data preprocessing task.

Data Cleaning
Data Cleaning uses methods to handle incorrect, incomplete, inconsistent, or missing values.
Some of the techniques for Data Cleaning include -
• Handling Missing Values
o Input data can contain missing or NULL values, which must be handled
before applying any Machine Learning or Data Mining techniques.
o Missing values can be handled by many techniques, such as removing
rows/columns containing NULL values and imputing NULL values using
mean, mode, regression, etc.
• De-noising
o De-noising is a process of removing noise from the data. Noisy data is
meaningless data that is not interpretable or understandable by machines or
humans. It can occur due to data entry errors, faulty data collection, etc.
o De-noising can be performed by applying many techniques, such as binning
the features, using regression to smoothen the features to reduce noise,
clustering to detect the outliers, etc.
Data Integration
Data Integration can be defined as combining data from multiple sources. A few of the issues
to be considered during Data Integration include the following -
• Entity Identification Problem - It can be defined as identifying objects/features from
multiple databases that correspond to the same entity. For example, in database
A _customer_id,_ and in database B _customer_number_ belong to the same entity.
• Schema Integration - It is used to merge two or more database schema/metadata into
a single schema. It essentially takes two or more schema as input and determines a
mapping between them. For example, entity type CUSTOMER in one schema may
have CLIENT in another schema.
• Detecting and Resolving Data Value Concepts - The data can be stored in various
ways in different databases, and it needs to be taken care of while integrating them
into a single dataset. For example, dates can be stored in various formats such
as DD/MM/YYYY, YYYY/MM/DD, or MM/DD/YYYY, etc.
Data Reduction
Data Reduction is used to reduce the volume or size of the input data. Its main objective is to
reduce storage and analysis costs and improve storage efficiency. A few of the popular
techniques to perform Data Reduction include -
• Dimensionality Reduction - It is the process of reducing the number of features in
the input dataset. It can be performed in various ways, such as selecting features with
the highest importance, Principal Component Analysis (PCA), etc.
• Numerosity Reduction - In this method, various techniques can be applied to reduce
the volume of data by choosing alternative smaller representations of the data. For
example, a variable can be approximated by a regression model, and instead of storing
the entire variable, we can store the regression model to approximate it.
• Data Compression - In this method, data is compressed. Data Compression can be
lossless or lossy depending on whether the information is lost or not during
compression.
Data Transformation
Data Transformation is a process of converting data into a format that helps in building
efficient ML models and deriving better insights. A few of the most common methods for
Data Transformation include -
• Smoothing - Data Smoothing is used to remove noise in the dataset, and it helps
identify important features and detect patterns. Therefore, it can help in predicting
trends or future events.
• Aggregation - Data Aggregation is the process of transforming large volumes of
data into an organized and summarized format that is more understandable and
comprehensive. For example, a company may look at monthly sales data of a product
instead of raw sales data to understand its performance better and forecast future
sales.
• Discretization - Data Discretization is a process of converting numerical or
continuous variables into a set of intervals/bins. This makes data easier to analyze.
For example, the age features can be converted into various intervals such as (0-
10, 11-20, ..) or (child, young, …).
• Normalization - Data Normalization is a process of converting a numeric variable
into a specified range such as [-1,1], [0,1], etc. A few of the most common approaches
to performing normalization are Min-Max Normalization, Data Standardization or
Data Scaling, etc.

Data reduction

Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount
of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining,
including:

1. Data Sampling: This technique involves selecting a subset of the data to work with,
rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features
in the dataset, either by removing features that are not relevant or by combining
multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or
lossless compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into
discrete data by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the
dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy
and the size of the data. The more data is reduced, the less accurate the model will be
and the less generalizable it will be.

Data Discretization

• Dividing the range of a continuous attribute into intervals.


• Interval labels can then be used to replace actual data values.
• Reduce the number of values for a given continuous attribute.
• Some classification algorithms only accept categorically attributes.
• This leads to a concise, easy-to-use, knowledge-level representation of mining results.
• Discretization techniques can be categorized based on whether it uses class
information or not such as follows:
o Supervised Discretization - This discretization process uses class information.
o Unsupervised Discretization - This discretization process does not use class
information.
• Discretization techniques can be categorized based on which direction it proceeds as
follows:

Top-down Discretization -

• If the process starts by first finding one or a few points called split points or cut points
to split the entire attribute range and then repeat this recursively on the resulting
intervals.

Bottom-up Discretization -

• Starts by considering all of the continuous values as potential split-points.


• Removes some by merging neighborhood values to form intervals, and then
recursively applies this process to the resulting intervals.

Concept Hierarchies

• Discretization can be performed rapidly on an attribute to provide a hierarchical


partitioning of the attribute values, known as a Concept Hierarchy.
• Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies.
• This organization provides users with the flexibility to view data from different
perspectives.
• Data mining on a reduced data set means fewer input and output operations and is
more efficient than mining on a larger data set.
• Because of these benefits, discretization techniques and concept hierarchies are
typically applied before data mining, rather than during mining.
Typical Methods of Discretization and Concept Hierarchy Generation for Numerical Data

1] Binning

• Binning is a top-down splitting technique based on a specified number of bins.


• Binning is an unsupervised discretization technique because it does not use class
information.
• In this, The sorted values are distributed into several buckets or bins and then replaced
with each bin value by the bin mean or median.
• It is further classified into
o Equal-width (distance) partitioning
o Equal-depth (frequency) partitioning

2] Histogram Analysis

• It is an unsupervised discretization technique because histogram analysis does not use


class information.
• Histograms partition the values for an attribute into disjoint ranges called buckets.
• It is also further classified into
o Equal-width histogram
o Equal frequency histogram
• The histogram analysis algorithm can be applied recursively to each partition to
automatically generate a multilevel concept hierarchy, with the procedure terminating
once a pre-specified number of concept levels has been reached.

3] Cluster Analysis

• Cluster analysis is a popular data discretization method.


• A clustering algorithm can be applied to discretize a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Clustering considers the distribution of A, as well as the closeness of data points, and
therefore can produce high-quality discretization results.
• Each initial cluster or partition may be further decomposed into several subcultures,
forming a lower level of the hierarchy.

4] Entropy-Based Discretization

• Entropy-based discretization is a supervised, top-down splitting technique.


• It explores class distribution information in its calculation and determination of split
points.
• Let D consist of data instances defined by a set of attributes and a class-label attribute.
• The class-label attribute provides the class information per instance.
• In this, the interval boundaries or split-points defined may help to improve
classification accuracy.
• The entropy and information gain measures are used for decision tree induction.

5] Interval Merge by χ2 Analysis


• It is a bottom-up method.
• Find the best neighboring intervals and merge them to form larger intervals
recursively.
• The method is supervised in that it uses class information.
• ChiMerge treats intervals as discrete categories.
• The basic notion is that for accurate discretization, the relative class frequencies
should be fairly consistent within an interval.
• Therefore, if two adjacent intervals have a very similar distribution of classes, then
the intervals can be merged.
• Otherwise, they should remain separate.

You might also like