You are on page 1of 10

Data mining is a technical methodology to detect information from huge data sets.

The main
objective of data mining is to identify patterns, trends, or rules that explain data behavior
contextually. The data mining method uses mathematical analysis to deduce patterns and trends,
which were not possible through the old methods of data exploration. Data mining is a handy and
highly convenient methodology for dealing with vast volumes of data. In this article, we explore
some data mining functionalities that are measured to predict the type of patterns in data sets.

To learn more about data mining, read – What is Data Mining

Data Mining Functionalities


We have listed some most popular functionalities of data mining, such as –

 Classification
 Association Analysis
 Cluster Analysis
 Data Characterization
 Data Discrimination
 Prediction
 Outlier Analysis
 Evolution Analysis

Classification

As the name suggests, classification is the technique of categorizing elements in a collection,


basis their predefined functionalities and properties. In classification, the model can classify new
instances whose classification is unknown. These particular instances used to create the model are
called training data. Such a classification mechanism uses if-then, decision trees, neural networks,
or even a set of classification rules. These methods can be retrieved to identify future data.

Association Analysis

Association Analysis is also called Market Basket Analysis. It is a prevalent data mining


methodology with usage in sales. Association analysis helps to find relations between elements
frequently occurring together. It is made up of a series of sets of elements and rules that describe
how these are grouped within the cases. Association rules are used to predict the presence of an
element in the database and are based on the manifestation of a specific element identified as
important. Association analysis is based on 2 parts rule –
antecedent (if)

consequent(then) –

An antecedent (if) points towards a degree of discovering a consequent (then) in the data set. It
suggests that they are associated.

One example to understand this better can be – If a person buys popcorn in the theatre, there is a
60% chance that he will buy a cold drink. This way, a prediction can be made on the consumer’s
shopping behavior.

Cluster Analysis

The cluster analysis process is similar to that of classification. In cluster analysis, similar data
types are grouped; the only difference is that the class label is unknown. Clustering algorithms
divide the data basis similarities, and the grouped data are similar to each other more than the
other data in other groups. Cluster analysis is used in machine learning, deep learning, image
processing, pattern recognition, NLP, etc.

Data Characterization

The process of data characterization involves summarizing the generic data features, which can
result in specific rules to define a target class. An attribute-oriented induction technique is used to
characterize the data without much user intervention or interaction. The resultant characterized
data can be visualized in the form of different types of graphs, charts, or tables.

Data Discrimination

Data discrimination is a bias when a data set or source is treated differently than the others,
intentional or unintentional. This data mining functionality helps to separate peculiar data sets
based on the ambiguity in attribute values.

Prediction

Prediction is among the most popular data mining functionalities that determine any missing or
unknown element in a data set. Linear regression models based on the previous data are used to
make numeric predictions, which help businesses forecast the results of any given event,
positively or negatively. There are two types of predictions –

 Numeric Predictions – Predict any missing or unknown element in a data set


 Class Predictions – Predict the class label using a previously built class model

Outlier Analysis
We use the outlier analysis technique if we cannot group any data in any class. Outlier analysis
helps to learn about data quality. Outlier means data abnormality in most cases. More outliers in
your data sets low the data quality. You cannot identify data patterns or derive conclusions from
data sets with many outliers. The outlier analysis process helps check if any data can be used to
analyze after some clean-up. Nevertheless, tracking unusual data and activities is still essential so
that any anomalies can be detected beforehand and any business impact can be detected in
advance.

Evolution Analysis

Evolution Analysis refers to the study of data sets that may have been through a phase of
transformation or change. The evolution analysis models capture evolutionary trends in data,
which further contributes to data characterization, classification, or discrimination and clustering
for multivariate time series.

Data Processing in Data Mining


Data processing is collecting raw data and translating it into usable information. The
raw data is collected, filtered, sorted, processed, analyzed, stored, and then
presented in a readable format. It is usually performed in a step-by-step process by a
team of data scientists and data engineers in an organization.

The data processing is carried out automatically or manually. Nowadays, most data is
processed automatically with the help of the computer, which is faster and gives
accurate results. Thus, data can be converted into different forms. It can be graphic
as well as audio ones. It depends on the software used as well as data processing
methods.

Stages of Data Processing


The data processing consists of the following six stages.
1. Data Collection

The collection of raw data is the first step of the data processing cycle. The raw data
collected has a huge impact on the output produced. Hence, raw data should be
gathered from defined and accurate sources so that the subsequent findings are
valid and usable. Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.

2. Data Preparation

Data preparation or data cleaning is the process of sorting and filtering the raw data
to remove unnecessary and inaccurate data. Raw data is checked for errors,
duplication, miscalculations, or missing data and transformed into a suitable form for
further analysis and processing. This ensures that only the highest quality data is fed
into the processing unit.

3. Data Input
In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner, or
any other input source.

4. Data Processing

In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate the desired
output. This step may vary slightly from process to process depending on the source
of data being processed (data lakes, online databases, connected devices, etc.) and
the intended use of the output.

5. Data Interpretation or Output

The data is finally transmitted and displayed to the user in a readable form like
graphs, tables, vector files, audio, video, documents, etc. This output can be stored
and further processed in the next data processing cycle.

6. Data Storage

The last step of the data processing cycle is storage, where data and metadata are
stored for further use. This allows quick access and retrieval of information whenever
needed. Effective proper data storage is necessary for compliance with GDPR (data
protection legislation).

Data Preprocessing in Data Mining


Data preprocessing is an important step in the data mining process. It refers
to the cleaning, transforming, and integrating of data in order to make it
ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

 Data cleaning: this step involves identifying and removing missing,


inconsistent, or irrelevant data. This can include removing duplicate
records, filling in missing values, and handling outliers.
 Data integration: this step involves combining data from multiple
sources, such as databases, spreadsheets, and text files. The goal
of integration is to create a single, consistent view of the data.
 Data transformation: this step involves converting the data into a
format that is more suitable for the data mining task. This can
include normalizing numerical data, creating dummy variables, and
encoding categorical data.
 Data reduction: this step is used to select a subset of the data that
is relevant to the data mining task. This can include feature
selection (selecting a subset of the variables) or feature extraction
(extracting new variables from the data).
 Data discretization: this step is used to convert continuous
numerical data into categorical data, which can be used for decision
tree and other categorical data mining techniques.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining: 
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format. 
 

Steps Involved in Data Preprocessing: 


1. Data Cleaning: 
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc. 
 
 (a). Missing Data: 
This situation arises when some data is missing in the data. It can
be handled in various ways. 
Some of them are: 
1. Ignore the tuples: 
This approach is suitable only when the dataset we have
is quite large and multiple values are missing within a
tuple. 
 
2. Fill the Missing values: 
There are various ways to do this task. You can choose to
fill the missing values manually, by attribute mean or the
most probable value. 
 
 (b). Noisy Data: 
Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data
entry errors etc. It can be handled in following ways : 
1. Binning Method: 
This method works on sorted data in order to smooth it.
The whole data is divided into segments of equal size and
then various methods are performed to complete the task.
Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can
be used to complete the task. 
 
2. Regression: 
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple
independent variables). 
This is used to smooth the data and help handle data
when unnecessary data is present. For the analysis,
purpose regression helps decide the suitable variable.
Linear regression refers to finding the best line to fit
between two variables so that one can be used to predict
the other. Multiple linear regression involves more than
two variables. Using regression to find a mathematical
equation to fit into the data helps to smooth out the noise.
 
3. Clustering: 
This approach groups the similar data in a cluster. The
outliers may be undetected or it will fall outside the
clusters. 
2. Data Transformation: 
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways: 
1. Normalization: 
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0) 
 
2. Attribute Selection: 
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process. 
 
3. Discretization: 
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels. 
 
4. Concept Hierarchy Generation: 
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”. 
 
3. Data Reduction: 
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs. 
The various steps to data reduction are: 
1. Data Cube Aggregation: 
Aggregation operation is applied to data for the construction of the
data cube. 

A data cube (also called a business intelligence cube or OLAP


cube) is a data structure optimized for fast and efficient analysis. It
enables consolidating or aggregating relevant data into the cube
and then drilling down, slicing and dicing, or pivoting data to view it
from different angles. Essentially, a cube is a section of data built
from tables in a database that contains calculations. OLAP cubes
are typically grouped according to business function, containing
data relevant to each function.

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large
amounts of data by making use of a multi-dimensional array. It
increases its efficiency by keeping an index of each dimension.
Thus, dimensional is able to retrieve data fast.
 Relational data cube: It basically helps in storing large amounts of
data by making use of relational tables. Each relational table
displays the dimensions of the data cube. It is slower compared to a
Multidimensional Data Cube.

 
2. Attribute Subset Selection: 
The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded. 

Methods of Attribute Subset Selection-


1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
1. Stepwise Forward Selection: This procedure start with an empty
set of attributes as the minimal set. The most relevant attributes are
chosen(having minimum p-value) and are added to the minimal set.
In each iteration, one attribute is added to a reduced set.
2. Stepwise Backward Elimination: Here all the attributes are
considered in the initial set of attributes. In each iteration, one
attribute is eliminated from the set of attributes whose p-value is
higher than significance level.
3. Combination of Forward Selection and Backward
Elimination: The stepwise forward selection and backward
elimination are combined so as to select the relevant attributes
most efficiently. This is the most common technique which is
generally used for attribute selection.
4. Decision Tree Induction: This approach uses decision tree for
attribute selection. It constructs a flow chart like structure having
nodes denoting a test on an attribute. Each branch corresponds to
the outcome of test and leaf nodes is a class prediction. The
attribute that is not the part of tree is considered irrelevant and
hence discarded.

 
3. Numerosity Reduction: 
This enable to store the model of data instead of whole data, for
example: Regression Models. 
 
4. Dimensionality Reduction: 
This reduce the size of data by encoding mechanisms.It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis). 
 

You might also like