You are on page 1of 31

(R18CSE4102) - Data Mining

UNIT II Data Mining – Introduction: Introduction to Data Mining Systems – Knowledge


Discovery Process – Data Mining Techniques – Issues – applications- Data Objects and
attribute types, Statistical description of data, Data Preprocessing – Cleaning, Integration,
Reduction, Transformation and discretization, Data Visualization, Data similarity and
dissimilarity measures.

Data Mining:
Data mining refers to extracting or mining knowledge from large amounts of data.
The term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving


methods at the intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use.

The key properties of data mining are


 Automatic discovery of patterns

 Prediction of likely outcomes

 Creation of actionable information

 Focus on large datasets and databases

The Scope of Data Mining


Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in gigabytes
of store scanner data — and mining a mountain for a vein of valuable ore. Both processes
require either sifting through an immense amount of material, or intelligently probing it to
find exactly where the value resides. Given databases of sufficient size and quality, data
mining technology can generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviors


Data mining automates the process of finding predictive information in large databases.
Questions that traditionally required extensive hands-on analysis can now be answered
directly from the data — quickly. A typical example of a predictive problem is targeted
marketing. Data mining uses data on past promotional mailings to identify the targets most
likely to maximize return on investment in future mailings. Other predictive problems include
forecasting bankruptcy and other forms of default, and identifying segments of a population
likely to respond similarly to given events.
Automated discovery of previously unknown patterns.
Data mining tools sweep through databases and identify previously hidden patterns in one
step. An example of pattern discovery is the analysis of retail sales data to identify seemingly
unrelated products that are often purchased together. Other pattern discovery problems
include detecting fraudulent credit card transactions and identifying anomalous data that
could represent data entry keying errors.

Tasks of Data Mining

Data mining involves six common classes of tasks:


Anomaly detection (Outlier/change/deviation detection) – The identification of unusual
data records, that might be interesting or data errors that require further investigation.

Association rule learning (Dependency modelling) – Searches for relationships between


variables. For example a supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are frequently
bought together and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Regression – attempts to find a function which models the data with the least error.

Summarization – providing a more compact representation of the data set, including


visualization and report generation.

KDD ( Knowledge Discovery in Databases)


 KDD is the overall process of converting raw data into useful information.
 This process consists of a series of transformation steps, from data preprocessing to
post processing of data mining results.
Input data
 stored in a variety of formats (flat files, spreadsheets, or relational tables) a
 reside in a centralized data repository or be distributed across multiple sites.
Preprocessing
 transform the raw input data into an appropriate format for subsequent analysis.
 most laborious and time-consuming step in the knowledge discovery process
Steps involved in preprocessing
 combine data from multiple sources,
 cleaning data to remove noise
 duplicate observations
 selecting records and features that are relevant to the data mining task.
Data mining
 an integral part of knowledge discovery in databases (KDD).
Postprocessing
 a step that ensures that only valid and useful results are incorporated into the decision
support system.
 Statistical measures or hypothesis testing methods applied to eliminate false data
mining results.
 Example for post processing:
o visualization, which allows analysts to explore the data and the data mining
results from a variety of viewpoints.

Data Mining Techniques


1. Association
Association analysis is the finding of association rules showing attribute-value conditions
that occur frequently together in a given set of data. Association analysis is widely used for
a market basket or transaction data analysis. Association rule mining is a significant and
exceptionally dynamic area of data mining research. One method of association-based
classification, called associative classification, consists of two steps. In the main step,
association instructions are generated using a modified version of the standard association
rule mining algorithm known as Apriori. The second step constructs a classifier based on
the association rules discovered.
2. Classification
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The determined model depends
on the investigation of a set of training data information (i.e. data objects whose class label
is known). The derived model may be represented in various forms, such as classification
(if – then) rules, decision trees, and neural networks. Data Mining has a different type of
classifier:
 
 Decision Tree
 SVM(Support Vector Machine)
 Generalized Linear Models
 Bayesian classification:
 Classification by Backpropagation
 K-NN Classifier
 Rule-Based Classification
 Frequent-Pattern Based Classification
 Rough set theory
 Fuzzy Logic
Decision Trees: A decision tree is a flow-chart-like tree structure, where each node
represents a test on an attribute value, each branch denotes an outcome of a test, and tree
leaves represent classes or class distributions. Decision trees can be easily transformed into
classification rules. Decision tree enlistment is a nonparametric methodology for building
classification models. In other words, it does not require any prior assumptions regarding
the type of probability distribution satisfied by the class and other attributes. Decision trees,
especially smaller size trees, are relatively easy to interpret. The accuracies of the trees are
also comparable to two other classification techniques for a much simple data set. These
provide an expressive representation for learning discrete-valued functions. However, they
do not simplify well to certain types of Boolean problems.

This figure generated on the IRIS data set of the UCI machine repository. Basically, three
different class labels available in the data set: Setosa, Versicolor, and Virginia.  
Support Vector Machine (SVM) Classifier Method: Support Vector Machines is a
supervised learning strategy used for classification and additionally used for regression.
When the output of the support vector machine is a continuous value, the learning
methodology is claimed to perform regression; and once the learning methodology will
predict a category label of the input object, it’s known as classification. The independent
variables could or could not be quantitative. Kernel equations are functions that transform
linearly non-separable information in one domain into another domain wherever the
instances become linearly divisible. Kernel equations are also linear, quadratic, Gaussian,
or anything that achieves this specific purpose. A linear classification technique may be a
classifier that uses a linear function of its inputs to base its decision on. Applying the kernel
equations arranges the information instances in such a way at intervals in the multi-
dimensional space, that there is a hyper-plane that separates knowledge instances of one
kind from those of another. The advantage of Support Vector Machines is that they will
make use of certain kernels to transform the problem, such we are able to apply linear
classification techniques to nonlinear knowledge. Once we manage to divide the
information into two different classes our aim is to include the most effective hyper-plane
to separate two kinds of instances.
Generalized Linear Models:  Generalized Linear Models(GLM) is a statistical technique,
for linear modeling.GLM provides extensive coefficient statistics and model statistics, as
well as row diagnostics. It also supports confidence bounds.
Bayesian Classification: Bayesian classifier is a statistical classifier. They can predict
class membership probabilities, for instance, the probability that a given sample belongs to
a particular class. Bayesian classification is created on the Bayes theorem. Studies
comparing the classification algorithms have found a simple Bayesian classifier known as
the naive Bayesian classifier to be comparable in performance with decision tree and neural
network classifiers. Bayesian classifiers have also displayed high accuracy and speed when
applied to large databases. Naive Bayesian classifiers adopt that the exact attribute value on
a given class is independent of the values of the other attributes. This assumption is termed
class conditional independence. It is made to simplify the calculations involved, and is
considered “naive”. Bayesian belief networks are graphical replicas, which unlike naive
Bayesian classifiers allow the depiction of dependencies among subsets of attributes.
Bayesian belief can also be utilized for classification.  
Classification by Backpropagation: A Backpropagation learns by iteratively processing a
set of training samples, comparing the network’s estimate for each sample with the actual
known class label. For each training sample, weights are modified to minimize the mean
squared error between the network’s prediction and the actual class. These changes are
made in the “backward” direction, i.e., from the output layer, through each concealed layer
down to the first hidden layer (hence the name backpropagation). Although it is not
guaranteed, in general, the weights will finally converge, and the knowledge process stops.

K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest neighbor (K-NN)


classifier is taken into account as an example-based classifier, which means that the training
documents are used for comparison instead of an exact class illustration, like the class
profiles utilized by other classifiers. As such, there’s no real training section. once a new
document has to be classified, the k most similar documents (neighbors) are found and if a
large enough proportion of them are allotted to a precise class, the new document is also
appointed to the present class, otherwise not. Additionally, finding the closest neighbors is
quickened using traditional classification strategies.
Rule-Based Classification: Rule-Based classification represent the knowledge in the form
of If-Then rules. An assessment of a rule evaluated according to the accuracy and coverage
of the classifier. If more than one rule is triggered then we need to conflict resolution in
rule-based classification. Conflict resolution can be performed on three different
parameters: Size ordering, Class-Based ordering, and rule-based ordering. There are some
advantages of Rule-based classifier like:
 Rules are easier to understand than a large tree.

 Rules are mutually exclusive and exhaustive.


 Each attribute-value pair along a path forms conjunction: each leaf holds the class
prediction.
Frequent-Pattern Based Classification: Frequent pattern discovery (or FP discovery, FP
mining, or Frequent itemset mining) is part of data mining. It describes the task of finding
the most frequent and relevant patterns in large datasets. The idea was first presented for
mining transaction databases. Frequent patterns are defined as subsets (item sets,
subsequences, or substructures) that appear in a data set with a frequency no less than a
user-specified or auto-determined threshold.
Rough Set Theory: Rough set theory can be used for classification to discover structural
relationships within imprecise or noisy data. It applies to discrete-valued features.
Continuous-valued attributes must therefore be discrete prior to their use. Rough set theory
is based on the establishment of equivalence classes within the given training data. All the
data samples forming a similarity class are indiscernible, that is, the samples are equal with
respect to the attributes describing the data. Rough sets can also be used for feature
reduction (where attributes that do not contribute towards the classification of the given
training data can be identified and removed), and relevance analysis (where the contribution
or significance of each attribute is assessed with respect to the classification task). The
problem of finding the minimal subsets (redacts) of attributes that can describe all the
concepts in the given data set is NP-hard. However, algorithms to decrease the computation
intensity have been proposed. In one method, for example, a discernibility matrix is used
which stores the differences between attribute values for each pair of data samples. Rather
than pointed on the entire training set, the matrix is instead searched to detect redundant
attributes.
Fuzzy-Logic: Rule-based systems for classification have the disadvantage that they involve
sharp cut-offs for continuous attributes. Fuzzy Logic is valuable for data mining
frameworks performing grouping /classification. It provides the benefit of working at a
high level of abstraction. In general, the usage of fuzzy logic in rule-based systems involves
the following:
 Attribute values are changed to fuzzy values.
 For a given new data set /example, more than one fuzzy rule may apply. Every
applicable rule contributes a vote for membership in the categories. Typically, the truth
values for each projected category are summed.
3. Prediction 
Data Prediction is a two-step process, similar to that of data classification. Although, for
prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for
which values are being predicted is consistently valued(ordered) instead of categorical
(discrete-esteemed and unordered). The attribute can be referred to simply as the predicted
attribute. Prediction can be viewed as the construction and use of a model to assess the
class of an unlabeled object, or to assess the value or value ranges of an attribute that a
given object is likely to have.
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes,
clustering analyzes data objects without consulting an identified class label. In general, the
class labels do not exist in the training data simply because they are not known to begin
with. Clustering can be used to generate these labels. The objects are clustered based on the
principle of maximizing the intra-class similarity and minimizing the interclass similarity.
That is, clusters of objects are created so that objects inside a cluster have high similarity in
contrast with each other, but are different objects in other clusters. Each Cluster that is
generated can be seen as a class of objects, from which rules can be inferred. Clustering can
also facilitate classification formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.

5. Regression
Regression can be defined as a statistical modeling method in which previously obtained
data is used to predicting a continuous quantity for new observations. This classifier is also
known as the Continuous Value Classifier. There are two types of regression models:
Linear regression and multiple linear regression models.

6. Artificial Neural network (ANN) Classifier Method


An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN),
could be a process model supported by biological neural networks. It consists of an
interconnected collection of artificial neurons. A neural network is a set of connected
input/output units where each connection has a weight associated with it. During the
knowledge phase, the network acquires by adjusting the weights to be able to predict the
correct class label of the input samples. Neural network learning is also denoted as
connectionist learning due to the connections between units. Neural networks involve long
training times and are therefore more appropriate for applications where this is feasible.
They require a number of parameters that are typically best determined empirically, such as
the network topology or “structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning behind the
learned weights. These features firstly made neural networks less desirable for data mining.
The advantages of neural networks, however, contain their high tolerance to noisy data as
well as their ability to classify patterns on which they have not been trained. In addition,
several algorithms have newly been developed for the extraction of rules from trained
neural networks. These issues contribute to the usefulness of neural networks for
classification in data mining.
An artificial neural network is an adjective system that changes its structure-supported
information that flows through the artificial network during a learning section. The ANN
relies on the principle of learning by example. There are two classical types of neural
networks, perceptron and also multilayer perceptron.
7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model
of the data. These data objects are Outliers. The investigation of OUTLIER data is known
as OUTLIER MINING. An outlier may be detected using statistical tests which assume a
distribution or probability model for the data, or using distance measures where objects
having a small fraction of “close” neighbors in space are considered outliers. Rather than
utilizing factual or distance measures, deviation-based techniques distinguish
exceptions/outlier by inspecting differences in the principle attributes of items in a group.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of
evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and
genetics. These are intelligent exploitation of random search provided with historical data
to direct the search into the region of better performance in solution space. They are
commonly used to generate high-quality solutions for optimization problems and search
problems. Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and reproduce
and go to the next generation. In simple words, they simulate “survival of the fittest”
among individuals of consecutive generations for solving a problem. Each generation
consist of a population of individuals and each individual represents a point in search space
and possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.
Data Mining - Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.

Data Mining Applications


1. Financial Analysis
The banking and finance industry relies on high-quality, reliable data. In loan markets,
financial and user data can be used for a variety of purposes, like predicting loan payments
and determining credit ratings. And data mining methods make such tasks more manageable. 
Classification techniques facilitate the separation of crucial factors that influence customers’
banking decisions from the irrelevant ones. Further, multidimensional clustering techniques
allow the identification of customers with similar loan payment behaviours. Data analysis and
mining can also help detect money laundering and other financial crimes.
2. Telecommunication Industry
Expanding and growing at a fast pace, especially with the advent of the internet. Data mining
can enable key industry players to improve their service quality to stay ahead in the game. 
Pattern analysis of spatiotemporal databases can play a huge role in mobile
telecommunication, mobile computing, and also web and information services. And
techniques like outlier analysis can detect fraudulent users. Also, OLAP and visualization
tools can help compare information, such as user group behaviour, profit, data traffic, system
overloads, etc. 
3. Intrusion Detection
Global connectivity in today’s technology-driven economy has presented security challenges
for network administration. Network resources can face threats and actions that intrude on
their confidentiality or integrity. Therefore, detection of intrusion has emerged as a crucial
data mining practice.
It encompasses association and correlation analysis, aggregation techniques, visualization,
and query tools, which can effectively detect any anomalies or deviations from normal
behaviour. 

4. Retail Industry
The organized retail sector holds sizable quantities of data points covering sales, purchasing
history, delivery of goods, consumption, and customer service. The databases have become
even larger with the arrival of e-commerce marketplaces. 
In modern-day retail, data warehouses are being designed and constructed to get the full
benefits of data mining. Multidimensional data analysis helps deal with data related to
different types of customers, products, regions, and time zones. Online retailers can also
recommend products to drive more sales revenue and analyze the effectiveness of their
promotional campaigns. So, from noticing buying patterns to improving customer service and
satisfaction, data mining opens many doors in this sector. 
5. Higher Education
As the demand for higher education goes up worldwide, institutions are looking for
innovative solutions to cater to the rising needs. Institutions can use data mining to predict
which students would enrol in a particular program, who would require additional assistance
to graduate, refining enrollment management overall.
Moreover, the prognosis of students’ career paths and presentation of data would become
more comfortable with effective analytics. In this manner, data mining techniques can help
uncover the hidden patterns in massive databases in the field of higher education.
6. Energy Industry
Big Data is available even in the energy sector nowadays, which points to the need for
appropriate data mining techniques. Decision tree models and support vector machine
learning are among the most popular approaches in the industry, providing feasible solutions
for decision-making and management. Additionally, data mining can also achieve productive
gains by predicting power outputs and the clearing price of electricity.
7. Spatial Data Mining
Geographic Information Systems (GIS) and several other navigation applications make use of
data mining to secure vital information and understand its implications. This new trend
includes extraction of geographical, environment, and astronomical data, including images
from outer space. Typically, spatial data mining can reveal aspects like topology and
distance. 
Top Data Science Skills to Learn in 2022
 8. Biological Data Analysis
Biological data mining practices are common in genomics, proteomics, and biomedical
research. From characterizing patients’ behaviour and predicting office visits to identifying
medical therapies for their illnesses, data science techniques provide multiple advantages. 
Some of the data mining applications in the Bioinformatics field are:
 Semantic integration of heterogeneous and distributed databases
 Association and path analysis
 Use of visualization tools
 Structural pattern discovery
 Analysis of genetic networks and protein pathways

9. Other Scientific Applications


Fast numerical simulations in scientific fields like chemical engineering, fluid dynamics,
climate, and ecosystem modeling generate vast datasets. Data mining brings capabilities like
data warehouses, data preprocessing, visualization, graph-based mining, etc. 
10. Manufacturing Engineering
System-level designing makes use of data mining to extract relationships between portfolios
and product architectures. Moreover, the methods also come in handy for predicting product
costs and span time for development. 
11. Criminal Investigation
Data mining activities are also used in Criminology, which is a study of crime characteristics.
First, text-based crime reports need to be converted into word processing files. Then, the
identification and crime-machining process would take place by discovering patterns in
massive stores of data. 
12. Counter-Terrorism
Sophisticated mathematical algorithms can indicate which intelligence unit should play the
headliner in counter-terrorism activities. Data mining can even help with police
administration tasks, like determining where to deploy the workforce and denoting the
searches at border crossings. 

Data Preprocessing
Data Preprocessing
Steps that should be applied to make the data more suitable for data mining.
Consists of a number of different strategies and techniques that are interrelated in
omplex ways.
Strategies and techniques
(1) Aggregation
(2) Sampling
(3) Dimensionality reduction
(4) Feature subset selection
(5) Feature creation
(6) Discretization and binarization
(7) Variable transformation
Two categories of Strategies and techniques
Selecting data objects and attributes for the analysis.
Creating/changing the attributes.

Goal:
To improve the data mining analysis with respect to time, cost, and quality.
Aggregation
Quantitative attributes are typically aggregated by taking a sum or an average.
A qualitative attribute can either be omitted or summarized.

Motivations for aggregation


smaller data sets resulting from data reduction require less memory and
processing time.
hence aggregation uses more expensive data mining algorithms.
aggregation can act as a change of scope or scale by providing a high-level view
of the data instead of a low-level view.

Disadvantage of aggregation
potential loss of interesting details.
Sampling Approaches
Random sampling.
Progressive or Adaptive Sampling
Random sampling
Sampling without replacement: as each item is selected, it is removed from the set
of all objects that together constitute the population.
Sampling with replacement: objects are not removed from the population as they
are selected for the sample. same object can be picked more than once.
Progressive or Adaptive Sampling
Difficult to determine proper sample size.
So adaptive or progressive sampling schemes are used.
These approaches start with a small sample, and then increase the sample size
until a sample of sufficient size has been obtained.

Dimensionality reduction
Data mining algorithms work better if the dimensionality - the number of
attributes in the data - is lower.
Eliminate irrelevant features and reduce noise.
Lead to a more understandable model due to fewer attributes.
Allow the data to be more easily visualized.
Amount of time and memory required by the data mining algorithm is reduced.
Feature subset selection
The reduction of dimensionality by selecting new attributes that are a subset of the
old.

Discretization and Binarization


Discretization
Transform a continuous attribute into a categorical attribute.
Binarization
Transform both continuous and discrete attributes into one or more binary
attributes.
Variable transformation
A transformation that is applied to all the values of a variable.

DATA CLEANING TECHNIQUES


Real-world data tend to be
incomplete
noisy
inconsistent
The basic methods for data cleaning are
handling missing values,
data smoothing techniques,
approaches to data cleaning as a process.
Missing Values:
Assume that you need to analyze sales and customer data of a company.
many tuples have no recorded value for several attributes, such as customer
income.
The methods for filling in the missing values for this attribute:
1. Ignore the tuple:
This is usually done when the class label is missing.
This method is not very effective, unless the tuple contains several attributes
with missing values.
It is especially poor when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually:
This approach is time-consuming
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant, such as a label like
“Unknown”.
If missing values are replaced by, say, “Unknown,” then the mining program
may mistakenly think that they form an interesting concept, since they all have
a value in common—that of “Unknown.”
Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value:
For example, suppose that the average income of AllElectronics customers is
$56,000.
Use this value to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class:
For example, if classifying customers according to credit risk, replace the
missing value with the average income value f

6. Use the most probable value to fill in the missing value:


This may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.

Noisy Data
Noise is a random error or variance in a measured variable. Given a numerical attribute,
the following data smoothing techniques are used:
1. Binning:
Binning methods smooth a sorted data value by consulting its “neighbor- hood,”
that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins. Binning is
also used as a discretization technique.
2. Regression:
Data can be smoothed by fitting the data to a function, such as with regression.
Linear regression involves finding the “best” line to fit two attributes (or variables),
so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
3. Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may
be considered outliers.

Data Cleaning as a Process


Missing values, noise, and inconsistencies contribute to inaccurate data.
The first step in data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors, including poorly designed data entry
forms that have many optional fields, human error in data entry, deliberate errors.
Discrepancies may also arise from inconsistent data representations and the
inconsistent use of codes.
Errors in instrumentation devices that record data, and system errors, are another
source of discrepancies.
Errors can also occur when the data are used for purposes other than originally
intended. There may also be inconsistencies due to data integration.
There are a number of different commercial tools that can aid in the step of discrepancy
detection.
Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal
addresses, and spell-checking) to detect errors and make corrections in the data.
Data auditing tools find discrepancies by analyzing the data to discover rules and
relationships, and detecting data that violate such conditions. They are variants of data
mining tools.

Dimensionality Reduction
Data sets can have a large number of features.
Example:
 Consider a set of documents, where each document is represented by a vector whose
components are the frequencies with which each word occurs in the document.
 In such cases, there are typically thousands or tens of thousands of attributes
(components), one for each word in the vocabulary.
 As another example, consider a set of time series consisting of the daily closing price
of various stocks over a period of 30 years.
 In this case, the attributes, which are the prices on specific days, again number in the
thousands.

Benefits to dimensionality reduction:


 Data mining algorithms work better if the dimensionality - the number of attributes in
the data-is lower.
 can eliminate irrelevant features and reduce noise.
 Can lead to a more understandable model because the model may involve fewer
attributes.
 allow the data to be more easily visualized.
 the amount of time and memory required by the data mining algorithm is reduced.

Feature subset selection


 The reduction of dimensionality by selecting new attributes that are a subset of the
old.

The Curse of Dimensionality


 refers to the phenomenon that many types of data analysis become significantly
harder as the dimensionality of the data increases.
 as dimensionality increases, the data becomes increasingly sparse in the space that it
occupies.
 For classification: there are not enough data objects to allow the creation of a model
that reliably assigns a class to all possible objects.
 For clustering: the definitions of density and the distance between points, which are
critical for clustering, become less meaningful.
 As a result many clustering and classification algorithms have trouble with high-
dimensional data reduced classification accuracy and poor quality clusters.

Linear Algebra Techniques for Dimensionality Reduction


 Some of the most common approaches for dimensionality reduction, particularly for
continuous data, use techniques from linear algebra to project the data from a high-
dimensional space into a lower dimensional space.

Principal Components Analysis (PCA)


 is a linear algebra technique for continuous attributes that finds new attributes
(principal components) that
(1) are linear combinations of the original attributes
(2) are orthogonal (perpendicular) to each other
(3) capture the maximum amount of variation in the data.

Singular Value Decomposition (SVD)


 is a linear algebra technique that is related to PCA
 and is also commonly used for dimensionality reduction.
Measures of Similarity and Dissimilarity
Similarity and dissimilarity are used by a number of data mining techniques, such as
clustering, nearest neighbor classification, and anomaly detection.
The initial data set is not needed once these similarities or dissimilarities have been
computed.
Such approaches can be viewed as transforming the data to a similarity
(dissimilarity) space and then performing the analysis.

Proximity
o refers to either similarity or dissimilarity.
o proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects.
o This includes measures such as
Correlation and Euclidean distance, which are useful for dense data
such as time series or two-dimensional points
Jaccard and cosine similarity measures, which are useful for sparse
data like documents.

Similarity
o Similarity between two objects is a numerical measure of the degree to which
the two objects are alike.
o are higher for pairs of objects that are more alike.
o usually non-negative and are often between 0 (no similarity) and 1 (complete
similarity).

Dissimilarity
o Dissimilarity between two objects is a numerical measure of the degree to
which the two objects are different.
o are lower for more similar pairs of objects.
o term distance is used as a synonym for dissimilarity.
o fall in the interval [0,1], but it is also common for them to range from 0 to .

Transformations
o are often applied to convert a similarity to a dissimilarity, or vice versa, or to
transform a proximity measure to fall within a particular range, such as [0,1].
o proximity measures, especially similarities, are defined or transformed to have
values in the interval [0,1].
o motivation for this is to use a scale in which a proximity value indicates the
fraction of similarity (or dissimilarity) between two objects.
o transformation of similarities to the interval [0,1] is given by the expression
where max_s and min_s are the
maximum and minimum similarity values, respectively.
o dissimilarity measures with a finite range can be mapped to the interval [0,1]
by using the formula
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
These sources may include multiple data cubes, databases, or flat files. 

The data integration approaches are formally defined as triple <G, S, M> where, 
G stand for the global schema, 
S stands for the heterogeneous source of schema, 
M stands for mapping between the queries of source and global schema. 

There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”. 

Tight Coupling: 

Here, a data warehouse is treated as an information retrieval component.

In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:  

Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.

And the data only remains in the actual source databases.


Issues in Data Integration: 
There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below. 
1. Schema Integration: 

 Integrate metadata from different sources.


 The real-world entities from multiple sources are referred to as the entity
identification problem.

2. Redundancy: 

 An attribute may be redundant if it can be derived or obtained from another attribute


or set of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.

3. Detection and resolution of data value conflicts: 

 This is the third critical issue in data integration.


 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level of abstraction than the
“same” attribute in another.

Data Reduction in Data Mining:

The method of data reduction may achieve a condensed description of the original data which
is much smaller in quantity but keeps the quality of the original data. 

Methods of data reduction: 


These are explained as following below. 

1.Data Cube Aggregation: 


This technique is used to aggregate data in a simpler form. For example, imagine that
information you gathered for your analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months. They involve you in the annual sales, rather
than the quarterly average, So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It summarizes the data. 
2. Dimension reduction: 
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.

Step-wise Forward Selection – 


The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics. 
Suppose there are the following attributes in the data set in which few attributes are
redundant. 

Initial attribute Set: {X1, X2, X3, X4, X5, X6}

Initial reduced attribute set: { }

Step-1: {X1}

Step-2: {X1, X2}

Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

Step-wise Backward Selection – 


This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set. 

Suppose there are the following attributes in the data set in which few attributes are
redundant. 

Initial attribute Set: {X1, X2, X3, X4, X5, X6}

Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}

Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

Combination of forwarding and Backward Selection – 


It allows us to remove the worst and select best attributes, saving time and making the
process faster. 

3. Data Compression: 
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques. 

Lossless Compression – 
Encoding techniques (Run Length Encoding) allows a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data from
the compressed data.  

Lossy Compression – 
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., JPEG image format is a lossy compression, but
we can find the meaning equivalent to the original the image. In lossy-data compression, the
decompressed data may differ to the original data but are useful enough to retrieve
information from them. 

4. Numerosity Reduction: 
In this reduction technique the actual data is replaced with mathematical models or smaller
representation of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric method such as clustering, histogram, sampling. For More
Information on Numerosity Reduction Visit the link below: 

5. Discretization & Concept Hierarchy Operation: 


Techniques of data discretization are used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way. 

Top-down discretization – 
If you first consider one or a couple of points (so-called breakpoints or split points) to divide
the whole set of attributes and repeat of this method up to the end, then the process is known
as top-down discretization also known as splitting. 

Bottom-up discretization – 
If you first consider all the constant values as split-points, some are discarded through a
combination of the neighbourhood values in the interval, that process is called bottom-up
discretization. 

Concept Hierarchies: 
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for
age) to high-level concepts (categorical variables such as middle age or Senior). 

For numeric data following techniques can be followed: 

Binning – 
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user. 

Histogram analysis – 
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules: 

 Equal Frequency partitioning: Partitioning the values based on their number of


occurrences in the data set. 
 Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20. 
 Clustering: Grouping the similar data together. 
Data Transformation in Data Mining

The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:

1.Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows
for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other
noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they
wouldn’t see otherwise.

2.Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used. Gathering accurate
data of high quality and a large enough quantity is necessary to produce relevant results.

The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies.

For example, Sales, data may be aggregated to compute monthly& annual total amounts.

3.Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining
activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes.

Also, even if a data mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its discrete values.

For example, (1-10, 11-20) (age:- young, middle age, senior).

4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of
attributes. This simplifies the original data & makes the mining more efficient.

5.Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For
Example Age initially in Numerical form (22, 25) is converted into categorical value (young,
old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-
level definitions, such as town or country.

6. Normalization: Data normalization involves converting all data variable into a given


range.
Techniques that are used for normalization are:

Min-Max Normalization:

This transforms the original data linearly.

Suppose that: min_A is the minima and max_A is the maxima of an attribute, P

We Have the Formula:

Where v is the value you want to plot in the new range.

v’ is the new value you get after normalizing the old value.

Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs.
100, 000. We want to plot the profit in the range [0, 1]. Using min-max normalization the
value of Rs. 20, 000 for attribute profit can be plotted to:

And hence, we get the value of v’ as 0.11

Z-Score Normalization:

In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation

A value, v, of attribute A is normalized to v’ by computing

For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using
z-score normalization, a value of 85000 for P can be transformed to:

And hence we get the value of v’ to be 2.5

Decimal Scaling:
It normalizes the values of an attribute by changing the position of their decimal points

The number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A.

A value, v, of attribute A is normalized to v’ by computing

where j is the smallest integer such that Max(|v’|) < 1.

For example:

Suppose: Values of an attribute P varies from -99 to 99.

The maximum absolute value of P is 99.

For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers
in the largest number) so that values come out to be as 0.98, 0.97 and so on.

What is Data Visualization?


Data visualization is a graphical representation of quantitative information and data by using
visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to understand
and process for humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and trends in
the data.

In the world of Big Data, the data visualization tools and technologies are required to analyze
vast amounts of information.

Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see visualizations
in the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-
a-whole. And maps are the best way to share geographical data visually.
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.

What makes Data Visualization Effective?

Effective data visualization are created by communication, data science, and design collide.
Data visualizations did right key insights into complicated data sets into meaningful and
natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations


consist of ? complex ideas communicated with clarity, precision, and efficiency.

To craft an effective data visualization, you need to start with clean data that is well-sourced
and complete. After the data is ready to visualize, you need to pick the right chart.

After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.

Similarity Measures
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering nearest neighbour classification and anomaly detection
The term proximity is used to refer to either similarity or dissimilarity
Definitions

The similarity between two objects is a numeral measure of the degree to which the
Consequently, similarities are higher for pairs of objects that are more alike. Similarities are
usually non- are often between 0 (no two objects are alike. negative and similarity) and
1(complete similarity).

The dissimilarity between two objects is the numerical measure of the degree to which the
two objects are different. Dissimilarity is lower for more similar pairs of objects.

Frequently, the term distance is used as a synonym for dissimilarity. Dissimilarities


sometimes fall in the interval [0,1], but it is also common for them to range from 0 to ∞

Proximity Measures

Proximity measures, especially similarities, are defined to have values in the interval [0,1]. If
the similarity between objects can range from 1 (not at all similar) to 10 (completely similar),
we can make them fall into the range [0,1] by using the formula: s'=(s-1)/9, where s and s' are
the original and the new similarity values, respectively.

The more general case, s' is calculated as s'=(s-min_s)/(max_s-min_s), where min_s and max
s are the minimum and maximum similarity values respectively.

Likewise, dissimilarity measures with a finite range can be mapped to the interval [0,1] by
using the formula d'=(d-min_d)/(max_d- min_d)

If the proximity measure originally takes values in the interval [0, 0], then we usually use the
formula: d'= d/(1+d) for such cases and bring the dissimilarity measure between [0,1]

Similarity and dissimilarity between simple attributes

The proximity of objects with a number of attributes is defined by combining the proximities
of individual attributes.

Attribute Types and Similarity Measures:

1) For interval or ratio attributes, the natural measure of dissimilarity between two attributes
is the absolute difference of their values. For example, we might compare our current weight
to our weight one year ago. In such cases the dissimilarities range from 0 to ∞.

2) For objects described with one nominal attribute, the attribute value describes whether the
attribute is present in the object or not. Comparing two objects with one nominal attribute
means comparing the values of this attribute. In that case, similarity is traditionally defined as
1 if attribute values match and as 0 otherwise. A dissimilarity would be defined in the
opposite way: 0 if the attribute values match, 1 if they do not.

3) For objects with a single ordinal attribute, information about order should be taken into
account. Consider an attribute that measures the quality of a product on the scale {poor, fair,
OK, good, wonderful}. It would be reasonable that a product P1 which was rated wonderful
would be closer to a product P2 rated good rather than a product P3 rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped to successive
integers, beginning at 0 or 1, e.g.{poor=0, fair=1, OK=2, good=3, wonderful=4}. Then d(P1-
P2) =4-3 =1

Dissimilarities Between Data Objects

Distances:

Distances are dissimilarities with certain properties. The Euclidian distance, d, between two
points , x and y in one , two or higher dimensional space is given by the formula:

d(x,y)=∑n−−−−√(xk−yk)2d(x,y)=∑n(xk−yk)2

where n is the number of dimensibns and xk and yk are, respectively, the kth attribute
(component) of x and y.

The Euclidian distance measure is given generalized by the Minkowski distance metric
shown as:

d(x,y)=(∑n|xk−yk|r)1/rd(x,y)=(∑n|xk−yk|r)1/r

The following are the 3 most common examples of Minkowski distances:

r = 1 also known as City block (Manhattan or L1 norm) distance. A common example is the
Hamming distance, which is the number of bits that are different between two objects that
only have binary attributes (i.e., binary vectors)

r=2. Euclidian distance (L2 norm).

r=∞r=∞. Supremum, ( LmaxLmax or L∞L∞ norm) distance. This is the maximum difference


between any attributes of the objects. The L∞L∞ is defined more formally by:

d(x,y)=limr→∞(∑n|xk−yk|r)1/rd(x,y)=limr→∞(∑n|xk−yk|r)1/r

Similarities Between Data Objects

s(x, y) is the similarity between points x and y, then typically we will have 1. s(x, y) =1 only
if x=y. (0 £ s£ 1)

s(x, y) = s (y, x) for all x and y. (Symmetry)

Non-symmetric Similarity

Measures – confusion matrix

Consider an experiment in which people are asked to classify a small set of characters as they
flash on the screen. The confusion matrix for this experiment records how often each
character is classified as itself, and how often it is classified as another character. For
example, suppose "0" appeared 200 times and was classified as "0" 160 times but as "o" 40
times. Likewise, suppose that "o" appeared 200 times and was classified as "o" 170 times and
as "0" 30 times

Non-symmetric Similarity

Measures – confusion matrix

If we take these counts as a measure of similarity between the two characters, then we have a
similarity measure, but not a symmetric one.

s(“0", "o") =40/2 = 20%

s("o", "0") = 30/2 = 15%

You might also like