Unit - 1

UNIT – 1
Data Mining
Definition
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to
extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD).
 Data mining is the process of extracting the useful information, which is stored in the large
database.
 It is a powerful tool, which is useful for organizations to retrieve the useful information from
available data warehouses.
 Data mining can be applied to relational databases, object-oriented databases, data warehouses,
structured-unstructured databases, etc.
 Data mining is used in numerous areas like banking, insurance companies, pharmaceutical
companies etc.
KDD and Data mining
The main goal of KDD is to extract knowledge from large databases with the help of data mining
methods.
The different steps of KDD are as given below:
1. Data cleaning:
In this step, noise and irrelevant data are removed from the database.
2. Data integration:
In this step, the heterogeneous data sources are merged into a single data source.
3. Data selection:
In this step, the data which is relevant to the analysis process gets retrieved from the database.
4. Data transformation:
In this step, the selected data is transformed in such forms which are suitable for data mining.
5. Data mining:
In this step, the various techniques are applied to extract the data patterns.
6. Pattern evaluation:
In this step, the different data patterns are evaluated.
7. Knowledge representation:
This is the final step of KDD, which represents the knowledge.
WHAT TYPES OF DATA CAN BE MININED
1. Flat Files
 Flat files is defined as data files in text form or binary form with a structure that can be
easily extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, then there will be no relations between the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in DataWarehousing to store data, Used in carrying data to and from
server, etc.
2. Relational Databases
 A Relational database is defined as the collection of data organized in tables with rows
and columns.
 Physical schema in Relational databases is a schema which defines the structure of
tables.
 Logical schema in Relational databases is a schema which defines the relationship
among tables.
 Standard API of relational database is SQL.
 Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
 A datawarehouse is defined as the collection of data integrated from multiple sources
that will queries and decision making.
 There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
 Two approaches can be used to update data in DataWarehouse: Query-driven Approach
and Update-driven Approach.
 Application: Business decision making, Data mining, etc.
4. Transactional Databases
 Transactional databases is a collection of data organized by time stamps, date, etc to represent
transaction in databases.
 This type of database has the capability to roll back or undo its operation when a transaction is
not completed or committed.
 Highly flexible system where users can modify information without changing any sensitive
information.
 Follows ACID property of DBMS.
 Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in a pre-specified formats.
 Application: Digital libraries, video-on demand, news-on demand, musical database,
etc.
6. Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.
7. Time-series Databases
 Time series databases contains stock exchange data and user logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
 WWW refers to World wide web is a collection of documents and resources like audio,
video, text, etc which are identified by Uniform Resource Locators (URLs) through web
browsers, linked by HTML pages, and accessible via the Internet network.
 It is the most heterogeneous repository as it collects data from multiple resources.
 It is dynamic in nature as Volume of data is continuously increasing and changing.
 Application: Online shopping, Job search, Research, studying, etc.
TECHNOLOGIES USED IN DATA MINING
Several techniques used in the development of data mining methods. Some of them are mentioned
below:
1. Statistics:
 It uses the mathematical analysis to express representations, model and summarize empirical data
or real world observations.
 Statistical analysis involves the collection of methods, applicable to large amount of data to
conclude and report the trend.
2. Machine learning
 Arthur Samuel defined machine learning as a field of study that gives computers the ability to learn
without being programmed.
 When the new data is entered in the computer, algorithms help the data to grow or change due to
machine learning.
 In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
 It is related to computational statistics.
The four types of machine learning are:
1. Supervised learning
 It is based on the classification.

 It is also called as inductive learning. In this method, the desired outputs are included in the training
dataset.
2. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the basis of similarity measures and
desired outputs are not included in the training dataset.
3. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to generate the appropriate
functions. This method generally avoids the large number of labeled examples (i.e. desired outputs).
4. Active learning
 Active learning is a powerful approach in analyzing the data efficiently.

 The algorithm is designed in such a way that, the desired output should be decided by the algorithm
itself (the user plays important role in this type).
3. Information retrieval
Information deals with uncertain representations of the semantics of objects (text, images).
For example: Finding relevant information from a large document.
4. Database systems and data warehouse
 Databases are used for the purpose of recording the data as well as data warehousing.
 Online Transactional Processing (OLTP) uses databases for day to day transaction
purpose.
 To remove the redundant data and save the storage space, data is normalized and stored in the form
of tables.
 Entity-Relational modeling techniques are used for relational database management system design.
 Data warehouses are used to store historical data which helps to take strategical decision for
business.
 It is used for online analytical processing (OALP), which helps to analyze the data.
5. Decision support system
 Decision support system is a category of information system. It is very useful in decision making for
organizations.
 It is an interactive software based system which helps decision makers to extract useful information
from the data, documents to make the decision.
MAJOR ISSUES IN DATA MINING

Data Mining is not very simple to understand and implement. As it is already evident that Data Mining is
a process which is very crucial for various researchers and businesses. But in data mining, the algorithms
are very complex and on top of that, the data is not readily available at one place. Every technology has
flaws or issues. But one needs to always know the various flaws or issues that technology has.
Mining Methodology and User Interaction Issues:

I. Mining different kinds of knowledge in databases: This issue is responsible for addressing the
problems of covering a big range of data in order to meet the needs of the client or the customer.
Due to the different information or a different way, it becomes difficult for a user to cover a big
range of knowledge discovery task.
I. Interactive mining of knowledge at multiple levels of abstraction: Interactive mining is very

crucial because it permits the user to focus the search for patterns, providing and refining data
mining requests based on the results that were returned. In simpler words, it allows user to focus
the search on patterns from various different angles.
I. Incorporation of background of knowledge: The main work of background knowledge is to

continue the process of discovery and indicate the patterns or trends that were seen in the process.
Background knowledge can also be used to express the patterns or trends observed in brief and
precise terms. It can also be represented at different levels of abstraction.
I. Data mining query languages and ad hoc data mining: Data Mining Query language is
responsible for giving access to the user such that it describes ad hoc mining tasks as well and it
needs to be integrated with a data warehouse query language.
 Presentation and visualization of data mining results: In this issue, the patterns or trends that
are discovered are to be rendered in high level languages and visual representations. The
representation has to be written so that it is simply understood by everyone.
 Handling noisy or incomplete data: For this process, the data cleaning methods are used. It is a
convenient way of handling the noise and the incomplete objects in data mining. Without data
cleaning methods, there will be no accuracy in the discovered patterns. And then these patterns
will be poor in quality.
Performance Issues:
It has been noticed several times before also that there are performance related issues in data mining as
well. These issues are listed as follows:
Efficiency and Scalability of data mining algorithm: Efficiency and Scalability is very important when
it comes to data mining process. It is also very necessary because with the help of using this, the user can
withdraw the information from the data in a more effective and productive manner.
On top of that, the user can withdraw that information effectively from the large amount of data in various
databases.
Parallel, distributed and incremental mining algorithm: There are a lot factors which can be
responsible for the development of parallel and distributed algorithms in data mining.
These factors are large in size of database, huge distribution of data, and data mining method that are
complex.
In this process, the first and foremost step, the algorithm divides the data from database into various
partition. In the next step, that data is processed such that it is situated in parallel manner. Then the last
step, the result from the partition is merged.
Diverse Data Types Issues:
The issues in this type of issue are given below:
Handling of relational and complex types of data: The database may contain the various data objects
for example, complex, multimedia, temporal data, or spatial data objects. It is very difficult to mine all
these data with the help of a single system.
Mining information from heterogeneous databases and global information systems: The problem in
this kind of issue is to mine the knowledge from various data sources. These data are not available as a
single source instead these data are available at the different data sources on LAN or WAN. The
structures of these data are different as well.
Data Mining Functionalities—What Kinds of Patterns Can Be Mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. data
mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database. Predictive
mining tasks perform inference on the current data in order to make predictions.
Concept/Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. For example, in the AllElectronics store, classes of
items for sale include computers and printers, and concepts of customers
include bigSpenders and budgetSpenders. It can be useful to describe individual classes and concepts
in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived via
data characterization, by summarizing the data of the class under study (often called the target class) in
general terms,
data discrimination, by comparison of the target class with one or a set of comparative classes (often
called the contrasting classes), or (3) both data characterization and discrimination.
Data characterization is a summarization of the general characteristics or features of a target class of
data. The data corresponding to the user-specified class are typically collected by a database query the
output of data characterization can be presented in various forms. Examples include pie charts, bar
charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved through database queries.
Discrimination descriptions expressed in rule form are referred to as discriminate rules.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many
kinds of frequent patterns, including itemsets, subsequences, and substructures.
A frequent itemset typically refers to a set of items that frequently appear together in a transactional data
set, such as Computer and Software. A frequently occurring subsequence, such as thepattern that
customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a
(frequent) sequential pattern.
Example: Association analysis. Suppose, as a marketing manager of AllElectronics, you would like to
determine which items are frequently purchased together within the same transactions. An example of
such a rule, mined from the AllElectronics transactional database, is
buys(X; ―computer‖) buys(X; ―software‖) [support = 1%, confidence = 50%]
where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support
means that 1% of all of the transactions under analysis showed that computer and software were
purchased together. This association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-dimensional association rules.
Dropping the predicate notation, the above rule can be written simply as ―compute software [1%,
50%]‖.
Classification and Prediction
Classification is the process of finding a model (or function) that describes and distinguishes data classes
or concepts, for the purpose of being able to use the model to predict the class of objects whose class
label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects
whose class label is known).
“How is the derived model presented?” The derived model may be represented in various forms, such
as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value,
each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
Decision trees can easily be converted to classification rules
A neural network, when used for classification, is typically a collection of neuron-like processing units
with weighted connections between the units. There are many other methods for constructing
classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification. Whereas
classification predicts categorical (discrete, unordered) labels, prediction models Continuous-valued
functions. That is, it is used to predict missing or unavailable numerical data values rather than class
labels. Although the term prediction may refer to both numeric prediction and class label prediction,
Cluster Analysis
Classification and prediction analyze class-labeled data objects, where as clustering analyzes data
objects without consulting a known class label.
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can be more interesting than the
more regularly occurring ones. The analysis of outlier data is referred to as outlier mining.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior changes
over time. Although this may include characterization, discrimination, association and correlation
analysis, classification, prediction, or clustering of time related data, distinct features of such an analysis
include time-series data analysis, Sequence or periodicity pattern matching, and similarity-based data
analysis.
DATA OBJECTS
 Data sets are made up of data objects.

 A data object represents an entity.
 Examples: – sales database: customers, store items, sales – medical database: patients,
treatments – university database: students, professors, courses .
 Also called samples, examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
A Data Object
What is an Attribute?
The attribute can be defined as a field for storing the data that represents the characteristics of a data
object. The attribute is the property of the object. The attribute represents different features of the object.
For example, hair color is the attribute of a lady. Similarly, rollno, and marks are attributes of a student.
An attribute vector is commonly known as a set of attributes that are used to describe a given object.
Type of attributes
1. Qualitative Attributes such as Nominal, Ordinal, and Binary Attributes.
2. Quantitative Attributes such as Discrete and Continuous Attributes.
There are different types of attributes. some of these attributes are mentioned below;
Example of attribute
In this example, RollNo, Name, and Result are attributes of the object named as a student.
Rollo Name Result
1 Ali Pass
2 Akram Fail
Types Of attributes
 Binary
 Nominal
 Ordinal Attributes
 Numeric
o Interval-scaled
o Ratio-scaled
 Nominal Attributes
 Nominal data is in alphabetical form and not in an integer. Nominal Attributes are Qualitative
Attributes.
 Examples of Nominal attributes
 In this example, sates and colors are the attribute and New, Pending, Working, Complete, Finish
and Black, Brown, White, and Red are the values.
Attribute Value
Categorical data Lecturer, Assistant Professor, Professor
States New, Pending, Working, Complete, Finish
Colors Black, Brown, White, Red
Binary Attributes
Binary data have only two values/states. For example, here HIV detected can be only Yes or No.
Binary Attributes are Qualitative Attributes.
Examples of Binary Attributes
Attribute Value
HIV detected Yes, No
Result Pass, Fail
The binary attribute is of two types;
1. Symmetric binary
2. Asymmetric binary
Examples of Symmetric data

Both values are equally important. For example, if we have open admission to our university, then it does
not matter, whether you are a male or a female.
Example:
Attribute Value
Gender Male, Female
Examples of Asymmetric data

Both values are not equally important. For example, HIV detected is more important than HIV not
detected. If a patient is with HIV and we ignore him, then it can lead to death but if a person is not HIV
detected and we ignore it, then there is no special issue or risk.
Example
Attribute Value
HIV detected Yes, No
Result Pass, Fail
Ordinal Attributes
All Values have a meaningful order. For example, Grade-A means highest marks, B means marks are
less than A, C means marks are less than grades A and B, and so on. Ordinal Attributes are Quantitative
Attributes.
Examples of Ordinal Attributes
Attribute Value
Grade A, B, C, D, F
BPS- Basic pay scale 16, 17, 18
Discrete Attributes
Discrete data have a finite value. It can be in numerical form and can also be in a categorical form.
Discrete Attributes are Quantitative Attributes.
Examples of Discrete Data
Attribute Value
Profession Teacher, Bussiness Man, Peon etc
Postal Code 42200, 42300 etc
Example of Continuous Attribute
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between 1 and 2. These attributes
are Quantitative Attributes.
Example of Continuous Attribute
Attribute Value
Height 5.4…, 6.5….. etc
Weight 50.09…. etc
DATA VISUALISATION
Data Visualization refers to the visual representation of data with the help of comprehensive charts,
images, lists, charts, and other visual objects. It enables users to easily understand the information within
a fraction of time and extract useful information, patterns, and trends. Moreover, it makes the information
easy to understand.
Pixel-oriented
Pixel-oriented or Pixel-based techniques using a single pixel to represent a data attribute. Depending on
the value it is mapped to a color-map to choose the correct color for the pixel. Pixel-oriented techniques
can also be either query dependent or independent. When independent all attributes of the data are drawn
on screen independently of the each other. Using a dependent technique you specify a query that decides
how pixels should be placed enabling the possibility to find patterns. When using a pixel-oriented method
it is possible to visualize very large data sets as each instance only needs a single pixel to be visualized. If
not using a good method of drawing pixels you will end up with a visualization that can't be interpreted.
Recursive pattern Recursive patterns makes it possible to group data when visualizing to make it easier to
interpret. It works by using several different levels of patterns by specifying height and width of for the
different levels. A first level pattern might organize the pixels to make it possible to interpret data over a
year while the second makes it possible to interpret over the month. Circle segments The circle segment
technique divides a circle into k-dimensions for the amount of attributes k in the data set. Within each
dimension each attribute is then visualized by coloring a single pixel. Close to the center all attributes are
close making it easier to compare their values.
ICON-BASED
Icon-based techniques visualize data by changing the properties of an icon or glyph according to the data.
An early version was Chernoff faces where data is mapped to different face parts as nose, mouth, eyes
and more. For example how rich people are can be mapped to the mouth of the Chernoff face. Rich
people represented by a happy mouth and and poor people by a sad mouth. Other methods are:
Icon-based methods have the strength of being easily interpreted if icons/glyphs are chosen wisely. A
poor choice can make it difficult to distinguish differences.
Geometric Projection
Geometric projection techniques are a good choice for finding outliers and correlation between attributes
in multivariate data. A geometric projection technique does this by using transformations and projections
of the data. When using large data sets a clustering algorithm is usually necessary to apply before the
visualization technique to avoid cluttered and unclear data caused by the too much information. Some
widely used geometric projection techniques are: Scatter plots A scatter plot is one of the most common
visualization techniques and can be visualized both in 3D and 2D. The scatter plot visualizes different
attributes of the data on the x, y axis for 2D visualizations and also along the z-axis in 3D. Scatter plots
are usable to find correlations between attributes in arbitrary small data sets. If the data set gets too big or
contains too many attributes the scatter plot gets cluttered and hard to interpret.
Parallel coordinates
If the data contains many attributes you will need several scatter plots to visualize all attributes and it
might become hard to see patterns. A technique often used for multivariate data is parallel coordinates.
They work by visualizing each attribute on a vertical axis and connecting each individual data with lines
between the axis. Parallel coordinates strength is the possibility to find correlations between a high
amount of attributes. The weakness of parallel coordinates is the same as with scatter plots, if the data set
is too large it easily gets cluttered.
RadViz
RadViz is another technique for visualizing multivariate data. It maps data to a 2D plane using Hooke's
law and a set of anchor points usually specified from the attributes.
Hierarchical
Hierarchical techniques visualize data using subspaces created from the data's attributes. Hierarchical
techniques are usable when some attributes of the data might be more relevant than other.
Treemap
Treemaps display hierarchical data using rectangles. Each branch of the tree is assigned a rectangle. Then
each sub-branch gets assigned to a rectangle and this continues recursively until a leaf node is found.
Depending on choice the rectangle representing the leaf node is colored, sized or both according to
chosen attributes. This is a good way to spot relevant data and is good for categorized data, although the
result can be cluttered when dealing with large data sets.
SIMILARITY AND DISSIMILARITY MEASURE
In specific data-mining applications such as clustering, it is essential to find how similar or dissimilar
objects are to each other.
A similarity measure for two objects (i,j)(i,j) will return 1 if similar and 0 if dissimilar.
A dissimilarity measure works just opposite to how the similarity measure works, i.e., it returns 1 if
dissimilar and 0 if similar.
Similarity and dissimilarity measures help remove the outliers. Their use quickly eliminates redundant
data since they help identify potential outliers as highly dissimilar objects to others.
The measure of similarity and dissimilarity is referred to as proximity.
The measure of similarity can often be measured as a function of a measure of dissimilarity.

Similarity and dissimilarity measures can be calculated as:
 i, j are row and column values of the dissimilarity matrix.

 m is several matches for which i, j are in the same state.
 p is a total number of attributes.
Dissimilarity matrix
A dissimilarity matrix stores a collection of proximities that are available for all pairs of n-objects.
In a dissimilarity matrix, d(i,j)is measured as dissimilarity or difference between i and j.
Let’s look at an example and try to find similarity and dissimilarity measures.
While constructing a dissimilarity matrix, we give the value of 1 for dissimilar objects

and 0 for similar things. For a similarity matrix, it is vice-versa.
The proximity measure for the grade attribute is calculated below.
Calculating proximity measures
The dissimilarity matrix values are calculated as shown below:
dis (2,1) = (A, B) =1

dis (3,1) = (C, A) =1
dis (3,1) = (C, A) =1
dis (3,2) = (A, B) =1
dis (3,2) = (A, B) =1
dis (4,1) = (A, A) =0
dis (4,1) = (A, A) =0
dis (4,2) = (A, B) =1
dis (4,2) = (A, B) =1
dis (4,3) = (A, C) =1
dis (4,3) = (A, C) =1
The similarity matrix values for this are shown below:
sim(2,1) =1−dis(2,1) = 0
sim(3,1) =1−dis(3,1) = 0
sim(3,1) =1−dis(3,1) = 0
sim(3,2) =1−dis(3,2) = 0
sim(3,2) =1−dis(3,2) = 0
sim(4,1) =1−dis(4,1) =1
sim(4,1) =1−dis(4,1) =1
sim(4,2) =1−dis(4,2) = 0
sim(4,2) =1−dis(4,2) = 0
sim(4,3) =1−dis(4,3) = 0
sim(4,3) =1−dis(4,3) = 0
The matrices from the example problem are given below:
Example:
Let’s take an example where each data point contains only one input feature. This can be considered the
simplest example to show the dissimilarity between three data points A, B, and C. Each data sample can
have a single value on one axis(because we only have one input feature); let’s denote that as the x-axis.
Let’s take two points, A(0.5), B(1), and C(30). As you can tell, A and B are close enough to each other in
contrast to C. Thus, the similarity between A and B is higher than A and C or B and C. In other terms, A
and B have a strong correlation. Therefore, the smaller the distance is, the larger the similarity will get.
PREPROCESSING IN DATA MINING
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.

Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

 (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.

 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments
of equal size and then various methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its mean or boundary values can be
used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction
technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute, the attribute having p-value
greater than significance level can be discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction.

Unit - 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - 1

Uploaded by

Copyright:

Available Formats

UNIT – 1

KDD and Data mining

TECHNOLOGIES USED IN DATA MINING

 It is based on the classification.

 Active learning is a powerful approach in analyzing the data efficiently.

4. Database systems and data warehouse

MAJOR ISSUES IN DATA MINING

Mining Methodology and User Interaction Issues:

I. Interactive mining of knowledge at multiple levels of abstraction: Interactive mining is very

I. Incorporation of background of knowledge: The main work of background knowledge is to

Data Mining Functionalities—What Kinds of Patterns Can Be Mined?

 Data sets are made up of data objects.

Categorical data Lecturer, Assistant Professor, Professor

States New, Pending, Working, Complete, Finish

Colors Black, Brown, White, Red

Examples of Symmetric data

Examples of Asymmetric data

SIMILARITY AND DISSIMILARITY MEASURE

A similarity measure for two objects (i,j)(i,j) will return 1 if similar and 0 if dissimilar.

The measure of similarity and dissimilarity is referred to as proximity.

The measure of similarity can often be measured as a function of a measure of dissimilarity.

 i, j are row and column values of the dissimilarity matrix.

In a dissimilarity matrix, d(i,j)is measured as dissimilarity or difference between i and j.

While constructing a dissimilarity matrix, we give the value of 1 for dissimilar objects

The proximity measure for the grade attribute is calculated below.

Calculating proximity measures

The dissimilarity matrix values are calculated as shown below:

dis (2,1) = (A, B) =1

The similarity matrix values for this are shown below:

PREPROCESSING IN DATA MINING

You might also like