You are on page 1of 18

DATA MINING

UNIT-I
Introduction to Data Mining: Data mining is the process of discovering patterns
in large data sets involving methods at the intersection of machine
learning, statistics, and database systems. The information or knowledge extracted
so can be used for any of the following applications:
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
Data Mining Applications:
Data mining is highly useful in the following domains:
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web Surf-
Aid.
Knowledge discovery in databases (KDD):
Knowledge discovery in databases (KDD) is the process of discovering useful
knowledge from a collection of data.
Data Cleaning: The noise and inconsistent data is removed.
Data Integration: Multiple data sources are combined.
Data Selection: Data relevant to the analysis task are retrieved from the database.
Data Transformation: Data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
Data Mining: Intelligent methods are applied in order to extract data patterns.
Pattern Evaluation: Data patterns are evaluated (to identify the truly interesting
patterns representing knowledge based on interestingness measures).

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Fig: Data mining as a step in the process of Knowledge discovery.

Knowledge Presentation: Knowledge is represented (where visualization and


knowledge representation techniques are used to present mined knowledge to
users).
Types of Data:
There are 4 Types of Data available in data mining
Qualitative Data Type
 Nominal
 Ordinal
Quantitative Data Type
 Discrete
 Continuous
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
When this Data has so much importance in our life then it becomes important to
properly store and process this without any error. When dealing with datasets, the
category of data plays an important role to determine which preprocessing strategy
would work for a particular set to get the right results or which type of statistical
analysis should be applied for the best results. Let‘s dive into some of the commonly
used categories of data.
Qualitative Data Type
Qualitative or Categorical Data describes the object under consideration using a
finite set of discrete classes. It means that this type of data can‘t be counted or
measured easily using numbers and therefore divided into categories. The gender of
a person (male, female, or others) is a good example of this data type.
These are usually extracted from audio, images, or text medium. Another example
can be of a smartphone brand that provides information about the current rating,
the color of the phone, category of the phone, and so on. All this information can be
categorized as Qualitative data. There are two subcategories under this:
Nominal
These are the set of values that don‘t possess a natural ordering. Let‘s understand
this with some examples. The color of a smartphone can be considered as a
nominal data type as we can‘t compare one color with others.
It is not possible to state that ‗Red‘ is greater than ‗Blue‘. The gender of a person is
another one where we can‘t differentiate between male, female, or others. Mobile
phone categories whether it is midrange, budget segment, or premium smartphone
is also nominal data type.
Nominal data types in statistics are not quantifiable and cannot be measured
through numerical units. Nominal types of statistical data are valuable while
conducting qualitative research as it extends freedom of opinion to subjects.
Ordinal
These types of values have a natural ordering while maintaining their class of
values. If we consider the size of a clothing brand then we can easily sort them
according to their name tag in the order of small < medium < large. The grading
system while marking candidates in a test can also be considered as an ordinal
data type where A+ is definitely better than B grade.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
These categories help us deciding which encoding strategy can be applied to which
type of data. Data encoding for Qualitative data is important because machine
learning models can‘t handle these values directly and needed to be converted to
numerical types as the models are mathematical in nature.
Quantitative Data Type
This data type tries to quantify things and it does by considering numerical values
that make it countable in nature. The price of a smartphone, discount offered,
number of ratings on a product, the frequency of processor of a smartphone, or ram
of that particular phone, all these things fall under the category of Quantitative data
types.
Discrete
The numerical values which fall under are integers or whole numbers are placed
under this category.
Discrete data types in statistics cannot be measured – it can only be counted as the
objects included in discrete data have a fixed value. The value can be represented in
decimal, but it has to be whole.
Continuous
The fractional numbers are considered as continuous values. These can take the
form of the operating frequency of the processors, the android version of the phone,
wifi frequency, temperature of the cores, and so on.
Unlike discrete data types of data in research, with a whole and fixed value,
continuous data can break down into smaller pieces and can take any value.

Data mining functionalities:


Data mining functionalities are used to represent the type of patterns that have to
be discovered in data mining tasks. In general, data mining tasks can be classified
into two types including descriptive and predictive. Descriptive mining tasks define
the common features of the data in the database and the predictive mining tasks
act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows:
 Data characterization: It is a summarization of the general characteristics of
an object class of data. The data corresponding to the user-specified class is
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
generally collected by a database query. The output of data characterization
can be presented in multiple forms.
 Data discrimination: It is a comparison of the general characteristics of target
class data objects with the general characteristics of objects from one or a set
of contrasting classes. The target and contrasting classes can be represented
by the user, and the equivalent data objects fetched through database queries.
 Association Analysis: It analyses the set of items that generally occur
together in a transactional dataset. There are two parameters that are used for
determining the association rules −
o It provides which identifies the common item set in the database.
o Confidence is the conditional probability that an item occurs in a
transaction when another item occurs.
 Classification: Classification is the procedure of discovering a model that
represents and distinguishes data classes or concepts, for the objective of
being able to use the model to predict the class of objects whose class label is
anonymous. The derived model is established on the analysis of a set of
training data (i.e., data objects whose class label is common).
 Prediction: It defines predict some unavailable data values or pending trends.
An object can be anticipated based on the attribute values of the object and
attribute values of the classes. It can be a prediction of missing numerical
values or increase/decrease trends in time-related information.
 Clustering: It is similar to classification but the classes are not predefined.
The classes are represented by data attributes. It is unsupervised learning. The
objects are clustered or grouped, depends on the principle of maximizing the
intraclass similarity and minimizing the intraclass similarity.
 Outlier analysis: When data that cannot be grouped in any of the class
appears, we use outlier analysis. There will be occurrences of data that will
have different attributes to any of the other classes or general models. These
outstanding data are called outliers. They are usually considered noise or
exceptions, and the analysis of these outliers is called outlier mining. The
analysis of this type of data can be essential to mine the knowledge.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


 Evolution analysis: It defines the trends for objects whose behaviour changes
over some time.
Are All Patterns Interesting?
A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
You may ask, ―Are all of the patterns interesting?‖ Typically, the answer is no—
only a small fraction of the patterns potentially generated would actually be of
interest to a given user.
Interestingness Patterns:
A pattern is interesting if it is
(1) Easily understood by humans,
(2) Valid on new or test data with some degree of certainty,
(3) Potentially useful, and
(4) Novel.
A pattern is also interesting if it validates a hypothesis that the user sought to
confirm. An interesting pattern represents knowledge.
Classification of Data Mining systems:
a data mining system can also be classified based on the kind of (a) databases
mined, (b) knowledge mined, (c) techniques utilized, and (d) applications
adapted.
What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to any kind of data as long as
the data are meaningful for a target application. The most basic forms of data for
mining applications are
 Database data (A database system, also called a database
management system (DBMS), consists of a collection of interrelated
data, known as a database, and a set of software programs to manage
and access the data)
 Data warehouse data (A data warehouse is usually modeled by a
multidimensional data structure, called a data cube, in which each
dimension corresponds to an attribute or a set of attributes in the
schema)
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
 Transactional data (a transactional database captures a transaction,
such as a customer‘s purchase, a flight booking, or a user‘s clicks on a
web page. A transaction typically includes a unique transaction
identity number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction).
Data mining can also be applied to other forms of data
 Data streams (e.g., video surveillance and sensor data, which are
continuously transmitted)
 Time related/sequence data (e.g., historical records, stock exchange
data, and time-series and biological sequence data)
 Graph or networked data (e.g., social and information networks)
 Spatial data (e.g., maps)
 Multimedia data (including text, image, video, and audio data)
 WWW (a huge, widely distributed information repository made available
by the Internet)
Data mining will certainly continue to embrace new data types as they emerge.
What Kinds of Patterns Can Be Mined?
Let us now examine the kinds of patterns that can be mined.
There are a number of data mining functionalities. These include
 Characterization and discrimination the mining of frequent patterns
 Associations and correlations
 Classification and regression
 Clustering analysis and
 Outlier analysis.
Data mining functionalities are used to specify the kinds of patterns to be found in
data mining tasks. In general, such tasks can be classified into two categories:
 Descriptive mining tasks
 Predictive mining tasks
Descriptive mining tasks characterize properties of the data in a target data set.
Predictive mining tasks perform induction on the current data in order to make
predictions.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


As a highly application-driven domain, data mining has incorporated many
techniques from other domains such as statistics, machine learning, pattern
recognition, database and data warehouse systems, information retrieval,
visualization, algorithms, high performance computing, and many application
domains. The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive
applications.

Fig: Data mining adopts techniques from many domains.

Kind of knowledge to be mined:


It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Class/Concept Description:

Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and
printers, and concepts of customers include big spenders and budget spenders.
Such descriptions of a class or a concept are called class/concept descriptions.
These descriptions can be derived by the following two ways −

 Data Characterization: This refers to summarizing data of class under


study. This class under study is called as Target Class. The data
corresponding to the user-specified class are typically collected by a query.
The output of data characterization can be presented in various forms.
Examples include pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables, including crosstabs. The resulting
descriptions can also be presented as generalized relations or in rule form
(called characteristic rules).

 Data Discrimination: It refers to the mapping or classification of a class with


some predefined group or class. Data discrimination is a comparison of the
general features of the target class data objects against the general features
of objects from one or multiple contrasting classes.

Background knowledge:

The background knowledge allows data to be mined at multiple levels of

abstraction. For example, the Concept hierarchies are one of the background
knowledge that allows data to be mined at multiple levels of abstraction.

Data Mining Task Primitives:

Each user will have a data mining task in mind, that is, some form of data analysis
that he or she would like to have performed. A data mining task can be specified in
the form of a data mining query, which is input to the data mining system. A data
mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


discovery in order to direct the mining process, or examine the findings from
different angles or depths. The data mining primitives specify the following.

 The set of task-relevant data to be mined: This specifies the portions of


the database or the set of data in which the user is interested. This includes
the database attributes or data warehouse dimensions of interest (referred to
as the relevant attributes or dimensions).
 The kind of knowledge to be mined: This specifies the data mining
functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
 The background knowledge to be used in the discovery process: This
knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and for evaluating the patterns found. Concept hierarchies
are a popular form of background knowledge, which allow data to be mined
at multiple levels of abstraction.
 The interestingness measures and thresholds for pattern evaluation:
They may be used to guide the mining process or, after discovery, to evaluate
the discovered patterns. Different kinds of knowledge may have different
interestingness measures. For example, interestingness measures for
association rules include support and confidence. Rules whose support and
confidence values are below user-specified thresholds are considered
uninteresting.
 The expected representation for visualizing the discovered patterns:
This refers to the form in which discovered patterns are to be displayed,
which may include rules, tables, charts, graphs, decision trees, and cubes.

A data mining query language can be designed to incorporate these primitives,


allowing users to flexibly interact with data mining systems. Having a data mining
query language provides a foundation on which user-friendly graphical interfaces
can be built.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Integration of Data mining system with a Data warehouse:

The data mining system is integrated with a data warehouse or database system so
that it can do its tasks in an effective presence. A data mining system operates in
an environment that needed it to communicate with other data systems like a
database system.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


DB and DW systems, possible integration schemes include no coupling, loose
coupling, semitight coupling, and tight coupling. We examine each of these
schemes, as follows:
1. No coupling: No coupling means that a DM system will not utilize any function
of a DB or DW system. It may fetch data from a particular source (such as a file
system), process data using some data mining algorithms, and then store the
mining results in another file.
2.Loose coupling: Loose coupling means that a DM system will use some facilities
of a DB or DW system, fetching data from a data repository managed by these
systems, performing data mining, and then storing the mining results either in a
file or in a designated place in a database or data Warehouse. Loose coupling is
better than no coupling because it can fetch any portion of data stored in
databases or data warehouses by using query processing, indexing, and other
system facilities.
However, many loosely coupled mining systems are main memory-based. Because
mining does not explore data structures and query optimization methods provided
by DB or DW systems, it is difficult for loose coupling to achieve high scalability
and good performance with large data sets.
3. Semitight coupling: Semi-tight coupling means that besides linking a DM
system to a DB/DW system, efficient implementations of a few essential data
mining primitives (identified by the analysis of frequently encountered data mining
functions) can be provided in the DB/DW system. These primitives can include
sorting, indexing, aggregation, histogram analysis, multi way join, and pre-
computation of some essential statistical measures, such as sum, count, max, min
,standard deviation,
4. Tight coupling: Tight coupling means that a DM system is smoothly integrated
into the DB/DW system. The data mining subsystem is treated as one functional
component of information system. Data mining queries and functions are
optimized based on mining query analysis, data structures, indexing schemes, and
query processing methods of a DB or DW system.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Major Issues in Data Mining:
Data mining is a dynamic and fast-expanding field with great strengths. Major
issues in data mining are partitioning them into five groups:
 Mining methodology,
 User interaction,
 Efficiency and scalability,
 Diversity of data types,
 Data mining and society.
Mining Methodology: Mining methodology involves the investigation of new kinds
of knowledge, mining in multidimensional space, integrating methods from other
disciplines, and the consideration of semantic ties among data objects. In addition,
mining methodologies should consider issues such as data uncertainty, noise, and
incompleteness. Some mining methods explore how user specified measures can be
used to assess the interestingness of discovered patterns as well as guide the
discovery process. Let‘s have a look at these various aspects of mining
methodology.
 Mining various and new kinds of knowledge
 Mining knowledge in multidimensional space
 Data mining—an interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling uncertainty, noise, or incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
User Interaction:
The user plays an important role in the data mining process. Interesting areas of
research include how to interact with a data mining system, how to incorporate a
user‘s background knowledge in mining, and how to visualize and comprehend
data mining results.
 Interactive mining
 Incorporation of background knowledge
 Ad hoc data mining and data mining query languages
 Presentation and visualization of data mining results

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Efficiency and Scalability:
Efficiency and scalability are always considered when comparing data mining
algorithms. As data amounts continue to multiply, these two factors are especially
critical.
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, and incremental mining algorithms
Diversity of Database Types:
The wide diversity of database types brings about challenges to data mining. These
include
 Handling complex types of data
 Mining dynamic, networked, and global data repositories.
Data Preprocessing:
Today‘s real-world databases are highly susceptible to noisy, missing, and
inconsistent data due to their typically huge size (often several gigabytes or more)
and their likely origin from multiple, heterogenous sources. Low-quality data will
lead to low-quality mining results.
There are several data preprocessing techniques. Data cleaning can be applied to
remove noise and correct inconsistencies in data. Data integration merges data
from multiple sources into a coherent data store such as a data warehouse. Data
reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering. Data transformations (e.g., normalization) may
be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0.
Multidimensional measures of Data Quality:

Data have quality if they satisfy the requirements of the intended use. There are
many factors comprising data quality, including

Accuracy,

Completeness,

Consistency,

Timeliness,

Believability, and

Interpretability.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
Major Tasks in Data Preprocessing:

Major steps involved in data preprocessing, namely, data cleaning, data integration,
data reduction, and data transformation.

Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.

Data integration involve integrating multiple databases, data cubes, or files. Yet
some attributes representing a given concept may have different names in different
databases, causing inconsistencies and redundancies.

The data set I have selected for analysis is HUGE, which is sure to slow down the
mining process. Is there a way I can reduce the size of my data set without
jeopardizing the data mining results?‖ Data reduction obtains a reduced
representation of the data set that is much smaller in volume, yet produces the
same (or almost the same) analytical results. Data reduction strategies include
dimensionality reduction and numerosity reduction.

Fig: Forms of Data Preprocessing


Faculty: Mr. D. Krishna, Associate Professor CSE Dept
Why Data preprocessing:
Data in the real world id dirty
Incomplete Data: Lacking the attribute values, lacking certain
attributes of interest or containing only aggregate
data.
Example: occupation = ― ‖
Incomplete data may come from:
 ―Not applicable‖ data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems.

Noisy data: Containing errors or outliers


Example: Salary = ―-10‖
Noisy data may come from:
 Faulty data collection instruments
 Human or computer error at data entry
 Error in data transmission.

Inconsistent: containing discrepancies in codes or names


Example: age= ―1,2,3‖ birthday = ―03/07/1997‖
Was rating ―1,2,3‖ now rating ―A,B,C‖
Inconsistent data may come from:
 Different data sources
 Functional dependency violation(e. g: modify some linked data.

There are four Data preprocessing methods


1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


1. Data Cleaning:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or
data cleansing) routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
1. Fill in missing values
2. Smooth noisy data
3. Identify or remove outliers
4. Resolve inconsistency
2. Data Integration:
Data integration is the process of combining data from different sources into a
single, unified view.
Data mining often requires data integration—the merging of data from multiple data
stores. Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting dataset. This can help improve the accuracy and
speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data
integration.
3. Data Reduction:
Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in
situations where the dataset is too large to be processed efficiently, or where the
dataset contains a large amount of irrelevant or redundant information.
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the
original data. That is, mining on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results.
Data reduction strategies include dimensionality reduction, numerosity
reduction, and data compression. Dimensionality reduction is the process of
reducing the number of random variables or attributes under consideration.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Numerosity reduction techniques replace the original data volume by alternative,
smaller forms of data representation. These techniques may be parametric or
nonparametric. For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead of the actual data.
4. Data Transformation:
In pre-processing step, the data are transformed or consolidated so that the
resulting mining process may be more efficient, and the patterns found may be
easier to understand.
Data Transformation Strategies Overview:
In data transformation, the data are transformed or consolidated in to forms
appropriate for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include
binning, regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are
constructed and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data.
For example, the daily sales data may be aggregated so as to compute monthly and
annual total amounts. This step is typically used in constructing a data cube for
data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall with in a smaller
range, such as−1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age)are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth,
adult, senior). The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for the numeric attribute. Figure shows a
concept hierarchy for the attribute price. More than one concept hierarchy can be
defined for the same attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as
street can be generalized to higher-level concepts, like city or country. Many
hierarchies for nominal attributes are implicit within the database schema and can
be automatically defined at the schema definition level.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept

You might also like