DM Unit-1 Notes

1
DATA MINING UNIT 1 NOTES
UNIT - I Data Mining: Data–Types of Data–, Data Mining Functionalities–

Interestingness Patterns– Classification of Data Mining systems– Data mining Task
primitives –Integration of Data mining system with a Data warehouse–Major issues in
Data Mining–Data Pre-processing.
INTRODUCTION: Data mining is nothing but discovery of knowledged data from large
database. Generally, the term mining refers to mining of gold from rocks or sand is called gold mining.
1.1. Why Data Mining?
• The major reason that data mining has attracted a great deal of attention in the information
industry in recent years is due to the wide availability of huge amounts of data and need for
turning such data into useful information and knowledge.
• The information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
• Data mining can be viewed as a result of the natural evolution of information technology. It
means, providing a path to extract the required data of an industry from warehousing machine.
This is the witness of developing knowledge of an industry.
• It includes data collection, database creation, data management (i.e data storage and retrieval,
and database transaction processing) and data analysis and understanding(involving data
warehousing and data mining).
1.1.1. Evolution of data mining and data warehousing: In the development of data mining, we
should know the evolution of database. This includes,
Data collection and Database creation: In the 1960’s, database and information technology began
with file processing system. It is powerful database system. But it is providing inconsistency of data.
It means, a user needs to maintain duplicate data of an industry.
Database Management System: In b/w 1970 – 1980, the progress of database is
→ Hierarchical and network database systems were developed.
→ Relational database systems were developed
→ Data modeling tools were developed in early 1980s (such as E-R model etc.
→ Indexing and data organization techniques were developed. ( such as B+ tree, hashing etc).
→ Query languages were developed. (such as SQL, PL/SQL)
→ User interfaces, forms and reports, query processing.
→ On-line transaction processing (OLTP)
Advanced Database Systems: In mid 1980s to till date,
→ Advanced data models were developed. (such as extended relational, object-oriented,
object-relational, spatial, temporal, multimedia, scientific databases etc.
Data Warehousing and Data mining: In late 1980 to till date
→ Developed Data warehouse and OLAP technology
→ Data mining and knowledge discovery were introduced.
Web-based Databases Systems: In 1990 – till date
→ XML based database systems and web mining were developed.
New Generation of Integrated Information Systems: From 2000 onwards developed an integrated
2
information system.
What is Data Mining: The term

Data Mining refers to extracting or
“mining” knowledge from large amounts
of data. The term mining is actually a
misnomer (i.e. unstructured data). For
example, mining of gold from rocks or
sand is referred to as gold mining.
Data mining is the process of
discovering meaningful new trends by
storing the large amount of data in
repository of database. It also uses
pattern recognition techniques as well
as statistical techniques.
1.1.2. Data mining steps in the knowledge discovery process (KDD): The Data
3
mining is a step in the Knowledge Discovery in Databases (KDD). It has different stages, such as
→Data Cleaning: It is the process of removing noise and inconsistent data.
→Data Integrating: It is the process of combining data from multiple sources.
→Data Selection: It is the process of retrieving relevant data from database.
→Data Transformation: In this process, data are transformed or consolidated into forms or reports
by performing summary or aggregation operation.
→Data Mining: It is an essential process to extracting data from raw data by using intelligent
methods.
→Pattern Evaluation: to identify the discovered data is in the knowledge based on some
interestingness measures.(i.e identify the mined data is in the required format or not.).
→Knowledge presentation: Visualization and knowledge representation techniques are used to
present the mined data to the user.
1.2.2 Architecture of Data Mining System:
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge
base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files,
and other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may comprise
one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain
text files or spreadsheets may contain information. Another primary source of data is the World
Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will
4
be collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains several modules
for operating data mining tasks, including association, characterization, classification, clustering,
prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various
data sources and stored within the data warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to
focus the search towards fascinating patterns.
It might utilize a stake threshold to filter out discovered patterns. On the other hand, the pattern
evaluation module might be coordinated with the mining module, depending on the implementation
of the data mining techniques used.
For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as much
as possible into the mining procedure to confine the search to only fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the
user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user views
and data from user experiences that might be helpful in the data mining process. The data mining
engine may receive inputs from the knowledge base to make the result more accurate and reliable.
The pattern assessment module regularly interacts with the knowledge base to get inputs, and also
update it.
5
Architecture of Data Mining:
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels
of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included.
Other examples of domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
6
2. Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.
3. Pattern Evaluation Module:
This component typically employs interestingness measures interacts with the

data mining modules so as to focus the search toward interesting patterns. It may
use interestingness thresholds to filter out discovered patterns. Alternatively, the
pattern evaluation module may be integrated with the mining module,
depending on the implementation of the datamining method used. For efficient
data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process as to confine the
search to only the interesting patterns.
4. User interface:
This module communicates between users and the data mining system,allowing
the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory
datamining based on the intermediate data mining results. In addition, this
component allows the user to browse database and data warehouse schemas or
data structures,evaluate mined patterns, and visualize the patterns in different
forms.
7
1.2. What Kind of Data Can Be Mined?
Data mining can be applied to any kind of information repositories such as Databases data, data
warehouse, transactional data bases, advanced systems, flat files and the World Wide Web. Advanced
databases systems include object-oriented, object-relation databases, time series databases, text
databases and multimedia databases.
→1.3.1. Databases Data: A database system is also called a database management system (DBMS).
It consists of a collection of interrelated data, known as a database, and set of software programs to
manage and access the data. The software programs provide mechanisms for defining database
structures and data storage. These also provide data consistency and security, concurrency, shared or
distributed data access etc.
A relational database is a collection of tables, each of which is assigned a unique name. Each table
consists of a set of attributes (columns or fields) and a set of tuples (records or rows). Each tuple is
identified by a unique key and is described by a set of attribute values. For this, ER models are
constructed for relational databases. For example, AllElectronics Industry illustrated with following
information. custer, item, employee, branch.
customer table
Cust-id cust_name Gender Address Place item-id
item table
item-id item_name Price Manufacturing
All Electronics company sells his products (such as computers and printers) to the customers. Here
providing the relation b/w customer table (file) and product table. By this relation can identify what
types of products are taken the customer.
→1.3.2. Data Warehouses: A data warehouse is a repository of information collected from multiple
sources, stored under a schema and resides at a single site. The data warehouses are constructed by
a process of data cleaning, data transformation, data integration, data loading and periodic data
refreshing.
8
A data cube for summarized sales data of All Electronics is presented in Figure
The cube has three dimensions: address (with city values Chicago, New York, Toronto, Vancouver), time
(with quarter values Q1, Q2, Q3, Q4), and item(with item type values home entertainment, computer,
phone, security).
The aggregate value stored in each cell of the cube is sales amount (in thousands).
For example, the total sales for the first quarter, Q1, for the items related to security systems in Vancouver
is $400,000, as stored in cell Vancouver, Q1, security. Additional cubes may be used to store aggregate
sums over each dimension, corresponding to the aggregate values obtained using different SQL group-bys
s(e.g., the total sales amount per city and quarter, or per city and item, or per quarter and item, or per each
individual dimension).
Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at
differing degrees of summarization For instance, we can drill down on sales data summarized by quarter
to see data summarized by month. Similarly, we can roll up on sales data summarized by city to view data
summarized by country.
9
→1.3.3. Transactional Databases: A transactional database consists of a file where each record
represents a transaction. A transaction includes a unique transaction such as data of the transaction,
the customer id number, the ID number of the sales person and so on.
AllElectronics transactions can be stored in a table with one record per transaction. This is shown in
fig.
Transaction_id List of items Transaction dates
T100 I1, I3, I8, I16 18-12-2018
T200 I2, I8 18-12-2018
1
10
What Kinds of Patterns Can Be Mined? (or) Data Mining Functionalities: Data
mining functionalities are used to specify the kind of patterns to be found in data mining tasks.Data
mining tasks are classified into two categories descriptive and predictive.
→ Descriptive mining tasks characterize the general properties of the data in the database.
→Predictive mining tasks perform inference on the current data in order to make predictions.
1.4.1. Concept/Class Description: Descriptions of a individual classes or a concepts in
summarized, concise and precise terms called class or concept descriptions. These descriptions can
be divided into 1. Data Characterization 2. Data Discrimination.
Data Characterization:
• It is summarization of the general characteristics of a target class of data (forms).
• The data corresponding to the user specified class are collected by a database query.
The output of data characterization can be presented in various forms like pie charts, bar charts
curves, multidimensional cubes, multidimensional tables etc. The resulting descriptions can be
presented as generalized relations are called characteristic rules.
Data Discriminations: Comparison of two target class data objects from one or set of contrasting
(distinct) classes. The target and contrasting classes can be specified by the user, and the
corresponding data objects are retrieved through database queries.
For example, comparison of products whose sales increased by 10% in the last year with
those whose sales decreased by 30% during the same period. This is called data discrimination.
1.4.2. Mining Frequent Patterns, Associations and Correlations:
1.4.2.1. Frequent Patterns: A frequent itemset typically refers to a set of items that often appear in
a transactional data. For example, milk, and bread are frequently purchased by many customers. All
Electronics industry occurring the products which are frequently purchased by the customers.
Generally, home needs are frequently used by the more customers.
1.4.2.2. Association Analysis: “What is association analysis ?”
Association analysis is the discovery of association rules showing attribute with value
conditions that occur frequently together the given set of data. It is used for transaction data analysis.
The Association rule of the form X ==> Y.
For example, In AllElectronics relational database, data mining system may find association rules
like buys(X, “computer”) ==> buys(X, “software”)
Here, who buys “computer”, they buys “software”.
age (X, “20 .. 29”) & income (X, “20k .. 29k”) ==> buys(X, laptop)
In this, the Association rule indicate that that indicates who employee of
AllElectronics have the age b/w 20 to 29 and earning income b/w 20000 to 29000 are
purchased CD player at AllElectronics Company.
1.4.2.3. Classification and Regressive prediction:
Classification is the process of finding a set of models that describes and distinguishes data classes
or concepts.
• The derived model may be represented in various forms such as classification (IF-THEN)
rules, decision trees, mathematical formulae or neural networks.
11
• A decision tree is a flow-chart like tree structure. The decision trees can easily converted to
classification rule. The neural networks are used for classification to provide connection b/w
computers.
Regression for Predication is used to predict missing or unavailable data values rather than class
labels. Prediction refers to both data value prediction and class label prediction. The predicted values
are numerical data and are often referred to as prediction.
1.4.2.4. Cluster Analysis: (“What is cluster analysis?”)
Clustering is a method of grouping data into different groups, so that in each group share similar
trends and patterns. The objectives of clustering are
• To uncover natural groupings
• To initiate hypothesis about the data
• To find consistent and valid organization of data.
For example, Cluster analysis can be performed on All
Electronics customers. It means, to identify
homogeneous (same group) customers. By this cluster
may represent target groups for marketing to increase
the sales.
12
1.4.2.5. Outlier Analysis: In this analysis, a database may contain data objects that do not do what
someone wants. Most data mining methods discard outliers as noise or exceptions. Finding such type
of applications are fraud detection is referred as outlier mining.
For example, Outlier analysis may uncover usage of credit cards by detecting purchases of
large amount of products when comparing with regular purchase of large product customers.
Outliers may be detected using statistical tests that assume a distribution or probability model
for the data, or using distance measures where objects that are remote from any other cluster are
considered outliers.
Interestingness Of Patterns
A data mining system has the potential to generate thousands or even millions of patterns,
or rules. then “are all of the patterns interesting?” Typically not—only a small fraction of
the patterns potentially generated would actually be of interest to any given user.
“What makes a pattern interesting?
Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?”
To answer the first question,
1.A pattern is interesting if it is easily understood by humans,
2.valid on new or test data with some degree of certainty,
3.potentially useful, and novel.
13
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge.
Several objective measures of pattern interestingness exist. These are based on the structure
of discovered patterns and the statistics underlying them. An objective measure for
association rules of the form X Y is rule support, representing the percentage of transactions
from a transaction database that the given rule satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction contains
both X and Y, that is, the union of itemsets X and Y. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y | X), that is, the probability
that a transaction containing X also contains Y. More formally, support and confidence are
defined as
support(X Y) = P(XUY) confidence(X Y) = P(Y | X)
In general, each interestingness measure is associated with a threshold, which may be

controlled by the user. For example, rules that do not satisfy a confidence threshold of, say,
50% can be considered uninteresting. Rules below the threshold threshold likely reflect
noise, exceptions, or minority cases and are probably of less value.
The second question—―Can a data mining system generate all of the

interesting patterns?‖—refers to the completeness of a data mining algorithm. It is often
unrealistic and inefficient for data mining systems to generate all of the possible patterns.
Instead, user-provided constraints and interestingness measures should be used to focus the
search.
Finally,
The third question—“Can a data mining system generate only interesting atterns?”—is
an optimization problem in data mining. It is highly desirable for data mining systems to
generate only interesting patterns. This would be much more efficient for users and data
mining systems, because neither would have to search through the patterns generated in
order to identify the truly interesting ones. Progress has been made in this direction;
however, such optimization remains a challenging issue in data mining.
14
1.5. Which Technologies Are Used? (or) Classification of Data Mining Systems:
Data mining is classified with many
techniques. Such as statistics, machine
learning, pattern recognition, database
and data warehouse systems,
information retrieval, visualization,
algorithms, high performance
computing, and many application
domains (Shown in Figure).Data mining
system can be categorized according to
various criteria.
Classification Based on the type of Knowledge Mined
A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
Classification Based on the mined Databases

A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models, types of
data, etc., which further assist in classifying a data mining system.
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
Classification Based on the Techniques Utilized

A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction involved
or the methods of analysis employed.
15
Classification Based on the Applications Adapted
Data mining systems classified based on adapted applications adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
Statistics : A statistical model is a set of mathematical functions that describe the behavior of the objects in
a target classin terms of random variables and their associated probability distributions. Statistical models
are widely used to model data and data classes.
For example, in data mining tasks like data characterization and classification, statistical models of
target classes can be built.
Alternatively, data mining tasks can be built on top of statistical models. For example, we can use
statistics to model noise and missing data values.
Inferential statistics (or predictive statistics) models data in a way that accounts for randomness
and uncertainty in the observations and is used to draw inferences about the process or population
under investigation.
A statistical hypothesis test (sometimes called confirmatory data analysis) makes statistical
decisions using experimental data. A result is called statistically significant if it is unlikely to have
occurred by chance. If the classification or prediction model holds true, then the descriptive statistics
of the model increases the soundness of the model.
Machine Learning: Machine learning investigates how computers can learn (or improve their
performance) based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. Machine learning is a fast-
growing discipline.
• Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding
machine-readable translations are used as the training examples, which supervise the learning
of the classification model.
• Unsupervised learning is essentially a synonym for clustering. The learning processis
unsupervised since the input examples are not class labeled. For example, an unsupervised
learning method can take, as input, a set of images of handwritten digits. Suppose that it finds
10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9,
respectively.
• Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. For a two-class problem, one class as
the positive examples and the other class as the negative examples.
• Active learning is a machine learning approach that lets users play an active role in the
learning process.
16
Database Systems and Data Warehouses:
• Database systems can focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems principles in data models, query
languages, query processing and optimization methods, data storage, and indexing and
accessing methods. Many data mining tasks need to handle large data sets or even real-time,
fast streaming data. Recent database systems have built systematic data analysis capabilities
on database data using data warehousing and data mining facilities.
• A data warehouse integrates data from multiple sources and various timeframes. It provides
OLAP facilities in multidimensional databases to promote multidimensional data mining. It
maintain recent data, previous data and historical data in database.
Information Retrieval:
• Information retrieval (IR) is the science of searching for documents or information in
documents. The typical approaches in information retrieval adopt probabilistic models. For
example, a text document can be observing as a container of words, that is, a multi set of
words appearing in the document.
Pattern recognition is the process of recognizing patterns by using machine learning algorithm.
Pattern recognition can be defined as the classification of data based on knowledge already gained
or on statistical information extracted from patterns and/or their representation. One of the important
aspects of the pattern recognition is its application potential. Examples: Speech
recognition, speaker identification, multimedia document recognition (MDR), automatic medical
diagnosis.
Data visualization is a general term that describes any effort to help people understand the
significance of data by placing it in a visual context. Patterns, trends and correlations that might go
undetected in text-based data can be exposed and recognized easier with data visualization software.
An algorithm in data mining (or machine learning) is a set of heuristics and calculations that creates
a model from data. To create a model, the algorithm first analyzes the data you provide, looking for
specific types of patterns or trends.
High Performance Computing (HPC) framework which can abstract the increased complexity in
current computing systems and at the same time provide performance benefits by exploiting multiple
forms of parallelism in Data Mining algorithms.
Data Mining Applications: The list of areas where data mining is widely used − Financial Data
Analysis, Retail Industry, Telecommunication Industry, Biological Data Analysis, Other Scientific
Applications, Intrusion Detection.
Data mining Task Primitives

• We can specify a data mining task in the form of a data mining query.
• This query is input to the system.
• A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data mining
system. Here is the list of Data Mining Task Primitives −
• Set of task relevant data to be mined.

• Kind of knowledge to be mined.
• Background knowledge to be used in discovery process.
• Interestingness measures and thresholds for pattern evaluation.
17
• Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes the following
−
• Database Attributes
• Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These representations
may include the following. −
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Integration od Datamining system within a Warehouse:

DB and DW systems, possible integration schemes include no coupling, loose coupling,
semitight coupling, and tight coupling. We examine each of these schemes, as follows:
18
1.No coupling: No coupling means that a DM system will not utilize any function of a DB
or DW system. It may fetch data from a particular source (such as a file system), process
data using some data mining algorithms, and then store the mining results in another file.
2.Loose coupling: Loose coupling means that a DM system will use some facilities of a DB
or DW system, fetching data from a data repository managed by these systems, performing
data mining, and then storing the mining results either in a file or in a designated place in a
database or data Warehouse. Loose coupling is better than no coupling because it can fetch
any portion of data stored in databases or data warehouses by using query processing,
indexing, and other system facilities.
However, many loosely coupled mining systems are main memory-based. Because
mining does not explore data structures and query optimization methods provided by DB or
DW systems, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
3.Semitight coupling: Semitight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives
(identified by the analysis of frequently encountered data mining functions) can be provided
in the DB/DW system. These primitives can include sorting, indexing, aggregation,
histogram analysis, multi way join, and precomputation of some essential statistical
measures, such as sum, count, max, min ,standard deviation,
4.Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to as the entity
identification problem.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
• This is the third critical issue in data integration.
• Attribute values from different sources may differ for the same real-world entity.
19
• An attribute in one system may be recorded at a lower level of abstraction than
the “same” attribute in another.
1.6. Which Kinds of Applications Are Targeted?
Data mining has seen great successes in many applications. Presentations of data mining in
knowledge-intensive application domains, such as bioinformatics and software engineering,
• Business intelligence (BI) technologies provide historical, current, and predictive views of
business operations. Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and predictive analytics.
o Data mining is the core of business intelligence. Online analytical processing tools in
business intelligence depend on data warehousing and multidimensional data mining.
Classification and prediction techniques are the core of predictive analytics in business
intelligence, for which there are many applications in analyzing markets, supplies, and
sales.
• A Web search engine is a specialized computer server that searches for information on the
Web. The search results of a user query are often returned as a list (sometimes called hits).
The hits may consist of web pages, images, and other types of files.
o Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines, ranging from crawling
(e.g., deciding which pages should be crawled and the crawling frequencies), indexing
(e.g., selecting pages to be indexed and deciding to which extent the index should be
constructed), and searching (e.g., deciding how pages should be ranked, which
advertisements should be added, and how the search results can be personalized or
made “context aware”).
1.7. Major issues in Data Mining: Data mining is a dynamic and fast-expanding field with
great strengths. Major issues in data mining research, partitioning them into five groups: mining
methodology, user interaction, efficiency and scalability, diversity of data types, and data mining and
society.
20
• Mining Methodology and User Interaction Issues

• It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
• Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
21
• Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.
Performance Issues
• There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the
data again from scratch.
Diverse Data Types Issues
• Handling of relational and complex types of
data − The database may contain complex data
objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system
to mine all these kind of data.
• Mining information from heterogeneous
databases and global information systems − The
data is available at different data sources on LAN or
WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the
knowledge from them adds challenges to data
mining.
22
Data processing
Data processing is collecting raw data and translating it into usable information. The raw data
is collected, filtered, sorted, processed, analyzed, stored, and then presented in a readable
format. It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.
The data processing is carried out automatically or manually. Nowadays, most data is processed
automatically with the help of the computer, which is faster and gives accurate results. Thus, data
can be converted into different forms. It can be graphic as well as audio ones. It depends on the
software used as well as data processing methods.
After that, the data collected is processed and then translated into a desirable form as per
requirements, useful for performing tasks. The data is acquired from Excel files, databases, text
file data, and unorganized data such as audio clips, images, GPRS, and video clips.
Data processing is crucial for organizations to create better business strategies and increase their
competitive edge. By converting the data into a readable format like graphs, charts,
and documents, employees throughout the organization can understand and use the data.
Major Tasks in Data Preprocessing:
1.Data cleaning
2.Data Integration
3.Data reduction
4.Data Transformation
23
Data Quality: Why Preprocess the Data?

• Data preprocessing is essential before its actual use. Data
preprocessing is the concept of changing the raw data into
a clean data set.
• The dataset is preprocessed in order to check missing
values, noisy data, and other inconsistencies before
executing it to the algorithm.
• Noisy data
• The data collection instruments may be faulty.
• Data entry is wrong by humans.
• Inconsistencies in the naming conventions or data codes
used,inconsistent formats for the input field such as date.
• Duplicate tuples also require data cleaning.
Major Tasks in Data Preprocessing:
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation.
In summary, real-world data tend to be dirty, incomplete, and

inconsistent. Data pre-processing techniques can improve data quality,
thereby helping to improve the accuracy and efficiency of the subsequent
mining process.
Data pre-processing is an important step in the knowledge discovery
24
process, because quality decisions must be based on quality data.

Detecting data anomalies, rectifying them early, and reducing the data to
be analyzed can lead to huge payoffs for decision making.
Descriptive Data Summarization

• Descriptive data summarization techniques can be used to identify
the typical properties of your data and highlight which data values
should be treated as noise or outlier.
• For many data preprocessing tasks, users would like to learn about
data characteristics regarding both central tendency and dispersion
of the data.
• Measures of central tendency include mean, median, mode, and
25
midrange, while measures of data dispersion include quartiles,

interquartile range (IQR), and variance.
Measuring the Central Tendency

• There are many ways to measure the central tendency of data.
• The most common and most effective numerical measure of the
“center” of a set of data is the (arithmetic) mean.
• Let x1,x2,...,xN be a set of N values or observations, such as for
some attribute, like salary. The mean of this set of values is
• This corresponds to the built-in aggregate function, average (avg()

in SQL), provided in relational database systems.
• Distributive measure: A distributive measure is a measure (i.e.,
function) that can be computed for a given data set by partitioning
the data into smaller subsets, computing the measure for each subset,
and then merging the results in order to arrive at the measure’s value
for the original (entire) data set.
• Both sum() and count() are distributive measures because they can
be computed in this manner. Other examples include max() and
min().
• Algebraic measure:An algebraic measure is a measure that can be

computed by applying an algebraic function to one or more
distributive measures. Hence, average (or mean()) is an algebraic
measure because it can be computed by sum()/count().
• Each value xi in a set may be associated with a weight wi ,
• for i = 1,...,N. The weights reflect the significance, importance, or
occurrence frequency attached to their respective values. In this
case, we can compute This is called the weighted arithmetic mean
or the weighted average.
26
Drawbacks of mean
• A major problem with the mean is its sensitivity to extreme (e.g.,

outlier) values. Even a small number of extreme values can corrupt
the mean.
• For example, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers. Similarly, the
average score of a class in an exam could be pulled down quite a bit
by a few very low scores.
Trimmed mean
• we can instead use the trimmed mean, which is the mean obtained
after chopping off values at the high and low extremes.
• For example, we can sort the values observed for salary and remove
the top and bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both
ends as this can result in the loss of valuable information.
Median
• It is a better measure to find the center of data
• Suppose that a given data set of N distinct values is sorted in
numerical order.
• If N is odd, then the median is the middle value of the ordered set;
otherwise (i.e., if N is even), the median is the average of the middle
two values.
• Assume that data are grouped in intervals according to their xi data
values and that the frequency (i.e., number of data values) of each
interval is known.
• For example, people may be grouped according to their annual
salary in intervals such as 10–20K, 20–30K, and so on.
27
• Let the interval that contains the median frequency be the

median interval. We can approximate the median of the entire
data set (e.g., the median salary) by interpolation using the
formula:
• L1 is the lower boundary of the median interval,

• N is the number of values in the entire data set,
• (∑ freq)l is the sum of the frequencies of all of the intervals
that are lower than the median interval,
• freqmedian is the frequency of the median interval,
• and width is the width of the median interval.
Mode
• The mode for a set of data is the value that occurs most
frequently in the set.
• Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
• For example, the mode of the data set in the given set of data:
• 2, 4, 5, 5, 6, 7 is 5 because it appears twice in the collection.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then
there is no mode.
28
Measuring the Dispersion of Data

• The degree to which numerical data tend to spread is called the dispersion,
or variance of the data.
• The most common measures of data dispersion are range, the five-number
summary (based on quartiles), the interquartile range, and the standard
deviation.
Boxplots can be plotted based on the five-number summary and are a useful
tool for identifying outliers.
Range, Quartiles, Outliers, and Boxplots
• Let x1,x2,...,xN be a set of observations for some attribute. The range of

the set is the difference between the largest (max()) and smallest (min())
values.
• let’s assume that the data are sorted in increasing numerical order.
• The kth percentile of a set of data in numerical order is the value xi having
the property that k percent of the data entries lie at or below xi .
• The most commonly used percentiles other than the median are quartiles.
• The first quartile, denoted by Q1, is the 25th percentile; the third quartile,
denoted by Q3, is the 75th percentile.
IQR(Inter quarter range)
• The distance between the first and third quartiles is inter quarter range.
IQR = Q3 −Q1.
The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order Minimum, Q1, Median, Q3, Maximum.
Boxplots
• Boxplots are a popular way of visualizing a distribution. A boxplot

29
incorporates the five-number summary as follows:
Variance and Standard Deviation
• Plotting histograms, or frequency histograms, is a graphical method for

summarizing the distribution of a given attribute.
• A histogram for an attribute A partitions the data distribution of A into
disjoint subsets, or buckets. Typically, the width of each bucket is uniform.
30
31
32
Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent. Data

cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Missing Values
• Imagine that you need to analyze All Electronics sales and customer data.
You note that many tuples have no recorded value for several attributes,
such as customer income.
• How can you go about filling in the missing values for this attribute?
• Ignore the tuple:This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with
missing values.
• Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
• Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”
or −∞. If missing values are replaced by, say, “Unknown,”.
• Use the attribute mean to fill in the missing value:For example,
suppose that the average income of All Electronics customers is
$56,000. Use this value to replace the missing value for income.
• Use the attribute mean for all samples belonging to the same class
as the given tuple:
• For example, if classifying customers according to credit risk, replace
the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.
• Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
Noisy Data
• “What is noise?”
• Noise is a random error or variance in a measured variable.
• Given a numerical attribute such as, say, price, how can we “smooth”
out the data to remove the noise?
• Let’s look at the following data smoothing techniques:
33
Binning methods smooth a sorted data value by consulting its “neighbourhood,”

that is, the values around it. The sorted values are distributed into a number of
“buckets,” or bins.
In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values). In smoothing by
bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by thevalue9.
smoothing by bin medians can be employed, in which each bin value

is replaced by the bin median.
In smoothing by bin boundaries, the minimum and

maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.
34
2.Regression:
• Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used to predict the
other.
• Multiple linear regression is an extension of linear regression, where more
than two attributes are involved and the data are fit to a multidimensional
surface.
Clustering:
• Outliers may be detected by clustering, where similar values are

organized into groups, or “clusters.”
• Intuitively, values that fall outside of the set of clusters may be
considered outliers.
Data Integration
• Data mining often requires data integration—the merging of data from
multiple data stores.
• It is likely that your data analysis task will involve data integration, which
combines data from multiple sources into a coherent data store, as in data
warehousing.
• These sources may include multiple databases, data cubes, or flat files.
Issues during Data Integration
1.Entity identification problem:How can equivalent real-world entities from

multiple data sources be matched up?
• For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the same
attribute?
• Examples of metadata for each attribute include the name, meaning, data
type, and range of values permitted for the attribute, and null rules for
handling blank, zero, or null values.
• Such metadata can be used to help avoid errors in schema integration.
2.Redundancy
• Redundancy is another important issue. An attribute (such as annual

revenue, for instance) may be redundant if it can be “derived” from another
attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
35
• Given two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data.
• For numerical attributes, we can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient(also known as
Pearson’s product moment coefficient, named after its inventer, Karl
Pearson)
• where N is the number of tuples

• ai and bi are the respective values of A and B in tuple i
• A and B are the respective mean values of A and B
• σA and σB are the respective standard deviations of A and B
• and Σ(aibi) is the sum of the AB cross-product (that is, for each tuple, the
value for A is multiplied by the value for B in that tuple).
• Note that −1 ≤ rA,B ≤ +1.
• IfrA,B is greater than 0, then A and B are positively correlated,
meaning that the values of A increase as the values of B increase.
• If the resulting value is less than 0, then A and B are negatively
correlated, where the values of one attribute increase as the values
of the other attribute decrease.
• For categorical (discrete) data, a correlation relationship between two

attributes, A and B, can be discovered by a χ 2 (chi-square) test.
• Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values,
namely b1,b2,...br .
• The data tuples described by A and B can be shown as a contingency table,
with the c values of A making up the columns and the r values of B making
up the rows.
• Let (Ai ,Bj) denote the event that attribute A takes on value ai and attribute
B takes on value bj , that is, where (A = ai ,B = bj). Each and every possible
(Ai ,Bj) joint event has its own cell (or slot) in the table. The χ 2 value (also
known as the Pearson χ 2 statistic) is computed as:
36
• where oij is the observed frequency (i.e., actual count) of the joint event (Ai
,Bj) and ei j is the expected frequency of (Ai ,Bj), which can be computed
as
where N is the number of data tuples,

count(A = ai)is the number of tuples having value ai for A,
and count(B = bj) is the number of tuples having value bj for B.
• Correlation analysis of categorical attributes using χ 2 . Suppose that

a group of 1,500 people was surveyed.
37
• The gender of each person was noted. Each person was polled as to
whether their preferred type of reading material was fiction or
nonfiction.
• Thus, we have two attributes, gender and preferred reading.
Data Transformation
In data transformation, the data are transformed or consolidated into forms

appropriate for mining. Data transformation can involve the following:
• Smoothing, which works to remove noise from the data. Such
techniques include binning, regression, and clustering.
• Aggregation, where summary or aggregation operations are applied
to the data.
• For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in
constructing a data cube for analysis of the data at multiple
granularities.
• Generalization of the data, where low-level or “primitive” (raw) data
are replaced by higher-level concepts through the use of concept
hierarchies.
• For example, categorical attributes, like street, can be generalized to
higher-level concepts, like city or country.
• Similarly, values for numerical attributes, like age, may be mapped to
higher-level concepts, like youth, middle-aged, and senior.
• Normalization, where the attribute data are scaled so as to fall within

a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
38
• Attribute construction (or feature construction), where new attributes are

constructed and added from the given set of attributes to help the mining
process.
• Min-max normalization.
• z-score normalization.
• Normalization by decimal scaling.
Min-max normalization
Z-score normalization:
• This method normalizes the value for attribute A using the mean
and standard deviation. The following formula is used for Z-score
normalization:
39
Normalization by decimal scaling
• Normalization by decimal scaling normalizes by moving the decimal

point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A.
• A value, v, of A is normalized to v 0 by computing.
Data Reduction
• Data reduction techniques can be applied to obtain a reduced

representation of the data set that is much smaller in volume.
• Strategies for data reduction include the following:
• Data cube aggregation, where aggregation operations are applied to
the data in the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to
reduce the data set size.
• Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models
or nonparametric methods such as clustering, sampling, and the use
of histograms.
40
• Discretization and concept hierarchy generation where raw data

values for attributes are replaced by ranges or higher conceptual
levels.
• Data discretization is a form of numerosity reduction that is very
useful for the automatic generation of concept hierarchies.
Discretization and concept hierarchy generation are powerful tools
for data mining, in that they allow the mining of data at multiple levels
of abstraction.
Data aggregation:
• This technique is used to aggregate data in a simpler form. For

example, imagine that information you gathered for your analysis for
the years 2012 to 2014, that data includes the revenue of your
company every three months.
• They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the
resulting data summarizes the total sales per year instead of per
quarter. It summarizes the data.
41
Attribute Subset Selection
• Attribute subset selection reduces the data set size by removing

irrelevant or redundant attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
• Mining on a reduced set of attributes has an additional benefit. It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
Basic heuristic methods of attribute subset selection include the following

techniques
• Stepwise forward selection: The procedure starts with an empty set

of attributes as the reduced set. The best of the original attributes is
determined and added to the reduced set. At each subsequent
iteration or step, the best of the remaining original attributes is added
to the set.
• Stepwise backward elimination: The procedure starts with the full set
of attributes.
• At each step, it removes the worst attribute remaining in the set.
42
• Combination of forward selection and backward elimination: The

stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best
attribute and removes the worst from among the remaining attributes.
• Decision tree induction: Decision tree algorithms, such as
ID3(iterative dichotamiser), C4.5, and CART(classification and
regression trees), were originally intended for classification.
• Decision tree induction constructs a flowchart like structure where
each internal (non leaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external
(leaf) node denotes a class prediction.
• The set of attributes appearing in the tree form the reduced subset of
attributes.
Dimensionality Reduction
• In dimensionality reduction, data encoding or transformations are applied
so as to obtain a reduced or “compressed” representation of the original
data.
• If the original data can be reconstructed from the compressed data without
any loss of information, the data reduction is called lossless.
• If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy.
43
Numerosity Reduction
• “Can we reduce the data volume by choosing alternative, ‘smaller’ forms

of data representation?” These techniques may be parametric or non
parametric.
• For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.)
• Log-linear models, which estimate discrete multidimensional probability
distributions, are an example.
• Nonparametric methods for storing reduced representations of the data
include histograms, clustering, and sampling.
• Two effective methods of lossy dimensionality reduction:

• 1.wavelet transforms and
• 2.principal components analysis.
Wavelet transforms
• The discrete wavelet transform (DWT) is a signal processing technique

that transforms linear signals.
• The wavelet transform can present a signal with a good time resolution or
a good frequency resolution. There are two types of wavelet
transforms: the continuous wavelet transform (CWT) and the discrete
wavelet transform (DWT).
• The data vector X is transformed into a numerically different vector, Xo,
of wavelet coefficients when the DWT is applied. The two vectors X and
Xo must be of the same length.
• When applying this technique to data reduction, we consider n-
dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the number
of attributes present in the relation of the data set.
What’s a Wavelet?
• A Wavelet is a wave-like oscillation that is localized in time, an
example is given below. Wavelets have two basic properties: scale and
location.
• Scale (or dilation) defines how “stretched” or “squished” a wavelet is. This
property is related to frequency as defined for waves.
• Location defines where the wavelet is positioned in time (or space)
44
• Wavelet transforms can be applied to multidimensional data, such as

a data cube.
• The computational complexity involved is linear with respect to the
number of cells in the cube.
• Wavelet transforms give good results on sparse or skewed data and
on data with ordered attributes.
• Wavelet transforms have many real-world applications, including the
compression of fingerprint images, computer vision, analysis of time-
series data, and data cleaning.
Principal Components Analysis
• principal components analysis is one method for dimensionality reduction.

PCA is a method used to reduce number of variables in your data by
extracting important one from a large pool.
• It reduces the dimension of your data with the aim of retaining as much
information as possible..
• Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions.
• Principal components analysis,searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k ≤ n.
• PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels.
• It is a feature extraction technique, so it contains the important variables
and drops the least important variable.
• The PCA algorithm is based on some mathematical concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
Regression and Log-Linear Models
• Regression and log-linear models can be used to approximate the

given data. In (simple) linear regression, the data are modeled to fit a
straight line.
• For example, a random variable, y (called a response variable), can
be modeled as a linear function of another random variable, x (called
a predictor variable), with the equation. Y is response variable and x
is called as the predictor variable. W and b are the regression
coefficients specify the slope of the line y-intercept.
y = wx+b
45
Histograms
• Histograms use binning to approximate data distributions and are a

popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, or buckets. If each bucket represents only a
single attribute-value/frequency pair, the buckets are called singleton
buckets. Often, buckets instead represent continuous ranges for the
given attribute.
• Histograms. The following data are a list of prices of commonly sold

items at All Electronics (rounded to the nearest dollar). The numbers
have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
46
Sampling
47

DM Unit-1 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Unit-1 Notes

Uploaded by

Copyright:

Available Formats

1

DATA MINING UNIT 1 NOTES

UNIT - I Data Mining: Data–Types of Data–, Data Mining Functionalities–

What is Data Mining: The term

Database or Data Warehouse Server:

Architecture of Data Mining:

2. Data Mining Engine:

3. Pattern Evaluation Module:

This component typically employs interestingness measures interacts with the

Cust-id cust_name Gender Address Place item-id

support(X Y) = P(XUY) confidence(X Y) = P(Y | X)

In general, each interestingness measure is associated with a threshold, which may be

The second question—―Can a data mining system generate all of the

Classification Based on the type of Knowledge Mined

Classification Based on the mined Databases

Classification Based on the Techniques Utilized

Data mining Task Primitives

• Set of task relevant data to be mined.

Integration od Datamining system within a Warehouse:

• Mining Methodology and User Interaction Issues

Major Tasks in Data Preprocessing:

Data Quality: Why Preprocess the Data?

Major Tasks in Data Preprocessing:

In summary, real-world data tend to be dirty, incomplete, and

process, because quality decisions must be based on quality data.

Descriptive Data Summarization

midrange, while measures of data dispersion include quartiles,

Measuring the Central Tendency

• This corresponds to the built-in aggregate function, average (avg()

• Algebraic measure:An algebraic measure is a measure that can be

• A major problem with the mean is its sensitivity to extreme (e.g.,

• Let the interval that contains the median frequency be the

• L1 is the lower boundary of the median interval,

Measuring the Dispersion of Data

• Let x1,x2,...,xN be a set of observations for some attribute. The range of

IQR(Inter quarter range)

• Boxplots are a popular way of visualizing a distribution. A boxplot

incorporates the five-number summary as follows:

Variance and Standard Deviation

• Plotting histograms, or frequency histograms, is a graphical method for

• Real-world data tend to be incomplete, noisy, and inconsistent. Data

Binning methods smooth a sorted data value by consulting its “neighbourhood,”

smoothing by bin medians can be employed, in which each bin value

In smoothing by bin boundaries, the minimum and

• Outliers may be detected by clustering, where similar values are

1.Entity identification problem:How can equivalent real-world entities from

• Redundancy is another important issue. An attribute (such as annual

• where N is the number of tuples

• For categorical (discrete) data, a correlation relationship between two

where N is the number of data tuples,

• Correlation analysis of categorical attributes using χ 2 . Suppose that

In data transformation, the data are transformed or consolidated into forms

• Normalization, where the attribute data are scaled so as to fall within

• Attribute construction (or feature construction), where new attributes are

Normalization by decimal scaling

• Normalization by decimal scaling normalizes by moving the decimal

• Data reduction techniques can be applied to obtain a reduced

• Discretization and concept hierarchy generation where raw data

• This technique is used to aggregate data in a simpler form. For

Attribute Subset Selection

• Attribute subset selection reduces the data set size by removing

Basic heuristic methods of attribute subset selection include the following

• Stepwise forward selection: The procedure starts with an empty set