Professional Documents
Culture Documents
INTRODUCTION: Data mining is nothing but discovery of knowledged data from large
database. Generally, the term mining refers to mining of gold from rocks or sand is called gold mining.
1.1. Why Data Mining?
• The major reason that data mining has attracted a great deal of attention in the information
industry in recent years is due to the wide availability of huge amounts of data and need for
turning such data into useful information and knowledge.
• The information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
• Data mining can be viewed as a result of the natural evolution of information technology. It
means, providing a path to extract the required data of an industry from warehousing machine.
This is the witness of developing knowledge of an industry.
• It includes data collection, database creation, data management (i.e data storage and retrieval,
and database transaction processing) and data analysis and understanding(involving data
warehousing and data mining).
1.1.1. Evolution of data mining and data warehousing: In the development of data mining, we
should know the evolution of database. This includes,
Data collection and Database creation: In the 1960’s, database and information technology began
with file processing system. It is powerful database system. But it is providing inconsistency of data.
It means, a user needs to maintain duplicate data of an industry.
Database Management System: In b/w 1970 – 1980, the progress of database is
→ Hierarchical and network database systems were developed.
→ Relational database systems were developed
→ Data modeling tools were developed in early 1980s (such as E-R model etc.
→ Indexing and data organization techniques were developed. ( such as B+ tree, hashing etc).
→ Query languages were developed. (such as SQL, PL/SQL)
→ User interfaces, forms and reports, query processing.
→ On-line transaction processing (OLTP)
Advanced Database Systems: In mid 1980s to till date,
→ Advanced data models were developed. (such as extended relational, object-oriented,
object-relational, spatial, temporal, multimedia, scientific databases etc.
Data Warehousing and Data mining: In late 1980 to till date
→ Developed Data warehouse and OLAP technology
→ Data mining and knowledge discovery were introduced.
Web-based Databases Systems: In 1990 – till date
→ XML based database systems and web mining were developed.
New Generation of Integrated Information Systems: From 2000 onwards developed an integrated
2
information system.
1.1.2. Data mining steps in the knowledge discovery process (KDD): The Data
3
mining is a step in the Knowledge Discovery in Databases (KDD). It has different stages, such as
→Data Cleaning: It is the process of removing noise and inconsistent data.
→Data Integrating: It is the process of combining data from multiple sources.
→Data Selection: It is the process of retrieving relevant data from database.
→Data Transformation: In this process, data are transformed or consolidated into forms or reports
by performing summary or aggregation operation.
→Data Mining: It is an essential process to extracting data from raw data by using intelligent
methods.
→Pattern Evaluation: to identify the discovered data is in the knowledge based on some
interestingness measures.(i.e identify the mined data is in the required format or not.).
→Knowledge presentation: Visualization and knowledge representation techniques are used to
present the mined data to the user.
1.2.2 Architecture of Data Mining System:
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge
base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files,
and other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may comprise
one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain
text files or spreadsheets may contain information. Another primary source of data is the World
Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will
4
be collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels
of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included.
Other examples of domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
6
This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.
4. User interface:
This module communicates between users and the data mining system,allowing
the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory
datamining based on the intermediate data mining results. In addition, this
component allows the user to browse database and data warehouse schemas or
data structures,evaluate mined patterns, and visualize the patterns in different
forms.
7
1.2. What Kind of Data Can Be Mined?
Data mining can be applied to any kind of information repositories such as Databases data, data
warehouse, transactional data bases, advanced systems, flat files and the World Wide Web. Advanced
databases systems include object-oriented, object-relation databases, time series databases, text
databases and multimedia databases.
→1.3.1. Databases Data: A database system is also called a database management system (DBMS).
It consists of a collection of interrelated data, known as a database, and set of software programs to
manage and access the data. The software programs provide mechanisms for defining database
structures and data storage. These also provide data consistency and security, concurrency, shared or
distributed data access etc.
A relational database is a collection of tables, each of which is assigned a unique name. Each table
consists of a set of attributes (columns or fields) and a set of tuples (records or rows). Each tuple is
identified by a unique key and is described by a set of attribute values. For this, ER models are
constructed for relational databases. For example, AllElectronics Industry illustrated with following
information. custer, item, employee, branch.
customer table
item table
item-id item_name Price Manufacturing
All Electronics company sells his products (such as computers and printers) to the customers. Here
providing the relation b/w customer table (file) and product table. By this relation can identify what
types of products are taken the customer.
→1.3.2. Data Warehouses: A data warehouse is a repository of information collected from multiple
sources, stored under a schema and resides at a single site. The data warehouses are constructed by
a process of data cleaning, data transformation, data integration, data loading and periodic data
refreshing.
8
A data cube for summarized sales data of All Electronics is presented in Figure
The cube has three dimensions: address (with city values Chicago, New York, Toronto, Vancouver), time
(with quarter values Q1, Q2, Q3, Q4), and item(with item type values home entertainment, computer,
phone, security).
The aggregate value stored in each cell of the cube is sales amount (in thousands).
For example, the total sales for the first quarter, Q1, for the items related to security systems in Vancouver
is $400,000, as stored in cell Vancouver, Q1, security. Additional cubes may be used to store aggregate
sums over each dimension, corresponding to the aggregate values obtained using different SQL group-bys
s(e.g., the total sales amount per city and quarter, or per city and item, or per quarter and item, or per each
individual dimension).
Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at
differing degrees of summarization For instance, we can drill down on sales data summarized by quarter
to see data summarized by month. Similarly, we can roll up on sales data summarized by city to view data
summarized by country.
9
→1.3.3. Transactional Databases: A transactional database consists of a file where each record
represents a transaction. A transaction includes a unique transaction such as data of the transaction,
the customer id number, the ID number of the sales person and so on.
AllElectronics transactions can be stored in a table with one record per transaction. This is shown in
fig.
Transaction_id List of items Transaction dates
T100 I1, I3, I8, I16 18-12-2018
T200 I2, I8 18-12-2018
1
10
What Kinds of Patterns Can Be Mined? (or) Data Mining Functionalities: Data
mining functionalities are used to specify the kind of patterns to be found in data mining tasks.Data
mining tasks are classified into two categories descriptive and predictive.
→ Descriptive mining tasks characterize the general properties of the data in the database.
→Predictive mining tasks perform inference on the current data in order to make predictions.
1.4.1. Concept/Class Description: Descriptions of a individual classes or a concepts in
summarized, concise and precise terms called class or concept descriptions. These descriptions can
be divided into 1. Data Characterization 2. Data Discrimination.
Data Characterization:
• It is summarization of the general characteristics of a target class of data (forms).
• The data corresponding to the user specified class are collected by a database query.
The output of data characterization can be presented in various forms like pie charts, bar charts
curves, multidimensional cubes, multidimensional tables etc. The resulting descriptions can be
presented as generalized relations are called characteristic rules.
Data Discriminations: Comparison of two target class data objects from one or set of contrasting
(distinct) classes. The target and contrasting classes can be specified by the user, and the
corresponding data objects are retrieved through database queries.
For example, comparison of products whose sales increased by 10% in the last year with
those whose sales decreased by 30% during the same period. This is called data discrimination.
1.4.2. Mining Frequent Patterns, Associations and Correlations:
1.4.2.1. Frequent Patterns: A frequent itemset typically refers to a set of items that often appear in
a transactional data. For example, milk, and bread are frequently purchased by many customers. All
Electronics industry occurring the products which are frequently purchased by the customers.
Generally, home needs are frequently used by the more customers.
1.4.2.2. Association Analysis: “What is association analysis ?”
Association analysis is the discovery of association rules showing attribute with value
conditions that occur frequently together the given set of data. It is used for transaction data analysis.
The Association rule of the form X ==> Y.
For example, In AllElectronics relational database, data mining system may find association rules
like buys(X, “computer”) ==> buys(X, “software”)
Here, who buys “computer”, they buys “software”.
age (X, “20 .. 29”) & income (X, “20k .. 29k”) ==> buys(X, laptop)
In this, the Association rule indicate that that indicates who employee of
AllElectronics have the age b/w 20 to 29 and earning income b/w 20000 to 29000 are
purchased CD player at AllElectronics Company.
1.4.2.3. Classification and Regressive prediction:
Classification is the process of finding a set of models that describes and distinguishes data classes
or concepts.
• The derived model may be represented in various forms such as classification (IF-THEN)
rules, decision trees, mathematical formulae or neural networks.
11
• A decision tree is a flow-chart like tree structure. The decision trees can easily converted to
classification rule. The neural networks are used for classification to provide connection b/w
computers.
Regression for Predication is used to predict missing or unavailable data values rather than class
labels. Prediction refers to both data value prediction and class label prediction. The predicted values
are numerical data and are often referred to as prediction.
1.4.2.4. Cluster Analysis: (“What is cluster analysis?”)
Clustering is a method of grouping data into different groups, so that in each group share similar
trends and patterns. The objectives of clustering are
• To uncover natural groupings
• To initiate hypothesis about the data
• To find consistent and valid organization of data.
For example, Cluster analysis can be performed on All
Electronics customers. It means, to identify
homogeneous (same group) customers. By this cluster
may represent target groups for marketing to increase
the sales.
12
1.4.2.5. Outlier Analysis: In this analysis, a database may contain data objects that do not do what
someone wants. Most data mining methods discard outliers as noise or exceptions. Finding such type
of applications are fraud detection is referred as outlier mining.
For example, Outlier analysis may uncover usage of credit cards by detecting purchases of
large amount of products when comparing with regular purchase of large product customers.
Outliers may be detected using statistical tests that assume a distribution or probability model
for the data, or using distance measures where objects that are remote from any other cluster are
considered outliers.
Interestingness Of Patterns
A data mining system has the potential to generate thousands or even millions of patterns,
or rules. then “are all of the patterns interesting?” Typically not—only a small fraction of
the patterns potentially generated would actually be of interest to any given user.
“What makes a pattern interesting?
Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?”
To answer the first question,
1.A pattern is interesting if it is easily understood by humans,
2.valid on new or test data with some degree of certainty,
3.potentially useful, and novel.
13
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge.
Several objective measures of pattern interestingness exist. These are based on the structure
of discovered patterns and the statistics underlying them. An objective measure for
association rules of the form X Y is rule support, representing the percentage of transactions
from a transaction database that the given rule satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction contains
both X and Y, that is, the union of itemsets X and Y. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y | X), that is, the probability
that a transaction containing X also contains Y. More formally, support and confidence are
defined as
1.5. Which Technologies Are Used? (or) Classification of Data Mining Systems:
Data mining is classified with many
techniques. Such as statistics, machine
learning, pattern recognition, database
and data warehouse systems,
information retrieval, visualization,
algorithms, high performance
computing, and many application
domains (Shown in Figure).Data mining
system can be categorized according to
various criteria.
A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
Statistics : A statistical model is a set of mathematical functions that describe the behavior of the objects in
a target classin terms of random variables and their associated probability distributions. Statistical models
are widely used to model data and data classes.
For example, in data mining tasks like data characterization and classification, statistical models of
target classes can be built.
Alternatively, data mining tasks can be built on top of statistical models. For example, we can use
statistics to model noise and missing data values.
Inferential statistics (or predictive statistics) models data in a way that accounts for randomness
and uncertainty in the observations and is used to draw inferences about the process or population
under investigation.
A statistical hypothesis test (sometimes called confirmatory data analysis) makes statistical
decisions using experimental data. A result is called statistically significant if it is unlikely to have
occurred by chance. If the classification or prediction model holds true, then the descriptive statistics
of the model increases the soundness of the model.
Machine Learning: Machine learning investigates how computers can learn (or improve their
performance) based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. Machine learning is a fast-
growing discipline.
• Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding
machine-readable translations are used as the training examples, which supervise the learning
of the classification model.
• Unsupervised learning is essentially a synonym for clustering. The learning processis
unsupervised since the input examples are not class labeled. For example, an unsupervised
learning method can take, as input, a set of images of handwritten digits. Suppose that it finds
10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9,
respectively.
• Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. For a two-class problem, one class as
the positive examples and the other class as the negative examples.
• Active learning is a machine learning approach that lets users play an active role in the
learning process.
16
Database Systems and Data Warehouses:
• Database systems can focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems principles in data models, query
languages, query processing and optimization methods, data storage, and indexing and
accessing methods. Many data mining tasks need to handle large data sets or even real-time,
fast streaming data. Recent database systems have built systematic data analysis capabilities
on database data using data warehousing and data mining facilities.
• A data warehouse integrates data from multiple sources and various timeframes. It provides
OLAP facilities in multidimensional databases to promote multidimensional data mining. It
maintain recent data, previous data and historical data in database.
Information Retrieval:
• Information retrieval (IR) is the science of searching for documents or information in
documents. The typical approaches in information retrieval adopt probabilistic models. For
example, a text document can be observing as a container of words, that is, a multi set of
words appearing in the document.
Pattern recognition is the process of recognizing patterns by using machine learning algorithm.
Pattern recognition can be defined as the classification of data based on knowledge already gained
or on statistical information extracted from patterns and/or their representation. One of the important
aspects of the pattern recognition is its application potential. Examples: Speech
recognition, speaker identification, multimedia document recognition (MDR), automatic medical
diagnosis.
Data visualization is a general term that describes any effort to help people understand the
significance of data by placing it in a visual context. Patterns, trends and correlations that might go
undetected in text-based data can be exposed and recognized easier with data visualization software.
An algorithm in data mining (or machine learning) is a set of heuristics and calculations that creates
a model from data. To create a model, the algorithm first analyzes the data you provide, looking for
specific types of patterns or trends.
High Performance Computing (HPC) framework which can abstract the increased complexity in
current computing systems and at the same time provide performance benefits by exploiting multiple
forms of parallelism in Data Mining algorithms.
Data Mining Applications: The list of areas where data mining is widely used − Financial Data
Analysis, Retail Industry, Telecommunication Industry, Biological Data Analysis, Other Scientific
Applications, Intrusion Detection.
• Database Attributes
• Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These representations
may include the following. −
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
2.Loose coupling: Loose coupling means that a DM system will use some facilities of a DB
or DW system, fetching data from a data repository managed by these systems, performing
data mining, and then storing the mining results either in a file or in a designated place in a
database or data Warehouse. Loose coupling is better than no coupling because it can fetch
any portion of data stored in databases or data warehouses by using query processing,
indexing, and other system facilities.
However, many loosely coupled mining systems are main memory-based. Because
mining does not explore data structures and query optimization methods provided by DB or
DW systems, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
3.Semitight coupling: Semitight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives
(identified by the analysis of frequently encountered data mining functions) can be provided
in the DB/DW system. These primitives can include sorting, indexing, aggregation,
histogram analysis, multi way join, and precomputation of some essential statistical
measures, such as sum, count, max, min ,standard deviation,
4.Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to as the entity
identification problem.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
• This is the third critical issue in data integration.
• Attribute values from different sources may differ for the same real-world entity.
19
• An attribute in one system may be recorded at a lower level of abstraction than
the “same” attribute in another.
1.6. Which Kinds of Applications Are Targeted?
Data mining has seen great successes in many applications. Presentations of data mining in
knowledge-intensive application domains, such as bioinformatics and software engineering,
• Business intelligence (BI) technologies provide historical, current, and predictive views of
business operations. Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and predictive analytics.
o Data mining is the core of business intelligence. Online analytical processing tools in
business intelligence depend on data warehousing and multidimensional data mining.
Classification and prediction techniques are the core of predictive analytics in business
intelligence, for which there are many applications in analyzing markets, supplies, and
sales.
• A Web search engine is a specialized computer server that searches for information on the
Web. The search results of a user query are often returned as a list (sometimes called hits).
The hits may consist of web pages, images, and other types of files.
o Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines, ranging from crawling
(e.g., deciding which pages should be crawled and the crawling frequencies), indexing
(e.g., selecting pages to be indexed and deciding to which extent the index should be
constructed), and searching (e.g., deciding how pages should be ranked, which
advertisements should be added, and how the search results can be personalized or
made “context aware”).
1.7. Major issues in Data Mining: Data mining is a dynamic and fast-expanding field with
great strengths. Major issues in data mining research, partitioning them into five groups: mining
methodology, user interaction, efficiency and scalability, diversity of data types, and data mining and
society.
20
Performance Issues
• There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the
data again from scratch.
Diverse Data Types Issues
• Handling of relational and complex types of
data − The database may contain complex data
objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system
to mine all these kind of data.
• Mining information from heterogeneous
databases and global information systems − The
data is available at different data sources on LAN or
WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the
knowledge from them adds challenges to data
mining.
22
Data processing
Data processing is collecting raw data and translating it into usable information. The raw data
is collected, filtered, sorted, processed, analyzed, stored, and then presented in a readable
format. It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.
The data processing is carried out automatically or manually. Nowadays, most data is processed
automatically with the help of the computer, which is faster and gives accurate results. Thus, data
can be converted into different forms. It can be graphic as well as audio ones. It depends on the
software used as well as data processing methods.
After that, the data collected is processed and then translated into a desirable form as per
requirements, useful for performing tasks. The data is acquired from Excel files, databases, text
file data, and unorganized data such as audio clips, images, GPRS, and video clips.
Data processing is crucial for organizations to create better business strategies and increase their
competitive edge. By converting the data into a readable format like graphs, charts,
and documents, employees throughout the organization can understand and use the data.
1.Data cleaning
2.Data Integration
3.Data reduction
4.Data Transformation
23
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation.
Drawbacks of mean
Trimmed mean
• we can instead use the trimmed mean, which is the mean obtained
after chopping off values at the high and low extremes.
• For example, we can sort the values observed for salary and remove
the top and bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both
ends as this can result in the loss of valuable information.
Median
• It is a better measure to find the center of data
• Suppose that a given data set of N distinct values is sorted in
numerical order.
• If N is odd, then the median is the middle value of the ordered set;
otherwise (i.e., if N is even), the median is the average of the middle
two values.
• Assume that data are grouped in intervals according to their xi data
values and that the frequency (i.e., number of data values) of each
interval is known.
• For example, people may be grouped according to their annual
salary in intervals such as 10–20K, 20–30K, and so on.
27
Mode
• The mode for a set of data is the value that occurs most
frequently in the set.
• Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
• For example, the mode of the data set in the given set of data:
• 2, 4, 5, 5, 6, 7 is 5 because it appears twice in the collection.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then
there is no mode.
28
• The distance between the first and third quartiles is inter quarter range.
IQR = Q3 −Q1.
The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order Minimum, Q1, Median, Q3, Maximum.
Boxplots
Data Cleaning
Missing Values
• Imagine that you need to analyze All Electronics sales and customer data.
You note that many tuples have no recorded value for several attributes,
such as customer income.
• How can you go about filling in the missing values for this attribute?
• Ignore the tuple:This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with
missing values.
• Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
• Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”
or −∞. If missing values are replaced by, say, “Unknown,”.
• Use the attribute mean to fill in the missing value:For example,
suppose that the average income of All Electronics customers is
$56,000. Use this value to replace the missing value for income.
• Use the attribute mean for all samples belonging to the same class
as the given tuple:
• For example, if classifying customers according to credit risk, replace
the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.
• Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
Noisy Data
• “What is noise?”
• Noise is a random error or variance in a measured variable.
• Given a numerical attribute such as, say, price, how can we “smooth”
out the data to remove the noise?
• Let’s look at the following data smoothing techniques:
33
In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values). In smoothing by
bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by thevalue9.
2.Regression:
• Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used to predict the
other.
• Multiple linear regression is an extension of linear regression, where more
than two attributes are involved and the data are fit to a multidimensional
surface.
Clustering:
Data Integration
• Data mining often requires data integration—the merging of data from
multiple data stores.
• It is likely that your data analysis task will involve data integration, which
combines data from multiple sources into a coherent data store, as in data
warehousing.
• These sources may include multiple databases, data cubes, or flat files.
Issues during Data Integration
2.Redundancy
• Given two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data.
• For numerical attributes, we can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient(also known as
Pearson’s product moment coefficient, named after its inventer, Karl
Pearson)
• where oij is the observed frequency (i.e., actual count) of the joint event (Ai
,Bj) and ei j is the expected frequency of (Ai ,Bj), which can be computed
as
• The gender of each person was noted. Each person was polled as to
whether their preferred type of reading material was fiction or
nonfiction.
• Thus, we have two attributes, gender and preferred reading.
Data Transformation
Min-max normalization
Z-score normalization:
• This method normalizes the value for attribute A using the mean
and standard deviation. The following formula is used for Z-score
normalization:
39
Data Reduction
Data aggregation:
Dimensionality Reduction
• In dimensionality reduction, data encoding or transformations are applied
so as to obtain a reduced or “compressed” representation of the original
data.
• If the original data can be reconstructed from the compressed data without
any loss of information, the data reduction is called lossless.
• If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy.
43
Numerosity Reduction
Wavelet transforms
What’s a Wavelet?
• A Wavelet is a wave-like oscillation that is localized in time, an
example is given below. Wavelets have two basic properties: scale and
location.
• Scale (or dilation) defines how “stretched” or “squished” a wavelet is. This
property is related to frequency as defined for waves.
• Location defines where the wavelet is positioned in time (or space)
44
Histograms
Sampling
47