Professional Documents
Culture Documents
UNIT-I
Introduction to Data Mining: Data mining is the process of discovering patterns
in large data sets involving methods at the intersection of machine
learning, statistics, and database systems. The information or knowledge extracted
so can be used for any of the following applications:
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Applications:
Data mining is highly useful in the following domains:
Market Analysis and Management
Corporate Analysis & Risk Management
Fraud Detection
Apart from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web Surf-
Aid.
Knowledge discovery in databases (KDD):
Knowledge discovery in databases (KDD) is the process of discovering useful
knowledge from a collection of data.
Data Cleaning: The noise and inconsistent data is removed.
Data Integration: Multiple data sources are combined.
Data Selection: Data relevant to the analysis task are retrieved from the database.
Data Transformation: Data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
Data Mining: Intelligent methods are applied in order to extract data patterns.
Pattern Evaluation: Data patterns are evaluated (to identify the truly interesting
patterns representing knowledge based on interestingness measures).
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and
printers, and concepts of customers include big spenders and budget spenders.
Such descriptions of a class or a concept are called class/concept descriptions.
These descriptions can be derived by the following two ways −
Background knowledge:
abstraction. For example, the Concept hierarchies are one of the background
knowledge that allows data to be mined at multiple levels of abstraction.
Each user will have a data mining task in mind, that is, some form of data analysis
that he or she would like to have performed. A data mining task can be specified in
the form of a data mining query, which is input to the data mining system. A data
mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during
The data mining system is integrated with a data warehouse or database system so
that it can do its tasks in an effective presence. A data mining system operates in
an environment that needed it to communicate with other data systems like a
database system.
Data have quality if they satisfy the requirements of the intended use. There are
many factors comprising data quality, including
Accuracy,
Completeness,
Consistency,
Timeliness,
Believability, and
Interpretability.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
Major Tasks in Data Preprocessing:
Major steps involved in data preprocessing, namely, data cleaning, data integration,
data reduction, and data transformation.
Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
Data integration involve integrating multiple databases, data cubes, or files. Yet
some attributes representing a given concept may have different names in different
databases, causing inconsistencies and redundancies.
The data set I have selected for analysis is HUGE, which is sure to slow down the
mining process. Is there a way I can reduce the size of my data set without
jeopardizing the data mining results?‖ Data reduction obtains a reduced
representation of the data set that is much smaller in volume, yet produces the
same (or almost the same) analytical results. Data reduction strategies include
dimensionality reduction and numerosity reduction.