Data Mining

Piyush Verma
Introduction
Data mining, the extraction of hidden predictive information from large data bases, is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouse.
Data mining tools predict future trends and behaviours, allowing businesses to make proactive,
knowledge-driven decisions. Data mining tools can answer business questions that were too
time consuming to resolve. They look into the databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.
Most companies collect and refine massive quantities of data. Data mining techniques
can be implemented rapidly on existing software and hardware platforms to enhance the value
of existing information resources, and can be integrated with new products and systems as they
are brought on- line. When implemented on high performance client/server or parallel
processing computers, data mining tools can analyze massive databases to deliver answers to
questions such as. “which clients are most likely to respond to my next promotional mailing,
and why?”
What is data mining?
It is a knowledge discovery process. Data mining centers around the automated discovery of
new facts and relationships in data. With traditional query tools, you search for known
information. Data mining tools enable you to uncover hidden information.
We live in the age of information. The importance of collecting data that reflect your
business or scientific activities to achieve competitive advantage is widely recognized now.
Powerful systems for collecting data and managing it in large and midrange companies.
However, the bottleneck of turning this data into your success is the difficulty of extracting
knowledge about the system that you study from the collected data. Consider the following
questions:
 What goods should be promoted to this customer?
 What is the probability that a certain customer will respond to a planned promotion?
 Can one predict the most profitable securities to buy/sell during the next trading
session?
 Will this customer default on a loan or pay back on schedule?
 What medical diagnosis should be assigned to this patient?
 How large the peak loads of telephone or energy network are going to be?
 Why the facility suddenly starts producing defective goods?
These are all the questions that can probably be answered if information hidden among
megabytes of data in your databases can be found explicitly and utilized. Modeling the
investigated system and discovering relations that connect variables in a database is the
objectives of the data mining.
Piyush Verma
Evolution of data mining
In the evolution from business data to business information, each new step has built
upon the previous one. For example, dynamic data access is critical for drill through in
data navigation applications, and the ability to store large data bases is critical to data
mining. From the user’s point of view, the four steps listed below were revolutionary
because they allowed new business questions to be answered accurately and quickly.
 Data collection (1960s) – answered questions like “what was my total revenue
in the last five years?”
 Data access (1980s) – answered business questions like “what were unit sales
in New England last march?” relational data bases (RDBMS, structure query
language (SQL)), ODBC , etc. were used for querying and reporting.
 Data warehousing & decision support (1960s) – these technologies were
capable of answering business questions like “what were unit sales in New
England last march?” the technologies used are on-line analytic processing
(OLAP), multidimensional data bases, data warehouse, etc.
 Data mining (2002) – capable of answering questions like “what’s likely to
happen to southern region sales next month? Why?” uses advanced algorithms,
multiprocessor computers, massive data bases, etc. The characteristics include
prospective nature and proactive information delivery.
Tasks solved by data mining

The main tasks that are solved by a data mining system are the following:
 Predicting – a task of learning a pattern from example and using the
development model to predict future values of the largest variable.
 Classification – a task can be of finding a function that maps an example
into one of several discrete classes.
 Detection of relations – a task of searching for the most influential
independent variables for selected target variable.
 Explicit modeling – a task of finding explicit formula describing
dependencies between various variables.
 Clustering – a task of identifying a finite set of categories or clusters that
describe data.
 Deviation detection – a task of determining the most significant changes
in some key measures of data from previous or expected values.
Advantage of data mining
Given data bases of sufficient size and quality, data mining technology can generate new
business opportunities by providing these capabilities:
Piyush Verma
 Automated prediction of trends and behaviours

 Automated discovery of previously unknown patterns
 Large data base could be used
Automated prediction of trends and behaviours

Data mining automates the process of finding predictive information in large databases.
Questions that traditionally required extensive hands on analysis can now be answered directly
from the data quickly. A typical example of a predictive problem is targeted marketing. Data
mining uses data on past promotional mailings to identify the targets most likely to maximize
return on investment in future mailings. Other predictive problems include forecasting
bankruptcy and other forms of default, and identifying segments of a populations likely to
respond similarly to given events.
Automated discovery of previously unknown patterns

Data mining tools sweep through databases and identify previously hidden patterns in
one step. An example of pattern discovery is the analysis of retail sales data to identify unrelated
products that are often purchased together. Other pattern discovery problems include
detecting fraudulent credit card transactions and identifying data that could represent data
entry keying errors.
Databases can be larger

The databases can have more columns and rows. Analysts must often limit the number of
variables they examine when doing hands-on analysis due to time constraints. Yet variables
that are discarded because they seem unimportant may carry information about unknown
patterns. High performance data mining allows users to explore the full depth of a data base.
Without preselecting a subset of variables. The data mining data bases contain larger samples
(more rows) as they yield lower estimation errors and variance, and allow users to make
interfaces about small but important segments of a population.
Data mining techniques can yield the benefits of automation on existing software and
hardware platforms, and can be implemented on new systems, as existing platforms are
upgraded and new products developed. When data mining tools are implemented in high
performance parallel processing systems, they can analyze massive data bases in minutes.
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge quantities of
data. Larger data bases, in turn, yield improved predictions.
Technologies used in data mining

The most commonly used techniques in data mining are:
 Neural networks – non linear predictive models that learn through training and
resemble biological neural networks in structure.
 Rule induction – the extraction of useful if- then rules from data based on statistical
significance.
Piyush Verma
 Evolutionary programming – at present this is the youngest and evident the most
promising branch of data mining. The underlying idea of the method is that the
system automatically formulates hypotheses about the dependence of the target
variable on other variables in the form of programs expressed in an internal
programming language.
 Case based reasoning (CBR) – the main idea underlying this method is very simple.
To forecast a future situation, or to make a correct decision, such system find the
closet past analog of the present situation and choose the same solution, which was
the right one in those past situations. That is why this method is also called the
nearest neighbour method.
 Decision trees – tree shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a data set.
 Genetic algorithms – optimization techniques that use process such as genetic
combination.
 Nonlinear regression methods

Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Piyush Verma

What is data mining?

Evolution of data mining

Tasks solved by data mining

Advantage of data mining

 Automated prediction of trends and behaviours

Automated prediction of trends and behaviours

Automated discovery of previously unknown patterns

Databases can be larger

Technologies used in data mining

You might also like