You are on page 1of 8

Chapter 1

Data Mining-An Introduction
1.1 Why Data Mining?
We live in a world where vast amounts of data are collected daily. Analysing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data.

The explosive growth of available data volume is a result of the computerization
of our society and the fast development of powerful data collection and storage
tools. Powerful and versatile tools are badly needed to automatically uncover
valuable information from the tremendous amounts of data and to transform
such data into organized knowledge. This necessity has led to the birth of data
mining.

Data mining can be viewed as a result of the natural evolution of information
technology. The database and data management industry evolved in the
development of several critical functionalities: data collection and database
creation, data management (including data storage and retrieval and database
transaction processing), and

advanced

data analysis

(involving data

warehousing and data mining). The widening gap between data and information
calls for the systematic development of data mining tools that can turn data
tombs into “golden nuggets” of knowledge.

1.2 What Is Data Mining?
It is no surprise that data mining can be defined in many different ways. Even the term
data mining does not really present all the major components in the picture. To refer to
the mining of gold from rocks or sand, we say gold mining instead of rock or sand
mining. Analogously, data mining should have been more appropriately named
“knowledge mining from data,” which is unfortunately somewhat long. However, the
shorter term, knowledge mining may not reflect the emphasis on mining from large
amounts of data. Nevertheless, mining is a vivid term characterizing the process that
finds a small set of precious nuggets from a great deal of raw material. Thus, such a
misnomer carrying both “data” and “mining” became a popular choice. In addition,
many other terms have a similar meaning to data mining—for Example, knowledge

data/pattern analysis.An Introduction mining from data. and data dredging.Chapter 1 Data Mining. Data cleaning (to remove noise and inconsistent data) 2. Computer Engineering . Figure 1. or KDD. while others view data mining as merely an essential step in the process of knowledge discovery. The knowledge discovery process is an iterative sequence of the following steps: 1. Data selection (where data relevant to the analysis task are retrieved from the Database) Poornima University.tech.1 Data mining as a step in the process of knowledge discovery Many people treat data mining as a synonym for another popularly used term. Jaipur 2 B. data Archaeology. knowledge discovery from data. knowledge extraction. Data integration (where multiple data sources may be combined) 3.

The data mining step may interact with the user or a knowledge base. 1.1) shows data mining as one step in the knowledge discovery process. visualization. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. in media. However. the term data mining is often used to refer to the entire knowledge discovery process (perhaps because the term is shorter than knowledge discovery from data). in industry. data mining has incorporated many techniques from other domains such as statistics. The preceding view (Figure 1. Computer Engineering . the Web. high performance Computing. algorithms.tech. and many application domains (Figure 1. Jaipur 3 B. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users) Steps 1 through 4 are different forms of data pre-processing. we adopt a broad view of data mining functionality: Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. though an essential one because it uncovers hidden patterns for evaluation.An Introduction 4.3 Which Technologies Are Used? As a highly application-driven domain. Poornima University. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures) 7. machine learning. other information repositories. pattern recognition.2). The data sources can include databases. Therefore. and in the research environment. information retrieval.Chapter 1 Data Mining. database and data warehouse systems. or data that are streamed into the system dynamically. where data are prepared for mining. Data mining (an essential process where intelligent methods are applied to extract data patterns) 6. Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations) 5. data warehouses.

A statistical model is a set of mathematical functions that describe the behaviour of the objects in a target class in terms of random variables and their associated probability distributions. 1. and may reside on the Web. interpretation or explanation. query languages.tech.3.3. and indexing and accessing methods.4 Information Retrieval Information retrieval (IR) is the science of searching for documents or information in documents. A data warehouse integrates data originating from multiple sources and various timeframes. and presentation of data. 1. The differences between traditional information retrieval and database systems are two fold: Information retrieval assumes that (1) the data under search are unstructured. Data mining has an inherent connection with statistics.2 Machine Learning Machine learning investigates how computers can learn (or improve their performance) based on data. Computer Engineering . query processing and optimisation methods.2 Data mining adopts techniques from many domains 1. Documents can be text or multimedia. analysis.1 Statistics Statistics studies the collection.3 Database Systems and Data Warehouses Database systems research focuses on the creation. and use of databases for organizations and end-users. Particularly. 1.Chapter 1 Data Mining.3. A main research area is for computer programs to automatically learn to recognize complex patterns and make intelligent decisions based on data. database systems researchers have established highly recognized principles in data models.An Introduction Figure 1. maintenance. Jaipur 4 B.3. data storage. and (2) the Poornima University.

Chapter 1 Data Mining. Computer Engineering . goods transportation. Jaipur 5 B. Some of the typical cases are as follows −  Design and construction of data warehouses for multidimensional data analysis and data mining. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing ease.4 Data Mining Applications The list of areas where data mining is widely used are −  Financial Data Analysis  Retail Industry  Telecommunication Industry  Biological Data Analysis  Other Scientific Applications  Intrusion Detection Financial Data Analysis The financial data in banking and financial industry is generally reliable and of high quality which facilitates systematic data analysis and data mining.An Introduction queries are formed mainly by keywords.  Classification and clustering of customers for targeted marketing.  Detection of money laundering and other financial crimes. Retail Industry Data Mining has its great application in Retail Industry because it collects large amount of data from on sales. which do not have complex structures (unlike SQL queries in database systems). customer purchasing history. availability and popularity of the web. consumption and services. 1. Poornima University.  Loan payment prediction and customer credit policy analysis.tech.

 Use of visualization tools in telecommunication data analysis. Due to the development of new computer and communication technologies. Here is the list of examples for which data mining improves telecommunication services −  Multidimensional Analysis of Telecommunication data.  Analysis of effectiveness of sales campaigns. Following are the aspects in which data mining contributes for biological data analysis − Poornima University. make better use of resource. pager. etc.  Product recommendation and cross-referencing of items. products. This is the reason why data mining is become very important to help and understand the business.An Introduction Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good customer retention and satisfaction. Jaipur 6 B.tech. Here is the list of examples of data mining in the retail industry −  Design and Construction of data warehouses based on the benefits of data mining. Telecommunication Industry Today the telecommunication industry is one of the most emerging industries providing various services such as fax. customers.  Identification of unusual patterns.  Multidimensional association and sequential patterns analysis. catch fraudulent activities. cellular phone. Computer Engineering .  Mobile Telecommunication services. images.  Customer Retention. internet messenger. Data mining in telecommunication industry helps in identifying the telecommunication patterns. e-mail. we have seen a tremendous growth in the field of biology such as genomic. Biological Data Analysis In recent times.Chapter 1 Data Mining.  Multidimensional analysis of sales. the telecommunication industry is rapidly expanding.  Fraudulent pattern analysis. proteomic. functional Genomic and biomedical research. Biological data mining is a very important part of Bioinformatics. and improve quality of service. web data transmission. time and region.

 Visualization and domain specific knowledge. With increased usage of internet and availability of the tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. security has become the major issue. Computer Engineering .tech. etc.  Association and path analysis. Poornima University.Chapter 1 Data Mining. Huge amount of data have been collected from scientific domains such as geosciences. distributed genomic and proteomic databases. A large amount of data sets is being generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling.An Introduction  Semantic integration of heterogeneous. Jaipur 7 B.  Association and correlation analysis. confidentiality.  Alignment.  Analysis of Stream data. or the availability of network resources. Other Scientific Applications The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate.  Visualization tools in genetic data analysis. indexing. Following are the applications of data mining in the field of Scientific Applications −  Data Warehouses and data preprocessing.  Graph-based mining. Intrusion Detection Intrusion refers to any kind of action that threatens integrity. etc. fluid dynamics. chemical engineering. astronomy. aggregation to help select and build discriminating attributes. Here is the list of areas in which data mining technology may be applied for intrusion detection −  Development of data mining algorithm for intrusion detection.  Discovery of structural patterns and analysis of genetic networks and protein pathways. similarity search and comparative analysis multiple nucleotide sequences. In this world of connectivity.

 Visualization and query tools.An Introduction  Distributed data mining. Computer Engineering .tech.Chapter 1 Data Mining. Jaipur 8 B. Poornima University.