8530521.doc Created by Chethan.M
Data Mining
Goal of Data Mining
Simplification and automation of the overall statistical process, from data source(s) tomodel application
Changed over the years
Replace statistician ? Better models, less grunge work— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques
Build intelligence into the software
Data Mining Is…
Decision Trees
Nearest Neighbor ClassificationNeural Networks
Rule Induction
K-means Clustering
Data Mining is Not...
Data warehousing
SQL / Ad Hoc Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization
Why data-mining now? 
Data mining is an increasingly popular topic(If the number of new textbooks is anything to go by).Two main reasons:
With computers now mediating most aspects of our lives, there has been alarge increase in the accumulation of electronic data.
With computers being increasingly up to the demands of complex modeling, itis getting easier to process larger datasets.
Why Mine Data? Commercial Viewpoint
Data volumes are too large for classical analysis approaches:
Large number of records
High dimensional data
Leverage organization’s data assets
Only a small portion of the collected data is ever analyzed
Data that may never be analyzed continues to be collected, at a greatexpense, out of fear that something which may prove important in thefuture is missing.
Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an
(e.g. In CustomerRelationship Management)
Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
micro arrays generating gene expression data
scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
In classifying and segmenting data
In Hypothesis Formation
Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics, anddatabase systems
Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
Mining Large Data Sets - Motivation
There is often information “hidden” in the data that is not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
What is Data Mining? ----- Many Definitions
Data processing using sophisticated data search capabilities and statisticalalgorithms to discover patterns and correlations in large preexisting databases; away to discover new meaning in data.
Non-trivial extraction of implicit, previously unknown and potentially usefulinformation from data.
Exploration & analysis, by automatic or semi-automatic means, of large quantitiesof data in order to discover meaningful patterns.
The process of identifying commercially useful patterns or relationships indatabases or other computer repositories through the use of advanced statisticaltools.
The automated extraction of predictive information from (large) databases.
A step in the knowledge discovery process consisting of particular algorithms(methods) that under some acceptable objective, produces a particularenumeration of patterns (models) over the data.
Data mining is the process of discovering interesting knowledge from large amountsof data stored either in databases, data warehouses, or other informationrepositories.
What is (not) Data Mining?
What is Data Mining?
Certain names are more prevalent in certain USlocations (O’Brien, O’Rurke, O’Reilly… in Bostonarea)Group together similar documents returned bysearch engine according to their context (e.g.Amazon rainforest, Amazon.com,)
What is not Data Mining?
Look up phone number in phonedirectoryQuery a Web search engine forinformation about “Amazon”

