You are on page 1of 43

Chapter 7

Data Mining Concepts and Applications


Definition , characteristics & benefits A process that uses statistical, mathematical, artificial intelligence techniques to extract and identify useful information and subsequent knowledge from large databases A term used to describe knowledge discovery in databases. Identify undiscovered patterns in data Data mining applications used by medical &

pharmaceutical researchers to identify successful therapies for illness and to discover new & improved drugs To identify the customers buying pattern, to better

Data Mining Concepts and Applications


How data mining works Data mining tools find patterns in data Three methods are used to identify patterns in data:
Simple models (eg. sql queries, OLAP) 2. Intermediate models (eg. Regression, decision trees, clustering) 3. Complex models (eg. neural networks, other rule induction) The Patterns / rules are used to guide decision making & forecast the effects of decisions
1.

Data Mining Concepts and Applications


Data mining algorithms fall into 4 broad categories:
1. Classification 2. Clustering 3. Association 4. Sequence Discovery

Other data analysis tools

Regression Time series analysis Visualization

Data Mining Concepts and Applications

Classification/supervised induction
Main objective-to analyze the historical data stored in

a database and to automatically generate a model that can predict future behavior Model is used to predict the classes of other unclassified records

Common tools used for classification are: Neural networks (no of variables very large, relationship complex, need considerable training) Decision trees If-then-else rules

Data Mining Concepts and Applications


The goal of data classification is to organize

and categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data.

Classification
Supervised Classification = Classification the class labels and the number of classes are known

Unsupervised Classification = Clustering the class labels and the number of classes are not known

Data Mining Concepts and Applications


Clustering Partitioning a database into segments in which the members of a segment share similar qualities Unlike in classification, clusters are unknown when the algorithm starts The goal is to create groups so that members within each group have maximum similarity and members across groups have minimum similarity A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. No predefined classes

Data Mining Concepts and Applications

Association
A category of data mining algorithm that establishes relationships about items that occur together in a given record Eg. Association among items that sell together (market basket analysis) can benefit the retailers

Data Mining Concepts and Applications

Sequence discovery The identification of associations over time Visualization can be used in conjunction with data mining to gain a clearer understanding of many underlying relationships

Data Mining Concepts and Applications


Regression is a well-known statistical

technique that is used to map data to a prediction value


Forecasting estimates future values based

on patterns within large sets of data eg to predict future sales/ demand.

Data Mining Concepts and Applications


Data mining applications (Read from book

page 311) Marketing

Banking Retailing and sales Manufacturing and production Brokerage and securities trading Insurance

Computer hardware and software Government and defense Airlines Health care Broadcasting Police Homeland security

Data Mining Techniques and Tools


Data mining tools and techniques can be

classified based on the structure of the data and the algorithms used:
Statistical methods- include linear and non linear regression, Bayess theorem (i.e. probability distribution), correlations & cluster analysis. Decision trees

Defined as a root followed by internal nodes.

Data Mining Techniques and Tools


Data mining tools and techniques can be

classified based on the structure of the data and the algorithms used:
Case-based reasoning Neural computing: identify potential customers for a new product Intelligent agents Other tools

Rule induction Data visualization

Data Mining Techniques and Tools


Gini index Used to determine the split point that best divides the data into classes. Can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable To evaluate the goodness of split So the goal of decision tree is to select that variable that has the lowest gini index.

Data Mining Techniques and Tools


A general algorithm for building a decision

tree:
Create a root node and select a splitting attribute. 2. Add a branch to the root node for each split candidate value and label 3. Take the following iterative steps:
1.
a. b.

Classify data by applying the split value. If a stopping point is reached, then create leaf node and label it. Otherwise, build another subtree

Data Mining Techniques and Tools

Clustering Example
From book page 322 Table 7.3

Data Mining Project Processes


Data mining projects have to follow a project

management process CRISP-DM (cross- industry standard process for data mining)
60% of estimated time is spent in developing data

and business understanding

Six sigma methodology


Well structured , data driven methodology for

eliminating defects, waste and quality control problems DMAIC model

Data Mining Project Processes

DMAIC model

Data Mining Project Processes

Data Mining Project Processes

Knowledge discovery in databases (KDD)


A comprehensive process of using data mining

methods to find useful information and patterns in data as opposed to data mining which involves using algorithms to identify patterns in data derived through the KDD process Encompasses data mining Input to KDD process- organizational data

Data Mining Project Processes

KDD process involves 1. Selection- Identification of data that will be considered within the data mining process 2. Preprocessing- deal with erroneous & missing data, involves correction 3. Transformation-data converted into a single common format for processing, involve encoding or reducing the no. of variables 4. Data mining- apply algorithms to the transformed data in order to produce output 5. Interpretation/evaluation results must be presented in a manner that is meaningful to the user

Text Mining
Text mining Application of data mining to nonstructured or less structured text files. Text Mining uses Natural Language Processing techniques to 'understand' the data. Tries to understand the semantics of the text (information) Cluster documents on the basis of similarity, visualize relationships between documents etc.

Text Mining
Text mining helps organizations:
1. 2.

3.

Find the hidden content of documents, including additional useful relationships Relate documents across previous unnoticed divisions (eg discover the customers in 2 different product divisions that have the same characteristics) Group documents by common themes (eg all the customers of an insurance firm who have similar complaints and cancel their policies)

Text Mining
Text mining imply 3 types of text processing
1. Information retrieval- refers to querying

text, finding text and presenting textual info 2. Information extraction- natural language processing is used to analyze and process text eg reading 1000s of resumes and extract key info 3. Information summarization Eg. NewsinEssence system collects document from a number of news sites, creates clusters based on topics and summarizes each cluster.

Typical Applications for Text Mining


Analyzing open-ended survey responses: For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro's and con's of a product or service Automatic processing of messages, emails,

etc:
To "filter" out automatically most undesirable "junk

email" based on certain terms or words that are not likely to appear in legitimate messages. The automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency

Text Mining
Applications of text mining Automatic detection of e-mail spam or phishing through analysis of the document content Automatic processing of messages or e-mails to route a message to the most appropriate party to process that message Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses

Text Mining
Applications of text mining Analysis of related scientific publications in journals to create an automated summary view of a particular discipline

Text Mining

Basic form of text mining-Term extraction The Simplest data structure in text mining weighted list of words Text is reduced to a list of terms and weights

How to mine text


Text mining involves the following steps:
1. 2.

3. 4.

Eliminate commonly used words (stop-words) eg the, and Replace words with their stems or roots (stemming algorithms) eg. replace the term phoning, phoned and phones with phone Consider synonyms and phrases eg. Student and pupil may need to be grouped together Calculate the weights of the remaining terms

Text Mining
2 common measures to calculate the frequency

with which the word appears/ the weight of the term: tf factor- measures the actual no. of times a word appears in a document Idf (inverse document frequency) factorindicates the no of times the word appears in all documents in a set. A large tf factor increases the weight A large idf factor decreases the weight (occur frequently, common words, not considered important)

Data mining vs Text mining


In text mining the patterns are extracted from

natural language text rather than from structured databases of facts. Data Mining: Discover hidden models to describe the data. Text Mining: Discover hidden facts within bodies of text. Completely different approaches: DM tries to generalise all of the data into a single model. TM tries to understand the details, and cross reference between individual bodies of text.

Web Mining
Info on the web like

Home page is linked to which other pages Each visitor to a website Each search on a search engine Each click on a link Each transaction on an e-commerce site Make better use of websites Provide a better relationship and value to the website visitors

Analysis of such info can help


Web Mining
Defined as the discovery and analysis of interesting and

useful info from the web, about the web and usually through web-based tools
Three areas of web mining:

1. Web content mining The extraction of useful information from Web pages 2. Web structure mining The development of useful information from the links included in the Web documents 3. Web usage mining The extraction of useful information from the data being generated through webpage visits, transaction, etc.

Web Mining

Web Mining

Web content mining o Refers to mining, extraction and integration of useful data, information and knowledge from Web page contents. o Web crawlers are used to read through the contents of a website automatically. o Used to enhance the results produced by search engines

Web Mining
Web structure mining o The development of useful information from the links included in the Web documents o Links going to a document-Useful in determining the popularity of a document o Links within the document indicate the depth of coverage of a topic o Hubs- pages that point to many authorities o Authority pages- those that are linked by many hubs

Web Mining

Web usage mining o The extraction of useful information from the data being generated through webpage visits, transaction, etc. o Extracting useful information from server logs i.e. users history. o Finding out what users are looking for on theinternet. o Clickstream analysis- Analyzing the info collected by web servers to understand the user behavior o Eg 60% of the visitors who searched for hotels in Goa had earlier searched for air fares to Goa. This

Web Mining
Uses for Web mining: Determine the lifetime value of clients Design cross-marketing strategies across products Evaluate promotional campaigns Target electronic ads and coupons at user groups based on their access patterns Predict user behavior Present dynamic information to users based on their interests and profiles

Web Mining
Eg. Amazon.com
A registered user who revisits amazon.com is

greeted by name(recognize user by reading a cookie). Also presents the user with a choice of products in personalized store based on previous purchases

Data Mining Project Processes

THANK YOU