You are on page 1of 11

Data Mining

In general terms, “Mining” is the process of extraction of some valuable material from the
earth e.g. coal mining, diamond mining etc. In the context of computer science, “Data
Mining” refers to the extraction of useful information from a bulk of data or data
warehouses. One can see that the term itself is a little bit confusing. In case of coal or
diamond mining, the result of extraction process is coal or diamond. But in case of Data
Mining, the result of extraction process is not data!! Instead, the result of data mining is the
patterns and knowledge that we gain at the end of the extraction process. In that sense,
Data Mining is also known as Knowledge Discovery or Knowledge Extraction.

Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989.


However, the term ‘data mining’ became more popular in the business and press
communities. Currently, Data Mining and Knowledge Discovery are used interchangeably.

Now a days, data mining is used in almost all the places where a large amount of data is
stored and processed. For example, banks typically use ‘data mining’ to find out their
prospective customers who could be interested in credit cards, personal loans or insurances
as well. Since banks have the transaction details and detailed profiles of their customers,
they analyze all this data and try to find out patterns which help them predict that certain
customers could be interested in personal loans etc.
Main Purpose of Data Mining
Basically, the information gathered from Data Mining helps to predict hidden patterns, future
trends and behaviors and allowing businesses to take decisions.
Technically, data mining is the computational process of analyzing data from different
perspective, dimensions, angles and categorizing/summarizing it into meaningful
information.
Data Mining can be applied to any type of data e.g. Data Warehouses, Transactional
Databases, Relational Databases, Multimedia Databases, Spatial Databases, Time-series
Databases, World Wide Web.
Data Mining as a whole process
The whole process of Data Mining comprises of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection and transformation takes
place
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analyzing and presenting results
In future articles, we will cover the details of each of these phase.
Applications of Data Mining
1. Financial Analysis
2. Biological Analysis
3. Scientific Analysis
4. Intrusion Detection
5. Fraud Detection
6. Research Analysis
Real life example of Data Mining – Market Basket Analysis
Market Basket Analysis is a technique which gives the careful study of purchases done by a
customer in a super market. The concept is basically applied to identify the items that are
bought together by a customer. Say, if a person buys bread, what are the chances that
he/she will also purchase butter. This analysis helps in promoting offers and deals by the
companies. The same is done with the help of data mining.
This article is contributed by Sheena Kohli. If you like GeeksforGeeks and wo
Basic Concept of Classification (Data Mining)
Data Mining: Data mining in general terms means mining or digging deep into data which is
in different forms to gain patterns, and to gain knowledge on that pattern. In the process of
data mining, large data sets are first sorted, then patterns are identified and relationships are
established to perform data analysis and solve problems.
Classification: It is a Data analysis task, i.e. the process of finding a model that describes
and distinguishes data classes and concepts. Classification is the problem of identifying to
which of a set of categories (sub populations), a new observation belongs to, on the basis of
a training set of data containing observations and whose categories membership is known.
Example: Before starting any Project, we need to check it’s feasibility. In this case, a
classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project
and to further approve it. It is a two step process such as :
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. Model has to be trained for prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he
should get aside in order not to get hurt. So, this is his training part to move away. While
Testing if the person sees any heavy object coming towards him or falling on him and moves
aside then system is tested positively and if the person do not moves aside then the system
is negatively tested.
Same is the case with the data, it should be trained in order to get the accurate and best
results.
There are certain data types associated with data mining that actually tells us the format of
the file (whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
1. Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey of evaluating some product. We need to check
whether it’s useful or not. So, the Customer have to answer it in Yes or No.
Product usefulness: Yes / No
 Symmetric: Both values are equally important in all aspects
 Asymmetric: When both the values may not be important.
2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather
than being in Integer form.
Example: One needs to choose some material but of different colors. So, the color
might be Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
 Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain different
grades as per their performance such as A, B, C, D
Grades: A, B, C, D
 Continuous: May have infinite number of values, it is in float type
Example: Measuring weight of few Students in a sequence or orderly manner i.e. 50,
51, 52, 53
Weight: 50, 51, 52, 53
 Discrete: Finite number of values.
Example: Marks of a Student in few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Syntax:
 Mathematical Notation: Classification is based on building a function taking input
feature vector “X” and predicting its outcome “Y” (Qualitative response taking values in
set C)
 Here Classifier (or model) is used which is a Supervised function, can be designed
manually based on expert’s knowledge. It has been constructed to predict class labels
(Example: Label – “Yes” or “No” for the approval of some event).
Classifiers can be categorized on two major types:

1. Discriminative: It is a very basic classifier and determines just one class for each
row of data. It tries to model just by depending on the observed data, depends heavily
on quality of data rather than on distributions.
Example: Logistic Regression
Acceptance of a student at a University (Test and Grades need to be considered)
Suppose there are few students and the Result of them are as follows :
Student 1 : Test Score: 9/10, Grades: 8/10 Result: Accepted
Student 2 : Test Score: 3/10, Grades: 4/10, Result: Rejected
Student 3 : Test Score: 7/10, Grades: 6/10, Result: to be tested

2. Generative: It models the distribution of individual classes and tries to learn the
model that generates the data behind the scenes by estimating assumptions and
distributions of the model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that
too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam
emails). Now if a user wants to check that if any email contains the word cheap, then
that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and
rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are
spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (=
80%)
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression<
Associated Tools and Languages: Used to mine/ extract useful information from raw data.
 Main Languages used: R, SAS, Python, SQL
 Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
 Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK,
TensorFlow, Seaborn, Basemap, etc.
Real Life Examples :
 Market Basket Analysis:
It is a modelling technique that has been associated with frequent transactions of
buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
product, certain suggestions for the commodities are shown that some people have
bought in the past.
 Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on the
parameters such as temperature, humidity, wind direction. This keen observation also
requires use of previous records in order to predict it accurately.
Advantages:
 Mining Based Methods are cost effective and efficient
 Helps in identifying criminal suspects
 Helps in predicting risk of diseases
 Helps Banks and Financial Institutions to identify defaulters so that they may approve
Cards, Loan, etc.
Disadvantages: 
Privacy: When the data is eThere are chances that a company may give some information of
their customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get best accuracy
and result.
APPLICATIONS:
 Marketing and Retailing
 Manufacturing
 Telecommunication Industry
 Intrusion Detection
 Education System
 Fraud Detection
GIST OF DATA MINING : 
1.Choosing the correct classification method, like decision trees, Bayesian networks, or
neural networks.
2.Need a sample of data, where all class values are known. Then the data will be divided
into two parts, a training set and a test set.
Now, training set is given to a learning algorithm, which derives a classifier. Then the
classifier is tested with the test set, where all class values are hidden.
If the classifier classifies most cases in the test set correctly, it can be assumed that it works
accurately also on the future data else it may be a wrong model chosen.
Data mining deals with the kind of patterns that can be mined. On the
basis of the kind of data to be mined, there are two categories of
functions involved in Data Mining −

 Descriptive

 Classification and Prediction

Descriptive Function
The descriptive function deals with the general properties of data in the
database. Here is the list of descriptive functions −

 Class/Concept Description

 Mining of Frequent Patterns

 Mining of Associations

 Mining of Correlations

 Mining of Clusters

Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales
include computer and printers, and concepts of customers include big
spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived
by the following two ways −

 Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.

Mining of Frequent Patterns


Frequent patterns are those patterns that occur frequently in
transactional data. Here is the list of kind of frequent patterns −

 Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.

Mining of Association
Associations are used in retail sales to identify patterns that are
frequently purchased together. This process refers to the process of
uncovering the relationship among data and determining association
rules.
For example, a retailer generates an association rule that shows that
70% of time milk is sold with bread and only 30% of times biscuits are
sold with bread.

Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or
between two item sets to analyze that if they have positive, negative or
no effect on each other.

Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers
to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.

Classification and Prediction


Classification is the process of finding a model that describes the data
classes or concepts. The purpose is to be able to use this model to
predict the class of objects whose class label is unknown. This derived
model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

 Classification (IF-THEN) Rules

 Decision Trees

 Mathematical Formulae

 Neural Networks

The list of functions involved in these processes are as follows −

 Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the
data object whose class label is well known.
 Prediction − It is used to predict missing or unavailable numerical data values
rather than class labels. Regression Analysis is generally used for prediction.
Prediction can also be used for identification of distribution trends based on available
data.
 Outlier Analysis − Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
 Evolution Analysis − Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.

Data Mining Task Primitives


 We can specify a data mining task in the form of a data mining query.

 This query is input to the system.

 A data mining query is defined in terms of data mining task primitives.

Note − These primitives allow us to communicate in an interactive


manner with the data mining system. Here is the list of Data Mining Task
Primitives −

 Set of task relevant data to be mined.

 Kind of knowledge to be mined.

 Background knowledge to be used in discovery process.

 Interestingness measures and thresholds for pattern evaluation.

 Representation for visualizing the discovered patterns.

Set of task relevant data to be mined


This is the portion of database in which the user is interested. This
portion includes the following −
 Database Attributes

 Data Warehouse dimensions of interest

Kind of knowledge to be mined


It refers to the kind of functions to be performed. These functions are −

 Characterization

 Discrimination

 Association and Correlation Analysis

 Classification

 Prediction

 Clustering

 Outlier Analysis

 Evolution Analysis

Background knowledge
The background knowledge allows data to be mined at multiple levels of
abstraction. For example, the Concept hierarchies are one of the
background knowledge that allows data to be mined at multiple levels of
abstraction.

Interestingness measures and thresholds for pattern


evaluation
This is used to evaluate the patterns that are discovered by the process
of knowledge discovery. There are different interesting measures for
different kind of knowledge.

Representation for visualizing the discovered patterns


This refers to the form in which discovered patterns are to be displayed.
These representations may include the following. −

 Rules

 Tables

 Charts

 Graphs

 Decision Trees

 Cubes

Regression in Data Mining


 Home > Big Data & Analytics > Data Mining

 « Previous
 Next »
Tutorial

 Data Mining Introduction


 OLAP
 Knowledge Representation
 Associations in Data Mining
 Classification in Data Mining
 Regression in Data Mining
 Clustering
 Mining Text & Web
 Reinforcement Learning
Regression involves predictor variable (the values which are known)
and response variable (values to be predicted).

The two basic types of regression are:

1. Linear regression

 It is simplest form of regression. Linear regression attempts to model the


relationship between two variables by fitting a linear equation to observe the data.
 Linear regression attempts to find the mathematical relationship between
variables.
 If outcome is straight line then it is considered as linear model and if it is
curved line, then it is a non linear model.
 The relationship between dependent variable is given by straight line and it
has only one independent variable.
Y =  α + Β X
 Model 'Y', is a linear function of 'X'.
 The value of 'Y' increases or decreases in linear manner according to which
the value of 'X' also changes.
2. Multiple regression model

 Multiple linear regression is an extension of linear regression analysis.


 It uses two or more independent variables to predict an outcome and a single
continuous dependent variable.
Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
where, 
'Y' is the response variable.
X1 + X2 + Xk are the independent predictors.
'e' is random error.
a0, a1, a2, ak are the regression  coefficients.

Naive Bays Classification Solved example

Bike Damaged example: In the following table attributes are given such as color,
type, origin and subject can be yes or no.

Bike No Color Type Origin Damaged?

10 Blue Moped Indian Yes

20 Blue Moped Indian No

30 Blue Moped Indian Yes

40 Red Moped Indian No


50 Red Moped Japanese Yes

60 Red Sports Japanese No

70 Red Sports Japanese Yes

80 Red Sports Indian No

90 Blue Sports Japanese No

100 Blue Moped Japanese Yes

Solution:
Required formula:
              P (c | x) = P (c | x) P (c) / P (x) In the fields of science,
engineering and statistics, the accuracy of a measurement system is the
degree of closeness of measurements of a quantity to that quantity's true
value.

Where,   
P (c | x) is the posterior of probability.
P (c)  is the prior probability.
P (c | x) is the likelihood.
P (x) is the prior probability.

It is necessary to classify a <Blue, Indian, Sports>, is unseen sample, which is not


given in the data set. 
So the probability can be computed as: 
     
P (Yes) = 5/10
P (No) = 5/10

Color

P(Blue|Yes) = 3/5 P(Blue|No) = 2/5

P(Red|Yes) = 2/5 P(Red|No) = 3/5

Type

P(Sports|Yes) = 1/5 P(Sports|No) = 3/5


P(Moped|Yes) = 4/5 P(Moped|No) = 2/5

Origin

P(Indian|Yes) = 2/5 P(Indian|No) = 3/5

P(Japanese|Yes) = 3/5 P(Japanese|No) = 2/5

So, unseen example X = <Blue, Indian, Sports>

P(X|Yes). P(Yes) = P(Blue|Yes). P(Indian|Yes). P(Sports|Yes). P(Yes)


                             = 3/5*2/5*1/5*5/10 = 0.024

P(X|No). P(No) = P(Blue|No). P(Indian|No). P(Sports|No).P(No)


                           = 2/5*3/5*3/5*5/10 = 0.072

So, 0.072 > 0.024 so example can be classified as NO.   

You might also like