You are on page 1of 8

DATA MINING ALGORITHMS

An algorithm in data mining (or machine learning) is a set of heuristics and calculations that
creates a model from data.
To create a model, the algorithm first analyzes the data you provide, looking for specific types of
patterns or trends. The algorithm uses the results of this analysis over many iterations to find the
optimal parameters for creating the mining model. These parameters are then applied across the
entire data set to extract actionable patterns and detailed statistics.
The mining model that an algorithm creates from your data can take various forms, including:
 A set of clusters that describe how the cases in a dataset are related.
 A decision tree that predicts an outcome, and describes how different criteria affect that
outcome.
 A mathematical model that forecasts sales.
 A set of rules that describe how products are grouped together in a transaction, and the
probabilities that products are purchased together.

Choosing the Right Algorithm


Choosing the best algorithm to use for a specific analytical task can be a challenge. While you
can use different algorithms to perform the same business task, each algorithm produces a
different result, and some algorithms can produce more than one type of result.

Choosing an Algorithm by Type


Data Mining includes the following algorithm types:
 Classification algorithms predict one or more discrete variables, based on the other
attributes in the dataset.
They categorize data into a number of classes, which are then assigned labels. It does this
by examining the dataset as it is received and then classifying new inputs based on these
classifications.
 Regression algorithms are useful for prediction. They attempt to establish a relationship
between a dependent and independent variable and then make forecasts based on that
relationship.
 Segmentation algorithms divide data into groups, regions or clusters, of items that have
similar properties. These are useful for unlabeled data that don’t have categories. These
algorithms are closely related to clustering algorithms, and some use the two terms
interchangeably.
 Association algorithms find correlations between different attributes in a dataset i.e.
they attempt to discover how items are related to one another. It does this by looking for
rules that govern the relationships between variables in databases.
 Sequence analysis algorithms summarize frequent sequences or episodes in data, such as
a series of clicks in a web site, or a series of log events preceding machine maintenance.
Sequences analysis imposes order on observations that must be preserved when training
models. In many of the other algorithms, the sequence is not important, but with
sequence analysis, the order is important.

However, there is no reason that you should be limited to one algorithm in your solutions.
Experienced analysts will sometimes use one algorithm to determine the most effective inputs
(that is, variables), and then apply a different algorithm to predict a specific outcome based on
that data.

Purpose of Data Mining Techniques


With a huge amount of data being stored each day, the businesses are now interested in finding
out the trends from them. The data extraction techniques help in converting the raw data into
useful knowledge. To mine huge amounts of data, the software is required as it is impossible for
a human to manually go through the large volume of data.
A data mining software analyses the relationship between different items in large databases
which can help in the decision-making process, learn more about customers, craft marketing
strategies, increase sales and reduce the costs.

List of Data Extraction Techniques


The data mining technique that is to be applied depends on the perspective of our Data analysis.
1. Frequent Pattern Mining/Association Analysis
This type of data mining technique looks for recurring relationships in the given dataset. It will
look for interesting associations and correlations between the different items in the database and
identify a pattern.
An example, of such kind, would be “Shopping Basket Analysis”: finding out “which products
the customers are likely to purchase together in the store?” such as bread and butter.

Application: Designing the placement of the products on store shelves, marketing, cross-selling


of products.
The patterns can be represented in the form of association rules.
The association rule says that support and confidence are the parameters to find out the
usefulness of the associated items. The transactions which had both the items purchased together
in one go is known as a support.
The transactions where the customers bought both the items but one after the other is confidence.
The mined pattern would be considered interesting if it has a minimum support
threshold and minimum confidence threshold value. The threshold values are decided by the
domain experts.
Bread=> butter [support=2%, confidence-60%]
The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter.

Steps To Implement Association Analysis:


1. Finding frequent item sets. Item set means a set of items. An item set containing k items
is a k-item set. The frequency of an item set is the number of transactions that contain the
item set.
2. Generating strong association rules from the frequent item sets. By strong association
rules, we mean that the minimum threshold support and confidence is met.
There are various frequent itemset mining methods like Apriori Algorithm, Pattern Growth
Approach, and Mining Using the Vertical Data Format. This technique is commonly known as
Market Basket Analysis.
2. Correlation Analysis
Correlation Analysis is just an extension of Association Rules. Sometimes the support and
confidence parameters may still yield uninteresting patterns to the users.
An example supporting the above statement can be: out of 1000 transactions analyzed, 600
contained only bread, while 750 contained butter and 400 contained both bread and butter.
Suppose the min support for association rule run is 30% and the minimum confidence is 60%.
The support value of 400/1000=40% and confidence value= 400/600= 66% meets the threshold.
However, we see that the probability of purchasing butter is 75% which is more than 66%. This
means that bread and butter are negatively correlated as the purchase of one would lead to a
decrease in the purchase of the other. The results are deceiving.

From the above example, the support and confidence are supplemented with another
interestingness measure i.e. correlation analysis which will help in mining interesting patterns.
A => B [support, confidence, correlation].
Correlation rule is measured by support, confidence and correlation between itemsets A and B.
Correlation is measured by Lift and Chi-Square.
(i) Lift: As the word itself says, Lift represents the degree to which the presence of one itemset
lifts the occurrence of other itemsets.
The lift between the occurrence of A and B can be measured by:
Lift (A, B) = P (A U B) / P (A). P (B).
If it is < 1, then A and B are negatively correlated.
If it is >1. Then A and B are positively correlated which means that the occurrence of one
implies the occurrence of the other.
If it is = 1, then there is no correlation between them.
(ii) Chi-Square: This is another correlation measure. It measures the squared difference between
the observed and expected value for a slot (A and B pair) divided by the expected value.

If it is >1, then it is negatively correlated.


3. Classification
Classification helps in building models of important data classes. A model or a classifier is
constructed to predict the class labels. Labels are the defined classes with discrete values like
“yes” or “no”, “safe” or “risky”. It is a type of supervised learning as the label class is already
known.
Data Classification is a two-step process:
1. Learning step: The model is constructed here. A pre-defined algorithm is applied to the
data to analyze with a class label provided and the classification rules are constructed.
2. Classification Step: The model is used to predict class labels for given data. The accuracy
of the classification rules is estimated by the test data which if found accurate is used for
classification of new data tuples.
The items in the itemset will be assigned to the target categories to predict functions at the class
label level.
Application: Banks to identify loan applicants as low, medium or high risk, businesses designing
marketing campaigns based on age group classification.`

4. Decision Tree Induction


Decision Trees Induction method comes under the Classification Analysis. A decision tree is a
tree-like structure that is easy to understand and simple & fast. In this, each non-leaf node
represents a test on an attribute and each branch represents the outcome of the test, and the leaf
node represents the class label.
The attribute values in a tuple are tested against the decision tree from the root to the leaf node.
Decision trees are popular as it does not require any domain knowledge. These can represent
multidimensional data. The decision trees can be easily converted to classification rules.
Application: The decision trees are constructed in medicine, manufacturing, production,
astronomy, etc. 

5. Bayes Classification
Bayesian Classification is another method of Classification Analysis. Bayes Classifiers predict
the probability of a given tuple to belong to a particular class. It is based on the Bayes theorem,
which is based on probability and decision theory.
Bayes Classification works on posterior probability and prior probability for the decision-making
process. By posterior probability, the hypothesis is made from the given information i.e. the
attribute values are known, while for prior probability, the hypotheses are given regardless of the
attribute values.

6. Clustering Analysis
It is a technique of partitioning a set of data into clusters or groups of objects. The clustering is
done using algorithms. It is a type of unsupervised learning as the label information is not
known. Clustering methods identify data that are similar or different from each other, and
analysis of characteristics is done.
Cluster analysis can be used as a pre-step for applying various other algorithms such as
characterization, attribute subset selection, etc. Cluster Analysis can also be used for Outlier
detection such as high purchases in credit card transactions.
Applications: Image recognition, web search, and security.

7. Outlier Detection
The process of finding data objects which possess exceptional behavior from the other objects is
called outlier detection. Outlier detection and cluster analysis are related to each other. Outlier
methods are categorized into statistical, proximity-based, clustering-based and classification
based.
There are different types of outliers, some of them are:
 Global Outlier: The data object deviated significantly from the rest of the data set.
 Contextual Outlier: It depends on certain factors like day, time, and location. If a data
object deviates significantly with reference to a context.
 Collective Outlier: When a group of data objects has different behavior from the entire
data set.
Application: Detection of credit card fraud risks, novelty detection, etc.
8. Sequential Patterns
A trend or some consistent patterns are recognized in this type of data mining. Understanding
customer purchase behavior and sequential patterns are used by the stores to display their
products on shelves.
Application: E-commerce example where when you buy item A, it will show that Item B is
often bought with Item A looking at the past purchasing history.

9. Regression Analysis
This type of analysis is supervised and identifies which item sets amongst the different
relationships are related to or are independent of each other. It can predict sales, profit,
temperature, forecast human behavior, etc. It has a data set value that is already known.
When an input is provided, the regression algorithm will compare the input and expected value,
and the error is calculated to get to the accurate result.
Application: Marketing and Product Development Efforts comparison.
Assignment

Describe examples of algorithms used in data mining

You might also like