You are on page 1of 29

What is Data Mining?

 Data mining is the process of sorting through large data sets to identify
patterns and relationships that can help solve business problems through
data analysis.
 Data mining techniques and tools enable enterprises to predict future
trends and make more-informed business decisions.
 Data mining is a key part of data analytics overall and one of the core
disciplines in data science, which uses advanced analytics techniques to
find useful information in data sets.
 At a more granular level, data mining is a step in the knowledge discovery
in databases (KDD) process, a data science methodology for gathering,
processing and analyzing data.
 Data mining and KDD are sometimes referred to interchangeably, but
they're more commonly seen as distinct things.

Data Mining Techniques

Various techniques can be used to mine data for different data science


applications. Pattern recognition is a common data mining use case that's
enabled by multiple techniques, as is anomaly detection, which aims to
identify outlier values in data sets. Popular data mining techniques include the
following types:

 Association rule mining. In data mining, association rules are if-then


statements that identify relationships between data elements. Support and
confidence criteria are used to assess the relationships -- support
measures how frequently the related elements appear in a data set, while
confidence reflects the number of times an if-then statement is accurate.
 Classification. This approach assigns the elements in data sets to
different categories defined as part of the data mining process. Decision
trees, Naive Bayes classifiers, k-nearest neighbor and logistic
regression are some examples of classification methods.

 Clustering. In this case, data elements that share particular characteristics


are grouped together into clusters as part of data mining applications.
Examples include k-means clustering, hierarchical clustering and Gaussian
mixture models.

 Regression. This is another way to find relationships in data sets, by


calculating predicted data values based on a set of variables. Linear
regression and multivariate regression are examples. Decision trees and
some other classification methods can be used to do regressions, too.

 Sequence and path analysis. Data can also be mined to look for patterns
in which a particular set of events or values leads to later ones.

 Neural networks. A neural network is a set of algorithms that simulates


the activity of the human brain. Neural networks are particularly useful in
complex pattern recognition applications involving deep learning, a more
advanced offshoot of machine learning.

Data Mining- Issues

Data mining is not an easy task, as the algorithms used can get very complex
and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources. These factors also create some issues.

2
Mining Methodology and User Interaction Issues
 It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary
for data mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to
focus the search for patterns, providing and refining data mining
requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process
and to express the discovered patterns, the background knowledge can
be used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.

3
 Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and
visual representations. These representations should be easily
understandable.
 Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the
data regularities. If the data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.

Performance Issues
 There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity
of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide the data into
partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.

4
Diverse Data Types Issues
 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kind
of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds
challenges to data mining.

Data Mining Applications

Here is the list of areas where data mining is widely used:


 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection

Financial Data Analysis


The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows −

5
 Design and construction of data warehouses for multidimensional data
analysis and data mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.

Retail Industry

Data Mining has its great application in Retail Industry because it collects
large amount of data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of
data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns
and trends that lead to improved quality of customer service and good
customer retention and satisfaction. Here is the list of examples of data
mining in the retail industry −
 Design and Construction of data warehouses based on the benefits of
data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet
messenger, images, e-mail, web data transmission, etc. Due to the
development of new computer and communication technologies, the

6
telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −
 Multidimensional Analysis of Telecommunication data.
 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.

Biological Data Analysis


In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which
data mining contributes for biological data analysis −
 Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
 Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
 Discovery of structural patterns and analysis of genetic networks and protein
pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.

Other Scientific Applications


The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data have
been collected from scientific domains such as geosciences, astronomy, etc. A large
amount of data sets is being generated because of the fast numerical simulations in
various fields such as climate and ecosystem modeling, chemical engineering, fluid
dynamics, etc. Following are the applications of data mining in the field of Scientific
Applications −

 Data Warehouses and data preprocessing.


 Graph-based mining.
 Visualization and domain specific knowledge.

7
Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or


the availability of network resources. In this world of connectivity, security has
become the major issue. With increased usage of internet and availability of the
tools and tricks for intruding and attacking network prompted intrusion detection
to become a critical component of network administration. Here is the list of
areas in which data mining technology may be applied for intrusion detection −
 Development of data mining algorithm for intrusion detection.
 Association and correlation analysis, aggregation to help select and build
discriminating attributes.
 Analysis of Stream data.
 Distributed data mining.
 Visualization and query tools.

- Data Objects and attribute types

In data mining, data objects are the entities that are being analyzed or
studied. These objects can be anything from customers, products,
transactions, or events. Each data object can be described by a set of
attributes or characteristics that define it.
Attribute types in data mining can be categorized into three main types:

 Nominal Attribute: 
Nominal Attributes only provide enough attributes to differentiate
between one object and another. Such as Student Roll No., Sex of the
Person. 

8
 Ordinal Attribute: 
The ordinal attribute value provides sufficient information to order the
objects. Such as Rankings, Grades, Height
 Binary Attribute: 
These are 0 and 1. Where 0 is the absence of any features and 1 is the
inclusion of any characteristics.
 Numeric attribute:It is quantitative, such that quantity can be measured
and represented in integer or real values ,are of two types
Interval Scaled attribute: 
It is measured on a scale of equal size units,these attributes allow us to
compare such as temperature in C or F and thus values of attributes have
ordered.
Ratio Scaled attribute: 
Both differences and ratios are significant for Ratio. For eg. age, length,
and Weight.

Knowledge discovery Process

The knowledge discovery process in data mining, also known as the KDD
process, is a systematic approach to extracting useful and meaningful
information from large data sets. The KDD process consists of several
stages:
 Selection: In this stage, the data to be analyzed is identified and selected.
This data can come from various sources such as databases, web logs, or
social media platforms.
 Preprocessing: Once the data has been selected, it is preprocessed to
clean and transform it into a suitable format for analysis. This includes
tasks such as removing missing values, handling outliers, and
transforming data into a numerical format.

9
 Transformation: In this stage, the preprocessed data is transformed into
a format that can be used for analysis. This includes tasks such as feature
extraction, dimensionality reduction, and normalization.
 Data Mining: In this stage, the transformed data is analyzed using various
data mining techniques such as classification, clustering, association rule
mining, and outlier detection.
 Evaluation: Once the data mining process is complete, the results are
evaluated to determine their usefulness and validity. This includes tasks
such as visualizing the results, evaluating the accuracy of the models,
and assessing the quality of the patterns and insights discovered.
 Interpretation: Finally, the insights and knowledge gained from the data
mining process are interpreted and communicated to the relevant
stakeholders. This includes tasks such as explaining the findings,
identifying areas for improvement, and making decisions based on the
insights gained.
Overall, the KDD process is an iterative and cyclical process, where the
results and insights gained from one stage are used to refine and improve
subsequent stages.

Data Mining Functionalities

Data mining functionalities are used to represent the type of patterns that
have to be discovered in data mining tasks. In general, data mining tasks
can be classified into two types including descriptive and predictive.
Descriptive mining tasks define the common features of the data in the
database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
 Data characterization − It is a summarization of the general
characteristics of an object class of data. The data corresponding to the

10
user-specified class is generally collected by a database query. The
output of data characterization can be presented in multiple forms.
 Data discrimination − It is a comparison of the general characteristics of
target class data objects with the general characteristics of objects from
one or a set of contrasting classes. The target and contrasting classes can
be represented by the user, and the equivalent data objects fetched
through database queries.
 Association Analysis − It analyses the set of items that generally occur
together in a transactional dataset. There are two parameters that are
used for determining the association rules −
 It provides which identifies the common item set in the database.
 Confidence is the conditional probability that an item occurs in a
transaction when another item occurs.
 Classification − Classification is the procedure of discovering a model
that represents and distinguishes data classes or concepts, for the
objective of being able to use the model to predict the class of objects
whose class label is anonymous. The derived model is established on the
analysis of a set of training data (i.e., data objects whose class label is
common).
 Prediction − It defines predict some unavailable data values or pending
trends. An object can be anticipated based on the attribute values of the
object and attribute values of the classes. It can be a prediction of
missing numerical values or increase/decrease trends in time-related
information.
 Clustering − It is similar to classification but the classes are not
predefined. The classes are represented by data attributes. It is
unsupervised learning. The objects are clustered or grouped, depends on
the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
 Outlier analysis − Outliers are data elements that cannot be grouped in a
given class or cluster. These are the data objects which have multiple

11
behaviour from the general behaviour of other data objects. The analysis
of this type of data can be essential to mine the knowledge.
 Evolution analysis − It defines the trends for objects whose behaviour
changes over some time.

Data Mining Systems Classification

A data mining system can be classified according to the following criteria −


 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines

12
Apart from these, a data mining system can also be classified based on the
kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized,
and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases
mined. Database system can be classified according to different criteria such
as data models, types of data, etc. And the data mining system can be
classified accordingly.
For example, if we classify a database according to the data model, then we
may have a relational, transactional, object-relational, or data warehouse
mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge
mined. It means the data mining system is classified on the basis of
functionalities such as −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques
used. We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted.
These applications are as follows −

13
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail
 AD

Statistical Description of data

In data mining, statistical description of data is the process of summarizing


and describing the characteristics and patterns of the data using statistical
measures. Statistical description is an important step in data mining, as it
provides a basis for understanding the data and developing insights into its
behavior.
The following are some commonly used statistical measures for describing
data:
 Central tendency: Central tendency measures describe the typical or
central value of the data. The most common measures of central
tendency are the mean, median, and mode.
 Dispersion: Dispersion measures describe how spread out or variable the
data is. The most common measures of dispersion are the range,
variance, and standard deviation.
 Skewness: Skewness measures describe the degree of asymmetry in the
distribution of the data. A positive skewness value indicates that the
data is skewed to the right, while a negative skewness value indicates
that the data is skewed to the left.
 Kurtosis: Kurtosis measures describe the degree of peakedness or
flatness in the distribution of the data. A high kurtosis value indicates a

14
sharp peak, while a low kurtosis value indicates a flat or dispersed
distribution.
 Correlation: Correlation measures describe the strength and direction of
the linear relationship between two variables. The most common
measure of correlation is the Pearson correlation coefficient.
 Covariance: Covariance measures describe the strength and direction of
the linear relationship between two variables, taking into account their
scale of measurement.
Statistical description of data is an important tool for understanding the
underlying patterns and trends in the data, and can be used to guide the
selection of appropriate data mining techniques and algorithms.

Data Preprocessing

1. Data Cleaning: 
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc. 
 
(a). Missing Data: 
This situation arises when some data is missing in the data. It can be
handled in various ways. 
Some of them are: 
 Ignore the tuples: 
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple. 
 
 Fill the Missing values: 
There are various ways to do this task. You can choose to fill the missing

15
values manually, by attribute mean or the most probable value. 
 
(b). Noisy Data: 
Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways : 
 Binning Method: 
This method works on sorted data in order to smooth it. The whole data
is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task. 
 
 Regression: 
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables). 
 
 Clustering: 
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters. 
2. Data Transformation: 
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways: 
Normalization: 
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0) 
 
Attribute Selection: 
In this strategy, new attributes are constructed from the given set of

16
attributes to help the mining process. 
 
 Discretization: 
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels. 
 
 Concept Hierarchy Generation: 
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”. 
 
3. Data Reduction: 
Since data mining is a technique that is used to handle huge amount of
data. While working with huge volume of data, analysis became harder
in such cases. In order to get rid of this, we uses data reduction
technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs. 
 The various steps to data reduction are: 
 Data Cube Aggregation: 
Aggregation operation is applied to data for the construction of the data
cube. 
 
 Attribute Subset Selection: 
The highly relevant attributes should be used, rest all can be discarded.
For performing attribute selection, one can use level of significance and
p- value of the attribute.the attribute having p-value greater than
significance level can be discarded. 
 
 Numerosity Reduction: 
This enable to store the model of data instead of whole data, for

17
example: Regression Models. 
 
 Dimensionality Reduction: 
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data, original data can
be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction
are:Wavelet transforms and PCA (Principal Component Analysis). 

Data Visualization

Data visualization is a key aspect of data mining, as it allows us to explore,


analyze, and communicate patterns and insights in the data. In data mining,
data visualization is used to represent the data in a visual form, such as
charts, graphs, and maps, to aid in understanding and interpretation.
The following are some commonly used data visualization techniques in data
mining:
 Scatter plots: Scatter plots are used to visualize the relationship between
two variables. Each point on the plot represents a data point, and the
position of the point indicates the values of the two variables.
 Bar charts: Bar charts are used to compare the frequency or distribution
of categorical variables. Each bar represents a category, and the height
of the bar represents the frequency or proportion of that category.
 Histograms: Histograms are used to visualize the distribution of a
numerical variable. The x-axis represents the values of the variable, and
the y-axis represents the frequency or proportion of each value.
 Heat maps: Heat maps are used to visualize patterns and relationships in
large datasets. Each cell of the heat map represents a combination of

18
two variables, and the color of the cell indicates the frequency or value
of that combination.
 Network diagrams: Network diagrams are used to visualize relationships
between entities. Each node in the network represents an entity, and the
links between nodes represent the relationships between them.
 Geographic maps: Geographic maps are used to visualize spatial patterns
in the data. Each point on the map represents a location, and the color
or size of the point represents the value of the variable at that location.
Data visualization is a powerful tool for exploring and understanding
complex data, and can help us to identify patterns, trends, and relationships
that may not be apparent from the raw data alone.

Data Similarity and Dissimilarity

Data similarity and dissimilarity measures are used in data mining to


quantify the degree of similarity or dissimilarity between two or more data
points. These measures are used in a variety of applications, including
clustering, classification, and outlier detection.
The following are some commonly used similarity and dissimilarity measures
in data mining:
 Euclidean distance: Euclidean distance is a measure of the straight-line
distance between two data points in a multidimensional space. It is
defined as the square root of the sum of the squared differences
between the corresponding attributes of the two points.
 Cosine similarity: Cosine similarity is a measure of the similarity between
two data points based on the angle between their feature vectors. It is
calculated as the dot product of the two vectors divided by the product
of their magnitudes.

19
 Jaccard similarity: Jaccard similarity is a measure of the similarity
between two sets of data points. It is calculated as the ratio of the size of
the intersection of the two sets to the size of their union.
 Pearson correlation coefficient: Pearson correlation coefficient is a
measure of the linear correlation between two data points. It is
calculated as the covariance of the two points divided by the product of
their standard deviations.
 Hamming distance: Hamming distance is a measure of the number of
attributes that differ between two data points. It is commonly used for
binary data, where the attributes can take on only two values.
 Manhattan distance: Manhattan distance is a measure of the distance
between two data points in a grid-like space. It is defined as the sum of
the absolute differences between the corresponding attributes of the
two points.
These similarity and dissimilarity measures are used in a variety of data
mining algorithms, such as k-means clustering, hierarchical clustering, and
nearest-neighbor classification. The choice of the appropriate measure
depends on the type of data and the specific application.

Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery to
direct the mining process or examine the findings from different angles or
depths. The data mining primitives specify the following,
 Task-relevant data: This is the database portion to be investigated. For
example, suppose that you are a manager of All Electronics in charge of
sales in the United States and Canada. In particular, you would like to

20
study the buying trends of customers in Canada. Rather than mining on
the entire database. These are referred to as relevant attributes

 The kinds of knowledge to be mined: This specifies the data mining


functions to be performed, such as characterization, discrimination,
association, classification, clustering, or evolution analysis. For instance,
if studying the buying habits of customers in Canada, you may choose to
mine associations between customer profiles and the items that these
customers like to buy
  
 Background knowledge: Users can specify background knowledge, or
knowledge about the domain to be mined. This knowledge is useful for
guiding the knowledge discovery process, and for evaluating the patterns
found. There are several kinds of background knowledge.

 Interestingness measures: These functions are used to separate


uninteresting patterns from knowledge. They may be used to guide the
mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness
measures.
 
 Presentation and visualization of discovered patterns: This refers to the
form in which discovered patterns are to be displayed. Users can choose
from different forms for knowledge presentation, such as rules, tables,
charts, graphs, decision trees, and cubes.

Mining Frequent Patterns

Mining frequent patterns is a common task in data mining that involves


identifying sets of items that frequently occur together in a dataset. These

21
patterns can be used to gain insights into the data, such as identifying
associations between items or identifying common behavior among groups
of users.
The following are the steps involved in mining frequent patterns:
 Data preprocessing: The first step in mining frequent patterns is to
preprocess the data. This involves cleaning and transforming the data
into a suitable format for analysis.
 Itemset generation: In this step, all possible itemsets are generated from
the data. An itemset is a set of items that occur together in a transaction.
For example, if the data consists of transactions in a grocery store, an
itemset could be a set of items that a customer bought together, such as
milk, bread, and eggs.
 Support calculation: The support of an itemset is the proportion of
transactions in which the itemset appears. In this step, the support of
each itemset is calculated.
 Pruning: In this step, itemsets with a support less than a predefined
threshold are removed. This helps to reduce the number of itemsets that
need to be considered in the next step.
 Rule generation: In this step, association rules are generated from the
frequent itemsets. An association rule is a statement of the form X → Y,
where X and Y are itemsets. The rule indicates that if a transaction
contains X, it is likely to also contain Y.
 Evaluation: In this step, the generated rules are evaluated using various
measures such as confidence, lift, and support. This helps to identify the
most interesting and useful rules.
Frequent pattern mining algorithms, such as Apriori and FP-growth, are
commonly used to mine frequent patterns. These algorithms are designed to
efficiently generate frequent itemsets and association rules from large
datasets. The choice of algorithm depends on the size of the dataset and the
specific requirements of the analysis.

22
Associations

Association rule learning is a type of unsupervised learning technique that


checks for the dependency of one data item on another data item and maps
accordingly so that it can be more profitable. It tries to find some interesting
relations or associations among the variables of dataset. It is based on
different rules to discover the interesting relations between variables in the
database.
The association rule learning is one of the very important concepts
of machine learning, and it is employed in Market Basket analysis,
Web usage mining, continuous production, etc. Here market basket
analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a
supermarket, as in a supermarket, all products that are purchased together
are put together.
For example, if a customer buys bread, he most likely can also buy butter,
eggs, or milk, so these products are stored within a shelf or mostly nearby.
Consider the below diagram:

23
Market Basket Analysis

Market basket analysis is a data mining technique used by retailers to


increase sales by better understanding customer purchasing patterns. It
involves analyzing large data sets, such as purchase history, to reveal
product groupings and products that are likely to be purchased together.

One example is the Shopping Basket Analysis tool in Microsoft Excel, which
analyzes transaction data contained in a spreadsheet and performs market
basket analysis. A transaction ID must relate to the items to be analyzed.
The Shopping Basket Analysis tool then creates two worksheets:
 The Shopping Basket Item Groups worksheet, which lists items that are
frequently purchased together,

24
 And the Shopping Basket Rules worksheet shows how items are related
(For example, purchasers of Product A are likely to buy Product B).

How does Market Basket Analysis Work?


Market Basket Analysis is modelled on Association rule mining, i.e., the IF {},
THEN {} construct. For example, IF a customer buys bread, THEN he is likely
to buy butter as well.
Association rules are usually represented as: {Bread} -> {Butter}
Some terminologies to familiarize yourself with Market Basket Analysis are:
 Antecedent:Items or 'itemsets' found within the data are antecedents.
In simpler words, it's the IF component, written on the left-hand side. In
the above example, bread is the antecedent.
 Consequent:A consequent is an item or set of items found in
combination with the antecedent. It's the THEN component, written on
the right-hand side. In the above example, butter is the consequent.

Apriori Algorithm

 The Apriori algorithm is a classic algorithm used in data mining for


discovering frequent itemsets and association rules from transactional
databases.
 It was proposed by R. Agrawal and R. Srikant in 1994 and has since
become one of the most popular and widely used algorithms for
association rule mining.
 The algorithm works by generating a set of candidate itemsets, which are
subsets of the transactional database that meet a specified minimum
support threshold. The support of an itemset is the fraction of
transactions in the database that contain the itemset.

25
 The algorithm then prunes the set of candidate itemsets that do not
meet the minimum support threshold, and generates new candidate
itemsets by combining the remaining itemsets in a pairwise manner.
 The Apriori algorithm is an iterative process that repeats this process
until no new frequent itemsets can be generated. Once all frequent
itemsets have been identified, association rules are generated from
these itemsets by specifying a minimum confidence threshold.
 The confidence of a rule is the fraction of transactions that contain both
the antecedent and the consequent of the rule, out of the transactions
that contain the antecedent.
 The Apriori algorithm is known for its efficiency and scalability, and is
widely used in applications such as market basket analysis, web log
mining, and bioinformatics.
 However, the algorithm can suffer from performance degradation when
dealing with large databases or datasets with high dimensionality.

Association rules from frequent item set

Association rules are a common output of frequent itemset mining in data


mining. Association rules describe the relationships or patterns that exist
between different items in a dataset.
The process of generating association rules typically involves two steps:
1. Finding frequent itemsets: First, the frequent itemsets are identified
using an algorithm such as Apriori or FP-Growth. A frequent itemset is a
set of items that occurs frequently in the dataset, above a certain
minimum support threshold.
2. Generating association rules: Once the frequent itemsets have been
identified, association rules can be generated from them. An association
rule is a statement of the form "If A, then B", where A and B are sets of
items. The rule is said to hold if the occurrence of A is associated with

26
the occurrence of B in the dataset, above a certain minimum confidence
threshold.
For example, consider a dataset of transactions at a grocery store. One
frequent itemset might be {milk, bread, eggs}, indicating that these three
items are often purchased together. From this frequent itemset, we can
generate association rules such as "If a customer buys milk and bread, then
they are likely to buy eggs as well" or "If a customer buys milk and eggs,
then they are likely to buy bread as well". The rules can be used to make
predictions or recommendations about future purchases.
It's important to note that frequent itemsets and association rules are not
necessarily meaningful or useful on their own. Domain knowledge and
context are necessary to interpret the results and understand their
implications.

Text Mining and Web Mining

Here's a table summarizing the differences between Text Mining and Web
Mining in data mining:

Text Mining Web Mining

Focus Extracting useful information from text Extracting useful information from the web

Input Unstructured data (text) Semi-structured and unstructured data

Data Documents, emails, social media posts Web pages, search engine results, log data

27
Text Mining Web Mining

Methods Text classification, clustering, NLP Web usage mining, content mining, link analysis

Sentiment analysis, topic modeling, Web personalization, web content optimization,


Goals named entity recognition search engine optimization

Customer feedback analysis, fraud Website improvement, e-commerce analysis,


Applications detection, spam filtering market research

In summary, Text Mining focuses on extracting useful information from


unstructured textual data, while Web Mining focuses on extracting useful
information from semi-structured and unstructured web data. Text Mining
methods typically involve natural language processing (NLP) techniques such
as text classification, clustering, sentiment analysis, and named entity
recognition.
Web Mining methods include web usage mining, content mining, and link
analysis. The goals of Text Mining often include sentiment analysis, topic
modeling, and entity recognition, while the goals of Web Mining include
web personalization, web content optimization, and search engine
optimization. Applications of Text Mining include customer feedback
analysis, fraud detection, and spam filtering, while applications of Web
Mining include website improvement, e-commerce analysis, and market
research.

28
29

You might also like