Professional Documents
Culture Documents
Data mining is the process of sorting through large data sets to identify
patterns and relationships that can help solve business problems through
data analysis.
Data mining techniques and tools enable enterprises to predict future
trends and make more-informed business decisions.
Data mining is a key part of data analytics overall and one of the core
disciplines in data science, which uses advanced analytics techniques to
find useful information in data sets.
At a more granular level, data mining is a step in the knowledge discovery
in databases (KDD) process, a data science methodology for gathering,
processing and analyzing data.
Data mining and KDD are sometimes referred to interchangeably, but
they're more commonly seen as distinct things.
Sequence and path analysis. Data can also be mined to look for patterns
in which a particular set of events or values leads to later ones.
Data mining is not an easy task, as the algorithms used can get very complex
and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources. These factors also create some issues.
2
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary
for data mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to
focus the search for patterns, providing and refining data mining
requests based on the returned results.
Incorporation of background knowledge − To guide discovery process
and to express the discovered patterns, the background knowledge can
be used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.
3
Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and
visual representations. These representations should be easily
understandable.
Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the
data regularities. If the data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity
of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide the data into
partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.
4
Diverse Data Types Issues
Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kind
of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds
challenges to data mining.
5
Design and construction of data warehouses for multidimensional data
analysis and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects
large amount of data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of
data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns
and trends that lead to improved quality of customer service and good
customer retention and satisfaction. Here is the list of examples of data
mining in the retail industry −
Design and Construction of data warehouses based on the benefits of
data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet
messenger, images, e-mail, web data transmission, etc. Due to the
development of new computer and communication technologies, the
6
telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
7
Intrusion Detection
In data mining, data objects are the entities that are being analyzed or
studied. These objects can be anything from customers, products,
transactions, or events. Each data object can be described by a set of
attributes or characteristics that define it.
Attribute types in data mining can be categorized into three main types:
Nominal Attribute:
Nominal Attributes only provide enough attributes to differentiate
between one object and another. Such as Student Roll No., Sex of the
Person.
8
Ordinal Attribute:
The ordinal attribute value provides sufficient information to order the
objects. Such as Rankings, Grades, Height
Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the
inclusion of any characteristics.
Numeric attribute:It is quantitative, such that quantity can be measured
and represented in integer or real values ,are of two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allow us to
compare such as temperature in C or F and thus values of attributes have
ordered.
Ratio Scaled attribute:
Both differences and ratios are significant for Ratio. For eg. age, length,
and Weight.
The knowledge discovery process in data mining, also known as the KDD
process, is a systematic approach to extracting useful and meaningful
information from large data sets. The KDD process consists of several
stages:
Selection: In this stage, the data to be analyzed is identified and selected.
This data can come from various sources such as databases, web logs, or
social media platforms.
Preprocessing: Once the data has been selected, it is preprocessed to
clean and transform it into a suitable format for analysis. This includes
tasks such as removing missing values, handling outliers, and
transforming data into a numerical format.
9
Transformation: In this stage, the preprocessed data is transformed into
a format that can be used for analysis. This includes tasks such as feature
extraction, dimensionality reduction, and normalization.
Data Mining: In this stage, the transformed data is analyzed using various
data mining techniques such as classification, clustering, association rule
mining, and outlier detection.
Evaluation: Once the data mining process is complete, the results are
evaluated to determine their usefulness and validity. This includes tasks
such as visualizing the results, evaluating the accuracy of the models,
and assessing the quality of the patterns and insights discovered.
Interpretation: Finally, the insights and knowledge gained from the data
mining process are interpreted and communicated to the relevant
stakeholders. This includes tasks such as explaining the findings,
identifying areas for improvement, and making decisions based on the
insights gained.
Overall, the KDD process is an iterative and cyclical process, where the
results and insights gained from one stage are used to refine and improve
subsequent stages.
Data mining functionalities are used to represent the type of patterns that
have to be discovered in data mining tasks. In general, data mining tasks
can be classified into two types including descriptive and predictive.
Descriptive mining tasks define the common features of the data in the
database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general
characteristics of an object class of data. The data corresponding to the
10
user-specified class is generally collected by a database query. The
output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of
target class data objects with the general characteristics of objects from
one or a set of contrasting classes. The target and contrasting classes can
be represented by the user, and the equivalent data objects fetched
through database queries.
Association Analysis − It analyses the set of items that generally occur
together in a transactional dataset. There are two parameters that are
used for determining the association rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a
transaction when another item occurs.
Classification − Classification is the procedure of discovering a model
that represents and distinguishes data classes or concepts, for the
objective of being able to use the model to predict the class of objects
whose class label is anonymous. The derived model is established on the
analysis of a set of training data (i.e., data objects whose class label is
common).
Prediction − It defines predict some unavailable data values or pending
trends. An object can be anticipated based on the attribute values of the
object and attribute values of the classes. It can be a prediction of
missing numerical values or increase/decrease trends in time-related
information.
Clustering − It is similar to classification but the classes are not
predefined. The classes are represented by data attributes. It is
unsupervised learning. The objects are clustered or grouped, depends on
the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a
given class or cluster. These are the data objects which have multiple
11
behaviour from the general behaviour of other data objects. The analysis
of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour
changes over some time.
12
Apart from these, a data mining system can also be classified based on the
kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized,
and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases
mined. Database system can be classified according to different criteria such
as data models, types of data, etc. And the data mining system can be
classified accordingly.
For example, if we classify a database according to the data model, then we
may have a relational, transactional, object-relational, or data warehouse
mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge
mined. It means the data mining system is classified on the basis of
functionalities such as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques
used. We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted.
These applications are as follows −
13
Finance
Telecommunications
DNA
Stock Markets
E-mail
AD
14
sharp peak, while a low kurtosis value indicates a flat or dispersed
distribution.
Correlation: Correlation measures describe the strength and direction of
the linear relationship between two variables. The most common
measure of correlation is the Pearson correlation coefficient.
Covariance: Covariance measures describe the strength and direction of
the linear relationship between two variables, taking into account their
scale of measurement.
Statistical description of data is an important tool for understanding the
underlying patterns and trends in the data, and can be used to guide the
selection of appropriate data mining techniques and algorithms.
Data Preprocessing
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
15
values manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data
is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of
16
attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of
data. While working with huge volume of data, analysis became harder
in such cases. In order to get rid of this, we uses data reduction
technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data
cube.
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded.
For performing attribute selection, one can use level of significance and
p- value of the attribute.the attribute having p-value greater than
significance level can be discarded.
Numerosity Reduction:
This enable to store the model of data instead of whole data, for
17
example: Regression Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data, original data can
be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction
are:Wavelet transforms and PCA (Principal Component Analysis).
Data Visualization
18
two variables, and the color of the cell indicates the frequency or value
of that combination.
Network diagrams: Network diagrams are used to visualize relationships
between entities. Each node in the network represents an entity, and the
links between nodes represent the relationships between them.
Geographic maps: Geographic maps are used to visualize spatial patterns
in the data. Each point on the map represents a location, and the color
or size of the point represents the value of the variable at that location.
Data visualization is a powerful tool for exploring and understanding
complex data, and can help us to identify patterns, trends, and relationships
that may not be apparent from the raw data alone.
19
Jaccard similarity: Jaccard similarity is a measure of the similarity
between two sets of data points. It is calculated as the ratio of the size of
the intersection of the two sets to the size of their union.
Pearson correlation coefficient: Pearson correlation coefficient is a
measure of the linear correlation between two data points. It is
calculated as the covariance of the two points divided by the product of
their standard deviations.
Hamming distance: Hamming distance is a measure of the number of
attributes that differ between two data points. It is commonly used for
binary data, where the attributes can take on only two values.
Manhattan distance: Manhattan distance is a measure of the distance
between two data points in a grid-like space. It is defined as the sum of
the absolute differences between the corresponding attributes of the
two points.
These similarity and dissimilarity measures are used in a variety of data
mining algorithms, such as k-means clustering, hierarchical clustering, and
nearest-neighbor classification. The choice of the appropriate measure
depends on the type of data and the specific application.
A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery to
direct the mining process or examine the findings from different angles or
depths. The data mining primitives specify the following,
Task-relevant data: This is the database portion to be investigated. For
example, suppose that you are a manager of All Electronics in charge of
sales in the United States and Canada. In particular, you would like to
20
study the buying trends of customers in Canada. Rather than mining on
the entire database. These are referred to as relevant attributes
21
patterns can be used to gain insights into the data, such as identifying
associations between items or identifying common behavior among groups
of users.
The following are the steps involved in mining frequent patterns:
Data preprocessing: The first step in mining frequent patterns is to
preprocess the data. This involves cleaning and transforming the data
into a suitable format for analysis.
Itemset generation: In this step, all possible itemsets are generated from
the data. An itemset is a set of items that occur together in a transaction.
For example, if the data consists of transactions in a grocery store, an
itemset could be a set of items that a customer bought together, such as
milk, bread, and eggs.
Support calculation: The support of an itemset is the proportion of
transactions in which the itemset appears. In this step, the support of
each itemset is calculated.
Pruning: In this step, itemsets with a support less than a predefined
threshold are removed. This helps to reduce the number of itemsets that
need to be considered in the next step.
Rule generation: In this step, association rules are generated from the
frequent itemsets. An association rule is a statement of the form X → Y,
where X and Y are itemsets. The rule indicates that if a transaction
contains X, it is likely to also contain Y.
Evaluation: In this step, the generated rules are evaluated using various
measures such as confidence, lift, and support. This helps to identify the
most interesting and useful rules.
Frequent pattern mining algorithms, such as Apriori and FP-growth, are
commonly used to mine frequent patterns. These algorithms are designed to
efficiently generate frequent itemsets and association rules from large
datasets. The choice of algorithm depends on the size of the dataset and the
specific requirements of the analysis.
22
Associations
23
Market Basket Analysis
One example is the Shopping Basket Analysis tool in Microsoft Excel, which
analyzes transaction data contained in a spreadsheet and performs market
basket analysis. A transaction ID must relate to the items to be analyzed.
The Shopping Basket Analysis tool then creates two worksheets:
The Shopping Basket Item Groups worksheet, which lists items that are
frequently purchased together,
24
And the Shopping Basket Rules worksheet shows how items are related
(For example, purchasers of Product A are likely to buy Product B).
Apriori Algorithm
25
The algorithm then prunes the set of candidate itemsets that do not
meet the minimum support threshold, and generates new candidate
itemsets by combining the remaining itemsets in a pairwise manner.
The Apriori algorithm is an iterative process that repeats this process
until no new frequent itemsets can be generated. Once all frequent
itemsets have been identified, association rules are generated from
these itemsets by specifying a minimum confidence threshold.
The confidence of a rule is the fraction of transactions that contain both
the antecedent and the consequent of the rule, out of the transactions
that contain the antecedent.
The Apriori algorithm is known for its efficiency and scalability, and is
widely used in applications such as market basket analysis, web log
mining, and bioinformatics.
However, the algorithm can suffer from performance degradation when
dealing with large databases or datasets with high dimensionality.
26
the occurrence of B in the dataset, above a certain minimum confidence
threshold.
For example, consider a dataset of transactions at a grocery store. One
frequent itemset might be {milk, bread, eggs}, indicating that these three
items are often purchased together. From this frequent itemset, we can
generate association rules such as "If a customer buys milk and bread, then
they are likely to buy eggs as well" or "If a customer buys milk and eggs,
then they are likely to buy bread as well". The rules can be used to make
predictions or recommendations about future purchases.
It's important to note that frequent itemsets and association rules are not
necessarily meaningful or useful on their own. Domain knowledge and
context are necessary to interpret the results and understand their
implications.
Here's a table summarizing the differences between Text Mining and Web
Mining in data mining:
Focus Extracting useful information from text Extracting useful information from the web
Data Documents, emails, social media posts Web pages, search engine results, log data
27
Text Mining Web Mining
Methods Text classification, clustering, NLP Web usage mining, content mining, link analysis
28
29