You are on page 1of 20

Contents

1. INTRODUCTION........................................................................................................................2
1.2 WHY DATA MINING..........................................................................................................2
2 DATA MINING PROCESS........................................................................................................3
2.1 DATA COLLECTION..........................................................................................................3
2.1.1 BASIC DATA TYPES....................................................................................................3
2.1.1.1 NON-DEPENDENCY-ORIENTED DATA TYPES...............................................3
2.1.1.2 DEPENDENCY-ORIENTED DATA........................................................................4
2.2 DATA PROCESSING...............................................................................................................6
2.2.1 FEATURE EXTRACTION...........................................................................................6
2.2.2 DATA CLEANSING......................................................................................................7
2.2.3 DATA TRANSFORMATION AND REDUCTION.......................................................8
2.3 ANALYTICAL PHASE....................................................................................................9
2.3.1 ASSOCIATION PATTERN MINING.......................................................................10
2.3.1.1 Apriori Algorithm.......................................................................................................11
2.3.2 Further Improvement of the Apriori Method...............................................................16
2.3.3 Mining Frequent Patterns Without Candidate Generation........................................16
2.3.4 FP-Growth Method: Construction of FP-Tree.............................................................17
3 References....................................................................................................................................21
1. INTRODUCTION
Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights
from data. A wide variation exists in terms of the problem domains, applications, formulations, and
data representations that are encountered in real applications.

1.2 WHY DATA MINING


In the modern age, virtually all automated systems generate some form of data either for diagnostic
or analysis purposes. This has resulted in a deluge of data, which has been reaching the order of
petabytes or exabytes. Some examples of different kinds of data are as follows:

World Wide Web: The number of documents on the indexed Web is now on the order of billions,
User accesses to such documents create Web access logs at servers and customer behavior profiles at
commercial sites. user access logs can be mined to determine frequent patterns of accesses or unusual
patterns of possibly unwarranted behavior

Financial interactions: Most common transactions of everyday life, such as using an automated teller
machine (ATM) card or a credit card, can create data in an automated way. Such transactions can be
mined for many useful insights such as fraud or other unusual activity.

User interactions: Many forms of user interactions create large volumes of data. For example, the
use of a telephone typically creates a record at the telecommunication company with details about the
duration and destination of the call. Many phone companies routinely analyze such data to determine
relevant patterns of behavior that can be used to make decisions about network capacity, promotions,
pricing, or customer targeting.

Sensor technologies and the Internet of Things: A recent trend is the development of low-cost
wearable sensors, smartphones, and other smart devices that can communicate with one another. By
one estimate, the number of such devices exceeded the number of people on the planet in 2008 [1].

2 DATA MINING PROCESS


Data mining process is a pipeline containing many phases such as data cleaning, feature extraction,
and algorithmic design
2.1 DATA COLLECTION
Data collection may require the use of specialized hardware such as a sensor network, manual labor
such as the collection of user surveys, or software tools such as a Web document crawling engine to
collect documents.

2.1.1 BASIC DATA TYPES


Data mining consist of a wide variety of types that are available for analysis. They can be classified
in to two main categories, Non-dependency-oriented data and dependency-oriented data.

2.1.1.1 NON-DEPENDENCY-ORIENTED DATA TYPES

This typically refers to simple data types such as multidimensional data or text data. These data types
are the simplest and most commonly encountered. This data typically contains a set of records. A
record is also referred to as a data point, instance, example, transaction, entity, tuple, object, or
feature-vector, depending on the application at hand. Each record contains a set of fields, which are
also referred to as attributes, dimensions, and features.

Relational database systems were traditionally designed to handle this kind of data, even in their
earliest forms. For example, consider the demographic data set illustrated in Table above, the
demographic properties of an individual, such as age, gender, and ZIP code, are illustrated. Data
types under this category are;

 Quantitative multi-dimensional data


Data set in which all fields are quantitative is also referred to as quantitative data or numeric
data.
 Categorical and Mixed attribute data
Data set which consist of both numerical and categorical data.
 Binary and Set data
These are special case of multidimensional categorical or multidimension quantitative data in
which attributes takes one of two discrete values. That is, 1 indicates element should be
included in the set and 0 otherwise.

2.1.1.2 DEPENDENCY-ORIENTED DATA


Data mining is all about finding relationships between data items. The presence of preexisting
dependencies therefore changes the expected relationships in the data, and what may be considered
interesting from the perspective of these expected relationships. Several types of dependencies may
exist that may either be explicit or implicit.

 Implicit dependencies: In this case, the dependencies between data items are not explicitly
specified but are known to “typically” exist in that domain. For example, consecutive
temperature values collected by a sensor are likely to be extremely similar to one another.
 Explicit dependencies: This typically refers to graph or network data in which edges are used
to specify explicit relationships.

Different dependency-oriented data types include;


 Text data : it is a raw form, a text document that corresponds to a string. Each string is a
sequence of character in the document. It can be used to analyze the frequency of words in a
document.
 Time-series data: It contains values that are generated by continuous measurement over time.
For example, an environmental sensor will measure the temperature continuously, whereas an
electrocardiogram (ECG) will measure the parameters of a subject’s heart rhythm. They are
most often used in forecasting financial market analysis.
 Discrete sequences and Strings: it is considered as a categorical analog of time series data.
For example, a sequence of web accesses in which web pages and originating IP address of
the request are collected, such data are referred to as strings. Other examples include event
log on web servers, Biological data such as information about characteristics and functions of
protein etc.
 Spatial data: These are data measured over specific location. For example, sea-surface
temperatures are often collected by meteorologists to forecast the occurrence of hurricanes.
They are closely related to time-series data.
 Network and Graph data: Data values may correspond to nodes in the network, whereas the
relationships among the data values may correspond to the edges in the network. Some
examples are;
o Web graph: Here nodes correspond to the Web pages, and the edges correspond to
hyperlinks. The nodes have text attributes corresponding to the content in the page.
o Social networks: In this case, the nodes correspond to social network actors, whereas the
edges correspond to friendship links. The nodes may have attributes corresponding to
social page content.
o Chemical compound databases: In this case, the nodes correspond to the elements and the
edges correspond to the chemical bonds between the elements. The structures in these
chemical compounds are very useful for identifying important reactive and
pharmacological properties of these compounds.

2.2 DATA PROCESSING


Data preprocessing phase is perhaps the most crucial one in the data mining process. This phase
begins after the collection of the data, and it consists of the following steps, feature extraction, data
cleansing, feature selection and transformation.

2.2.1 FEATURE EXTRACTION

In this phase raw data is transformed in to meaningful database features for processing. Features that
are most relevant are extracted based on the type of application. For example, in a credit-card fraud
detection application, the amount of a charge, the repeat frequency, and the location are often good
indicators of fraud. Different types of data require different types of features.

Sensor data: Sensor data is often collected as large volumes of low-level signals, which are massive.
The low-level signals are sometimes converted to higher-level features using wavelet or Fourier
transforms.

Image data: In its most primitive form, image data are represented as pixels. At a slightly higher
level, color histograms can be used to represent the features in different segments of an image.

Web logs: Web logs are typically represented as text strings in a prespecified format. it is relatively
easy to convert Web access logs into a multidimensional representation of (the relevant) categorical
and numeric attributes.

Network traffic: In many intrusion-detection applications, the characteristics of the network packets
are used to analyze intrusions or other interesting activity. A variety of features may be extracted
from these packets, such as the number of bytes transferred, the network protocol used.

Document data: Document data is often available in raw and unstructured form, and the data may
contain rich linguistic relations between different entities. Named-entity recognition is an important
subtask of information extraction. This approach locates and classifies atomic elements in text into
predefined expressions of names of persons, organizations, locations, actions, and numeric quantities.
2.2.2 DATA CLEANSING

This is the process of detecting and correcting inaccurate or corrupt data in a data set. This process is
important because several source of errors and missing entries arise during data collection process.

Some data collection technologies, such as sensors, are inherently inaccurate because of the hardware
limitations associated with collection and transmission. Sometimes sensors may drop readings
because of hardware failure or battery exhaustion.

Users may not want to specify their information for privacy reasons, or they may specify incorrect
values intentionally. For example, it has often been observed that users sometimes specify their
birthday incorrectly on automated registration sites such as those of social networks. In some cases,
users may choose to leave several fields empty.

Methods are needed to remove or correct missing and erroneous entries from the data. There are
several important aspects of data cleaning:

Handling missing entries: Many entries in the data may remain unspecified because of weaknesses in
data collection or the inherent nature of the data. Such missing entries may need to be estimated. The
process of estimating missing entries is also referred to as imputation. This may not be practical
when most record contain missing entries. Also, errors created by the imputation algorithm may
affect the result of data mining algorithm. It is most preferably to work with missing data because it
avoids additional biases inherent in the imputation process.

In the classification problem, a single attribute is treated specially, and the other features are used to
estimate its value. In this case, there are many classification methods that can also be used for
missing value estimation (such as matrix completion method).

In a time-series data set, the average of the values at the time stamp just before or after the missing
attribute may be used for estimation.

For the case of spatial data, the estimation process is quite similar, where the average of values at
neighboring spatial locations may be used.

In case where different features represent different scales of reference and may therefore not be
comparable to one another For example, an attribute such as age is drawn on a very different scale
than an attribute such as salary. The latter attribute is typically order of magnitude larger than the
former. As a result, any aggregate function computed on the different features (e.g., Euclidean
distances) will be dominated by the attribute of
larger magnitude. To address this problem, it is common to use standardization.

2.2.3 DATA TRANSFORMATION AND REDUCTION

The goal of data reduction is to represent it more compactly. When the data size is smaller, it is much
easier to apply sophisticated and computationally expensive algorithms. Data reduction does result in
some loss of information. The use of a more sophisticated algorithm may sometimes compensate for
the loss in information resulting from data reduction. Different types of data reduction are used in
various applications; Data sampling, feature selection, Data reduction and axis rotation and Data
reduction with typed transformation.

 Data sampling: in data sampling, the underlying data is sampled to create a much smaller
database. The type of sampling used may vary with application.
 Sampling for static data; In the unbiased sampling approach, a predefined fraction f of the data
points is selected and retained for analysis. This is extremely simple to implement, and can be
achieved in two different ways, depending upon whether or not replacement is used. In biased
sampling, some part of the data are intentionally emphasized because of greater importance to
analysis. A stratified sample, first partitions the data into a set of desired strata, and then
independently samples from each of these strata based on predefined proportions in an
application-specific way.
 Reservoir Sampling for data streams: In this case, the data set is not static and cannot be stored
on disk. for each incoming data point in the stream, one must use a set of efficiently
implementable operations to maintain the sample. Admission and control decision is made
dynamically, to decide on whether to include a new data point in the sample and how to eject
data from the sample to insert new data point in the sample.
 Feature subset selection: Some features are discarded when they are known to be irrelevant. To
determine which feature is relevant depends on the application. The are two types of features
selection. Unsupervised feature selection and supervised feature selection. Feature selection is an
important part in data mining because it defines quality of the input data.
 Dimensionality Reduction with Axis Rotation: In real data sets, a significant number of
correlations exist among different attributes. In some cases, hard constraints or rules between
attributes may uniquely define some attributes in terms of others. For example, the date of birth
of an individual (represented quantitatively) is perfectly correlated with his or her age. These
correlations and constraints correspond to implicit redundancies because they imply that
knowledge of some subsets of the dimensions can be used to predict the values of the other
dimensions. The two methods to achieve this goal is Principal Component Analysis (PCA) and
Singular Value Decomposition (SVD). PCA and SVD are primarily used for data reduction and
compression, they have many other applications in data mining.
 Dimensionality Reduction with type transformation: Data is transformed from a more complex
type to a less complex type, such as multidimensional data. Thus, these methods serve the dual
purpose of data reduction and type portability.

2.3 ANALYTICAL PHASE

The final part of the mining process is to design effective analytical methods from the processed
data. Four problems in data mining are considered fundamental to the mining process. These
problems correspond to clustering, classification, association pattern mining, and outlier detection,
and they are encountered repeatedly in the context of many data mining applications.

2.3.1 ASSOCIATION PATTERN MINING

In its most primitive form, the association pattern mining problem is defined in the context of sparse
binary databases, where the data matrix contains only 0 or 1 entries, and most entries take on the
value of 0. Most customer transaction databases are of this type. For example, if each column in the
data matrix corresponds to an item, and a customer transaction represents a row, the (i, j)th entry is 1,
if customer transaction i contains item j as one of the items that was bought.

Association pattern mining problem has a wide variety of application, such as;
 Supermarket data: the determination of frequent item set provide a useful insight about
targeted marketing and shelf placement of items.
 Text mining: frequent text mining helps to identify co occurring terms and keywords.
 Generalization to dependency-oriented data types:

The following are some definitions and Properties used in Frequent pattern mining model.

o Definition (Support) The support of an itemset I is defined as the fraction of the transactions
in the database T = {T1 . . . Tn} that contain I as a subset.
o Definition (Frequent Itemset Mining) Given a set of transactions T = {T1 . . . Tn}, where
each transaction Ti is a subset of items from U, determine all item sets I that occur as a
subset of at least a predefined fraction minsup of the transactions in T .
o Definition (Frequent Itemset Mining: Set-wise Definition) Given a set of sets T = {T1 . . .
Tn}, where each element of the set Ti is drawn on the universe of elements U, determine all
sets I that occur as a subset of at least a predefined fraction minsup of the sets in T .
o Property (Support Monotonicity Property) The support of every subset J of I is at least
equal to that of the support of itemset I. sup(J) ≥ sup(I) ∀ J ⊆ I.
o Property (Downward Closure Property) Every subset of a frequent itemset is also frequent.
o Definition (Maximal Frequent Item sets) A frequent itemset is maximal at a given minimum
support level minsup, if it is frequent, and no superset of it is frequent.
o Definition (Confidence) Let X and Y be two sets of items. The confidence conf(X ∪ Y ) of the
rule X ∪ Y is the conditional probability of X ∪ Y occurring in a transaction, given that the
transaction contains X. Therefore, the confidence conf(X ⇒ Y ) is defined as follows:
¿( X ∪ Y )
conf ( X ⇒ Y )=
¿ (X ).
o Definition (Association Rules) Let X and Y be two sets of items. Then, the rule X ⇒ Y is said
to be an association rule at a minimum support of minsup and minimum confidence of
minconf, if it satisfies both the following criteria:
1. The support of the itemset X ∪ Y is at least minsup.
2. The confidence of the rule X ⇒ Y is at least minconf.
o Property (Confidence Monotonicity) Let X1, X2, and I be item sets such that X1 ⊂ X2 ⊂ I.
Then the confidence of X2 ⇒ I - X2 is at least that of X1 ⇒ I - X1.
conf(X2 ⇒ I - X2) ≥ conf(X1 ⇒ I - X1)

2.3.1.1 Apriori Algorithm


The Apriori algorithm uses the downward closure property in order to prune the candidate search
space. The downward closure property imposes a clear structure on the set of frequent patterns. This
is useful for avoiding wasteful counting of support levels of item sets that are known not to be
frequent
Let Fk denote the set of frequent k-item sets, and Ck denote the set of candidate k-item sets

Example: Market basket analysis


This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such association scan help
retailers develop marketing strategies by gaining insight into which items are frequently
purchased together by customers. For instance, if customers are buying milk, how likely are they
to also buy bread (and what kind of bread) on the same trip to the supermarket. Such information
can lead to increased sales by helping retailers do selective marketing and plan their shelf space.
In order to achieve this, process the point-of-sale data collected with barcode scanners to find
dependencies among items.
Consider the arrangement of items in that order I = {bread, milk, cereals, eggs, butter, sugar}

Binary representation of item set

TID List of item IDs


T100 I1, I2, I5
T200 I2, I4 There are nine transactions in this database, that is, |D| = 9.
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to
generate a candidate set of 2-itemsets, C2. No candidates are removed from C2 during the
prune step because each subset of the candidates is also frequent.
4. Next, the transactions in D are scanned and the support count of each candidate itemset In C2
is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first get C3
=L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on
the Apriori property that all subsets of a frequent itemset must also be frequent, we can
determine that the four latter candidates cannot possibly be frequent.

7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
The rule of joining is that L2 is that k=3 and we should have k-2 (i.e 1) element in common and this
element should be the first element.
Back to our example above,
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5},
{I1,I2,I3}, {I1,I2,I5}}. – Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below, each listed with its confidence.
– R1: I1 ^ I2 => I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected
. – R2: I1 ^ I5 => I2
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
– R3: I2 ^ I5=>I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
• R3 is Selected.

– R4: I1 => I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2 => I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5 => I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected. In this way, We have found three strong association rules
2.3.2 Further Improvement of the Apriori Method
- Major computational challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
- Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
- Completeness: any association rule mining algorithm should get the same set of frequent
itemsets.

2.3.3 Mining Frequent Patterns Without Candidate Generation


• Compress a large database into a compact, Frequent- Pattern tree (FP-tree) structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones

TID List of Items

T100 I1, I2, I5

T100 I2, I4
– Avoid candidate generation: sub-database test only!
T100 I2, I3
• Consider the same previous example of a database, D , consisting of 9 transactions.
T100 I1, I2, I4
• Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
• The first scan of databaseI1,isI3 same as Apriori, which derives the set of 1-itemsets & their support
T100

counts.
T100 I2, I3
• The set of frequent items is sorted in the order of descending support count.
T100 I1, I3
• The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}
T100 I1, I2 ,I3, I5

T100 I1, I2, I3


2.3.4 FP-Growth Method: Construction of FP-Tree

• First, create the root of the tree, labeled with “null”.


• Scan the database D a second time. (First time we scanned it to create 1-itemset and then L).
• The items in each transaction are processed in L order (i.e. sorted order).
• A branch is created for each transaction with items having their support count separated by
colon.
• Whenever the same node is encountered in another transaction, we just increment the support
count of the common node or Prefix.
• To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links.
Now, The problem of mining frequent patterns in database is transformed to that of mining the FP-
Tree.
2.3.4.1 Mining the FP-Tree by Creating Conditional (sub) pattern bases
Steps:

1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree
co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent
patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.

Item Conditional pattern Conditional FP- Frequent pattern generated


base Tree

I {(I2 I1: 1),(I2 I1 I3: <I2:2 , I1:2> {I2, I5:2}, {I1, I5:2},
5 1)} {I2, I1 I5: 2}

I {(I2 I1: 1),(I2: 1)} <I2: 2> {I2, I4: 2}


4

I {(I2 I1: 2), (I2: 2), (I1: 2)} <I2: 4, I1: {I2, I3:4}, {I1, I3: 4} ,
3 2>,<I1:2> {I2, I1, I3: 2}

I {(I2: 4)} <I2: 4> {I2, I1: 4}


1

Mining the FP-Tree by creating conditional (sub) pattern bases

Now, Following the above mentioned steps:

• Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
• Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1} and {I2
I1 I3: 1}, which forms its conditional pattern base.
So considering table’s last column of frequent Pattern generation we have generated a 3-item
frequent set as {I2, I1 I5: 2} & {I2, I1, I3: 2}}. Similarly 2-Item frequent sets are {I2, I5:2},{I1,
I5:2}, {I2, I4: 2}, {I2, I3:4}, {I1, I3: 4}& {I2, I1: 4}. Also, we can see that we have arrived at
distinct sets. None of the sets are similar in the case.

In comparison to the Apriori Algorithm, we have generated only the frequent patterns in the item
sets rather than all the combinations of different items. For example, here we haven’t generated
{I3, I4} or {I3, I5} since they are not frequently bought together items which is the main essence
behind the association rules criteria and FP growth algorithm.
3 References
[1] C. Aggarwal. Managing and mining sensor data. Springer, 2013.

You might also like