You are on page 1of 9

Discovery of Sequential Patterns with Quantity Factors

Karim Guevara Puente de la Vega Cesar Beltrán Castañón


Universidad Católica de Santa María /Arequipa Departamento de Ingeniería
Universidad Nacional de San Agustín / Arequipa Pontificia Universidad Católica del Perú / Lima
kguevara@ucsm.edu.pe cbeltran@pucp.pe

In addition, the algorithms for mining sequen-


Abstract tial patterns of dealing with the previous sequen-
tial patterns in a uniform manner, despite the fact
The sequential pattern mining stems from the that these patterns individually in a sequence can
need to obtain patterns that are repeated in have important differences such as the associated
multiple transactions in a database of se- amount to each item that make up each pattern.
quences, which are related to time, or another For the foregoing reasons, in the present paper
type of criterion. This work presents the pro-
proposes a technique by which it is intended to
posal of a new technique for the discovery of
sequential patterns from a database of se- exploit these intrinsic relationships of the se-
quences, where the patterns not only provide quential patterns, in this specific case the rela-
information on how these relate to the time, tionship to the amount of each of the items. The
but also, that in the mining process itself inclusion of this aspect in the sequential pattern
should be included the quantity factors asso- mining, you can afford to get a set of sequential
ciated with each of the items that are part of a patterns that are not only common but also let us
sequence, and as a result of this process can know how these amounts associated with each
be obtain information relating to how they re- item that is included in a sequential pattern fre-
late these items with regard to the amounts quent relates. The inclusion of the restriction of
associated. The proposed algorithm uses di-
quantity within the extraction process of the fre-
vide and conquer techniques, as well as in-
dexing and partitioning of the database. quent sequential patterns we could provide in-
formation much more meaningful.
1 Credits The article is organized as follows: Section 2
is on the previous work. Section 3 gives a de-
This document was written as part of the devel- scription of the problem. Section 4 introduces the
opment of the 1st Symposium on Information technical proposal. Section 5 shows the experi-
Management and Big Data, SIMBig 2014. It has ments and results. The conclusions and future
been adapted from the instructions for earlier work are shown in section 6 and finally the ref-
ACL. erences.

2 Introduction 3 Previous works


The sequential pattern mining is the process The techniques of discovery of association
by which you get the relationships between oc- rules are essentially boolean, due to which are
currences of sequential events, to find if there is discarded the quantities of the items purchased
a specific order in which these events occur. In and only pay attention to if something was pur-
relation to this area of study there are many in- chased or not. An important area of study is the
vestigations, all of them makes use of the re- sequential pattern mining that involves the ex-
striction of minimal support, some include other traction of patterns that are repeated in multiple
restrictions, such as for example the time interval transactions in a transactional database, which
in which it is required that the events happen, are related to time or another type of sequence.
also the use of taxonomies as defined by the user, The problem of the sequential pattern mining
and the fact of allowing the items in a sequence was introduced by Agrawal and Srikant (1995)
not necessarily must have occurred in a single set the example of the typical income of clients
transaction, but could be in two or more, always in a rental shop videos. Customers often rent
and when their times of each of these transac- "Star Wars", then "Empire Strikes Back" and
tions is within some small window of time de- then "Return of the Jedi". All these incomes not
termined by the user. necessarily should have been made consecutive-
ly, that is to say, there could be customers that

33
leased any other video in the middle of the pre- means of task execution of projection of the ini-
vious sequence, so that these sequences of trans- tial data base and the obtaining of patterns, with-
actions also fall into the same pattern. out involving a process of lead generation. Using
The researches on mining sequential patterns this technique and approach under the divide and
are based on events that took place in an orderly rule, Han et al. (1996) proposed the algorithm
fashion at the time. FreeSpan (Frecuent Pattern-Project Sequential
Most of the implemented algorithms for the Pattern mining), and Pei et al. (2001) proposes
extraction of frequent sequences, using three dif- PrefixSpan (Prefix-projected Sequential Pattern
ferent types of approaches according to the form mining). In these algorithms the database of se-
of evaluating the support of the candidate se- quences is projected recursively in a set of small
quential patterns. The first group of algorithms is databases from which the fragments of sub se-
based on the ownership apriori, introduced by quences grow based on the current set of fre-
Agrawal and Srikant (1994) in the mining of as- quent sequences, where the patterns are extract-
sociation rules. This property suggests that any ed.
sub pattern from a frequent pattern is also fre- Han et al. (1996)] show that FreeSpan extracts
quent, allowing pruning sequences candidates the full set of patterns and is more efficient and
during the process of lead generation. Based on considerably faster than the algorithm GSP.
this heuristics, Agrawal and Srikant (1995) pro- However, a sub sequence can be generated by
posed algorithms as the AprioriAll and Apri- the combinations of sub strings in a sequence,
oriSome. The substantial difference between the- while the projection in FreeSpan must follow the
se two algorithms is that the AprioriAll generates sequence in the initial database without reducing
the candidates from all the large sequences the length. In addition, it is very expensive the
found, but that might not be lowest panning val- fact that the growths of a sub sequence it will be
ues, however, AprioriSome only counts those explored in any point of the division within a
sequences that are large but lowest panning val- candidate sequence. As an alternative to this
ues, thus reducing the search space of the pat- problem, Pei (2001) proposes PrefixSpan. The
terns. general idea is to examine only the prefixes for
In subsequent work, Srikant and Agrawal the sub project only sequences and their corre-
(1996) propose the same algorithm GSP (Gener- sponding sub sequences postfijas within data-
alization Sequential Patterns), also based on the bases planned. In each of these databases
technical apriori, surpassing previous in 20 mag- planned, it will find the sequential patterns ex-
nitudes of time. Until this time, the algorithms panded exploration only local patterns frequent-
that had been proposed for mining sequential ly. PrefixSpan extracts the full set of patterns and
patterns focused on obtaining patterns taking into their efficiency and implementation are consid-
account only the minimal support given by the erably better both GSP and FreeSpan.
user. But these patterns could fit into transactions The third group is formed by algorithms that
that had been given at intervals of time very dis- kept in memory only information necessary for
tant, which was not convenient for the purposes the evaluation of the bracket. These algorithms
of mining. So, in this paper, we propose the idea are based on the calls of occurrence lists that
that in addition to the minimal support, the user contain the description of the location where the
could be in the ability to specify your interest in patterns occur in the database. Under this ap-
obtaining patterns that fit into transactions that proach, Zaki (2001) proposes the SPADE algo-
have been given in certain periods of time, and rithm (Sequential Pattern Discovery using
this is made from the inclusion of restrictions on Equivalence classes) where he introduces the
the maximum and minimum distance, the size of technical processing of the data base to vertical
the window in which the sequences and the in- format, in addition there is a difference from the
heritance relationships - taxonomies, which are algorithms based on apriori, it does not perform
cross-relations through a hierarchy. multiple passes on the database, and you can ex-
In these algorithms based on the principle of tract all the frequent sequences in only three
apriori, the greater effort focused on developing passes. This is due to the incorporation of new
specific structures that allow sequential patterns techniques and concepts such as the list of identi-
represent the candidates and in this way make the fiers (id-list) with vertical format that is associat-
counting operations support more quickly. ed with the sequences. In these lists by means of
The second group is the algorithms that seek temporary unions can be generated frequent se-
to reduce the size of the set of scanned data, by quences. Also used the grid based approach to

34
break down the search space in small classes that consequently, its performance is better than
can be processed independently in the main Wang et al.
memory. Also, uses the search in both breadth Fiot (2008) in her work suggests that an item
and depth to find the frequent sequences within quantitative is partitioned into several fuzzy sets.
each class. In the context of fuzzy logic, a diffuse item is the
In addition to the techniques mentioned earli- association of a fuzzy set b to its corresponding
er, Lin and Lee (2005) proposes the first algo- item x, i.e. [x,b]. In the DB each record is asso-
rithm that implements the idea of indexing called ciated with a diffuse item [x,b] according to their
Memisp memory (Memory Indexing for sequen- degree of membership. A set of diffuse items
tial pattern mining). The central idea of Memisp will be implicated by the pair (X,B), where X is
is to use the memory for both the data streams as the set of items, and B is a set of fuzzy sets.
to the indexes in the mining process and imple- In addition, it argues that a sequence g-k-
ment a strategy of indexing and search to find all sequence (s1, s2,…, sp) is formed by g item sets
frequent sequences from a sequence of data in diffuse s=(X,B) grouped to diffuse k items [x,b],
memory, sequences that were read from the da- therefore the sequential pattern mining diffuse
tabase in a first tour. Only requires a tour on the consists in finding the maximum frequency dif-
basis of data, at most, two for databases too fuse g-k-sequence.
large. Also avoids the generation of candidates Fiot (2008), provides a general definition of
and the projection of database, but presented as frequency of a sequence, and presents three algo-
disadvantage a high CPU utilization and rithms to find the fuzzy sequential patterns:
memory. SpeedyFuzzy, which has all the objects or items
The fourth group of algorithms is composed of of a fuzzy set, regardless of the degree, if it is
all those who use fuzzy techniques. One of the greater than 0 objects have the same weight,
first work performed is the Wang et al. (1999), MiniFuzzy is responsible for counting the objects
who propose a new data-mining algorithm, or items of a fuzzy set, but supports only those
which takes the advantages of fuzzy sets theory, items of the sequence that candidate have a
to enhance the capability of exploring interesting greater degree of belonging to a specified thresh-
sequential patterns from the databases with quan- old; and TotallyFuzzy that account each object
titative values. The proposed algorithm integrates and each sequence. In this algorithm takes into
concepts of fuzzy sets and the AprioriAll algo- account the importance of the set or sequence of
rithm to find interesting sequential patterns and data, and is considered the best grade of mem-
fuzzy association rules from transaction data. bership.
The rules can thus predict what products and
quantities will be bought next for a customer and 4 Description of the Problem
can be used to provide some suggestions to ap-
A sequence s, denoted by <e1e2…  in>, is an or-
propriate supervisors.
dered set of n elements, where each element ei is
Wang et al. (1999) propose fuzzy quantitative
a set of objects called itemset. An itemset, which
sequential patterns (FQSP) algorithm, where an
is denoted by (x1 [c1], x2 [c2]  ,  …,   Xq[cq] ), is a
item’s  quantity  in  the  pattern  is  represented  by  a  
non-empty set of elements q, where each element
fuzzy term rather than a quantity interval. In their
xj is an item and is represented by a literal, and cj
work an Apriori-like algorithm was developed to
is the amount associated with the item xj that is
mine all FQSP, it suffers from the same weak-
represented by a number in square brackets.
nesses, including: (1) it may generate a huge set
Without loss of generality, the objects of an ele-
of candidate sequences and (2) it may require
ment are supposed to be found in lexicographical
multiple scans of the database. Therefore, an
order by the literal. The size of the sequence s,
Apriori-like algorithm often does not have a
denoted by |s|, is the total number of objects of
good performance when a sequence database is
all elements of the s, so a sequence s is a k-
large and/or when the number of sequential pat-
sequence, if |s|=k.
terns to be mined is large.
For example, <(a[5])(c[2])(a[1])>,
Chen et al. (2006) propose divide-and-conquer
<(a[2],c[4])(a[3])> and <(b[2])(a[2],e[3])> are
fuzzy sequential mining (DFSM) algorithm, to
all 3-sequences. A sequence s = <e1e2…  in> is a
solve the same problem presented by Hong using
sub-sequence of another sequence of s'=<e1'e2'…  
the divide-and-conquer strategy, which possesses
em'> if there are 1≤i1<i2<…<in≤m such that e1⊆
the same merits as the PrefixSpan algorithm;
ei1', e2⊆ei2 ', ... , and en⊆ ein'. The sequence s'

35
contains the sequence s if s is a sub-sequence of Consider the sequence C6, which consists of
s'. three elements, the first has the objects b and c,
Similarly, <(b,c)(c)(a,c,e)> contains the second has the object c, and the third has the
<(b)(a,e)> where the quantities may be different. objects a, c, and e. Therefore, the support of
The support (sup) of a sequential pattern X is <(b)(a)> is 4/6 since all the sequences of data
defined as the percentage on the fraction of rec- with the exception of C2 and C3 contain a
ords that contains X the total number of records <(b)(a)>. The sequence <(a,d)(a)> is a sub se-
in the database. The counter for each item is in- quence of both C1 and C4; and therefore,
creased by one each time the item is found in <(a,d)(a)>.sup=2/6.
different transactions in the database during the Given the pattern <(a)> and the frequent item
scanning process. This means that the counter of b, gets the pattern type-1 <(a)(b)> adding (b) to
support does not take into account the quantity of <(a)>, and the pattern type-2 <(a,b)> by the
the item. For example, in a transaction a custom- extension of <(a)> with b.
er buys three bottles of beer, but only increases Similarly, <(a)> is the prefix pattern (_pat)
the number of the counter to support {beer} by which in turn is a frequent sequence, and b is the
one; in other words, if a transaction contains an stem of both: <(a)(b)> and <(a,b)>.
item, then, the support counter that item only is Note that the sequence null, denoted by <>, is
incremented by one. the _pat of any 1-frequent sequence. Therefore,
Each sequence in the database is known as a a k-sequence is like a frequent pattern type-1 or
sequence of data. The support of the sequence s, type-2 of a (k-1)-frequent sequence.
is denoted as s.sup, and represents the number of
sequences of data that contain s divided by the 5 Algorithm for the discovery of se-
total number of sequences that there is in the da- quential patterns with quantity fac-
tabase. minSup threshold is the minimum speci- tors - MSP-QF
fied by the user. A sequence s is frequent if
s.sup≥minSup, therefore it will be a sequential The algorithm for mining sequential patterns
pattern frequently. with quantity factors, arises from the need to dis-
Then, given the value of the minSup and a da- cover from a database of sequences, the set of
tabase of sequences, the problem of the sequen- sequential patterns that include the amounts as-
tial pattern mining is to discover the set of all sociated with each of the items that are part of
sequential patterns whose supports are greater the transactions in the database, since having this
equal to the value of the minimum support additional information can be known with greater
(s.sup≥  minSup). precision not only what is the relationship with
Definition: given a  pattern and a frequent respect to the time that exists between the vari-
item x in the database of sequences, ' is a: ous items involved in the transactions of a se-
 Pattern Type-1: if ' can be formed by add- quence, but also as is the relationship to the
ing to  the itemset that contains the item x, amount of these items.
The algorithm MSP-QF, it is based on the idea
as a new element of .
of the use of prefixes, and the creation of indexes
 Pattern Type-2: if ' can be formed by the from the database of sequences or other indices
extension of the last element of  with x. that are generated during the mining process,
The item x is called stem of the sequential pat- where recursively searching for frequent pat-
tern ', and  prefix is the pattern of '. terns. As a result of the exploration of a particu-
That is, the following database of sequences of lar index, fewer and shorter sequences of data
figure 1, which includes amounts for the items need to be processed, while the patterns that are
and that, has six sequences of data. found will be made longer.
In addition, if the database is very large se-
Sequences
C1 = <(a[1],d[2]) (b[3],c[4]) (a[3],e[2])>
quence uses the techniques of partitioning in a
C2 = <(d[2],g[1]) (c[5],f[3]) (b[2],d[1])> manner that the algorithm is applied to each of
C3 = <(a[5],c[3]) (d[2]) (f[2]) (b[3])> the partitions as if it were a database of lesser
C4 = <(a[4],b[2],c[3],d[1]) (a[3]) (b[4])>
C5 = <(b[3],c[2],d[1]) (a[3],c[2],e[2]) (a[4])> size.
C6 = <(b[4],c[3]) (c[2]) (a[1],c[2],e[3])>

Figure 1. Database of sequences

36
5.1 Procedure of the algorithm MSP-QF Step 5: When there is no more stems, filtered
index stems all those who are frequent. For all
Listed below are the steps of the proposed al-
stems (sequences) frequently, we proceed to dis-
gorithm.
cretize the quantities that were associated with
each item and stored in the set of frequent pat-
Step 1: Partitioning and scanning of the data-
terns. For this, before adding it to the set of fre-
base of sequences. Depending on the size of the
quent patterns, we proceed to verify that the
database are applicable to so it can be partitioned
common pattern found recently has not already
and formatted through and then to scan each of
been added before this set as a result of applying
the partitions of independently. For each parti-
the algorithm to a partition of the database that
tion, the sequences are constructed and stored in
was processed with previously. If frequent pat-
the structure DBSeq. At the same time generates
tern already exists in the set of frequent patterns,
the index of items where is stored the support for
the discretization process is again applied to the
each one of them, which is found during the
set of quantities associated with the sequence is
scanning process.
stored as a frequent pattern and set of quantities
of newly discovered frequent pattern; otherwise,
Step 2: The index of items are filtered out
the common pattern found in the current partition
those that are frequent, i.e., whose support is
is added directly to the set of frequent patterns.
greater than or equal to minSup determined by
Then we proceed to perform recursively steps
the user. All these items come to form sequences
3, 4 and 5 with each one of the frequent patterns
of size |s| =1, therefore, form the set of 1-
that are found in the process.
sequences. For all these sequences frequent item
is to write the amounts associated with each item
Discretization Function: This function is re-
to the time it is saved in the whole of frequent
sponsible for making the set of quantities associ-
patterns.
ated with an item, the range of values given by
the mean and standard deviation of this set. For
Step 3: For each one of the frequent patterns
example, given the sequences of the figure 1, the
, found in step 2, or as a result of the step 4, the
set of quantities associated with the item <(a)>
index is constructed _idx, with inputs (ptr_ds, is: 1,5,4,3,1, which after being discretized would
pos), where ptr_ds refers to a sequence of the DB be the interval formed by: [ 2.8±1.6 ]
in which appears the  pattern, and pos is the
pair (posItemSet, posItem), where posItemSet is To summarize the steps carried out in the pro-
the position of the itemset in the sequence and posed algorithm, figure 3 shows a schematic of
posItem the position of the item in the itemset the entire procedure.
from the sequence where the pattern appears.
The values of pos allow the following scans are
performed only on the basis of these positions in 5.2 Algorithm specification MSP-QF
a certain sequence. Here we show the specification of the pro-
posed algorithm MSP-QF.
Step 4: Find the stems of type-1 and/or type-2
for each  pattern and its corresponding index Algorithm MSP-QF
_idx generated in the previous step, considering In: DB = database sequences
only those items of the sequences referred to in minSup = minimum support
partition = number of sequences included in each of the partitions
_idx and the respective values of pos. At the Out: set of all sequential patterns with quantity factors.
same time as are the stems are calculated their Procedure:
1. Partitioning the DB
supports, and in addition is added to the list of 2. Each partition scan it in main memory and:
quantities of the item that is part of the stem the (i) build sequences and store them in DBSeq structure.
amount referred to in the item of the sequence of (ii) index the items and determine the support of each item.
(iv) associate the quantities of each item in a sequence list of item
the DB which is being examined. The infor- quantities in the index.
mation of the stems and their quantities are 3. Find the set of frequent items
4. For each frequent item x,
stored in another index of stems. This step is re- (i) form the sequential pattern  = <(x)>
peated for the same pattern, until they were no (ii) call Discretize() to discretize the set of quantities associated
longer more stems from this. with each item x.
(iii) storing  in the set frequent patterns.
(iv) call Indexing (x, <>, DBSeq) to build the -idx index.
(v) call Mining (, -idx) to obtain patterns from index -idx.

37
Subrutine Indexing (x, , set_Seq) 3. For each stem x found in the previous step,
Parameters: (i) form a sequential pattern '' from the prefix pattern -pat and
x = one stem type-1 or type-2; the stem x.
 = prefix pattern (-pat); (ii) call Discretize(') to discretize the amounts associated with
set_Seq = set of data sequences the items of '.
/ * If set_Seq is an index, then each data sequence in the index is (iii) call Index(', , ’-idx) to build the index ’-idx.
referenced by the element ptr_ds¸ which is formed at the input (iv) call Mining(’, '-idx) to discover sequential patterns from
(ptr_ds, pos) index * / index ’-idx
Out: índex '-idx, where ' represents the pattern formed by the stem x
and prefix pattern -pat.
Procedure:
1. For each data sequence ds of set_Seq Subrutina Discretize()
(i) If set_Seq = DBSeq the pos_inicial = 0, else pos_inicial = pos. Parameters:
(ii) Find the stem in each sequence ds from the position  = a pattern that is a sequence;
(pos_inicial + 1), Output: the arithmetic mean and standard deviation of the amounts
1. If the stem x is in position pos in ds, then insert a pair (ptr_ds, associated with each item  pattern.
pos) in '-idx index, where ptr_ds reference to ds. Procedure:
2. If the stems x is equal to the item x’  of the ds sequence, add- 1. For each itemset    do
ed the quantity q associated with the item x’,  to the list of a) For each item x  do
quantities related to x. (i) Calculate the arithmetic mean and standard deviation of
2. Return the '-idx index. the set of quantities associated with the item x
(ii) storing the arithmetic mean and standard deviation in the
pattern 
Subrutine Mining(,-idx)
Parameters:
 = a pattern;
-idx = an índex.
Procedure:
1. For each data sequence ds referenced by ptr_ds of input (ptr_ds,
pos) in -idx,
(i) Starting from the (pos +1) position until |ds|, determining poten-
tial stems and increase in one support each of these stems.
2. Filter those stems that have a large enough support.

Figure 2. Specification of the algorithm MSP-QF

Figure 3. Schema of the procedures of the algorithm MSP-QF

38
In the test with minSup=2% were obtained
6 Experiments and Result 824 sequential patterns with quantity factors,
some of which are:
The experiments to test the technical proposal Olive[ 1.06±0.51 ]$
were implemented in two different scenarios, Olive[ 1.03±0.5 ]$ Paprika[ 0.37±0.23 ]$
which are described below. Olive[ 1.01±0.5 ]$ Porro[ 0.57±0.27 ]$
Celery[ 0.56±0.25 ]$ Lettuce[ 0.53±0.24 ], Lentils[ 0.54±0.23 ]$
Celery[ 0.56±0.26 ]$ Lettuce[ 0.55km Air ±0.24 ],
6.1 Scenario 1: Real data Paprika[ 0.34±0.17 ]$
Lettuce[ 0.61±0.25 ], Lentils[ 0.56±0.26 ]$ Porro[ 0.59±0.24 ],
The technique was applied in the analysis of Paprika[ 0.33±0.15 ]$
the market basket of a supermarket. These tests Porro[ 0.54±0.27 ], Lentils[at 0.62±0.25 ]$ Lentils[ 0.58±0.25 ]$
consisted of obtain the set of frequent sequential Paprika[ 0.35±0.17 ]$

patterns from the basis of data obtained in the


course of three non-consecutive periods. The Of these sequential patterns we can clarify the
first period goes from mid-December of 1999 following with regard to purchases made by cus-
until mid-January 2000. The second period goes tomers:
from early 2000 until the beginning of June of • Customers buy only olives in a quantity of
1.06±0.51.
the same year. The third period goes from late
• Customers who have purchased a first time only
August 2000 until the end of November 2000.
olive, returning a next time for chili or by porro,
This database consists of 88163 transactions, with quantities of 0.37±0.23 and 0.57±0.27 re-
3000 items unique to approximately 5133 cus- spectively . Those who buy after pepper, pur-
tomers. chased before olives in a quantity equal to
The purpose of testing is to discover patterns 1.03±0.5, while those who acquire porro did so
of customer usage in the supermarket, plus get with an amount equal to 1.01±0.5.
the amount of each of the items that will be pur- • Those who buy lettuce at the same time buy lentils
chased by these customers as a result of applying in amounts equal to 0.61±0.25 and 0.56±0.26 re-
the proposed technique, which will allow us to spectively. Later, these same customers buy porro
have more accurate and significant in terms of and paprika with amounts equal to 0.59±0.24 and
0.33±0.15.
the quantity purchased of each of the items.
Seven tests were carried out with minimum • Those who buy porro, in the same transaction also
buy lentils. Later return to buy only lentils, and a
media 10 %, 2%, 1.5%, 1.0%, 0.75%, 0.50% and next time buy only paprika, in the amounts listed
0.25%, which were observed in figure 4. These in the pattern.
results were compared with results of the tech-
nical Memisp. 6.2 Scenario 2: Synthetic data

MEMISP MSP-QF
This second scenario is generated multiple da-
minSup tabases (datasets) of synthetic form by means of
Exe.Time No. Exe.Time No.
(%)
(seg.) Patterns (seg.) Patterns the Synthetic Data Generator Tool.
10.00 4 50 5 50 The process followed to synthetic generation
2.00 12 824 15 824
1.50 16 1371 19 1371 of the dataset, it is the describing Agrawal and
1.00 22 2773 27 2773 Srikan (1995), and under the parameters referred
0.75 28 4582 35 4582 to in the work of Lin and Lee (2005).
0.50 39 9286 50 9286
0.25 72 30831 89 30831 In this scenario, tests were carried out both of
effectiveness, efficiency and scalability.
The evidence of effectiveness and efficiency
Analysis of the market basket
100 were made with dataset generated with the fol-
90
lowing parameters: NI = 25000, NS = 5000, N =
Execution time (seg.)

80
70 10000, |S| = 4, |I| = 1.25, corrS = 0.25, crupS =
60 0.75, corrI = 0.25 and crupI = 0.75.
50
40
MEMISP The results of these tests were compared with
30 MSP-QF the results obtained for the algorithms Pre-
20
10
fixSpan-1, PrefixSpan-2 and Memisp.
0
10 2 1.5 1 0.75 0.5 0.25 Efficiency Tests: Ran a first subset of tests for
Minimal Support (%)
|C|=10 and a database of 200,000 sequences,
Figure 4. Results of the tests for scenario 1 with different values for minSup. The results are
shown in figure 5.

39
1600 Efficacy tests: Were carried out 92 efficacy
Exxecution Tme (seg.)
1400 trials with the same dataset of the tests of effi-
1200 ciency. In four of the tests carried out with values
1000
|C| =20 and |T| equal to 2.5, 5 and 7.5 respective-
800
600
ly, it did not achieve the same amount of sequen-
400 tial patterns found with the algorithms Pre-
200 fixSpan-1 and PrefixSpan-2. These 4 tests repre-
0
0.25 0.5 0.75 1 1.5 2
sent 4% of the total.
Minimal Support (%)
PrefixSpan-1 PrefixSpan-2 MEMISP MSP-QF Scalability tests: The scalability tests were
used datasets synthetically generated with the
Figure 5. Test results for |C|=10 and |DB|=200K same values of the parameters of the first subset
of tests of efficiency, and with minimal support
The second subgroup of tests was conducted equal to 0.75%. The amount of sequences in the
with a dataset with values for |C| and |T| of 20 dataset for these tests ranged from |DB|=1000K
and 5 respectively. This value of |T| implies that to 10000K, i.e. of a million to 10 million se-
the number of items of transactions increases, quences. In figure 8, you can watch the results of
which represents that the database is also larger these tests.
and more dense with respect to the number of
frequent sequential patterns that may be ob- 40
tained. The results of these tests are those seen in
Execution Time (relative |DB| =

35
figure 6. 30
25
1000KB)

10000 20
Execution Time (seg.)

15
8000
10
6000
5
4000
0
2000 1 2 3 4 5 6 7 8 9 10
Millions of sequences
0
PrefixSpan-1 PrefixSpan-2 MEMISP MSP-QF
0.25 0.5 0.75 1 1.5 2
Minimal Support (%)
PrefixSpan-1 PrefixSpan-2 MEMISP MSP-QF Figure 8, Results of scalability tests for min-
sup=0.75% and different sizes of datasets
Figure 6. Test results for |C|=20, |T|=5 and
|DB|=200K 7 Conclusion
We have proposed an algorithm for the dis-
A last subset of efficiency tests were carried covery of sequential patterns that allows, as part
out under the same parameters of the subset of the mining process, to infer from the amounts
above with the exception of |T| increased to 7.5. associated with each of the items of the transac-
The results are shown in figure 7. tions that make up a sequence, the quantity fac-
tors linked to the frequent sequential patterns.
20000 The technical proposal has been designed in
18000
16000
such a way that uses a compact set of indices in
Execution Time (seg.)

14000 which focuses the search for the sequential pat-


12000 terns from frequent patterns that have already
10000
8000
been found earlier and that represent the prefixes
6000 of the patterns to find. That is why the size of the
4000 indexes is decreasing in accordance with the
2000
0
mining process progresses.
0.5 0.75 1 1.5 2 In addition, there has been that the information
Minimal Support (%) provided by the frequent patterns with factors of
PrefixSpan-1 PerfixSpan-2 MEMISP MSP-QF
quantity, is much more accurate, since not only
Figure 7. Test results for |C|=20, |T|=7.5 and
gives us information on how is the temporal rela-
|DB|=200K tionship of the items in the various transactions,

40
but also, what is the relationship of the quantities Karel Filip. 2006. Quantitative and Ordinal Associa-
of some items to others, which enriches the se- tion Rules Mining (QAR Mining). 10th Internation-
mantics provided by the set of sequential pat- al Conference on Knowledge-Based & Intelligent
terns. Information & Engineering Systems (KES 2006).
South Coast, UK: Springer, Heidelberg.
Finally, the results obtained in section 5, we
can conclude by saying that the technical pro- Lin Ming-Yen and Lee Suh-Yin. 2005. Fast Discov-
posal meets the objectives of the mining process; ery of Sequential Patterns through Memory Index-
it is effective, is efficient and is scalable because ing and Database Partitioning. Journal of Infor-
it has a linear behavior in accordance with the mation Science and Engineering.
sequence database grows, and that when applied Market-Basket Synthetic Data Generator,
to large data bases his performance turned out to http://synthdatagen.codeplex.com/.
be better than the techniques discussed in this Molina L. Carlos. 2001. Torturando los Datos hasta que
work. Confiesen. Departamento de Lenguajes y Sistemas In-
formáticos, Universidad Politécnica de Cataluña. Bar-
Reference celona, España.

Agrawal Rakesh and Srikant Ramakrishnan. 1994. Papadimitriou Stergios and Mavroudi Seferina. 2005.
Fast algorithms for mining association rules. In The fuzzy frequent pattern Tree. In 9th WSEAS Inter-
Proceeding 20th International Conference Very national Conference on Computers. Athens, Greece:
Large Data Bases, VLDB. World Scientific and Engineering Academy and Soci-
ety.
Agrawal Rakesh and Srikant Ramakrishnan. 1995.
Pei Jian, Han Jiawei, Mortazavi-Asl Behzad and Pinto
Minning Pattern Sequential. 11th International
Helen. 2001. PrefixSpan: Mining sequential patterns
Conference   on   Data   Engineering   (ICDE’95),   Tai-
efficiently by prefix-projected pattern growth. In
pei, Taiwan. ICDE '01 Proceedings of the 17th International Con-
Agrawal Rakesh and Srikant Ramakrishnan. 1996. ference on Data Engineering.
Mining Quantitative Association Rules in Large Srikant Ramakrishnan and Agrawal Rakesh. 1996.
Relational Tables. In Proceeding of the 1996 ACM Mining Sequential Patterns: Generalizations and
SIGMOD Conference, Montreal, Québec, Canada. Performance Improvements. In Proc.5th Int. Conf.
Alatas Bilal, Akin Erhan and Karci Ali. 2008. Extending  Database  Technology  (EDBT’96),  pages  
MODENAR: Multi-objective differential evolution algo- 3–17, Avignon, France.
rithm for mining numeric association rules. Applied Soft
Takashi Washio, Yuki Mitsunaga and Hiroshi Moto-
Computing.
da. 2005. Mining Quantitative Frequent Itemsets
Chen Yen-Liang and Haung T. Cheng-Kui. 2006. A Using Adaptive Density-Based Subspace Cluster-
new approach for discovering fuzzy quantitative ing. In Fifth IEEE International Conference on Da-
sequential patterns in sequence databases. Fuzzy ta Mining (ICDM'05). Houston, Texas, USA:
Sets and Systems 157(12):1641–1661. IEEE Computer Society.
Fiot Céline. 2008. Fuzzy Sequential Patterns for Wang Shyue, Kuo Chun-Yn and Hong Tzung-Pei.
Quantitative Data Mining. In Galindo, J. (Ed.), 1999. Mining fuzzy sequential patterns from quan-
Handbook of Research on Fuzzy Information Pro- titative data”,  1999  IEEE  Internat.  Conference Sys-
cessing in Databases. tems, Man, and Cybernetics, vol. 3, 1999, pp. 962–
966.
Han Jiawei, Pei Jian, Mortazavi-Asl Behzad, Chen
Qiming, Dayal Umeshwar and Hsu Mei-Chun. Zaki Mohammed J. 2001. SPADE: An efficient algo-
1996. Freespan: Frequent pattern-projected se- rithm for mining frequent sequences. Machine
quential pattern mining. Conference of the sixth Learning, 42(1-2):31–60.
ACM SIGKDD international conference on
Knowledge discovery and data mining.

41

You might also like