Nimertis Varsamis (Mech) PDF

.
, 2013
:
, /
/
:
, /
/ .
, /
/ .
, /
/ .
, 2013
........................................................................................................... - 10 .......................................................................................................... - 16 ......................................................................................................................... - 18 ........................................................................................................................... - 20 1.
..................................................................................................................... - 22 -
2.
....................................... - 26 2.1
........................................................................................... - 28 ................................................................................................................. - 31 3.
.......................................... - 34 .............................................................................. - 39 1994 Mannila ........................................................................................ - 39 1998 CAP .............................................................................................. - 40 1999 CLOSE ........................................................................................... - 42 1999 UWEP ........................................................................................... - 43 2000 FP-Growth.................................................................................... - 44 2000 Closet ........................................................................................... - 47 2002 SmartMiner.................................................................................. - 50 2002 TFP ............................................................................................... - 52 2003 AFOPT .......................................................................................... - 54 2003 CLOSET+ ....................................................................................... - 56 2004 Patricia ......................................................................................... - 58 2005 CFP-Tree ...................................................................................... - 60 2005 FP-Array FP-Close .................................................................. - 62 ......................................................................... - 63 2001 MAFIA .......................................................................................... - 64 -
2002 GenMax ....................................................................................... - 65 2005 ................................................................ - 67 2005 Charm .......................................................................................... - 68 2006 DCI-CLOSED.................................................................................. - 69 2011 - Memory-based online pattern ......................................................................... - 70 2012 DBV-Miner ................................................................................... - 72 ...................................................................................................... - 73 1997 Eclat ............................................................................................. - 73 2004 kDCI.............................................................................................. - 74 4.
.................................................................................... - 78 4.1
.................................................................................................... - 78 -
4.2
............................................................................... - 79 -
Invert Index File (IIF) ................................................. - 81 (CFI) .................. - 84 (MFI) ................ - 88 ............................................................... - 91 4.3
.............................................................................................. - 97 -
, CFI ....................................................... - 98 ................................................... - 100 .............................................................. - 101 5.
........................................................ - 103 5.1
IIF ..................................................................................................... - 104 -
5.2
CFI.................................................................................................... - 106 -
5.3
MFI .................................................................................................. - 108 -
5.4
(CFI Vs. MFI) ................................................................. - 111 -
6.
....................................................................................................... - 115 -
7.
............................................. - 117 -
A ( ) .......................................................................... - 122 -
B ( ) ................................................................................. - 130 ................................................................................................................... - 180 -

1: [40] .............. - 23 2: . ..................................................................... - 27 3: .
, ,
.Error!
Bookmark
not
defined.
4: . ....................................... - 32 5: . ........................... - 32 6: D. .......................................................................................... - 34 7: D. ................................................. - 34 8: Apriori . ............................................................. - 38 9: Mannila........................................................ - 40 10: CAP. ........................................................... - 41 11: CLOSE ......................................................... - 42 12: UWEP. ........................................................ - 44 13: FP-tree. ........................... - 45 14: FP-tree. ............................. - 46 15: FP-Growth. ................................................ - 47 16: .
-................................................. - 48 17: CLOSET....................................................... - 49 18: SmartMiner MFI. ......... - 51 19: SmartMiner
. ...................................................... - 52 20: infMfi . ... 52 21: (descendant_sum)
FP-tree............................................................................................ - 53 22: .- 53
23: AFOPT. ....................................................... - 55 -
24: AFOPT
. .......................................................................... - 55 25: ,
(MFI). ........................................................ - 56 26: CLOSET+. ................................................................... - 57 27: . ................................ - 58 28: Patricia. ...................................................... - 59 29: patricia: a. , b.
, c. Patricia............................................ - 60 30: ,
CFP-tree. .............................................................................................................................. - 61 31: CFP-tree. ............................... - 62 32: ,
FP , FP-array. .................................. - 63 33: MAFIA. ...................................................... - 64 34:
(progressive focusing). .................................................................... - 65 35: GenMax. .................................................... - 66 36:
(diffset propagation). .......................................................................... - 66 37: (
). ........................................................................................................................... - 67 38: CHARM. ..................................................... - 68 39: DCI_CLOSED ............................................... - 69 40: ( ),
( ) (). ......... - 70 41: .......................................... - 71 42: ........................................... - 72 43: DBV-Miner . ................................................ - 73 44: kDCI. .......................................................... - 75 45: kDCI. ................ - 76 46:
(key_elements) kDCI (PASCAL). ................................................................ - 77 47: IIF.......................................................... - 80 -
48: main. .......................................................... - 81 49: IIF. ........................................................ - 82 50: IIF 3. ................................ - 83 51: pivot itemsets (2-itemsets). ................ - 85 52: sub pivot itemsets (N-itemsets)........... - 85 53: CFI....................................................... - 86 54: CFI ............... - 87 55: pivot itemsets, sub pivot itemsets, CFI
CFI. .................................................................................................... - 88 56: MFI. . 89 57: MFI.- 90
58: MFI . ................................................................. - 91 59:
(subset query)...................................................................................................................... - 92 60:
. ........................................................................................................................ - 93 61:
(superset query). ................................................................................................................. - 94 62:
. ........................................................................................................................ - 95 63:
(similarity query). ................................................................................................................ - 96 64:
. ........................................................................................................................ - 97 65: pivot-IIF, sub-pivot-IIF CFI (
3). ......................................................................................................................................... - 99 66: . ............. - 101 67: (
4). ....................................................................................................................................... - 102 68:
. ........................................................................................................... - 103 69: 3000 . .............................. - 104 -
70: IIF
.................................................................................................................... - 105 71: IIF
. .......................................................................................................................... - 105 72: CFI
, 40I10D100K.dat, . .............. - 106 73: CFI
, pumsb.dat, .......................... - 107 74: CFI
40I10D100K.dat, . ............... - 107 75: CFI
pumsb.dat, ........................... - 108 76: MFI
40I10D100K.dat. ...................................................... - 109 77: MFI
pumsb.dat. ................................................................ - 109 78: MFI
CFI,
40I10D100K.dat. .............................................................................................................. - 110 79: MFI
CFI,
pumsb.dat.......................................................................................................................... - 110 80:
CFI MFI pumsb.dat.
. ..................................................................................... - 112 81:
CFI MFI pumsb.dat.
. ..................................................................................... - 112 82:
CFI MFI pumsb.dat.
. ..................................................................................... - 113 83:
CFI MFI T40I10D100K.dat.
. ..................................................................................... - 113 -
84:
. ..................................................................................... - 114 85:
. ..................................................................................... - 114 86:
MFI 40I10D100K.dat.- 125
87: MFI
CFI,
40I10D100K.dat. ........................................................................................ - 125 88:
MFI pumsb.dat. ......... - 126 89: MFI
CFI,
pumsb.dat. ................................................................................................... - 126 -

1: ........................................................................... - 25 2: IIF
T40I10D100K.dat. .............................................................................................................. - 122 3: IIF
pumsb.dat.......................................................................................................................... - 123 4: IIF
connect.dat........................................................................................................................ - 123 5: CFI
40I10D100K.dat. ........................................................................................ - 124 6: CFI
pumsb.dat. ................................................................................................... - 124 7:
, pumsb.dat
T40I10D100K.dat, CFI
MFI. ............................................................................................................................... - 129 -
,
,
.

,

.

.
,
(Frequent Itemsets).

,
(Closed Frequent Itemsets)
(Maximum Frequent Itemsets). ,
MFI
, MFI-drive
,

.
1.

,
,
.
- , ,
1 . ,
, ,
, .
, ,
, .

,
,
. ,
, ,
, ,
,
,
.
.

(Data Mining)
(Knowledge Discovery from Data, KDD)
( 1).

. .
, ,
, ,
.
,
1.
1: [40]

, , .
(Data Cleaning)
, ,
,

.
,
(Data Integration),
.
(schema), -
- ,
, .

,
.
(Data Selection) (Data
Reduction). ,
.
,
,
:
1. ,
.
2. , -
.
3. ,

.

1.
Wavelet
Wavelet

(Data Cubes)
, .
1:
, -
(Data Transformation and Discretization) .
,
.

(smoothing)
(aggregation), (attribute/feature construction)

(normalization)
, (discretization)
,
(concept hierarchy generation)
[1].
-,
,
.
(market basket
analysis),
,

(Frequent Itemsets) .
2.
(Frequent Pattern)
(Association Rules)
.

(real time)
.
Rakesh Agrawal [2],

[3] super
market (market basket analysis),

. ,
super market 80%

, 70% .
, ,
(marketing)
, ..
,

a priori. ,
, ,
, ,

.
2: .
. ,
, ,
,
.

.

, (confidence)
(support) , .

.
2.1
},
{
},
-, TID (Transaction IDentifier).

A,

D (support) s, s
D
, A B A B.
(confidence) c D, c
D B, .

. , , :
(2.1-1)
|
(2.1-2)

(min_sup) (min_conf) .

0% - 100%, 0 1.
(itemset) k
k- (k-itemset), .. {, } 2-
(2-itemset).

(frequency), (support count)
(count) . ,
2.1-1,
, .
I
, I
(frequent itemsets). k-
Lk2.
2.1-1 2.1-2 :
|
(2.1-3)
2.1-3
A
A, B

,
:
(Large).
,
k- Lk.
1. .

(min_support).
2.
.
(min_support)
(min_confidence).

,
.

.

.
. ,
100 {
1-:
{1},
{2},
{100}
} (
2-: {
} .
(2.1-4)

,
(closed frequent itemsets)
(maximal frequent itemsets).
X D
Y
X D. X
D .
X D X
Y, .
3: .
, ,
.
,
.
D 10 ,
4
5 .
TID
1 {a, d, e}
2 {b, c, d}
3 {a, c, e}
4 {a, c, d, e}
5 {a, e}
6 {a, c, d}
7 {b, c}
8 {a, c, d, e}
9 {b, c, e}
10 {a, d, e}
4: .
,
, .
/
1- 2- 3- 4- 5-
{a}: 7
{a, b}: 0 {a, b, c}: 0 {a, b, c, d}: 0 {a,b,c,d,e}: 0
{b}: 3
{a, c}: 4 {a, b, d}: 0 {a, b, c, e}: 0
{c}: 7
{a, d}: 5 {a, b, e}: 0 {a, b, d, e}: 0
{d}: 6
{a, e}: 6 {a, c, d}: 3 {a, c, d, e}: 2
{e}: 7
{b, c}: 3 {a, c, e}: 3 {b, c, d, e}: 0
{b, d}: 1 {a, d, e}: 3
{b, e}: 1 {b, c, d}: 1
{c, d}: 4 {b, c, e}: 1
{c, e}: 4 {b, d, e}: 1
{d, e}: 4 {c, d, e}: 2
5: .
(5)
2.1-4, 31 .
30%,
. ,
.
,

2.1-1, 2.1-2 2.1-3,
:
{
{ }
{ }, 30% 75%
{ },
40% 57%.
3.
(Frequent Itemsets)

.

, ,

, 6,
7. , ,
( 20%)
.
TID
1
2
3
4
5
Items
{a, c, d}
{b, c, e}
{a, b, c, e}
{b, e}
{a, b, c, e}
6: D.
abcde
abcd
abce
abde
acde
bcde
abc
abd
abe
acd
ace
ade
bcd
bce
bde
cde
ab
ac
ad
ae
bc
bd
be
cd
ce
de
( 2)

7: D.

m
,

,
.
,

.
Apriori
Apriori
. 1994 R.
Agrawal R. Srikant
Boolean [3].
,
, . Apriori
, -
(+1)-.

,
.
L1. L1
L2, 2-,
L3
- . Li
.
Li ,
Apriori
.
Apriori:
.
, ,
, min_support,

. A
. ,
.

(antimonotonicity property) ,
,
. Apriori
.
Apriori
Lk Lk-1
.
1. : Lk,
- , Ck, (join)
Lk-1 .
Lk-1
[ ] j-

. (-1)-
[ ]
[ ]
].
(k-2) .
[ ]
] .
]
[ ]
[ ]
[ ]
] .
{ [ ]
[ ]
]
[
] }.
2. : Ck Lk
, - Ck.
Lk ,
- -
Lk. Ck, ,
.
Apriori,
(-1)- - Lk-1,
- Lk Ck.
hash-trees
.
8 Apriori.
apriori_gen (join) (prune)
Apriori
has_infrequent_subset.
Algorithm: Apriori. Find frequent itemsets using an iterative level-wise

approach based on the candidate generation.
Input:
D, a database of transactions;
min_sup, the minimum support count threshold.
Output:
L, frequent itemsets in D.
Method:
(1)
L1 = find_frequent_1-itemsets(D);
(2)
for(k = 2; Lk-1 ; k++) {
(3)
Ck = Apriori_gen(Lk-1);
(4)
for each transaction t D { // scan D for counts
(5)
Ct = subset(Ck, t); // get the subsets of t that are candidates
(6)
for each candidate c Ct
(7)
c.count++;
(8)
}
(9)
Lk = {c Ck| c.count min_sup}
(10)
}
(11)
Return L = UkLk;
procedure apriori_gen(Lk-1: frequent (k-1)-itemsets)
(1)
for each itemset l1 Lk-1
(2)
for each itemset l2 Lk-1
(3)
if(l1[1] = l2[1] l1[2] = l2[2] (l1[k-2] = l2[k-2]) (l1[k-1] <
l2[k-1])) then {
(4)
c = l1 l2; //join step: generate candidates
(5)
If has_infrequent_subset(c, Lk-1) then
(6)
delete c; //prune step; remove unfruitful candidate
(7)
else add c to Ck;
(8)
}
(9)
return Ck;
procedure has_infrequent_subset(c: candidate k-itemset; Lk-1: frequent (k1)-itemsets); // use prior knowledge
(1)
for each (k-1)-subset s of c
(2)
if s Lk-1 then
(3)
return TRUE;
(4)
return FALSE;
8: Apriori .
Apriori,
,
TID-itemset, TID (Transaction IDentifier)
itemset .
. ,
item-TID_set,
.
.
, ,
.
Apriori
.

Apriori .
1994 Mannila
Mannila [4] Apriori
.

.
,
,
itemset
. , ,
L2={AB, BC, AC, AE, BE, AF, CG}, Manilla ABC

ABE L3,
L2, L4 . ,
X s+1

. ,
:
{
|| |
1. C1 :- {{A} I A E R};
2. 8 := 1;
3. while C~ ~ 0 do
4. database pass: let L~ be the elements of Cs that are covering;
5. candidate generation: compute C~+1 from L~;
6. s := s + 1;
7.. od;
9: Mannila
h,
, (h,).
Chernoff [5], [6] :
[

:
[
, ,
3.000
.

Apriori,
,
.
1998 CAP
[7] Raymond, Laks Jiawei
, , : 1.
, 2.
3.
.
10: CAP.
,
(breakpoints),

. ,
,

(constrained association queries CAQ),

. ,
, .
, , ,
- (black-box)
, ,
,
.
, (antimonotonicity)
(succinctness),

. CAP

.
1999 CLOSE
[8] Nicolas Pasquier, Yves Bastide, Rafik Taouil Lotfi Lakhal
, CLOSE.
11: CLOSE .

,
:
1.

.
2.
,
.
3.
.
1999 UWEP
[9] Necip Fazll Ayan, Abdullah Uz Tansel Erol Arkun
,
.
FUP2 [10] Partition-Update [11],
UWEP,
,
.
, UWEP

,
, .
,
,
. (lookahead) :
1: (small) ,
.
2: X , DB , db

D.
3: X .
.
, ,
.
12: UWEP.
2000 FP-Growth
Jian Pei, Jiawei Han, Yiwen Yin Runying Mao [12]
, , , FP-tree, ,
FP-Growth,
Apriori [3]
Tree-projection [13].
13: FP-tree.
, FP-tree (frequent pattern tree),

(root) null,
-
.
: ,

.
,

.
FP-Growth
:
14: FP-tree.
FP-tree,
,
.

1-
-,
.
.
Apriori,
.
,
-

.
, FP-Growth ,

,
.
15: FP-Growth.
2000 Closet
[14] Jian Pei, Jiawei Han Runying Mao
, CLOSET, [12]

. :
1. (prefix) (FPTree)
,
.
2.
.
3.
16: .
-.
17: CLOSET.

, Pasquier et al. [15]

,
. , FP-Tree,
,

.

, FPTree,
, CHARM [16] A-Close
[15].
2002 SmartMiner
Qinghua Zou, Wesley W. Chu, Baojing Lu, [17]

(MFI), ,
Mafia [18] GenMax [19]. , SmartMiner,

.
18: SmartMiner
MFI.

.
,
.
19: SmartMiner

.
20: infMfi .
2002 TFP
[20] Jiawei Han, Jianyong Wang, Ying Lu Petre Tzvetkov

TFP -
, close_node_count
descendant_sum, FP-tree,

.
21:
(descendant_sum) FPtree.

,
(hash table)
.
CLOSET CHAR
.
22: .
2003 AFOPT
Guimei Liu, Hongjun Lu, Jeffrey Xu Yu,Wei Wang Xiangye Xiao
[21]
,
AFOPT,
:
-
,
.

-.
23: AFOPT.
24: AFOPT
.
AFOPT
-
.

.
25: ,
(MFI).
2003 CLOSET+
[22] Jianyong Wang, Jiawei Han Jian Pei

, CLOSE+,
, CLOSE [14],
CHARM [16] OP [23].
26: CLOSET+.
CLOSET+ :

,

.
CLOSET+
:

(CFI),
.

, FPtree.
27: .
2004 Patricia
Mining Frequent Itemsets using Patricia Tries [24] Andrea
Pietracaprina Dario Zandolin PatriciaMine,
.
28: Patricia.
, Eclat, FPtree DCI, :

-
, Patricia tree,

, .

,

.
29: patricia: a. , b.
, c. Patricia.
2005 CFP-Tree

Guimei Liua, Hongjun Lua
Jeffrey Xu Yu [25]
.
CFP-tree
,
.
CFP-tree

:
30: , CFPtree.

( minimum
support constraints query).

(superset query).

(subset query).
m
(similarity query).

- (multiple constraints query).
31: CFP-tree.
2005 FP-Array FP-Close

Goesta Grahne Jianfei Zhu [26] ,
FP-array,
FP-tree

.
32: ,
FP
,

FP-array.
FP-
tree, FP-growth*, FP-tree (MFI-tree, CFI-tree)

(FPmax, FPclose)
.
2001 MAFIA
Doug Burdick, Manuel Calimlim Johannes Gehrke [18]
MAFIA,
.

.
33: MAFIA.
, ,
:
-
Simple
.
PEP
.
-
FHUT

.
HUTMFI .
Project (bitmaps)
.
2002 GenMax
[19] Karam Gouda Mohammed J Zaki GenMax,
.
, (progressive
focusing) ,
(diffset propagation)
.
34:
(progressive focusing).
GenMax
(backtracking search) .

},
.
,

},
.
{
, -
} ,
(pruning),
.
35: GenMax.
36:
(diffset propagation).
2005
[27] ,
on-line
.
.
37:
( ).

,
,
on-line .
2005 Charm
Mohammed J Zaki Ching-Jui Hsiao,

, [28] CHARM.
diffset
[29]
-
.
(hash
table).
38: CHARM.

, CHARM-L.
2006 DCI-CLOSED
Claudio Lucchese, Salvatore Orlando Raffaele Perego [30]
,
DCI_CLOSED.
39: DCI_CLOSED .
DCI_CLOSED
,
.
,

.
2011 - Memory-based online pattern

[31] Mei Qiao Degan Zhang

(bitmap)
, - -
. ,

,
.
40: ( ),
( ) ().

,
1
.
,

.
41: .
, ,

.
2012 DBV-Miner

[32] Bay Vo, Tzung-Pei Hong Bac Le,
DBV-Miner.
42: .

.
DBV-tree

AND, OR.
,

. DBV-Miner

.
43: DBV-Miner .
1997 Eclat

Eclat, Mohammed Javeed
Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara Wei Li [33].
, ,
,
.
, .

:
-
ClusterApr
.
Eclat
.
MaxEclat
.
.
Clique -
.
MaxClique, , -
. , , .
TopDown -
.
2004 kDCI
Claudio Lucchese, Salvatore Orlando, Paolo Palmerini, Raffaele Perego
Fabrizio Silvestri [34] DCI [35].
, kDCI,

.
44: kDCI.
kDCI
.
45: kDCI.
, kDCI
, .
,

.
,
PASCAL
Mining frequent patterns with counting
inference Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, L. Lakhal [36].
,
,

.
46:
(key_elements) kDCI (PASCAL).
4.
, MFI-drive.

. ,
.

.
. ,
,
.
,
.
4.1

(CFI)
(MFI),
.

MFI CFI.
MFI-Drive, ,
(MFI)
, ,
..
.
MFI CFI,
CFI
. MFI
, .
MFI CFI

40%.
, ,
,

.
4.2
.
C++, ,
standard libraries containers,
.
MFI ,

(Invert Index File).

,
. IIF C++ map
container

.
.
47: IIF.
IIF,

CFI,
map container.
, ,
3.
CFI,
MFI,
.

(map container) ,
.

CFI MFI.
o (subset query),
(superset query) (similarity query).
48: main.
Invert Index File (IIF)

(IIF)

.
,
, ,
(), . ,
-,
, ,
,
.
C++ map container, 47.

, (amortized)
, ,
[37].

std::string,
. unsigned integer,
,
std::string , bit vector
bitset, ,
.
49: IIF.
IIF,
. ,
4.3
,
.
IIF ,
(3):
50: IIF 3.
(3), 32
(8) , IIF, ,
IIF .
IIF

. , 4 3
,
00100101.
(CFI)
(CFI)
IIF
,
.
,
+1, ,
, Apriori
.
, 50 154
10 (2-itemset) 154-10
(3), . ,
49, 3-itemset , 154-10-49, ,

3-itemset 154-10-49.

,

( 8 48).
2-itemsets.
pivot itemsets,
.
51: pivot itemsets (2-itemsets).
, pivot
itemsets,

. N-itemsets sub pivot itemsets
,
.
52: sub pivot itemsets (N-itemsets).

, .
,
. ,
CFI
CFI .
.
53: CFI.
CFI,
CFI

.
-
,
.
54: CFI .
CFI 504.
pivot itemsets (2-itemsets) .
. ,
pivot itemsets,
sub pivot IIF.
,
.
CFI
, 5-10 5-10-12.

, .
4

. .
5, 10 12
,
3.
55: pivot itemsets, sub pivot itemsets, CFI

CFI.
(MFI)
MFI CFI.
.

.
CFI,

:
1.
. 10-12
10, 10-12-3-5, 10-154, 10-23, 10-3-5, 10-49. 10
,
. 10-154 ,

.
10-12-3-5, ,
10-12 MFI.
56:
MFI.
2.
,
.

. , ,
.
57: MFI.
MFI
CFI .
MFI .
1-itemsets
. 21%:
58: MFI .

MFI CFI

.
MFI
. MFI
CFI
CFI,
5.

.
. 7.
.
CFI, MFI
. ,
,

.

.
, .
I.
(Subset Query)
.
, (subset query)
,
.
59: (subset
query).

, CFI MFI
:
60:
.
10-4-5-99-12,

. ,
, ,
CFI, ,
MFI.

,

CFI MFI, .
II.
(Superset Query)
(superset
query).
.
61 -
.
, MFI
.
61:
(superset query).
62:
.

CFI MFI,

(12). ,
, 56%.
III.
(Similarity Query)
(similarity
query).
, m
. m

.

63.
.
m-itemsets

.
63: (similarity query).

m-itemsets
superset
. .
64:
.

CFI MFI.
CFI
, MFI
.
4.3

.
,

.
thn
C++,

.

.
, CFI

,
,
.
(n2)

. CFI
. ,
map container C++, ,

.
, CFI
/,

. ,
,
.
, ,

.
65: pivot-IIF, sub-pivot-IIF CFI ( 3).
pivot-IIF (2itemsets) 1-itemsets, sub-pivot-IIF pivot-IIF

CFI. , IIF
2-itemsets (pivot-IIF).
, ,
154, 4 ...,
, 10. , 12,
.
pivot-IIF
( 65 ),
(3),
( 3-10-12
3-10-5 3-10),
.
, 3-10-12 3-12-5
3-10 3-12 ,
.

map container, sub-pivot
,
sub-pivot ,
. ,
45-12, 45-23 ,
99-12 .
.
CFI ,
CFI.

,
(string) (0) (1),
.
bitstring .
, IIF 10-5-3:4:001000111
10-5-3 , , (4)
, ,
.

IIF,
.
.
,
bitstring .
(1)
. ,
,
AND (bitwise
AND) .
66: .
66 1itemset IIF (7) (10)

, .
,
.
.

.
(Run-length encoding [38]).

,
. MFI-drive,

.

, .
, ,
trade-off
.
.
67: ( 4).
67

154, 49 99.
.

50%.
5.
MFI-drive
6.
Intel(R) Core(TM) i5 CPU M560
@ 2.67GHz-2660 Mhz, 4GB Windows 7 Professional.
NetBeans IDE 7.2 GNU Compiler
Collection Cygwin.
,
, .

.
3000 1% 68.
68:
.
Korsarak T40I10D100K
(76% 42%) ( 0
1000), . chess
connect (24% 34%)
( 2500 3000) .
6
Frequent Itemset Mining

Dataset Repository (http://fimi.ua.ac.be/data/).
, mushroom, pumsb pumsb_star

, . ,

, .

3000 . T40I10D100K
, (chess, connect)
.
69: 3000 .

MFI-drive .
.
5.1 IIF

(RLE),
IIF. T40I10D100K, pumsb
connect, . , chess
connect, ,
.
.
A ( ).
70: IIF
.
IIF,
( 71).
71: IIF
.
5.2 CFI

(CFI).
, T40I10D100K.dat pumsb.dat,
.

CFI,
.
A ( ).
.
CFI Creation Time - T40I10D100K.dat (5%)

8000
7000
Seconds
6000
5000
4000
3000
Compression Off
2000
Compression On
1000
0
Input size
72: CFI
, 40I10D100K.dat, .
CFI Creation Time - pumsb.dat (85%)

2500
Seconds
2000
1500
Compression Off
1000
Compression On
500
0
100
1000
10000
20000
30000
40000
49000
Input size
73: CFI ,
pumsb.dat, .
CFI Size - T40I10D100K.dat (5%)

35000
30000
Kilobytes
25000
20000
15000
Compression Off
10000
Compression On
5000
0
100
1000 10000 20000 40000 60000 80000 100000

Input size
74: CFI
40I10D100K.dat, .
CFI Size - pumsb.dat (85%)

500000
Kilobytes
400000
300000
Compression Off
200000
Compression On
100000
0
100
1000
10000
20000
30000
40000
49000
Input size
75: CFI
pumsb.dat, .
5.3 MFI
MFI-drive

MFI. ,

CFI.
, MFI

CFI.
A ( ).
MFI Creation Time - T40I10D100K.dat (5%)

0,45
0,4
0,35
Seconds
0,3
0,25
0,2
0,15
0,1
0,05
0
100
1000
10000
20000
40000
60000
80000 100000
Input size
76: MFI
40I10D100K.dat.
MFI Creation Time - pumsb.dat (85%)

600
500
Seconds
400
300
200
100
0
100
1000
10000
20000
30000
40000
49000
Input size
77: MFI
pumsb.dat.
MFI Vs. CFI records reduce - T40I10D100K.dat (5%)

18%
16%
14%
Reduce
12%
10%
8%
6%
4%
2%
0%
100
1000
10000
20000
40000
60000
80000
100000
Input size
78:
MFI CFI,
40I10D100K.dat.
Reduce
MFI Vs. CFI records reduce - pumb.dat (85%)

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
100
1000
10000
20000
30000
40000
49000
Input size
79:
MFI CFI,
pumsb.dat.
5.4 (CFI Vs. MFI)

, .
40I10D100K.dat pumsb.dat.
1000 2%
CFI 3133 MFI 2612 (17% ).
100
85%.

.
A ( ).

0,001 1
.
IIF.
0,1 IIF 300 ,

30.
Seconds
Subset Query Timings - pumsb.dat (1000, 85%, 0.4)

45
94%
40
93%
35
92%
30
91%
25
90%
20
89%
15
88%
10
87%
86%
85%
0
10
20
CFI time
30
40
MFI time
50
60
% Time Reduce
80:
CFI MFI pumsb.dat.
.
Seconds
Superset Query Timings - pumsb.dat (1000, 85%, 0.1)

180
100%
160
90%
140
80%
120
70%
100
60%
80
50%
60
40%
40
30%
20
20%
0
-20 0
10
20
CFI time
30
MFI time
40
50
10%
60 0%
% Time Reduce
81:
CFI MFI pumsb.dat.
.
Seconds
Similarity Query Timings - pumsb.dat (1000, 85%, 0.1)

180
100%
160
90%
140
80%
120
70%
100
60%
80
50%
60
40%
40
30%
20
20%
0
-20 0
10
20
CFI time
30
40
MFI time
50
10%
60 0%
% Time Reduce
82:
CFI MFI pumsb.dat.
.
Seconds
Subset Query Timings - T40I10D100K.dat (1000, 2%, 0.4)

0,4
45%
0,35
40%
0,3
35%
30%
0,25
25%
0,2
20%
0,15
15%
0,1
10%
0,05
5%
0%
0
10
20
CFI time
30
MFI time
40
50
60
% Time Reduce
83:
.
Superset Query Timings - T40I10D100K.dat (1000, 2%, 0.1)

0,1
80%
0,09
70%
0,08
60%
Seconds
0,07
0,06
50%
0,05
40%
0,04
30%
0,03
20%
0,02
10%
0,01
0
0%
0
10
20
CFI time
30
MFI time
40
50
60
% Time Reduce
84:
.
Seconds
Similarity Query Timings - T40I10D100K.dat (1000, 2%, 0.4)

0,4
45%
0,35
40%
0,3
35%
30%
0,25
25%
0,2
20%
0,15
15%
0,1
10%
0,05
5%
0%
0
10
20
CFI time
30
MFI time
40
50
60
% Time Reduce
85:
.
6.
, ,

MFI-drive:
-
RLE ,
, trade off
. IIF, CFI MFI

53%, 67% 67%,
,
23%, 4% 1% .
IIF
CFI
16% 45% 22% 16%
. MFI

.

.

,
, .
IIF

.
CFI
.
,
,
,
(nk) [39] ( )
.

, , .
MFI
CFI .
CFI.
MFI CFI
, ,
, , 86%
4% .

. 4%
MFI CFI,
,
18%, 26% 17% . 86%
,
90%, 78% 78%. ,
MFI,

.
7.

, MFI-drive,

,

MFI-drive.

MFI-drive RLE
.

,
,
pivot, CFI MFI. ,

.
MFI-drive
CFI sub
pivot .

, ,
.
,
(IIF),
CFI. ,

. CFI
.

sub pivot
IIF,
.

.

,
CFP-trees,
3.
,
, MFI-drive
CFP-tree.
MFI, MFI-trees, CFI,

CFI-trees. MFI-drive
,
, ,
.
4.2
MFI-drive, ,
. , ,

.
,
,
. MFI-drive

,
,
.
MFI
, CFI
, .
, , MFI ,

, .

,
,
,
CFI.
, MFI-drive
,
MFI
,
, expanded
MFI.
pivot CFI,
CFI MFI.
MFI ,

,
. ,
MFI.
MFI-drive. ,

.

MFI-drive,

.

,
.
, ,
,
. ,
. ,
MFI

expanded MFI

. MFI
.

.

MFI-drive ,
. ,
.
A ( )

IIF (5.1 IIF) .
T40I10D100K.dat
Time with Time with no Time
Memory
Input
Compression Compression
Relative
Difference
Size
(Sec)
(Sec)
Difference
(KB)
100
0,057
0,046
18%
49,5%
1000
0,395
0,333
16%
52,7%
5000
1,950
1,507
23%
53,0%
10000
3,733
3,125
16%
53,1%
15000
5,980
4,571
24%
53,3%
20000
8,028
6,151
23%
53,3%
25000
9,812
7,618
22%
53,1%
30000
11,762
11,975
-2%
54,5%
35000
13,576
14,357
-6%
53,2%
48000
19,505
19,812
-2%
53,3%
2: IIF
T40I10D100K.dat.
pumsb.dat
Memory
Input
Relative
Difference
Size
(Sec)
(Sec)
Difference
(KB)
100
0,067
0,062
8%
20%
1000
0,696
0,535
23%
21%
5000
3,390
2,636
22%
23%
10000
6,890
5,324
23%
22%
15000
10,488
7,623
27%
22%
20000
14,045
10,664
24%
22%
25000
18,017
12,620
30%
22%
30000
22,604
15,043
33%
22%
35000
26,379
26,837
-2%
22%
48000
36,556
36,005
2%
22%
3: IIF
pumsb.dat.
connect.dat
Memory
Input
Relative
Difference
Size
(Sec)
(Sec)
Difference
(KB)
100
0,041
0,036
12%
0%
1000
0,327
0,343
-5%
19%
5000
1,658
1,861
-12%
27%
10000
3,452
3,624
-5%
30%
15000
5,246
5,756
-10%
32%
20000
6,905
7,565
-10%
29%
25000
8,990
9,380
-4%
29%
30000
10,883
11,429
-5%
30%
35000
13,150
13,478
-2%
31%
48000
14,877
15,241
-2%
30%
4: IIF
connect.dat.

CFI, (5.2 CFI).
T40I10D100K.dat
Time with
Time with no
Input Size
Compression CFI Size
CFI
(Support 5%)
(sec)
(KB)
(sec)
(KB)
Entries
100
12,589
26
2,792
68
684
1000
10000
20000
40000
60000
80000
100000
44,179
389,721
797,883
1744,232
2978,402
4306,466
7104,553
109
1003
2029
4053
6075
8134
10165
19,936
174,3
362,562
967,893
1491,692
2026,547
2502,422
322
3008
6133
12266
18398
24687
30860
330
308
314
314
314
316
316
5: CFI
40I10D100K.dat.
pumsb.dat
Time with
Time with no
Input Size
CFI
(Support 85%) (sec)
(KB)
(sec)
(KB)
Entries
100
173,27
480
138,7
485
2773
1000
1484,806
23576
1629,293
23714
19885
10000
404,667
80830
339,91
81541
6940
20000
776,027
187239
492,48
188907
8129
30000
1165,328
257557
544,256
259910
7413
40000
1406
224217
747,354
233473
7475
49000
1969,923
463687
1046
467493
8506
6: CFI
pumsb.dat.

MFI,

CFI (5.3 MFI).
T40I10D100K.dat
Time with
Time with No
Input Size
Compression MFI_size
Compression MFI_size
(Support 5%)
(sec)
(KB)
(sec)
(KB)
100
0,406
20
0,281
55
1000
10000
20000
40000
60000
80000
100000
0
0,016
0,015
0,015
0,016
0,031
0,031
102
978
1978
3953
5924
7933
9914
0,016
0,048
0,015
0,046
0,047
0,047
0,047
310
558
6055
12110
18164
24375
30469
86:
MFI
40I10D100K.dat.
T40I10D100K.dat
Entries
Entries After
Input Size
After Prefix Scattered
Difference
(Support 5%)
Purge
Purge
with CFI
100
584
568
17%
1000
10000
20000
40000
60000
80000
100000
318
304
310
310
310
312
312
318
304
310
310
310
312
312
4%
1%
1%
1%
1%
1%
1%
87:
MFI CFI,
40I10D100K.dat.
Input Size
(Support 85%)
pumsb.dat
Time with
MFI_size
Compression (KB)
(sec)
100
43,134
1000
10000
20000
30000
40000
49000
485,911
128,497
154,706
152,429
152,974
193
Time with
MFI_size
No
(KB)
Compression
(sec)
92
41,575
93
1450
9972
20668
30050
38312
51287
610,135
178,544
152,803
141,852
149,387
197,341
1463
10086
20920
30391
38710
51818
88:
MFI
pumsb.dat.
Input Size
(Support 85%)
pumsb.dat
Entries
After Prefix
Purge
Entries After
Scattered
Purge
Difference
with CFI
100
1917
630
77%
1000
10000
20000
30000
40000
49000
12750
4434
5166
4638
4646
5248
1438
1029
1069
1036
990
1082
93%
85%
87%
86%
87%
87%
89:
MFI CFI,
pumsb.dat.
,
5.4
(CFI Vs. MFI).
Case ID
1
Subset (factor 0,4)

CFI time MFI time Diff
16,646
1,56
91%
pumsb.dat (1000, 85%)

Superset (factor 0,1)
1,451
0,109 92%
Similarity (factor 0,1)

CFI time
MFI time
Diff
122,102
24,134
80%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
41,263
21,715
9,594
8,705
11,17
19,937
4,992
9,235
18,205
15,989
15,429
9,001
16,255
10,499
12,808
7,581
13,791
5,679
8,939
11,888
10,967
11,997
7,956
7,847
8,596
13,197
8,018
12,418
9,859
10,14
7,956
9,126
9,454
9,626
12,886
16,661
12,511
7,551
7,425
16,037
25,74
10,515
8,767
4,009
2,246
0,967
1,232
1,295
2,372
0,468
0,78
1,919
1,638
1,888
0,873
1,357
1,061
1,388
0,78
1,529
0,468
0,78
1,154
1,185
1,638
0,702
0,561
0,796
1,497
0,702
1,545
0,686
1,076
0,765
0,921
0,889
0,795
1,529
1,7
1,341
0,795
0,983
1,638
3,307
1,029
0,795
90%
90%
90%
86%
88%
88%
91%
92%
89%
90%
88%
90%
92%
90%
89%
90%
89%
92%
91%
90%
89%
86%
91%
93%
91%
89%
91%
88%
93%
89%
90%
90%
91%
92%
88%
90%
89%
89%
87%
90%
87%
90%
91%
148,154
117,109
164,956
6,599
122,898
104,068
45,818
7,238
113,382
78,391
4,836
0,951
139,043
1,701
72,353
6,552
162,1
4,4
108,591
115,535
71,995
82,696
126,408
71,339
109,559
82,743
72,93
139,31
52,4
47,128
47,69
48,142
6,864
1,529
1,638
123,115
155,408
46,55
6,833
4,82
134,536
7,847
1,42
29,515
21,232
30,311
2,964
22,371
18,564
10,921
1,545
22,823
15,256
1,341
0,141
28,517
0,327
14,445
1,623
31,559
1,622
21,544
22,682
15,319
14,305
24,867
14,087
20,405
17,082
15,023
21,809
10,967
11,56
11,404
11,341
2,824
0,343
0,265
23,759
33,103
9,298
2,683
1,17
25,927
1,763
0,296
80%
82%
82%
55%
82%
82%
76%
79%
80%
81%
72%
85%
79%
81%
80%
75%
81%
63%
80%
80%
79%
83%
80%
80%
81%
79%
79%
84%
79%
75%
76%
76%
59%
78%
84%
81%
79%
80%
61%
76%
81%
78%
79%
48,688
1,092
138,497
0,982
164,55
5,553
1,451
50,014
151,056
45,724
80,466
7,91
70,886
37,487
77,829
1,045
5,226
4,961
132,242
77,236
79,592
159,714
6,505
47,409
150,947
87,579
87,174
168,45
114,084
7,66
157,841
142,695
5,351
4,009
32,573
4,056
33,743
124,769
37,846
97,375
59,468
147,484
6,708
10,764
0,125
26,255
0,094
31,824
2,902
0,359
7,91
30,591
7,332
15,382
1,404
14,196
9,36
11,497
0,109
1,295
1,623
33,82
15,163
16,349
34,897
2,886
11,934
32,324
18,143
17,831
37,659
13,338
1,701
33,805
26,551
2,294
1,263
6,1
1,295
6,022
28,814
9,079
17,706
12,324
29,749
2,48
78%
89%
81%
90%
81%
48%
75%
84%
80%
84%
81%
82%
80%
75%
85%
90%
75%
67%
74%
80%
79%
78%
56%
75%
79%
79%
80%
78%
88%
78%
79%
81%
57%
68%
81%
68%
82%
77%
76%
82%
79%
80%
63%
45
46
47
48
49
50
Case ID
4,415
8,05
8,268
6,162
12,184
4,664
0,483
0,921
0,998
0,436
1,123
0,483
89%
89%
88%
93%
91%
90%
4,899
120,635
104,661
7,457
4,946
5,054
1,326
23,135
19,438
1,669
1,669
1,373
73%
81%
81%
78%
66%
73%
T40I10D100K.dat (1000, 2%)

Subset (factor 0,4)
Superset (factor 0,1)
90,48
116,938
82,634
160,619
126,672
46,55
18,299
20,514
17,675
36,566
23,026
9,298
80%
82%
79%
77%
82%
80%
Similarity (factor 0,1)

CFI time
MFI time
Diff
0,327
0,187
43%
0,063
0,047 25%
0,063
0,047
25%
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
0,234
0,265
0,281
0,312
0,281
0,327
0,296
0,281
0,312
0,327
0,265
0,266
0,281
0,265
0,327
0,281
0,359
0,218
0,25
0,281
0,25
0,25
0,265
0,265
0,312
0,25
0,312
0,266
0,218
0,249
0,266
0,234
0,218
0,25
0,312
0,249
0,297
0,219
0,249
0,266
0,203
0,234
0,218
0,234
0,187
0,203
0,203
0,281
0,202
0,203
0,249
0,25
0,172
0,187
0,203
0,265
0,249
0,266
0,187
0,172
0,203
0,249
0%
18%
11%
0%
11%
9%
26%
11%
15%
38%
12%
18%
17%
29%
38%
28%
22%
7%
19%
11%
0%
31%
29%
23%
15%
0%
15%
30%
21%
18%
6%
0,062
0,078
0,047
0,047
0,062
0,062
0,062
0,063
0,062
0,094
0,063
0,078
0,078
0,062
0,078
0,062
0,078
0,063
0,078
0,078
0,062
0,062
0,078
0,063
0,063
0,047
0,047
0,078
0,063
0,062
0,063
0,047
0,047
0,046
0,047
0,047
0,047
0,047
0,062
0,047
0,062
0,062
0,047
0,032
0,062
0,063
0,031
0,062
0,046
0,062
0,047
0,047
0,032
0,062
0,031
0,031
0,031
0,031
0,047
0,031
0,047
0,046
0,062
0,405
0,078
1,138
0,889
0,343
0,874
0,78
1,217
0,343
0,374
0,344
0,873
0,046
0,827
0,936
0,92
0,327
0,297
0,312
0,796
0,889
0,998
0,297
0,375
0,952
0,843
0,842
0,811
0,375
0,92
0,031
0,328
0,047
1,077
0,749
0,343
0,733
0,764
0,92
0,328
0,312
0,312
0,609
0,016
0,749
0,874
0,843
0,312
0,218
0,234
0,764
0,749
0,764
0,296
0,249
0,686
0,702
0,609
0,717
0,343
0,687
50%
19%
40%
5%
16%
0%
16%
2%
24%
4%
17%
9%
30%
65%
9%
7%
8%
5%
27%
25%
4%
16%
23%
0%
34%
28%
17%
28%
12%
9%
25%
24%
40%
2%
0%
24%
24%
24%
2%
24%
34%
2%
40%
59%
0%
19%
50%
21%
27%
21%
40%
24%
48%
21%
51%
51%
34%
34%
40%
51%
24%
27%
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
0,218
0,234
0,234
0,234
0,281
0,265
0,297
0,328
0,281
0,296
0,234
0,219
0,265
0,297
0,312
0,234
0,203
0,296
0,188
0,187
0,234
0,172
0,203
0,25
0,25
0,265
0,249
0,203
0,156
0,203
0,202
0,187
0,218
0,187
0,172
0,281
14%
20%
0%
26%
28%
6%
16%
19%
11%
31%
33%
7%
24%
37%
30%
20%
15%
5%
0,063
0,046
0,046
0,063
0,063
0,063
0,046
0,062
0,094
0,078
0,047
0,063
0,078
0,078
0,078
0,047
0,078
0,047
0,062
0,032
0,032
0,046
0,062
0,062
0,032
0,062
0,031
0,047
0,046
0,031
0,062
0,063
0,046
0,047
0,047
0,046
2%
30%
30%
27%
2%
2%
30%
0%
67%
40%
2%
51%
21%
19%
41%
0%
40%
2%
0,328
0,873
0,811
0,406
0,858
0,842
0,936
0,873
0,359
0,842
0,312
0,39
0,39
0,92
0,873
0,843
0,905
0,312
0,312
0,702
0,718
0,28
0,656
0,827
0,827
0,811
0,343
0,78
0,297
0,327
0,328
0,734
0,78
0,639
0,609
0,296
7: ,
pumsb.dat T40I10D100K.dat,
CFI MFI.
5%
20%
11%
31%
24%
2%
12%
7%
4%
7%
5%
16%
16%
20%
11%
24%
33%
5%
B ( )
/*
* File: MFIdrive.h
* Author: Theodore Varsamis
*
* Created on 3 October 2012, 10:53 am
*/
#ifndef MFIDRIVE_H
#define MFIDRIVE_H
#include <vector>
#include <map>
#include <string>
#include <iostream>
#include <fstream>
#include <sstream>
#include <iterator>
#include <ctime>
#include <set>
#include <cmath>
/* This defines the number of the input database entries that are going to be
* read during the main IIF creation.
*/
#define DB_ENTRIES 1000
/* This the the predefined itemset's support/frequency threshold which is used
* during the Invert Index File pruning from the weak/un-frequent itemsets.
*/
#define SUPPORT_THRESH (0.02 * DB_ENTRIES)
/* These are the relative sizes for the random itemsets which are contructed
* to serve as input for the three types of queries (subset, superset and
* similarity query) while in automatic mode (options 4, 5, 6). The sizes are
* given in floating point form, taking values bwtween 1 and 0. For example a
* relative size equal to 0.4 means that for the test, a random itemset is
* constructed that has 40% of the IIF size, number of items. So for an IIF with
* 100 items the random itemset will include 40 items.
*/
#define SUB_REL_SIZE 0.4
#define SUP_REL_SIZE 0.01
#define SIM_REL_SIZE 0.01
/* This variable controls the compress algorithm zero_count parameter in the
* Invert_Index_File::compress_transaction_bitmap function. It sets the
* threshold for the maximum number of sequential zeros that will appear in the
* compressed bitstring. The zeros sequence of length N is represented with
* #N# and from four zeros and more we have string size benefits, since for
* N equals three the size is the same, i.e. lenghth("000") = length("#3#").
* Of course large values can be used to avoid compress and uncompress cycles.
*/
#define COMPRESS_ZERO_THRESH 4
/* Type definition of the string Invert Index File structure as
* map<string, map<unsigned int, string> > where the first string is the
* N-itemset, the int is the itemset's frequency/support, the second string is
* the bit vector of the transcations' ids where the item was found (e.g. for
* the <345-12<4,00101101>>, the the 345-12 itemsets has been found 4 times in
* the third, fifth, sixth and eighth transactions).

*/
typedef std::pair<uint, std::string> fi_pair;
typedef std::map<std::string, fi_pair> strMap;
typedef std::pair<std::string, fi_pair> itemset_rec;
#endif
/* MFIDRIVE_H */
/*
* File: main.cpp
*
* Created on 1 October 2012, 11:03 pm
*/
#include "Invert_Index_File.h"
#include "Closed_Frequent_Itemset.h"
#include "Maximum_Frequent_Itemset.h"
#include "Query.h"
#define PIVOT_IIF_PROGRESS
#define DEBUG_OUTPUT_TO_FILE
#define TIMINGS
/* Function that returns the current time for logging.
*/
std::string get_time() {
time_t now;
struct tm* timeinfo;
std::stringstream sstrtime;
time(&now);
timeinfo = localtime(&now);
sstrtime.clear();
sstrtime.str("");
sstrtime << timeinfo->tm_hour << ":" << timeinfo->tm_min << ":" <<
timeinfo->tm_sec;
return sstrtime.str();
}
/* This is the main function of the application that takes as input a text
* transaction database with one or more item ids in every line and calculates
* the Closed Frequent Itemsets, the Maximum Frequent Itemsets and implements
* functions for various queries like superset and subset pattern existence.
*/
int main(int argc, char** argv) {
const std::string db_filename = "T40I10D100K.dat";
// const std::string db_filename = "chess.dat";
//
const std::string db_filename = "kosarak.dat";
// const std::string db_filename = "connect.dat";
// const std::string db_filename = "mushroom.dat";
//
const std::string db_filename = "pumsb.dat";
// const std::string db_filename = "pumsb_star.dat";
// const std::string db_filename = "Example Database.dat";
const std::string IIF_output_file = "IIF_table.dat";
const std::string CFI_output_file = "CFI_table.dat";
const std::string sorted_CFI_output_file = "sorted_CFI_table.dat";
const std::string pivot_IIF_output_file = "pivotIIFs.dat";

const std::string sub_pivot_IIF_output_file = "sub_pivotIIFs.dat";
const std::string MFI_output_file = "MFI_table.dat";
const std::string qCFI_output_file = "qCFI_table.dat";
const std::string qMFI_output_file = "qMFI_table.dat";
std::vector<std::string> support_ordered_itemsets;
bool query_exit = false;
std::string input, query_args, itemset, sim_factor;
char query_choise;
uint number_of_runs;
std::stringstream buffer;
#ifdef TIMINGS
clock_t begin, IIF_creation, CFI_creation, CFI_sorting, CFI_cleanup,
MFI_creation, q_start, q_finish;
double t_CFI, t_MFI;
begin = clock();
#endif
std::cout << get_time() << " | Starting MFI-drive execution for database "
<< db_filename << ", input size " << DB_ENTRIES <<
" and support threshold " << SUPPORT_THRESH << "." << std::endl
<< std::endl;
/* First create the Invert Index File from the input transactions database.
* The first IIF contains all the 1-itemsets, their support and the
* transaction bit string mapping with the form <itemset_id<support,
* transactions_bit_string>> (e.g. <154<4, 0100010011>>).
*/
Invert_Index_File IIF(db_filename);
#ifdef TIMINGS
IIF_creation = clock();
std::cerr << get_time() << " | Invert Index File structure created after "
<< double(IIF_creation - begin) / CLOCKS_PER_SEC << " sec." <<
std::endl;
#endif
/* After creating the first IIF remove the weak 1-itemsets.

*/
IIF.eliminate_weak_itemsets();
if (IIF.get_main_IIF_structure().size() == 0) {
std::cerr << "There are no itemsets in the Invert Index File." <<
" Lower the support threshold (current threshold " <<
SUPPORT_THRESH << ") or increase the database."
<< std::endl;
exit(0);
}
#ifdef DEBUG_OUTPUT_TO_FILE
// Output the Invert Index File to a file.
IIF.print_main_IIF_to_file(IIF_output_file);
#endif
/* Create the CFI structure and store (all) the candidates closed
* 1-itemsets.
*/
Closed_Frequent_Itemset CFI(IIF.get_main_IIF_structure());
/* Get the 1-itemset ids sorted based on their support, so the CFI
* generation can begin starting from the itemset with the lowest support
* (last in the vector). This order is used to avoid checking large numbers
* of subsets later during the sub pivot IIF creation.
*/
support_ordered_itemsets = IIF.get_descend_ordered_itemsets();
#ifdef PIVOT_IIF_PROGRESS
clock_t itemset_finish;
int count = 1;
std::vector<std::string>::size_type total =
support_ordered_itemsets.size();
itemset_finish = clock();
#endif
/* Loop through the 1- itemsets and create a pivot IIF and then all its sub
* pivot IIFs in order to find all the Closed Frequent Itemsets.
*/
while (support_ordered_itemsets.empty() == false) {
#ifdef PIVOT_IIF_PROGRESS
std::cerr << get_time() << " | Processed item " << count++ << "/" <<
total << " (" << double((clock() - itemset_finish))
/ CLOCKS_PER_SEC << " sec)." << std::endl;
itemset_finish = clock();
#endif
/* The pivot IIF holds all the 2-itemsets for a given 1-itemset. While
* creating the pivot IFF, the weak 2-itemsets are removed.
*/
IIF.create_pivot_IIF(support_ordered_itemsets.back());
// Output the pivot IIFs one after another to a file.
IIF.print_pivots_IIF_to_file(pivot_IIF_output_file);
#endif
/* Update the Closed Frequent Itemsets with the 2-itemsets. The weak
* 1-itemsets are removed.
*/
CFI.update_CFI(IIF.get_pivot_IIF_structure());
if (IIF.get_pivot_IIF_structure().size() <= 1) {
// Delete the examined itemset from the vector to free up some
// space.
support_ordered_itemsets.pop_back();
/* No sub pivot IIFs can be created based on a pivot IIF with one or
* zero itemsets, so continue with the next pivot IIF.
*/
continue;
} else IIF.initialize_sub_pivot_IIF();
/* While the sub pivot IIF contains more that one n-itemset, then
* CFI candidates (n+1)-itemsets could be created.
*/
do {
/* By refreshing a sub pivot IIF containing n-itemsets, the possible
* (n+1)-itemsets are created while applying elimination for the
* ones with low support.
*/
IIF.refresh_sub_pivot_IIF();
//
//
// Output the sub pivot IIFs one after another to a file. For large
//
// inputs this file gets a few GB in size!
//
IIF.print_sub_pivots_IIF_to_file(sub_pivot_IIF_output_file);
//
#endif
/* With the CFI update the new (n+1)-itemsets are inserted in the
* CFI structure and older n-itemsets are removed if their support
* equal to their's descendants' support.
*/
CFI.update_CFI(IIF.get_sub_pivot_IIF_structure());
} while (IIF.get_sub_pivot_IIF_structure().size() > 1);
// Delete the examined itemset from the vector to free up some space.
support_ordered_itemsets.pop_back();
}
#ifdef TIMINGS
CFI_creation = clock();
std::cerr << get_time() << " | Closed Frequent Itemsets found after " <<
double(CFI_creation - IIF_creation) / CLOCKS_PER_SEC << " sec." <<
std::endl;
#endif
// Output the Closed Frequent Itemsets to a file.
CFI.print_CFI_to_file(CFI_output_file);
#endif
CFI.sort_CFI();
#ifdef TIMINGS
CFI_sorting = clock();
std::cerr << get_time() << " | Closed Frequent Itemsets sorted after " <<
double(CFI_sorting - CFI_creation) / CLOCKS_PER_SEC << " sec." <<
std::endl;
#endif
CFI.clean_up_CFI();
#ifdef TIMINGS
CFI_cleanup = clock();
std::cerr << get_time() << " | Closed Frequent Itemsets cleaned up after "
<< double(CFI_cleanup - CFI_sorting) / CLOCKS_PER_SEC << " sec." <<
std::endl;
#endif
// Output the Closed Frequent Itemsets to a file.
CFI.print_CFI_to_file(sorted_CFI_output_file);
#endif
/* Generate the Maximum Frequent Itemsets from the found Closed Frequent
* Itemsets.
*/
Maximum_Frequent_Itemset MFI(CFI.get_CFI_structure());
#ifdef TIMINGS
MFI_creation = clock();
std::cerr << get_time() << " | Maximum Frequent Itemsets found after " <<
double(MFI_creation - CFI_cleanup) / CLOCKS_PER_SEC << " sec."
<< std::endl;
#endif
// Output the Maximum Frequent Itemsets to a file.

MFI.print_MFI_to_file(MFI_output_file);
#endif
Query q_CFI, q_MFI;
while (!query_exit) {
std::cout << std::endl << get_time() << std::endl <<
"Please choose one of the below:" << std::endl <<
"1. Subset Query (User Feedback)." << std::endl <<
"2. Superset Query (User Feedback)." << std::endl <<
"3. Similarity Query (User Feedback)." << std::endl <<
"4. Subset Query (Auto)." << std::endl <<
"5. Superset Query (Auto)." << std::endl <<
"6. Similarity Query (Auto)." << std::endl <<
"7. Exit" << std::endl << std::endl << "Choice: ";
std::cin >> input;
if (input.length() > 1) {
// Ignore any (accidentally) large inputs.
continue;
} else {
query_choise = input.c_str()[0];
}
switch (query_choise) {
case '1':
std::cout << "Please provide the itemset for which all the" <<
" subsets will be found: ";
std::cin.sync();
std::cin >> query_args;
#ifdef TIMINGS
q_start = clock();
#endif
q_CFI = Query(query_choise, query_args,
CFI.get_CFI_structure());
#ifdef TIMINGS
q_finish = clock();
t_CFI = q_finish - q_start;
#endif
q_CFI.print_Qresults_to_file(qCFI_output_file);
#endif
#ifdef TIMINGS
q_start = clock();
#endif
q_MFI = Query(query_choise, query_args,
MFI.get_MFI_structure());
#ifdef TIMINGS
q_finish = clock();
t_MFI = q_finish - q_start;
std::cerr << "Time results for subsets search [ CFI: " <<
double(t_CFI) / CLOCKS_PER_SEC << " sec, MFI: " <<
double(t_MFI) / CLOCKS_PER_SEC << " sec";
if (t_CFI != 0 && t_MFI != 0)
std::cerr << " ( " <<
double(t_CFI - t_MFI)*100 / t_CFI << "% less time)";
std::cerr << "]." << std::endl;
#endif
q_MFI.print_Qresults_to_file(qMFI_output_file);
#endif
break;
case '2':
std::cout << "Please provide the itemset for which all the" <<
" supersets will be found: ";
std::cin.sync();
std::cin >> query_args;
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
t_CFI = double(q_finish - q_start) / CLOCKS_PER_SEC;
#endif
#endif
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
t_MFI = double(q_finish - q_start) / CLOCKS_PER_SEC;
std::cerr << "Time results for supersets search [ CFI: " <<
if (t_CFI != 0 && t_MFI != 0)
std::cerr << " ( " <<
#endif
#endif
break;
case '3':
std::cout << "Please provide the itemset:" << std::endl;
std::cin.sync();
std::cin >> itemset;
std::cout << "Please provide the number of items you want to" <<
" be common with the returned itemsets" << std::endl;
std::cin.sync();
std::cin >> sim_factor;
query_args = itemset + " " + sim_factor;
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
t_CFI = double(q_finish - q_start) / CLOCKS_PER_SEC;
#endif
#endif
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
t_MFI = double(q_finish - q_start) / CLOCKS_PER_SEC;
std::cerr << "Time results for similarity search [ CFI: " <<
if (t_CFI != 0 && t_MFI != 0)
std::cerr << " ( " <<
#endif
#endif
break;
case '4':
std::cout << "Please provide the number of runs for subset "
<< "queries: " <<
std::endl;
std::cin.sync();
std::cin >> number_of_runs;
std::cout << "Number of automatic tests: " << number_of_runs <<
std::endl;
for (uint i = 0; i < number_of_runs; i++) {
query_args = IIF.get_random_itemset(SUB_REL_SIZE);
// std::cout << "Query argument(s): " << query_args <<
//
std::endl;
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
#endif
#endif
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
std::cerr << "Time results for subsets search [ CFI: " <<
if (t_CFI != 0 && t_MFI != 0)
std::cerr << " ( " <<
#endif
#endif
}
break;
case '5':
std::cout << "Please provide the number of runs for superset "
<< "queries: " << std::endl;
std::cin.sync();
std::endl;
query_args = IIF.get_random_itemset(SUP_REL_SIZE);
// std::cout << "Query argument(s): " << query_args <<
// std::endl;
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
#endif
#endif
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
std::cerr << "Time results for supersets search [ CFI: " <<
if (t_CFI != 0 && t_MFI != 0)
std::cerr << " ( " <<
#endif
#endif
}
break;
case '6':
std::cout << "Please provide the number of runs for similarity "
<< "queries: " << std::endl;
std::cin.sync();
std::endl;
itemset = IIF.get_random_itemset(SIM_REL_SIZE);
//Add a random to the time seed for extra randomness
//(it's needed!?).
srand(time(NULL) + rand() % 100);
buffer.clear();
buffer.str("");
buffer << (rand() %
Maximum_Frequent_Itemset::get_items_count(itemset))
+ 1;
sim_factor = buffer.str();
query_args = itemset + " " + sim_factor;
//std::cout << "Query argument(s): " << query_args << "." <<
// std::endl;
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
#endif
#endif
#ifdef TIMINGS
q_start = clock();
#endif
#ifdef TIMINGS
q_finish = clock();
std::cerr << "Time results for similarity search [ CFI: " <<
if (t_CFI != 0 && t_MFI != 0)
std::cerr << " ( " <<
#endif
#endif
}
break;
case '7':
std::cout << "Exiting ..." << std::endl;
query_exit = true;
break;
default:
std::cout << "Please choose values between 1 and 5." <<
std::endl;
}
}
return 0;
}
/*
* File: Invert_Index_File.h
*
*/
#ifndef INVERT_INDEX_FILE_H
#define INVERT_INDEX_FILE_H
#include "MFIdrive.h"
// Multimap needed for the 1-itemsets sorting procedure.

typedef std::multimap<int, std::string> isMMap;
class Invert_Index_File {
public:
Invert_Index_File();
// Custom constructor from an input text transcation DB file.
Invert_Index_File(const std::string transactions_DB_file);
Invert_Index_File(const Invert_Index_File& orig);
virtual ~Invert_Index_File();
/* Function that removes from the Invert Index File structure the 1-itemsets
* whose frequency/support is less than the predefined SUPPORT_THRESH.
*/
void eliminate_weak_itemsets();
// Function that returns the main Invert Index File structure.
strMap& get_main_IIF_structure();
// Function that returns the pivot Invert Index File structure.
strMap& get_pivot_IIF_structure();
// Function that returns the sub pivot Invert Index File structure.
strMap& get_sub_pivot_IIF_structure();
/* Function that returns a string vector with the itemset ids in
* descending order based on their corresponding support.
*/
std::vector<std::string> get_descend_ordered_itemsets();
/* Function that populates the my_pivotIIF structure with all the possible
* 2-itemsets that share the same, given, 1-itemset. It also applies itemset
* elimination.
*/
void create_pivot_IIF(const std::string current_itemset);
/* Initialize the first sub pivot IIF using the pivot IIF entries.
*/
void initialize_sub_pivot_IIF();
/* Function that populates the my_sub_pivotIIF structure with all the
* possible itemsets combinations of the elements it had before and applies
* support elimination.
*/
void refresh_sub_pivot_IIF();
// Public function to print the main Invert Index File to a file.
void print_main_IIF_to_file(const std::string output_file);
// Public function to print the sub Invert Index File to a file.
void print_pivots_IIF_to_file(const std::string output_file);
// Public function to print the sub Invert Index File to a file.
void print_sub_pivots_IIF_to_file(const std::string output_file);
// Function to return a random itemset build up from IIF items.
std::string get_random_itemset(double relative_itemset_size);
private:
/* This is the map<string, <uint, string> > Invert Index File structure,
* where the first string holds the itemsets and the second string holds
* the transactions ids bitmap.

*/
strMap my_IIF;
// Container to hold only the items from the IIF, to use later in the
// get_random_itemset function.
std::set<std::string> my_IIF_items;
/* This structure stores every time one pivot IIF that contains the
* data for all the possbile 2-itemsets based on a given itesmet.
*/
strMap my_pivotIIF;
/* This structure stores every time the possible n-itemset combinations
* for a specific pivot IIF.
*/
strMap my_sub_pivotIIF;
/* This function compress the transaction bitmaps for every record in the
* IIF.
*/
void compress_IIF();
/* Function to insert a new itemset entry in the main IIF structure.
*/
void insert_entry(const std::string item_id, const uint transaction_id);
/* This function compress the transaction bitmap by grouping the large
* sequences of zeros.
*/
void compress_transaction_bitmap(std::string& current_bitmap);
/* This function uncompress the transaction bitmaps that were compressed
* by the compress_transaction_bitmap function.
*/
void uncompress_transaction_bitmap(std::string& current_bitmap);
/* Function that takes as arguments two bit-strings (like for example 1010
* and 10010111) and sets the third argument as the result bitstring (here
* 1000) that is generated by applying bitwise the binary AND operation on
* the first two one arguments. It returns the number of 1s in the result
* bitstring (here 1).
*/
uint add_bitstrings(const std::string& first_bitstring,
const std::string& sec_bitstring, std::string& result);
// Function to update the existing itemset's bool vector in the main IIF.
void update_entry(const strMap::iterator& pos, const std::string item_id,
const uint transaction_id);
/* Function to create the 1-itemset Invert Index File structure from the
* input file.
*/
void create_invert_index_file(const std::string input_DB_file);
};
#endif
/* INVERT_INDEX_FILE_H */
/*
* File: Invert_Index_File.cpp
*
*/
//#define IIF_CREATION_PROGRESS
//#define IIF_CREATION_VERBOSE
//#define ELIMIN_PROGRESS
//#define COMPRESS_VERBOSE
//#define ADD_BITSTRINGS_VERBOSE
//#define SUB_IIF_ELIMINATION_VERBOSE
//#define SUB_IIF_ADDITION_VERBOSE
//#define SUB_PIVOT_DOUBLE_LOOP_VERBOSE
//#define SUB_PIVOT_ADDITION_VERBOSE
//#define SUB_PIVOT_ELIMINATION_VERBOSE
//#define SUB_PIVOT_JUMP_FWD
#define COMPRESS_ON
#define DB_PARTIAL_INPUT
#ifdef DB_PARTIAL_INPUT
#define DB_PARTIAL_INPUT_LIMIT DB_ENTRIES
#endif
//#define IIF_UPDATE_ENTRY_VERBOSE
//#define IIF_INSERT_ENTRY_VERBOSE
//#define GET_RANDOM_VERBOSE
Invert_Index_File::Invert_Index_File() {
}
Invert_Index_File::Invert_Index_File(std::string transactions_DB_file) {
create_invert_index_file(transactions_DB_file);
}
Invert_Index_File::Invert_Index_File(const Invert_Index_File& orig) {
}
Invert_Index_File::~Invert_Index_File() {
}
/* Function that returns the main Invert Index File structure.
*/
strMap& Invert_Index_File::get_main_IIF_structure() {
return this->my_IIF;
}
/* Function that returns the pivot Invert Index File structure.
*/
strMap& Invert_Index_File::get_pivot_IIF_structure() {
return this->my_pivotIIF;
}
/* Function that returns the sub pivot Invert Index File structure.
*/
strMap& Invert_Index_File::get_sub_pivot_IIF_structure() {
return this->my_sub_pivotIIF;
}
/* Function that returns a string vector with the itemsets in descending
* order based on their corresponding support.
*/
std::vector<std::string> Invert_Index_File::get_descend_ordered_itemsets() {
std::vector<std::string> ordered_itemsets;
ordered_itemsets.reserve(this->my_IIF.size());
std::multimap<int, std::string> supp_itemset_mmap;
strMap::iterator mapIt;
isMMap::reverse_iterator mmapRevIt;
for (mapIt = this->my_IIF.begin(); mapIt != this->my_IIF.end(); ++mapIt) {
supp_itemset_mmap.insert(fi_pair(mapIt->second.first, mapIt->first));
}
/* Using reverse iterator to facilitate the proper vector's size reduction
* over time in main().
*/
for (mmapRevIt = supp_itemset_mmap.rbegin();
mmapRevIt != supp_itemset_mmap.rend(); ++mmapRevIt) {
ordered_itemsets.push_back(mmapRevIt->second);
}
return ordered_itemsets;
}
/* Public function to print the main Invert Index File to a file.
*/
void Invert_Index_File::print_main_IIF_to_file(const std::string output_file) {
std::ofstream out_file(output_file.c_str());
if (out_file.is_open()) {
for (mapIt = this->my_IIF.begin(); mapIt != this->my_IIF.end();
++mapIt) {
out_file << mapIt->first << ":" << mapIt->second.first << ":"
<< mapIt->second.second << std::endl;
}
out_file.close();
}
}
/* Public function to print all the pivot Invert Index File to a file.
*/
void Invert_Index_File::print_pivots_IIF_to_file(const std::string output_file) {
std::ofstream out_file(output_file.c_str(), std::fstream::out |
std::fstream::app);
if (this->my_pivotIIF.size() > 0 && out_file.is_open()) {
for (mapIt = this->my_pivotIIF.begin();
mapIt != this->my_pivotIIF.end(); ++mapIt) {
}
out_file.width(40);
out_file.fill('-');
out_file << '-' << std::endl;
out_file.close();
}
}
/* Public function to print the sub pivot Invert Index File to a file.
*/
void Invert_Index_File::print_sub_pivots_IIF_to_file(
const std::string output_file) {
std::ofstream out_file(output_file.c_str(), std::fstream::out |
std::fstream::app);
if (this->my_sub_pivotIIF.size() > 0 && out_file.is_open()) {
for (mapIt = this->my_sub_pivotIIF.begin();
mapIt != this->my_sub_pivotIIF.end(); ++mapIt) {
}
out_file.width(40);
out_file.fill('-');
out_file << '-' << std::endl;
out_file.close();
}
}
/* Function to insert a new itemset entry in the main IIF structure.
*/
void Invert_Index_File::insert_entry(const std::string item_id,
const uint transaction_id) {
std::string bit_string(transaction_id - 1, '0');
/* Since the item appears in the last (current) transaction, set the last
* bit (char) to 1.
*/
bit_string += "1";
/* Set the new pair to have frequency 1 for the new itemset since this is
* the first transaction which includes it.
*/
fi_pair freq_bit_string(1, bit_string);
this->my_IIF.insert(itemset_rec(item_id, freq_bit_string));
#ifdef IIF_INSERT_ENTRY_VERBOSE
std::cout << "Vector to insert: " << bit_string << std::endl;
#endif
}
/* Function to update the bool vector of an existing 1-itemset in the main IIF.
*/
void Invert_Index_File::update_entry(const strMap::iterator& pos,
const std::string item_id, const uint transaction_id) {
/* Find the size of the new bit string which must be appended after the
* current one and create it with default (N-1) zeroes values. Then append
* at the end a 1 to point that the last (current) transaction includes the
* itemset.
*/
int new_bit_array_size = transaction_id - pos->second.second.size();
std::string new_bit_string(new_bit_array_size - 1, '0');
new_bit_string += "1";
#ifdef IIF_UPDATE_ENTRY_VERBOSE
std::cerr << "Initial vector: " << pos->second.second << std::endl;
#endif
/* Update the existing bit string entry by appending the new bit string for
* the newly found itemset in the last (current) transcation.

*/
pos->second.second += new_bit_string;
// Increase the itemset's frequency/support by one since it was found again.
pos->second.first++;
#ifdef IIF_UPDATE_ENTRY_VERBOSE
std::cerr << "New vector size: " << new_bit_array_size << std::endl;
std::cerr << "Sting to insert: " << new_bit_string << std::endl;
std::cerr << "Final string: " << pos->second.second << std::endl;
#endif
}
/* This function compress the transaction bitmaps for every record in the
* IIF.
*/
void Invert_Index_File::compress_IIF() {
strMap::iterator record;
for (record = this->my_IIF.begin(); record != this->my_IIF.end();
++record) {
compress_transaction_bitmap(record->second.second);
}
}
/*Function to create the Invert Index File structure from the input text file
*/
void Invert_Index_File::create_invert_index_file(
const std::string input_DB_file) {
uint transaction_id, line_counter;
std::ifstream db_file;
std::string line, item_id;
std::stringstream line_buffer;
db_file.open(input_DB_file.c_str());
if (db_file.fail()) {
std::cerr << "The file " << input_DB_file
<< " couldn't open! Exiting ..." << std::endl;
exit(1);
}
/* Open input database file and start populating the Invert Index File
* Map structure with the item and the transaction IDs.
*/
line_counter = 0;
while (db_file.good()) {
transaction_id = ++line_counter;
getline(db_file, line);
/* Use clear to reset the flags and str member function to initiate the
* string-stream. Note that if the flags are not cleared, the while loop
* below will set the string-stream empty flag the first time and no
* data will be read to item_id subsequently. Also if the << operator is
* used for initialization the string-stream will not be updated in the
* next loops.
*/
line_buffer.clear();
line_buffer.str(line);
/* Use the >> string-stream operator to parse each line using space as
* the delimeter and get the item IDs.
*/
while (line_buffer >> item_id) {
if ((record = this->my_IIF.find(item_id))
== this->my_IIF.end()) {
// If the itemset is new then make a new entry in the IIF.
insert_entry(item_id, transaction_id);
} else {
#ifdef IIF_CREATION_VERBOSE
this->print_main_IIF_to_file("IIF_table_25-part.dat");
std::cerr << "-------- Found same item in transaction: " <<
transaction_id << std::endl;
#endif
update_entry(record, item_id, transaction_id);
}
}
#ifdef IIF_CREATION_PROGRESS
std::cerr << "Line number: " << line_counter << std::endl;
#endif
#ifdef DB_PARTIAL_INPUT
if (line_counter == DB_PARTIAL_INPUT_LIMIT) {
return;
}
#endif
}
db_file.close();
}
/* This function compress the transaction bitmap by grouping the large
* sequences of zeros.
*/
void Invert_Index_File::compress_transaction_bitmap(
std::string & current_bitmap) {
uint zero_count, zero_thresh;
std::stringstream code;
std::string result;
zero_thresh = COMPRESS_ZERO_THRESH;
result.clear();
zero_count = 0;
for (std::string::size_type i = 0; i < current_bitmap.length(); i++) {
if (current_bitmap.compare(i, 1, "1") == 0) {
if (zero_count < zero_thresh) {
for (uint c = 0; c < zero_count; c++) {
result.append("0");
}
result.append("1");
zero_count = 0;
} else {
code.clear();
code.str("");
code << "#" << zero_count << "#" << "1";
result.append(code.str());
zero_count = 0;
}
} else {
zero_count++;
}
}
// Take care any left zeros
if (zero_count > 0) {
if (zero_count < zero_thresh) {
for (uint i = 0; i < zero_count; i++) {
result.append("0");
}
zero_count = 0;
} else {
code.clear();
code.str("");
code << "#" << zero_count << "#";
result.append(code.str());
zero_count = 0;
}
}
current_bitmap = result;
}
/* This function uncompress the transaction bitmaps that were compressed
* by the compress_transaction_bitmap function.
*/
void Invert_Index_File::uncompress_transaction_bitmap(
std::string & current_bitmap) {
std::string::size_type foundL, foundR;
uint zero_count;
foundL = current_bitmap.find_first_of("#");
while (foundL != std::string::npos) {
foundR = current_bitmap.find_first_of("#", foundL + 1);
zero_count = atoi(current_bitmap.substr(
foundL + 1, foundR - foundL - 1).c_str());
#ifdef COMPRESS_VERBOSE
std::cerr << "----------------------------------------" << std::endl;
std::cerr << "Bitstring before erasing: " << current_bitmap <<
" (" << foundL << "," << foundR << ")" << std::endl;
#endif
current_bitmap.erase(foundL, foundR - foundL + 1);
std::cerr << "Bitstring after erasing: " << current_bitmap <<
std::endl;
#endif
current_bitmap.insert(foundL, zero_count, '0');
std::cerr << "Bitstring after inserting: " << current_bitmap <<
std::endl;
std::cerr << "----------------------------------------" << std::endl;
#endif
foundL = current_bitmap.find_first_of("#",
foundL + zero_count + 1);
}
}
/* Function that takes as arguments two bit-strings (like for example 1010
* and 10010111) and sets the third argument as the result bitstring (here
* 1000) that is generated by applying bitwise the binary AND operation on
* the first two one arguments. It returns the number of 1s in the result
* bitstring (here 1).
*/
uint Invert_Index_File::add_bitstrings(const std::string& first_bitstring,
const std::string& sec_bitstring, std::string & result) {
int ace_count;
std::string::size_type first_size, sec_size, min_size;
std::string first_bitstring_uncomp, sec_bitstring_uncomp;
ace_count = 0;
first_bitstring_uncomp = first_bitstring;
sec_bitstring_uncomp = sec_bitstring;
// Uncompressing inputs
#ifdef COMPRESS_ON
uncompress_transaction_bitmap(first_bitstring_uncomp);
uncompress_transaction_bitmap(sec_bitstring_uncomp);
#endif
first_size = first_bitstring_uncomp.length();
sec_size = sec_bitstring_uncomp.length();
min_size = first_size >= sec_size ? sec_size : first_size;
// The result bit-string must start always empty!
result.clear();
for (std::string::size_type i = 0; i < min_size; i++) {
if (first_bitstring_uncomp.compare(i, 1, "1") == 0 &&
sec_bitstring_uncomp.compare(i, 1, "1") == 0) {
result.append("1");
ace_count++;
} else {
result.append("0");
}
}
#ifdef COMPRESS_ON
compress_transaction_bitmap(result);
#endif
#ifdef ADD_BITSTRINGS_VERBOSE
std::cout << "-----------------------------" << std::endl;
std::cout.width(30);
std::cout << "First (compressed):" << std::left << first_bitstring << std::endl;
std::cout << "Second (compressed):" << std::left << sec_bitstring << std::endl;
std::cout << "Result (compressed):" << std::left << result << std::endl;
std::cout << "First:
" << std::left << first_bitstring_uncomp <<
std::endl;
std::cout << "Second: " << std::left << sec_bitstring_uncomp <<
std::endl;
std::cout << "Result: " << std::left << result << std::endl;
#endif
return ace_count;
}
/* Function that removes from the Invert Index File Structure the itemsets
* whose frequency/support is less than the predefined SUPPORT_THRESH
*/
void Invert_Index_File::eliminate_weak_itemsets() {
for (record = this->my_IIF.begin(); record != this->my_IIF.end();) {
if (record->second.first < SUPPORT_THRESH) {
#ifdef ELIMIN_PROGRESS
std::cerr << "Removing Itemset " << record->first <<
" with support " << record->second.first << std::endl;
#endif
this->my_IIF.erase(record++);
} else record++;
}
#ifdef COMPRESS_ON
compress_IIF();
#endif
// Poppulating the my_IIF_items for the queries later.
for (record = this->my_IIF.begin(); record != this->my_IIF.end();
++record) {
this->my_IIF_items.insert(record->first);
}
}
/* Function that populates the pivot IIF structure with all the possible
* 2-itemsets that share the same, given, itemset. It also applies 2-itemset
* elimination.
*/
void Invert_Index_File::create_pivot_IIF(const std::string current_itemset) {
std::string two_itemset_id, two_itemset_bitstring;

uint new_support;
itemset_rec current_itemset_record;
fi_pair freq_bitstring;
strMap::iterator item;
// Populate the current itemset record before deleting it from the main IIF.
current_itemset_record.first = current_itemset;
current_itemset_record.second.first = this->my_IIF[current_itemset].first;
current_itemset_record.second.second = this->my_IIF[current_itemset].second;
this->my_IIF.erase(current_itemset);
this->my_pivotIIF.clear();
for (item = this->my_IIF.begin(); item != this->my_IIF.end(); item++) {
two_itemset_id = current_itemset_record.first + "-" + item->first;
new_support = add_bitstrings(current_itemset_record.second.second,
item->second.second, two_itemset_bitstring);
if (new_support >= SUPPORT_THRESH) {
#ifdef SUB_IIF_ADDITION_VERBOSE
std::cerr << "Adding itemset (" << two_itemset_id
<< ") with support " << new_support << std::endl;
#endif
freq_bitstring = std::make_pair(new_support, two_itemset_bitstring);

this->my_pivotIIF.insert(
itemset_rec(two_itemset_id, freq_bitstring));
two_itemset_bitstring.clear();
}
#ifdef SUB_IIF_ELIMINATION_VERBOSE
else {
std::cout << "Eliminating 2-itemset " << two_itemset_id <<
" with support " << new_support << std::endl;
}
#endif
}
}
/* Initialize the first sub pivot IIF with the pivot IIF entries.
*/
void Invert_Index_File::initialize_sub_pivot_IIF() {
this->my_sub_pivotIIF = this->my_pivotIIF;
}
/* Function that populates the my_sub_pivotIIF structure with all the
* possible itemsets combinations of the elements it had before and applies
* itemsets elimination based on the support value.
*/
void Invert_Index_File::refresh_sub_pivot_IIF() {
std::string new_itemsets_id, new_bit_string, first_prefix, second_prefix;
uint new_support;
strMap new_sub_pivot_IIF;
strMap::iterator mapItFirst, mapItSec;
fi_pair new_supp_trans_pair;
/* We set as limit for the first iterator, mapItFirst, the last element
* of the map because we have a second operator, mapItSec, which initiates
* every time with the position after the mapFirstIt. Thus, we need to avoid
* references out of map bounds.
*/
strMap::iterator last_itemset = this->my_sub_pivotIIF.end();
std::advance(last_itemset, -1);
/* With these loops we combine all the n-itemsets that share the same
* prefix. The prefix we compare is all the itemset ids except the last one.
* Only pairs from those n-itemsets can generate valid (n+1)-itemsets.
*/
for (mapItFirst = this->my_sub_pivotIIF.begin();
mapItFirst != last_itemset; ++mapItFirst) {
for (mapItSec = mapItFirst, std::advance(mapItSec, 1);
mapItSec != this->my_sub_pivotIIF.end(); ++mapItSec) {
first_prefix = mapItFirst->first.substr(0,
mapItFirst->first.rfind("-"));
second_prefix = mapItSec->first.substr(0,
mapItSec->first.rfind("-"));
#ifdef SUB_PIVOT_DOUBLE_LOOP_VERBOSE
std::cerr << "To compare " << mapItFirst->first <<
" and " << mapItSec->first << std::endl;
#endif
/* Here we have found two n-itemsets that can be combined and give
* a new (n+1)-itemset.
*/
if (first_prefix == second_prefix) {
new_itemsets_id = mapItFirst->first + mapItSec->first.substr(
mapItSec->first.find_last_of("-"));
new_support = add_bitstrings(mapItFirst->second.second,
mapItSec->second.second, new_bit_string);
if (new_support >= SUPPORT_THRESH) {
#ifdef SUB_PIVOT_ADDITION_VERBOSE
std::cerr << "Adding itemset " << new_itemsets_id
<< " with support " << new_support << std::endl;
#endif
new_supp_trans_pair = std::make_pair(new_support,
new_bit_string);
new_sub_pivot_IIF.insert(itemset_rec(new_itemsets_id,
new_supp_trans_pair));
}
#ifdef SUB_PIVOT_ELIMINATION_VERBOSE
else {
std::cerr << "Eliminating itemset " << new_itemsets_id <<
" with support " << new_support << std::endl;
}
#endif
} else {
#ifdef SUB_PIVOT_JUMP_FWD
std::cerr << "The prefixes don't match (" << first_prefix <<
", " << second_prefix << ". Jump fwd.)" << std::endl;
#endif
break;
}
}
}
this->my_sub_pivotIIF = new_sub_pivot_IIF;
}
/* This function get as an argument a relative itemset size, expressed as a
* percentage of the IIF's size and returns a random itemset build up from
* the corresponding number of IIF items.
*/
std::string Invert_Index_File::get_random_itemset(
double relative_itemset_size) {
std::set<std::string>::size_type IIF_size;
std::set<std::string>::const_iterator item;
std::string random_itemset;
uint number_of_items, count, offset;
std::multiset<std::string> selected_items;
// Subtracting one to ignore the last empty entry.
IIF_size = this->my_IIF_items.size() - 1;
number_of_items = (int) floor(relative_itemset_size * IIF_size + 0.5);
// Set minimum items to be one.
if (number_of_items == 0) {
number_of_items = 1;
}
#ifdef GET_RANDOM_VERBOSE
std::cerr << "Relative size: " << relative_itemset_size << ", IIF size: " <<
IIF_size << ", Number of items for random itemset: " <<
number_of_items << std::endl;

#endif
random_itemset = "";
for (count = 0; count < number_of_items; count++) {
//Add a random to the time seed for extra randomness (it's needed!?).
srand(time(NULL) + rand() % 100);
// Make sure that the chosen item has not already been selected before.
do {
item = this->my_IIF_items.begin();
offset = rand() % IIF_size;
std::advance(item, offset);
std::cerr << "For offset " << offset << " checking item " <<
*item << "." << std::endl;
#endif
} while (selected_items.find(*item) != selected_items.end());
std::cerr << "Adding " << *item << " in the random items bucket."
<< std::endl;
#endif
selected_items.insert(*item);
random_itemset += "-" + *item;
}
// Deleting heading "-"
random_itemset.erase(random_itemset.begin());
return random_itemset;
}
/*
* File: Closed_Frequent_Itemset.h
*
*/
#ifndef CLOSED_FREQUENT_ITEMSET_H
#define CLOSED_FREQUENT_ITEMSET_H
class Closed_Frequent_Itemset {
public:
Closed_Frequent_Itemset();
Closed_Frequent_Itemset(const Closed_Frequent_Itemset& orig);
/* Constructor of CFI from an IIF structure.
*/
Closed_Frequent_Itemset(const strMap& current_IIF_structure);
virtual ~Closed_Frequent_Itemset();
/* Function to return the CFI structure.
*/
strMap& get_CFI_structure();
/* Public function to print the Closed Frequent Itemsets to a file.
*/
void print_CFI_to_file(const std::string output_file);

/* This function gets as an argument the an IIF structure and updates
* the set of Closed Frequent Itemsets.
*/
void update_CFI(const strMap& current_IIF);
/* This function sorts the itemsets in every record of the CFI structure.
*/
void sort_CFI();
/* This function removes itemsets that are subsets of others and have the
* same support. These have left because some frequent items appear in
* several pivot and sub-pivot groups.
*/
void clean_up_CFI();
private:
/* This map<string, <uint, string> > structure holds the candidate Closed
* Frequent Itemsets along with their support and transaction bit string
* map.
*/
strMap my_CFI;
/* This function returns the number of items that the provided itemset
* contains.
*/
std::multiset<std::string> get_items(std::string current_itemset);
/* This function breaks an itemset into the items it contains and returns
* those.
*/
uint get_items_count(const std::string& current_itemset);
};
#endif
/* CLOSED_FREQUENT_ITEMSET_H */
/*
* File: Closed_Frequent_Itemset.cpp
*
*/
#include "Closed_Frequent_Itemset.h"
//#define UPDATE_CFI_PRUNING_VERBOSE
//#define UPDATE_CFI_ADD_VERBOSE
//#define SORT_VERBOSE
//#define CLEAN_UP_VERBOSE
//#define CLEAN_UP_SUBSET_VERBOSE
Closed_Frequent_Itemset::Closed_Frequent_Itemset() {
}
Closed_Frequent_Itemset::Closed_Frequent_Itemset(const Closed_Frequent_Itemset&
orig) {
}
/* Constructor of CFI from an IIF structure.

*/
Closed_Frequent_Itemset::Closed_Frequent_Itemset(const strMap&
current_IIF_structure) {
this->my_CFI = current_IIF_structure;
}
Closed_Frequent_Itemset::~Closed_Frequent_Itemset() {
}
/* Function to return the CFI structure.
*/
strMap& Closed_Frequent_Itemset::get_CFI_structure() {
return this->my_CFI;
}
/* Public function to print the Closed Frequent Itemsets to a file.
*/
void Closed_Frequent_Itemset::print_CFI_to_file(const std::string output_file) {
for (mapIt = this->my_CFI.begin(); mapIt != this->my_CFI.end();
++mapIt) {
}
}
}
/* This function gets as an argument an IIF structure and updates
* the set of Closed Frequent Itemsets.
*/
void Closed_Frequent_Itemset::update_CFI(const strMap& current_IIF) {
std::string subset;
std::string::size_type found;
std::vector<std::string::size_type> separator_pos;
uint candidate_cfi_support;
strMap::const_iterator mapIt;
/* For every n-itemset candidate in the provided IIF find its (n-1)-itemset
* subsets (by keeping always the first item) and compare the candidate's
* support against the subset(s)' support(s). The candidate's support can be
* less or equal, never greater. If the candidate's support is equal with
* one or subset(s)' support(s), then remove the corresponding subset(s)
* from the current CFI set (if those are still there). Finally add the
* candidate in the CFI since its support is greater than the threshold
* (this condition was satisfied during the IIF creation time earlier).
*/
for (mapIt = current_IIF.begin(); mapIt != current_IIF.end(); ++mapIt) {
candidate_cfi_support = mapIt->second.first;
separator_pos.clear();
found = mapIt->first.find("-");
while (found != std::string::npos) {
separator_pos.push_back(found);
found = mapIt->first.find("-", found + 1);
}
separator_pos.push_back(mapIt->first.npos);
for (uint i = 0; i < separator_pos.size() - 1; i++) {

subset = mapIt->first;
subset.erase(separator_pos[i], separator_pos[i + 1] separator_pos[i]);
#ifdef UPDATE_CFI_PRUNING_VERBOSE
std::cerr << "The subset is " << subset << "(" << mapIt->first
<< ")." << std::endl;
#endif
if (this->my_CFI.find(subset) != this->my_CFI.end() &&
candidate_cfi_support == this->my_CFI[subset].first) {
#ifdef UPDATE_CFI_PRUNING_VERBOSE
std::cerr << "The new itemset (" << mapIt->first
<< ") has the same support (" << candidate_cfi_support
<< ") like the current subset " << subset
<< ". Removing subset ..." << std::endl;
#endif
this->my_CFI.erase(subset);
}
}
this->my_CFI.insert(itemset_rec(mapIt->first, mapIt->second));
#ifdef UPDATE_CFI_ADD_VERBOSE
std::cerr << "Added new CFI (" << mapIt->first << ") with support "
<< candidate_cfi_support << "." << std::endl;
#endif
}
}
/* This function sorts the itemsets in every record of the CFI structure.
*/
void Closed_Frequent_Itemset::sort_CFI() {
std::multiset<std::string> itemsets_buffer;
std::multiset<std::string>::iterator msetIt;
std::string item_id_found, sorted_itemset;
strMap sorted_CFI;
/* Loop through the CFI structure and for each record, parse the itemset
* string, sort the item ids and then combine them into a new itemstring.
*/
for (mapIt = this->my_CFI.begin(); mapIt != this->my_CFI.end();) {
//
support = mapIt->second.first;
//
bitmap = mapIt->second.second;
// Clear the buffer to prepare it for the new record's items' ids.
itemsets_buffer.clear();
#ifdef SORT_VERBOSE
std::cerr << "Itemset: " << mapIt->first << std::endl;
#endif
for (uint i = 0; i < mapIt->first.length(); i++) {
if (mapIt->first[i] != '-') {
item_id_found += mapIt->first[i];
} else {
#ifdef SORT_VERBOSE
std::cerr << "Additing item: " << item_id_found << std::endl;
#endif
itemsets_buffer.insert(item_id_found);
// Clear the item id buffer to prepare for the next.
item_id_found.clear();
}
}
#ifdef SORT_VERBOSE
std::cerr << "Additing item: " << item_id_found << std::endl;
#endif
/* Add the last item id after we reached the itemset string's end
* and clear the item id buffer for the next loop.
*/
itemsets_buffer.insert(item_id_found);
item_id_found.clear();
// Clear the sorted_itemset string to accept the new ids.

sorted_itemset.clear();
/* Combine the multiset native sorted elements (item ids) into a new
* itemset string to replace the old.
*/
for (msetIt = itemsets_buffer.begin(); msetIt != itemsets_buffer.end();
msetIt++) {
sorted_itemset += '-' + *msetIt;
}
// Erase the first '-' character.
sorted_itemset.erase(sorted_itemset.begin());
#ifdef SORT_VERBOSE
std::cerr << "Sorted Itemset: " << sorted_itemset << std::endl;
#endif
sorted_CFI.insert(std::pair<std::string, fi_pair >
(sorted_itemset, mapIt->second));
/* Increase here the iterator instead in the for statement, else the
* loop will break.
*/
this->my_CFI.erase(mapIt++);
}
this->my_CFI = sorted_CFI;
}
/* This function returns the number of items that the provided itemset contains.
*/
uint Closed_Frequent_Itemset::get_items_count(
const std::string& current_itemset) {
uint count = 0;
std::string items_seperator = "-";
for (uint i = 0; i < current_itemset.length(); ++i) {
if (current_itemset.compare(i, 1, items_seperator) == 0)
count++;
}
// The items are +1 from the seperator's count.
count++;
return count;
}
/* This function breaks an itemset into the items it contains and returns those.
*/
std::multiset<std::string> Closed_Frequent_Itemset::get_items(
std::string current_itemset) {
std::multiset<std::string>::iterator pos;
std::multiset<std::string> items;
std::string::size_type item_end;
item_end = current_itemset.find("-");
pos = items.begin();
do {
// Use position to insert to get amortized constant complexity.
items.insert(pos, current_itemset.substr(0, item_end));
std::advance(pos, 1);
// Delete also the "-".
current_itemset.erase(0, item_end + 1);
} while (item_end != std::string::npos);
// Add the last item.
return items;
}
/* This function removes itemsets that are subsets of others and have the
* same support. These have left because some frequent items appear in
* several pivot and sub-pivot groups.
*/
void Closed_Frequent_Itemset::clean_up_CFI() {
strMap::iterator itemset;
std::multiset<std::string> pivot_itemset_items, next_itemset_items;
std::multimap<std::string::size_type, std::string> length_itemset_map;
std::multimap<std::string::size_type, std::string>::reverse_iterator
pivot_itemset, next_itemset, backup;
uint pivot_items_count, next_items_count;
bool subset_flag;
/* Sort the MFI candidates by their length in ascending order (the largest
* is at the end). Ignore any 1-itemsets since those were checked in the
* purge_continuous_prefix_subsets() function.
*/
for (itemset = this->my_CFI.begin(); itemset != this->my_CFI.end();
++itemset) {
// If the itemset string has no "-" seperator it consists of one item.
if (itemset->first.find("-") != std::string::npos) {
length_itemset_map.insert(std::pair<std::string::size_type,
std::string > (itemset->first.length(), itemset->first));
}
}
/* For each itemset (pivot), starting from the largest, search if one of the
* rest candidates MFI are its supersets. If yes then remove the pivot.
*/
for (pivot_itemset = length_itemset_map.rbegin();
pivot_itemset != length_itemset_map.rend(); ++pivot_itemset) {
for (next_itemset = length_itemset_map.rbegin();
next_itemset != length_itemset_map.rend(); ++next_itemset) {
// We do not compare the same itemsets.
if (next_itemset == pivot_itemset) continue;
// If the pivot itemset has different support than the next_itemset
// we continue since even if it is a subset it is still valid.
if (this->my_CFI[pivot_itemset->second].first !=
this->my_CFI[next_itemset->second].first) continue;
#ifdef CLEAN_UP_VERBOSE
std::cerr << "Checking the pivot itemset " << pivot_itemset->second
<< " (" << pivot_itemset->first << ") and next itemset " <<
next_itemset->second << " (" << next_itemset->first << ")."
<< std::endl;
#endif
/* If an itemset A is larger than another B, then by definition it

* cannot be a subset of B. So proceed with the next itemset.
*/
if (pivot_itemset->first >= next_itemset->first) {
std::cerr << "The pivot " << pivot_itemset->second << " (" <<
pivot_itemset->first << ") is same or larger" <<
" in length than the next itemset " <<
next_itemset->second << " (" << next_itemset->first <<
"). Getting new pivot ..." << std::endl;
#endif
/* Break since the candidates are sorted by size and all the
* rest itemsets have the same or smaller size than the current
* pivot.
*/
break;
}
/* Calculate the number of items the pivot and the next_itemset
* contains and if the pivot has more or the same number of items,
* then continue with the next itemset.
*/
pivot_items_count = get_items_count(pivot_itemset->second);
next_items_count = get_items_count(next_itemset->second);
if (pivot_items_count >= next_items_count) {
std::cerr << "The pivot " << pivot_itemset->second <<
" has the same or more items than the next itemset (" <<
next_itemset->second << "). Getting new pivot ..."
<< std::endl;
#endif
continue;
}
/* Find all the items the pivot itemset includes.
*/
pivot_itemset_items.clear();
pivot_itemset_items = get_items(pivot_itemset->second);
/* Find all the items the next itemset includes.
*/
next_itemset_items.clear();
next_itemset_items = get_items(next_itemset->second);
/* Check for every item in the pivot itemset if it is contained in
* the next_itemset. If not then continue with the next pivot. If
* yes remove from the MFI candidates the pivot.
*/
subset_flag = true;
for (pos = pivot_itemset_items.begin();
pos != pivot_itemset_items.end(); ++pos) {
if (next_itemset_items.find(*pos) ==
next_itemset_items.end()) {
std::cerr << "The items " << *pos <<
" is not contained in the next_itemset. " <<

"Proceeding with the next next_itemset ... " <<
std::endl;
#endif
subset_flag = false;
break;
}
}
if (subset_flag) {
#ifdef CLEAN_UP_SUBSET_VERBOSE
backup = pivot_itemset;
std::advance(backup, 1);
std::cerr << "The pivot itemset " << pivot_itemset->second
<< "(" <<
this->my_CFI[pivot_itemset->second].first <<
") is subset of the next itemset " <<
next_itemset->second << "(" <<
this->my_CFI[next_itemset->second].first <<
"). Deleting the pivot ..." << std::endl;
#endif
this->my_CFI.erase(pivot_itemset->second);
std::advance(backup, -1);
// Turn to base so we can delete ... (Note that always the base
// iterator is +1 the reversed).
length_itemset_map.erase(--pivot_itemset.base());
pivot_itemset = backup;
break;
}
}
}
}
/*
* File: Maximum_Frequent_Itemset.h
*
*/
#ifndef MAXIMUM_FREQUENT_ITEMSET_H
#define MAXIMUM_FREQUENT_ITEMSET_H
class Maximum_Frequent_Itemset {
public:
Maximum_Frequent_Itemset();
Maximum_Frequent_Itemset(const Maximum_Frequent_Itemset& orig);
Maximum_Frequent_Itemset(const strMap& current_CFI_structure);
virtual ~Maximum_Frequent_Itemset();
/* Public function to print the Maximum Frequent Itemsets to a file.
*/
void print_MFI_to_file(const std::string output_file);
/* Function to return the main MFI structure.
*/
strMap& Maximum_Frequent_Itemset::get_MFI_structure();
/* This function returns the number of items that the provided itemset
* contains.
*/
static std::multiset<std::string> get_items(std::string current_itemset);
/* This function breaks an itemset into the items it contains and returns
* those.
*/
static uint get_items_count(const std::string& current_itemset);
private:
strMap my_MFI;
/* This function takes advantage of the MFI candidates' locality, since

* those that have the same prefix appear one after another, and removes any
* itemsets which are subsets appearing in the beginning of the group's
* other itemsets.
*/
void purge_continuous_prefix_subsets();
/* This function checks for every MFI candidate if all its items appear in
* any of the rest candidates in any possible order. If so then its
* identified as subset and gets removed.
*/
void purge_scattered_subsets();
};
#endif
/* MAXIMUM_FREQUENT_ITEMSET_H */
/*
* File: Maximum_Frequent_Itemset.cpp
*
*/
#define PURGE_RESULTS
//#define PREFIX_PURGE_VERBOSE
//#define PREFIX_PURGE_SUBSET_VERBOSE
//#define SCATTERED_PURGE_VERBOSE
//#define SCATTERED_PURGE_SUBSET_VERBOSE
Maximum_Frequent_Itemset::Maximum_Frequent_Itemset() {
}
Maximum_Frequent_Itemset::Maximum_Frequent_Itemset(
const Maximum_Frequent_Itemset& orig) {
}
Maximum_Frequent_Itemset::Maximum_Frequent_Itemset(
const strMap& current_CFI_structure) {
this->my_MFI = current_CFI_structure;
#ifdef PURGE_RESULTS
strMap::size_type initial, afterPrP, afterScatP;
initial = this->my_MFI.size();
std::cout << "Initial MFI candidates size (equals to CFI number of " <<
"records): " << initial << std::endl;
#endif
purge_continuous_prefix_subsets();
afterPrP = this->my_MFI.size();
std::cout << "MFI candidates size after continuous prefix subsets removal: "
<< afterPrP << " (-" << float(initial - afterPrP) / initial * 100 <<
"%)." << std::endl;
#endif
purge_scattered_subsets();
afterScatP = this->my_MFI.size();
std::cout << "MFI candidates size after scattered subsets removal: "
<< afterScatP << " (-" << float(initial - afterScatP) /
initial * 100 << "%)." << std::endl;
#endif
}
Maximum_Frequent_Itemset::~Maximum_Frequent_Itemset() {
}
/* Function to return the main MFI structure.
*/
strMap& Maximum_Frequent_Itemset::get_MFI_structure() {
return this->my_MFI;
}
/* Public function to print the Maximum Frequent Itemsets to a file.
*/
void Maximum_Frequent_Itemset::print_MFI_to_file(const std::string output_file) {
for (mapIt = this->my_MFI.begin(); mapIt != this->my_MFI.end();
++mapIt) {
}
}
}
/* This function takes advantage of the MFI candidates' locality, since those
* that have the same prefix appear one after another, and removes any itemsets
* which are subsets appearing in the beginning of the group's other itemsets.
*/
void Maximum_Frequent_Itemset::purge_continuous_prefix_subsets() {
std::string pivot_prefix, next_itemset_prefix;
std::string::size_type pivot_itemset_length, next_itemset_length, found;
strMap::iterator pivot_itemset, next_itemset, last_itemset;
pivot_itemset = this->my_MFI.begin();
next_itemset = pivot_itemset;
std::advance(next_itemset, 1);
last_itemset = this->my_MFI.end();
std::advance(last_itemset, -1);
for (; pivot_itemset != last_itemset;) {
for (; next_itemset != this->my_MFI.end();) {
#ifdef PREFIX_PURGE_VERBOSE
std::cout << "Comparing " << pivot_itemset->first << " and " <<
next_itemset->first << "." << std::endl;
#endif
/* The closed itemsets have been constructed and appear with the
* next one being a superset or different than the previous one.
* So if the length of the pivot (previous) itemset is larger or
* equal in length than the next then proceed.
*/
pivot_itemset_length = pivot_itemset->first.length();
next_itemset_length = next_itemset->first.length();
if (pivot_itemset_length >= next_itemset_length) {
std::cout << "The pivot closed itemset (" <<
pivot_itemset->first << ", " <<
pivot_itemset->first.length() << ") has same or " <<
"greater length than the next one (" <<
next_itemset->first << ", " <<
next_itemset->first.length() <<
"). Proceeding ..." << std::endl;
#endif
pivot_itemset = next_itemset++;
break; // Continue in the outer loop.
}
/* The itemsets are sorted so if both don't begin with the same
* itemset advance to the next.
*/
pivot_prefix = pivot_itemset->first.substr(0,
pivot_itemset->first.find('-'));
next_itemset_prefix = next_itemset->first.substr(0,
next_itemset->first.find('-'));
if (pivot_prefix != next_itemset_prefix) {
pivot_itemset->first << ") has a different prefix" <<
" than the next one (" << next_itemset->first <<
#endif
/* Set the pivot to be the itemset with the new prefix because
* all the previous ones have the same and been tested.
*/
break; // Continue in the outer loop.
}
/* Search for the pivot itemset at the beginning of the next
* itemset (range 0 until (pivot_itemset_length) ). To search in
* ranges from 0 to a defined string pos the rfind must be used.
* If the pivot is a subset then remove it from the candidates.
* Else, continue with the next itemset.
*/
found = next_itemset->first.rfind(pivot_itemset->first,
pivot_itemset_length);
if (next_itemset->first.rfind(pivot_itemset->first,
pivot_itemset_length) != std::string::npos) {
#ifdef PREFIX_PURGE_SUBSET_VERBOSE
std::cout << "The pivot closed itemset " <<
pivot_itemset->first << "(" <<
this->my_MFI[pivot_itemset->first].first <<
") is a subset of the next_itemset " <<
next_itemset->first << "(" <<
this->my_MFI[next_itemset->first].first <<
"). Removing it ..." << std::endl;
#endif
this->my_MFI.erase(pivot_itemset);
} else {
pivot_itemset->first << ") is not a subset of the " <<
"next one (" << next_itemset->first <<
#endif
++next_itemset;
}
}
}
/* The above check forgets the last element if it is an 1-itemset, so we
* need to check it seperately.
*/
if (last_itemset->first.find("-") == std::string::npos) {
for (next_itemset = this->my_MFI.begin();
next_itemset != last_itemset; ++next_itemset) {
if (next_itemset->first.find(last_itemset->first) !=
std::string::npos) {
this->my_MFI.erase(last_itemset);
break;
}
}
}
}
/* This function returns the number of items that the provided itemset contains.
*/
uint Maximum_Frequent_Itemset::get_items_count(
const std::string& current_itemset) {
uint count = 0;
std::string items_seperator = "-";
for (uint i = 0; i < current_itemset.length(); ++i) {
if (current_itemset.compare(i, 1, items_seperator) == 0)
count++;
}
// The items are +1 from the seperator's count.
count++;
return count;
}
/* This function breaks an itemset into its parts and returns those.
*/
std::multiset<std::string> Maximum_Frequent_Itemset::get_items(
std::string current_itemset) {
std::multiset<std::string> items;
std::string::size_type item_end;
if (item_end != current_itemset.npos) {
pos = items.begin();
do {
// Use position to insert, to get amortized constant complexity.
std::advance(pos, 1);
// Delete also the "-".
current_itemset.erase(0, item_end + 1);
} while (item_end != std::string::npos);
// Add the last item.
} else {
items.insert(current_itemset);
}
return items;
}
/* This function checks for every MFI candidate if all its items appear in any
* of the rest candidates in any possible order. If so then its identified as
* subset and gets removed.
*/
void Maximum_Frequent_Itemset::purge_scattered_subsets() {
strMap::iterator itemset;
std::multiset<std::string> pivot_itemset_items, next_itemset_items;
std::multimap<std::string::size_type, std::string> length_itemset_map;
std::multimap<std::string::size_type, std::string>::reverse_iterator
pivot_itemset, next_itemset, backup;
uint pivot_items_count, next_items_count;
bool subset_flag;
/* Sort the MFI candidates by their length in ascending order (the largest
* is at the end). Ignore any 1-itemsets since those were checked in the
* purge_continuous_prefix_subsets() function.
*/
for (itemset = this->my_MFI.begin(); itemset !=
this->my_MFI.end(); ++itemset) {
// If the itemset string has no "-" seperator it consists of one item.
if (itemset->first.find("-") != std::string::npos) {
length_itemset_map.insert(std::pair<std::string::size_type,
std::string > (itemset->first.length(), itemset->first));
}
}
/* For each itemset (pivot), starting from the largest, search if one of the
* rest candidates MFI are its supersets. If yes then remove the pivot.
*/
for (pivot_itemset = length_itemset_map.rbegin();
pivot_itemset != length_itemset_map.rend(); ++pivot_itemset) {
for (next_itemset = length_itemset_map.rbegin();
next_itemset != length_itemset_map.rend(); ++next_itemset) {
// We do not compare the same itemsets.
if (next_itemset == pivot_itemset) continue;
#ifdef SCATTERED_PURGE_VERBOSE
std::cerr << "Checking the pivot itemset " << pivot_itemset->second
<< " (" << pivot_itemset->first << ") and next itemset " <<
next_itemset->second << " (" << next_itemset->first << ")."
<< std::endl;
#endif
/* If an itemset A is larger than another B, then by definition it

* cannot be a subset of B. So proceed with the next itemset.
*/
if (pivot_itemset->first >= next_itemset->first) {
std::cerr << "The pivot " << pivot_itemset->second << " (" <<
pivot_itemset->first << ") is same or larger" <<
" in length than the next itemset " <<
next_itemset->second << " (" << next_itemset->first <<
"). Getting new pivot ..." << std::endl;
#endif
/* Break since the candidates are sorted by size and all the
* rest itemsets have the same or smaller size than the current
* pivot.
*/
break;
}
/* Calculate the number of items the pivot and the next_itemset
* contains and if the pivot has more or the same number of items,
* then continue with the next itemset.
*/
pivot_items_count = get_items_count(pivot_itemset->second);
next_items_count = get_items_count(next_itemset->second);
if (pivot_items_count >= next_items_count) {
std::cerr << "The pivot " << pivot_itemset->second <<
" has the same or more items than the next itemset (" <<
next_itemset->second << "). Getting new pivot ..."
<< std::endl;
#endif
continue;
}
/* Find all the items the pivot itemset includes.
*/
pivot_itemset_items.clear();
pivot_itemset_items = get_items(pivot_itemset->second);
/* Find all the items the next itemset includes.
*/
next_itemset_items.clear();
next_itemset_items = get_items(next_itemset->second);
/* Check for every item in the pivot itemset if it is contained in
* the next_itemset. If not then continue with the next pivot. If
* yes remove from the MFI candidates the pivot.
*/
subset_flag = true;
for (pos = pivot_itemset_items.begin();
pos != pivot_itemset_items.end(); ++pos) {
if (next_itemset_items.find(*pos) ==
next_itemset_items.end()) {
std::cerr << "The items " << *pos <<
" is not contained in the next_itemset. " <<
"Proceeding with the next next_itemset ... " <<
std::endl;
#endif
subset_flag = false;
break;
}
}
if (subset_flag) {
#ifdef SCATTERED_PURGE_SUBSET_VERBOSE
std::advance(backup, 1);
std::cerr << "The pivot itemset " << pivot_itemset->second
<< "(" <<
this->my_MFI[pivot_itemset->second].first <<
") is subset of the next itemset " <<
next_itemset->second << "(" <<
this->my_MFI[next_itemset->second].first <<
"). Deleting the pivot ..." << std::endl;
#endif
this->my_MFI.erase(pivot_itemset->second);
std::advance(backup, -1);
// Turn to base so we can delete ... (Note that always the base
// iterator is +1 the reversed).
length_itemset_map.erase(--pivot_itemset.base());
pivot_itemset = backup;
break;
}
}
}
}
/*
* File: Query.h
*
* Created on 24 November 2012, 1:25 pm
*/
#ifndef QUERY_H
#define QUERY_H
class Query {
public:
Query();
Query(const Query& orig);
Query(char query_type, std::string query_args, const strMap& search_space);
virtual ~Query();
/* Function that prints the query results into a file.
*/
void print_Qresults_to_file(const std::string output_file);
private:
strMap itemset_pool;
std::set<std::string> results;
/* Function that replies the subset queries. Finds the itemsets in the
* search space which contain subsets of the provided input itemset.

*/
void subset_query(std::string itemset, const strMap& search_space);
/* Function that replies the superset queries. Finds the itemsets in the
* search space which contain supersets of the provided input itemset.
*/
void superset_query(std::string itemset, const strMap& search_space);
/* Function that replies the similarity queries. Finds the itemsets in the
* search space which have m common items compared to the provided input
* itemset.
*/
void similarity_query(std::string args, const strMap& search_space);
/* Function that replies the minimum support queries. Finds the itemsets in
* the seach space which have support more than the provided input support
* threshold.
*/
void min_support_query(std::string min_support, const strMap& search_space);
/* Function that generates all the possible subset itemsets from the
* selected MFI or CFI records that are stored in the itemset_pool.
*/
void generate_subset_itemsets(const std::multiset<std::string> pivot_items);
/* Function that generates all the possible superset itemsets from the
*/
void generate_superset_itemsets(
const std::multiset<std::string> pivot_items);
/* Function that accepts as input a set of items and a group_factor and
* generates and returns all the possible itemset combinations of size
* group_factor.
*/
std::multiset<std::string> get_combination_by_m(
std::multiset<std::string> items, uint group_factor);
};
#endif
/* QUERY_H */
/*
* File: Query.cpp
*
* Created on 24 November 2012, 1:25 pm
*/
#include "Query.h"
//#define GENERATE_SUB_VERBOSE
//#define GENERATE_SUP_VERBOSE
//#define COMBI_VERBOSE
//#define GENERATE_SIM_VERBOSE
Query::Query() {
}
Query::Query(const Query& orig) {
}
Query::Query(char query_type, std::string query_args,
const strMap& search_space) {
switch (query_type) {
case '1':
subset_query(query_args, search_space);
break;
case '2':
superset_query(query_args, search_space);
break;
case '3':
similarity_query(query_args, search_space);
break;
case '4':
subset_query(query_args, search_space);
break;
case '5':
superset_query(query_args, search_space);
break;
case '6':
similarity_query(query_args, search_space);
break;
default:
std::cerr << "Wrong input! Exiting ..." << std::endl;
exit(1);
}
}
Query::~Query() {
}
/* Function that prints the query results into a file.
*/
void Query::print_Qresults_to_file(const std::string output_file) {
std::set<std::string>::iterator itemset;
for (record = this->itemset_pool.begin();
record != this->itemset_pool.end(); ++record) {
out_file << record->first << ":" << record->second.first << ":"
<< record->second.second << std::endl;
}
out_file << "------------------------------------------" << std::endl;
for (itemset = this->results.begin();
itemset != this->results.end(); ++itemset) {
out_file << *itemset << std::endl;
}
}
}
/* Function that generates all the possible itemsets from the selected MFI
* or CFI records that are stored in the itemset_pool.
*/
void Query::generate_subset_itemsets(
const std::multiset<std::string> pivot_items) {
std::multiset<std::string> record_items, new_items;
std::multiset<std::string>::const_iterator item, pitem, nitem;
std::set<std::string> seeds, cur_pool, next_pool;
std::set<std::string>::iterator s, last;
std::string pivot_base, next_base, prefix, new_itemset;
#ifdef GENERATE_SUB_VERBOSE
std::cerr << "------------------------------- " << std::endl;
#endif
record_items = Maximum_Frequent_Itemset::get_items(record->first);
seeds.clear();
for (item = pivot_items.begin(); item != pivot_items.end();
++item) {
if (record_items.find(*item) != record_items.end()) {
seeds.insert(*item);
}
}
std::cerr << "For pivot items ";
for (std::multiset<std::string>::const_iterator k = pivot_items.begin();
k != pivot_items.end(); ++k) {
std::cerr << *k << " ";
}
std::cerr << "and record " << record->first << ", seeds found:";
for (std::multiset<std::string>::const_iterator k = seeds.begin();
k != seeds.end(); ++k) {
std::cerr << " " << *k;
}
std::cerr << "." << std::endl;
#endif
if (!seeds.empty()) {
for (s = seeds.begin(); s != seeds.end(); ++s) {
if (this->results.find(*s) == this->results.end()) {
std::cerr << "Adding in results " << *s << std::endl;
#endif
this->results.insert(*s);
}
}
} else // If there are no seeds to generate itemsets, then continue.
continue;
// If there is only one seed there can be no combination, so continue
// with the next itemset.
if (seeds.size() == 1) continue;
// The below block generates all the possible combinations of the items
// in seed container.
cur_pool = seeds;
last = cur_pool.end();
std::advance(last, -1);
do {
std::cerr << "Current pool contains:";
for (std::set<std::string>::const_iterator k = cur_pool.begin();

k != cur_pool.end(); ++k) {
std::cerr << " " << *k;
}
#endif
for (pitem = cur_pool.begin(); pitem != last; ++pitem) {
prefix = *pitem;
if (pitem->find("-") != std::string::npos) {
pivot_base = pitem->substr(0, pitem->find_last_of("-"));
} else {
pivot_base = "";
}
for (nitem = pitem, std::advance(nitem, 1);
nitem != cur_pool.end(); ++nitem) {
if (nitem->find("-") != std::string::npos) {
next_base = nitem->substr(0,
nitem->find_last_of("-"));
} else {
next_base = "";
}
std::cerr << "Comparing pivot base (" << pivot_base <<
") from pivot " << *pitem << ", with next base (" <<
next_base << ") from next item " << *nitem << "." <<
std::endl;
#endif
if (pivot_base == next_base) {
if (next_base == "") {
new_itemset = prefix + "-" + *nitem;
} else {
new_itemset = prefix + "-" +
nitem->substr(nitem->find_last_of("-") + 1,
nitem->length() - nitem->find_last_of("-"));
}
new_items = Maximum_Frequent_Itemset::get_items(
new_itemset);
new_itemset.clear();
for (s = new_items.begin(); s != new_items.end(); ++s) {
new_itemset += "-" + *s;
}
new_itemset.erase(new_itemset.begin());
if (this->results.find(new_itemset) ==
this->results.end()) {
std::cerr << "Adding new itemset " << new_itemset <<
" from prefix " << prefix << "." << std::endl;
#endif
this->results.insert(new_itemset);
}
next_pool.insert(new_itemset);
} else {
std::cerr << "Pivot base " << pivot_base <<
" differs from next base " << next_base <<
". Continue with next." << std::endl;
#endif
break;
}
}
}
cur_pool = next_pool;
next_pool.clear();
} while (cur_pool.size() > 1);
}
}
/* Function that replies the subset queries. Finds the itemsets in the
* search space which contain subsets of the provided input itemset.
*/
void Query::subset_query(std::string itemset, const strMap& search_space) {
strMap::const_iterator record;
std::multiset<std::string> items, record_items;
std::multiset<std::string>::iterator rec_item;
items = Maximum_Frequent_Itemset::get_items(itemset);
for (record = search_space.begin(); record != search_space.end();
++record) {
// If at least one record's item is matched then this record contains
// subsets for the itemset.
for (rec_item = record_items.begin(); rec_item != record_items.end();
++rec_item) {
if (items.find(*rec_item) != items.end()) {
this->itemset_pool.insert(itemset_rec(record->first,
record->second));
break;
}
}
}
#ifdef GENERATE_SUP_VERBOSE
std::cerr << "The items are";
for (std::multiset<std::string>::const_iterator k = items.begin();
k != items.end(); ++k) {
std::cerr << " " << *k;
}
#endif
generate_subset_itemsets(items);
}
/* Function that generates all the possible superset itemsets from the
*/
void Query::generate_superset_itemsets(
const std::multiset<std::string> pivot_items) {
std::multiset<std::string> record_items, new_items;
std::multiset<std::string>::const_iterator item, pitem, nitem;
std::set<std::string> seeds, cur_pool, next_pool;
std::set<std::string>::iterator s, last;
std::string pivot_base, next_base, prefix, new_itemset;
std::cerr << "------------------------------- " << std::endl;
#endif
// Add the pivot items in the results since the same itemset is
// considered as a superset of itself.

for (item = pivot_items.begin(); item != pivot_items.end();
++item) {
new_itemset += "-" + *item;
}
seeds.clear();
// The pivot items are all included in the every record, so we need
// the rest items to generate all the possible combinations and at the
// end attach the pivot items.
for (item = record_items.begin(); item != record_items.end();
++item) {
if (pivot_items.find(*item) == pivot_items.end()) {
seeds.insert(*item);
}
}
std::cerr << "For pivot items ";
for (std::multiset<std::string>::const_iterator k = pivot_items.begin();
k != pivot_items.end(); ++k) {
std::cerr << *k << " ";
}
std::cerr << "and record " << record->first << ", seeds found:";
for (std::multiset<std::string>::const_iterator k = seeds.begin();
k != seeds.end(); ++k) {
std::cerr << " " << *k;
}
#endif
// Add the 1-itemset seeds in the results along with the pivot items
// since those are the first proper supersets.
if (!seeds.empty()) {
for (s = seeds.begin(); s != seeds.end(); ++s) {
new_items = pivot_items;
new_items.insert(*s);
for (item = new_items.begin(); item != new_items.end();
++item) {
new_itemset += "-" + *item;
}
if (this->results.find(new_itemset) == this->results.end()) {
std::cerr << "Adding in results " << *s << std::endl;
#endif
}
}
} else // If there are no seeds to generate itemsets, then continue.
continue;
// If there is only one seed there can be no combination, so continue
// with the next itemset.
if (seeds.size() == 1) continue;
// The below block generates all the possible combinations of the items
// in seed container.
cur_pool = seeds;
do {
std::cerr << "Current pool contains:";
for (std::set<std::string>::const_iterator k = cur_pool.begin();
k != cur_pool.end(); ++k) {
std::cerr << " " << *k;
}
#endif
for (pitem = cur_pool.begin(); pitem != last; ++pitem) {
prefix = *pitem;
if (pitem->find("-") != std::string::npos) {
pivot_base = pitem->substr(0, pitem->find_last_of("-"));
} else {
// Handle the 2-itemsets cases.
pivot_base = "";
}
for (nitem = pitem, std::advance(nitem, 1);
nitem != cur_pool.end(); ++nitem) {
if (nitem->find("-") != std::string::npos) {
next_base = nitem->substr(0,
nitem->find_last_of("-"));
} else {
// Handle the 2-itemsets cases.
next_base = "";
}
std::cerr << "Comparing pivot base (" << pivot_base <<
") from pivot " << *pitem << ", with next base (" <<
next_base << ") from next item " << *nitem << "." <<
std::endl;
#endif
if (pivot_base == next_base) {
if (next_base == "") {
new_itemset = prefix + "-" + *nitem;
} else {
new_itemset = prefix + "-" +
nitem->substr(nitem->find_last_of("-") + 1,
nitem->length() - nitem->find_last_of("-"));
}
new_items = Maximum_Frequent_Itemset::get_items(
new_itemset);
// Create the new itemset.
}
// Delete the heading "-".
// Add the new_itemset in the next pool to generate the
// next combinations!
next_pool.insert(new_itemset);
// Add the pivot items to create the final superset.
for (item = pivot_items.begin();
item != pivot_items.end(); ++item) {
new_items.insert(*item);
}
// Create the new itemset.
}
// Delete the heading "-".
if (this->results.find(new_itemset) ==
this->results.end()) {
std::cerr << "Adding new itemset " << new_itemset <<
" from prefix " << prefix << "." << std::endl;
#endif
}
} else {
std::cerr << "Pivot base " << pivot_base <<
" differs from next base " << next_base <<
". Continue with next." << std::endl;
#endif
break;
}
}
}
cur_pool = next_pool;
next_pool.clear();
} while (cur_pool.size() > 1);
}
}
/* Function that replies the superset queries. Finds the itemsets in the
* search space which contain supersets of the provided input itemset.
*/
void Query::superset_query(std::string itemset, const strMap& search_space) {
std::multiset<std::string> items, record_items;
std::multiset<std::string>::iterator pivot_item;
bool superset_flag;
items = Maximum_Frequent_Itemset::get_items(itemset);
++record) {
superset_flag = true;
// All the record's items must be matched, so it can be characterized as
// a valid superset for the input itemset.
for (pivot_item = items.begin(); pivot_item != items.end();
++pivot_item) {
if (record_items.find(*pivot_item) == record_items.end()) {
superset_flag = false;
break;
}
}
if (superset_flag) {
record->second));
}
}
generate_superset_itemsets(items);
}
/* Function that accepts as input a set of items and a group_factor and
* generates and returns all the possible itemset combinations of size
* group_factor.
*/
std::multiset<std::string> Query::get_combination_by_m(
std::multiset<std::string> items, uint group_factor) {
std::multiset<std::string> combinations, old_pool, new_pool;
std::multiset<std::string>::iterator item, item_a, item_b;
std::string new_itemset, prefix_a, prefix_b;
uint current_group_factor;
if (group_factor > items.size()) {
std::cerr << "The group factor is larger than the number of items." <<
" Exiting ..." << std::endl;
exit(1);
}
#ifdef COMBI_VERBOSE
std::cerr << "The items to play with are ";
for (item = items.begin(); item != items.end(); ++item) {
std::cerr << *item << " ";
}
#endif
combinations.clear();
// In the case we need the 1-itemsets, put those in the combination
// variable and return.
if (group_factor == 1) {
combinations.insert(*item);
}
std::cerr << "Somebody asked for 1-itemsets. Let return ";
for (item = combinations.begin(); item != combinations.end(); ++item) {
}
#endif
return combinations;
}
// In the case we need all of the items combined, build a string with all
// of them and return.
if (group_factor == items.size()) {
new_itemset = new_itemset + "-" + *item;
}
combinations.insert(new_itemset);
std::cerr << "Somebody asked for max itemset. Let return ";
std::cerr << *item;

}
#endif
}
current_group_factor = 1;
new_pool.clear();
old_pool.clear();
// Produce the desired itemsets by building those from scratch, starting
// from the 1-itemsets, then build the 2-itemsets, then the 3-itemsets and
// so on, until we reach the goal group_factor.
while (current_group_factor < group_factor) {
// First combine the itemsets to build the possible 2-itemsets.
if (current_group_factor == 1) {
for (item_a = items.begin(); item_a != items.end(); ++item_a) {
for (item_b = item_a, std::advance(item_b, 1);
item_b != items.end(); ++item_b) {
new_itemset = *item_a + "-" + *item_b;
old_pool.insert(new_itemset);
}
}
current_group_factor++;
} else {
for (item_a = old_pool.begin(); item_a != old_pool.end();
++item_a) {
for (item_b = item_a, std::advance(item_b, 1);
item_b != old_pool.end(); ++item_b) {
// Combine itemsets only if they have the same max prefix.
prefix_a = (*item_a).substr(0, (*item_a).find_last_of("-"));
prefix_b = (*item_b).substr(0, (*item_b).find_last_of("-"));
if (prefix_a == prefix_b) {
// Add to item_a the last item from item_b.
new_itemset = *item_a + "-" + (*item_b).substr(
(*item_b).find_last_of("-") + 1);
new_pool.insert(new_itemset);
} else {
break;
}
}
}
old_pool = new_pool;
new_pool.clear();
current_group_factor++;
}
std::cerr << "The old pool contains the ";
for (item = old_pool.begin(); item != old_pool.end(); ++item) {
}
#endif
}
combinations = old_pool;
std::cerr << "We return the ";
}
#endif
}
/* Function that replies the similarity queries. Finds the itemsets in the
* search space which have m common items compared to the provided input
* itemset.
*/
void Query::similarity_query(std::string args, const strMap& search_space) {
uint similarity_factor;
std::string itemset, new_record_itemset;
std::multiset<std::string> all_items, combi_items, excluded_items,
record_items, combi_itemsets;
std::multiset<std::string>::iterator pivot_item, combi, item;
bool superset_flag, contains_excluded_item;
itemset = args.substr(0, args.find(" "));
similarity_factor = atoi(args.substr(args.find(" ") + 1).c_str());
all_items = Maximum_Frequent_Itemset::get_items(itemset);
#ifdef GENERATE_SIM_VERBOSE
std::cerr << "Itemset received: " << itemset << ", similarity factor: " <<
similarity_factor << std::endl;
#endif
combi_itemsets = get_combination_by_m(all_items, similarity_factor);
std::cerr << "Itemsets to find similar with: ";
for (item = combi_itemsets.begin(); item != combi_itemsets.end(); ++item) {
}
#endif
// For every combination itemset returned, use the superset algorithm for
// all the possible itemsets that share the same common items.
for (combi = combi_itemsets.begin(); combi != combi_itemsets.end();
++combi) {
this->itemset_pool.clear();
std::cerr << "Combi searched: " << *combi << "." << std::endl;
#endif
combi_items.clear();
combi_items = Maximum_Frequent_Itemset::get_items(*combi);
excluded_items.clear();
// Up to here we know that sim_factor items will be the similar, so we
// need to verify that no other similar will appear. So we mark all the
//items we do NOT want to appear at the end.
for (item = all_items.begin(); item != all_items.end(); ++item) {
if (combi_items.find(*item) == combi_items.end()) {
excluded_items.insert(*item);
}
}
++record) {
superset_flag = true;
// All the record's items must be matched, so it can be
// characterized as a valid superset for the input itemset.
for (pivot_item = combi_items.begin(); pivot_item !=

combi_items.end(); ++pivot_item) {
if (record_items.find(*pivot_item) == record_items.end()) {
superset_flag = false;
break;
}
}
std::cerr << "Record: " << record->first << ", supeset flag: " <<
superset_flag << "." << std::endl;
#endif
if (superset_flag) {
std::cerr << "Superset found: " << record->first << std::endl;
#endif
// The record is a superset. If it has more common items then
// remove the additional ones to avoid higher similarity than
// the one we need.
contains_excluded_item = false;
for (item = excluded_items.begin();
item != excluded_items.end(); ++item) {
if (record_items.find(*item) != record_items.end()) {
contains_excluded_item = true;
std::cerr << "Removing item " << *item <<
" from the record " << record->first <<
std::endl;
#endif
record_items.erase(*item);
}
}
if (contains_excluded_item) {
// Now construct the proper record without the extra similar
//items.
new_record_itemset.clear();
for (item = record_items.begin(); item !=
record_items.end(); ++item) {
new_record_itemset += "-" + *item;
}
new_record_itemset.erase(new_record_itemset.begin());
std::cerr << "Candidate itemset found (trimmed): " <<
new_record_itemset << std::endl;
#endif
this->itemset_pool.insert(itemset_rec(new_record_itemset,
record->second));
} else {
std::cerr << "Candidate itemset found (original): " <<
record->first << std::endl;
#endif
record->second));
}
}
}
generate_superset_itemsets(combi_items);
}
}
/* Function that replies the minimum support queries. Finds the itemsets in
* the seach space which have support more than the provided input support
* threshold.
*/
void Query::min_support_query(std::string min_support,
const strMap& search_space) {
//UNDER CONSTRUCTION
}

[1] C. Makris, Data Mining and Learning Algorithms, 2011. []. Available:
http://mmlab.ceid.upatras.gr/courses/data_mining/.
[2] T. I. A. S. R. Agrawal, Mining Association rules between sets of items in large
databases, Proceedings of the 1993 ACM SIGMOD Conference, pp. 207-216, 1993.
[3] R. S. R. Agrawal, Fast Algorithms for mining association rules, Proceedings of the 20th
In'l Conference on Very Large Data Bases, pp. 478-499, 1994.
[4] H. T. A. I. V. Heikki Mannila, Efficient Algorithms Association for Discovering Rules,
1994.
[5] N. A. a. J. H. Spencer, The Probabilistic Method, New York: John Wiley Inc., 1992.
[6] B. Bollobas, Combinatorics, Cambridge : Cambridge University Press, 1986.
[7] L. V. L. J. H. A. P. Raymond T. Ng, Exploratory Mining and Pruning Optimizations of
Constrained Associations Rules, 1998.
[8] Y. B. R. T. L. L. Nicolas Pasquier, Efficient Mining of Association rules using Closed
Itemset Lattices, 1999.
[9] A. U. T. E. A. Necip Fazll Ayan, An Efficient Algorithm To Update Large Itemsets With
Early Pruning, 1999.
[10] S. D. L. a. B. K. D. W. Cheung, A general incremental technique for maintaining
discovered association rules., DASFAA 97, pp. 185-194, April 1997.
[11] E. O. a. A. Savasere., Efficient mining of association rules in large dynamic databases.,
BNCOD98, pp. 49-63, 1998.
[12] J. H. Y. Y. R. M. Jian Pei, Mining Frequent Patterns without Candidate Generation: A
Frequent-Pattern Tree Approach, 2000.
[13] R. A. C. a. P. V. Agarwal, A tree projection algorithm for generation of frequent,
Journal of Parallel and Distributed Computing, p. 61:350371, 2001.
[14] J. H. a. R. M. Jian Pei, CLOSET: An Efficient Algorithm for Mining Frequent Closed
Itemsets, 2000.
[15] Y. B. R. T. L. L. N. Pasquier, Discovering frequent closed itemsets for association rules.,
In Proc. 7th Int. Conf. Database Theory (ICDT'99), pp. 398-416, Jan 1999.
[16] M. J. Z. a. C. Hsiao, Charm: An efficient algorithm for closed association rule mining.,
In Technical Report 99-10, Computer Science, Rensselaer Polytechnic Institute, 1999.
[17] W. W. C. B. L. Qinghua Zou, SmartMiner: A Depth First Algorithm Guided by Tail
Information for Mining Maximal Frequent Itemsets, 2002.
[18] M. C. J. G. Doug Burdick, MAFIA: A Maximal Frequent Itemset Algorithm for
Transactional Databases, 2001.
[19] M. J. Z. Karam Gouda, Efficiently Mining Maximal Frequent Itemsets, 2001.
[20] J. W. Y. L. P. T. Jiawei Han, Mining TopK Frequent Closed Patterns without Minimum
Support, 2002.
[21] J. X. Y. W. X. X. Guimei Liu Hongjun Lu, 2003.
[22] J. H. J. P. Jianyong Wang, CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets, 2003.
[23] Y. P. K. W. a. J. H. J. Liu, Mining frequent item sets by opportunistic projection.,
SIGKDD'02, 2002.
[24] A. P. a. D. Zandolin, Mining Frequent Itemsets using Patricia Tries, 2004.
[25] H. L. J. X. Y. Guimei Liua, CFP-tree: A compact disk-based structure for storing and
querying frequent itemsets, 2005.
[26] J. Z. Goesta Grahne, Fast Algorithms for Frequent Itemset Mining Using FP-Trees,
2005.
[27] C. H. M. D. o. C. E. a. I. A. K. T. Ioannis N. Kouris, On-line generation of association rules
using inverted files indexing and compression, 2002.
[28] M. I. a. C.-J. H. Mohammed J. Zaki, Efficient Algorithms for Mining Closed Itemsets and
Their Lattice Structure, 2005.
[29] K. G. M.J. Zaki, Fast Vertical Mining Using Diffsets, Proc. Ninth ACM SIGKDD Intl Conf.
Knowledge Discovery and Data, 2003.
[30] S. O. R. P. Claudio Lucchese, Fast and Memory Efficient Mining of Frequent Closed
Itemsets, 2006.
[31] D. Z. Mei Qiao, 2011_Efficiently Matching Frequent Patterns Based on Bitmap Inverted
Fiels Built from Closed Itemsets, 2011.
[32] T.-P. H. B. L. Bay Vo, DBV-Miner: A Dynamic Bit-Vector approach for fast mining
frequent closed itemsets, 2012.
[33] S. P. M. O. a. W. L. Mohammed Javeed Zaki, New Algorithms for Fast Discovery of

Association Rules, 1997.
[34] S. O. P. P. R. P. F. S. Claudio Lucchese, kDCI: a Multi-Strategy Algorithm for Mining
Frequent Sets, 2004.
[35] S. O. P. P. R. P. F. S. Claudio Lucchese, Adaptive and Resource-Aware Mining of
Frequent Sets., Data Mining (ICDM02), p. pages 338345, 2002.
[36] R. T. N. P. G. S. L. L. Y. Bastide, Mining frequent patterns with counting inference.,
ACM SIGKDD Explorations Newsletter, p. 2(2):6675, 2000.
[37] C.
enthousiastics,
cplusplus.com,
[].
Available:
http://www.cplusplus.com/reference/map/map/.
[38] t. f. e. From Wikipedia, Wikipedia: Run-length encoding, From Wikipedia, the free
encyclopedia,
November
2012.
[].
Available:
http://en.wikipedia.org/wiki/Run-length_encoding.
[39] J.
D.
Cook,
Binomial
coefficients,
[].
Available:
http://www.johndcook.com/binomial_coefficients.html.
[40] M. K. J. P. Jiawei Han, Data Mining Concepts and Techniques, USA: Morgan Kauffman,
2012.
[41] C.
Borgelt,
Christian
Borgelt's
http://www.borgelt.net/slides/fpm4.pdf.
Web
Pages,
[].
Available:

Nimertis Varsamis (Mech) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nimertis Varsamis (Mech) PDF

Uploaded by

Copyright:

Available Formats

.

, CFI ....................................................... - 98 ................................................... - 100 .............................................................. - 101 5.

........................................................ - 103 5.1

IIF ..................................................................................................... - 104 -

MFI .................................................................................................. - 108 -

(CFI Vs. MFI) ................................................................. - 111 -

B ( ) ................................................................................. - 130 ................................................................................................................... - 180 -

(aggregation), (attribute/feature construction)

Rakesh Agrawal [2],

-, TID (Transaction IDentifier).

Algorithm: Apriori. Find frequent itemsets using an iterative level-wise

L2={AB, BC, AC, AE, BE, AF, CG}, Manilla ABC

, FP-tree (frequent pattern tree),

[20] Jiawei Han, Jianyong Wang, Ying Lu Petre Tzvetkov

, Eclat, FPtree DCI, :

2005 FP-Array FP-Close

tree, FP-growth*, FP-tree (MFI-tree, CFI-tree)

2011 - Memory-based online pattern

Invert Index File (IIF)

51: pivot itemsets (2-itemsets).

52: sub pivot itemsets (N-itemsets).

55: pivot itemsets, sub pivot itemsets, CFI

63: (similarity query).

65: pivot-IIF, sub-pivot-IIF CFI ( 3).

pivot-IIF (2itemsets) 1-itemsets, sub-pivot-IIF pivot-IIF

66 1itemset IIF (7) (10)

Frequent Itemset Mining

, mushroom, pumsb pumsb_star

CFI Creation Time - T40I10D100K.dat (5%)

CFI Creation Time - pumsb.dat (85%)

CFI Size - T40I10D100K.dat (5%)

1000 10000 20000 40000 60000 80000 100000

CFI Size - pumsb.dat (85%)

MFI Creation Time - T40I10D100K.dat (5%)

MFI Creation Time - pumsb.dat (85%)

MFI Vs. CFI records reduce - T40I10D100K.dat (5%)

MFI Vs. CFI records reduce - pumb.dat (85%)

5.4 (CFI Vs. MFI)

Subset Query Timings - pumsb.dat (1000, 85%, 0.4)

Superset Query Timings - pumsb.dat (1000, 85%, 0.1)

Similarity Query Timings - pumsb.dat (1000, 85%, 0.1)

Subset Query Timings - T40I10D100K.dat (1000, 2%, 0.4)

Superset Query Timings - T40I10D100K.dat (1000, 2%, 0.1)

Similarity Query Timings - T40I10D100K.dat (1000, 2%, 0.4)

MFI, MFI-trees, CFI,

Subset (factor 0,4)

pumsb.dat (1000, 85%)

Similarity (factor 0,1)

T40I10D100K.dat (1000, 2%)

Similarity (factor 0,1)

* the third, fifth, sixth and eighth transactions).

const std::string pivot_IIF_output_file = "pivotIIFs.dat";

/* After creating the first IIF remove the weak 1-itemsets.

// Output the Maximum Frequent Itemsets to a file.

// Multimap needed for the 1-itemsets sorting procedure.

* the transactions ids bitmap.

* the newly found itemset in the last (current) transcation.

std::string two_itemset_id, two_itemset_bitstring;

freq_bitstring = std::make_pair(new_support, two_itemset_bitstring);

number_of_items << std::endl;

void print_CFI_to_file(const std::string output_file);

/* Constructor of CFI from an IIF structure.

for (uint i = 0; i < separator_pos.size() - 1; i++) {