You are on page 1of 5

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

19

Combined use of ARM and graph clustering methods to find association in urban routes
(Case study: database of Tehran traffics status)
V. Dehghan, Sh. Khadivi, A. Farahi
AbstractFinding meaningful association from basket data is one of the oldest problems in data mining. The solution is analyzing and mining relational rules. In this paper we are going to prepare a proper approach to find traffic influence in routes, as association rules, this approach contains data mining techniques and clustering methods. First, by different clustering methods, rotes are grouped into homogeneous clusters and then extract the rules from each of detected clusters. This approach reduced the needs for searching in massive data list and caused producing interesting rules with low cost time. Finally the best clustering method will proposed with respect to the results of produced association rules.

Index Terms Association Rules, Community Detection, Data Mining, Graph Clustering, Transaction Data, Traffic.

1 INTRODUCTION
rban traffic is one of the big concerns for many governments in their metropolises that this Issue with citizen infrastructure development like highway and tunnels is not solved properly yet, it seems, reasons that caused the heavy traffic is lack attention to the conductive traffic field, and lack of evacuation plan for shut roads in critical time. Also because of lack attention to the measurement of traffic in routes and prediction of urban traffic. One of the helpful methods to solve this Issue is the use of data mining methods, finding hot route segment is one of the important subjects and is advantage for urban designers, police stations and many of other organization that involved in urban traffics. Detection of the routes, that their traffic is similar and impact to one other, lead to smoother traffic and traffic steerage. This problem has been addressed with domain knowledge of city[1].In this research we have large database of occurred status in Tehran road segments and we plan to analyze this data in form of association rule mining, The problem of association rule mining in large database has practiced in past few decades. And variety of algorithm and methods for this reason introduced. Our aim in this paper, is finding these rules in traffic data of Tehran city. With attention to massive data and transactional record, current algorithms directly cant find these rules, so by focusing on this application at this paper we

Vahid dehghan. Is graduate student in payame noor university, and work in Tehran Municipality+9809144001756 Shahram khadivi, was with Amirkabir University, Tehran, He is now with the Department of Computer . Author is with the Computer Engineering Department, University of Payame Noor, Tehran

are going to extract this rules with dominate the input/output problem and producing interesting rules that is why we are trying to use clustering methods. In many clustering ways we are looking similarity measures to use them for dividing object in homogeneous groups. Variety of algorithms was introduced for this reason but all of them require parameters which to be set by users to work. Determining these values needs specific knowledge in this field and usually changing these parameters causes different results and that be mentioned that the cost of these changes on huge data is very expensive. We can mention similarity measures such as mynkvfsky, chebyshev, manhattan, euclidean and etc. However in case study database of this research two variable, road segment1 and status exist. The second variable includes three attributes which are heavy, very heavy and smooth. Similarity measures that is considered, is routes relation and not variable similarity. For this reason we must change the space of problem to network organization, to find these relations, we need to construct the graph of communication routes. Bye creating the graph and detecting those groups which their internal dense is more than other groups and then with solution that we will explain in the rest of paper, we extract records from this groups that is appropriate for association rule mining. Many complex systems can be represented as networks, were the relationship between their objects can be illustrated by nodes and edges [11]. Complex systems are usually organized in partitions, which each of them have their own function. In the network representation, such partitions appear as set of nodes with high density of internal links, while the links between partitions is low; these networks are called communities or modules, and occur in a variety of large networked systems, finding and studying these
1

routes

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

20

partitions may clarify the organization of systems and their functions [4, 5]. Network of urban streets is complex system. In 2008 blondel and collegues present a new algorithm and identified language communities in Belgian mobile phone networks of 2 million customers and by analyzing a web graph of 118 million nodes and more than one billion links [6], the result of applying this algorithm on huge nodes was very good. Suppose each route as a node and link between them as an edge. Our constructed graph has 7975 nodes and 16976 edges, given that, each nodes of Tehran graph, has list of transactions related to its events during the year. With mining on these transactional lists we will produce association rules that are Routes Influences.

Now the question is, whether the algorithm may not have any of the output? For example these algorithms on random graphs how they will be act, the experiments performed show that the algorithms currently available in many cases are the fastest and most reliable. In the following, briefly about the measurements of two algorithms will be discussed.

2.1 Fast Greedy Modularity


This method is heuristic methods based on optimization. It is based on maximize objective function. Modularity of a partition is scalar value between 1 and -1 which density inside a plate with respect to its relationship with other plates are calculated. Blondel algorithm is fastest that developed on the bases, that identifying communities in a 118 million nodes network took only 152 minute[8].

GRAPH CLUSTERING ALGORITHMS

With regard to what mentioned in introductions, one can say that aim in networks is optimal community detection; many methods have been developed, using tools and techniques from disciplines like physics, biology, applied mathematics, computer and social science. However it is still not clear which algorithms are reliable and can be used in applications. The question of reliable itself is seductive, because this concept requires shared definition of community and partitions that even with the respect to many researches in this field is not done yet. And no agreements among researchers exist to define indexes like optimal and reliability. But there has been silent acceptance of a simple network model that can be said, it is a base for comparing and developing clustering algorithms, namely planted lpartitions model [5]. In this model the purpose is graph construction with predefined community in size and count, called benchmark graph, to be indexes for comparing, efficiency and speed of detecting, in clustering algorithms. In this model one partition consisting of certain number of nodes. Each node has one probability pin as Number of nodes communicate with nodes within a group and pout that indicate connection a node with other nodes of the other groups, until pin>pout can call these groups are community. If the above condition is not established, the network would be a random graph without any structure. Different benchmark graph are provided, including the GN benchmark by Girvan and Newman has been named, it has 4 groups with 32 members that degree of each node is 16 and number of all nodes is 128. It has two fundamental problems. First all nodes have a same degree and second, size of each community is equal. As can be seen, this feature is not realistic, because complex networks have heterogeneous distribution in degrees [6]. The other benchmark presented is called, LFR, which is the basis for comparison of clustering algorithm. In this benchmark graph node degree distribution is heterogeneous and, however it is emphasized that even if these communities may also be noted, different algorithms may not be able to detect them [6]. Most existing algorithms can work well on GN because its simplicity, but LFR apply heavy test on algorithms and shows its limitations, Results of this study in terms of both time complexity and compatibility with existing structures in the graph and detected community shows that two algorithms Infomap, and Blondel had the best performance [5].

2.2 Random walks


Random walk is a mathematical formula based on the consecutive steps in a random trajectory. For example the path of a molecule in the gas passes, or path of an animal for food. Today, the urban routes generally are not a perfect square grid, suppose a person reaches a certain connection and for remained route walk with possibility of existing routes, If the connection is seven Outputs Set for each of them probably one of the seven. For this a random walk on graph called. In an undirected weighted graph, a random walk on graph is process that start from some of the nodes at the beginning of each time step to the other node transmits. When graph is not weighted, a path is selected randomly from among the neighbor nodes. When graph is weighted the next step is likely the edge is proportional to the weight. Infomap algorithm based on this method, the communities can be selected [6].

GRAPH WEIGHTING

One of the parameters in the most of graph clustering methods is applied and is effective in detection of homogeneity measurement of cluster is edge weight. In this section According to the information recorded in the database traffic events, with simple solution we are studying the behavior of adjacent nodes at same event time, also determined the similarity of the traffic between nodes. For example consider two adjacent node and its occurred events in table I, comparison of 5689 nodes and 2656 to the n transaction recorded, is value between 0 and 1, when the behavior of two node being same the value is closed to 1 and vice versa, with this solution all routes in Tehran graph, weighted.
TABLE I. EXAMPLE RECORDS

Routes
2526 5689 2656 5689

Time Stamp Event Date


13:15 to 13:30 2010/19/12 13:15 to 13:30 2010/19/12 15:10 to 15:30 2010/20/12 15:10to15:30 2010/20/12

status
Very heavy Smooth Heavy Heavy

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

21

ASSOCIATION RULES

Each rule are like A C ,the left part of the rule is called antecedent and the right is called the consequent and there is no joint member between two sets, here there is two parameter called minimum support and minimum confidence which is define as Eq. 1 and 2: (1) A B s T
c A B A

(2)

A and B appear together in at last s% of transactions. B occurs in at last c% of the transactions in which A occurs. A set of {road segment , status} called itemsets and itemsets that satisfy minimum support are frequent itemset, the first efficient algorithm to mine association rule is Apriori[2], and other algorithms that decrease the count of reads of the database and to improve computational efficiency like Partition[3],[4], we use FpGrowth2 Because they perform fastest on our data. Each of these algorithms has limitation when faced with massive databases. One of our goals in this study provide a guideline for exploring the association rules in urban routes that usual methods for discovering these rules are not possible, because of the limitations that mentioned.

5 GATHERING TRAFFIC PROPAGATION


As you see in table I, data base record is transactional form and isnt in itemset form, for this reason we need to compose these transaction, so we use concept of propagation, so that, based on events time, and at specified intervals, like t=15min, status of all nodes in each community formed in to itemsets, for example:
1) 2526_Veryheavy,5689_Smooth

Fig I: Top illustration is result of infomap with 485 community and bottom is blodel result with 41 communities.

2) 2656_Heavy,5689_Heavy

6 APPLYING CLUSTERING METHODS TO GENERATE HOMOGENOUS COMMUNITY


With the chosen clustering methods, and applying them on case study graph, communities identifies in each method, the result of this algorithms is illustrated in fig. I.

Fig. II, Example of generated rule, which presented in Google map service

2 Association rule mining algorithm that decrease database read by constructing frequent pattern tree

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

22

7 ANALYZING RESULTS ON REAL DATA


Our purpose in this section is to answer this question: which approach in this application, produced interesting and beneficial rules with optimum time? As you see in table II, blondel approach divided graph with high modularity to 41 sub graph. With support and confidence 5 and 30 respectively, produced 3656 association rule that from this count 682 rule is Comprehensive3, that equivalent to 18% of all rules, with respect to the contrary of data in comparisons with basket data, minimum support and confidence here is low value, infomap approach divided graph to 485 sub graph according to random walk measurement, with support and confidence 5 and 30 produced 3082 rules that 418 rules is Comprehensive, equivalent 21% of all rules. With average comparison of two values that produced for comprehensibility in variety of support and confidence can be reached to this result that infomap approach in this measurement has good results. Comprehensibility rule is important here because shows that one route condition affected many routes and this means is identifying traffic bottlenecks. There are many measurements for completion of support-confidence approach in selecting interesting rule [12], a method for assessing the importance of rules is using a variety of indexes, however there are no agreements on a specific indexes and this problem related to data and application [11]. As noted above, there are many indexes and measurements for assessing the produced rules with consider their interestingness, here we use lift and confidence indexes for assessing rules, lift, is one of the famous statistics calculations of rule interestingness[10], the Eq. 3 shows it, with comparing Eq. 3,4, and our data type in this application , we can say the confidence index is good, because it consider the relation between A and B without all transaction, it suppose transaction that include A but No B,[11]. Suppose two road segment that have direct traffic relation in specific month of year like fig. III, even if in the 100% cases equivalent 10 record, A and B occurs together, the indexes like lift delete it from rule list, because of the proportion of total transactions that is 1000 record include A, B, C, D, E, F and with Eq. 1, A and B do not appear together. With comparing in table III, the average of the produced values for confidence index in infomap method and comparing it with blondel method can be pointed that in infomap result is effective.
Lift ( A B ) p( A B) p ( A) P ( B )

Fig. III, A, B has direct traffic in 100% of Trans that inserted.

CONCLUSION

In this paper with converting the space of problem to graph network, we could overcome the complexity issue, and with defining a required application of this complex space that was the effect of routes on each others, we were able to produce interesting rules, fig. II shows one of them. The i/o problems with facing large database in ARM4 algorithms with this approach and on this application resolved, and interesting rules generates with ordinary approaches was not extractable, with clustering high connected routes and mining each cluster we are sure that generated rules is meaningful, FpGrowth algorithm tested and as expected with facing to high attribute count is extremely efficient. This application result is very useful for all organizations that are involved in traffic problems in their cities.

(3) (4)

Confidence

A B A

3 When antecedent of rule is less than consequent the rule is comprehensive

Association rule mining

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

23

TABLE II, RESULT OF ARM WITH FPGROWTH

Confidence

Confidence

Confidence

Confidence

Lift

Rules

lift

indexes Rules

lift

Rule

lift

1.37 1.12

0.94 0.93

4 177

1.57 1.18

0.84 0.80

15 503

2.89 1.66

0.73

428

3.83 2.24

0.64

0.77 1346

365 6 0.66 3082

Rules

41 485

TABLE III, RESULT OF ARM WITH FPGROWTH AND CALCULATION OF LIFT AND CONFIDENCE MEASUREMENTS

Comprehensibility

Comprehensibility

Comprehensibility

Comprehensibility

RulesCount

RulesCount

RulesCount

00:31:32 00:03:44

34926 36770

0 3

4 177

0 38

15 503

44 116

428 1346

682 418

3656 3082

RulesCount

All Itemset sCount

41 485

infomap

REFERENCES
[1] X. Li, J. Han, J. Lee, Hector, 2007, Traffic Density-Based Discovery of Hot Routes in Road Networks, Advanced In Temporal and Spatial Databases, pp. 441-459 W.-K. Chen, Linear Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135, 1993. R. Agrawal, R. Srikant.1994, Fast Algorithms for Mining association Rules. In VLDBY Conference, pp. 487-499. A. Savasere, E. Omiecinski, S. Navathe, 1995, An Efficient Algorithm for Mining Association Rules in Large Databas Proceedings of the 21st International Conference on VLDB, pp.432-444. J. Han, J. Pei, Y. Yin, 2000, Mining Frequent Patterns without Candidate Generation, ACM-SIGMOD Int, Vol. 29 Issue. 2. A. Lancichinetti, S. Fortunato, 2010, Community Detection Algorithms: a Comparative Analysis, Physical Review E, Vol. 80 Issue. 5. V. D Blondel, JL. Guillaume, R. Lambiotte, E. Lefebvre, 2008, Fast Unfolding of Communities in Large Networks, stackes.iop.org, P1008K. Elissa, An Overview of Decision Theory,". Stackes.iop.org,p1008.(unpublished) S. Schaeffer, 2007, Graph Clustering, journal of Elsevier, Vol. 1 Issue. 1, pp. 27-64.

[2] [3]

[4] [5]

S. Furtuinato, 2010, Community Detection in Graph, Journal of Elsevier, Vol. 486 Issue. 3-5, pp. 75-174. [9] D. A. spielman, 2008, Spectral Graph Theory, Random Walks On Graph, Vol. 1, pp. 1-75. [10] P. P.Wakabi-Waiswa, V. Baryamureeba, 2008, Extraction Of Interesting Association Rules Using Genetic Algorithms, International Journal of Computing and ICT, Vol. 2, No. 1, pp. 26-33. [11] T. Reader.Nitesh, V. Chawla, 2010, Market Basket Analysis with Networks Journal of Targeting Measurement and Analysis for Marketing, Vol. 11 Issue. 4, pp.373-386. [12] M. Plasse, N. Niang, G. Saporta, A. Villemiont,L. Leblond ,2007, Combined Use Of Association Rules Mining And Clustering Methods To Find Relevant Links Between Binary Rare Attributes In Large DataSet, Comput. Statist. Data Anal, Vol. 52 Issue. 1, pp. 596-613. First V. Dehghan graduate student in Payame Noor university faculty of engineering Tehran center should be l; employee in Tehran municipality information and communication technology. Second Sh. Khadivi Dr. rer. nat. (Ph.D.) in Computer Science, July 2008. Research area: Statistical Machine Translation Third A. Faraahi Payame Noor University, Computer Engineering and Information Technology Department.

[8]

[6]

[7]

ClusteringAlgorithms blondel

AverageTime

DetectedClusters

conf min=80% supp-min=50%

conf -min=50% supp-min=30%

conf -min=50% supp-min=10%

conf -min=30% suppmin=5%

ClusteringAlgorithms blondel infomap

DetectedClusters

conf -min=80% supp-min=50%

conf -min=50% supp-min=30%

conf -min=50% supp-min=10%

conf -min=30% supp-min=5%

You might also like