Professional Documents
Culture Documents
Combined Use of ARM and Graph Clustering Methods To Find Association in Urban Routes
Combined Use of ARM and Graph Clustering Methods To Find Association in Urban Routes
ORG
19
Combined use of ARM and graph clustering methods to find association in urban routes
(Case study: database of Tehran traffics status)
V. Dehghan, Sh. Khadivi, A. Farahi
AbstractFinding meaningful association from basket data is one of the oldest problems in data mining. The solution is analyzing and mining relational rules. In this paper we are going to prepare a proper approach to find traffic influence in routes, as association rules, this approach contains data mining techniques and clustering methods. First, by different clustering methods, rotes are grouped into homogeneous clusters and then extract the rules from each of detected clusters. This approach reduced the needs for searching in massive data list and caused producing interesting rules with low cost time. Finally the best clustering method will proposed with respect to the results of produced association rules.
Index Terms Association Rules, Community Detection, Data Mining, Graph Clustering, Transaction Data, Traffic.
1 INTRODUCTION
rban traffic is one of the big concerns for many governments in their metropolises that this Issue with citizen infrastructure development like highway and tunnels is not solved properly yet, it seems, reasons that caused the heavy traffic is lack attention to the conductive traffic field, and lack of evacuation plan for shut roads in critical time. Also because of lack attention to the measurement of traffic in routes and prediction of urban traffic. One of the helpful methods to solve this Issue is the use of data mining methods, finding hot route segment is one of the important subjects and is advantage for urban designers, police stations and many of other organization that involved in urban traffics. Detection of the routes, that their traffic is similar and impact to one other, lead to smoother traffic and traffic steerage. This problem has been addressed with domain knowledge of city[1].In this research we have large database of occurred status in Tehran road segments and we plan to analyze this data in form of association rule mining, The problem of association rule mining in large database has practiced in past few decades. And variety of algorithm and methods for this reason introduced. Our aim in this paper, is finding these rules in traffic data of Tehran city. With attention to massive data and transactional record, current algorithms directly cant find these rules, so by focusing on this application at this paper we
Vahid dehghan. Is graduate student in payame noor university, and work in Tehran Municipality+9809144001756 Shahram khadivi, was with Amirkabir University, Tehran, He is now with the Department of Computer . Author is with the Computer Engineering Department, University of Payame Noor, Tehran
are going to extract this rules with dominate the input/output problem and producing interesting rules that is why we are trying to use clustering methods. In many clustering ways we are looking similarity measures to use them for dividing object in homogeneous groups. Variety of algorithms was introduced for this reason but all of them require parameters which to be set by users to work. Determining these values needs specific knowledge in this field and usually changing these parameters causes different results and that be mentioned that the cost of these changes on huge data is very expensive. We can mention similarity measures such as mynkvfsky, chebyshev, manhattan, euclidean and etc. However in case study database of this research two variable, road segment1 and status exist. The second variable includes three attributes which are heavy, very heavy and smooth. Similarity measures that is considered, is routes relation and not variable similarity. For this reason we must change the space of problem to network organization, to find these relations, we need to construct the graph of communication routes. Bye creating the graph and detecting those groups which their internal dense is more than other groups and then with solution that we will explain in the rest of paper, we extract records from this groups that is appropriate for association rule mining. Many complex systems can be represented as networks, were the relationship between their objects can be illustrated by nodes and edges [11]. Complex systems are usually organized in partitions, which each of them have their own function. In the network representation, such partitions appear as set of nodes with high density of internal links, while the links between partitions is low; these networks are called communities or modules, and occur in a variety of large networked systems, finding and studying these
1
routes
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
20
partitions may clarify the organization of systems and their functions [4, 5]. Network of urban streets is complex system. In 2008 blondel and collegues present a new algorithm and identified language communities in Belgian mobile phone networks of 2 million customers and by analyzing a web graph of 118 million nodes and more than one billion links [6], the result of applying this algorithm on huge nodes was very good. Suppose each route as a node and link between them as an edge. Our constructed graph has 7975 nodes and 16976 edges, given that, each nodes of Tehran graph, has list of transactions related to its events during the year. With mining on these transactional lists we will produce association rules that are Routes Influences.
Now the question is, whether the algorithm may not have any of the output? For example these algorithms on random graphs how they will be act, the experiments performed show that the algorithms currently available in many cases are the fastest and most reliable. In the following, briefly about the measurements of two algorithms will be discussed.
With regard to what mentioned in introductions, one can say that aim in networks is optimal community detection; many methods have been developed, using tools and techniques from disciplines like physics, biology, applied mathematics, computer and social science. However it is still not clear which algorithms are reliable and can be used in applications. The question of reliable itself is seductive, because this concept requires shared definition of community and partitions that even with the respect to many researches in this field is not done yet. And no agreements among researchers exist to define indexes like optimal and reliability. But there has been silent acceptance of a simple network model that can be said, it is a base for comparing and developing clustering algorithms, namely planted lpartitions model [5]. In this model the purpose is graph construction with predefined community in size and count, called benchmark graph, to be indexes for comparing, efficiency and speed of detecting, in clustering algorithms. In this model one partition consisting of certain number of nodes. Each node has one probability pin as Number of nodes communicate with nodes within a group and pout that indicate connection a node with other nodes of the other groups, until pin>pout can call these groups are community. If the above condition is not established, the network would be a random graph without any structure. Different benchmark graph are provided, including the GN benchmark by Girvan and Newman has been named, it has 4 groups with 32 members that degree of each node is 16 and number of all nodes is 128. It has two fundamental problems. First all nodes have a same degree and second, size of each community is equal. As can be seen, this feature is not realistic, because complex networks have heterogeneous distribution in degrees [6]. The other benchmark presented is called, LFR, which is the basis for comparison of clustering algorithm. In this benchmark graph node degree distribution is heterogeneous and, however it is emphasized that even if these communities may also be noted, different algorithms may not be able to detect them [6]. Most existing algorithms can work well on GN because its simplicity, but LFR apply heavy test on algorithms and shows its limitations, Results of this study in terms of both time complexity and compatibility with existing structures in the graph and detected community shows that two algorithms Infomap, and Blondel had the best performance [5].
GRAPH WEIGHTING
One of the parameters in the most of graph clustering methods is applied and is effective in detection of homogeneity measurement of cluster is edge weight. In this section According to the information recorded in the database traffic events, with simple solution we are studying the behavior of adjacent nodes at same event time, also determined the similarity of the traffic between nodes. For example consider two adjacent node and its occurred events in table I, comparison of 5689 nodes and 2656 to the n transaction recorded, is value between 0 and 1, when the behavior of two node being same the value is closed to 1 and vice versa, with this solution all routes in Tehran graph, weighted.
TABLE I. EXAMPLE RECORDS
Routes
2526 5689 2656 5689
status
Very heavy Smooth Heavy Heavy
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
21
ASSOCIATION RULES
Each rule are like A C ,the left part of the rule is called antecedent and the right is called the consequent and there is no joint member between two sets, here there is two parameter called minimum support and minimum confidence which is define as Eq. 1 and 2: (1) A B s T
c A B A
(2)
A and B appear together in at last s% of transactions. B occurs in at last c% of the transactions in which A occurs. A set of {road segment , status} called itemsets and itemsets that satisfy minimum support are frequent itemset, the first efficient algorithm to mine association rule is Apriori[2], and other algorithms that decrease the count of reads of the database and to improve computational efficiency like Partition[3],[4], we use FpGrowth2 Because they perform fastest on our data. Each of these algorithms has limitation when faced with massive databases. One of our goals in this study provide a guideline for exploring the association rules in urban routes that usual methods for discovering these rules are not possible, because of the limitations that mentioned.
Fig I: Top illustration is result of infomap with 485 community and bottom is blodel result with 41 communities.
2) 2656_Heavy,5689_Heavy
Fig. II, Example of generated rule, which presented in Google map service
2 Association rule mining algorithm that decrease database read by constructing frequent pattern tree
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
22
CONCLUSION
In this paper with converting the space of problem to graph network, we could overcome the complexity issue, and with defining a required application of this complex space that was the effect of routes on each others, we were able to produce interesting rules, fig. II shows one of them. The i/o problems with facing large database in ARM4 algorithms with this approach and on this application resolved, and interesting rules generates with ordinary approaches was not extractable, with clustering high connected routes and mining each cluster we are sure that generated rules is meaningful, FpGrowth algorithm tested and as expected with facing to high attribute count is extremely efficient. This application result is very useful for all organizations that are involved in traffic problems in their cities.
(3) (4)
Confidence
A B A
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 2, FEBRUARY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
23
Confidence
Confidence
Confidence
Confidence
Lift
Rules
lift
indexes Rules
lift
Rule
lift
1.37 1.12
0.94 0.93
4 177
1.57 1.18
0.84 0.80
15 503
2.89 1.66
0.73
428
3.83 2.24
0.64
0.77 1346
Rules
41 485
TABLE III, RESULT OF ARM WITH FPGROWTH AND CALCULATION OF LIFT AND CONFIDENCE MEASUREMENTS
Comprehensibility
Comprehensibility
Comprehensibility
Comprehensibility
RulesCount
RulesCount
RulesCount
00:31:32 00:03:44
34926 36770
0 3
4 177
0 38
15 503
44 116
428 1346
682 418
3656 3082
RulesCount
41 485
infomap
REFERENCES
[1] X. Li, J. Han, J. Lee, Hector, 2007, Traffic Density-Based Discovery of Hot Routes in Road Networks, Advanced In Temporal and Spatial Databases, pp. 441-459 W.-K. Chen, Linear Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135, 1993. R. Agrawal, R. Srikant.1994, Fast Algorithms for Mining association Rules. In VLDBY Conference, pp. 487-499. A. Savasere, E. Omiecinski, S. Navathe, 1995, An Efficient Algorithm for Mining Association Rules in Large Databas Proceedings of the 21st International Conference on VLDB, pp.432-444. J. Han, J. Pei, Y. Yin, 2000, Mining Frequent Patterns without Candidate Generation, ACM-SIGMOD Int, Vol. 29 Issue. 2. A. Lancichinetti, S. Fortunato, 2010, Community Detection Algorithms: a Comparative Analysis, Physical Review E, Vol. 80 Issue. 5. V. D Blondel, JL. Guillaume, R. Lambiotte, E. Lefebvre, 2008, Fast Unfolding of Communities in Large Networks, stackes.iop.org, P1008K. Elissa, An Overview of Decision Theory,". Stackes.iop.org,p1008.(unpublished) S. Schaeffer, 2007, Graph Clustering, journal of Elsevier, Vol. 1 Issue. 1, pp. 27-64.
[2] [3]
[4] [5]
S. Furtuinato, 2010, Community Detection in Graph, Journal of Elsevier, Vol. 486 Issue. 3-5, pp. 75-174. [9] D. A. spielman, 2008, Spectral Graph Theory, Random Walks On Graph, Vol. 1, pp. 1-75. [10] P. P.Wakabi-Waiswa, V. Baryamureeba, 2008, Extraction Of Interesting Association Rules Using Genetic Algorithms, International Journal of Computing and ICT, Vol. 2, No. 1, pp. 26-33. [11] T. Reader.Nitesh, V. Chawla, 2010, Market Basket Analysis with Networks Journal of Targeting Measurement and Analysis for Marketing, Vol. 11 Issue. 4, pp.373-386. [12] M. Plasse, N. Niang, G. Saporta, A. Villemiont,L. Leblond ,2007, Combined Use Of Association Rules Mining And Clustering Methods To Find Relevant Links Between Binary Rare Attributes In Large DataSet, Comput. Statist. Data Anal, Vol. 52 Issue. 1, pp. 596-613. First V. Dehghan graduate student in Payame Noor university faculty of engineering Tehran center should be l; employee in Tehran municipality information and communication technology. Second Sh. Khadivi Dr. rer. nat. (Ph.D.) in Computer Science, July 2008. Research area: Statistical Machine Translation Third A. Faraahi Payame Noor University, Computer Engineering and Information Technology Department.
[8]
[6]
[7]
ClusteringAlgorithms blondel
AverageTime
DetectedClusters
DetectedClusters