You are on page 1of 54

Mobility Analytics for Spatio Temporal

and Social Data First International


Workshop MATES 2017 Munich
Germany September 1 2017 Revised
Selected Papers 1st Edition Christos
Doulkeridis
Visit to download the full and correct content document:
https://textbookfull.com/product/mobility-analytics-for-spatio-temporal-and-social-data-
first-international-workshop-mates-2017-munich-germany-september-1-2017-revised-
selected-papers-1st-edition-christos-doulkeridis/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Algorithmic Aspects of Cloud Computing Third


International Workshop ALGOCLOUD 2017 Vienna Austria
September 5 2017 Revised Selected Papers 1st Edition
Dan Alistarh
https://textbookfull.com/product/algorithmic-aspects-of-cloud-
computing-third-international-workshop-algocloud-2017-vienna-
austria-september-5-2017-revised-selected-papers-1st-edition-dan-
alistarh/

Spatio Temporal Graph Data Analytics 1st Edition


Venkata Gunturi

https://textbookfull.com/product/spatio-temporal-graph-data-
analytics-1st-edition-venkata-gunturi/

Real-Time Business Intelligence and Analytics:


International Workshops, BIRTE 2015, Kohala Coast, HI,
USA, August 31, 2015, BIRTE 2016, New Delhi, India,
September 5, 2016, BIRTE 2017, Munich, Germany, August
28, 2017, Revised Selected Papers Malu Castellanos
https://textbookfull.com/product/real-time-business-intelligence-
and-analytics-international-workshops-birte-2015-kohala-coast-hi-
usa-august-31-2015-birte-2016-new-delhi-india-
september-5-2016-birte-2017-munich-germany/

Knowledge Representation for Health Care HEC 2016


International Joint Workshop KR4HC ProHealth 2016
Munich Germany September 2 2016 Revised Selected Papers
1st Edition David Riaño
https://textbookfull.com/product/knowledge-representation-for-
health-care-hec-2016-international-joint-workshop-kr4hc-
prohealth-2016-munich-germany-september-2-2016-revised-selected-
Spatio temporal Image Analysis for Longitudinal and
Time Series Image Data Third International Workshop
STIA 2014 Held in Conjunction with MICCAI 2014 Boston
MA USA September 18 2014 Revised Selected Papers 1st
Edition Stanley Durrleman
https://textbookfull.com/product/spatio-temporal-image-analysis-
for-longitudinal-and-time-series-image-data-third-international-
workshop-stia-2014-held-in-conjunction-with-miccai-2014-boston-
ma-usa-september-18-2014-revised-selected/

Machine Learning Optimization and Big Data Third


International Conference MOD 2017 Volterra Italy
September 14 17 2017 Revised Selected Papers 1st
Edition Giuseppe Nicosia
https://textbookfull.com/product/machine-learning-optimization-
and-big-data-third-international-conference-mod-2017-volterra-
italy-september-14-17-2017-revised-selected-papers-1st-edition-
giuseppe-nicosia/

Graphical Models for Security 4th International


Workshop GraMSec 2017 Santa Barbara CA USA August 21
2017 Revised Selected Papers 1st Edition Peng Liu

https://textbookfull.com/product/graphical-models-for-
security-4th-international-workshop-gramsec-2017-santa-barbara-
ca-usa-august-21-2017-revised-selected-papers-1st-edition-peng-
liu/

Data Management Technologies and Applications 6th


International Conference DATA 2017 Madrid Spain July 24
26 2017 Revised Selected Papers Joaquim Filipe

https://textbookfull.com/product/data-management-technologies-
and-applications-6th-international-conference-data-2017-madrid-
spain-july-24-26-2017-revised-selected-papers-joaquim-filipe/

Music Technology with Swing 13th International


Symposium CMMR 2017 Matosinhos Portugal September 25 28
2017 Revised Selected Papers Mitsuko Aramaki

https://textbookfull.com/product/music-technology-with-
swing-13th-international-symposium-cmmr-2017-matosinhos-portugal-
september-25-28-2017-revised-selected-papers-mitsuko-aramaki/
Christos Doulkeridis · George A. Vouros
Qiang Qu · Shuhui Wang (Eds.)

Mobility Analytics
LNCS 10731

for Spatio-Temporal
and Social Data
First International Workshop, MATES 2017
Munich, Germany, September 1, 2017
Revised Selected Papers

123
Lecture Notes in Computer Science 10731
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zurich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7409
Christos Doulkeridis George A. Vouros

Qiang Qu Shuhui Wang (Eds.)


Mobility Analytics
for Spatio-Temporal
and Social Data
First International Workshop, MATES 2017
Munich, Germany, September 1, 2017
Revised Selected Papers

123
Editors
Christos Doulkeridis Qiang Qu
University of Piraeus Shenzhen Institutes of Advanced
Piraeus Technology
Greece Shenzhen
China
George A. Vouros
University of Piraeus Shuhui Wang
Piraeus Institute of Computing Technology
Greece Beijing
China

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Computer Science
ISBN 978-3-319-73520-7 ISBN 978-3-319-73521-4 (eBook)
https://doi.org/10.1007/978-3-319-73521-4

Library of Congress Control Number: 2017962899

LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI

© Springer International Publishing AG 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

This proceedings volume contains revised versions of the papers presented at the First
International Workshop on Mobility Analytics for Spatiotemporal and Social Data
(MATES 2017), held in conjunction with the 43rd International Conference on Very
Large Data Bases (VLDB 2017), in Munich, Germany, on September 1, 2017.
Mobility analytics is a timely topic owing to the ever-increasing number of diverse,
real-life applications, ranging from social media to land, sea, and air surveillance
systems, which produce massive amounts of streaming spatiotemporal data, whose
acquisition, cleaning, representation, aggregation, processing, and analysis pose new
challenges for the data management community. The aim of MATES is to bring
together researchers and practitioners interested in developing data-intensive applica-
tions that analyze big spatiotemporal/societal data, in order to foster the exchange of
new ideas on multidisciplinary real-world problems, propose innovative solutions, and
stimulate further research in the area of big spatiotemporal/societal data management
and analysis. The workshop intends to bridge the gap between researchers and domain
experts, most importantly to raise awareness of real-world problems in critical domains
which require novel data management solutions, tailored to addressing the specific
needs of each domain.
The peer-review process put great emphasis on ensuring a high quality of accepted
contributions. Every paper was reviewed by at least three Program Committee
(PC) members. The MATES PC accepted six submissions (46%) as full papers and
another two submissions (16%) as short papers out of a total of 13 submissions. After
careful revision of accepted papers, based both on comments of reviewers and dis-
cussions during the workshop, the chairs decided to allocate the same maximum page
length for all papers included in this volume.
Apart from the peer-reviewed papers that were presented at the workshop, the
program included two keynote speeches, one from academia and one from the
industrial sector. The first keynote — “Effective and Efficient Community Search” —
was given by Dr. Reynold Cheng, Associate Professor of the Department of Computer
Science at the University of Hong Kong. The second keynote — “How Data Analytics
Enables Advanced AIS Applications” — was given by Ernest Batty, Technical
Director of IMIS Global Limited. After the workshop, both keynote speakers were
invited to submit a paper describing the research objectives and future challenges in
relation with their talks, and these papers are included in this volume.
This volume is structured as follows: In the first part, we include the invited papers
from the keynote speakers. Then, the research papers are grouped in thematic areas:
The second part concerns “Social Networks Analytics and Applications,” while the
third part addresses “Spatiotemporal Mobility Analytics.” In this way, the grouping of
research papers reflects the two major foci of the workshop, namely, mobility analytics
for social data and mobility analytics for spatiotemporal data.
VI Preface

The editors wish to thank the PC members for helping MATES put together a
program of high-quality papers that provides an up-to-date overview of the area of
mobility analytics for spatiotemporal and social data. In addition, the editors would like
to thank all authors for submitting their work to MATES.
On a final note, we wish to mention that this workshop was partially supported by
the European Union’s Horizon 2020 research and innovation programme datAcron:
Big Data Analytics for Time Critical Mobility Forecasting, under grant agreement
number 687591, the CAS Pioneer Hundred Talents Program, and the MOE Key
Laboratory of Machine Perception at Peking University under grant number
K-2017-02.

November 2017 Christos Doulkeridis


George Vouros
Qiang Qu
Shuhui Wang
Organization

Program Committee
Natalia Andrienko Fraunhofer Institute IAIS, Germany
Alexander Artikis NCSR Demokritos, Greece
Elena Camossi NATO Centre for Maritime Research
and Experimentation (CMRE), Italy
Christophe Claramunt Naval Academy Research Institute, France
Jose Manuel Cordero Garcia CRIDA, Spain
Christos Doulkeridis University of Piraeus, Greece
Georg Fuchs Fraunhofer Institute IAIS, Germany
Maria Halkidi University of Piraeus, Greece
Anne-Laure Jousselme NATO Centre for Martime Research
and Experimentation (CMRE), Italy
Sofia Karagiorgou University of Piraeus, Greece
Jooyoung Lee Syracuse University, USA
Jiehuan Luo Jinan University, China
Michael Mock Fraunhofer Institute IAIS, Germany
Mohamed Mokbel University of Minnesota, USA
Kjetil Noervaag Norwegian University of Science
and Technology, Norway
Kostas Patroumpas University of Piraeus, Greece
Nikos Pelekis University of Piraeus, Greece
Jiang Qingshan Chinese Academy of Sciences, China
Qiang Qu Shenzhen Institutes of Advanced Technology, China
Cyril Ray Naval Academy Research Institute, France
Giorgos Santipantakis University of Piraeus, Greece
David Scarlatti Boeing Research and Technology Europe, Spain
Liu Siyuan The Pennsylvania State University, USA
Yannis Theodoridis University of Piraeus, Greece
Goce Trajcevski Northwestern University, USA
Akrivi Vlachou University of Piraeus, Greece
George Vouros University of Piraeus, Greece
Shuhui Wang Chinese Academy of Sciences, China
Raymong Wong The Hong Kong University of Science
and Technology, SAR China
Contents

On Attributed Community Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Yixiang Fang and Reynold Cheng

Data Analytics Enables Advanced AIS Applications . . . . . . . . . . . . . . . . . . 22


Ernest Batty

What do Geotagged Tweets Reveal About Mobility Behavior? . . . . . . . . . . . 36


Pavlos Paraskevopoulos and Themis Palpanas

Edge Representation Learning for Community Detection in Large Scale


Information Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Suxue Li, Haixia Zhang, Dalei Wu, Chuanting Zhang,
and Dongfeng Yuan

Introducing ADegree: Anonymisation of Social Networks


Through Constraint Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Sergei Solonets, Victor Drobny, Victor Rivera, and JooYoung Lee

JEREMIE: Joint Semantic Feature Learning via Multi-relational


Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Jiaming Zhang, Shuhui Wang, Qiang Qu, and Qingming Huang

A Big Data Driven Approach to Extracting Global Trade Patterns. . . . . . . . . 109


Giannis Spiliopoulos, Dimitrios Zissis,
and Konstantinos Chatzikokolakis

Efficient Processing of Spatiotemporal Pattern Queries on Historical


Frequent Co-Movement Pattern Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Shahab Helmi and Farnoush Banaei-Kashani

Exploratory Spatio-Temporal Queries in Evolving Information . . . . . . . . . . . 138


Chiara Francalanci, Barbara Pernici, and Gabriele Scalia

Efficient Cross-Modal Retrieval Using Social Tag Information Towards


Mobile Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Jianfeng He, Shuhui Wang, Qiang Qu, Weigang Zhang,
and Qingming Huang

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177


On Attributed Community Search

Yixiang Fang(B) and Reynold Cheng

Department of Computer Science, The University of Hong Kong,


Hong Kong, China
{yxfang,ckcheng}@cs.hku.hk

Abstract. Communities, which are prevalent in attributed graphs (e.g.,


social networks and knowledge bases) can be used in emerging appli-
cations such as product advertisement and setting up of social events.
Given a graph G and a vertex q ∈ G, the community search (CS) query
returns a subgraph of G that contains vertices related to q. In this article,
we study CS over two common attributed graphs, where (1) vertices are
associated with keywords; and (2) vertices are augmented with locations.
For keyword-based attributed graphs, we investigate the keyword-based
attributed community (or KAC) query, which returns a KAC for a query
vertex. A KAC satisfies both structure cohesiveness (i.e., its vertices are
tightly connected) and keyword cohesiveness (i.e., its vertices share com-
mon keywords). For spatial-based attributed graphs, we aim to find the
spatial-aware community (or SAC), whose vertices are close structurally
and spatially, for a query vertex in an online manner. To enable effi-
cient KAC search and SAC search, we propose efficient query algorithms.
We also perform experimental evaluation on large real datasets, and the
results show that our methods achieve higher effectiveness than the state-
of-the-art community retrieval algorithms. Moreover, our solutions are
faster than baseline approaches. In addition, we develop the C-Explorer
system to assist users in extracting, visualizing, and analyzing KACs.

1 Introduction

Due to the developments of gigantic social networks (e.g., Flickr, and Facebook),
the topic of attributed graphs has attracted attention from both industry and
research areas [18–20,29,37,49,53]. Essentially, an attributed graph is a graph,
in which vertices or edges are associated with attributes. The attribute of a vertex
often refers to its features, including its interest, hobbies, and locations, while
the attribute of an edge often indicates the relationships between two vertices.
In this article, we consider two typical kinds of attributed graphs, which are
keyword-based attributed graphs, and spatial-based attributed graphs, and study
the problem of community search on these attributed graphs.
Let us see some examples of attributed graphs. Figure 1 illustrates an
attributed graph, where each vertex represents a social network user, and each
edge represent the friendship between two users. The keywords of each user
describe the interest of that user. Figure 2 depicts a spatial-based attributed
graph with nine users in three cities, and each user has a location. The solid
c Springer International Publishing AG 2018
C. Doulkeridis et al. (Eds.): MATES 2017, LNCS 10731, pp. 1–21, 2018.
https://doi.org/10.1007/978-3-319-73521-4_1
2 Y. Fang and R. Cheng

John:{research, sports, web} Alex:{chess, tour, research}


Ada:{art, cook, music}

Mike:{film, research, sports}


Alice:{art, music, yoga}
Bob:{research, sports, yoga}

Anna:{art, cook, music}


Jack:{research, sports, tour} Tom:{fiction, film, game}

Fig. 1. A keyword-based attributed graph and an attributed community.

Tom Jim
Bob
Jack Jason
Eric
Jeff John

Leo

Berlin Paris London

Fig. 2. A spatial-based attributed graph.

lines represent their social relationship, and the dashed lines denote their
locations.
The problems related to retrieving communities from a graph can generally be
classified into community detection (CD) and community search (CS). In general,
CD methods aim to discovery all communities of a graph [22,38,41,42,44,49,50,
54]. These solutions are not “query-based”, i.e., they are not customized for a
query request (e.g., a user-specified query vertex). Besides, it is not clear how
these algorithms can efficiently return a community that contain a given vertex
q. Moreover, they can take a long time to find all the communities for a large
graph, and so they are not suitable for quick or online retrieval of communities.
To solve issues above, CS methods [1,8,9,32,33,48] have been recently devel-
oped. The CS solutions aim to search the community of a specific query vertex
in an “online” manner, which implies that these approaches are query-based.
However, existing CS algorithms assume non-attributed graphs, and only use
the graph structure information to find communities. Thus, it is desirable to
develop methods for searching communities with the consideration of attributes.
As we will show later, the use of attribute information can significantly improve
the effectiveness of the communities retrieved.
In this article, we systematically study the motivation, applications, features,
technical challenges, algorithms, and experimental evaluation of searching com-
munities over these two kinds of attributed graphs. In the following, we will
detail how to tackle these issues.
On Attributed Community Search 3

1.1 CS on Keyword-Based Attributed Graphs

Given an attributed graph G and a vertex q ∈ G, the keyword-based attributed


community (or KAC) query returns one or more subgraphs of G known as
keyword-based attributed communities (or KACs). A KAC is a kind of com-
munity, which consists of vertices that are closely related. Particularly, a KAC
satisfies structure cohesiveness (i.e., its vertices are closely linked to each other)
and keyword cohesiveness (i.e., its vertices have keywords in common). Figure 1
illustrates an AC (circled), which is a connected subgraph with vertex degree
3; its vertices {Jack, Bob, John, Mike} have two keywords (i.e., “research” and
“sports”) in common.
The main features of KA search are that: (1) Ease of interpretation. As shown
in Fig. 1, a KAC contains tightly-connected vertices with similar contexts or
backgrounds. Thus, a query user can focus on the common keywords or features
of these vertices (e.g., the vertices of the KAC in this example contain “research”
and “sports”, reflecting that all members of this KAC like research and sports).
(2) Personalization. The user of an KACs can control the semantics of the AC,
by specifying a set of S of keywords. Intuitively, S decides the meaning of the AC
based on the user’s need. If we let q = Jack, k = 2 and S = {“research”}, the AC
is formed by {Jack, Bob, John, Mike, Alex}, who are all interested in research.
Thus, with the use of different keyword sets S, different “personalized” commu-
nities can be obtained. (3) Online evaluation. Similar to other CS solutions, we
have developed efficient query algorithms for large graphs, allowing ACs to be
generated quickly upon a query request.
We define a KAC based on the minimum degree. We formulate the keyword
cohesiveness as maximizing the number of shared keywords in keyword set S.
The shared keywords naturally reveal the common features among vertices (e.g.,
common interest of social network users). A simple way of answering a KAC
query is to consider all the possible keyword combinations, and then return
the subgraphs, which satisfy the minimum degree constraint and have the most
shared keywords. This solution has a complexity exponential to the size of q’s
keyword set, so it is impractical, when q’s keyword set is large.
We first propose two baseline solutions. We further develop the CL-tree index,
which organizes the vertex keyword data in a hierarchical structure. Based on the
CL-tree index, we have developed three different KAC algorithms, and they are
able to achieve a superior performance. We have performed extensive exper-
iments on large real graph datasets, and the results show that KAC query
achieves higher effectiveness than existing CD and CS algorithms. Moreover,
our proposed algorithms are much faster than the baseline solutions. Finally,
we propose Community-Explorer (or C-Explorer ), a web-based system that can
assist users in extracting, visualizing, and analyzing communities.
Figure 3 shows the user interface of C-Explorer configured to run on the
DBLP bibliographical network. On the left panel, a user inputs the name of an
author (e.g., “jim gray”) and the minimum degree of each vertex in the commu-
nity she wants to have. The user can also indicate the labels or keywords related
to her community. Once she clicks the “Search” button, the right panel will
4 Y. Fang and R. Cheng

Fig. 3. Interface of C-Explorer.

quickly display a community of Jim Gray, which contains researchers working on


database system transactions since they all share the keyword set {transaction,
data, management, system, research}. C-Explorer implements several state-of-
the-art CR algorithms, including Global [48], Local [9], and CODICIL [44], and
provides functions for analyzing the communities. A user can also plug new CR
solution into C-Explorer through a application programmer interface (API).

1.2 CS on Spatial-Based Attributed Graphs


Given a spatial graph G and a vertex q ∈ G, our goal is to find a subgraph of
G, called a spatial-aware community (or SAC). An SAC is a community with
high structure cohesiveness and spatial cohesiveness. The structure cohesiveness
mainly measures the social connections within the community, while the spatial
cohesiveness focuses on the closeness among their geo-locations. Figure 2 illus-
trates an SAC with three users {Tom, Jeff, Jim}, in which each user is linked
with each other and all of them are in Paris.
We adopt the minimum degree [9,36,48] to measure the structure cohesive-
ness. To measure the spatial cohesiveness, we consider the spatial circle, which
contains all the community members. In specific, given a query vertex q ∈ G,
our goal is to find an SAC containing q in the smallest minimum covering circle
(or MCC) and all the vertices of the SAC satisfy the minimum degree measure.
The main features of SAC search are that: (1) Adaptability to location
changes. As the locations of users often change over time and their link rela-
tionship also evolve over time, SAC search can adapt to such dynamic easily, as
it can answer queries in an “online” manner. Figure 4(b) shows another user’s
two SACs in three days, when she moves from place “C” to place “D”. These
real examples clearly show that a user’s communities could evolve over time.
(2) Personalization. SAC search is customized for finding communities for a
On Attributed Community Search 5

Fig. 4. SACs in Brightkite dataset.

particular query user, and the link cohesiveness of the community can also be
controlled. (3) Online evaluation. The SAC search is able to find an SAC from
a large spatial graph quickly once a query request arrives.
Since SACs achieve both high structure and spatial cohesiveness, it can be
applied to many interesting applications including event recommendation (e.g.,
Meetup1 ), social marketing, and geo-social data analysis. For example, Meetup
tracks its users’ mobile phone locations, and suggests interesting location-based
events to them [53]. Suppose that Meetup wishes to recommend an event to a
user u. Then we can first find u’s SAC, whose members are physically close
to u. Events proposed by u’s SAC member v can then be introduced to u, so
that u can meet v if she is interested in v’s activity. Since u’s location changes
constantly, u’s recommendation needs to be updated accordingly.
The SAC search problem is very challenging, because the center and radius
of the smallest MCC containing q are unknown. A basic exact approach takes
O(m×n3 ) time to answer a query, where n and m denote the numbers of vertices
and edges in G. To alleviate this issue, we develop three efficient approximation
algorithms with arbitrary approximation ratio, and an advanced exact algo-
rithm, which is much faster than the basic exact algorithm. We have performed
experiments on real datasets and the results show that our solutions yield better
communities than those produced by existing CS and CD algorithms. Moreover,
the approximation algorithms are much faster than the exact algorithms.
We organize the rest of the article as follows. We review the related works in
Sect. 2. In Sect. 3, we investigate the problem of CS on keyword-based attributed
graphs. In Sect. 4, we examine the problem of CS on spatial-based attributed
graphs. We conclude and discuss the future work in Sect. 5.

1
https://www.meetup.com/.
6 Y. Fang and R. Cheng

2 Related Work
The related works about community retrieval can generally be classified into
community detection (CD) and community search (CS). Table 1 summarizes the
works related to community retrieval. We review them in detail as follows.

Table 1. Classification of works in community retrieval (CR).

Graph type Community detection (CD) Community search (CS)


Non-attributed [22, 25, 42] [1, 8, 9, 32, 33, 36, 48]
Attributed [6, 12, 26, 34, 35, 38, 39, 41, 44, KAC [13, 14, 16, 17], SAC [15]
49, 50, 52, 54]

2.1 Community Detection (CD)

Detecting communities from a network is a fundamental research problem in net-


work science, and it has been widely studied during the past several decades [43].
In the following, we mainly review studies about CD on attributed graphs.

CD on Keyword-Based Attributed Graphs. The clustering technique is


often used to detect communities from keyword-based attributed graphs. Zhou
et al. [54] considered both links and keywords of vertices to compute the vertices’
pairwise similarities, and then clustered the graph based on the similarities. Ruan
et al. [44] proposed a method called CODICIL. This solution augments the original
graphs by creating new edges among vertices based on their content similarity,
and then uses an effective graph sampling to boost the efficiency of clustering.
Another common approach is based on the generative models. The LDA
model [4] is a classical generative statistical model, which is able to explain the
observations based on some unobserved variables. In [38,41], the Link-PLSA-LDA
and Topic-Link LDA models jointly model vertices’ content and links based
on the LDA model. In [49], Xu et al. developed a Bayesian probabilistic model
which can capture both structures and attributes of vertices. In [45], Sachan
et al. proposed to discover communities based on the topics, interaction types
and the social connections among the vertices. CESNA [50] detects overlapping
communities by assuming communities “generate” both the link and content. A
discriminative approach [51], which combines the link and content analysis, has
also been considered for CD on attributed graphs. However, these CD solutions
are generally slow, as they often consider the pairwise distance/similarity among
vertices in an entire graph. Also, they partition graphs with no reference to the
query queries, and it is not clear how they can answer online queries.

CD on Spatial-Based Attributed Graphs. Many recent works identify com-


munities from spatially constrained graphs, whose vertices are not only have
links, but also associated with spatial coordinates [2]. For example, Girvan
et al. [25] studied the geo-community, which is a graph of intensely connected
On Attributed Community Search 7

vertices being loosely connected with others, but it is more compact in space.
Guo et al. [26] proposed the average linkage (ALK) measure for clustering objects
in spatially constrained graphs. In [12], Expert et al. adapted the modularity
function for spatial networks and proposed a method to uncovered commu-
nities from spatial graphs. In [47], Shakarian et al. modified the well known
Louvain algorithm and used a variant of Newman-Girvan modularity to mine
the geographically dispersed communities from location-based social networks.
In [6], Chen et al. proposed a geo-distance-based method using fast modular-
ity maximization for identifying communities that are both highly topologically
connected and spatially clustered from spatially constrained networks. We will
compare our proposed methods with it in the experiments.
However, these CD algorithms are generally costly and time-consuming, as
they often detect all the communities from an entire network. None of these CD
methods has been shown to be able to quickly detect communities from spatial-
based attributed graphs with millions or billions of vertices. Also, it is not clear
how they can be adapted for online retrieval of communities from large spatial-
based attributed graphs. Thus, it calls for the development of faster algorithms
of performing CS on the spatial-based attributed graphs.

2.2 Community Search (CS)

To perform CS queries, people often define some measures of structure cohesive-


ness for a community. We classify the existing CS solutions using these measures.
Minimum degree. The minimum degree is one of the most fundamental char-
acteristics of a graph [5,23]. In [48], Sozio et al. proposed the first CS solution,
called Global. Given a graph G and a query vertex q ∈ G, Global returns the
largest connected subgraph containing q as the target community in an online
manner. It finds the community by iteratively removing vertices whose degrees
are less than k. Cui et al. [9] also used the minimum degree measure and devel-
oped another solution Local, which uses local expansion techniques to enhance
the performance of Global. In [1], Barbieri et al. further improved Global and
Local and generalized the query such that it can find a community of multiple
query vertices. In [36], Li et al. assumed each graph vertex has an influence value
and proposed to find the top-r k-influential communities.
k -truss. The k-truss of a graph is the largest subgraph, in which each edge is
contained by at least (k − 2) triangles in the sub-graph [7]. In [32], Huang et al.
proposed to search overlapping communities based on k-truss. In [33], Huang
et al. proposed to find the closest truss communities.
Other measures. Some other classical measures (e.g., k-clique and connectiv-
ity) have also been applied to CS. In [8], Cui et al. proposed to find overlapping
communities based on k-cliques. The connectivity of a graph is the minimum
number of edges whose removal disconnect it [24]. In [30,31], Hu et al. proposed
a community model based on the measure of connectivity.
8 Y. Fang and R. Cheng

However, these existing CS solutions generally assume non-attributed graphs,


and overlook the rich information of vertices and edges that come with attributed
graphs. Therefore, it is desirable to design CS algorithms for attributed graphs.

3 CS on Keyword-Based Attributed Graphs


In this section, we first formally introduce the KAC query, then present the
query algorithms, and finally discuss the experimental results.

3.1 The KAC Query


We now discuss the keyword-based attributed graph model, the k-core, and the
AC. We consider a keyword-based attributed graph2 G(V, E), which is undi-
rected with vertex set V and edge set E. Each vertex v ∈ V is associated with
a set of keywords, W (v). Let n and m be the corresponding sizes of V and E.
The degree of a vertex v of G is denoted by degG (v).
A community is often a subgraph that satisfies structure cohesiveness. In
KAC query, we use the minimum degree, which is also used in the k-core.
Definition 1 (k-core [3,46]). Given an integer k (k ≥ 0), the k-core of G,
denoted by Hk , is the largest subgraph of G, such that ∀v ∈ Hk , degHk (v) ≥ k.
We say that Hk has an order of k. Notice that Hk may not be a connected
 are usually the
graph [3], and its connected components, denoted by k-cores,
 search algorithms.
“communities” returned by k-core
Example 1. In Fig. 5(a), {A, B, C, D} is both a 3-core and a 3-core.
 The 1-core
has vertices {A, B, C, D, E, F, G, H, I}, and is composed of two 1-core
 compo-
nents: {A, B, C, D, E, F, G} and {H, I}. The number k in each circle represents
 contained in that ellipse.
the k-core

2 3
1 G:{x, y}
A:{w, x, y} Core number Vertices
E:{y, z}
0 J
H:{y, z} J:{x}
B:{x} 1 F, G, H, I
D:{x, y, z}
2 E
I:{x} C:{x, y} 3 A, B, C, D
F:{y}

(a) graph (b) core number

Fig. 5. Illustrating the k-core and the KAC.

Observe that k-cores are “nested” [3]: given two positive integers i and j, if
i < j, Hj ⊆ Hi . In Fig. 5(a), H3 is contained in H2 , which is nested in H1 .
2
Without ambiguity, all the attributed graphs mentioned in this section refer to
keyword-based attributed graphs.
On Attributed Community Search 9

Definition 2 (Core number). Given a vertex v ∈ V , its core number, denoted


by coreG [v], is the highest order of a k-core that contains v.

A list of core numbers and their respective vertices for Example 1 are shown
in Fig. 5(b). We now formally define the KAC query problem as follows.

Problem 1 (KAC query). Given a graph G(V, E), a positive integer k, a vertex
q ∈ V and a set of keywords S ⊆ W (q), return a set G of graphs, such that
∀Gq ∈ G, the following properties hold:

• Connectivity. Gq ⊆ G is connected and q ∈ Gq ;


• Structure cohesiveness. ∀v ∈ Gq , degGq (v) ≥ k;
• Keyword cohesiveness. The size of L(Gq , S) is maximal, where L(Gq , S) =
∩v∈Gq (W (v) ∩ S) is the set of keywords shared in S by all vertices of Gq .

We call Gq the keyword-based attributed community (or KAC) of q, and


L(Gq , S) the KAC-label of Gq . In Problem 1, the first two properties are also
 of a given vertex q [48]. The keyword cohesiveness (Prop-
specified by the k-core
erty 3), which is unique to Problem 1, enables the retrieval of communities whose
vertices have common keywords in S. We use S to impose semantics on the KAC
produced by Problem 1. By default, S = W (q), which means that the KAC gen-
erated should have keywords common to those associated with q. If S ⊂ W (q), it
means that the query user is interested in forming communities that are related
to some (but not all) of the keywords of q. For example, in Fig. 5(a), if q = A,
k = 2 and S = {w, x, y}, the output of Problem 1 is {A, C, D}, with KAC-label
{x, y}, meaning that these vertices share the keywords x and y.
We require L(Gq , S) to be maximal in Property 3, because we wish the
KAC(s) returned only contain(s) the most related vertices, in terms of the num-
ber of common keywords. Let us use Fig. 5(a) to explain why this is impor-
tant. Using the same query (q = A, k = 2, S = {w, x, y}), without the “maximal”
requirement, we can obtain communities such as {A, B, E} (which do not share
any keywords), {A, B, D}, or {A, B, C} (which share 1 keyword).

3.2 Basic Solutions

For simplicity, we say that v contains a set S  of keywords, if S  ⊆ W (v). We use


G[S  ] to denote the largest connected subgraph of G, where each vertex contains
S  and q ∈ G[S]. We use Gk [S  ] to denote the largest connected subgraph of
G[S  ], in which every vertex has degree being at least k in Gk [S  ]. We call S  a
qualified keyword set for the query vertex q on the graph G, if Gk [S  ] exists.
A simple way of answering a KAC query is to consider all the possible key-
word combinations, and then return the subgraphs, which satisfy the minimum
degree constraint and have the most shared keywords. This solution has a com-
plexity exponential to the size of q’s keyword set, so it is impractical, when q’s
keyword set is large. To alleviate this issue, we propose the following two-step
framework, which is mainly based on the following anti-monotonicity property.
10 Y. Fang and R. Cheng

Lemma 1 (Anti-monotonicity)3 . Given a graph G, a vertex q ∈ G and a


set S of keywords, if there exists a subgraph Gk [S], then there exists a subgraph
Gk [S  ] for any subset S  ⊆ S.

The anti-monotonicity property allows us to stop examining all the super


sets of S  (S  ⊆ S), once have verified that Gk [S  ] does not exist. The basic
solution begins with examining the set, Ψ1 , of size-1 candidate keyword sets,
i.e., each candidate contains a single keyword of S. It then repeatedly executes
the following two key steps, to retrieve the size-2 (size-3, . . . ) qualified keyword
subsets until no qualified keyword sets are found.

• Verification. For each candidate S  in Ψc (initially c = 1), mark S  as a


qualified set if Gk [S  ] exists.
• Candidate generation. For any two current size-c qualified keyword sets
which only differ in one keyword, union them as a new expanded candidate
with size-(c + 1), and put it into set Ψc+1 , if all its subsets are qualified, by
Lemma 1.

Among the above steps, the key issue is how to compute Gk [S  ]. Since Gk [S  ]
should satisfy the structure cohesiveness and keyword cohesiveness. Intuitively,
we have two approaches to compute Gk [S  ]: either searching the subgraph satisfy-
ing degree constraint first, followed by further refining with keyword constraints
(called basic-g); or vise versa (called basic-w).4

3.3 The CL-tree Index

In this section, we propose a novel index, called CL-tree (Core Label tree),
 and keywords into a tree structure. Based on
which organizes both the k-cores
the index, the efficiency of answering KAC query can be improved significantly.
The CL-tree index is built based on the key observation that cores are nested.
 must be contained in a k-core.
Specifically, a (k + 1)-core  All k-cores
 can be
organized into a tree structure5 .

Example 2. Consider the graph in Fig. 5(a). All the k-cores can be organized
into a tree as shown in Fig. 6(a). The height of the tree is 4. For each tree node,
we attach the core number and vertex set of its corresponding k-core.

The tree structure in Fig. 6(a) can be stored compactly, as shown in Fig. 6(b).
The key observation is that, for any internal node p in the tree, the vertex sets
of its child nodes are the subsets of p’s vertex set, because of the inclusion
relationship. To save space cost, we can remove the redundant vertices that are
shared by p’s child nodes from p’s vertex set. After such removal, we obtain
a compressed tree, where each graph vertex appears only once. This structure

3
All the proofs of lemmas in this article can be found in [13].
4
All the pseudocodes of algorithms in this article can be found in [13].
5
We use “node” to mean “CL-tree node” in Sect. 3.
On Attributed Community Search 11

0 0
ABCDE
FGHIJ J x: J
1 1 1 1
x: I
x: G
ABCDEFG HI r1 FG HI y: H
y: F, G
z: H
2 2
y: E
ABCDE r2 E z: E

3 3 w: A
x: A,B,C,D
ABCD r3 ABCD y: A,C,D
z: D
(a) tree structure (b) CL-tree index

Fig. 6. An example CL-tree index.

constitutes the CL-tree index, the nodes of which are further augmented by
inverted lists (Fig. 6(b)). The space cost of the CL-tree is linear to the size of
G. To summarize, each CL-tree node p has five elements: (1) coreNum: the core
 (2) vertexSet: a set of graph vertices; (3) invertedList: a
number of the k-core;
list of <key, value> pairs, where the key is a keyword contained by vertices in
vertexSet and the value is the list of vertices in vertexSet containing key; and
(4) childList: a list of child nodes;
Using the CL-tree, the following two key operations used by our query algo-
rithms (Sect. 3.5), can be performed efficiently.

• Core-locating. Given a vertex q and a core number c, find the k-core  with
core number c containing q, by traversing the CL-tree.
• Keyword-checking. Given a k-core,  find vertices which contain a given
query keyword set, by intersecting the inverted lists of query keywords.

3.4 Index Construction


A simple method to build the CL-tree is build nodes recursively in a top-down
manner. Specifically, we first generate the root node for 0-core, which is exactly
the entire graph. Then, for each k-core  of 1-core, we generate a child node for
the root node. After that, we only remain vertices with core numbers being 0 in
the root node. Then for each child node, we can generate its child nodes in the
similar way. This procedure is executed recursively until all the nodes are well
built. We denote this index construction method by basic.
Clearly, the time cost of basic method is O(m · kmax + l · n), because: (1) the
k-core decomposition can be done in O(m) [3]; (2) the inverted lists of each node
can be built in O(l · n); and (3) in function buildNode, we need to compute the
connected components with a given vertex set, which costs O(m) in the worst
case. This may lead to low efficiency for large-scale graphs. To higher efficiency,
we propose the advanced method, whose time and space complexities are almost
linear with the size of G. The advanced method builds the CL-tree level by level
in a bottom-up manner. Specifically, the tree nodes corresponding to larger core
12 Y. Fang and R. Cheng

numbers are created prior to those with smaller core numbers. More detailed
steps and analysis of basic and advanced are described in [16].

3.5 Query Algorithms


Based on the CL-tree, we propose a query algorithm, denoted by Dec. It first
generates the candidate keyword sets, and then verifies whether them could be
the shared keywords of the KACs. We illustrate the main steps as follows.
1. Generation of candidate keyword sets. Dec exploits the key observation
that, if S  (S  ⊆ S) is a qualified keyword set, then there are at least k of q’s
neighbors containing set S  , since every vertex in Gk [S  ] must has degree at
least k. In specific, we consider q and q’s neighbor vertices. For each vertex v,
we only select the keywords, which are contained by S and at least k of its
neighbors. Then we use these selected keywords to form an itemset, in which
each item is a keyword. After this step, we obtain a list of itemsets. Then we
apply the well studied frequent pattern mining algorithms to find the frequent
keyword combinations, each of which is a candidate keyword set. Since our goal
is to generate keyword combinations shared by at least k neighbors, we set the
minimum support as k, and use the well-known FP-Growth algorithm [28].
2. Verification of candidate keyword sets. As candidates can be obtained
using S and q’s neighbors directly, we can verify them either incrementally, or in
a decremental manner (larger candidate keyword sets first and smaller candidate
keyword sets later). We choose the latter manner. The rationale behind is that,
for any two keyword sets S1 ⊆ S2 , the number of vertices containing S2 is usually
smaller than that of S1 , so S2 can be verified more efficiently.

3.6 Experiments
We consider four real keyword-based attributed graphs. For each of them, each
vertex has a list of neighbors as well as a set of keywords. More details of these
graphs are described in [16]. To evaluate KAC queries, we set the default value
of k to 6. The input keyword set S is set to the whole set of keywords contained
by the query vertex. For each dataset, we randomly select 300 query vertices
with core numbers of 6 or more, which ensures that there is a k-core containing
each query vertex.
We have performed extensive experiments on these datasets. The detailed
experimental results can be found in [16]. The general conclusions observed from
the experiments are that: (1) The communities returned by KAC queries achieve
higher keyword cohesiveness than the state-of-the-art CD and CS methods. For
example, the Jaccard similarities of members in the KACs are higher than those
of Global [48], Local [9], and CODICIL [44]. (2) The case studies on the DBLP
network show that, using keywords, KAC query can find more meaningful com-
munities than Global and Local. (3) The index-based query algorithms are over
1 to 3 orders of magnitude faster than the basic methods. For example, on the
largest dataset DBpedia, a single KAC query takes less than 1 s. (4) For the
index construction methods, advanced is much faster than basic.
On Attributed Community Search 13

4 CS on Spatial-Based Attributed Graphs


In this section, we first formally introduce the SAC search, then present the
query algorithms, and finally discuss the experimental results.

4.1 Problem Definition


Data model. We consider a spatial-based attributed graph6 G(V, E), which is
an undirected graph with vertex set V and edge set E, where vertices represent
entities and edges denote their relationships. For each vertex v ∈ V , it has a
tuple (id, loc), where id is its ID and loc = (x, y) is its spatial positions along x-
and y-axis in a two-dimensional space. Let n and m be the corresponding sizes
of V and E. We denote a circle with center o and radius r by O(o, r), and the
Euclidean distance from vertices u to v by |u, r|. The degree of a vertex v in a
graph G is denoted by degG (v).

Example 3. Figure 7(a) depicts a geo-social network containing 10 vertices. The


solid lines linking the vertices are the edges, denoting their social relationships.

y I I
1
6 H H
A F A 2 F
G G
4 3
D D
Q Q
2 B C E
B C E
0
0 2 4 6 8 x
(a) spatial graph (b) k-core decomposition

Fig. 7. An example of geo-social network.

Spatial-aware community (SAC). Conceptually, an SAC is a subgraph, G ,


of the graph G satisfying: (1) Connectivity: G is connected; (2) Structure cohe-
siveness: all the vertices in G are linked intensively; and (3) Spatial cohesiveness:
all the vertices in G are spatially close to each other.
Structure cohesiveness. We adopt the minimum degree, a well-accepted struc-
ture cohesiveness criterion, for measuring the structure cohesiveness of the ver-
tices that appear in the community. Note that other criteria including k-truss [33]
and k-clique [8] can also be used for SACs.
Spatial cohesiveness. To ensure high spatial cohesiveness, we require all the
vertices of an SAC in a minimum covering circle (MCC) with the smallest radius.
In the literature [10,11,27,40], the notion of MCC has been widely adopted to
achieve high spatial compactness for a set of spatial objects.

6
For simplicity, in this section we call spatial-based attributed graphs spatial graphs.
14 Y. Fang and R. Cheng

Definition 3 (MCC). Given a set of vertices S, the MCC of S is the spatial


circle, which contains all the vertices in S with the smallest radius.

Problem 2 (SAC search). Given a graph G, a positive integer k and a vertex


q ∈ V , return a subgraph Gq ⊆ G, and the following properties hold:

• Connectivity. Gq is connected and contains q;


• Structure cohesiveness. ∀v ∈ Gq , degGq (v) ≥ k;
• Spatial cohesiveness. The MCC of vertices in Gq satisfying Properties 1
and 2 has the minimum radius.

We call a subgraph satisfying Properties 1 and 2 a feasible solution, and the


subgraph satisfying all the three properties the optimal solution (denoted by
Ψ ). We denote the radius of the MCC containing Ψ by ropt . Essentially, SAC
search finds the SAC in an MCC with the smallest radius among all the feasible
solutions. In Example 3, let C1 = {Q, C, D} and C2 = {Q, A, B}. The two circles
in Fig. 7(a) denote the MCCs of C1 and C2 respectively. Let q = Q and k = 2.
The optimal solution of this query is G[C1 ], and ropt = 1.5. Note that G[C2 ] and
G[C1 ∪ C2 ] are feasible solutions.

4.2 SAC Search Algorithms

We first present a basic exact algorithm Exact, which takes O(m×n3 ) to answer
a single query. This is very time-consuming for large graphs. So we turn to design
more efficient approximation algorithms. Here, the approximation ratio is defined
as the ratio of the radius of MCC returned over that of the optimal solution.
Inspired by the approximation algorithms, we also design a fast exact algorithm
Exact+. Their approximation ratios and time complexities are summarized in
Table 2, where F and A are parameters specified by the query user. The value
|F1 | is the number of “fixed vertices”, which will be defined later (|F1 | n).
AppInc is a 2-approximation algorithm, and it is much faster than Exact.
Inspired by AppInc, we design another (2 + F )-approximation algorithm
AppFast, where F ≥ 0, which is faster than AppInc. The limitation of AppInc

Table 2. Overview of algorithms for SAC search.

Algorithm Approximation ratio Time complexity


Exact 1 O(m × n3 )
AppInc 2 O(mn)
1
AppFast 2 + F If F > 0, O(m · min{n, log F
})
(F ≥ 0) If F = 0, O(mn)
AppAcc 1 + A O( m
2 × min{n, log
1
A
})
A
(0 < A < 1)
Exact+ 1 O( m
2 · min{n, log
1
A
} + m|F1 |3 )
A
On Attributed Community Search 15

and AppFast is that their theoretical approximation ratios are at least 2.


To achieve even lower approximation ratio, we further design another algorithm
AppAcc, whose approximation ratio is (1 + A ), where 0 < A < 1 is a value spec-
ified by the query user. Overall, these approximation algorithms guarantee that
the radius of the MCC of the community has an arbitrary expected approxima-
tion ratio.
There is a trade-off between the quality of results and efficiency, i.e., algo-
rithms with lower approximation ratios tend to have higher complexities. The
pseudocodes of these algorithms can be found from [15].

The Basic Exact Algorithm. We first describe a useful lemma [11].

Lemma 2 [11]. Given a set S (|S| ≥ 2) of vertices, its MCC can be determined
by at most three vertices in S which lie on the boundary of the circle. If it
is determined by only two vertices, then the line segment connecting those two
vertices must be a diameter of the circle. If it is determined by three vertices,
then the triangle consisting of those three vertices is not obtuse.

By Lemma 2, there are at least two or three vertices lying on the boundary of
the MCC of the target SAC. We call vertices lying on the boundary of an MCC
fixed vertices. So a straightforward method of SAC search can follow the two-
 containing q, which is the same
step framework directly. It first finds the k-core
as Global does, and then returns the subgraph achieving both the structure and
spatial cohesiveness by enumerating all the combinations of three vertices in the
 We call this method Exact. It completes in O(m × n3 ) time.
k-core.

A 2-Approximation Algorithm. In this section we present AppInc, which


has an approximation ratio of 2. Our key observation is that, the optimal solution
Ψ is very close to q. So we consider the smallest circle, denoted by O(q, δ), which
is centered at q and contains a feasible solution, denoted by Φ. Let the radius of
the MCC covering Φ be γ (γ ≤ δ). Note that, γ can be obtained by computing
the MCC containing Φ by a linear algorithm [40]. Next, we give two lemmas:

Lemma 3. 1
2δ ≤ ropt ≤ γ.

Lemma 4. The radius of the MCC covering the feasible solution Φ has an
approximation ratio of 2.

AppInc finds Φ in an incremental manner. Specifically, it considers vertices


close to q one by one incrementally, and checks whether there exists a feasible
solution when a new vertex is considered. It stops once a feasible solution has
been found. Clearly, AppInc takes O(mn) time, so it is much faster than Exact.

A (2 + F )-Approximation Algorithm. In this section, we propose another


fast approximation algorithm, called AppFast, which has a more flexible approx-
imation ratio, i.e., 2 + F , where F is an arbitrary non-negative value. Instead
16 Y. Fang and R. Cheng

of finding the circle O(q, δ) in an incremental manner, AppFast approximates


the radius δ by performing binary search. We observe that, the lower and upper
bounds of δ, denoted by l and u, are stated by Eq. (1):

l= max |q, v|, u = max |q, v|, (1)


v∈KN N (q) v∈X

 containing q, and KN N (q) contains


where X is the list of vertices of the k-core
the k nearest vertices in X ∩ nb(q) to q. Hence, we can approximate the radius
of the circle O(q, δ) by performing binary search within [l, u] until |u − l| is less
than a predefined small threshold α, and return an SAC denoted by Λ.

Lemma 5. In AppFast, the radius of the MCC covering Λ has an approximation


ratio of (2 + F ), if α ≤ r×
2+F , where F ≥ 0.
F

A (1 + A )-Approximation Algorithm. We first present a corollary:

Corollary 1. The center point, o, of the MCC O(o, ropt ) covering Ψ is in the
circle O(q, γ).

Although point o is in O(q, γ) by Corollary 1, it is still not easy to locate it


exactly, since the number of its possible positions to be explored can be infinite.
Instead of locating it exactly, we try to find an approximated “center”, which is
very close to o. In specific, we split the square containing the circle O(q, γ) into
equal-sized cells, and the size of each cell is β × β (we will explain how to set a
proper value of β later). We call the center point of each cell an anchor point.
By Corollary 1, we can conclude that o must be in one specific cell. Then we
can approximate o using the anchor point of this cell, denoted by c. We consider
the circle O(c, rmin ), where rmin is the minimum radius such that it contains a
feasible solution, which is denoted by Γ . We bound the value of rmin by Lemma 6.

Lemma 6. rmin ≤ ropt + 2
2 β.
√ √
rmin
By Lemma 6, we have ropt ≤ 1+ 2β
2ropt ≤ 1+ 2β
δ . Thus, we can approximate

Ψ using Γ , and the approximation ratio is (1 + A ), if we let δ2β ≤ A (0 <
A < 1).
To find O(c, rmin ), the basic method is that, for each anchor point p, we
use AppFast to find the circle, which is centered at p and contains a feasible
solution, and then return the minimum circle. To further improve the efficiency,
we develop some optimization techniques. Specifically, we assume that all the
anchor points are organized into a region quadtree [21], where the root node7 is a
square, centered at q with width 2γ. By decomposing this square into four equal-
sized quadrants, we obtain its four child nodes. The child nodes of them are built
in the same manner recursively, until the width of the leaf node is in (β/2, β].
Note that the center of each leaf node corresponds to an anchor point. To find
O(c, rmin ), we traverse the quadtree level by level in a top-down manner. Let
7
To avoid ambiguity, we use word “node” for tree nodes in Sect. 4.
On Attributed Community Search 17

ropt
y
6

4 o
q c
rmin
2
0
0 2 4 6 8 x
(a) Splitting O(q, γ) (b) rmin

Fig. 8. Illustrating AppAcc.

rcur , initialized as γ, record the smallest radius of an MCC containing a feasible


solution. For each node, we first obtain the center p of its square, and then use
the binary search of AppFast to approximate the smallest radius rp , such that
O(p, rp ) contains a feasible solution. Based on these ideas, we develop AppAcc
(Fig. 8).
δA
Lemma 7. If we set α ≤ 14 δA and β = √2(2+ in AppAcc, where 0 < A < 1
A)

and α is the gap between the upper bound and lower bound when stopping the
binary search, the radius of the MCC covering Γ has an approximation ratio of
(1 + A ).

The Advanced Exact Algorithm. Recall that, AppAcc approximates√the


center, o, of the MCC covering Ψ by its nearest anchor point c, and |o, c| ≤ 22 β.
Also, ropt is well approximated, i.e., rroptΓ
≤ 1 + A where rΓ is the radius of the

MCC covering Γ . This implies that, 1+A ≤ ropt ≤ rΓ , where 0 < A < 1. So the
value of ropt is in a small interval, if A is small. Besides, for any fixed vertex,
f , of the MCC of Ψ , its distance to o (i.e., |f, o|) is exactly ropt . By triangle
inequality, we have

2
|f, c| ≤ |f, o| + |o, c| ≤ rΓ + β, (2)
2

rΓ 2
|f, c| ≥ |f, o| − |o, c| ≥ − β. (3)
1 + A 2
Let us denote the rightmost items of above two inequations by r+ and r−
respectively. Then, we conclude that, for any fixed vertex f , its distance to
c is in the range [r− , r+ ].√If A is very small, the gap between r+ and r− , i.e.,
r+ −r− = rΓ (1− 1+ 1
A
)+ 2β, is also very small, which implies that the locations
of the fixed vertices are in a very narrow annular region. Hence, a large number
of vertices out of this annular region, which are not fixed vertices, can be pruned
safely. Based on above analysis, we design Exact+.
18 Y. Fang and R. Cheng

4.3 Experimental Results


We consider four real spatial graphs. For each of them, each vertex has a list of
graph neighbors and a 2-dimensional location. To evaluate SAC search, we set the
default value of k to 4. The default values of F and A are set as 0.5, since these
values practically result in good approximation ratios with reasonable efficiency.
For each dataset, we randomly select 200 query vertices with core numbers of 4
or more, which ensures that there is a k-core containing each query vertex. More
details of these graphs and experimental results are reported in [15].
The general conclusions observed from the experiments are that: (1) The
communities returned by SAC search achieve higher spatial cohesiveness than
the state-of-the-art CD and CS methods. For example, the radius of the MCC
covering the communities returned by SAC search is much smaller than that
of Global [48], Local [9], and GeoModu [6]. (2) For exact algorithms, Exact+ is
over four orders of magnitude faster than Exact. For approximation algorithms,
AppFast is the fastest one while AppInc is slowest one, and AppAcc is the most
accurate one. (3) For moderate-size graphs, Exact+ achieves not only the highest
quality results, but also reasonable efficiency. While for large graphs with millions
of vertices, AppFast and AppAcc should be better choices as they are very faster.

5 Conclusions and Future Work


In this article, we investigate the problem of community search (CS) over two
common attributed graphs, where (1) vertices are associated with keywords;
and (2) vertices are augmented with location information. For keyword-based
attributed graphs, we study the problem of keyword-based attributed commu-
nity (KAC) query and find the KACs of a query vertex. Essentially, a KAC is
a community that exhibits structure and keyword cohesiveness. To answer the
KAC query, we develop the CL-tree index and query algorithms. Our experimen-
tal results on real datasets show that KAC queries are more effective than exist-
ing CS and CD algorithms. In addition, our solutions are faster than existing CS
algorithms. For spatial-based attributed graphs, we study the spatial-aware com-
munity (SAC) search problem, which finds the community containing q within
the smallest minimum covering circle (MCC). Essentially, an SAC is a com-
munity that exhibits both structure and spatial cohesiveness. We propose two
exact algorithms, and three efficient approximation algorithms. The experimen-
tal results on real datasets show that, SAC search achieves better effectiveness
than the existing CD and CS algorithms. Also, our algorithms are very fast. Fur-
thermore, we develop C-Explorer, a system for online and interactive extracting,
visualizing, and analyzing communities of a query vertex.
This article opens to a number of promising directions for the future work:
(1) It would be interesting to adopt other classical metrics (e.g., k-truss and k-
clique) for finding communities from attributed graphs. (2) For keyword-based
attributed graphs, it would be interesting to consider other keyword cohesiveness
(e.g., Jaccard similarity and string edit distance) for formulating the community
models. (3) For spatial-based attributed graphs, it is of interest to examine other
On Attributed Community Search 19

kinds of spatial cohesiveness measures by considering more spatial regions (e.g.,


squares) and the pair-wise distances of vertices. (4) For C-Explorer, it is worth
considering more attributes of vertices and edges for searching the communities.

References
1. Barbieri, N., Bonchi, F., Galimberti, E., Gullo, F.: Efficient and effective commu-
nity search. DMKD 29(5), 1406–1433 (2015)
2. Barthélemy, M.: Spatial networks. Phys. Rep. 499(1), 1–101 (2011)
3. Batagelj, V., Zaversnik, M.: An o(m) algorithm for cores decomposition of net-
works. arXiv (2003)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
5. Bollobás, B.: The evolution of random graphs. Trans. Am. Math. Soc. 286(1),
257–274 (1984)
6. Chen, Y., Jun, X., Minzheng, X.: Finding community structure in spatially con-
strained complex networks. IJGIS 29(6), 889–911 (2015)
7. Cohen, J.: Trusses: cohesive subgraphs for social network analysis. National Secu-
rity Agency Technical Report, p. 16 (2008)
8. Cui, W., Xiao, Y., Wang, H., Lu, Y., Wang, W.: Online search of overlapping
communities. In: SIGMOD, pp. 277–288 (2013)
9. Cui, W., Xiao, Y., Wang, H., Wang, W.: Local search of communities in large
graphs. In: SIGMOD, pp. 991–1002 (2014)
10. Elzinga, D.J., Hearn, D.W.: The minimum covering sphere problem. Manage. Sci.
19(1), 96–104 (1972)
11. Elzinga, D.J., Hearn, D.W.: Geometrical solutions for some minimax location prob-
lems. Transp. Sci. 6(4), 379–394 (1972)
12. Expert, P., et al.: Uncovering space-independent communities in spatial networks.
PNAS 108(19), 7663–7668 (2011)
13. Fang, Y.: Effective and efficient community search over large attributed graphs.
HKU Ph.D. thesis, September 2017
14. Fang, Y., Cheng, R., Chen, Y., Luo, S., Hu, J.: Effective and efficient attributed
community search. VLDB J. 26(6), 803–828 (2017)
15. Fang, Y., Cheng, R., Li, X., Luo, S., Hu, J.: Effective community search over large
spatial graphs. PVLDB 10(6), 709–720 (2017)
16. Fang, Y., Cheng, R., Luo, S., Hu, J.: Effective community search for large
attributed graphs. PVLDB 9(12), 1233–1244 (2016)
17. Fang, Y., Cheng, R., Luo, S., Hu, J., Huang, K.: C-explorer: browsing communities
in large graphs. PVLDB 10(12), 1885–1888 (2017)
18. Fang, Y., Cheng, R., Tang, W., Maniu, S., Yang, X.: Scalable algorithms for
nearest-neighbor joins on big trajectory data. TKDE 28(3), 785–800 (2016)
19. Fang, Y., Cheng, R., Tang, W., Maniu, S., Yang, X.S.: Scalable algorithms for
nearest-neighbor joins on big trajectory data. In: ICDE, pp. 1528–1529 (2016)
20. Fang, Y., Zhang, H., Ye, Y., Li, X.: Detecting hot topics from Twitter: a multiview
approach. J. Inf. Sci. 40(5), 578–593 (2014)
21. Finkel, R.A., Bentley, J.L.: Quad trees: a data structure for retrieval on composite
keys. Acta Informatica 4(1), 1–9 (1974)
22. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)
20 Y. Fang and R. Cheng

23. Gaertler, M., Patrignani, M.: Dynamic analysis of the autonomous system graph.
In: IPS, pp. 13–24 (2004)
24. Gibbons, A.: Algorithmic Graph Theory. Cambridge University Press, Cambridge
(1985)
25. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. PNAS 99(12), 7821–7826 (2002)
26. Guo, D.: Regionalization with dynamically constrained agglomerative clustering
and partitioning (redcap). IJGIS 22(7), 801–823 (2008)
27. Guo, T., Cao, X., Cong, G.: Efficient algorithms for answering the m-closest key-
words query. In: SIGMOD, pp. 405–418. ACM (2015)
28. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.
In: SIGMOD (2000)
29. Hu, J., Cheng, R., Huang, Z., Fang, Y., Luo, S.: On embedding uncertain graphs.
In: CIKM. ACM (2017)
30. Hu, J., Wu, X., Cheng, R., Luo, S., Fang, Y.: Querying minimal Steiner maximum-
connected subgraphs in large graphs. In: CIKM, pp. 1241–1250 (2016)
31. Hu, J., Xiaowei, W., Cheng, R., Luo, S., Fang, Y.: On minimal steiner maximum-
connected subgraph queries. TKDE 29(11), 2455–2469 (2017)
32. Huang, X., Cheng, H., Qin, L., Tian, W., Yu, J.X.: Querying k-truss community
in large and dynamic graphs. In: SIGMOD (2014)
33. Huang, X., Lakshmanan, L.V.S., Yu, J.X., Cheng, H.: Approximate closest com-
munity search in networks. PVLDB 9(4), 276–287 (2015)
34. Kim, Y., Son, S.-W., Jeong, H.: Finding communities in directed networks. Phys.
Rev. E 81(1), 016103 (2010)
35. Leicht, E.A., Newman, M.E.J.: Community structure in directed networks. Phys.
Rev. Lett. 100(11), 118703 (2008)
36. Li, R.-H., Qin, L., Yu, J.X., Mao, R.: Influential community search in large net-
works. In: PVLDB (2015)
37. Li, Z., Fang, Y., Liu, Q., Cheng, J., Cheng, R., Lui, J.: Walking in the cloud:
parallel simrank at scale. PVLDB 9(1), 24–35 (2015)
38. Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and
author community. In: ICML (2009)
39. Malliaros, F.D., Vazirgiannis, M.: Clustering and community detection in directed
networks: a survey. Phys. Rep. 533(4), 95–142 (2013)
40. Megiddo, N.: Linear-time algorithms for linear programming in r3 and related
problems. In: FOCS, pp. 329–338. IEEE (1982)
41. Nallapati, R.M., Ahmed, A., Xing, E.P., Cohen, W.W.: Joint latent topic models
for text and citations. In: KDD (2008)
42. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Phys. Rev. E 69(2), 026113 (2004)
43. Plantié, M., Crampes, M.: Survey on social community detection. In: Ramzan,
N., van Zwol, R., Lee, J.S., Clüver, K., Hua, X.S. (eds.) Social Media Retrieval.
Computer Communications and Networks, pp. 65–85. Springer, London (2013).
https://doi.org/10.1007/978-1-4471-4555-4 4
44. Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large
networks using content and links. In: WWW (2013)
45. Sachan, M., et al.: Using content and interactions for discovering communities in
social networks. In: WWW, pp. 331–340 (2012)
46. Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287
(1983)
Another random document with
no related content on Scribd:
quite impervious, but becoming—at any rate in the case of the larger
and more important pair—open previous to the final ecdysis. We
have mentioned the contradictory opinions of Réaumur and Dufour,
and will now add the views of some modern investigators. Oustalet
says[341] that there are two pairs of spiracles in the nymphs; the first
pair is quite visible to the naked eye, and is situate between pro- and
meso-notum; it is in the nymph closed by a membrane. The other
pair of spiracles is placed above the posterior pair of legs, is small
and completely closed. He does not state what stage of growth was
attained by the nymphs he examined. Palmén was of opinion that
not only thoracic but abdominal spiracles exist in the nymph,[342] and
that they are completely closed so that no air enters them; he says
that the spiracles have tracheae connected with them, that at each
moult the part closing the spiracles is shed with some of the tracheal
exuviae attached to it. The breathing orifices are therefore for a short
time at each ecdysis open, being subsequently again closed by
some exudation or secretion. This view of Palmén's has been
thought improbable by Hagen and Dewitz, who operated by placing
nymphs in alcohol or warm water and observing the escape of
bubbles from the spots where the supposed breathing orifices are
situate. Both these observers found much difference in the results
obtained in the cases of young and of old nymphs. Hagen concludes
that the first pair of thoracic spiracles are functionally active, and that
abdominal stigmata exist though functionless; he appears to be of
opinion that when the first thoracic stigma is closed this is the result
of the abutting against it of a closed trachea. Dewitz found[343] that
in the adult nymph of Aeschna the thoracic stigma is well developed,
while the other stigmata—to what number and in what position is not
stated—are very small. In a half-grown Aeschnid nymph he found
the thoracic stigma to be present in an undeveloped form. On
placing a full-grown nymph in alcohol, gas escaped from the stigma
in question, but in immature nymphs no escape of gas occurred
although they were subjected to a severe test. A specimen that,
when submitted to the above-mentioned immersion, emitted gas,
subsequently moulted, and thereafter air escaped from the spiracle
previously impervious. The observations of Hagen and Dewitz are
perhaps not so adverse to the views of Palmén as has been
supposed, so that it would not be a matter for surprise if Palmén's
views on this point should be shown to be quite correct.

The number of species of Odonata or Libellulidae that have been


described is somewhat less than two thousand, but constant
additions are made to the number, and when the smaller and more
fragile forms from the tropics are collected and worked out it will
probably be found that the number of existing species is somewhere
between five and ten thousand. They are distributed all over the
world, but are most numerous in species in the warmer regions, and
their predominance in any one locality is very much regulated by the
existence of waters suitable for the early stages of their lives.

A good work on the British Odonata is still a desideratum.[344] In


Britain about forty-six species are believed to be native. They are
said to be of late years less numerous than they used to be.
Notwithstanding their great powers of flight, dragon-flies are
destroyed by birds of various kinds; several hawks are said to be
very fond of them, and Merops persicus to line its nest with their
wings. The number of Insects killed by dragon-flies in places where
they are abundant must be enormous; the nymphs, too, are very
destructive in the waters they inhabit, so that dragon-flies have no
doubt been no mean factor in maintaining that important and delicate
balance of life which it is so difficult for us to appreciate. The nymphs
are no doubt cannibals, and this may perhaps be an advantage to
the species, as the eggs are sometimes deposited in large numbers
in a limited body of water, where all must perish if the nymphs did
not, after exhausting other food, attack one another. Martin, speaking
of the Odonata of the Département de l'Indre in France, says:[345]
"The eggs, larvae, and nymphs are the prey of several fishes,
snakes, newts, Coleoptera, aquatic Hemiptera, and of some diving
birds. Sometimes the destruction is on a considerable scale, and one
may notice the dragon-flies of some piece of water to diminish
gradually in numbers, while the animals that prey on them increase,
so that a species may for a time entirely disappear in a particular
spot, owing to the attacks of some enemy that has been specially
prosperous, and also eager in their pursuit. De Selys found that from
a pond filled with carp, roach, perch, and eels, several of the dragon-
fly denizens disappeared directly the bream was introduced." On the
other hand, there can be little doubt that the nymphs are sometimes
injurious to fish; it has been recorded that in a piscicultural
establishment in Hungary 50,000 young fishes were put into a pond
in spring; in the following autumn only fifty-four fish could be found,
but there were present an enormous quantity of dragon-fly nymphs.

Odonata are among the few kinds of Insects that are known to form
swarms and migrate. Swarms of this kind have been frequently
observed in Europe and in North America; they usually consist of
species of the genus Libellula, but species of various other genera
also swarm, and sometimes a swarm may consist of more than one
species. L. quadrimaculata is the species that perhaps most
frequently forms these swarms in Europe; a large migration of this
species is said to occur every year in the Charente inférieure from
north to south.[346] It is needless to say that the instincts and stimuli
connected with these migrations are not understood.

The nymphs are capable, under certain circumstances, of


accommodating themselves to very peculiar conditions of life. The
Sandwich Islands are extremely poor in stagnant waters, and yet
there exist in this remote archipelago several highly peculiar species
of Agrioninae. Mr. R. C. L. Perkins has recently discovered that the
nymphs of some of these are capable of maintaining their existence
and completing their development in the small collections of water
that accumulate in the leaves of some lilies growing on dry land.
These nymphs (Fig. 271) have a shorter mask than occurs, we
believe, in any other Odonata, and one would suppose that they
must frequently wait long for a meal, as they must be dependent on
stray Insects becoming immersed in these tiny reservoirs. The
cannibal habits of the Odonata probably stand these lily-dwellers in
good stead; Mr. Perkins found that there were sometimes two or
three nymphs of different sizes together, and we may suspect that it
sometimes goes hard with the smaller fry. The extension in the
length of the body of one of these lily-frequenting Agrions when it
leaves the water for its aerial existence is truly extraordinary.

Fig. 271.—Under side of Agrionid nymph, with short mask, living in


water in lilies. Hawaiian Islands. × 3.

The Odonata have no close relations with any other group of Insects.
They were associated by Latreille with the Ephemeridae, in a family
called Subulicornia. The members of the two groups have, in fact, a
certain resemblance in some of the features of their lives, especially
in the sudden change, without intermediate condition, from aquatic to
aerial life; but in all important points of structure, and in their
dispositions, dragon-flies and may-flies are totally dissimilar, and
there is no intermediate group to connect them. We have already,
said that the Odonata consist of two very distinct divisions—
Anisopterides and Zygopterides. The former group comprises the
subfamilies Gomphinae, Cordulegasterinae, Aeschninae,
Corduliinae, and Libellulinae,—Insects having the hinder wings
slightly larger than the anterior pair; while the Zygopterides consist of
only two subfamilies—Calepteryginae and Agrioninae; they have the
wings of the two pairs equal in size, or the hinder a little the smaller.
The two groups Gomphinae and Calepteryginae are each, in several
respects, of lower development than the others, and authorities are
divided in opinion as to which of the two should be considered the
more primitive. It is therefore of much interest to find that there exists
an Insect that shares the characters of the two primitive subfamilies
in a striking manner. This Insect, Palaeophlebia superstes (Fig. 272),
has recently been discovered in Japan, and is perhaps the most
interesting dragon-fly yet obtained. De Selys Longchamps refers it to
the subfamily Calepteryginae, on account of the nature of its wings;
were the Insect, however, deprived of these organs, no one would
think of referring Palaeophlebia to the group in question, for it has
the form, colour, and appearance of a Gomphine Odonate.
Moreover, the two sexes differ in an important character,—the form
of the head and eyes. In this respect the female resembles a
Gomphine of inferior development; while the male, by the shape and
large size of the ocular organs, may be considered to combine the
characters of Gomphinae and Calepteryginae. The Insect is very
remarkable in colour, the large eyes being red in the dead examples.
We do not, however, know what may be their colour during life, as
only one pair of the species is known, and there is no record as to
the life-history and habits. De Selys considers the nearest ally of this
Insect to be Heterophlebia dislocata, a fossil dragon-fly found in the
Lower Lias of England.

Fig. 272.—Palaeophlebia superstes. A, The Insect with wings of one


side and with two legs removed; B, front view of head of female;
C, of male. (After De Selys.)

Numerous fossil dragon-flies are known; the group is well


represented in the Tertiary strata, and specimens have been found in
amber. In strata of the Secondary age these Insects have been
found as far back as the Lower Lias; their remains are said to exist in
considerable variety in the strata of that epoch, and some of them to
testify to the existence at that period of dragon-flies as highly
specialised as those now living. According to Hagen[347]
Platephemera antiqua and Gerephemera simplex, two Devonian
fossils, may be considered as dragon-flies; the evidence as to this
appears inadequate, and Brongniart refers the latter Insect to the
family Platypterides, and considers Platephemera to be more allied
to the may-flies.
One of the most remarkable of the numerous discoveries lately
made in fossil entomology is the finding of remains of huge Insects,
evidently allied to dragon-flies, in the Carboniferous strata at
Commentry. Brongniart calls these Insects Protodonates,[348] and
looks on them as the precursors of our Odonata. Meganeura monyi
was the largest of these Insects, and measured over two feet across
the expanded wings. If M. Brongniart be correct in his restoration of
this giant of the Insect world, it much resembled our existing dragon-
flies, but had a simple structure of the thoracic segments, and a
simpler system of wing-nervures. On p. 276 we figured
Titanophasma fayoli, considered by Scudder and Brongniart as allied
to the family Phasmidae, and we pointed out that this supposed
alliance must at best have been very remote. This view is now taken
by M. Brongniart himself,[349] he having removed the Insect from the
Protophasmides to locate it in the Protodonates near Meganeura.
There appears to be some doubt whether the wings supposed to
belong to this specimen were really such, or belonged rather to
some other species.

CHAPTER XIX

AMPHIBIOUS NEUROPTERA CONTINUED—EPHEMERIDAE, MAY-FLIES

Fam. VII. Ephemeridae—May-flies.

Delicate Insects with atrophied mouth and small, short antennae;


with four membranous wings having much minute cross-veining;
the hinder pair very much smaller than the other pair, sometimes
entirely absent: the body terminated by three or two very
elongate slender tails. The earlier stages are passed through in
water, and the individual then differs greatly in appearance from
the winged Insect; the passage between the two forms is
sudden; the creature in its first winged state is a subimago,
which by shedding a delicate skin reveals the final form of the
individual.

Fig. 273.—Ephemera danica, male, Britain.

The may-flies are well known—in literature—as the types of a brief


and ineffective life. This supposed brevity relates solely to their
existence in the winged form. In the earlier stages the may-fly is so
unlike its subsequent self that it is not recognised as a may-fly by the
uninitiated. The total life of the individual is really quite as long as
that of most other Insects. The earlier stages and life-histories of
these Insects are of great importance. The perfect Insects are so
delicate and fragile that they shrivel much in drying, and are very
difficult to preserve in a condition suitable for study.

The mouth of the imago is atrophied, the trophi scarcely existing as


separate parts. Packard says that in Palingenia bilineata he could
discover no certain traces of any of the mouth-parts, but in
Leptophlebia cupida he found, as he thought, the rudiments of the
maxillae and labium, though not of the mandibles. The antennae are
always short, and consist of one or two thick basal joints succeeded
by a delicate needle-like segment, which, though comparatively long,
is not divided. The ocular organs are remarkable for their large size
and complex development; they are always larger in the male than
they are in the female. The compound eyes of the former sex are in
certain species, e.g. Cloëon (Fig. 274), quite divided, so that each
eye becomes a pair of organs of a different character; one part forms
a pillar facetted at its summit, while the other part remains as a true
eye placed on the side of the head; in front of these compound eyes
there are three ocelli. Thus the Insect comes to have three different
kinds of eyes, together seven in number.

Fig. 274.—Front of head of Cloëon, male. a, Pillared eye; b, sessile


eye; c, ocellus.

The prothorax is small, the pronotum being, however, quite distinct.


The mesothorax is very large; its notum forms by far the larger part
of the upper surface of the thoracic region, the metathorax being
small and different in structure, resembling in appearance a part of
the abdomen, so that the hind wings look as if they were attached to
a first abdominal segment. The mesosternum is also
disproportionately large in comparison with the homologous piece
preceding it, and with that following it. The pleural pieces are large,
but their structure and disposition are only very imperfectly
understood. The coxae are small and are widely separated, the
anterior being, however, more elongate and approximate than the
others. The other parts of the legs are slender; the number of joints
in the tarsi varies from five to one. The legs throughout the family
exhibit a considerable variety of structure, and the front pair in the
males of some species are remarkably long. The abdomen is usually
slender, and consists of ten segments; the terminal one bears three,
or two, very long flexible appendages. The first dorsal plate of the
abdomen is either wanting or is concealed to a considerable extent
by the metanotum. The wings are peculiar; the anterior pair vary a
great deal in their width, but are never very long in proportion to the
width; the hind pair are always disproportionately small, and
sometimes are quite wanting. The venation consists of a few, or of a
moderate number, of delicate longitudinal veins that do not pursue a
tortuous course, but frequently are gracefully curved, and form a
system of approximately similar curves, most of the veins being of
considerable length; close to the anterior margin of the wing there
are two or three sub-parallel veins. Frequently there are very
numerous fine, short cross-veinlets, but these vary greatly and may
be entirely wanting.

Fig. 275.—Wings of Ephemera danica. (After Eaton.)

The earlier stages of the life of Ephemeridae are, it is believed, in the


case of all the species, aquatic. May-flies, indeed, during the period
of their post-embryonic development are more modified for an
aquatic life than any other Insects, and are provided with a complex
apparatus of tracheal gills. The eggs are committed to the waters
without any care or foresight on the part of the parent flies, thus the
embryonic development is also aquatic; little, however, is known of it.
According to Joly[350] the process in Palingenia virgo is slow. The
larva on emerging from the egg has no respiratory system, neither
could Joly detect any circulation or any nervous system. The
creature on emergence is very like Campodea in form, possessing
long antennae and tails—caudal setae. Owing to the organisation
being inferior, the creature in its earlier stages is called a larvule; in
its later stages it is usually spoken of as a nymph, but the term larva
is also frequently applied to it. Soon the gills begin to appear in the
form of small tubular caeca placed in the posterior and upper angles
of the abdominal rings; in fifteen days the gills begin to assume their
characteristic form, are penetrated by tracheae, and the circulation
can be seen. The amount of growth accomplished after hatching
between March and September is but small.
Fig. 276.—Nymph of Cloëon dipterum.[351] Wing-sheath of left side,
gills of right side, removed; g, tracheal gills. (After Vayssière.)

Fig. 277.—Larvule of Cloëon dimidiatum. (After Lubbock.)

The metamorphosis of Cloëon has been described by Sir John


Lubbock; he informs us that the young creature undergoes a
constant and progressive development, going through a series of
more than twenty moults, each accompanied by a slight change of
form or structure. His observations were made on captured
specimens, so that it is not certain that what he calls[352] the first
stage is really such. He found no tracheae in the earliest stages; the
small first rudiments of the gills became visible in the third stage,
when there were no tracheae; the fourth instar possessed tracheae,
and they could be seen in the gills. The wing rudiments could first be
detected in the ninth and tenth stages. The changes of skin during
the winter months are separated by longer intervals than those
occurring at other periods of the year.
Fig. 278.—Adult nymph of Ephemera vulgata. (After Eaton.) Britain.

The nymphs differ greatly in the structure and arrangement of their


tracheal gills, and display much variety in their general form and
habits; some of them are very curious creatures. Pictet[353] divides
them in accordance with their habits into four groups: (1) Fossorial
larvae: these live in the banks of streams and excavate burrows for
shelter; they are of cylindrical form, possess robust legs, abundant
gills at the sides of the body, and frequently processes projecting
forwards from the head: examples, Ephemera (Fig. 278) and
Palingenia. (2) Flat larvae: these live attached to rocks, but run with
rapidity when disturbed; they prefer rapid streams, have the
breathing organs attached to the sides of the body and not reposing
on the back; they are exclusively carnivorous, while the fossorial
forms are believed to obtain their nutriment by eating mud: example,
Baëtis. (3) Swimming larvae: elongate delicate creatures, with feeble
legs, and with strongly ciliated caudal setae: example, Cloëon (Fig.
276). (4) Climbing larvae: these live in slowly-moving waters,
especially such as have much slimy mud in suspension, and they
have a habit of covering themselves with this mud sometimes to
such an extent as to become concealed by it: example,
Potamanthus.
Fig. 279.—Nymph of Oligoneuria garumnica, France. g2 and g7, two of
the dorsal tracheal gills. (After Vayssière.)

The anatomy of the nymphs has been treated by Vayssière,[354] who


arranges them in five groups in accordance with the conditions of the
tracheal gills: (1) The gills are of large size, are exposed and
furnished at the sides with respiratory fringes: example, Ephemera
(Fig. 278). (2) The branchiae are blade-like, not fringed, and are
exposed at the sides of the body: example, Cloëon (Fig. 276). (3)
The respiratory tubes are placed on the under surface of plates
whose upper surface is not respiratory: example, Oligoneuria
garumnica (Fig. 279). (4) The anterior gill is modified to form a plate
that covers the others: example, Tricorythus (Fig. 282, B). (5) The
gills are concealed in a respiratory chamber: example, Prosopistoma
(Fig. 280). The last of these nymphs is more completely adapted for
an aquatic life than any other Insect at present known; it was for long
supposed to be a Crustacean, but it has now been shown to be the
early stage of a may-fly, the sub-imago having been reared from the
nymph. The carapace by which the larger part of the body is covered
is formed by the union of the pro- and meso-thorax with the sheaths
of the anterior wings, which have an unusually extensive
development; under the carapace there is a respiratory chamber, the
floor and sides of which are formed by the posterior wing-sheaths,
and by a large plate composed of the united nota of the metathorax
and the first six abdominal segments. In this chamber there are
placed five pairs of tracheal gills; entrance of water to the chamber is
effected by two laterally-placed orifices, and exit by a single dorsal
aperture. These nymphs use the body as a sucker, and so adhere
strongly to stones under water. When detached they swim rapidly by
means of their caudal setae; the form of these latter organs is
different from that of other Ephemerid nymphs. This point and other
details of the anatomy of this creature have been described in detail
by Vayssière.[355] These nymphs have a very highly developed
tracheal system; they live in rapid watercourses attached to stones
at a depth of three to six inches or more under the water. Species of
Prosopistoma occur in Europe, Madagascar, and West Africa.

Fig. 280.—Prosopistoma punctifrons, nymph. France. (After Vayssière.)


o, Orifice of exit from respiratory chamber.

According to Eaton,[356] in the nymphs of some Ephemeridae the


rectum serves, to a certain extent, as a respiratory agent; he
considers that water is admitted to it and expelled after the manner
we have described in Odonata, p. 421.

Fig. 281.—A, Last three abdominal segments and bases of the three
caudal processes of Cloëon dipterum: r, dorsal vessel; kl, ostia
thereof; k, special terminal chamber of the dorsal vessel with its
entrance a; b, blood-vessel of the left caudal process; B, twenty-
sixth joint of the left caudal process from below; b, a portion of the
blood-vessel; o, orifice in the latter. (After Zimmermann.)

The internal anatomy of the nymphs of Ephemeridae shows some


points of extreme interest. The long caudal setae are respiratory
organs of a kind that is almost if not quite without parallel in the other
divisions of Insecta. The dorsal vessel for the circulation of the blood
is elongate, and its chambers are arranged one to each segment of
the body. It drives the blood forwards in the usual manner, but the
posterior chamber possesses three blood-vessels, one of which is
prolonged into each caudal seta. This terminal chamber is so
arranged as to drive the blood backwards into the vessels of the
setae; on the under surface of the vessels there are oval orifices by
which the blood escapes into the cavity of the seta so as to be
submitted to the action of the surrounding medium for some of the
purposes of respiration. This structure has been described by
Zimmermann,[357] who agrees with Creutzberg[358] that the organ by
which the blood is propelled into the setae is a terminal chamber of
the dorsal vessel; Verlooren,[359] who first observed this accessory
system of circulation, thought the contractile chamber was quite
separate from the heart. The nature of the connexion between this
terminal chamber that drives the blood backwards and the other
chambers that propel the fluid forwards appears still to want
elucidation.

Fig. 282.—A, Nymph of Ephemerella ignita with gills of left side


removed; g, gills: B, nymph of Tricorythus sp. with gill cover of
right side removed; g.c, gill cover; g, g′, gills. (After Vayssière.)

The nymphs of the Ephemeridae being creatures adapted for


existence in water, the details of their transformation into creatures
having an entirely aerial existence cannot but be of much interest. In
the nymphs the tracheal system is well developed, but differs from
that of air-breathing Insects in the total absence of any spiracles.
Palmén has investigated this subject,[360] and finds that the main
longitudinal tracheal trunks of the body of the nymph are not
connected with the skin of the body by tracheae, but are attached
thereto by ten pairs of slender strings extending between the
chitinous integument and the tracheal trunks. When the skin is shed
these strings—or rather a chitinous axis in each one—are drawn out
of the body, and bring with them the chitinous linings of the tracheae.
Thus notwithstanding the absence of spiracles, the body wall is at
each moult pierced by openings that extend to the tracheae. After
the ordinary moults these orifices close immediately, but at the
change to the winged state they remain open and form the spiracles.
At the same time the tracheal gills are completely shed, and the
creature is thus transformed from a water-breather to an Insect
breathing air as usual. In addition to this change there are others of
great importance, such as the development of the great eyes and the
complete atrophy of the mouth-parts. The precise manner of these
changes is not known; they occur, however, within the nymph skin.
The sudden emergence of the winged Insect from the nymph is one
of the most remarkable facts in the life-history of the may-fly; it has
been observed by Sir John Lubbock,[361] who describes it as almost
instantaneous. The nymph floats on the water, the skin of the back
opens, and the winged Insect flies out, upwards and away; "from the
moment when the skin first cracks not ten seconds are over before
the Insect has flown away." The creature that thus escapes has not,
however, quite completed its transformation. It is still enveloped in a
skin that compresses and embarrasses it; this it therefore rapidly
gets rid of, and thus becomes the imago, or final instar of the life-
cycle. The instar in which the creature exists winged and active,
though covered with a skin, is called the sub-imago. The parts of the
body in the sub-imago are as a whole smaller than they are in the
imago, and the colour is more dingy; the appendages—wings, legs,
and caudal setae—are generally considerably shorter than they are
in the imago, but attain their full length during the process of
extraction. The creatures being, according to Riley, very impatient
and eager to take to the wing, the completion of the shedding of the
skin of the sub-imago is sometimes performed while the Insect is
flying in the air.

Fig. 283.—Lingua of Heptagenia longicauda, × 16. m, Central; l, lateral


pieces. (After Vayssière.)

The food of young Ephemeridae is apparently of a varied and mixed


nature. Eaton says[362] that though sometimes the stronger larvae
devour the weaker, yet the diet is even in these cases partly
vegetable. The alimentary canal frequently contains much mud; very
small organisms, such as diatoms and confervae, are thought to
form a large part of the bill of fare of Ephemerid nymphs. Although
the mouth is atrophied in the imago, yet it is highly developed in the
nymphs. This is especially notable in the case of the lingua or
hypopharynx (Fig. 283); indeed Vayssière[363] seems to incline to the
opinion that this part of the mouth may be looked on in these Insects
as a pair of appendages of a head-segment (see p. 96 ante), like the
labium or maxillae.

The life-history has not been fully ascertained in the case of any
species of may-fly; it is known, however, that the development of the
nymph sometimes occupies a considerable period, and it is thought
that in the case of some species this extends to as much as three
years. It is rare to find the post-embryonic development of an Insect
occupying so long a period, so that we are justified in saying that
brief as may be the life of the may-fly itself, the period of preparation
for it is longer than usual. Réaumur says, speaking of the winged fly,
that its life is so short that some species never see the sun. Their
emergence from the nymph-skin taking place at sunset, the duties of
the generation have been, so far as these individuals are concerned,
completed before the morning, and they die before sunrise. He
thinks, indeed, that individuals living thus long are to be looked on as
Methuselahs among their fellows, most of whom, he says, live only
an hour or half an hour.[364] It is by no means clear to which species
these remarks of Réaumur refer; they are doubtless correct in
certain cases, but in others the life of the adult is not so very short,
and in some species may, in all probability, extend over three or four
days; indeed, if the weather undergo an unfavourable change so as
to keep them motionless, the life of the flies may be prolonged for a
fortnight.

The life of the imago of the may-fly is as remarkable as it is brief; in


order to comprehend it we must refer to certain peculiarities of the
anatomy with which the vital phenomena are connected. The more
important of these are the large eyes of the males, the structure of
the alimentary canal, and that of the reproductive organs. We have
already remarked that the parts of the mouth in the imago are
atrophied, yet the canal itself not only exists but is even of greater
capacity than usual; it appears to have much the same general
arrangement of parts as it had in the nymph. Its coats are, however,
of great tenuity, and according to Palmén[365] the divisions of the
canal are separated by changes in the direction of certain portions
anterior to, and of others posterior to, its central and greater part—
the stomach—in such a manner that the portions with diverted
positions act as valves. The stomach, in fact, forms in the interior of
the body a delicate capacious sac; when movement tends to
increase the capacity of the body cavity then air enters into the
stomachic sac by the mouth orifice, but when muscular contractions
result in pressure on the sac they close the orifices of its extremities
by the valve-like structures we have mentioned above; the result is,
that as complex movements of the body are made the stomach
becomes more and more distended by air. It was known even to the
old naturalists that the dancing may-fly is a sort of balloon, but they
were not acquainted with the exact mode of inflation. Palmén says
that in addition to the valve-like arrangements we have described,
the entry to the canal is controlled by a circular muscle, with which
are connected radiating muscles attached to the walls of the head.
Palmén's views are adopted, and to a certain extent confirmed, by
Fritze,[366] who has examined the alimentary canal of the may-fly,
and considers that though the normal parts of the canal exist, the
function is changed in the imago, in which the canal serves as a sort
of balloon, and aids the function of the reproductive organs. The
change in the canal takes place in an anticipatory manner during the
nymph and sub-imago stages.

The sexual organs of Ephemeridae are remarkable for their


simplicity; they are destitute of the accessory glands and diverticula
that, in some form or other, are present in most other Insects. Still
more remarkable is the fact that the ducts by which they
communicate with the exterior continue as a pair to the extremity of
the body, and do not, as in other Insects, unite into a common duct.
Thus in the female there is neither bursa copulatrix, receptaculum
seminis, nor uterine portion of oviduct, and there is no trace of an
ovipositor; the terminations of the ducts are placed at the hind
margin of the seventh ventral plate, just in front of which they are
connected by a fold of the integument. The ovary consists of a very
large number of small egg-tubes seated on one side of a sac, which
forms their calyx, and one of whose extremities is continued
backwards as one of the pair of oviducts. The male has neither
vesiculae seminales, accessory glands, nor ductus ejaculatorius.
The testes are elongate sacs, whose extremities are prolonged
backwards forming the vasa deferentia; these open separately at the
extremity of the body, each on a separate intromittent projection of
more or less complex character, the two organs being, however,
connected by means of the ninth ventral plate, of which they are,
according to Palmén, appendages. We should remark that this
authority considers Heptagenia to form, to some extent, an exception
as regards the structures of the female; while Polymitarcys is in the
male sex strongly aberrant, as the two vasa deferentia, instead of
being approximately straight, are bent inwards at right angles near
their extremities so as to meet, and form in the middle a common
cavity, which then again becomes double to pass into the pair of
intromittent organs.
According to the views of Exner and others, the compound eyes of
Insects are chiefly organs for the perception of movement; if this
view be correct, movements such as those made during the dances
of may-flies may, by the number of the separate eyes, by their
curved surfaces and innumerable facets, be multiplied and
correlated in a manner of which our own sense of sight allows us to
form no conception. We can see on a summer's evening how
beautifully and gracefully a crowd of may-flies dance, and we may
well believe that to the marvellous ocular organs of the flies
themselves (Fig. 274) these movements form a veritable ballet. We
have pointed out that by this dancing the peculiarly formed
alimentary canal becomes distended, and may now add that Palmén
and Fritze believe that the unique structure of the reproductive
organs is also correlated with the other anatomical peculiarities, the
contents of the sexual glands being driven along the simple and
direct ducts by the expansion of the balloon-like stomach. During
these dances the momentary conjugation of the sexes occurs, and
immediately thereafter the female, according to Eaton, resorts to the
waters appropriate for the deposition of her eggs. As regards this,
Eaton says:[367] "Some short-lived species discharge the contents of
their ovaries completely en masse, and the pair of fusiform or
subcylindrical egg-clusters laid upon the water rapidly disintegrate,
so as to let the eggs sink broadcast upon the river-bed. The less
perishable species extrude their eggs gradually, part at a time, and
deposit them in one or other of the following manners: either the
mother alights upon the water at intervals to wash off the eggs that
have issued from the mouths of the oviducts during her flight, or else
she creeps down into the water to lay her eggs upon the under-side
of stones, disposing them in rounded patches, in a single layer
evenly spread, and in mutual contiguity." The eggs are very
numerous, and it is thought may sometimes remain in the water as
much as six or seven months before they hatch.

The number of individuals produced by some kinds of may-flies is


remarkable. Swarms consisting of millions of individuals are
occasionally witnessed. D'Albertis observed Palingenia papuana in
countless myriads on the Fly River in New Guinea: "For miles the
surface of the river, from side to side, was white with them as they
hung over it on gauzy wings; at certain moments, obeying some
mysterious signal, they would rise in the air, and then sink down
anew like a fall of snow." He further states that the two sexes were in
very disproportionate numbers, and estimates that there was but a
single female to every five or six thousand males.

Ephemeridae in the perfect state are a favourite food of fishes, and it


is said that on some waters it is useless for the fly-fisher to try any
other lure when these flies are swarming. Most of the "duns" and
"spinners" of the angler are Ephemeridae; so are several of the
"drakes," our large E. danica and E. vulgata being known as the
green drake and the gray drake. Ronalds says[368] that the term
"dun" refers to the pseud-imago condition, "spinner" to the perfect
Insect. E. danica and E. vulgata are perhaps not distinguished by
fishers; Eaton says that the former is abundant in rapid, cool
streams, while E. vulgata prefers warmer and more tranquil rivers.

These sensitive creatures are unable to resist the attractions of


artificial lights. Réaumur noticed this fact many years ago, and since
the introduction of the electric light, notes may frequently be seen in
journals recording that myriads of these Insects have been lured by it
to destruction. Their dances may frequently be observed to take
place in peculiar states of light and shade, in twilight, or where the
sinking sun has its light rendered broken by bushes or trees;
possibly the broken lights are enhanced in effect by the ocular
structures of the Insects. It has recently been ascertained that a
species of Teleganodes is itself luminous. Mr. Lewis,[369] who
observed this Insect in Ceylon, states that in life the whole of the
abdomen was luminous, not brightly so, but sufficient to serve as a
guide for capturing the Insect on a dark night. It has also been
recorded that the male of Caenis dimidiata gives a faint blue light at
night.

Nearly 300 species of Ephemeridae are known, but this may be only
a fragment of what actually exist, very little being known of may-flies

You might also like