You are on page 1of 54

Cohesive Subgraph Computation over

Large Sparse Graphs Algorithms Data


Structures and Programming
Techniques Lijun Chang
Visit to download the full and correct content document:
https://textbookfull.com/product/cohesive-subgraph-computation-over-large-sparse-gr
aphs-algorithms-data-structures-and-programming-techniques-lijun-chang/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Probabilistic data structures and algorithms for big


data applications Gakhov

https://textbookfull.com/product/probabilistic-data-structures-
and-algorithms-for-big-data-applications-gakhov/

Data Structures and Algorithms in Swift Kevin Lau

https://textbookfull.com/product/data-structures-and-algorithms-
in-swift-kevin-lau/

Data Structures & Algorithms in Python John Canning

https://textbookfull.com/product/data-structures-algorithms-in-
python-john-canning/

Problem Solving in Data Structures Algorithms Using C


Programming Interview Guide First Edition Hemant Jain

https://textbookfull.com/product/problem-solving-in-data-
structures-algorithms-using-c-programming-interview-guide-first-
edition-hemant-jain/
Problem Solving in Data Structures Algorithms Using C
Programming Interview Guide 1st Edition Hemant Jain

https://textbookfull.com/product/problem-solving-in-data-
structures-algorithms-using-c-programming-interview-guide-1st-
edition-hemant-jain/

Algorithms for Data and Computation Privacy 1st Edition


Alex X. Liu

https://textbookfull.com/product/algorithms-for-data-and-
computation-privacy-1st-edition-alex-x-liu/

Learning functional data structures and algorithms


learn functional data structures and algorithms for
your applications and bring their benefits to your work
now Khot
https://textbookfull.com/product/learning-functional-data-
structures-and-algorithms-learn-functional-data-structures-and-
algorithms-for-your-applications-and-bring-their-benefits-to-
your-work-now-khot/

Graphs, Algorithms, and Optimization, Second Edition


Kocay

https://textbookfull.com/product/graphs-algorithms-and-
optimization-second-edition-kocay/

Data Structures Algorithms in Kotlin Implementing


Practical Data Structures in Kotlin 1st Edition
Raywenderlich Tutorial Team

https://textbookfull.com/product/data-structures-algorithms-in-
kotlin-implementing-practical-data-structures-in-kotlin-1st-
edition-raywenderlich-tutorial-team/
Springer Series in the Data Sciences

Lijun Chang · Lu Qin

Cohesive Subgraph
Computation
over Large Sparse
Graphs
Algorithms, Data Structures,
and Programming Techniques
Springer Series in the Data Sciences

Series Editors:
David Banks, Duke University, Durham
Jianqing Fan, Princeton University, Princeton
Michael Jordan, University of California, Berkeley
Ravi Kannan, Microsoft Research Labs, Bangalore
Yurii Nesterov, Universite Catholique de Louvain, Louvain-la-Neuve
Christopher Ré, Stanford University, Stanford
Ryan Tibshirani, Carnegie Melon University, Pittsburgh
Larry Wasserman, Carnegie Mellon University, Pittsburgh

Springer Series in the Data Sciences focuses primarily on monographs and graduate
level textbooks. The target audience includes students and researchers working in
and across the fields of mathematics, theoretical computer science, and statistics.
Data Analysis and Interpretation is a broad field encompassing some of the
fastest-growing subjects in interdisciplinary statistics, mathematics and computer
science. It encompasses a process of inspecting, cleaning, transforming, and mod-
eling data with the goal of discovering useful information, suggesting conclusions,
and supporting decision making. Data analysis has multiple facets and approaches,
including diverse techniques under a variety of names, in different business, science,
and social science domains. Springer Series in the Data Sciences addresses the needs
of a broad spectrum of scientists and students who are utilizing quantitative methods
in their daily research.
The series is broad but structured, including topics within all core areas of the
data sciences. The breadth of the series reflects the variation of scholarly projects
currently underway in the field of machine learning.

More information about this series at http://www.springer.com/series/13852


Lijun Chang • Lu Qin

Cohesive Subgraph
Computation over
Large Sparse Graphs
Algorithms, Data Structures,
and Programming Techniques

123
Lijun Chang Lu Qin
School of Computer Science Centre for Artificial Intelligence
The University of Sydney University of Technology Sydney
Sydney, NSW, Australia Sydney, NSW, Australia

ISSN 2365-5674 ISSN 2365-5682 (electronic)


Springer Series in the Data Sciences
ISBN 978-3-030-03598-3 ISBN 978-3-030-03599-0 (eBook)
https://doi.org/10.1007/978-3-030-03599-0

Library of Congress Control Number: 2018962869

Mathematics Subject Classification: 05C85, 05C82, 91D30

© Springer Nature Switzerland AG 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife, Xi
my parents, Qiyuan and Yumei
Lijun Chang

To my wife, Michelle
my parents, Hanmin and Yaping
Lu Qin
Preface

Graph model has been widely used to represent the relationships among entities in
a wide spectrum of applications such as social networks, communication networks,
collaboration networks, information networks, and biological networks. As a result,
we are nowadays facing a tremendous amount of large real-world graphs. For exam-
ple, SNAP [49] and Network Repository [71] are two representative graph reposi-
tories hosting thousands of real graphs. An availability of rich graph data not only
brings great opportunities for realizing big values of data to serve key applications
but also brings great challenges in computation.
The main purpose of this book is to survey the recent technical developments
on efficiently processing large sparse graphs, in view of the fact that real graphs
are usually sparse graphs. Algorithms designed for large sparse graphs should be
analyzed with respect to the number of edges in a graph, and ideally should run
in linear or near-linear time to the number of edges. In this book, we illustrate
the general techniques and principles, toward efficiently processing large sparse
graphs with millions of vertices and billions of edges, through the problems of co-
hesive subgraph computation. Although real graphs are sparsely connected from a
global point of view, they usually contain subgraphs that are locally densely con-
nected [11]. Computing cohesive/dense subgraphs can either be the main goal of a
graph analysis task or act as a preprocessing step aiming to reduce/trim the graph
by removing sparse/unimportant parts such that more complex and time-consuming
analysis can be conducted. In the literature, the cohesiveness of a subgraph is usu-
ally measured by the minimum degree, the average degree, or their higher-order
variants, or edge connectivity. Cohesive subgraph computation based on different
cohesiveness measures extracts cohesive subgraphs with different properties and
also requires different levels of computational efforts.
The book can be used either as an extended survey for people who are inter-
ested in cohesive subgraph computation or as a reference book for a postgraduate
course on the related topics, or as a guideline book for writing effective C/C++
programs to efficiently process real graphs with billions of edges. In this book, we

vii
viii Preface

will introduce algorithms, in the form of pseudocode, analyze their time and space
complexities, and also discuss their implementations. C/C++ codes for all the data
structures and some of the presented algorithms are available at the author’s GitHub
website.1

Organization. The book is organized as follows.


In Chapter 1, we present the preliminaries of large sparse graph processing, in-
cluding characteristics of real-world graphs and the representation of large sparse
graphs in main memory. In this chapter, we also briefly introduce the problems of
cohesive subgraph computation over large sparse graphs and their applications.
In Chapter 2, we illustrate three data structures (specifically, linked list-based
linear heap, array-based linear heap, and lazy-update linear heap) that are useful for
algorithm design in the remaining chapters of the book.
In Chapter 3, we focus on minimum degree-based graph decomposition (aka core
decomposition); that is, compute the maximal subgraphs with minimum degree at
least k (called k-core), for all different k values. We present an algorithm to conduct
core decomposition in O(m) time, where m is the number of edges in a graph, and
also discuss h-index-based local algorithms that have higher time complexities but
can be naturally parallelized.
In Chapter 4, we study the problem of computing the subgraph with the max-
imum average degree (aka, densest subgraph). We present a 2-approximation al-
gorithm that has O(m) time complexity, a 2(1 + ε )-approximation streaming algo-
rithm, and also an exact algorithm based on minimum cut.
In Chapter 5, we investigate higher-order variants of the problems studied in
Chapters 3 and 4. As the building blocks of higher-order analysis of graphs are k-
cliques, we first present triangle enumeration algorithms that run in O(α (G) × m)
time and k-clique enumeration algorithms that run in O(k × (α (G)) √
k−2 × m) time,

where α (G) is the arboricity of a graph G and satisfies α (G) ≤ m [20]. Then, we
discuss how to extend the algorithms presented in Chapters 3 and 4 for higher-order
core decomposition (specifically, truss decomposition and nucleus decomposition)
and higher-order densest subgraph computation (specifically, k-clique densest sub-
graph computation), respectively.
In Chapter 6, we discuss edge connectivity-based graph decomposition. Firstly,
given an integer k, we study the problem of computing all maximal k-edge con-
nected subgraphs in a given input graph. We present a graph partition-based ap-
proach to conduct this in O(h × l × m) time, where h and l are usually bounded
by small constants for real-world graphs. Then, we present a divide-and-conquer
approach, which invokes the graph partition-based approach as a building block,
for computing the maximal k-edge connected subgraphs for all different k values in
O((log α (G)) × h × l × m) time.

1 https://github.com/LijunChang/Cohesive subgraph book.


Preface ix

Acknowledgments. This book is partially supported by Australian Research


Council Discovery Early Career Researcher Award (DE150100563).

Sydney, NSW, Australia Lijun Chang


Sydney, NSW, Australia Lu Qin
September 2018
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Graph Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Real Graph Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Representation of Large Sparse Graphs . . . . . . . . . . . . . . . . . . 4
1.1.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Cohesive Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Cohesive Subgraph Computation . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Linear Heap Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


2.1 Linked List-Based Linear Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Interface of a Linked List-Based Linear Heap . . . . . . . . . . . . . 10
2.1.2 Time Complexity of ListLinearHeap . . . . . . . . . . . . . . . . . 13
2.2 Array-Based Linear Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Interface of an Array-Based Linear Heap . . . . . . . . . . . . . . . . 15
2.2.2 Time Complexity of ArrayLinearHeap . . . . . . . . . . . . . . . . 18
2.3 Lazy-Update Linear Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Minimum Degree-Based Core Decomposition . . . . . . . . . . . . . . . . . . . . . 21


3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Degeneracy and Arboricity of a Graph . . . . . . . . . . . . . . . . . . 22
3.2 Linear-Time Core Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 The Peeling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Compute k-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Construct Core Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Core Decomposition in Other Environments . . . . . . . . . . . . . . . . . . . . 32
3.3.1 h-index-Based Core Decomposition . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Parallel/Distributed Core Decomposition . . . . . . . . . . . . . . . . 36
3.3.3 I/O-Efficient Core Decomposition . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xi
xii Contents

4 Average Degree-Based Densest Subgraph Computation . . . . . . . . . . . . . 41


4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Properties of Densest Subgraph . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 A 2-Approximation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 A Streaming 2(1 + ε )-Approximation Algorithm . . . . . . . . . . 45
4.3 An Exact Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Density Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 The Densest-Exact Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Pruning for Densest Subgraph Computation . . . . . . . . . . . . . . 52
4.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Higher-Order Structure-Based Graph Decomposition . . . . . . . . . . . . . . 55


5.1 k-Clique Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Triangle Enumeration Algorithms . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 k-Clique Enumeration Algorithms . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Higher-Order Core Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Truss Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Nucleus Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Higher-Order Densest Subgraph Computation . . . . . . . . . . . . . . . . . . . 71
5.3.1 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2 Exact Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Edge Connectivity-Based Graph Decomposition . . . . . . . . . . . . . . . . . . . 77


6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Deterministic k-Edge Connected Components Computation . . . . . . . 79
6.2.1 A Graph Partition-Based Framework . . . . . . . . . . . . . . . . . . . . 79
6.2.2 Connectivity-Aware Two-Way Partition . . . . . . . . . . . . . . . . . 80
6.2.3 Connectivity-Aware Multiway Partition . . . . . . . . . . . . . . . . . 85
6.2.4 The KECC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Randomized k-Edge Connected Components Computation . . . . . . . . 90
6.4 Edge Connectivity-Based Decomposition . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 A Bottom-Up Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4.2 A Top-Down Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.3 A Divide-and-Conquer Approach . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 1
Introduction

With the rapid development of information technology such as social media, on-
line communities, and mobile communications, huge volumes of digital data are
accumulated with data entities involving complex relationships. These data are usu-
ally modelled as graphs in view of the simple yet strong expressive power of graph
model; that is, entities are represented by vertices and relationships are represented
by edges. Managing and extracting knowledge and insights from large graphs are
highly demanded by many key applications [93], including public health, science,
engineering, business, environment, and more. An availability of rich graph data not
only brings great opportunities for realizing big values of data to serve key appli-
cations but also brings great challenges in computation. This book surveys recent
technical developments on efficiently processing large sparse graphs, where real
graphs are usually sparse graphs.
In this chapter, we firstly present in Section 1.1 the background information in-
cluding graph terminologies, some example real graphs that serve the purpose of
illustrating properties of real graphs as well as the purpose of empirically evaluating
algorithms, and space-effective representation of large sparse graphs in main mem-
ory. Then, in Section 1.2 we briefly introduce the problem of cohesive subgraph
computation and also discuss its applications.

1.1 Background

1.1.1 Graph Terminologies

In this book, we focus on unweighted and undirected graphs and consider only
the interconnection structure (i.e., edges) among vertices of a graph, while ignoring
possible attributes of vertices and edges. That is, we consider the simplest form of a
graph that consists of a set of vertices and a set of edges.

© Springer Nature Switzerland AG 2018 1


L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 1
2 1 Introduction

We denote a graph by g or G. For a graph g, we let V (g) and E(g) denote the
set of vertices and the set of edges of g, respectively, and we also represent g by
(V (g), E(g)). We denote the edge between u and v by (u, v), the set of neighbors of
a vertex u in g by:
Ng (u) = {v ∈ V (g) | (u, v) ∈ E(g)},
and the degree of u in g by:
dg (u) = |Ng (u)|.
We denote the minimum vertex degree, the average vertex degree, and the maximum
vertex degree of g by dmin (g), davg (g), and dmax (g), respectively. Given a subset Vs
of vertices of g (i.e., Vs ⊆ V (g)), we use g[Vs ] to denote the subgraph of g induced
by Vs ; that is:
g[Vs ] = (Vs , {(u, v) ∈ E(g) | u ∈ Vs , v ∈ Vs }).
Given a subset of edges of g, Es ⊆ E(g), we use g[Es ] to denote the subgraph of g
induced by Es ; that is: 
g[Es ] = ( {u, v}, Es ).
(u,v)∈Es

g[Vs ] is referred to as a vertex-induced subgraph of g, while g[Es ] is referred to as an


edge-induced subgraph of g.
Across the book, we use the notation G either in definitions or to specifically
denote the input graph that we are going to process, while using g to denote a gen-
eral (sub)graph. For the input graph G, we abbreviate V (G) and E(G) as V and E,
respectively; that is, G = (V, E). We also omit the subscript G in other notations,
e.g., d(u) and N(u). We denote the number of vertices and the number of undirected
edges in G by n and m, respectively, which will be used for analyzing the time and
space complexity of algorithms when taking G as the input graph. Without loss of
generality, we assume that G is connected; that is, there is a path between every pair
of vertices. We also assume that m ≥ n for presentation simplicity; note that, for a
connected graph G, it satisfies that m ≥ n − 1.

v6
v9
v5 v1 v2 v
8
v7

v10 v3 v4 v11
Fig. 1.1: An example unweighted undirected graph

Example 1.1. Figure 1.1 shows an example graph G consisting of 11 vertices and
13 undirected edges; that is, n = 11 and m = 13. The set of neighbors of v1 is
N(v1 ) = {v2 , v3 , v4 , v5 }, and the degree of v1 is d(v1 ) = |N(v1 )| = 4. The vertex-
induced subgraph G[{v1 , v2 , v3 , v4 }] is a clique consisting of 4 vertices and 6 undi-
rected edges.
1.1 Background 3

1.1.2 Real Graph Datasets

In this book, we focus on techniques for efficiently processing real graphs that
are obtained from real-life applications. In the following, we introduce several real
graph data repositories as well as present some example real graphs that serve the
purpose of illustrating properties of real graphs and the purpose of empirically eval-
uating algorithms in the remainder of the book.

Real Graph Data Repositories. Several real graph data repositories have been ac-
tively maintained by different research groups, which in total cover thousands of
real graphs. A few example repositories are as follows:
• Stanford Network Analysis Project (SNAP) [49] maintains a collection of more
than 50 large network datasets from tens of thousands of vertices and edges to
tens of millions of vertices and billions of edges. It includes social networks,
web graphs, road networks, internet networks, citation networks, collaboration
networks, and communication networks.
• Laboratory for Web Algorithmics (LAW) [12] hosts a set of large networks with
size up-to 1 billion vertices and tens of billions of edges. The networks of LAW
are mainly web graphs and social networks.
• Network Repository [71] is a large network repository archiving thousands of
graphs with up-to billions of vertices and tens of billions of edges.
• The Koblenz Network Collection (KONECT)1 contains several hundred network
datasets with up-to tens of millions of vertices and billions of edges. The net-
works of KONECT cover many diverse areas such as social networks, hyper-
link networks, authorship networks, physical networks, interaction networks, and
communication networks.

Five Real-World Graphs. We choose five real-world graphs from different do-
mains to show the characteristic of real-world graphs; these graphs will also be used
to demonstrate the performance of algorithms and data structures in the remain-
der of the book. The graphs are as-Skitter, soc-LiveJournal1, twitter-2010,
uk-2005, and it-2004; the first two are downloaded from SNAP, while the remain-
ing three are downloaded from LAW. as-Skitter is an internet topology graph.
soc-LiveJournal1 is an online community, where members maintain journals and
make friends. twitter-2010 is a social network. uk-2005 and it-2004 are two
web graphs crawled within the .uk and .it domains, respectively.
For each graph, we make its edges undirected, remove duplicate edges, and then
choose the largest connected component (i.e., giant component) as the correspond-
ing graph. Statistics of the five graphs are given in Table 1.1, where the last column
shows the degeneracy (see Chapter 3) of G. We can see that davg (G)  n holds for
all these graphs; that is, real-world graphs are usually sparse graphs.

1 http://konect.uni-koblenz.de/.
4 1 Introduction

Graphs n m davg (G) dmax (G) δ (G)


as-Skitter 1,694,616 11,094,209 13.09 35,455 111
soc-LiveJournal1 4,843,953 42,845,684 17.69 20,333 372
uk-2005 39,252,879 781,439,892 39.82 1,776,858 588
it-2004 41,290,577 1,027,474,895 49.77 1,326,744 3,224
twitter-2010 41,652,230 1,202,513,046 57.74 2,997,487 2,488

Table 1.1: Statistics of five real graphs (δ (G) is the degeneracy of G)


106 107
5 6
10 10
5
4 10
#Vertices

#Vertices
10
3 104
10
103
102 2
10
1 1
10 10
0 0
10 0 1 2 3 4 5
10 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10
Degree Degree

(a) as-Skitter (b) soc-LiveJournal1


7 7
10 10
6 6
10 10
105 105
#Vertices

#Vertices

4 4
10 10
3 3
10 10
2 2
10 10
101 101
0 0
10 10
100 101 102 103 104 105 106 107 100 101 102 103 104 105 106 107
Degree Degree

(c) it-2004 (d) twitter-2010

Fig. 1.2: Degree distributions

Figure 1.2 shows the degree distribution for four of the graphs. Note that both
x-axis and y-axis are in log scale. Thus, the degree distribution follows a power-
law distribution; this demonstrates that real-world graphs are usually power-law
graphs.

1.1.3 Representation of Large Sparse Graphs

There are two standard ways to represent a graph in main memory [24], adjacency
matrix and adjacency list. For a graph with n vertices and m edges, the adjacency
matrix representation consumes θ (n2 ) space, while the adjacency list representa-
tion consumes θ (n + m) space. As we are dealing with large sparse graphs con-
taining tens of millions (or even hundreds of millions) of vertices in this book (see
Table 1.1), the adjacency matrix representation is not feasible for such large num-
ber of vertices. On the other hand, an adjacency list consumes more space than an
1.1 Background 5

array due to explicitly storing pointers in the adjacency list, and moreover, access-
ing linked lists has the pointer-chasing issue that usually results in random memory
access. Thus, we use a variant of the adjacency list representation, called adjacency
array representation, which is also known as the Compressed Sparse Row (CSR)
representation in the literature [73]. Note that, as we focus on static graphs in this
book, the adjacency array representation is sufficient; however, if the input graph
dynamically grows (i.e., new edges are continuously added), then the adjacency list
representation or other representations may be required.
In the adjacency array representation, an unweighted graph is represented by two
arrays, denoted pstart and edges. It assumes that each of the n vertices of G takes
a distinct id from {0, . . . , n − 1}; note that if this assumption does not hold, then
a mapping from V to {0, . . . , n − 1} can be explicitly constructed. Thus, in the re-
mainder of the book, we also occasionally use i (0 ≤ i ≤ n − 1) to denote a vertex.
The adjacency array representation is to store the set of neighbors of each vertex
consecutively by an array (rather than a linked list as done in the adjacency list rep-
resentation) and then concatenate all such arrays into a single large array edges,
by putting the neighbors of vertex 0 first, followed by the neighbors of vertex 1,
and so on so forth. The start position (i.e., index) of the set of neighbors of vertex i
in the array edges is stored in pstart[i], while pstart[n] stores the length of the
array edges. In this way, the degree of vertex i can be obtained in constant time
as pstart[i + 1] − pstart[i], and the set of neighbors of vertex i is stored con-
secutively in the subarray edges[pstart[i], . . . , pstart[i + 1] − 1]. As a result, the
neighbors of each vertex occupy consecutive space in the main memory, which can
improve the cache hit-rate. Note that this representation also supports the removal
of edges from the graph. That is, we move all the remaining neighbors of vertex i
to be consecutive in edges starting at position pstart[i], and we use another array
pend to explicitly store in pend[i] the last position of the neighbors of i.

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
pstart 0 4 8 12 16 19 20 21 23 24 25 26

edges
v2 v3 v4 v5 v1 v3 v4 v8 v1 v2 v4 v10 v1 v2 v3 v11 v1 v6 v7 v5 v5 v2 v9 v8 v3 v4

Fig. 1.3: Adjacency array graph representation

Figure 1.3 demonstrates the adjacency array representation for the graph in Fig-
ure 1.1. It is easy to see that, by the adjacency array representation, an unweighted
undirected graph with n vertices and m undirected edges can be stored in main
memory by n + 2m + 1 machine words; note that here each undirected edge is stored
twice in the graph representation, once for each direction.
An example C++ code for allocating memory to store the graph G is shown in
Listing 1.1. Here, pstart and edges are two arrays of the data type unsigned int.
For presentation simplicity, we define unsigned int as uint in Listing 1.1, which
will also be used in the data structures presented in Chapter 2. Note that, as the range
of unsigned int in a typical machine nowadays is from 0 to 4, 294, 967, 295, the
6 1 Introduction

Listing 1.1: Graph memory allocation


typedef unsigned int uint;
uint *pstart = new uint[n+1];
uint *edges = new uint[2*m];

example C++ code in Listing 1.1 can be used to store a graph containing up-to 2
billion undirected edges; for storing larger graphs, the data type of pstart, or even
edges, needs to be changed to long.

1.1.4 Complexity Analysis

In this paper, we will provide time complexity analysis for all the presented al-
gorithms by using the big-O notation O(·). Specifically, for two given functions
f (n) and f  (n), f  (n) ∈ O( f (n)) if there exist positive constants c and n0 such that
f  (n) ≤ c × f (n) for all n ≥ n0 ; note that O(·) denotes a set of functions. Occa-
sionally, we will also use the θ -notation. Specifically, for two given functions f (n)
and f  (n), f  (n) ∈ θ ( f (n)) if there exist positive constants c1 , c2 , and n0 , such that
c1 × f (n) ≤ f  (n) ≤ c2 × f (n) for all n ≥ n0 .
As we aim to process large graphs with billions of edges in main memory, it is
also important to keep the memory consumption of an algorithm small such that
larger graphs can be processed with the available main memory. Thus, we also an-
alyze the space complexities of algorithms in this book. As the number m of edges
usually is much larger than the number n of vertices for large real-world graphs
(see Table 1.1), we analyze the space complexity in the form of c × m + O(n) by
explicitly specifying the constant c, since c × m usually is the dominating factor.
Recall that the adjacency array representation of a graph in Section 1.1.3 con-
sumes 2m + O(n) memory space. Thus, if an algorithm takes only O(n) extra mem-
ory space besides the graph representation, then a graph with 1 billion undirected
edges may be able to be processed in a machine with 16GB main memory which is
common for nowadays’ commodity machines. Note that a graph with 1 billion undi-
rected edges takes slightly more than 8GB main memory to store by the adjacency
array representation.

1.2 Cohesive Subgraphs

In this book, we illustrate the general techniques and principles towards efficiently
processing large real-world graphs with millions of vertices and billions of edges,
through the problems of cohesive subgraph computation.
1.2 Cohesive Subgraphs 7

1.2.1 Cohesive Subgraph Computation

A common phenomenon of real-world graphs is that they are usually globally sparse
but locally dense [11]. That is, the entire graph is sparse in terms of having a small
average degree (e.g., in the order of tens), but it contains subgraphs that are cohe-
sive/dense (e.g., contains a large clique of up-to thousands of vertices). Thus, it is of
great importance to extract cohesive subgraphs from large sparse graphs.
Given an input graph G, cohesive subgraph computation is either to find all max-
imal subgraphs of G whose cohesiveness values are at least k for all possible k
values, or to find the subgraph of G with the largest cohesiveness value. Here, the
cohesiveness value of a subgraph g is solely determined by the structure of g while
being independent to other parts of G that are not in g, and sometimes is also re-
ferred to as the density of the subgraph; thus, cohesive subgraph sometimes is also
called dense subgraph. In this book, we focus on cohesive subgraph computation
based on the following commonly used cohesiveness measures:
1. Minimum degree (aka, k-core, see Chapter 3); that is, the maximal subgraph
whose minimum degree is at least k, which is called k-core. The problem is either
to compute the k-core for a user-given k or to compute k-cores for all possible k
values [59, 76, 81].
2. Average degree (aka, dense subgraph, see Chapter 4); that is a subgraph with
average degree at least k. The problem studied usually is to compute the subgraph
with the largest average degree (i.e., densest subgraph) [18, 35].
3. Higher-order Variants of k-core and Densest Subgraph (see Chapter 5); for ex-
ample, the maximal subgraph in which each edge participates in at least k trian-
gles within the subgraph (i.e., k-truss) [22, 92], the subgraph where the average
number of triangles each vertex participates is the largest (i.e., triangle-dense
subgraph) [89].
4. Edge connectivity (aka, k-edge connected components, see Chapter 6); that is,
the maximal subgraphs each of which is k-edge connected. The problem studied
is either to compute the k-edge connected components for a user-given k [3, 17,
102] or to compute k-edge connected components for all possible k values [16,
99].
Besides the above commonly used ones, other cohesiveness measures have also
been defined in the literature [48]. For example, a graph g is a clique if each vertex
is connected to all other vertices [80] (i.e., |E(g)| = |V (g)|(|V2 (g)|−1) ); a graph g is a
γ -quasi clique if at least γ portion of its vertex pairs are connected by edges (i.e.,
|E(g)| ≥ γ × |V (g)|(|V2 (g)|−1) ) [1]; a graph g is a k-plex if every vertex of g is con-
nected to all but no more than (k − 1) other vertices (i.e., dg (u) ≥ |g| − k, for each
u ∈ V (g)) [82]. Nevertheless, cohesive subgraph computation based on these defi-
nitions usually leads to NP-Hard problems, and thus are generally computationally
too expensive to be applied to large graphs [23]. Consequently, we do not consider
these alternative cohesiveness measures in this book.
8 1 Introduction

1.2.2 Applications

Cohesive subgraph computation can either be the main goal of a graph analysis
task or act as a preprocessing step aiming to reduce/trim the graph by removing
sparse/unimport parts such that more complex and time-consuming analysis can be
conducted. For example, some of the applications of cohesive subgraph computation
are illustrated as follows.

Community Search. Cohesive subgraphs can be naturally regarded as communi-


ties [42]. In the literature of community search, which computes the communities for
a given set of query users, there are many models based on cohesive subgraphs, e.g.,
k-core-based community search [10, 84], and k-truss-based community search [41].
Recently, a tutorial is given in ICDE 2017 [42] regarding cohesive subgraph-based
community search.

Locating Influential Nodes. Cohesive subgraph detection has been used to identify
the entities in a network that act as influential spreaders for propagating information
to a large portion of the network [46, 55]. Analysis on real networks shows that
vertices belonging to the maximal k-truss subgraph for the largest k show good
spreading behavior [55], leading to fast and wide epidemic spreading. Moreover,
vertices belonging to such dense subgraphs dominate the small set of vertices that
achieve the optimal spreading in the network [55].

Keyword Extraction from Text. Recently, the graph of words representation is


shown very promising for text mining tasks [91]. That is, each distinct word is rep-
resented by a vertex, and there is an edge between two words if they co-occur within
a sliding window. It has been shown in [72, 88] that keywords in the main core (i.e.,
k-core with the largest k) or main truss (i.e., k-truss with the largest k) are more
likely to form higher-order n-grams. A tutorial is given in EMNLP 2017 [57] re-
garding cohesive subgraph-based keyword extraction from text.

Link Spam Detection. Dense subgraph detection is a useful primitive for spam
detection [33]. A study in [33] shows that many of the dense bipartite subgraphs in
a web graph are link spam, i.e., websites that attempt to manipulate search engine
rankings through aggressive interlinking to simulate popular content.

Real-Time Story Identification. A graph can be used to represent entities and their
relationships that are mentioned in the texts of an online social media such as Twit-
ter, where edge weights correspond to the pairwise association strengths of entities.
It is shown in [4] that given such a graph, a cohesive group of strongly associated
entity pairs usually indicates an important story.
Chapter 2
Linear Heap Data Structures

In this chapter, we present linear heap data structures that will be useful in the re-
mainder of the book for designing algorithms to efficiently process large sparse
graphs. In essence, the linear heap data structures can be used to replace Fibonacci
Heap and Binary Heap [24] to achieve better time complexity as well as practical
performance, when some assumptions are satisfied.
In general, the linear heap data structures store (element, key) pairs, and have
two assumptions regarding element and key, respectively. Firstly, the total number
of distinct elements is denoted by n, and the elements are 0, 1, . . . , n − 1; note that, in
this book, each element usually corresponds to a vertex of G. Secondly, let key cap
be the upper bound of the values of key, then the possible values of key are integers
in the range [0, key cap]; note that, in this book, key cap is usually bounded by n.
By utilizing the above two assumptions, the linear heap data structures can support
updating the key of an element in constant time and also support retrieving/remov-
ing the element with the minimum (or maximum) key in amortized constant time.
Recall that the famous Fibonacci Heap is constant-time updatable but has a loga-
rithmic popping cost [24].

2.1 Linked List-Based Linear Heap

The linked list-based linear heap organizes all elements with the same key value by
a doubly linked list [17]; that is, there is one doubly linked list for each distinct key
value. Moreover, the heads (i.e., the first elements) of all such doubly linked lists
are stored in an array heads, such that the doubly linked list for a specific key value
key can be retrieved in constant time from heads[key]. For memory efficiency, the
information of doubly linked lists are stored in two arrays, pres and nexts, such
that the elements that precede and follow element i in the doubly linked list of i are
pres[i] and nexts[i], respectively. In addition, there is an array keys for storing
the key values of elements; that is, the key value of element i is keys[i].

© Springer Nature Switzerland AG 2018 9


L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 2
10 2 Linear Heap Data Structures

10 0 1 2 3 4 5 6 7 8 9 10
heads - v11 v8 v5 v4 - - - - - -
...

4 v4 v3 v2 v1
v5
3 nexts - v1 v2 v3 - - v6 - v7 v9 v10
2 v8
1 v11 v10 v9 v7 v6 pres v2 v3 v4 - - v7 v9 - v10v11 -
0 keys 4 4 4 4 3 1 1 2 1 1 1
heads v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
(a) Conceptual view (b) Actual storage

Fig. 2.1: An example of linked list-based linear heap

For example, Figure 2.1 illustrates the linked list-based linear heap constructed
for vertices in Figure 1.1 where the key value of a vertex is its degree. Figure 2.1
shows the conceptual view in the form of adjacency lists, while Figure 2.1 shows
the actual data stored for the data structure; here, n = 11 and key cap = 10. Note
that dashed arrows in Figure 2.1 are just for illustration purpose and are not actually
stored. Each index in the array heads corresponds to a key value, while each index
in the arrays pres, nexts, and keys corresponds to an element; note that the same
index in the arrays pres, nexts, and keys corresponds to the same element.

2.1.1 Interface of a Linked List-Based Linear Heap

The interface of the linked list-based linear heap, denoted ListLinearHeap, is


given in Listing 2.1, while the full C++ code is available online as a header file.1
ListLinearHeap has member variable n storing the total number of possible dis-
tinct elements, key cap storing the maximum allowed key value, and four arrays
keys, pres, nexts, and heads for representing the doubly linked lists. In addition,
ListLinearHeap also stores max key — an upper bound of the current maximum
key value — and min key — a lower bound of the current minimum key value
— that will be used to retrieve the element with the maximum key value and the
element with the minimum key value.
In ListLinearHeap, the elements range over {0, . . . , n − 1}, while the key val-
ues range over {0, . . . , key cap}. The arrays keys, pres, and nexts are of size
n, and the array heads is of size key cap + 1. Thus, the space complexity of
ListLinearHeap is 3n + key cap + O(1).
Note that, to use ListLinearHeap after constructing it, the member function
init needs to be invoked first to properly initialize the member variables before
invoking any other member functions.
1 https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/
ListLinearHeap.h.
2.1 Linked List-Based Linear Heap 11

Listing 2.1: Interface of a linked list-based linear heap


class ListLinearHeap {
private:
uint n; // total number of possible distinct elements
uint key_cap; // maximum allowed key value
uint max_key; // upper bound of the current maximum key value
uint min_key; // lower bound of the current minimum key value
uint *keys; // key values of elements
uint *heads; // the first element in a doubly linked list
uint *pres; // previous element in a doubly linked list
uint *nexts; // next element in a doubly linked list

public:
ListLinearHeap(uint _n, uint _key_cap) ;
˜ListLinearHeap() ;
void init(uint _n, uint _key_cap, uint *_elems, uint *_keys) ;
void insert(uint element, uint key) ;
uint remove(uint element) ;
uint get_n() { return n; }
uint get_key_cap() { return key_cap; }
uint get_key(uint element) { return keys[element]; }
uint increment(uint element, uint inc) ;
uint decrement(uint element, uint dec) ;
bool get_max(uint &element, uint &key) ;
bool pop_max(uint &element, uint &key) ;
bool get_min(uint &element, uint &key) ;
bool pop_min(uint &element, uint &key) ;
}

Algorithm 1: init( n, key cap, elems, keys)


/* Initialize max key, min key and heads */
1 max key ← 0; min key ← key cap;
2 for key ← 0 to key cap do
3 heads[key] ← null;
/* Insert (element, key) pairs into the data structure */
4 for i ← 0 to n − 1 do
5 insert( elems[i], keys[i]);

Initialize a Linked List-Based Linear Heap (init). The main task of init is to:
(1) allocate proper memory space for the data structure, (2) assign proper initial val-
ues for max key, min key, and heads, and (3) insert the (element,key) pairs, sup-
plied in the input parameter to init, into the data structure. The pseudocode of init
is shown in Algorithm 1, where memory allocation is omitted. Specifically, max key
is initialized as 0, min key is initialized as key cap, and heads[key] is initialized
12 2 Linear Heap Data Structures

as non-exist (denoted by null) for each distinct key value. Each (element,key) pair
is inserted into the data structure by invoking the member function insert.

Algorithm 2: insert(element, key)


/* Update doubly linked list */
1 keys[element] ← key; pres[element] ← null; nexts[element] ← heads[key];
2 if heads[key] = null then pres[heads[key]] ← element;
3 heads[key] ← element;
/* Update min key and max key */
4 if key < min key then min key ← key;
5 if key > max key then max key ← key;

Insert/Remove an Element into/from a Linked List-Based Linear Heap. The


pseudocode of inserting an (element, key) pair into the data structure is shown in
Algorithm 2, which puts element at the beginning of the doubly linked list pointed
by heads[key]. Note that, after inserting (element, key) into the data structure, the
values of max key and min key may also be updated.

Algorithm 3: remove(element)
1 if pres[element] = null then
/* element is at the beginning of a doubly linked list */
2 heads[keys[element]] ← nexts[element];
3 if nexts[element]! = null then pres[nexts[element]] ← null;
4 else
5 nexts[pres[element]] ← nexts[element];
6 if nexts[elements]! = null then pres[nexts[elements]] ← pres[element];
7 return keys[element];

To remove an element from the data structure, the doubly linked list contain-
ing element is updated by adding a direct link between the immediate preceding
element pres[element] and the immediate succeeding element nexts[element] of
element. The pseudocode is given in Algorithm 3, which returns the key value of
the removed element. Note that if element is at the beginning of a doubly linked list,
then heads[keys[element]] also needs to be updated.

Algorithm 4: decrement(element, dec)


1 key ← remove(element);
2 key ← key − dec;
3 insert(element, key);
4 return key;
2.1 Linked List-Based Linear Heap 13

Update the Key Value of an Element. To update the key value of an element,
the element is firstly removed from the doubly linked list corresponding to the key
value keys[element] and is then inserted into the doubly linked list corresponding
to the updated key value. As a result, the key value of element in the data structure is
updated, and moreover min key and max key may also be updated by the new key
value of element. The pseudocode of decrement is shown in Algorithm 4, which
returns the updated key value of element. The pseudocode of increment is similar
and is omitted.

Algorithm 5: pop min(element, key)


1 while min key ≤ max key and heads[min key] = null do
2 min key ← min key + 1;
3 if min key > max key then
4 return false;
5 else
6 element ← heads[min key]; key ← min key;
7 remove(element);
8 return true;

Pop/Get Min/Max from a Linked List-Based Linear Heap. To pop the element
with the minimum key value from the data structure, the value of min key is firstly
updated to be the smallest value such that heads[min key] = null. If such min key
exists, then all elements in the doubly linked list pointed by heads[min key] have
the same minimum key value, and the first element is removed from the data struc-
ture and is returned. Otherwise, min key is updated to be larger than max key,
which means that the data structure currently contains no element. The pseudocode
is shown in Algorithm 5. The pseudocodes of get min, pop max, and get max are
similar and are omitted. Note that, during the execution of the data structure, the
value of min key is guaranteed to be a lower bound of the key values of all ele-
ments in the data structure. Thus, Algorithm 5 correctly obtains the element with
the minimum key value.

2.1.2 Time Complexity of ListLinearHeap

The time complexities of the member functions of ListLinearHeap are as follows.


Firstly, the initialization in Algorithm 1 takes O(n + key cap) time if memory allo-
cation is invoked and takes O( n + key cap) time otherwise. Secondly, each of the
remaining member functions of ListLinearHeap, other than get min, pop min,
get max, and pop max, runs in constant time. The difficult part is the four member
functions for popping/getting the element with the minimum/maximum key value.
Let key cap, which is given as an input to init, be the maximum key value that is
14 2 Linear Heap Data Structures

allowed during the execution of the member functions of ListLinearHeap. In the


worst case, an invocation of one of these four member functions takes O( key cap)
time. Nevertheless, a better time complexity is possible, as proved in the theorem
below.
Theorem 2.1. After the initialization by init, a sequence of x decrement(id, 1),
increment(id, 1), get min, pop min, get max, pop max, and remove operations
takes O( key cap + x) time. Note that: (1) decrement and increment are only
allowed to change the key value of an element by 1, and (2) insert is not allowed.
Proof. Firstly, as discussed in above, each of the member functions decrement,
increment, and remove takes constant time. Secondly, pop min and pop max
are simply get min and get max, respectively, followed by remove. We prove in
the following that a sequence of x decrement(id, 1), increment, get min, and
remove operations takes O( key cap + x) time. It is worth mentioning that here
the restriction of incrementing only by 1 for increment is removed.
The most time-consuming part of get min (similar to Algorithm 5) is updating
min key, while other parts can be conducted in constant time. Let t − denote the
number of times that min key is decreased; note that min key is only decreased
in decrement, and by 1 each time. Thus, t − equals the number of invocations of
decrement(id, 1) and t − ≤ x. Similarly, let t + denote the number of times that
min key is increased; note that min key is increased only in get min, but not in
decrement, increment, or remove. Then, the time complexity of a sequence of x
decrement(id, 1), increment, get min, and remove operations is O(x+t − +t + ).
It can be verified that t + ≤ t − + key cap. Therefore, the above time complexity is
O( key cap + x).
The general statement in this theorem can be proved similarly. 2
Following Theorem 2.1 and assuming key cap = O( n), the amortized cost
of one invocation of init followed by a sequence of (≥ n) decrement(id, 1),
increment(id, 1), get min, pop min, get max, pop max, and remove operations
is constant per operation. Thus, a set of n elements that are given as input to init
can be sorted in O( key cap + n) time, as shown in the following lemma, which
actually is similar to the idea of counting sort [24].
Lemma 2.1. A set of n elements that are given as input to init can be
sorted in non-decreasing key value order (or non-increasing key value order) in
O( key cap + n) time.
Proof. To sort in non-decreasing key value order, we invoke pop min n times after
initializing by init. The time complexity follows from Theorem 2.1. 2

2.2 Array-Based Linear Heap

In ListLinearHeap, the doubly linked lists are represented by explicitly storing


the immediate preceding and the immediate succeeding elements of an element i
2.2 Array-Based Linear Heap 15

in pres[i] and nexts[i], explicitly. Alternatively, all elements in the same doubly
linked list can be stored consecutively in an array, in the same way as the adjacency
array graph representation discussed in Section 1.1.3. This strategy is used in the
array-based linear heap in [7]. Specifically, all elements with the same key value
are stored consecutively in an array ids, such that elements with key value 0 are
put at the beginning of ids and are followed by elements with key value 1 and so
forth. The start position of the elements for each distinct key value is stored in an
array heads. Thus, heads and ids resemble pstart and edges of the adjacency
array graph representation, respectively. In addition, the key values of elements are
stored in an array keys in the same way as the linked list-based linear heap, and the
positions of elements in ids are stored in an array rids (i.e., rids[ids[i]] = i).

0 1 2 3 4 5 6 7 8 9 10
heads 0 0 5 6 7 11 11 11 11 11 11

ids v6 v7 v9 v10v11v8 v5 v1 v2 v3 v4

rids 7 8 9 10 6 0 1 5 2 3 4

keys 4 4 4 4 3 1 1 2 1 1 1
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11

Fig. 2.2: An example of array-based linear heap

For example, Figure 2.2 demonstrates the array-based linear heap constructed for
vertices in Figure 1.1 where the key value of a vertex is its degree. That is, Figure 2.2
shows the actual data of the adjacency array-based representation for the doubly
linked lists in Figure 2.1. Note that dashed arrows in Figure 2.2 are for illustration
purpose and are not actually stored. Each index in the array heads corresponds to a
key value, and each index in the arrays rids and keys corresponds to an element
i, while the indexes of the array ids correspond to neither.

2.2.1 Interface of an Array-Based Linear Heap

The interface of the array-based linear heap, denoted ArrayLinearHeap, is given


in Listing 2.2, while the full C++ code is available online as a header file.2 Sim-
ilar to ListLinearHeap, ArrayLinearHeap has member variables, n, key cap,
max key, min key, heads, keys, ids, and rids. ArrayLinearHeap has similar
but more restricted member functions than ListLinearHeap; for example, insert

2 https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/


ArrayLinearHeap.h.
16 2 Linear Heap Data Structures

Listing 2.2: Interface of an array-based linear heap


class ArrayLinearHeap {
private:
uint n; // total number of possible distinct elements
uint key_cap; // maximum allowed key value
uint max_key; // upper bound of the current maximum key value
uint min_key; // lower bound of the current minimum key value
uint *keys; // key values of elements
uint *heads; // start position of elements with a specific key
uint *ids; // element ids
uint *rids; // reverse of ids, i.e., rids[ids[i]] = i

public:
ArrayLinearHeap(uint _n, uint _key_cap) ;
˜ArrayLinearHeap() ;
void init(uint _n, uint _key_cap, uint *_ids, uint *_keys) ;
uint get_n() { return n; }
uint get_key_cap() { return key_cap; }
uint get_key(uint element) { return key_s[element]; }
void increment(uint element) ;
void decrement(uint element) ;
bool get_max(uint &element, uint &key) ;
bool pop_max(uint &element, uint &key) ;
bool get_min(uint &element, uint &key) ;
bool pop_min(uint &element, uint &key) ;
}

and remove are disabled, and increment and decrement are only allowed to up-
date the key value of an element by 1.
The space complexity of ArrayLinearHeap is 3n + key cap + O(1), the same
as that of ListLinearHeap. Also, to use ArrayLinearHeap after constructing it,
the member function init needs to be invoked first to properly initialize the member
variables before invoking any other member functions.
Initialize an Array-Based Linear Heap (init). The pseudocode of init is shown
in Algorithm 6, where memory allocation is omitted. It needs to sort all elements
in ids in non-decreasing key value order, which is conducted by the counting
sort [24]. After initializing by init, the set of elements with key value key is stored
consecutively in the subarray ids[heads[key], . . . , heads[key + 1] − 1].
Update the Key Value of an Element. In ArrayLinearHeap, the key value of an
element is only allowed to be updated by 1. The general idea of decrement is as fol-
lows. Let key be the key value of element before updating. Then, the goal is to move
element from the subarray ids[heads[key], . . . , heads[key + 1] − 1] to the subarray
ids[heads[key − 1], . . . , heads[key] − 1]. To do so, element is firstly moved to be at
(by swapping with) position heads[key], where the original position of element in
ids is located by rids[element]. Then, the start position of elements in ids with
2.2 Array-Based Linear Heap 17

Algorithm 6: init( n, key cap, ids, keys)


/* Initialize max key, min key, and keys */
1 max key ← 0; min key ← key cap;
2 for i ← 0 to n − 1 do
3 keys[ ids[i]] ← keys[i];
4 if keys[i] > max key then max key ← keys[i];
5 if keys[i] < min key then min key ← keys[i];
/* Initialize ids, rids */
6 Create an array cnt of size max key + 1, with all entries 0;
7 for i ← 0 to n − 1 do cnt[ keys[i]] ← cnt[ keys[i]] + 1;
8 for i ← 1 to max key do cnt[i] ← cnt[i] + cnt[i − 1];
9 for i ← 0 to n − 1 do
10 cnt[keys[i]] ← cnt[keys[i]] − 1;
11 rids[ ids[i]] ← cnt[keys[i]];
12 for i ← 0 to n − 1 do ids[rids[ ids[i]]] ← ids[i];
/* Initialize heads */
13 heads[min key] ← 0;
14 for key ← min key + 1 to max key + 1 do
15 heads[key] ← heads[key − 1];
16 while heads[key] < n and keys[ids[heads[key]]] < key do
17 heads[key] ← heads[key] + 1;

Algorithm 7: decrement(element)
1 key ← keys[element];
2 if heads[key] = rids[element] then
3 Swap the content of ids for positions heads[key] and rids[element];
4 rids[ids[rids[element]]] ← rids[element];
5 rids[ids[heads[key]]] ← heads[key];
6 if min key = key then
7 min key ← min key − 1;
8 heads[min key] ← heads[min key + 1];
9 heads[key] ← heads[key] + 1; keys[element] ← keys[element] − 1;
10 return keys[element];

key value key (i.e., heads[key]) is increased by one; consequently, element is now at
the end of the subarray ids[heads[key − 1], . . . , heads[key] − 1]. Note that min key
may also be updated in decrement. The pseudocode of decrement is shown in
Algorithm 7, which returns the updated key value of element. The pseudocode of
increment is similar and is omitted.
For example, to decrement the key value of v4 by 1 for the data structure in
Figure 2.2, firstly, v4 is swapped with v1 in the array ids, and then heads[4] is
increased by 1 to become 8; note that rids and keys are updated accordingly.

Pop/Get Min/Max from an Array-Based Linear Heap. This is similar to that of


ListLinearHeap, and we omit the details.
18 2 Linear Heap Data Structures

2.2.2 Time Complexity of ArrayLinearHeap

The time complexities of the member functions of ArrayLinearHeap are the


same as that of ListLinearHeap. Note that the counting sort in init runs in
O( n + key cap) time. Moreover, similar to the proof of Theorem 2.1, the fol-
lowing theorem can be proved for ArrayLinearHeap.

Theorem 2.2. After the initialization by init, a sequence of x decrement,


increment, get min, pop min, get max, and pop max operations takes O( key
cap + x) time.

Note that ArrayLinearHeap is more restricted than ListLinearHeap, and has


a similar performance to ListLinearHeap (see Chapter 3).

2.3 Lazy-Update Linear Heap

The practical performance of ListLinearHeap can be improved by using the lazy


updating strategy, if only get max, pop max, and decrement are allowed (or sim-
ilarly, only get min, pop min, and increment are allowed) [15]. For presentation
simplicity, we present the data structure for the former case in this subsection. The
interface of lazy-update linear heap, denoted LazyLinearHeap, is given in List-
ing 2.3, while the full C++ code is available online as a header file.3
Most parts of LazyLinearHeap are similar to that of ListLinearHeap. There
are two major differences [15]. Firstly, LazyLinearHeap is stored by three rather
than four arrays. That is, each adjacency list in LazyLinearHeap can be rep-
resented by a singly linked list rather than a doubly linked list, since no arbi-
trary deletion of an element from LazyLinearHeap is allowed. Thus, pres is not
needed in LazyLinearHeap, and the space complexity of LazyLinearHeap be-
comes 2n + key cap + O(1). Secondly, LazyLinearHeap does not greedily main-
tain an element into the proper adjacency list as done in ListLinearHeap, when its
key value is decremented. Specifically, decrement merely does the job of updating
keys[element] (see Listing 2.3). Thus, for elements in the singly linked list pointed
by heads[key], their key values can be exactly key and also can be smaller than key.
To obtain the element with the maximum key value from LazyLinearHeap, each
element is checked and maintained when it is going to be chosen as the element with
the maximum key value. The pseudocode of pop max is shown in Algorithm 8. It
iteratively retrieves and removes the first element in the singly linked list corre-
sponding to the maximum key value max key. If the key value of element equals
max key, then it indeed has the maximum key value and is returned. Otherwise, the
key value of element must be smaller than max key, and thus element is inserted
into the proper singly linked list.
3 https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/
LazyLinearHeap.h.
2.3 Lazy-Update Linear Heap 19

Listing 2.3: Interface of a lazy-update linear heap


class LazyLinearHeap {
private:
uint n; // total number of possible distinct elements
uint key_cap; // maximum allowed key value
uint max_key; // upper bound of the current maximum key value
uint *keys; // key values of elements
uint *heads; // the first element in a singly linked list
uint *nexts; // next element in a singly linked list

public:
LazyLinearHeap(uint _n, uint _key_cap) ;
˜LazyLinearHeap() ;
void init(uint _n, uint _key_cap, uint *_elems, uint *_keys) ;
uint get_n() { return n; }
uint get_key_cap() { return key_cap; }
uint get_key(uint element) { return keys[element]; }
uint decrement(uint element, uint dec) {
return keys[element] -= dec;
}
bool get_max(uint &element, uint &key) ;
bool pop_max(uint &element, uint &key) ;
}

Algorithm 8: pop max(element, key)


1 while true do
2 while max key > 0 and heads[max key] = null do
3 max key ← max key − 1;
4 if heads[max key] = null then
5 return false;
6 element ← heads[max key];
7 heads[max key] ← nexts[element]; /* Remove element */;
8 if keys[element] = max key then
/* element has the maximum key value */
9 key ← max key;
10 return true;
11 else
/* Insert element into the proper singly linked list */
12 nexts[element] ← heads[keys[element]];
13 heads[keys[element]] ← element;

Analysis. The efficiency of LazyLinearHeap compared with ListLinearHeap is


proved in the following lemma.
Another random document with
no related content on Scribd:
V
Vacation in Reno.
R570311.
Validity checking controls.
MP25236.
Valley of the shadow.
LP43247.
Valve spacing and pressuring.
MU8999.
Vanguard Films, Inc.
R578232.
Vaudeville revue.
R573504.
V belts.
MP24919.
VD attack plan.
MP25037.
V. D. story.
LP43290.
Vega Kammback versus competition.
MU8946.
Vega versus competition.
MU8944.
Velvet prison.
LP43116.
Vendetta.
LP43443.
Venezuela.
MP24869.
Very merry cricket.
LP42986.
Very strange triangle.
LP43111.
Viacom International, Inc.
R567076 - R567077.
R570608 - R570610.
R572099 - R572115.
R579967 - R579975.
Vice versa.
LF145.
Vic Film (Productions) Ltd.
LP42961.
Victim.
LP43295.
Videgraphe Corporation.
MP25062.
Vigilante.
R578903 - R578904.
Vigilante rides again.
R578903.
Virtual storage concepts.
MP25132.
MP25133.
MP25163.
Virus.
LP43162.
Vise.
LP43239.
Vision of doom.
LP43480.
Vitaphone Corporation.
R567280 - R567288.
R567290.
R569647 - R569649.
R571693 - R571696.
R573499 - R573504.
R576593.
R576596.
R576597.
R578349.
R578350.
R578351.
Vnuk, Wallace J.
MP25069 - MP25075.
Voices from the Russian underground.
MP25096.
Volcanic landscapes.
MP24911.
Volcanoes: exploring the restless earth.
MP24838.
Vortex.
LP43479.
Voter decides.
MP25137.
VSAM concepts and access method services usage.
MP25018.
VSAM concepts and facilities.
MP25432.
VS / DOS.
MP25162.
W
Wacky Quacky.
R577570.
Wacky West on Wednesday.
MP25441.
Wake up and dream.
R568011.
Walfran Research and Educational Fund.
MP25069 - MP25075.
Walkaway.
MP24907.
Walking with the Master.
MP24846.
Walk south.
LP43408.
Wallis (Hal) Productions, Inc.
R578391.
Wall of silence.
LP43484.
Walls of night.
LP43028.
Wanted for murder.
R571260.
Wanted, more homes.
MP25403.
Ware, Harlan.
LP43200.
Warner Brothers, Inc.
LP42954 - LP42957.
LP42969.
LP43118 - LP43119.
LP43626.
Warner Brothers Pictures, Inc.
R567279.
R567289.
R571689 - R571692.
R573498.
R576592.
R576595.
R578352.
Warner Brothers Productions, Ltd.
LP42953.
Warning to wantons.
LF147.
Warty the toad.
MP25394.
Washington. State Superintendent of Public Instruction. SEE
Brouillet, Frank R.
Wastepaper world.
LP43231.
Watchers.
LP43246.
Water is wide.
LP43373.
Weaker sex.
LF155.
We are curious, Scandinavia, 1973.
MU8974.
Weather: high and low pressure.
MP24784.
Weather — storm endangers forest animals.
MP24781.
Weather: superstition and facts.
MP24780.
Web of darkness.
LP43579.
Wednesday game.
LP43461.
We don’t want to lose you.
MP25339.
Weekend murders.
LP43099.
We have an addict in the house.
LP43574.
Weinkauf, David S.
MP25444.
Weird Wednesday.
LP43184.
Welcome home, Johnny Bristol.
LP43370.
Welcome stranger.
R578383.
Well — flowing, dead and unloading.
MU9002.
Well model and lift.
MU9003.
Wells, H. G.
LF130.
We’ll walk out of here together.
LP43457.
Welt, Louis A.
MU9010.
Wendland, John P.
MP25485.
Wenzonsky (Pio) Productions.
LP43122.
Werrenrath, Elizabeth I.
MP25269.
MP25270.
Wessex Film Productions.
LP141.
Western Michigan University, Kalamazoo.
MP25460.
Western Michigan University, Kalamazoo. Division of Instructional
Communications. Motion Picture Services.
MP25460.
Westinghouse Electric Corporation.
MU8988.
West of Dodge City.
R577566.
Weston Woods. SEE Weston Woods Studios, Inc.
Weston Woods Studios, Inc.
LP42979 - LP42980.
LP42984.
LP43121.
Wet chemical methods.
MP25303.
We’ve come of age.
MP24830.
What about McBride.
MP25466.
What do you do while you wait.
MP24839.
What do you want me to say.
MP25470.
What happens when you go to the hospital.
MP25331.
What’s new in gonorrhea.
MP24796.
What’s your I. Q.
R567067.
R570213.
Wheel.
LP43534.
When the bough breaks.
LF149.
When the wind blows.
LP43015.
Where is my wandering mother tonight.
LP43509.
Where’s my little lame stray.
LP43315.
Where’s Tommy.
MP24905.
Where the lilies bloom.
LP43371.
Where the wild things are.
LP42984.
Where today’s cats came from.
MP25178.
Where we stand in Cambodia.
MP25090.
White, Carol Elizabeth Brand Gwynn.
MU8991.
Whitefield.
MP25065.
White House family in the United States of America.
MU8991.
White knight.
LP43043.
Whittemore, L. H.
LP43267.
Who are you, Arthur Kolinski.
LP43374.
Who’ll cry for my baby.
LP43439.
Who saw him die.
LP43156.
Who says I can’t ride a rainbow.
LP43350.
Who stole the quiet day.
MP25422.
Why is a crooked letter.
LP43141.
Wicked wolf.
R570609.
Wide open spaces.
R577488.
Wife killer.
LP42995.
Wife wanted.
R577411.
Wilderness: a way of life.
MU8937.
Wild heritage.
MU8950.
Wild kingdom.
MP24855 - MP24859.
MP25437 - MP25440.
Wild West.
R569745.
Wild West chimp.
R572018.
Willard, Emmet E.
MP25273.
William: from Georgia to Harlem.
LP43093.
Williams, Bruce Bayne.
MU9005.
Willie Dynamite.
LP43623.
Wilson, Daniel.
MP24733.
Windjam.
MP24989.
Wind raiders of the Sahara.
MP24831.
Wine is a traitor.
LP43020.
Winger Enterprises, Inc.
LP43267.
Wings of an angel.
LP43003.
Winkler, Irwin.
LP43134.
Winn, William M.
MU8903.
Winning the West.
R572106.
Winter fun.
LP43535.
Winter holiday.
R568019.
Winter Kill.
LP43320.
Wisconsin Regional Primate Research Center.
MP24800.
Witches of Salem: the horror and the hope.
LP43089.
Witch hunt.
LP43595.
With a shout, not a whimper.
LP43464.
Withers, Brian Gary.
LU3664.
Without reservations.
R570310.
With strings attached.
LP43042.
Wizan, Joe.
LP43209.
Wolf, Sidney.
MP25425.
Wolf adaptations for defense.
MP24763.
Wolf and the badger.
MP24764.
Wolfe, Tom.
LP43209.
Wolf hunting pronghorn antelope.
MP24791.
Wolper, David L.
MP25445.
Wolper Productions.
MP24831.
MP25445.
Wolper Productions, Inc.
MP25482.
Woman alive.
MP25413.
Women.
LP43167.
Women, women, women.
MU9011.
Wonderful world of Disney, 1972 - 73.
LP43191 - LP43199.
LP43612.
MP25387.
Wonrac Productions.
LP42983.
Wood (Francis Carter) Inc.
MP24860.
Words of summer.
LP43157.
Wordworks.
MP25058 - MP25061.
Workers depend on each other.
MP25017.
Working heart.
LP43422.
Working set and locality.
MP25163.
World Book Encyclopedia.
LP43189.
World Film Services, Ltd.
LP42962.
World food problem.
MP25409.
World of Charlie Company.
MP25095.
World of concern.
MP24917.
World of darkness.
MP25214.
World of sports.
R567593.
R570079.
R572343.
R577572.
R578420.
World of the black maned lion.
MP25437.
World of work.
MP24832.
MP24833.
World premiere.
LP43101.
Worldwide Church of God, Pasadena.
MP25285 - MP25289.
MP25497 - MP25500.
Woroner Films, Inc.
MP24931.
MP24932.
MP24933.
MP25068.
MP25419.
Writing better business letters.
MP24888.
Writing workshop — secondary.
MP25370.
X
Xerox Corporation.
LP42942 - LP42943.
LP43312 - LP43317.
Xerox Films.
LP42942 - LP42943.
LP43312 - LP43317.
Y
Yachting Magazine.
MP25040.
Yacht Racing Magazine.
MP25040.
Yearling.
R566404.
Year 1200.
MP25088.
Yorkin, Bud.
LP43610.
LP43611.
You and your eyes.
MP24747.
You and your food.
MP24759.
You and your sense of smell and taste.
MP24754.
You and your senses.
MP24758.
You are there.
LP43357 - LP43369.
You can’t just hope they’ll make it.
MP25338.
Young mother.
MP25417.
Young widow.
R572754.
You’re too fat.
MP25483.
You see, I’ve had a life.
MP25418.
Youth and church need each other.
MU8970.
Yugoslavian coastline.
MP25473.
Z
Zaiontz, Michael G.
MP24976.
Zanuck, Richard D.
LP43102.
Zanuck / Brown.
LP43623.
Zardoz.
LP43258.
Ziff Davis Publishing Company.
MP25463 - MP25470.
Ziff Davis Publishing Company, CRM Productions.
MP25358.
Zlateh the goat.
LP43121.
Zoos of Geographic Society.
MP24741.
Zoos of the world.
MP24741.
Zweig, Stefan.
LF127.
Zwer, Joachim D.
MU8903.
MOTION PICTURES
CURRENT REGISTRATIONS

A list of domestic and foreign motion pictures registered during


the period covered by this issue, arranged by registration number.
LF
REGISTRATIONS

LF124.
Bedelia. England. 90 min., sd., b&w, 35 mm. Based on the book by
Vera Caspary. Appl. au.; Isadore Goldsmith. © John Corfield
Productions, Ltd.; 24May46; LF124.

LF125.
Don’t look now. An Anglo-Italian coproduction by Casey
Productions, Ltd. & Eldorado Films, S. R. L. England. 110 min., sd.,
color, 35 mm. From a story by Daphne DuMaurier. © D. L. N.
Ventures Partnership; 12Oct73; LF125.

LF126.
Men of two worlds. John Sutro. England. 107 min., sd., b&w, 16
mm. © Two Cities Films, Ltd.; 9Sep46; LF126.

LF127.
Beware of pity. A Pentagon production. England. 103 min., sd.,
b&w, 16 mm. From the novel by Stefan Zweig. © Two Cities Films,
Ltd.; 22Jul46; LF127.

LF128.
Theirs is the glory. England. 82 min., sd., b&w, 16 mm. ©
Gaumont British Distributors, Ltd. & General Film Distributors,
Ltd.; 14Oct46; LF128.
LF129.
Carnival. A Two Cities film. England. 93 min., sd., b&w, 16 mm. ©
Two Cities Films, Ltd.; 2Dec46; LF129.

LF130.
The History of Mister Polly. England. 96 min., sd., b&w, 35 mm.
The History of Mister Polly, by H. G. Wells. © Two Cities Films, Ltd.;
28Mar49 (in notice: 1948); LF130.

LF131.
The Reluctant widow. A Two Cities film. England. 86 min., sd.,
b&w, 16 mm. From the novel by Georgette Heyer. © Two Cities
Films, Ltd.; 1May50; LF131.

LF132.
Flood tide. A Pentagon production. England. 90 min., sd., b&w, 16
mm. © Aquila Film Productions, Ltd.; 2May49; LF132.

LF133.
Golden Salamander. A Ronald Neame production. England. 97
min., sd., b&w, 16 mm. © Pinewood Films, Ltd.; 3Feb50 (in notice:
1949); LF133.

LF134.
Fools rush in. A Pinewood Films production. England. 82 min.,
sd., b&w, 16 mm. From the play by Kenneth Horne. © Pinewood
Films, Ltd.; 23May49; LF134.

LF135.
Dear Mister Prohack. A Pentagon production. England. 89 min.,
sd., b&w, 35 mm. Adapted from the novel, Mister Prohack, by Arnold

You might also like