You are on page 1of 13

Graph Analytics for Python

Developers

Lecture 4

Version 1.0

0
Table of Contents

Table of Contents 1

Community Detection 1
Girvan-Newman algorithm 2
Louvain modularity 3

Connectivity 4
Weakly and Strongly Connected Components 4

Flows 5
Ford-Fulkerson algorithm 5

Clustering 7
K-means clustering 7
Hierarchical clustering 8

Graph Coloring 10
Last time, we focused on Graph Transversals, Path Finding algorithms, and
centrality algorithms. We also took a look at a graph structure called tree and its version -
minimum spanning tree. Of course, many other algorithms and structures fall in the
pathfinding category, but there is simply no time to cover them all. In this lesson, we will
talk more about graph algorithms and structures connected to community detection,
connectivity, flows, clustering algorithms, and graph coloring algorithms.

Community Detection

The word “community” has entered everyday conversations around the world. But what
does community mean? Think about social media such as Facebook or Twitter. Eventually,
after connecting with other people, we connect with people with different interests,
cultures, or social statuses. All of these “categories” are nothing else but communities!

In graph theory, communities are a subset of densely connected nodes and loosely
connected to the nodes in the other communities in the same graph. Densely connected
means that nodes have many connections between them, while loosely means exactly
opposite - nodes have a minimal number of edges.

Community detection is important to partition the network into multiple parts, which we
can analyze separately. There are two types of methods used in community detection:
agglomerative methods and divisive methods.

1) Agglomerative methods
Agglomerative methods start with an empty graph that has only nodes from the
original graph. Edges are added to the graph, starting from “stronger” to “weaker”
edges. Different methods determine which edge is “stronger” and which one is
“weaker.”

1
2) Divisive methods
The divisive method follows the opposite way. They remove the edges from the
graph one by one, from the “stronger” to the “weaker.”

In this lesson, we will cover the Girvan-Newman algorithm and the Louvain algorithm.

Girvan-Newman algorithm
Under the Girvan-Newman algorithm, the communities in a graph are discovered by
iteratively removing the graph’s edges based on the edge betweenness centrality value.
The Girvan-Newman algorithm is a divisive method of community detection, meaning that
it removes edges iteratively. The edge with the highest betweenness centrality is removed
first.

The algorithm’s steps are as follows:


1. Calculate the betweenness centrality of all edges in the graph.
2. Remove the edges with the highest betweenness centrality
3. Recalculate the betweenness centrality and repeat the process until no edges
remain.

In this example, you can see how a typical graph looks when edges are assigned weights
based on the number of shortest paths passing through them. We only calculated the
number of undirected shortest paths that pass through an edge to keep things simple. The
edge between nodes A and B has a strength/weight of 1 because we don’t count A →B and
B→A as two different paths.

The Girvan-Newman algorithm would remove the edge between nodes C and D because it
has the highest strength. This means that the edge is located between communities.

2
After removing an edge, recalculate the betweenness centrality for every remaining edge.
In this example, we have come to the point where every edge has the same betweenness
centrality.

Louvain modularity
Another popular and useful algorithm for community detection is Louvain Algorithm. The
Louvain algorithm is based on the modularity measure of a graph. The inspiration for this
method for community detection is the optimization of modularity as the algorithm
progresses. Modularity is a scale value that measures the relative density of edges inside
a community concerning edges outside the community. This means evaluating how much
more densely connected the nodes are within the community compared to how connected
they would be in a random graph.

In the Louvain Algorithm, first, we find small communities within a graph by optimizing the
modularity locally on all nodes. After that, small communities are grouped into a “node”,
and step one is repeated.

Louvain Algorithm goes as follows:


1. Assign a community to each node.
2. Create a new community with neighboring nodes to maximize modularity.
3. Create a new weighted graph with communities from the previous step as nodes of
the graph.
4. Repeat the process until the maximum possible modularity is achieved and there
are no changes in the graph.

3
Connectivity
Connectivity is one of the basic concepts of graph theory - it asks for the minimum
number of elements that need to be removed to disconnect the graph into two or more
isolated subgraphs. In a graph, two nodes are connected if they have an edge between
them. Therefore, the graph is connected if every pair of nodes in the graph has an edge
between them.

A graph can be connected in two manners - weakly and strongly, which we will talk more
about.

Weakly and Strongly Connected Components


A weakly connected component is a subgraph that is unreachable from other nodes of a
graph or subgraph.

A strongly connected component is a subgraph where there is a path from every node to
every other node.

In the picture below, following the definition we just introduced, the red graph represents a
weakly connected graph, and the yellow graph represents a strongly connected
component.

Weakly Connected Components (WCC) algorithm, also known as Union Find, is an


algorithm that finds sets of connected nodes in an undirected graph where each node is
reachable from any other node in the same set.

Its counterpart, the Strongly Connected Components (SCC) algorithm, needs a path to
exist in both directions.

4
Both algorithms use graph transversal algorithms as their basis. WCC uses a Depth-First
search, and SCC uses a Breadth-First search.

Flows
A flow network is a directed graph where
each node has a capacity, and each edge
receives a flow. The amount of flow on the
edge cannot be greater than the capacity of
the edge. The starting node is usually called
a source, and the destination node is called
a sink.

Flows and flow analysis is useful to model


traffic networks, fluids in pipes, or any
capacity-restricted flows.

Weights on the graph have a special annotation: x / y, which indicates the flow (x) and the
capacity (y).

The most important theorem for flow analysis is the Max-Flow Min-Cut theorem. The
theorem says that the maximum flow passing from
the source to the sink equals the total weight of the
edges in the minimum cut. Cut has been mentioned
already, even though we never said the exact term.
A cut is a partition of nodes into two disjoint
subsets. So, we are doing cuts when we remove the
edge to find weakly connected components or
perform the Girvan-Newman algorithm.

To find the Maximum Flow, we use the Ford-Fulkerson algorithm.

5
Ford-Fulkerson algorithm
The Ford-Fulkerson algorithm is very straightforward.
1) Start with the initial flow as 0
2) While there’s a path from source to sink, find the minimum capacity on that path
and send the flow with the value.
3) For each edge on the path, add the value of the capacity to the flow.

6
Clustering
Clustering is a machine learning technique that involves the grouping of data points. Data
points in the same group should share similar properties and features. In Data Science, we
can use clustering algorithms to gain valuable insights from data by seeing what groups
the data points fall into.

The concept sounds similar to community detection. That’s because it is. Different fields
of science assigned the same concept another name. In literature, you will find community
detection connected more with social networks and clustering with machine learning. This
distinction can also be attributed to the data that’s used for both techniques. Community
detection algorithms often work on node structures while clustering algorithms are more
prominent with vector representations.

In graph theory, a clustering coefficient measures the degree to which nodes in a graph
tend to cluster together. In particular social networks, nodes tend to create tightly
connected groups. This measurement is essential because it tells us if clustering is even
possible for the graph.

K-means clustering
K-means is probably the most well-known clustering algorithm. K-means is a distance-
based algorithm where we calculate the distances to assign a point to a cluster. The main
objective of K-means is to minimize the sum of distances between the points and their
respective cluster centers.

The k-means algorithm works as follows:


1. Choose the number of clusters k.
2. Select random k points from the data as centers.
3. Assign all the points to the closest cluster center.
4. Recompute the centers of the newly formed clusters.

7
5. Repeat steps 3 and 4 until the centers of newly formed clusters do not change,
points remain in the same clusters, or a maximum number of iterations has been
achieved.

Spectral clustering
Spectral clustering is a more general technique that can be applied not only to graphs but
also to images or any sort of data. Spectral clustering uses information from the
eigenvalues (spectrum) of matrices derived from the data. In contrast to the k-means
approach, even if the distance between 2 points is less, if they are not connected, they are
not clustered together.

Spectral clustering works as follows:


1. Compute the graph Laplacian (which is just another matrix representation of a
graph)
2. For k clusters, compute first k eigenvectors.
3. Stack the vectors vertically to form a matrix with the vectors as columns.
4. Represent every node with the corresponding row in this new matrix.
5. Use k-Means Clustering to cluster these points in k clusters.

Use of K-Means clustering in the final step results in that the clusters are not always the
same. They may vary depending on the choice of initial points.

Hierarchical clustering

8
Hierarchical cluster analysis (HCA) is a clustering algorithm that involves creating clusters
ordered from top to bottom. For example, all files and folders on our hard disk are
organized hierarchically.

This clustering technique is divided into two types:


1. Agglomerative hierarchical clustering,
2. Divisive hierarchical clustering.

In this lesson, we will take a closer look at agglomerative hierarchical clustering.

Agglomerative hierarchical clustering is the most common type of hierarchical clustering.


It uses a “bottom-up” approach: each object starts in its cluster, and pairs of clusters are
merged together as one moves up the hierarchy.

Agglomerative hierarchical clustering works as follows:


1. Group each point as a single-point cluster.
2. Take two closest points and group them into one cluster.
3. Take two closest clusters and merge them into one cluster
4. Repeat step 3 until you are left with one single cluster.

The result of this algorithm is a dendrogram - a type of tree that shows hierarchical
relationships between different sets of data. Just by looking at the dendrogram, you can
tell how clusters have been formed.

9
We won’t go over how divisive hierarchical clustering works in detail, we will just mention
the differences between the two approaches. Divisive hierarchical clustering works exactly
opposite to the agglomerative approach. It is a “top-down” approach where we partition
the clusters until we have singled out all of the points in the starting cluster.

Hierarchical clustering is a useful way of segmentation. The advantage of not having to


pre-define the number of clusters gives it an edge over K-means. But it is more effective on
smaller data sets.

10
Graph Coloring
One of the most popular problems in graph theory is graph
coloring. Graph coloring is a special variant of graph labeling. In
its simplest form, it is a way of coloring nodes of a graph so
that no two adjacent nodes are of the same color. But, graph
coloring problems are NP-Hard, meaning that no time-efficient
algorithm exists. It’s practically impossible to find the minimal
number of colors for graphs with lots of nodes.

The graph coloring algorithm works as follows:


1. Arrange the nodes in order.
2. Choose the first node and color it with the first color.
3. Choose the next node and color it with the lowest numbered color that has not
been assigned to any nodes adjacent to it. If all the adjacent nodes are colored with
this color, assign a new color.
4. Repeat step 3 until all of the nodes are colored.

11

You might also like