You are on page 1of 43

Module 6

Analyzing Co-Occurrence
Networks with GraphX
Networks with GraphX
• Network science applies tools from graph theory, the mathematical discipline that studies
the properties of pairwise relationships (called edges) between a set of entities (called
vertices).
• Graph theory is also widely used in computer science to study everything from data
structures to computer architecture to the design of networks like the Internet.
• Graph theory and network science have had a significant impact on the business.
• Almost every major internet company derives a significant fraction of its value from its
ability to build and analyze an important network of relationships better than any of its
competitors: the recommendation algorithms used at Amazon and Netflix rely on the
networks of consumer-item purchases (Amazon) and user movie ratings (Netflix) that
each company creates and controls.
• Facebook and LinkedIn have built graphs of relationships between people that they
analyze to organize content feeds, promote advertisements, and broker new connections.
• Google - the PageRank algorithm.
• The computational and analytical needs of these network-centric companies
helped drive the creation of distributed processing frameworks like
MapReduce as well as the hiring of data scientists who were capable of using
these new tools to analyze and create value from the ever-expanding volume of
data.
• One of the earliest use cases for MapReduce was to create a scalable and
reliable way to solve the equation at the heart of PageRank.
• Over time, as the graphs became larger and data scientists needed to analyze
them faster, new graph-parallel processing frameworks—like Pregel at
Google, Giraph at Yahoo!, and GraphLab at Carnegie Mellon—were
developed.
• These frameworks supported fault-tolerant, in-memory, iterative, and graph-
centric processing, and were capable of performing certain types of graph
computations orders of magnitude faster than the equivalent data-parallel
MapReduce jobs.
• Introduce a Spark library called GraphX, which extends Spark to support
many of the graph-parallel processing tasks that Pregel, Giraph, and
GraphLab support.
• Although it cannot handle every graph computation as quickly as the
custom graph frameworks do, the fact that it is a Spark library means that it
is relatively easy to bring GraphX into your normal data analysis workflow
whenever you want to analyze a network-centric data set.
• With it, you can combine graph parallel programming with the familiar
Spark abstractions you are used to working with.
The MEDLINE Citation Index: A Network
Analysis
• MEDLINE (Medical Literature Analysis and Retrieval System Online) is a database of
academic papers that have been published in journals covering the life sciences and medicine.
• It is managed and released by the US National Library of Medicine (NLM), a division of
the National Institutes of Health (NIH).
• The main database contains more than 20 million articles going back to the early 1950s and is
updated 5 days a week.
• Due to the volume of citations and the frequency of updates, the research community
developed an extensive set of semantic tags, called MeSH (Medical Subject Headings), that
are applied to all of the citations in the index.
• These tags provide a meaningful framework that can be used to explore relationships
between documents to facilitate literature reviews, and they have also been used as the basis
for building data products: in 2001, PubGene demonstrated one of the first production
applications of biomedical text mining by launching a search engine that allowed users to
explore the graph of MeSH terms that connect related documents together.
Parsing XML Documents with Scala’s
XML Library
➢ DescriptorName entries has an attribute called MajorTopicYN that indicates whether or
not this MeSH tag was a major topic of the cited article.
We can look up the value of attributes of XML tags using the \ and \\ operators if we preface
the attribute name with an @ symbol. We can use this to create a filter that only returns the
names of the major MeSH tags for each article:
Analyzing the MeSH Major Topics and Their Co-
Occurrences
• Extracted the MeSH tags we want from the MEDLINE citation records, let’s get a feel for the overall
distribution of tags in our data set by calculating some basic summary statistics using Spark SQL,
such as the number of records and a histogram of the frequencies of various major MeSH topics:
Constructing a Co-Occurrence Network with GraphX
• Studying co-occurrence networks, our standard tools for summarizing data don’t provide us
much insight.
• What we really want to do is analyze the co-occurrence network: by thinking of the topics as
vertices in a graph, and the existence of a citation record that features both topics as an
edge between those two vertices.
• Then, we could compute network-centric statistics that would help us understand the overall
structure of the network and identify interesting local outlier vertices that are worthy of further
investigation.
• We can also use co-occurrence networks to identify meaningful interactions between entities
that are worthy of further investigation.
• Figure 7-1 shows part of a cooccurrence graph for combinations of cancer drugs that were
associated with adverse events in the patients who were taking them. We can use the
information in these graphs to help us design clinical trials to study these interactions.
• GraphX is a Spark library that is designed to help us analyze various kinds of networks using
the language and tools of graph theory.
• Because GraphX builds on top of Spark, it inherits all of Spark’s scalability properties, which
means that it is capable of carrying out analyses on extremely large graphs that are distributed
across multiple machines. GraphX is built on top of Spark’s fundamental data primitive, the
RDD.
• Specifically, GraphX is based on two custom RDD implementations that are optimized for
working with graphs.
• The VertexRDD[VD] is a specialized implementationof RDD[(VertexId, VD)], in which the
VertexID type is an instance of Long and is required for every vertex, while the VD can be
any other type of data associated with the vertex, and is called the vertex attribute.
• The EdgeRDD[ED] is a specialized implementation of RDD[Edge[ED]], where Edge is a
case class that contains two VertexId values and an edge attribute of type ED. Both the
VertexRDD and the EdgeRDD have internal indices within each partition of the data that are
designed to facilitate fast joins and attribute updates. Given both a VertexRDD and an
associated EdgeRDD, we can create an instance of the Graph class, which contains a number
of methods for efficiently performing graph computations.
• The first requirement in creating a graph is to have a Long value that can be
used as an identifier for each vertex in the graph.
• One option would be to use the built-in hashCode method, which will
generate a 32- bit integer for any given Scala object.
• For our problem, which only has 13,000 vertices in the graph, the hash code
trick will probably work.
• But for graphs that have millions or tens of millions of vertices, the
probability of a hash code collision might be unacceptably high.
• For this reason, we’re going to copy a hashing implementation from
Google’s Guava Library to create a unique 64-bit identifier for each topic
using the MD5 hashing algorithm:
• Duplicate entries in the EdgeRDD for a given pair of vertices, the Graph API
will not deduplicate them:
• GraphX allows us to create multigraphs, which can have multiple edges with
different values between the same pair of vertices.
• This can be useful in applications where the vertices in the graph represent
rich objects, like people or businesses, that may have many different kinds of
relationships among them (e.g., friends, family members, customers,
partners, etc.).
• It also allows us to treat the edges as either directed or undirected, depending
on the context.
Understanding
the Structure
of Networks
Degree Distribution
• A connected graph can be structured in many different ways.
• For example, there might be a single vertex that is connected to all of the other vertices, but
none of those other vertices connect to each other.
• If we eliminated that single central vertex, the graph would shatter into individual vertices.
• We might also have a situation in which every vertex in the graph was connected to exactly
two other vertices, so that the entire connected component formed a giant loop.
• To gain additional insight into how the graph is structured- degree of each vertex,

which is simply the number of edges that a particular vertex belongs to.

• In a graph without loops (i.e., an edge that connects a vertex to itself), the sum of the

degrees of the vertices will be equal to twice the number of edges, because each edge

will contain two distinct vertices.


tions in the MEDLINE data that only had a single major topic, which means that they
would not have had any other topics to co-occur within our data.
Filtering Out Noisy Edges
• In the current co-occurrence graph, the edges are weighted based on the count
of how often a pair of concepts appears in the same paper.
• The problem with this simple weighting scheme is that it doesn’t distinguish
concept pairs that occur together because they have a meaningful semantic
relationship from concept pairs that occur together because they happen to both
occur frequently for any type of document.
• We need to use a new edge-weighting scheme that takes into account how
“interesting” or “surprising” a particular pair of concepts is for a document
given the overall prevalence of those concepts in the data.
• We will use Pearson’s chi-squared test to calculate this “interestingness” in a
principled way—that is, to test whether the occurrence of a particular concept
is independent from the occurrence of another concept.
A large chi-squared statistic indicates that the variables are less likely to be
independent, and thus we find the pair of concepts more interesting.
Small-World Networks
• The connectedness and degree distribution of a graph can give us a basic idea of its overall
structure, and GraphX makes it easy to calculate and analyze these properties.
• With the rise of computer networks like the web and social networks like Facebook and
Twitter, data scientists now have rich data sets that describe the structure and formation of
real-world networks versus the idealized networks that mathematicians and graph theorists
have traditionally studied.
• One of the first papers to describe the properties of these real-world networks and how they
differed from the idealized models was published in 1998 by Duncan Watts and Steven
Strogatz and was titled “Collective Dynamics of ‘Small-World’ Networks”.
• It was a seminal paper that outlined the first mathematical model for how to generate graphs
that exhibited the two “small-world” properties that we see in real-world graphs:
➢Most of the nodes in the network have a small degree and belong to a relatively dense cluster
of other nodes; that is, a high fraction of a node’s neighbors are also connected to each other.
➢ Despite the small degree and dense clustering of most nodes in the graph, it is possible to
reach any node in the network from any other network relatively quickly by traversing a small
number of edges.
❑For each of these properties, Watts and Strogatz defined a metric that could
be used to rank graphs based on how strongly they expressed these properties.
❑GraphX is used to compute these metrics for our concept network, and
compare the values we get to the values we would get for an idealized random
graph in order to test whether our concept network exhibits the small-world
property.

➢Cliques and Clustering Coefficients


➢Computing Average Path Length with Pregel
Cliques and Clustering Coefficients

• A graph is complete if every vertex is connected to every other vertex by an


edge.
• In a given graph, there may be many subsets of vertices that are complete, and
we call these complete subgraphs cliques.
• The presence of many large cliques in a graph indicates that the graph has the
kind of locally dense structure that we see in real small-world networks.
• Finding cliques in a given graph turns out to be very difficult to do. The
problem of detecting whether or not a given graph has a clique of a given size
is NP-complete, which means that finding cliques in even small graphs can be
computationally intensive.
Computing Average Path Length with Pregel
• The second property of small-world networks is that the length of the shortest path between
any two randomly chosen nodes tends to be small.
• Computing the path length between vertices in a graph is an iterative process that is similar to
the iterative process we use to find the connected components.
• At each phase of the process, each vertex will maintain a collection of the vertices that it
knows about and how far away each vertex is.
• Each vertex will then query its neighbors about the contents of their lists, and it will update its
own list with any new vertices that are contained in its neighbors’ lists that were not contained
in its own list.
• This process of querying neighbors and updating lists will continue across the entire graph
until none of the vertices are able to add any new information to their lists.
• This iterative, vertex-centric method of parallel programming on large, distributed graphs is
based on a paper that Google published in 2009 called “Pregel: a system for large-scale graph
processing”. Pregel is based on a model of distributed computation that predates MapReduce
called bulk-synchronous parallel (BSP). BSP programs divide parallel processing stages into
two phases: computation and communication.
Use of Pregal operator to implement the iterative, graph-parallel
computations we need to compute the average path length for a graph:
1. Figure out what state we need to keep track of at each vertex.
2. Write a function that takes the current state into account and evaluates
each pair of linked vertices to determine which messages to send at the
next phase.
3. Write a function that merges the messages from all of the different
vertices before we pass the output of the function to the vertex for
updating.
• Code that constructs the message that will be sent to each vertex based on the information it receives from its
neighbors at each iteration. The basic idea here is that each vertex should increment the value of each key in its
current Map[VertexId, Int] by one, combine the incremented map values with the values from its neighbor using
the mergeMaps method, and send the result of the mergeMaps function to the neighboring vertex if it differs
from the neighbor’s internal Map[VertexId, Int]. The code for performing this sequence of operations looks like
this:

You might also like