Professional Documents
Culture Documents
Analyzing Co-Occurrence
Networks with GraphX
Networks with GraphX
• Network science applies tools from graph theory, the mathematical discipline that studies
the properties of pairwise relationships (called edges) between a set of entities (called
vertices).
• Graph theory is also widely used in computer science to study everything from data
structures to computer architecture to the design of networks like the Internet.
• Graph theory and network science have had a significant impact on the business.
• Almost every major internet company derives a significant fraction of its value from its
ability to build and analyze an important network of relationships better than any of its
competitors: the recommendation algorithms used at Amazon and Netflix rely on the
networks of consumer-item purchases (Amazon) and user movie ratings (Netflix) that
each company creates and controls.
• Facebook and LinkedIn have built graphs of relationships between people that they
analyze to organize content feeds, promote advertisements, and broker new connections.
• Google - the PageRank algorithm.
• The computational and analytical needs of these network-centric companies
helped drive the creation of distributed processing frameworks like
MapReduce as well as the hiring of data scientists who were capable of using
these new tools to analyze and create value from the ever-expanding volume of
data.
• One of the earliest use cases for MapReduce was to create a scalable and
reliable way to solve the equation at the heart of PageRank.
• Over time, as the graphs became larger and data scientists needed to analyze
them faster, new graph-parallel processing frameworks—like Pregel at
Google, Giraph at Yahoo!, and GraphLab at Carnegie Mellon—were
developed.
• These frameworks supported fault-tolerant, in-memory, iterative, and graph-
centric processing, and were capable of performing certain types of graph
computations orders of magnitude faster than the equivalent data-parallel
MapReduce jobs.
• Introduce a Spark library called GraphX, which extends Spark to support
many of the graph-parallel processing tasks that Pregel, Giraph, and
GraphLab support.
• Although it cannot handle every graph computation as quickly as the
custom graph frameworks do, the fact that it is a Spark library means that it
is relatively easy to bring GraphX into your normal data analysis workflow
whenever you want to analyze a network-centric data set.
• With it, you can combine graph parallel programming with the familiar
Spark abstractions you are used to working with.
The MEDLINE Citation Index: A Network
Analysis
• MEDLINE (Medical Literature Analysis and Retrieval System Online) is a database of
academic papers that have been published in journals covering the life sciences and medicine.
• It is managed and released by the US National Library of Medicine (NLM), a division of
the National Institutes of Health (NIH).
• The main database contains more than 20 million articles going back to the early 1950s and is
updated 5 days a week.
• Due to the volume of citations and the frequency of updates, the research community
developed an extensive set of semantic tags, called MeSH (Medical Subject Headings), that
are applied to all of the citations in the index.
• These tags provide a meaningful framework that can be used to explore relationships
between documents to facilitate literature reviews, and they have also been used as the basis
for building data products: in 2001, PubGene demonstrated one of the first production
applications of biomedical text mining by launching a search engine that allowed users to
explore the graph of MeSH terms that connect related documents together.
Parsing XML Documents with Scala’s
XML Library
➢ DescriptorName entries has an attribute called MajorTopicYN that indicates whether or
not this MeSH tag was a major topic of the cited article.
We can look up the value of attributes of XML tags using the \ and \\ operators if we preface
the attribute name with an @ symbol. We can use this to create a filter that only returns the
names of the major MeSH tags for each article:
Analyzing the MeSH Major Topics and Their Co-
Occurrences
• Extracted the MeSH tags we want from the MEDLINE citation records, let’s get a feel for the overall
distribution of tags in our data set by calculating some basic summary statistics using Spark SQL,
such as the number of records and a histogram of the frequencies of various major MeSH topics:
Constructing a Co-Occurrence Network with GraphX
• Studying co-occurrence networks, our standard tools for summarizing data don’t provide us
much insight.
• What we really want to do is analyze the co-occurrence network: by thinking of the topics as
vertices in a graph, and the existence of a citation record that features both topics as an
edge between those two vertices.
• Then, we could compute network-centric statistics that would help us understand the overall
structure of the network and identify interesting local outlier vertices that are worthy of further
investigation.
• We can also use co-occurrence networks to identify meaningful interactions between entities
that are worthy of further investigation.
• Figure 7-1 shows part of a cooccurrence graph for combinations of cancer drugs that were
associated with adverse events in the patients who were taking them. We can use the
information in these graphs to help us design clinical trials to study these interactions.
• GraphX is a Spark library that is designed to help us analyze various kinds of networks using
the language and tools of graph theory.
• Because GraphX builds on top of Spark, it inherits all of Spark’s scalability properties, which
means that it is capable of carrying out analyses on extremely large graphs that are distributed
across multiple machines. GraphX is built on top of Spark’s fundamental data primitive, the
RDD.
• Specifically, GraphX is based on two custom RDD implementations that are optimized for
working with graphs.
• The VertexRDD[VD] is a specialized implementationof RDD[(VertexId, VD)], in which the
VertexID type is an instance of Long and is required for every vertex, while the VD can be
any other type of data associated with the vertex, and is called the vertex attribute.
• The EdgeRDD[ED] is a specialized implementation of RDD[Edge[ED]], where Edge is a
case class that contains two VertexId values and an edge attribute of type ED. Both the
VertexRDD and the EdgeRDD have internal indices within each partition of the data that are
designed to facilitate fast joins and attribute updates. Given both a VertexRDD and an
associated EdgeRDD, we can create an instance of the Graph class, which contains a number
of methods for efficiently performing graph computations.
• The first requirement in creating a graph is to have a Long value that can be
used as an identifier for each vertex in the graph.
• One option would be to use the built-in hashCode method, which will
generate a 32- bit integer for any given Scala object.
• For our problem, which only has 13,000 vertices in the graph, the hash code
trick will probably work.
• But for graphs that have millions or tens of millions of vertices, the
probability of a hash code collision might be unacceptably high.
• For this reason, we’re going to copy a hashing implementation from
Google’s Guava Library to create a unique 64-bit identifier for each topic
using the MD5 hashing algorithm:
• Duplicate entries in the EdgeRDD for a given pair of vertices, the Graph API
will not deduplicate them:
• GraphX allows us to create multigraphs, which can have multiple edges with
different values between the same pair of vertices.
• This can be useful in applications where the vertices in the graph represent
rich objects, like people or businesses, that may have many different kinds of
relationships among them (e.g., friends, family members, customers,
partners, etc.).
• It also allows us to treat the edges as either directed or undirected, depending
on the context.
Understanding
the Structure
of Networks
Degree Distribution
• A connected graph can be structured in many different ways.
• For example, there might be a single vertex that is connected to all of the other vertices, but
none of those other vertices connect to each other.
• If we eliminated that single central vertex, the graph would shatter into individual vertices.
• We might also have a situation in which every vertex in the graph was connected to exactly
two other vertices, so that the entire connected component formed a giant loop.
• To gain additional insight into how the graph is structured- degree of each vertex,
which is simply the number of edges that a particular vertex belongs to.
• In a graph without loops (i.e., an edge that connects a vertex to itself), the sum of the
degrees of the vertices will be equal to twice the number of edges, because each edge