You are on page 1of 43

ICWSM11 Tutorial Exploratory Network Analysis with:

Instructors: Sbastien Heymann, Julian Bilcke seb@gephi.org, julian.bilcke@gephi.org

July 17, 2011 | 1 PM - 4 PM

Exploratory Network Analysis with Gephi


This tutorial is an introduction to Gephi, the open source graph network visualization and manipulation software. Gephi aims to fulfill the complete chain from data importing to aesthetics refinements and interaction. Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties. The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections. At the end, the participants will walk away with the practical knowledge enabling them to use Gephi for their own projects.

L F F

E IN

Exploratory Network Analysis with Gephi


It starts with a brief introduction on the network exploration process and a hands-on demonstration of the essential functionalities of Gephi. Participants are guided step by step through the complete chain of representation, manipulation, layout, analysis and aesthetics refinements. Next, teams work on real datasets. They finally present their preliminary results. The tutorial concludes with a general question and answer session.

L F F

E IN

Requirements
Bring your own laptop with Java and Gephi installed. Gephi should be updated (menu Help > Check for Updates). Bring a mouse with a wheel. Bring a dataset of your own if you want, verify if it loads well in Gephi.[1]

[1] http://gephi.org/users/supported-graph-formats/

Workshop Schedule - Part I


Exploratory Network Analysis Exploratory Data Analysis Exploratory Network Analysis Looking for Orderness in Data Examples Guideline Introduction to Gephi Approach and Community Networked Data Quick Start Demo * 30 min break *

Workshop Schedule - Part II


Hands-On! Team Work on a Dataset Presentation of Preliminary Results Q&A

Exploratory Data Analysis

Confirmatory Exploratory Serendipity

results intuition surprise

The greatest value of a picture is when it forces us to notice what we never expected to see

started with John Tukey (1962)

Exploratory Data Analysis

Non-linear processing chain of Ben Fry in Computational Information Design (2004)

Dummy Example

Observation: visual saliences on specific file sizes External knowledge: these sizes correspond to films New hypothesis on data: films are highly exchanged, so the study might dig in this direction P2P file size distribution (Latapy et al., 2008)

Exploratory Network Analysis

see the network

interact in real time

1st graph viz tool: Pajek (1996) Vladimir Batagelj, Andrej Mrvar

Gephi prototype (2008) group, filter, compute metrics...

build a visual language

size by rank, color by partition, label, curved edges, thickness...

Looking for a Simple Small Truth?

Drew Conway, What Data Visualization Should Do:

1. Make complex things simple 2. Extract small information from large data 3. Present truth, do not deceive

http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/

Looking for Orderness in Data


Make varying 3 cursors simultaneously to extract meaningful patterns
MICRO level MACRO level

at different levels

1 dimension

N dimensions

on multiple dimensions

T+0

T+N

at time scale

Zoom cursor on Quantitative Data


MICRO level MACRO level

Global - connectivity - density - centralization Local - communities - bridges between communities - local centers vs periphery Individual - centrality - distances - neighborhood - location - local authority vs hub

Crossing cursor on Qualitative Data


1 dimension N dimensions

Social - who with whom - communities - brokerage - influence and power - homophily Semantic - topics - thematic clusters Geographic - spatial phenomena

Timeline cursor on Temporal Data


T+0 T+N

Evolution of social ties Evolution of communities Evolution of topics

Mapping an Innovation Center

Collaborations on projects at Images et Rseaux

Themes and content

Actors

Territory

Franck Ghitalla & Ecole de Design de Nantes

Mapping Scientific Cooperations

Network Map: a Series of Choices corpus data graphical operations

algorithms thresholds

communication goals

Guideline
# nodes 1 - 100 lists + edges in bonus, focus on qualitative data

100 - 1,000

How attributes explain the structure?

easy to read, obvious patterns focus on entities (in context) metrics are tools to describe the graph (centrality, bridging...) links help to build and interpret categories of entities challenge: mix attribute crossing and connectivity

1,000 - 50,000 hard to read, problem of hidden signals: track patterns with various layouts and filtering focus on structures metrics are tools to build the graph (cosine similarity...) categories help to understand the structure challenge: pattern recognition > 50,000 require high computational power

How the structure explains attributes?

Gephi now!

Gephi in a Nutshell
Like Photoshop for graphs. Helps data analysts to reveal patterns and trends, highlight outliers and tells story with their data.

Network visualization platform Open source, supported by a community Built for performance and usability Extensible by plug-ins Windows, MacOS X, Linux

Gephi Community

Nonprofit organization

Communities

Contributors
Mathieu Bastian, Mathieu Jacomy, Eduardo Ramos Ibaez, Sbastien Heymann, Guillaume Ceccarelli, Andr Panisson, Antonio Patriarca, Cezary Bartosiak, Martin kurla, Patrick McSweeney, Yi Du, Hlder Suzuki, Daniel Bernardes, Ernesto Aneiro, Keheliya Gallaba, Luiz Ribeiro, Urban kudnik, Vojtech Bardiovsky, Yudi Xue

Community Mission
Provide a sustainable software Maintain the technical ecosystem Build a business ecosystem Face cutting-edge technological challenges with a long-term vision Distribute the software in Open Source

Community Values
Open innovation: ideas and features come from the entire community. Decisions are taken with transparency. We consider this technology as a public good, and will keep it in open source.

Diversity of Usages
business leisure :-)

communication

academic

art

Diversity of Network Encoding


V = { a, b, c, d, e } E = { (a,b), (a,d), (b,c), (e,a), (c,e) } Textual
<graph> <nodes> <node id=a /> <node id=b /> <node id=c /> <node id=d /> <node id=e /> </nodes> <edges> <edge source=a target=b /> <edge source=a target=d /> <edge source=b target=c /> <edge source=e target=a /> <edge source=c target=e /> </edges> </graph>

a b c d e

a 1

b 1 -

c 1 -

d 1 -

e 1 Graphical

XML

Tabular

and many others...

Software I/O
MySQL PostgreSL SQL Server Neo4j

databases

file

CSV Pajek NET Guess GDF GEXF GraphML Graphviz DOT UCInet DL NetdrawVNA Tulip TLP Excel Spreadsheet

graph streaming

user input CSV Pajek NET file Guess GDF GEXF GraphML Excel Spreadsheet SVG PDF PNG

>

Choosing a File Format


re Va lu e al yn am ic s G tu es rib ut io n t D At t es at e ut liz rib rib ua ut Vi s At t ie H ra ef

St ru c

rix

at

re

/M

tu

gh

W ei

ru

Li

St

ge

XM

Ed

At t

ge

CSV DL Ucinet DOT Graphviz GDF GEXF GML GraphML NET Pajek TLP Tulip VNA Netdraw Spreadsheet*

Ed

rc h

st

ic

au lt

ra ph s

Table of features supported by Gephi

* spreadsheets can be loaded in the Data Laboratory

Do you need...

GEXF Spreadsheet GraphML Guess GDF GML UCINet DL Netdraw VNA Graphviz DOT Pajek NET CSV Tulip TLP

Many features

Few features

File Type XML Tabular Text

Using Gephi

O M E

Team work

1 2 3 4

Create a team of 2~3 people. Choose a dataset. Explore it during 1H. Two teams present their preliminary findings.

Dataset #1: GitHub Software Repository

GitHub is an application used by nearly a million people to store over two million code repositories, making GitHub the largest code host in the world.

Started in 2008, it provides the features of an online social network and a software repository to lower the barriers of collaboration and make the code easier to contribute. https://github.com

Dataset #1: GitHub Software Repository


Data extracted by Franck Cuny* at Linkfluence SAS 1st release in March 2010 -> this poster 2nd release in June 2011 -> your data _____________Network of user profiles__________ Nodes: peoples with at least one repository who are followed by at least two other people Edges: A follows B _____________Network of repositories__________ Nodes: repositories Edges: A shares a developer with B Very few research publications on this OSN!
* franck.cuny@linkfluence.net

Dataset #1: GitHub Software Repository


Data extracted by a crawl using the GitHub API Seed: 10 well-known contributors in the Perl community Networks by country: Japan, France, United States Networks by language: Perl, PHP, Python, Ruby Node attributes: user country number of followers main programming language Edges: directed weight = number of projects A has forked from B

Dataset #1: GitHub Software Repository

Your mission (should you decide to accept it): find research hypotheses based on your exploration
Example question: are the Perl communities based on geography?

Dataset #2: The Irish Blogosphere


Identifying Representative Textual Sources in Blog Networks. K. Wade, D. Greene, C. Lee, D. Archambault, P. Cunningham (2011) http://mlg.ucd.ie/blogs

_______________Blogroll Network______________ Nodes: blogs with more than two blogroll links Edges: blogroll link (in-link) _______________Post-link Network_____________ Nodes: blogs with more than two blogroll links Edges: hyperlink inside post from a blog to another (post-link)

Dataset #2: The Irish Blogosphere


Data extracted by a crawl at distance 2 from the seed for the in-links and Google Blog Search for the post-links. Seed: 21 popular blogs, winners of the 2010 Irish Blog Awards Node attributes: post count = total number of posts by blog category = from the irish blog index at www.irishblogdirectory.com, where available infomap_comm = community to which a node belongs (infomap algo) gce_comms = overlapping communities (GCE algo) moses_comms = overlapping communities (MOSES algo) Edges: directed weight = number of hyperlinks in the Post-link network
crawl at distance 2 from the seed

Dataset #2: The Irish Blogosphere

Your mission: explore and try to confirm the official results

Hands-On!
Start: Load a graph Apply a layout Color the nodes by a qualitative variable in Partition Panel Size the nodes by a quantitative variable in Ranking Panel Start to explore...compute metrics, filter the network End: Export maps to PDF in Preview Tab Save

Presentations

GitHub Repository

Irish Blogosphere

Gephi Documentation
Web Site: Support: Wiki: Source code: http://gephi.org
http://forum.gephi.org http://wiki.gephi.org https://launchpad.net/gephi

Online Tutorials

http://gephi.org/users/quick-start/ http://gephi.org/users/tutorial-visualization/ http://gephi.org/users/tutorial-layouts/ http://wiki.gephi.org/index.php/Import_CSV_Data http://wiki.gephi.org/index.php/Import_Dynamic_Data

Tutorial in Spanish
https://code.google.com/p/camon/wiki/Taller_Gephi

Supported Graph Formats


http://gephi.org/users/supported-graph-formats/

Thank You!

Caspar David Friedrich Wanderer Above the Sea of Fog

Credits
[slide 11] images from Drew Conway
http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/

[slide 22 top left] Benot Vidal at MFG Labs [slide 22 bottom center] Franck Ghitalla at UTC [slide 22 right] Studies in MA Digital Fashion at LCF by Peter Jeun Ho Tsang
http://jeunhotsang.com/blog/2010/12/07/prototype/

[slide 27] sketches from Ben Fry, Computational Information Design

Special Thanks to Franck Ghitalla and Mathieu Jacomy for their insightful discussions.

You might also like