You are on page 1of 14

1

Data and R code

This handout shows and discusses several pieces of R code. The script R_intro.R includes all the code shown
in these pages. You can access and run the code by opening R_intro.R in your R GUI (e.g. RStudio).
This code uses real-world data collected in 2012 with a personal network survey among 107 Sri Lankan
immigrants in Milan, Italy. Out of the 107 respondents, 102 reported their personal network. The data files
include ego-level data (sex, age, educational level, etc. for each Sri Lankan respondent), alter attributes
(alter’s nationality, country of residence, emotional closeness to ego etc.), and information on alter-alter ties
in the form of adjacency matrices. Each personal network has a fixed size of 45 alters. The relevant data
files are all located in the Data/ folder that you downloaded. The data files used in this document (and in
R_intro.R) are the following:



ego_data.csv: This is a single csv file with ego attributes for all the egos.
adj_28.csv: This is the adjacency matrix for ego ID 28’s personal network.
elist_28.csv: The edge list for ego ID 28’s personal network.
alter.data_28.csv: The alter attributes in ego ID 28’s personal network.

Information about data variables and categories is available in Data/Codebook.xlsx.
NOTE: Before running the R code in R_intro.R, you should make sure that the Data/ folder with all the
data files listed above is in your current working directory (use getwd() and setwd() to check/set your
working directory — see examples below). Also make sure to run the code in R_intro.R line by line: if you
skip one or more lines, the following lines may return errors.

2

Starting R

Upon opening R, you normally want to do two things:
• Check and/or set your current working directory. By default, R will look for files, and save new files,
in this directory.

– To set your working directory to “/Users/John/Documents/Rworkshop”, just run setwd("/Users/John/Documents/
– Windows users: Note that you have to input the directory path with forward slashes
(or double backward slashes), not with single backslashes as in a typical Windows path. I.e.
setwd("C:/Users/John/Documents/Rworkshop") or setwd("C:\\Users\\John\\Documents\\Rworkshop")
will work; setwd("C:\Users\Mario\Documents\Rworkshop") won’t work.
• Load all the packages you need to use in the current session. There are two steps to using a package
in R:
1. Installing the package. You do this just once. Use install.packages("package_name") or the
appropriate menu in your R GUI (e.g. in RStudio: Tools > Install Packages in RStudio). Once
you install a package, the package files are in your system R folder and R will be able to always
find the package there.
2. Loading the package in your current session. Use library("package_name"). You do this in
each R session in which you ned the package, i.e., every time you start R and you need the package.
Use library(package_name) (no quotation marks around the package name).

2.1

Console vs scripts

• When you open your R GUI, you typically see two separate windows: the script editor and the console.
You can write R code in either of them.
2

x <.2 Vector and matrix objects • Vectors are the most basic objects you use in R. it was just printed and lost). number of rows (nrow) and number of columns (ncol). • When we print() vectors. 4) # Display it. • When we print() matrices. Also keep in mind the : shortcut: c(1. Its main arguments are: the cell values (within c()). 2. Values are arranged in a nrow x ncol matrix by column.3. 2. the numbers in square brackets indicate the row and column numbers. logical (TRUE/FALSE data). Vectors can be numeric (numerical data). y <. times= 10) ## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 ## [36] 4 1 2 3 4 7 .1:4 y ## [1] 1 2 3 4 # What s the length of x? length(x) ## [1] 4 # The function rep() replicates values into a vector. • The length (number of elements) is a basic property of vectors (length()). x ## [1] 1 2 3 4 rep(x. # Also vectors themselves can be repeated. # Let s create a simple vector. • The basic function to create a vector is c() (concatenate). x ## [1] 1 2 3 4 # Shortcut for the same thing. 4) is the same as 1:4. See ?matrix. • Other useful functions to create vectors: rep() and seq(). • To create a matrix: matrix().c(1. rep(1. 3. 3. times= 10) ## [1] 1 1 1 1 1 1 1 1 1 1 # (NOTE that we didn t assign the vector above. the numbers in square brackets indicate the positions of vector elements. character (string data).

<=. >=. <.1:3 v1 + v2 ## Warning in v1 + v2: longer object length is not a multiple of shorter ## object length ## [1] 2 4 6 5 10 . • Is value x in vector y? x \%in\% y.) – [1 2 3 4] + [1 2 3] = [1+1 2+2 3+3 4+1] ([1 2 3] is recycled one third of a time: R will warn that the length of longer vector is not a multiple of the length of shorter vector. nrow = 3. | for OR. • Logical operators: & for AND.1:4 v1 + v2 ## [1] 2 4 6 8 # [1 2 3 4] + 1 v1 <.2 Comparisons and logical operations • Comparisons: >.g. Equal is == (NOT =). whereas = has a different meaning. and recycling applies. # [1 2 3 4] + [1 2 3 4] v1 <.1 v1 + v2 ## [1] 2 3 4 5 # [1 2 3 4] + [1 2] v1 <.1:4 v2 <.) 4.1:4 v2 <. • Comparisons result in logical vectors: TRUE/FALSE. # Just a few arithmetic operations between vectors to demonstrate element-wise # calculations and the recycling rule. matrix(x.1:4 v2 <.1:2 v1 + v2 ## [1] 2 4 4 6 # [1 2 3 4] + [1 2 3] v1 <.– [1 2 3 4] + [1 2] = [1+1 2+2 3+1 4+2] ([1 2] is recycled once. opposite) of a logical vector: !. ncol = 4)).2 is the same as x = 2). Not equal is !=. comparisons are performed element-wise on vectors. or to assign objects (x <. • Like arithmetic operations.1:4 v2 <. • Note: equal is ==. = is used to assign function arguments (e. • Negation (i.e.

.g.j] indexes the i....# What kind of variables are those? str(ego.tri().. 1 1 1 1 1 1 1 1 1 1 .. separated by a comma: x[i] indexes the i-th element of the one-dimensional object x. • Indexing is crucial in R. diagonal().. i refers to a row and j refers to a column). of 6 variables: 28 29 32 33 35 36 39 40 45 46 . e. or to replace it (assign a different value to that element)... it’s a matrix). x[i.. x <. Indexing means appending an index to an object to extract one (or more) of its elements.. 350 900 800 200 1000 1100 0 950 1600 1200 . to index all values in x that are greater than 3 (see example code below). – Notice that a dimension’s slot may be empty: if x is a matrix... just one slot) is still valid..k] indexes the i. – Names can be displayed and assigned using the names() function. and rows or columns in a matrix can have names.# Vector of 10 elements. # Numeric indexing # .. These can be useful for manipulating adjacency matrices.g. • Matrices have special functions that can be used for indexing.df) ## ## ## ## ## ## ## data...-4)]..31:40 x 22 .. matrices have 2.tri(). • Numeric indexing uses integers in square brackets [ ]: e. etc..g.e.. – Square brackets typically contain a slot for each dimension of the object. see special indexing rules for lists and data frames in next section).] will index the whole 3rd row of the matrix (i. then x[3] (no comma.. • When indexing you must take into account the number of dimensions of an object. 2008 2000 2002 2010 2009 1990 2007 2008 1975 2002 .. 61 38 43 30 25 63 29 56 52 35 .g.. x[-3]... – If x has more than one dimension (e. all columns])..j..j. • Name indexing uses element names.. • Particular indexing rules may apply to particular classes of objects (e. 2 1 1 2 3 2 2 3 1 3 . [row 3... lower.g. Elements in a vector.. e. threeway tables).frame : $ ego_ID: int $ sex : int $ age : int $ arr_it: int $ educ : int $ income: int 7 Indexing 107 obs. Indexing is also called subscripting. vectors have 1 dimension. Note that you can use negative integers to index (select) everything but that element: e. • Indexing can be used to extract (view..g. It’s used to index objects according to a condition. query) the element of an object... • Logical indexing uses logical vectors in square brackets [ ]..j-th element of the two-dimensional object x (e... upper.. but it might give you unexpected results. Arrays can be defined with 3 dimensions or more (e...g. x[c(-2.. x[i.k-th element of the three-dimensional object x. For example... x[3]... • The basic notation for indexing in R is [ ]: x[i] gives you the i-th element of object x.g.. x[3. x is a matrix...

with the [ . e.rm=TRUE. This is called subsetting a data frame.name is the same as df[["variable. df[5. ] notation to extract all the rows (cases) of a data frame that meet a given condition (e. df[3] or df["variable.1 Indexing and subsetting data frames • List notations. 3. Data frames can also be indexed like a matrix. df["variable. and the sum() function. – Consistently with this. all individuals older than 40 years old.# # # # # ***** EXERCISE 2: Using name indexing on matrix adj. This is given by df[i]. Note two differences from the [ ] notation: – [[ ]] does not preserve the data. The result is not a data frame.name"]. – df[. data frames can be indexed in the following 3 ways: 1.c("sex". not as a data frame but as a vector with its own type and class. character. as a vector (numeric. df[. not multiple elements. Data frames are a special class of lists. df[ . which is sometimes more intuitive to use. all women. e. df[c(1. This is given by df[[i]]. • Matrix notation. – You can obtain the same results with the subset() function. e. e."age"]. ] notation: – df[2.g. df[["variable. e.frame class.name indexes that variable. ].name is the name of a specific variable (column) in df. df[."variable.name"]].).i]. only the 3rd element.g. you need to set na. The $ notation. etc."age"] • Keep in mind the difference between the following: – Extracting a data frame’s variable (column) in itself. the numeric vector within the 3rd element of df.3]. "age")] 2.frame class: the result is still a data frame.name"]. df[[3]] or df[["variable. "age")]. Note: – This notation preserves the data. and it’s also the same as df[[i]] (where i is the position of the variable called variable.3]. all individuals with an income higher than 1000.g.g.3. HINT: In the sum() function. – Extracting another data frame of just one variable (column) – The single pepper packet within the pepper shaker in the figure below (panel B). # ***** 7. ***** # ***** EXERCISE 3: # Use is. This is the same as the [[ ]] notation: df$variable. If variable. [[ ]] can only be used to index a single element (column) of the data frame. [ ] notation.name"]]. Just like any list. df[.name. This returns the specific element (column) selected.g.name"] (with the comma). calculate Mario s outdegree and indegree. • You can use logical indexing and the [ .name in the data frame). df$variable.name"]]. See ?sum for more details. 27 .g.5)] or df[c("sex". df[2. etc.) — The single pepper packet by itself in the figure below (panel C). then df$variable. – This notation can be used to index multiple elements of a data frame into a new data frame. This returns another data frame that only includes the indexed element(s). [[ ]] notation.na() and logical indexing to recode all NA values in adj to 99.

] FROM "mark" "mark" "theo" "mario" "kelsie" "kelsie" TO "paul" "anna" "kelsie" "anna" "mario" "anna" # Use that edge list to create a network as an igraph object.] [6.] [5. gr <.# Load igraph library(igraph) # Manually type a simple edge list as a matrix. "mark".] [3. "mario". "kelsie". "TO") elist ## ## ## ## ## ## ## [1. elist <.] [2. class(gr) ## [1] "igraph" # Show the graph.matrix(c("mark". "theo". byrow=TRUE) colnames(elist) <."anna". "anna"). networks are objects of class "igraph".graph_from_edgelist(elist) # In igraph.] [4. set."mario".seed(609) ## Plot plot(gr) paul theo mark kelsie anna mario 38 .c("FROM"."kelsie"."anna". nrow= 6."paul". "kelsie". ## Set seed to always get the same vertex layout.

read. gr ## ## ## ## ## ## ## ## ## ## ## ## IGRAPH UNW./Data/Alter_attributes/alter. elist <.. Weighted.csv(". It has 45 vertices and 259 edges./Data/elist_28. # Get the graph from an external edge list. omitted several edges 2801--2805 2801--2811 2801--2820 2801--2831 2802--2806 2802--2812 2802--2841 2803--2809 2801--2806 2801--2812 2801--2823 2801--2840 2802--2807 2802--2813 2803--2804 2803--2810 2801--2807 2801--2813 2801--2825 2801--2841 2802--2808 2802--2815 2803--2805 2803--2811 # The graph is Undirected.attr) 40 . This is a personal network edge list.2819 2817 2844 2816 2827 2818 2824 2836 2814 2834 2828 2829 2832 2842 2831 2815 2838 2833 2813 2835 2823 2840 28222843 2841 28452839 2811 2801 2837 2826 28052802 2821 2804 2825 2807 2806 2803 2808 2810 2830 2820 2809 2812 # Print the graph. plus a data set with vertex # attributes. This is an alter attribute data set.attr <. and an edge attribute called "weight".read. weight (e/n) + edges (vertex names): [1] 2801--2802 2801--2803 2801--2804 [7] 2801--2808 2801--2809 2801--2810 [13] 2801--2814 2801--2815 2801--2818 [19] 2801--2827 2801--2828 2801--2829 [25] 2802--2803 2802--2804 2802--2805 [31] 2802--2809 2802--2810 2802--2811 [37] 2802--2823 2802--2831 2802--2840 [43] 2803--2806 2803--2807 2803--2808 + ..csv") head(elist) ## ## ## ## ## ## ## 1 2 3 4 5 6 from 2801 2801 2801 2801 2801 2801 to weight 2802 1 2803 1 2804 1 2805 1 2806 1 2807 1 ## Read in the vertex attribute data set.45 259 -+ attr: name (v/c). Named.csv(".csv") head(vert. ## Read in the edge list. It # has a vertex attribute called "name". vert.data_28.

ego ID 28...# Clear the workspace from existing objects. we first import alter attributes for just # one ego.. # In a for loop... i <.csv") 3 ...... E... "c")) { print(letter) } ## [1] "a" ## [1] "b" ## [1] "c" # Importing alter attributes # ..attr <.. .. then after each assignment the loop code is run.. for (letter in c("a".. for (i in 1:5) { print(i + 1) } ## ## ## ## ## [1] [1] [1] [1] [1] 2 3 4 5 6 # If we wanted to see what a single iteration does before running the whole # loop.g..data_28.../Data/Alter_attributes/alter. # To clarify what we are going to do.....read.. and the vector can be any kind of # vector.. the index i is assigned the 1st.. nth value in a # vector..) # # # # # # Change the working directory..1 print(i + 1) ## [1] 2 # Note that the index can have any name...# Let s use a for loop to import all the alter attribute data. "b". setwd("my_directory") (Delete the leading "#" and type in your actual working directory instead of "my_directory": this should be the directory where you saved the "Data" folder you downloaded for this workshop....csv(". setwd("/Users/John/Documents/Rworkshop")).. rm(list=ls()) # What s the current working directory? # getwd() # (Delete the leading "#" to actually check your current directory. 2nd.... alter...

• Many R functions have default values for their arguments: if you don’t specify the argument’s value. and returns an output (the function value in R terminology). • Once you write a function and define its arguments. specific existing object in your workspace. # Any piece of code you can write and run in R. – We write a function that performs specific calculations on the alter-attribute data frame of a given ego. – if is a flow control tool that is frequently used within functions: it specifies what the function should do if a certain condition is met at one point. many different ego networks). Functions are crucial for code reproducibility in R. times2 <. then run the function any time and on any arguments you need. If you write functions. times2(x= 3) ## [1] 6 10 . and wrap it in a new function that always executes plot with the argument col="red". • What the following code does. combined with loops or with other methods (more on this in the following sections). you can run that function on any argument values you want — provided that the function actually works on those argument values. – First think particular. you won’t need to re-write (copy and paste) the same code over and over again — you just write it once in the function.g. Note: If you don’t use return(). then generalize.function(. – return() allows you to explicitly set the output that the function will return (clearer code). col="red"). # Let s write a trivial function that takes its argument and multiplies it by 2. This yields clearer. if you want all your plots to have red as color. you can take R’s existing plotting function plot.) plot(. • A function is a piece of code that operates on one or multiple arguments (the input). the function will use the default. When you want to write a function. We then run that function on the alter-attribute data frames of different egos. • New functions are also commonly used to redefine existing functions by pre-setting the value of specific arguments.function(x) { x*2 } # Now we can run the function on any argument. (Examples in the following sections.. are the best way to run exactly the same code on many different objects (e. if the function was written for igraph objects). provided that the network is an igraph object. you can also put in a function. if a function takes an igraph object as an argument. It is also used to stop function execution earlier under certain conditions.. you can then wrap it into a general function to be run on any similar object (see examples in the code below).plot <. For example.. you’ll be able to run that function on any network you like. shorter and more readable code.3 Writing functions in R • One of the most powerful tools in R is the ability to write your own functions. • Functions. it’s a good idea to first try the code on a “real”.) • Tips and tricks with functions: – stopifnot() is useful to check that function arguments are of the type that was intended by the function author. If your network is a network object (created by a statnet function). argument is not an igraph object.g. Your function would be something like my. If the code does what you want on that object.. the function value (output) is the last object that is printed at the end of the function code. the function will likely return an error. For example. It stops the function if a certain condition is not met by a function argument (e.Everything that happens in R is done by a function..

• We will consider 3 of these functions: – lapply: Takes an object x (vector or list) and a function FUN as arguments.list.nat(alter.attr. matrix.attr. In R.tri. So. 14 .nat(alter. or a new function that you created.list[["102"]]) ## Nationality Freq ## 1 Sri Lankan 30 ## 2 Italian 15 # # # # # # # # # # 4 ***** EXERCISE: Write a function that takes an adjacency matrix as argument. Run this function on the adjacency matrices of ego ID 47. An edge is "probable" if the corresponding cell value in the adjacency matrix is 2. FUN can be an already existing function. HINT: You should first try the code on one adjacency matrix from the list adjacency.nat(alter. # Ego 60 table.list[["85"]]) ## Nationality Freq ## 1 Sri Lankan 44 ## 2 Italian 1 # Ego 102 table. the apply-like functions are the “R way” to loops. in R you should always prefer apply-like functions over for loops whenever that’s possible. In fact. Remember that the ego IDs are in names(adjacency. or list. so you should only consider the upper triangle of the matrix. • This is the same idea as a for loop. although they do essentially the same thing as a for loop.} # Return the data frame return(freq) # Now we can run exactly that code on the alter attribute data frame of any ego. 53 and 162. These personal network are undirected. see ?upper.attr. Applies FUN to every element of x. ***** The apply family of functions • The apply family of functions is a collection of R functions designed to execute the same other function on multiple elements of a vector. Returns the results in a new list (lapply stands for list-apply). The function returns the number of "probable" edges in the corresponding personal network. apply functions are more efficient and need shorter code (which means higher readability and reproducibility).list[["60"]]) ## Nationality Freq ## 1 Sri Lankan 33 ## 2 Italian 1 ## 3 Other 11 # Ego 85 table.list).

00000 # # # # # # ***** EXERCISE 1: Write a function to calculate the max betweenness of an alter in a personal network. # Put the code above into a function. We pick the results and re-combine them together. i. typically into a single dataset. sapply the function to graph. into a new dataset (combine). Vector names are ego IDs.list to try the function. each piece typically corresponding to one ego. (2) performing identical and independent analyses on each piece (each ego-network).sapply(graph.list.list.list. family.list to try the function. 3.deg <. Then lapply the function to graph.50000 21.list.80000 14. whatever we can do on an element of a list. family. just pick the first alter using indexing. [1]. In this section we’ll look at more advanced ways.degrees) ## 28 29 33 35 39 40 ## 14. head(family. Sapply the function to graph. to then associate them with other ego-level variables. V(gr)[relation==1]) mean(deg) } ## Run the function on all the 102 personal networks.degree(gr.function(gr) { deg <. – The plyr package. Split-applycombining is what we do whenever we have a single file or object with all our data and: 1. 20 .80000 19.g. (3) recombining the results together. we can put in a function # and do it simultaneously on all the elements of that list via lapply/sapply. Which is better in this case. Use a graph in graph.25000 14.degrees <. which are often quicker and more efficient: – The aggregate function. We apply exactly the same kind of calculation on each piece. – The dplyr and purrr packages. Use a graph in graph.e. family. e. 2. • for loops (see previous section) and apply-like functions (see previous section) are one way to perform split-apply-combining in R. ***** 5 Advanced tools for split-apply-combining • In many different scenarios we use the so-called “Split-Apply-Combine” strategy. With ego-networks. lapply or sapply? HINTS: ?betweenness ***** # # # # # # # ***** EXERCISE 2: Write a function that returns the nationality of the alter with maximum betweenness in a personal network. HINT: If multiple alters have betweenness equal to the maximum betweenness. • The split-apply-combine strategy is essential in ego-network analysis.33333 24. identically and independently (apply).# In fact. Finally.deg) # This generated a vector with average degree of close family members for all # egos. we are repeatedly (1) splitting the data into pieces. We split the object into pieces according to one or multiple (combinations of) categorical variables (split).

png". width= 800.plot() to the R GUI by plotting the personal # network of ego ID 45. but we ll export each of them to a separate png file. We won t print the plots in the # GUI.list)[i] # Set seed for reproducibility (so we always get the same network layout).names(graph.plot(graph. set.graph. ego_ID.seed(613) ## Plot my. for (i in 1:10) { # Get the graph gr <.list[[i]] # Get the graph s ego ID ego_ID <. set.create("Figures") # Let s now plot the first 10 personal networks.seed(613) # Open png device to print the plot to an external png file (note that the ego # ID is written to the file name). sep=""). if (!dir.". where all # the figures created in this section will be saved. par(bg="white") } # Plot the graph plot(gr) # Let s print an example of my. ". png(file= paste(".list[["45"]]) # Let s create a subfolder called "Figures" in our working directory. height= 800) 35 . ## Set seed for reproducibility (so we always get the same network layout). (Only if the directory # does not already exist)./Figures/plot.# Set "white" as background color for the plot.exists("Figures")) dir.

ita.ita)) + geom_point(size=1. aes(x= N. aes(color= as. with color representing ego s educational level. ## Set seed for reproducibility (jittering is random./Figures/N.ita.88 Avg degree of Italians 15 82 65 138 118 159 124 10 60 46 104 121 66 49 122 79 130 61 68 91 164 73 136 55 102 123 158 160 83 62 129 48 58 109 151 162119 147 80 131115 87 56 53 156 114 92 116 64 154 81 107 120 141 78 59 128 35 99 161 40 57 149 144 127 52 153 90 155 71 86 110 163 142 146 95 135 132 113 33 84 93 108 133 105 29 152 51 74 69 85 39 157 5 0 0 5 45 112 10 15 20 N Italians ## Print plot to external pdf file pdf(". ## Print plot in R GUI print(p) 40 . y= "Avg degree of Italians".) set. height= 6) print(p) dev. position=position_jitter(height=0.seed(613) ## Get and save plot p <.5.avg. width= 8.5)) + theme(legend. color= "Education") ## Note that we slightly jitter the points to avoid overplotting.off() ## pdf ## 2 # Same as above.deg.factor(educ)).2.position="bottom") + labs(x= "N Italians". width=0.ggplot(data= data.5. y= avg.pdf".deg.

ego.data[index. height= 50.complete.0 1980 1990 2000 2010 Year of arrival in Italy ## Print plot to external pdf file pdf(".cases(ego. let s remove cases with NA values on the ## relevant variables.arr.3 0.ggplot(data= data.pdf".Prop Italians in personal network 0. index <. aes(x= prop.2 0.data[. height= 6) print(p) dev. y= income)) + geom_jitter(width= 0.ita. si ## Print plot in R GUI print(p) 42 . shape= 1./Figures/prop.seed(613) ## Get and save plot p <. width= 8.off() ## pdf ## 2 # Do Sri Lankans with more Italians in their personal network have a higher # income? ## To avoid warnings from ggplot().1 0.c("prop.4 0.ita. "income")]) data <.ita".01.] ## Set seed for reproducibility (jittering is random.) set.