At the recent
Third Global Symposium for Health Systems Research, Jeff Knezovich and I asked participants to complete an online network survey. Our aim was to map the networks (social, collaboration, information seeking) of conference participants. We had some technical glitches with the online tool, slow internet access, and the apathy towards completing a survey that is commonly observed. The experience confirmed:
- Network mapping, especially in internet-limited settings, should be done off-line;
- Network mapping (probably anywhere) should be done face-to-face. Otherwise respondents are unlikely to respond;
- One should always pilot their data collection tools!
The idea of mapping the network seemed to resonate, but in total we had only 71 responses (give or take – cleaning the data outputted from the app was more grueling than a hike up Table Mountain!). Nevertheless, let’s see what these data look like. I will be running the analysis in Rstudio, using the
statnet
suite of packages. Feel free to download
the .csv files and work along with me.
Step 1: Make sure the packages are installed
Install the latest version of
ergm
and
sna
:
install.packages('ergm'); install.packages('sna')
library(ergm)
library(sna)
I also changed the color palette because I don’t like missing node attributes to be colored as black. It just doesn’t look nice. As a note, these colors are also good for color-blindness, and for printing in grey-scale.
col.list <-c("white", "darkblue", "cornflowerblue", "darkorange1", "darkred")
palette(col.list)
Step 2: Import data and convert to network
ONASurveys.com exported the data as an edge-list, and I had to do some extensive work to delete duplicate names, ensure IDs matched, etc. But you can use the cleaned files. There is one for each network, as well as an attributes file that is used for all the networks.
Save the files somewhere and set that folder as your working directory in R.
setwd("~/Dropbox/Dissertation_jan4/Conferences/Cape Town 2014")
attr <- read.csv(file="attr.csv", header=T, stringsAsFactors=FALSE)
social <-read.csv(file="social.csv", header=TRUE, stringsAsFactors=TRUE)
spre <- network(social, matrix.type="edgelist", directed=F)
smat <- as.matrix(spre)
snet <- network(smat, matrix.type="adjacency", directed=F, vertex.attr=attr)
Note that I had to coerce the edgelist into a network object, then into a matrix (to match up with the attributes), then back into a new adjacency-type network object with attributes attached.
summary(snet)
The
summary()
command shows us the network vertices, edges and density. Note that this is the entire network of all 1515 participants, 90% who did not complete the survey. Let’s plot that network to see what it looks like (and then we’ll get rid of the isolates).
Step 3: Plot the network
plot.network(snet, edge.col="darkgrey", vertex.border="black")
You can see a cluster of activity in the center, with edges in grey. But otherwise this is not a helpful graph. Let’s delete isolates and check out the summary stats.
sno_iso <- delete.vertices(snet, which(degree(snet)<1))
Run the
summary(sno_iso)
command to see ALL the details, or simply:
centralization(sno_iso, degree, mode="graph")
## [1] 0.1035052
network.density(sno_iso)
## [1] 0.007237999
Still a pretty sparse network! (Although these are incomplete data, so we can’t say much about the actual network density or centralization). The exact question was: “who did you, or do you plan to have lunch or dinner with during the conference.” One can imagine that there are likely clusters of friends/colleagues who are likely to socialize, and fewer connections between these clusters. And what drives the propensity to form social ties? Let’s look at a few graphs before we test hypotheses in ergm models.
Do people socialize with others from their region?
s2coord <- plot.network(sno_iso, edge.col="darkgrey", vertex.border="black")
plot.network(sno_iso, coord=s2coord, vertex.col="region", edge.col="darkgrey", vertex.border="black")
legend("bottomleft", legend=c("Africa", "Americas", "Europe", "South-East Asia", "Unknown"), pch=21,
cex=1, pt.bg=c("darkblue", "cornflowerblue", "darkorange1", "darkred", "white"))
Again, it is somewhat difficult to tell with the missing attribute data, but it doesn’t seem as though there is clustering by region. (On a side note, if these data were very important, I could look up all the alters’ regions. For smaller networks this would certainly be worth it).
Is sociality based on similar organization?
Hmm… first of all, we see that most of our respondents are from research organizations. Second, they seem to be more central in the network. Are they more likely than chance to eat lunch with other researchers? We will find out soon. But first, let’s examine by age.
Finally, which nodes are in the most strategic position to broker other nodes? This is measured by betweenness centrality, and can be applied to understand who the brokers are, and how to most efficiently disseminate ideas or information. We will calculate the betweenness centrality scores for all nodes, and then size our graphed nodes according to their betweenness.
s2between<-betweenness(sno_iso, g=1, gmode="graph", cmode="undirected")
plot.network(sno_iso, coord=s2coord, vertex.col="region", vertex.cex=s2between/1500, edge.col="darkgrey", vertex.border="black")
legend("bottomleft", legend=c("Africa", "Americas", "Europe", "South-East Asia", "Unknown"), pch=21,
cex=1, pt.bg=c("darkblue", "cornflowerblue", "darkorange1", "darkred", "white"))
The most strategically located brokers are from South-East Asia.
Step 4: Construct ergm models to test hypotheses about why conference participants socialize with each other
Ok, now let’s examine these in
ergm
models. Exponential random graph models (ergm) are a class of logistic regression model that allow us to test hypotheses related to dyads, i.e., network ties/edges. See the
statnet website for a list of ergm resources. I highly recommend “Birds of a Feather or Friend of a Friend” by Goodreau, Kitts and Morris (2009) for both a master class in ergm modeling as well as a wonderful application to adolescent friendship networks.
Unlike traditional statistical models, where the covarariates are some function of the units of analysis, ergm models allow us to alo reprensent covariates that are functions of the network itself. I usually build my ergms in two waves: 1. A set of attribute-only models where covariates are tested separately and then added to the final model stepwise if they improve model fit; 2. A set of structural-only models (following the same process as above).
Starting with the attributes, let’s test each in a model with an edges term, which is like an intercept in a traditional regression model.
smodel.02 <- ergm(sno_iso ~ edges+nodematch("region"))
summary(smodel.02)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: sno_iso ~ edges + nodematch("region")
##
## Iterations: 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -3.66743 0.05216 NA <1e-04 ***
## nodematch.region -3.02832 0.14465 NA <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 82741 on 59685 degrees of freedom
## Residual Deviance: 4375 on 59683 degrees of freedom
##
## AIC: 4379 BIC: 4397 (Smaller is better.)
The nodematch.region coefficient reports the change in log odds of a tie existing between any two given nodes if they share the same region (as compared to not sharing a region). The exponent of -3.02832 is 0.0484. Two actors are less likely to socialize if they are from the same region. BUT! Big caveat: so much of the alter data is missing, that the majority of ties are between known regions and unknown regions (i.e., different regions).
Let’s look at some models with structural covariates. What are these magical structural covariates? They are underlying social processes which have been documented empirically to occur more than chance alone. Today we will examine transitivity, or triangle formation, which describes the propensity for people to form relationships with ‘friends of friends.’ Transitivity has many implications. Think of triangles, literally, as cliques. Cliques might be fun for lunch, but they are not conducive to exposure to new ideas, innovation, behavior or policy change, etc. In the first model I test whether social ties are more likely to exist if they close a triangle. We expect that
a person is more likely to socialize with their friends’ friends.
*A note about missing edge data: While our structural models will not be affected by missing attribute data, they will be affected by missing edge data. We only know the edges of respondents, not the edges of the alters
smodel.06 <- ergm(sno_iso ~ edges+gwesp)
summary(smodel.06)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: sno_iso ~ edges + gwesp
##
## Iterations: 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -5.19997 0.05653 0 <1e-04 ***
## gwesp 0.99650 0.17815 0 <1e-04 ***
## gwesp.alpha 1.18096 0.10367 0 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 82741 on 59685 degrees of freedom
## Residual Deviance: 5011 on 59682 degrees of freedom
##
## AIC: 5017 BIC: 5044 (Smaller is better.)
The
gwesp
term measures the change in log odds of a tie forming between two nodes
given that this tie will close a triangle. Yes, even with missing edges, ties are more likely to exist if they close a triangle between three nodes. This is the result we expected for sociality, but let’s check to see whether this happens with collaboration ties. We would expect that
people are more likely to collaborate with their collaborators’ collaborators.
cmodel.06 <- ergm(cno_iso ~ edges+gwesp)
summary(cmodel.06)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: cno_iso ~ edges + gwesp
##
## Iterations: 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -4.96450 0.06448 0 < 1e-04 ***
## gwesp 0.93190 0.25254 0 0.000225 ***
## gwesp.alpha 1.27732 0.12876 0 < 1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 50343 on 36315 degrees of freedom
## Residual Deviance: 3729 on 36312 degrees of freedom
##
## AIC: 3735 BIC: 3761 (Smaller is better.)
Indeed, people are much more likley to collaborate with their collaborators (based on our incomplete data).
There are a few ways to deal with missing edge data:
- We could have asked the respondents to report their alters’ edges (this is typical in ego-network sampling, but its accuracy depends on the relationship being measured)
- We can remove the nodes we didn’t interview (and thus their edges). This will leave us with a network of complete edges, but not a complete network. I.e., the network we will be left with is not a real network. But neither is the missing edges network…
- We could try to impute edges based on attribute data. Wait! This is what we’d do if we didn’t know that edges are predicted not just on attributes, but also on network structure! Network dependencies make it difficult to impute edges. Man, these networks! Next time
Finally, we could mine existing data (i.e., citation data, Twitter data) to construct relevant networks. Maybe next week…