Social Network Analysis for Social Scientists (Spring 2017)

This proseminar, co-led by ISS Fellows Chris Smith and Robert Faris, will provide a broad overview of social network analysis, beginning with basic network visualization and customization and concluding with advanced modeling and the visualization of complex data.

ISS Fellows: Chris Smith (Sociology) and Robert Faris (Sociology).

Social network analysis (SNA) is a method for investigating social structures through the use of network and graph theories. It is used across a wide range of disciplines, from biology to sociology. The techniques covered in this proseminar will be applicable to any number of data types and disciplines. In addition to the primer on social network data, visualization, and analysis, this proseminar will rely heavily on the use of the free statistical and graphical platform R.

SOC 298 | Wednesdays, 3:40 - 5:40 p.m. | Andrews Room, 2203 SS&H | Spring 2017 | CRN 89045 | Flyer | Syllabus

 

Topics will include:

-       Introduction to SNA and its application across the social sciences and beyond

-       R: installation, syntax, etc.

-       Relational Data (including common data structures, including sociomatrices, edgelists, and affiliation data)

-       Basic Visualization (including techniques for customization)

-       Graph-Level Indices (components, density, centralization measures)

-       Node-Level Indices (degree, k-core, distance, betweenness)

-       Two-Mode Networks

-       Exponential Random Graph Models (ERGM) (parameters, convergence, goodness of fit)

-       Advanced Modeling, including Stochastic Actor-Oriented Models (R-Siena)

-       Big Data and Advanced Visualization

_____________________________________________________________________________

 

Proseminar blog

 

1. Introduction

April 5, 2017

The students in our group represent departments and graduate groups including: Agriculture and Resource Economics (ARE), Anthropology, Communication, Geography, Linguistics, Political Science, Psychology, and the Graduate School of Management.

After introducing ourselves, our research, and our experience with R coding (ranging from never downloaded the software to using it for years), our fearless leaders taught us the nuts and bolts of social network analysis. Using example networks of friendships, criminology, and sexually transmitted diseases, we learned SNA keywords including: nodes, edges, components, and isolates (Figure 1).

We then played Six Degrees of Kevin Bacon. Try it out: go to Google and type, “Bacon Number” followed by a name. The challenge is to try and find someone who has a Bacon Number of more than 4. We succeeded in finding two such names in our class: Adolf Hitler and Ivanka Trump. In the case of Ivanka, for example, she and Jamie Johnson appeared in Born Rich together; Jamie Johnson and Paul Weaver appeared together in Arbitrage; Paul Weaver and Sarah Jessica Parker appeared together in New Year’s Eve; and Sarah Jessica Parker and Kevin Bacon appeared together in Footloose

Following a review of the seminar’s schedule, we had lab time to download and/or update RStudio, work through some syntax and basic R coding, and install and/or update the Statnet package, which we’ll be using later in the course. RStudio pro-tip from class: you know how lines in the script can be never-ending in length? You can change your settings to always wrap the text lines (changes with resizes of your script window). To make this change go into the Tools Dropdown Menu. Then click on “Code” and check the box next to “Soft-wrap R source files”. Boom! See you next week!

References

Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris (2003). statnet: Software tools for the Statistical Modeling of Network Data. URL http://statnetproject.org

Rstudio Team (2016). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA. URL http://www.rstudio.com/

 

2. Relational Data

April 12, 2017

This week’s seminar introduced students to relational data. We started by reviewing nodes and description of undirected and directed ties. Undirected ties are represented by lines between nodes, where there is no distinction between nodes (ties are present or absent). Directed ties are represented by arrows and indicate which node is the sender and which node is the receiver. For example, if the arrow points from node A to node B (Figure 2a), node A is the sender; if the arrow points from node B to node A (Figure 2b), node A is the receiver; and if the tie is double-headed and points to both node A and node B (Figure 2c), both nodes are receivers and senders. Directed ties can be either asymmetric (one sender, one receiver), mutual (two senders, two receivers), or absent.

We were then introduced to data structures used in relating information for social network analysis: the sociomatrix and edgelists. The sociomatrix is square with nodes listed as row names and column names. The nodes must be in the same order. Each cell in the sociomatrix indicates the presence (1) or absence (0) of a tie. Because the sociomatrix is square, the diagonal in the matrix will consist of 0s, showing that the node is not connected to itself. Also important to note in the sociomatrix (for directed ties) is that rows represent the senders and columns represent the receivers. So, for example, if node A and node B are a mutual directed tie, in the sociomatrix, the cells {(A,B), (B,A)} will both have a 1. The other data structure covered was the edgelist which is a series of rows, each one representing one tie in the network. The edgelist has two columns, Sender and Receiver. As in the previous example, if node A and node B are a mutual directed tie, the edgelist will contain two rows, the first ordered (A,B) and the second ordered (B,A). In cases of undirected ties, the order of nodes in the edgelist does not matter. 

We also touched briefly on attributes that can be assigned to the social network, noting that the most important piece of the attribute table is assigning unique IDs that correspond to the edgelist ties. Attributes must be in the same order as the edgelist information. Finally, we talked about where to find social network data, which is basically everywhere! Examples include: the internet, observations, surveys, archival records, firm rosters, official police records, etc. After our lecture, we worked through a lab, practicing networking in R using both sociomatrices and edgelists. R-studio pro-tip of the week: You know how you make comments in R scripts using “#” and if you put four #, followed by your comment, followed by four #, it creates a little drop down arrow so you can collapse or expand code? Well, did you know there’s an outline feature in R-studio?! In your R-studio window, at the top of your script, click on the little outline icon and an outline with your nested headings/notes will pop out (Figure 3). Click on a heading to automatically jump to that section of your script. AHH-mazing!

Student Spotlight

This week we also started student and faculty spotlights. Matt Thompson, a graduate student in Sociology, presented briefly on work relating to institutional nomination data. He created networks of colleges and universities using publically available information and data that included attributes such as geography, size, race, gender, ranking, etc. I don’t want to give away all of Matt’s secrets, so if you’re interested in the awesome stuff he’s doing, you can contact him here: mthomp@ucdavis.edu

 

3. Two-Mode Networks

April 19, 2017

This week, we continued exploring and practicing network analysis, moving from one-mode networks, such as connections between people, to two-mode networks, such as connections between people (mode 1) present at the same event (mode 2). Examples of a two-mode network are the connections between kids at a birthday party; between states in an international crisis; or between people arrested together in the same police operation. In a two-mode network we imply that a co-presence in the same event means that the actors are connected. In other words, people are connected to events and events are connected to people. In the two-mode networks, the edge, or connection, between two actors exists only if they were present in the same event. Be aware, however, that there is a potential weakness in this approach: if, for example, you and I were in the same class this week, this does not necessary mean that we get acquainted; if you and I were not in the same class this week, it still can be the case that we are friends. Despite this potential weakness of implying that co-presence at an event == connection, in many cases there is simply no better, practically feasible way to identify connections between actors of interest.

To make the two-mode data useful, we also learned how to make projection from two-mode data to one-mode data, and back. As mentioned before, in a two-mode network we have people connected to events, and events connected to people. When we make a projection to a one-mode network, we can choose to portray a network of people connected to people; or a network of events connected to events, or both. Our Professor demonstrated us this projection by using examples from her own research and data on a network of co-arrests.

We also practiced making the projections in R by using igraph package. Following a very detailed and friendly R script (even for R novices), we converted raw data on actors and events into two-mode network, projected it into a one-mode network of actors and a one-mode networks of events, and visualized each of the networks to see how actors and events are inter-connected (two-mode network), how actors are connected (a one-mode network), and how events are connected (a one mode network). All this was done by using only ten rows and two columns of raw data. All events and actors in that data had no name or any empirical identification, so everyone in our multi-disciplinary class could apply the things we learned to his or her own research. Network analysis might seem complicated, but we are now in only third week and can make some of the magic ourselves!

 

4. Visualization

April 26, 2017

This week’s seminar was all about the visualization: colors, shapes, widths, line types, sizes, background colors, legends, etc. There are a ton of customizable features in social networks and this week’s seminar covered a lot of them! Some helpful tips for visualization:

1)    Aim for readability and parsimony

2)    In large networks, minimize details

3)    Minimize edge and node overlaps

4)    Be aware that node placement is not always useful

5)    Be aware the edge lengths are not always useful

In our discussions, we looked at both one two mode networks, noting that customization applied to two mode networks can be carried over into one mode projections (Figure 1).

In lab, we worked primarily with igraph, but we also tried our hand at customization in statnet. Using a variety of data sources (some supplied by our fearless leaders and some from our own research experiences), we held an ugly and pretty network competition. Interestingly enough, the majority of the class entered the ugly competition but backed away from the pretty competition. What does this mean!? What does it say about our class or networking in general? Are we afraid that our ‘pretty’ networks will be ugly to others? Or is it that we just enjoy making ugly networks more (plus it’s easier)?! That’s a discussion for another time. For now, take a look at some of our ugly networks. Who do you think wins the prize?

This week’s R pro-tips:

Working with visualization and need some color ideas? Check out these sites:

            http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

            http://research.stowers.org/mcm/efg/R/Color/Chart/ColorChart.pdf

            https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf

 

5. Graph-Level Indices

May 3, 2017 

This week’s seminar discussed using graph-level indices as a means for comparing networks either over time or against each other. There are different metrics on which to base comparisons of networks. The three we covered in seminar are:

1)    Size

2)    Components

3)    Density

The easiest metric is size, which requires knowing the number nodes in the network. This can be easily obtained through summary statistics in R. The ‘components’ metric considers either a count of the number of components in the network or the size of the components in the network. Finally, the ‘density’ metric considers the connectedness of the network. It calculates how dense or sparse the network is using a comparison between the actual connections present and the total number of connections possible. For an undirected network, for example, the total number of possible connections is calculated using the following equation: nx(n-1) / 2. By using density, you can better compare networks of different sizes. Figure 1 shows two networks, each with the same density.

We also talked about centralization of networks. Degree centralization considers how central the most central node is compared to all other nodes in the network. Using degree centrality, we can calculate a single statistic for the whole network. Before heading into lab to practice doing some of these calculations in R, we had another guest lecturer!

Research spotlight

Today’s guest was Dr. Cuihua (Cindy) Shen, an associate professor in the Department of Communication at UC Davis. Cindy talked about the use of social network analysis in online worlds. Specifically, she and her colleagues are interested in studying context collapse and its effects on people’s self-presentation strategies. To do this work, they partnered with a Facebook app, myPersonality. This app, created by Stanford students, was set up so that users agreed to donate their Facebook data in exchange for getting a ‘personality reading’. The research is informed by Communication Accommodation Theory (CAT), which postulates that people adjust their language when they interact with different networks or groups of people. Given that people have several networks on Facebook, Dr. Shen and her team delve into how people ‘speak’ on Facebook by looking at the language of their status updates. Want to dive further into the Facebook world and language research?! Contact Dr. Shen at cuishen@ucdavis.edu. For more information on the myPersonality app, see their website.

 

Node-Level Indices

May 10, 2017

This week, we dove into node-level indices. Specifically, we discussed five properties of node-level indices:

1)     Degree

2)     K-core

3)     Distance

4)     Betweenness

5)     Neighborhood

First, the ‘degree’ considers where the action is in the network, who is the most central actor, and who is popular versus not popular. The degree score for undirected networks is the count of ties from each node. As such, every node in a network has a degree score. Histograms are an easy way to visualize the different degree scores found in a network (Figure 1).

In comparison, the k-core is a bit more complicated. It focuses on dense pockets of cohesion or groups in the network that have the same minimum degree score. For example, in Figure 2, the green cluster has a k-core of 3 because the minimum degree score in this cluster is 3. Node degree scores and K-cores relate in that nodes with high degrees can have high k-cores (but they don’t have to). Nodes that have low degree scores will never have high K-cores.

The third property is distance. This refers to how far apart individuals are in the network or how far something must travel to get somewhere else. Every node has a geodesic distance to all other nodes (not itself). Typically, when analyzing networks, we look for the distance that is most efficient. In other words, we look for the shortest path possible between two nodes. There can be multiple shortest distance paths between nodes. However, geodesic distances are only possible in a single component. You cannot measure a geodesic distance across network components.

Next, we covered betweenness. Betweenness considers who is on the path between two nodes. It is calculated using the number of geodesic distances that a node is on and it can help highlight which nodes in the network act as ‘brokers’. Brokers are important because they bridge major pieces of the network and prevent ‘structural holes’ (Figure 3).

Finally, we considered neighborhoods. Neighborhoods consider connections to nodes through an increasing number of steps. For instance, who is connected to a node in one step, who is connected to a node in two steps, etc.

Our research spotlight today was our very own, Dr. Chris Smith, one of two fearless leaders of the SNA Pro-Seminar this quarter!

Research Spotlight

Dr. Chris Smith is an Assistant Professor in the Department of Sociology at UC Davis. She is also one of the profs leading the Social Network Analysis Pro-Seminar covered in this blog. Dr. Smith’s research focuses on crime, criminal relationships, and criminal organizations. In class, she presented work on how the structure of relationships contributes to power consolidation during exogenous events (such as prohibition). Within this work, she highlighted different brokering and distance measures related to gender. She found a higher gender gap ratio during pre-prohibition than prohibition times. Learn more about the fascinating world of organized crime, gender, and violence by contacting Dr. Smith at chmsmith@ucdavis.edu