Teaching material for Networks and Simulation
Tutorial 2 Notes
Tutorial 2 Exercises
Tutorial 3 Notes
Tutorial 3 Exercises
Tutorial 4 Notes
Tutorial 4 Exercises
Tutorial 5 Notes
Tutorial 5 Exercises
Tutorial 6 Notes
Tutorial 6 Exercises
Tutorial 7 Notes
Tutorial 7 Exercises
Tutorial 8 Notes
Tutorial 8 Exercises
These notes are intended as pre-tutorial reading alongside Chapter 14 of the Atlas. They organise the centrality measures covered in the tutorial into two families, explain the logic connecting them, and introduce the path-based measures used in Exercise 4. The goal is to give you a framework for understanding these measures rather than a list of formulas to memorise.
Every centrality measure is an answer to a specific question about what makes a node important. Before computing anything, it is worth asking: important for what? A node that is important for spreading information quickly may not be important for controlling what information flows where, and neither may be important for withstanding targeted removal. The proliferation of centrality measures in the literature is not redundancy — it reflects the genuine multiplicity of things “importance” can mean.
In this tutorial we cover two families. The first, the spectral family, asks some version of “how well-connected is this node to other well-connected nodes?” The second, the path-based family, asks “where does this node sit relative to the shortest paths in the network?” The two families capture structurally different properties and will often rank the same nodes very differently.
The spectral centrality measures are best understood as a sequence, each member motivated by a limitation of the previous one.
The simplest recursive definition of importance: a node is important if it is connected to other important nodes. Formally, the eigenvector centrality vector x satisfies:
\[\lambda \mathbf{x} = \mathbf{A}\mathbf{x}\]where A is the adjacency matrix and λ is the largest eigenvalue of A. The i-th entry xᵢ is the eigenvector centrality of node i.
The connection to walk counting (Tutorial 1) is direct. You showed that A¹1 counts walks of length 1 (the degree vector), A²1 counts walks of length 2, and so on. Repeatedly multiplying by A and normalising — the power method — amplifies the component of the result in the direction of the leading eigenvector. In the limit, the result is the eigenvector centrality vector. A node scores high because it can be reached by many long walks, which is equivalent to being connected to other well-connected nodes.
Limitation. In directed networks, a node with no incoming edges receives a score of zero, regardless of how many important nodes it connects to. This is a problem: a node that links out to the most important nodes in the network gets no credit for doing so. All subsequent measures in the family address this limitation.
Katz centrality adds a small uniform baseline score β to every node, so that even a node with no incoming edges receives a non-zero score. The Katz centrality vector satisfies:
\[\mathbf{x} = \alpha \mathbf{A}\mathbf{x} + \beta \mathbf{1}\]where α < 1/λ₁ is an attenuation factor that controls how much importance propagates through the network relative to the baseline. The constraint α < 1/λ₁ is necessary for the solution to converge — if α is too large the scores diverge.
Rearranging, the solution is:
\[\mathbf{x} = \beta(\mathbf{I} - \alpha \mathbf{A})^{-1}\mathbf{1}\]which can be expanded as a power series: x = β(1 + αA1 + α²A²1 + …). Each term counts walks of a given length, attenuated by αˡ for walks of length l. Katz centrality is therefore a weighted count of all walks of all lengths, with shorter walks weighted more heavily.
Remaining limitation. Every connection to node i contributes equally to i’s Katz score, regardless of how many other edges the source node has. A link from a node with 1000 outgoing edges is treated the same as a link from a node with 1 outgoing edge. This makes Katz centrality susceptible to manipulation and can give misleading results for networks where some nodes link to very many others.
Alpha centrality generalises Katz by replacing the uniform baseline 1 with an arbitrary exogenous importance vector e:
\[\mathbf{x} = \alpha \mathbf{A}\mathbf{x} + \mathbf{e}\]This allows you to encode prior knowledge about node importance. For an airport network, e might reflect passenger volume or the economic size of the city served. For a citation network, e might reflect the journal impact factor of the source publication. The propagation mechanism is the same as Katz; only the starting distribution of importance differs.
PageRank addresses the remaining limitation of Katz centrality by normalising the importance propagated along each edge by the out-degree of the source node. A link from a node with 1000 outgoing edges contributes 1/1000 of what a link from a node with 1 outgoing edge contributes. This makes PageRank a measure of the probability that a random walker following edges uniformly arrives at a given node.
The random walk interpretation also motivates the teleportation term: with probability (1 − d), the random walker ignores the network and jumps to a uniformly random node. This handles dangling nodes (nodes with no outgoing edges, which would trap a random walker) and ensures the walk mixes across the whole network. The standard value d = 0.85 was used in the original Google PageRank paper.
The PageRank vector satisfies:
\[\mathbf{x} = d \mathbf{M}\mathbf{x} + \frac{1-d}{n}\mathbf{1}\]where M is the column-stochastic transition matrix: Mᵢⱼ = Aᵢⱼ / kⱼᵒᵘᵗ. This can be solved by the power method — the same iterative multiplication you used for eigenvector centrality, but applied to the modified matrix.
A note on sparse matrices. The power method requires repeated matrix-vector multiplication. For a network with n nodes, a dense matrix representation requires storing n² values and performing n² operations per multiplication. For n = 3000 (the size of the OpenFlights network) this is already 9 million values per matrix. Real networks are sparse — most pairs of nodes are not connected — so only the non-zero entries of A need to be stored and operated on. Sparse matrix libraries (e.g., scipy.sparse in Python) store only the non-zero entries, reducing both memory and computation to O(m) where m is the number of edges. For any network with more than a few hundred nodes, you should use sparse representations for iterative computations.
HITS (Hyperlink-Induced Topic Search) takes a different approach to the limitation of eigenvector centrality. Rather than adding a baseline, it recognises that in directed networks there are genuinely two distinct types of importance: being a good source of links (a hub) and being a good destination for links (an authority). A good hub links to many good authorities; a good authority is linked to by many good hubs.
Every node receives two scores. The hub score h and authority score a satisfy:
\[\mathbf{h} = \mathbf{A}\mathbf{a}, \quad \mathbf{a} = \mathbf{A}^\top \mathbf{h}\]Substituting: h = AAᵀh and a = AᵀAa. Hub scores are therefore the leading eigenvector of AAᵀ, and authority scores are the leading eigenvector of AᵀA.
This has a direct connection to Tutorial 1’s bipartite network material. The matrix AAᵀ is the co-citation matrix — (AAᵀ)ᵢⱼ counts the number of articles that both i and j link to, exactly the one-mode projection of the directed graph viewed as a bipartite structure (sources on one side, destinations on the other). Computing eigenvector centrality on this projection gives the hub scores. HITS is therefore eigenvector centrality applied to the implicit bipartite structure of a directed network.
The spectral measures all ask some version of “who are your neighbours, and how important are they?” The path-based measures ask a different question: where do you sit in the network’s global structure of shortest paths?
Closeness centrality measures how quickly a node can reach all other nodes. The standard definition is the inverse of the average shortest path length from node i to all other nodes:
\[C_i = \frac{n-1}{\sum_{j \neq i} d(i,j)}\]where d(i,j) is the shortest path length between i and j. A node with high closeness can reach the whole network in few hops on average — it is geometrically central.
Limitation. Standard closeness is undefined if any node is unreachable (infinite shortest path), which occurs in disconnected or directed networks. The harmonic centrality variant addresses this by summing inverse distances instead of inverting the sum:
\[H_i = \frac{1}{n-1} \sum_{j \neq i} \frac{1}{d(i,j)}\]Unreachable nodes contribute 0 to the sum (since 1/∞ = 0) rather than making the measure undefined. Harmonic centrality is generally preferred for real-world directed or disconnected networks.
Betweenness centrality measures how often a node appears on shortest paths between other pairs of nodes. Formally:
\[B_i = \sum_{s \neq i \neq t} \frac{\sigma_{st}(i)}{\sigma_{st}}\]where σₛₜ is the total number of shortest paths from s to t, and σₛₜ(i) is the number of those paths that pass through i.
A node with high betweenness acts as a bridge or bottleneck: much of the network’s traffic must pass through it. Removing a high-betweenness node tends to increase path lengths between many pairs of nodes more than removing a high-degree node would, because betweenness specifically identifies the bridges rather than the hubs.
Betweenness centrality is computationally more expensive than degree or eigenvector centrality — the standard algorithm runs in O(nm) time for unweighted networks — so for large networks the igraph implementation is substantially faster than networkx.
The following table summarises the implicit question each measure answers. When choosing a centrality measure for a practical question, start with the question and work backwards to the measure.
| Measure | Implicit question |
|---|---|
| Degree | How many direct connections does this node have? |
| Eigenvector | How well-connected are this node’s neighbours (recursively)? |
| Katz | Same as eigenvector, but every node gets a baseline score |
| Alpha | Same as Katz, but baseline scores reflect prior knowledge |
| PageRank | How often does a random walker visit this node? |
| HITS (hub) | Does this node link to many important destinations? |
| HITS (authority) | Is this node linked to by many important sources? |
| Closeness | How quickly can this node reach all other nodes? |
| Betweenness | How often does this node lie on shortest paths between others? |
A useful heuristic: spectral measures tend to identify hubs — nodes that are well-connected to other well-connected nodes. Path-based measures tend to identify bridges — nodes whose position makes them structurally important for network flow, even if they are not highly connected. For many real networks, the two families identify different nodes, and the disagreement is itself informative about the network’s structure.
Computing centrality scores is straightforward; interpreting them correctly is not. Four cautions apply whenever you analyse centrality in practice.
Not every centrality measure is meaningful on every type of network. The most important case is directed networks, where several measures either break down or require modification.
Eigenvector centrality on a directed network without a strongly connected component (one where every node can reach every other node) will produce zero scores for all nodes that cannot be reached from the dominant component — often the majority of the network. The measure is technically defined but practically useless in this case. Katz centrality and PageRank are specifically designed to handle this situation.
Standard closeness centrality is undefined for any node that cannot reach all other nodes, which is common in directed networks where paths may only exist in one direction. Harmonic centrality resolves this by summing inverse distances — unreachable nodes contribute zero rather than making the measure undefined. Always prefer harmonic closeness on directed or disconnected networks.
Betweenness centrality is well-defined on directed networks (simply count directed shortest paths) but has a different interpretation: a node can be a bridge for paths from A to B without being a bridge for paths from B to A.
Before computing any centrality measure, ask: is this measure defined for my network type, and does the definition give me what I actually want to know?
PageRank is designed for directed networks. On an undirected network — where every edge goes both ways — the random walker has no preferred direction to follow, and the stationary distribution of the walk converges to a score proportional to each node’s degree. In the limit, PageRank on an undirected network is equivalent to degree centrality up to normalisation.
This means computing PageRank on an undirected network is rarely informative beyond what you already know from degree. If your network is undirected and you want a measure that captures the quality of a node’s connections recursively, eigenvector centrality is the appropriate choice.
On typical real-world networks — particularly social networks and infrastructure networks — many centrality measures are positively correlated with each other. Degree, eigenvector, Katz, and PageRank all tend to rank the same high-degree hubs at the top. Closeness often correlates with degree in well-connected networks. Computing five measures and obtaining five similar rankings is not strong evidence that a node is important in five different senses; it may simply reflect the fact that the network has a clear hub-and-spoke structure where one family of nodes dominates all rankings simultaneously.
The interesting analytical signal lies in the disagreements between measures, not the agreements. When a node ranks high on betweenness but low on degree, or high on closeness but low on PageRank, that disagreement is diagnostic of the node’s structural role. Exercise 4 is designed around exactly this principle.
This also implies that reporting multiple centrality measures without discussing their correlations or disagreements adds limited value. A good analysis identifies which measures agree, which disagree, and what the disagreements reveal about the network.
A node with harmonic centrality 0.43 is not obviously important or unimportant. Centrality scores only become interpretable when compared to something: the distribution of scores across all nodes in the network, the scores of nodes with similar degrees, or the scores produced by an appropriate null model.
The Medici exercise (Ex 5) makes this concrete: the Medici’s harmonic centrality score is only interpretable when placed against the configuration model baseline — the distribution of scores we would expect for a node with that degree in a random graph preserving the degree sequence. A score above the null model expectation indicates that the node’s position in the network cannot be explained by degree alone.
This principle generalises beyond the Medici. Whenever you report a centrality score in an analysis, ask: compared to what? If you cannot answer that question, the score is not yet an interpretable finding.
Exercise 5 uses the network of marriage and business alliances among prominent Florentine families in the 15th century, first analysed by Padgett and Ansell (1993). The network is small enough to inspect by eye and historically documented enough to verify results against real records — both of which make it an unusually good teaching dataset.
The exercise uses harmonic centrality rather than standard closeness for the reasons described above: the Medici network has a few isolated nodes, making standard closeness ill-defined. The harmonic variant handles this gracefully.
When interpreting results, keep in mind the distinction between two claims that the analysis can and cannot support. The analysis can show that the Medici have high harmonic centrality and that this is not fully explained by their degree. It cannot show that the Medici deliberately engineered their network position, or that network centrality caused their political power rather than the reverse. Network analysis can establish structural facts; the causal story always requires additional evidence.