Networks and Simulation — Tutorial Eight

Supplementary Notes: Making Predictions

These notes are pre-tutorial reading. They cover the conceptual foundations for both link prediction and node classification, with particular attention to the evaluation traps that make these problems harder than they first appear, and to the connection between node classification and the centrality methods of Tutorial 3.

1. Two prediction problems on networks

Networks change over time: new friendships form, new trade routes open, new collaborations begin. Link prediction asks whether we can anticipate which edges will appear next, or detect which edges are missing from an incomplete observation of a static network.

A separate but related problem is node classification: given that some nodes have known labels — a political affiliation, a disease status, a geographic category — can we infer the labels of the unlabelled nodes using the network structure? This is sometimes called semi-supervised learning on graphs, and it arises whenever labels are expensive to collect but network data is abundant.

Both problems share a common challenge: the training data is not independent of the test data. In link prediction, the edges you observe in training influence the neighbourhood features you compute, which in turn predict the test edges. In node classification, the labels of unlabelled nodes may be correlated with each other through the network. This dependence complicates evaluation in ways that are easy to overlook.

2. Link prediction: scoring candidate edges

Given an observed network $G$, a link prediction method assigns a score $s(u, v)$ to each unconnected pair $(u, v)$. Pairs with high scores are predicted to be likely edges. The simplest and most widely used scores are based on shared neighbourhoods.

Common Neighbours. The score is simply the number of common neighbours:

\[s(u, v) = |N(u) \cap N(v)|\]

The intuition is transitivity: if Alice and Bob both know Carol, they are more likely to know each other.

Adamic-Adar. Each common neighbour $w$ is weighted by $1 / \log k_w$, where $k_w$ is the degree of $w$:

\[s(u, v) = \sum_{w \in N(u) \cap N(v)} \frac{1}{\log k_w}\]

The idea: a shared acquaintance who knows everyone (a hub) is less informative about a specific friendship than a shared acquaintance with few contacts. A hub appearing as a common neighbour of two people tells you almost nothing — everyone is connected through hubs.

Resource Allocation. Similar in spirit, but with a sharper penalty for hubs:

\[s(u, v) = \sum_{w \in N(u) \cap N(v)} \frac{1}{k_w}\]

Think of each node as distributing a unit of resource evenly across its neighbours: two nodes receive more total resource from a common neighbour with few other connections than from one with many.

3. Evaluation: where it goes wrong

Class imbalance. In most real networks, the number of edges is far smaller than the number of possible edges. If you predict “no edge” for every pair, you are correct nearly 100% of the time — but you have predicted nothing useful. Accuracy is therefore a misleading metric. AUC-ROC (area under the receiver operating characteristic curve) measures how well your scores rank true edges above non-edges regardless of any threshold, and is robust to class imbalance. An AUC of 0.5 means your ranking is no better than random; 1.0 means perfect separation.

Sampling negatives. Because there are vastly more non-edges than edges, evaluation typically samples a set of non-edges equal in size to the positive test set. This avoids the computational cost of scoring all pairs, but the sampled negatives affect the AUC estimate. The results you obtain are relative to this specific sample, not to the full space of non-edges.

The node holdout problem. In standard supervised learning, you hold out independent examples — a customer record here, an image there — and train on the rest. In link prediction you cannot hold out nodes: a node’s feature in the training graph is its neighbourhood, and removing the node from the graph removes precisely the information that would let you compute those features. You can hold out edges, as the setup code does, but this is weaker: both endpoints remain in the training graph, just with fewer connections than in the full network. The neighbourhood structure used to make predictions is a degraded version of the truth. This is an inherent limitation of link prediction, not a flaw in any particular method, and it affects all neighbourhood-based methods to varying degrees.

4. Node classification as alpha centrality

Label propagation is one of the most widely used methods for semi-supervised node classification. It is also, structurally, exactly the same computation as alpha centrality.

Recall from Tutorial 3 that alpha centrality satisfies:

\[\mathbf{x} = \alpha \mathbf{A} \mathbf{x} + \mathbf{e}\]

where $\alpha$ is an attenuation parameter and $\mathbf{e}$ is an exogenous importance vector — a prior belief about each node’s importance, independent of the network.

Label propagation solves the same equation:

\[F = \alpha S F + (1 - \alpha) Y\]

where $S = D^{-1/2} A D^{-1/2}$ is the normalised adjacency matrix, $F$ is a matrix of label scores (one column per class), and $Y$ contains those scores for the nodes whose labels are known, and zeros elsewhere. The parameter $\alpha$ plays the same role as in alpha centrality: it controls how much influence comes from the network structure versus the fixed prior.

The only difference from alpha centrality is what $\mathbf{e}$ (or $Y$) contains: in alpha centrality, it encodes prior knowledge about node importance; in label propagation, it encodes the known class labels. The mathematical structure — propagate over the network, anchored by an exogenous vector — is identical.

In practice, the equation is solved iteratively: starting from $F = Y$, repeatedly apply $F \leftarrow \alpha S F + (1-\alpha) Y$ until convergence. Each iteration spreads label information one hop further across the network, weighted by $\alpha$.

What the normalised adjacency does. The symmetric normalised adjacency $S = D^{-1/2} A D^{-1/2}$ weights each edge by the inverse square-root of the degrees of both endpoints. This prevents high-degree nodes from dominating the propagation simply because they have many connections. Its eigenvalues lie in $[-1, 1]$.

5. Assortativity

Networks are assortative with respect to a node attribute if nodes with similar attribute values tend to connect to each other; disassortative if nodes with different values tend to connect. This pattern is also sometimes called homophily (birds of a feather flock together) in the assortative case.

The standard quantitative measure is the assortativity coefficient $r$, which is the Pearson correlation of the attribute values at the two ends of each edge:

\[r = \frac{\sum_{ij} (A_{ij} - k_i k_j / 2m) \, x_i x_j}{\sum_{ij} (k_i \delta_{ij} - k_i k_j / 2m) \, x_i x_j}\]

where $A_{ij}$ is the adjacency matrix, $k_i$ is the degree of node $i$, $m$ is the number of edges, and $x_i$ is the attribute value of node $i$. In practice you do not need to compute this by hand — NetworkX provides nx.attribute_assortativity_coefficient(G, attribute) for categorical labels and nx.numeric_assortativity_coefficient(G, attribute) for continuous values.

The coefficient $r$ lies in $[-1, 1]$:

$r > 0$: assortative — edges tend to connect same-attribute nodes
$r < 0$: disassortative — edges tend to connect different-attribute nodes
$r \approx 0$: no systematic pattern (or competing assortative and disassortative signals that cancel)

Examples. Social networks are typically assortative by age, education level, and political affiliation — people tend to connect with similar others. Biological networks (predator–prey, host–pathogen) are often disassortative — connections are between unlike nodes. Transport networks, as you will see in the exercises, can be complex: the global flight network connects airports across continents, so it is not obviously assortative by continent.

Why it matters for classification. Label propagation assumes that connected nodes share labels — the assortativity assumption. This is exactly the assumption that $r > 0$ for the class labels. When $r < 0$, the assumption is violated, and propagating labels across edges actively misleads rather than helps. The connection between assortativity and the eigenvalue structure of $S$ is what the next section explains.

6. The eigenvalue argument: why one-step can fail

Whether label propagation succeeds depends on a structural property of the network that is easy to state once you think about eigenvalues.

Consider a network with two classes of nodes. If the network is assortative with respect to those labels — same-label nodes tend to connect — then the dominant eigenvectors of $S$ that capture the class structure have positive eigenvalues. Propagating labels over $S$ amplifies these eigenvectors: after many iterations, each node’s predicted label converges toward the majority label in its neighbourhood. This is what we want.

If the network is disassortative — different-label nodes tend to connect — then the class-relevant eigenvectors have negative eigenvalues. Propagating over $S$ does not amplify these eigenvectors; it oscillates their sign at each iteration. After convergence, the predicted labels are systematically inverted: the method does not merely fail to help, it actively misleads.

Real networks can also be mixed: some aspects of the label structure are assortative, others disassortative. In this case the class-relevant eigenvectors have both positive and negative eigenvalues. One-step propagation amplifies the positive components and suppresses the negative ones, giving an incomplete and potentially misleading picture.

The two-step solution. Replacing $S$ with $S^2$ in the propagation equation solves the sign problem. The eigenvectors of $S^2$ are exactly the same as those of $S$ — the same structural patterns are captured — but every eigenvalue is squared, and squaring removes the sign: $(-\lambda)^2 = \lambda^2 \geq 0$. Two-step propagation therefore amplifies all structurally informative directions, whether the underlying structure is assortative, disassortative, or mixed. This is why two-step is a strictly more general method: it works wherever one-step works, and also works in cases where one-step fails or misleads.

The mixed case is the most interesting one. Four blocks with two labels can have a structure where the label-relevant eigenvector of $S$ has a negative eigenvalue — because the dominant cross-label connections are disassortative — even though same-label blocks also connect to each other. Two-step captures both signals simultaneously.

7. Choosing $\alpha$ and avoiding overfitting

The parameter $\alpha$ balances two terms in the propagation equation: the network term $\alpha SF$ (trust your neighbours) and the prior term $(1-\alpha)Y$ (trust the known labels). When $\alpha$ is small, predictions stay close to the known labels and the network’s influence is limited. When $\alpha$ is large, the network structure dominates and labels propagate far from the labelled nodes — which is only helpful if the network structure is genuinely informative about the labels.

In practice, $\alpha = 0.1$ works well and converges quickly: it lets the network exert meaningful influence without becoming dominated by noisy or misleading structural signals. Very large $\alpha$ (close to 1) can cause slow convergence and makes predictions highly sensitive to network structure, which is a liability when that structure is mixed or disassortative.

The natural temptation is to choose $\alpha$ by trying several values and picking the one that gives highest accuracy. But accuracy on what? If you tune $\alpha$ on the same held-out nodes you are trying to classify, you have no independent test of generalisation. The correct procedure is to use a separate validation set — a third partition of nodes, distinct from both training and test — to tune $\alpha$, and report accuracy only on the test set. This is the same logic as regularisation tuning in any supervised learning setting.

8. Practical notes

Loading the airport network. Use the same graph_tool loading approach as in Tutorials 3 and 4. Convert to an undirected NetworkX graph for link prediction, and work on the largest connected component.

stochastic_block_model in NetworkX. The function nx.stochastic_block_model(sizes, p) generates a network where sizes is a list of block sizes and p is a matrix of within- and between-block connection probabilities. Node IDs run from 0 to $n-1$, and each node has a 'block' attribute recording its block membership.

Matrix operations. The label_propagation function provided uses dense NumPy arrays, which is convenient but memory-intensive. If you find it slow on your machine, the matrix operations can be replaced with sparse equivalents from scipy.sparse — the same approach as in Tutorial 3.

Stochasticity. Both the SBM networks and the train/test split are stochastic. If your results look surprising, re-run a few times and check whether the pattern is consistent. The qualitative result — which method works on which network type — should be stable across runs even if the exact accuracy numbers vary.

References

Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031.

Peel, L. (2017). Graph-based semi-supervised learning for relational networks. In Proceedings of the 2017 SIAM International Conference on Data Mining (pp. 435–443). https://arxiv.org/abs/1612.05001

Zhou, D., Bousquet, O., Lal, T., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16.