Tutorial Eight — Exercises

Making Predictions

This tutorial asks you to use network structure to make predictions of two kinds: which edges are missing from an observed network (link prediction), and what labels should be assigned to unlabelled nodes (node classification). The methods themselves are not technically complex. The harder and more important questions are about evaluation: how do you know when a prediction is good, and how do you avoid fooling yourself? These questions run through both halves of the tutorial.

Link prediction

The following code loads the OpenFlights airport network and sets up a train/test evaluation framework. Read it carefully before attempting the exercises — understanding what it does is part of the exercise.

import networkx as nx
import numpy as np
from sklearn.metrics import roc_auc_score

# Load OpenFlights (same source as Tutorial 3)
import graph_tool.all as gt
g = gt.collection.ns["openflights"]
G_directed = nx.DiGraph()
for e in g.edges():
    G_directed.add_edge(int(e.source()), int(e.target()))

# Work with the undirected version, largest connected component
G_full = G_directed.to_undirected()
G_lcc = G_full.subgraph(
    max(nx.connected_components(G_full), key=len)
).copy()

def train_test_split_edges(G, test_fraction=0.1, seed=None):
    """
    Hold out a random fraction of edges as the test set.
    Returns the training graph, positive test edges, and sampled negative test edges.
    """
    rng = np.random.default_rng(seed)
    edges = list(G.edges())
    rng.shuffle(edges)
    n_test = int(test_fraction * len(edges))

    test_edges = edges[:n_test]
    train_edges = edges[n_test:]

    G_train = nx.Graph()
    G_train.add_nodes_from(G.nodes(data=True))
    G_train.add_edges_from(train_edges)

    # Sample the same number of non-edges as negatives
    non_edges = [(u, v) for u, v in nx.non_edges(G)
                 if G_train.has_node(u) and G_train.has_node(v)]
    rng.shuffle(non_edges)
    test_non_edges = non_edges[:n_test]

    return G_train, test_edges, test_non_edges

def evaluate(G_train, pos_edges, neg_edges, score_fn):
    """
    Evaluate a link prediction scoring function using AUC-ROC.
    score_fn(G, u, v) should return a scalar score for a candidate edge (u, v).
    """
    scores = [score_fn(G_train, u, v) for u, v in pos_edges + neg_edges]
    labels = [1] * len(pos_edges) + [0] * len(neg_edges)
    return roc_auc_score(labels, scores)

G_train, test_edges, test_non_edges = train_test_split_edges(G_lcc)

1. Before trying any link prediction method, it is worth understanding exactly what we are measuring and why the evaluation is set up the way it is.

(a) How many edges does G_train have, and how many are in the test set? What fraction of all possible node pairs in the network are edges? Given this fraction, a trivial classifier that predicts “no edge” for every pair would be nearly always correct. Why is accuracy a poor metric here, and why is AUC-ROC a better choice?

(b) The train_test_split_edges function samples the same number of non-edges as test edges. What would happen to the AUC if you included all non-edges instead? Is sampling non-edges a principled solution, or does it introduce its own problems?

(c) When edges are removed to form the test set, both endpoints remain in G_train — just with fewer connections than in the full network. Why does this matter for methods that use neighbourhood structure to score candidate edges? Is there a way to avoid this problem?

Discussion question. In standard supervised learning you hold out examples — individual data points — for testing. Why can you not simply hold out nodes in link prediction the same way? What is the fundamental difference between a link prediction test set and a test set in, say, classification of independent customer records?

Explore further. Try varying test_fraction — say 0.05, 0.1, and 0.2 — and observe how the AUC changes. Does larger test fraction consistently hurt performance, or is the picture more complicated? What does this suggest about how much the degraded training graph affects the neighbourhood-based methods?

2. Three of the most widely used link prediction methods score candidate edges by the overlap between the neighbourhoods of the two candidate nodes. Each encodes a slightly different hypothesis about why missing edges appear.

(a) The code below computes Common Neighbours as a link prediction score and evaluates it. Run it and record the AUC.

def cn_score(G, u, v):
    return len(list(nx.common_neighbors(G, u, v)))

auc_cn = evaluate(G_train, test_edges, test_non_edges, cn_score)
print(f"Common Neighbours AUC: {auc_cn:.3f}")

An AUC of 0.5 corresponds to random prediction. How far above this baseline does Common Neighbours score?

(b) Implement and evaluate Adamic-Adar and Resource Allocation. Both are weighted variants of Common Neighbours: Adamic-Adar weights each common neighbour $w$ by $1/\log k_w$, and Resource Allocation by $1/k_w$, where $k_w$ is the degree of $w$. NetworkX provides nx.adamic_adar_index and nx.resource_allocation_index, but these return generators over node pairs — you will need to adapt them to score a single pair. Compare the three AUC scores.

(c) Discussion question. Adamic-Adar and Resource Allocation both penalise common neighbours that are high-degree hubs. Why might a hub appearing as a common neighbour be less informative about whether a specific pair of airports is likely to be connected? Think about what kind of airport systematically appears as a common neighbour of nearly every pair of airports in the world.

Explore further. Try implementing a degree-product baseline: $s(u,v) = k_u \cdot k_v$. This scores edges purely by how many edges you would expect between two nodes in a random null model. How does it compare to the neighbourhood-based methods? What does the gap (or lack of one) tell you about how much of the predictive power of Common Neighbours comes simply from high-degree nodes being involved?

Node classification

The following code generates three synthetic networks using the stochastic block model. Each network has the same two class labels, but the relationship between labels and network structure differs in a way that will matter for classification.

import numpy as np
import networkx as nx
from sklearn.metrics import accuracy_score

# --- Assortative: same-label nodes are more likely to connect ---
sizes_2 = [250, 250]
p_in, p_out = 0.10, 0.01
G_assort = nx.stochastic_block_model(
    sizes_2, [[p_in, p_out], [p_out, p_in]]
)
y_assort = np.array([0]*250 + [1]*250)

# --- Disassortative: different-label nodes are more likely to connect ---
G_disassort = nx.stochastic_block_model(
    sizes_2, [[p_out, p_in], [p_in, p_out]]
)
y_disassort = y_assort.copy()

# --- Mixed: 4 blocks, 2 labels, with both assortative and disassortative structure ---
# Blocks 0 and 2 have label 0; blocks 1 and 3 have label 1.
# Same-label block pairs (0-2 and 1-3) are connected with probability p_a.
# Different-label block pairs (0-1, 0-3, 1-2, 2-3) are connected with probability p_d.
# p_d > p_a so the cross-label signal is stronger than the within-label signal.
sizes_4 = [150, 150, 150, 150]
p_w, p_a, p_d = 0.08, 0.03, 0.05
omega_mixed = [
    [p_w, p_d, p_a, p_d],
    [p_d, p_w, p_d, p_a],
    [p_a, p_d, p_w, p_d],
    [p_d, p_a, p_d, p_w],
]
G_mixed = nx.stochastic_block_model(sizes_4, omega_mixed)
y_mixed = np.array([0]*150 + [1]*150 + [0]*150 + [1]*150)


def label_propagation(G, y_true, train_mask, alpha=0.1, n_iter=100, two_step=False):
    """
    Semi-supervised label propagation.

    G          : NetworkX graph
    y_true     : integer array of true labels (0 or 1)
    train_mask : boolean array, True for nodes whose labels are known
    alpha      : weight on the network term — higher gives more influence to neighbours,
                 lower keeps predictions closer to the known labels.
                 Must be in (0, 1).
    n_iter     : number of propagation iterations
    two_step   : if True, propagate over S^2 instead of S

    Returns an array of predicted labels for all nodes.
    """
    nodes = list(G.nodes())
    n = len(nodes)

    # Symmetric normalised adjacency: S = D^{-1/2} A D^{-1/2}
    A = nx.to_numpy_array(G, nodelist=nodes)
    deg = A.sum(axis=1)
    with np.errstate(divide='ignore', invalid='ignore'):
        d_inv_sqrt = np.where(deg > 0, 1.0 / np.sqrt(deg), 0.0)
    S = (A * d_inv_sqrt[:, None]) * d_inv_sqrt[None, :]

    if two_step:
        S = S @ S

    # Initialise label score matrix
    F = np.zeros((n, 2))
    Y = np.zeros((n, 2))
    for i in range(n):
        if train_mask[i]:
            F[i, y_true[i]] = 1.0
            Y[i, y_true[i]] = 1.0

    for _ in range(n_iter):
        F = alpha * (S @ F) + (1 - alpha) * Y

    return np.argmax(F, axis=1)


def make_train_mask(n, train_fraction=0.2):
    rng = np.random.default_rng()
    mask = np.zeros(n, dtype=bool)
    idx = rng.choice(n, size=int(train_fraction * n), replace=False)
    mask[idx] = True
    return mask

3. Label propagation spreads known class labels across the network using the equation

\[F = \alpha S F + (1-\alpha) Y\]

where $S$ is the normalised adjacency matrix, $Y$ contains the known label scores, $F$ the current score estimates, and $\alpha$ controls how strongly each node’s label is influenced by its neighbours. You may recognise this equation from Tutorial 3: it has the same structure as alpha centrality ($\mathbf{x} = \alpha \mathbf{A}\mathbf{x} + \mathbf{e}$), with the known labels playing the role of the exogenous vector $\mathbf{e}$.

(a) Generate a train mask covering 20% of nodes and run one-step label propagation (two_step=False) on each of the three networks. Report accuracy on the held-out 80%.

for name, G, y in [("Assortative",    G_assort,    y_assort),
                   ("Disassortative", G_disassort, y_disassort),
                   ("Mixed",          G_mixed,     y_mixed)]:
    mask = make_train_mask(len(y), train_fraction=0.2)
    y_pred = label_propagation(G, y, mask, two_step=False)
    acc = accuracy_score(y[~mask], y_pred[~mask])
    print(f"{name}: accuracy = {acc:.3f}")

(b) For the disassortative network, inspect what the predicted labels actually are. Is the method failing at random, or is it making a systematic error? What does this tell you about what the propagation equation is doing when most of a node’s neighbours have the opposite label?

for name, G, y in [("Assortative",    G_assort,    y_assort),
                   ("Disassortative", G_disassort, y_disassort),
                   ("Mixed",          G_mixed,     y_mixed)]:
    nx.set_node_attributes(G, {node: int(y[i]) for i, node in enumerate(G.nodes())}, 'label')
    r = nx.attribute_assortativity_coefficient(G, 'label')
    print(f"{name}: label assortativity = {r:.3f}")

Is the sign of the assortativity consistent with the accuracy you observed? For the mixed network, what does the assortativity value tell you about the balance of assortative and disassortative structure?

Discussion question. Why does a negative label assortativity cause one-step propagation to actively mislead rather than simply fail to help? Think about what happens to the score of a class-A node when $\alpha$ is large and most of its neighbours are class B.

Explore further. Try varying alpha across a range such as 0.01, 0.1, 0.5, 0.9. How does accuracy on each network type change as $\alpha$ increases? At very small $\alpha$, what is the method essentially doing? At $\alpha$ close to 1, what does the network term dominate over? Does the choice of $\alpha$ matter more for some network types than others?

4. Setting two_step=True replaces $S$ with $S^2$ in the propagation equation. The eigenvectors of $S^2$ are the same as those of $S$ — the same structural patterns — but all eigenvalues are now non-negative, because squaring removes the sign. A negative eigenvalue $-\lambda$ becomes $\lambda^2 > 0$. This means two-step propagation does not distinguish between assortative structure (positive eigenvalues) and disassortative structure (negative eigenvalues): it captures and amplifies both simultaneously.

(a) Re-run label propagation on all three networks with two_step=True. How does accuracy change for each case?

(b) The assortative and disassortative cases are straightforward to explain. Focus on the mixed network. The label structure [0, 1, 0, 1] across the four blocks is associated with a negative eigenvalue of $S$ — which is why one-step fails. After squaring, this eigenvalue becomes positive. But the mixed network also contains assortative structure (blocks of the same label do connect, via $p_a$). Why is two-step able to exploit both signals simultaneously, while one-step suppresses one of them?

(c) Looking at the three cases together — assortative (one-step works), disassortative (one-step fails), mixed (one-step fails) — what is the general lesson about when you can trust a method that assumes labels spread smoothly across edges?

Discussion question. The parameter $\alpha$ controls how far label information propagates. A large $\alpha$ gives the network structure more weight; a small $\alpha$ keeps predictions anchored to the known labels. How would you choose $\alpha$ without looking at the test labels? What does this choice have in common with regularisation in any supervised learning setting, and what specific danger arises if you tune $\alpha$ on the same nodes you are trying to classify?

Explore further. Try varying the training fraction — say 0.05, 0.1, 0.2, 0.4. How does accuracy change for one-step and two-step as fewer or more labels are known? Which network type is most sensitive to the amount of labelled data, and why does this make sense given the structure of the problem?