{% extends "layout.html" %} {% block content %}
Story-style intuition: The Social Network Detective
Imagine you're a detective trying to identify the members of two rival clubs, "The Eagles" and "The Sharks," in a large social network. You only know the affiliation of a few people (labeled data), but you have the entire network map showing who is friends with whom (unlabeled data). You make a simple but powerful assumption: "People are probably in the same club as their friends." So, you start with the known members and let their club affiliation "spread" to their direct friends, and then to their friends' friends, like a rumor. Eventually, this process reveals two distinct communities in the network. This is the core idea of Graph-Based Semi-Supervised Learning (SSL). It uses the connections between data points to propagate information from the labeled few to the unlabeled many.
Graph-Based Semi-Supervised Learning (SSL) is a family of algorithms that represents a dataset as a graph, where data points are nodes and the relationships between them are edges. It leverages the structure of this graph to infer the labels of unlabeled data points based on their proximity to labeled ones.
The foundation of this method is the graph representation itself, which is built on a key assumption about the data.
Example: In an image dataset of handwritten digits, each image is a node. An edge between two images of the digit '7' would be very strong (high weight) because their pixel values are similar. An edge between an image of a '7' and an image of a '1' would be very weak (low weight).
Example: Imagine we have 10 data points. Two are labeled 'Class A' (blue) and one is labeled 'Class B' (red). The other seven are unlabeled (gray). After building the graph, we see the two blue points are strongly connected to a cluster of four gray points on the left. The red point is strongly connected to a cluster of three gray points on the right. The algorithm will naturally 'propagate' the blue label to the left cluster and the red label to the right one, as this is the smoothest fit for the graph's structure.
The "Social Network Detective" needs a way to measure how "chaotic" a potential labeling of the network is. A chaotic labeling would have lots of friends from rival clubs, while a smooth labeling would have very few. This measure of chaos is captured by the Graph Laplacian.
$$ L = D - W $$
Where:Example: A Simple Graph's Laplacian
Imagine a 3-node graph: Node 1 is connected to Node 2 (weight=1) and Node 2 is connected to Node 3 (weight=1).
This L matrix now contains all the information about the graph's connectivity.
$$ \sum_{i,j} W_{ij} (f(x_i) - f(x_j))^2 $$
The general process follows a logical flow of building the graph and then spreading the information.
Example: A Single Step of Label Propagation
Consider an unlabeled node C, which has three neighbors: Node A (labeled Blue), Node B (labeled Blue), and Node D (labeled Red). All edge weights are equal. In the next iteration, node C looks at its neighbors. It sees two "Blue" votes and one "Red" vote. Therefore, Node C will update its own label to "Blue" because that's the majority label among its direct neighbors.
These methods work best when the data follows certain geometric patterns.
Example: In a dataset of animal photos, we assume that photos of cats will naturally form a tight cluster based on pixel similarity, separate from the cluster of dog photos.
Example: A set of photos of a person's face taken from different angles exists in a very high-dimensional pixel space (e.g., 10,000+ dimensions). However, the 'true' structure of this data is a simple 2D or 3D manifold representing the rotation of the head. The graph helps capture the "neighbor" relationships along this curved surface.
| Advantages | Disadvantages |
|---|---|
| β Naturally incorporates the structure of both labeled and unlabeled data. | β Graph Construction is Expensive: Building the similarity matrix for a large dataset with N samples requires calculating NΒ² distances. Example: For a dataset of just 50,000 images, this would require 50,000 * 50,000 = 2.5 billion distance calculations, which is very slow. |
| β Very effective when you have very few labeled samples but a large amount of unlabeled data. | β Highly sensitive to how the graph is built. A poor choice of similarity metric or neighborhood size (k in k-NN) can lead to a bad result. Example: If you choose k=3 in a k-NN graph, you might miss a connection to an important fourth neighbor. If you choose k=20, you might introduce noisy connections to dissimilar points. |
| β Works well on complex, high-dimensional data where the underlying structure is important. | β Does not scale well to datasets with millions of nodes due to the computational and memory costs. |
Example: Given a massive database of web images, you can manually label a few as 'cat', 'dog', and 'car'. The algorithm can then use visual similarity (e.g., comparing color histograms or deep learning features) to propagate these labels to the rest of the database.
Example: Label a few news articles as 'Sports' or 'Politics'. The algorithm can use word similarity (e.g., TF-IDF vectors) to build a graph and categorize millions of other articles on a news feed.
Example: Start with a few known 'pro-A' and 'pro-B' Twitter users. The algorithm can propagate these labels through the follower/friend network to estimate the political stance of millions of other users.
Scikit-learn provides easy-to-use implementations of `LabelPropagation` and `LabelSpreading`. In this example, we'll create a dataset that forms two clear "moon" shapes. This is a classic problem where linear classifiers fail. We'll label only two points (one in each moon) and let the algorithm infer the labels for all the others based on the graph structure.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
# --- 1. Create a Sample Dataset ---
# make_moons is perfect for testing graph methods because its structure is nonlinear
X, y_true = make_moons(n_samples=200, noise=0.08, random_state=42)
# --- 2. Create a small labeled set and a large unlabeled set ---
# We simulate a real-world scenario by providing labels for only 2 points!
y_labeled = np.full_like(y_true, -1) # -1 is the scikit-learn code for "unlabeled"
y_labeled[0] = y_true[0] # Label the first point
y_labeled[-1] = y_true[-1] # Label the last point
# --- 3. Train a Label Propagation Model ---
# The model will build a k-NN graph behind the scenes
lp_model = LabelPropagation(kernel='knn', n_neighbors=10)
# The .fit() method does all the work: builds the graph and propagates labels
lp_model.fit(X, y_labeled)
y_pred_lp = lp_model.predict(X)
# --- 4. Train a Label Spreading Model ---
ls_model = LabelSpreading(kernel='knn', n_neighbors=10)
ls_model.fit(X, y_labeled)
y_pred_ls = ls_model.predict(X)
# --- 5. Visualize the Results ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot Label Propagation Results
# Small dots are the inferred labels, large circles are the original labeled "seeds"
ax1.scatter(X[y_labeled == -1, 0], X[y_labeled == -1, 1], c=y_pred_lp[y_labeled == -1], cmap='viridis', marker='.')
ax1.scatter(X[y_labeled != -1, 0], X[y_labeled != -1, 1], c=y_labeled[y_labeled != -1], cmap='viridis', marker='o', s=100, edgecolor='k')
ax1.set_title('Label Propagation Results')
# Plot Label Spreading Results
ax2.scatter(X[y_labeled == -1, 0], X[y_labeled == -1, 1], c=y_pred_ls[y_labeled == -1], cmap='viridis', marker='.')
ax2.scatter(X[y_labeled != -1, 0], X[y_labeled != -1, 1], c=y_labeled[y_labeled != -1], cmap='viridis', marker='o', s=100, edgecolor='k')
ax2.set_title('Label Spreading Results')
plt.show()
1. The smoothness assumption means that if two data points are very similar (connected by a strong edge), they are very likely to have the same label. The algorithm tries to find a labeling that minimizes conflicts across strong edges.
2. Nodes represent the data points (both labeled and unlabeled). Edge weights represent the similarity between two connected data points; a higher weight means greater similarity.
3. The main bottleneck is the graph construction phase, which can require computing a similarity score (like distance) between all pairs of N nodes, an operation that scales with O(NΒ²).
4. Label Spreading is generally more robust to noisy data because it includes a "clamping" factor that ensures the original labeled nodes retain some of their initial influence, preventing them from being completely swayed by their neighbors.
The Story: Decoding the Social Network Detective's Dossier