Skip to content

Illumanizer/gnn-node-embedding-and-link-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Graph Machine Learning: Node Embeddings, Classification & Link Prediction

A comprehensive implementation of graph machine learning techniques including shallow and deep node embeddings, node classification with Graph Attention Networks (GAT), and link prediction using Graph Autoencoders (GAE).

Overview

This project demonstrates three core graph ML tasks:

  1. Node Embeddings - Comparing shallow (Node2Vec) vs deep (GNN-based) embedding methods
  2. Node Classification - Semi-supervised classification using GAT with hyperparameter optimization
  3. Link Prediction - Reconstructing missing edges using Graph Autoencoders

Results Summary

Task Method Metric Score
Node Embedding (Deep) GCN Autoencoder ROC-AUC 90.21%
Node Classification GAT + Optuna Test Accuracy 84.30%
Node Classification GAT + Optuna Macro F1 83.06%
Link Prediction GAT Autoencoder ROC-AUC 93.66%

Datasets

Hamsterster Social Network (Part A)

A social network dataset representing friendships between users of the Hamsterster website.

Property Value
Nodes 2,426
Edges 16,630
Type Undirected

Cora Citation Network (Parts B & C)

A benchmark citation network where nodes are scientific papers and edges represent citations.

Property Value
Nodes 2,708
Edges 10,556
Features 1,433 (bag-of-words)
Classes 7

Part A: Node Embeddings

Shallow Embedding: Node2Vec

Node2Vec learns node representations by optimizing a skip-gram objective over random walks.

Architecture & Parameters:

Embedding Dimension: 128
Walk Length: 15
Context Size: 10
Walks per Node: 10
Negative Samples: 5
p, q: 1.0, 1.0 (unbiased random walk)
Epochs: 50

Results:

  • Final embedding shape: (2426, 128)
  • Training time: ~24 minutes
  • Loss converged from 8.91 → 0.99

Deep Embedding: GCN-based Graph Autoencoder

A Graph Convolutional Network encoder learns embeddings by reconstructing the adjacency matrix.

Architecture:

Input → GCN (hidden=128) → ReLU → Dropout(0.5) → GCN (out=64) → Embeddings

Results:

  • Final embedding shape: (2426, 64)
  • Test ROC-AUC: 90.21%
  • Early stopping at epoch 109 (best at epoch 89)

Embedding Comparison

Metric Value
Mean cosine similarity (per node) -0.12
Top-10 neighbor overlap 14.22%

The low correlation indicates that shallow and deep methods capture fundamentally different structural properties.

Part B: Node Classification (Cora)

Model: Graph Attention Network (GAT)

GAT uses attention mechanisms to weigh neighbor contributions during message passing.

Architecture:

Input (1433) → GAT Layer 1 (heads=10, hidden=16) → ELU → Dropout
            → GAT Layer 2 (heads=1, out=7) → Softmax

Hyperparameter Optimization

Used Optuna with 30 trials for Bayesian optimization:

Parameter Search Space Best Value
Learning Rate [1e-4, 1e-2] 0.00987
Weight Decay [1e-6, 1e-3] 0.000106
Hidden Dim [8, 16, 32, 64] 16
Attention Heads [4, 6, 8, 10, 12] 10
Dropout [0.3, 0.7] 0.588

Training Strategy

  1. Optuna search on train/val split
  2. Early stopping with patience=40
  3. Retrain on combined train+val with best hyperparameters
  4. Final evaluation on held-out test set

Results

Metric Score
Best Validation Accuracy 82.00%
Final Test Accuracy 84.30%
Macro F1-Score 83.06%

Part C: Link Prediction (Cora)

Task Setup

  • Removed 10% of edges (527 edges) for testing
  • Validation set: 263 edges
  • Training set: 8,976 edges

Model: GAT-based Graph Autoencoder

Encoder Architecture:

Input (1433) → GAT Layer 1 (hidden=128, heads=4) → ELU → Dropout(0.3)
            → GAT Layer 2 (latent=64, heads=4) → Node Embeddings

Decoder: Inner product between node embeddings to predict edge probability

Training Configuration

Epochs: 300 (with early stopping)
Patience: 40
Learning Rate: 1e-3
Weight Decay: 5e-5
Hidden Dim: 128
Latent Dim: 64

Results

Metric Score
Best Validation AUC 94.69%
Train AUC @ best 98.48%
Final Test ROC-AUC 93.66%

Training completed in 146 seconds with early stopping at epoch 237 (best at epoch 197).

Visualizations

The project includes several visualizations:

  • Network visualization with Louvain community detection
  • t-SNE/UMAP projections of node embeddings
  • Training curves (loss and AUC over epochs)
  • ROC curves for link prediction
  • Precision-Recall curves
  • Confusion matrix for node classification

Installation

pip install torch torchvision torchaudio
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric
pip install node2vec umap-learn scikit-learn matplotlib networkx
pip install optuna python-louvain

Usage

# Run in Google Colab or Jupyter Notebook
# The notebook is self-contained and downloads datasets automatically

# Key imports
from torch_geometric.nn import GATConv, GCNConv, GAE
from torch_geometric.datasets import Planetoid
from torch_geometric.utils import train_test_split_edges

Project Structure

├── graph_ml_project.ipynb    # Main notebook with all implementations
├── README.md                  # This file
└── data/
    ├── Cora/                  # Cora dataset (auto-downloaded)
    └── soc-hamsterster.edges  # Hamsterster network

Key Techniques Demonstrated

  • Random Walk Embeddings (Node2Vec with configurable p, q)
  • Graph Convolutional Networks (GCN for encoding)
  • Graph Attention Networks (GAT with multi-head attention)
  • Graph Autoencoders (GAE for unsupervised learning)
  • Hyperparameter Optimization (Optuna with pruning)
  • Early Stopping (validation-based checkpointing)
  • Community Detection (Louvain algorithm)
  • Dimensionality Reduction (t-SNE, UMAP, PCA)

References

Requirements

  • Python 3.10+
  • PyTorch 2.0+
  • PyTorch Geometric 2.0+
  • CUDA (optional, for GPU acceleration)

About

Implementation of GNN - Node embeddings, classification & Link Prediction on CORA & soc-hamsterster dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published