You are on page 1of 7

11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory

1 !pip install torch_geometric

output Collecting torch_geometric


Downloading torch_geometric-2.3.1.tar.gz (661 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 661.6/661.6 kB 4.8 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (4.66.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.23.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.10.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.1.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (2.31.0)
Requirement already satisfied: pyparsing in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (3.1.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (1.2.2)
Requirement already satisfied: psutil>=5.8.0 in /usr/local/lib/python3.10/dist-packages (from torch_geometric) (5.9.5)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch_geometric) (2
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geo
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geometric) (3.4
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geometric
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torch_geometric
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->torch_geometric
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->torch_geo
Building wheels for collected packages: torch_geometric
Building wheel for torch_geometric (pyproject.toml) ... done
Created wheel for torch_geometric: filename=torch_geometric-2.3.1-py3-none-any.whl size=910454 sha256=a800b511d3fe4944a21b
Stored in directory: /root/.cache/pip/wheels/ac/dc/30/e2874821ff308ee67dcd7a66dbde912411e19e35a1addda028
Successfully built torch_geometric
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.3.1

Hands-on Graph Neural Networks with PyTorch Geometric (3): Multi-Layer Perceptron
Machine learning research on data with graph structures, such as social networks, has recently attracted a lot of attention. There are various
machine learning tasks with graph data, such as node classification, link prediction, and graph classification, but in this article we will tackle the
node classification task. As a model to work with, we will deal with a simple neural model, Multi-Layer Perceptron (MLP). MLP is often used as
a baseline against which to compare other GNNs because it ignores the graph topology and is trained using only node features.

In this article we will train MLP on three different datasets and compare the results.

Through this article, we will learn the following;

How to handle pytorch and pytorch geometric


Characteristics of Multi-Layer Perceptron
How to train Multi-Layer Perceptron

1 import os
2 import collections
3 import numpy as np
4 import pandas as pd
5 import matplotlib.pyplot as plt
6 import seaborn as sns
7 import torch
8 from torch import nn
9 from torch import Tensor
10 import torch.nn.functional as F
11 from torch.nn import Linear, ReLU
12 import torch_geometric
13 from torch_geometric.datasets import Planetoid, WebKB
14 from torch_geometric.data import Data
15 from sklearn.datasets import load_iris
16 from sklearn.model_selection import train_test_split

1 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


2 print(device)
3 data_dir = "./data"
4 os.makedirs(data_dir, exist_ok=True)

cuda

MLP

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 1/7
11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory
The multi-layer perceptron is a type of forward propagating network and is the most basic neural network. It is a neural network that has a
structure of layered units joined only between adjacent layers, with information propagating only in one direction from the input side to the
output side.

Here we will use pytorch to create an MLP with one hidden layer.

1 class MLP(torch.nn.Module):
2 def __init__(self, in_channels, hidden_channels, out_channels, dropout):
3 super(MLP, self).__init__()
4 self.lin1 = Linear(in_channels, hidden_channels)
5 self.lin2 = Linear(hidden_channels, out_channels)
6 self.dropout = dropout
7
8 def reset_parameters(self):
9 self.lin1.reset_parameters()
10 self.lin2.reset_parameters()
11
12 def forward(self, data):
13 x = data.x
14 x = F.dropout(x, p=self.dropout, training=self.training)
15 x = self.lin1(x)
16 x = x.relu()
17 x = F.dropout(x, p=self.dropout, training=self.training)
18 x = self.lin2(x)
19 return F.log_softmax(x, dim=1)

If you want to know the structure of a neural network, you can find it in the forward method. The dropout process is a special process that is
only enabled during training, but ignore it for now.

The input data (x) is transformed by the relu function after going through the linear layer. Then it goes into the linear layer again and is
transformed to predict the label with a softmax function. Very simple.

The MLP structure we will study can be schematically depicted as follows.

We will perform model training later in the article, so we will create a function for this purpose. GNN almost always performs batch learning
with all training data. MLP does not use adjacency matrices for training and can do mini-batch training, but here we have written code to do
batch training.

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 2/7
11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory
1 def run_training(model, data, lr=0.01, weight_decay=5e-4, epochs=200):
2 optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
3
4 def train():
5 model.train()
6 optimizer.zero_grad()
7 out = model(data)
8 loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
9 loss.backward()
10 optimizer.step()
11 return float(loss)
12
13 @torch.no_grad()
14 def test():
15 model.eval()
16 pred = model(data).argmax(dim=-1)
17 accs = []
18 for mask in [data.train_mask, data.val_mask, data.test_mask]:
19 accs.append(int((pred[mask] == data.y[mask]).sum()) / int(mask.sum()))
20 return accs
21
22 train_acc_list, val_acc_list, test_acc_list = [], [], []
23 best_val_acc = final_test_acc = 0
24 for epoch in range(1, epochs + 1):
25 loss = train()
26 train_acc, val_acc, tmp_test_acc = test()
27 train_acc_list.append(train_acc)
28 val_acc_list.append(val_acc)
29 test_acc_list.append(tmp_test_acc)
30 if val_acc > best_val_acc:
31 best_val_acc = val_acc
32 test_acc = tmp_test_acc
33 print(f'Epoch: {epoch:03d}, Train: {train_acc:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')
34 return train_acc_list, val_acc_list, test_acc_list

Model Training
We will now train MLP on the following three datasets.

Iris Dataset
Cora Dataset
Texas Dataset

Iris Dataset
Before using the graph dataset, we first train MLP using the iris dataset, a common machine learning dataset. This dataset is a dataset with 4
features and label information about species names.

1 iris = load_iris(as_frame=True)
2 df = iris["frame"]
3 print(df.shape) # (150, 5)
4 df.head()

(150, 5)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

The data is not graph data, so it is basically handled in the form of a data frame. However, in order to use the same training process as graph
datasets, which will be discussed later, we will convert the data into pytorch geometric graph data format. Please note that the edge index
information is not included because it is not a graph data set.

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 3/7
11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory
1 X = torch.Tensor(df.iloc[:, :4].values)
2 y = torch.LongTensor(df["target"].values)
3 train, test = train_test_split(df, test_size=0.2, random_state=0)
4 train, val = train_test_split(train, test_size=0.25, random_state=0)
5 print(train.shape, val.shape, test.shape)
6
7 def get_mask(index):
8 mask = np.repeat([False], 150)
9 mask[index] = True
10 mask = torch.tensor(mask, dtype=torch.bool)
11 return mask
12
13 train_mask = get_mask(train.index)
14 val_mask = get_mask(val.index)
15 test_mask = get_mask(test.index)
16 iris = Data(x=X, y=y, train_mask=train_mask, val_mask=val_mask, test_mask=test_mask)
17 iris
18 # Data(x=[150, 4], y=[150], train_mask=[150], val_mask=[150], test_mask=[150])

(90, 5) (30, 5) (30, 5)


Data(x=[150, 4], y=[150], train_mask=[150], val_mask=[150], test_mask=[150])

Excellent! We have now converted the data into a form that can be handled by pytorch geometric. The x contains feature values and the y
contains label information. The last three, such as train_mask, may be unfamiliar to you, but they contain the information to split the data in the
form of a boolean array.

Let’s see how many data are split for training, validation, and testing, respectively.

1 print(f'Number of training nodes: {iris.train_mask.sum()}')


2 print(f'Number of validation nodes: {iris.val_mask.sum()}')
3 print(f'Number of test nodes: {iris.test_mask.sum()}')

Number of training nodes: 90


Number of validation nodes: 30
Number of test nodes: 30

We can see that the data is split into 90, 30, and 30 nodes for training, validation, and test, respectively.

Now we will use this data to train MLP.

1 epochs = 200
2 mlp = MLP(in_channels=4, hidden_channels=16, out_channels=3, dropout=0)
3 train_acc_list, val_acc_list, test_acc_list = run_training(mlp, iris, epochs=epochs)

Epoch: 200, Train: 1.0000, Val: 0.9333, Test: 1.0000

The fitting seems to be working well since the accuracy in the training data is 1. The accuracy in the test data is displayed when the maximum
accuracy in the validation data is reached. Finally, the test accuracy is 0.9.

To check whether overfitting is occurring, let us plot the change in the accuracy on the training data and the accuracy on the validation data.

1 plt.plot(range(epochs), train_acc_list, label='train')


2 plt.plot(range(epochs), val_acc_list, label='val')
3 # plt.plot(range(epochs), test_acc_list, label='test')
4 plt.xlabel("epoch")
5 plt.ylabel("accuracy")
6 plt.legend()
7 plt.show()

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 4/7
11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory

Model training seems to be progressing well.

Cora Dataset
The Cora dataset is a well-known dataset in the field of graph research. This consists of 2708 scientific publications classified into one of seven
classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the
absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. See my previous article for
more details.

1 cora_dataset = Planetoid(root=data_dir, name='Cora')


2 cora = cora_dataset[0]
3 cora
4 # Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

The dataset is ready. Compare it to the iris dataset we have just used. The Cora dataset contains edge_index which is specific to graph data.

The three masks, such as train_mask, are prepared in advance. Let's see how many data are split for training, validation, and testing,
respectively.

1 print(f'Number of training nodes: {cora.train_mask.sum()}')


2 print(f'Number of validation nodes: {cora.val_mask.sum()}')
3 print(f'Number of test nodes: {cora.test_mask.sum()}')

Number of training nodes: 140


Number of validation nodes: 500
Number of test nodes: 1000

We can see that the data is split into 140, 500, and 1000 nodes for training, validation, and test, respectively. The proportion of training data is
low, and learning is expected to be difficult.

Now we will train MLP.

1 mlp = MLP(in_channels=cora_dataset.num_features, hidden_channels=16,


2 out_channels=cora_dataset.num_classes, dropout=0.5)
3 train_acc_list, val_acc_list, test_acc_list = run_training(mlp, cora)

Epoch: 200, Train: 1.0000, Val: 0.5260, Test: 0.5190

The fitting seems to be working well since the accuracy in the training data is 1. However, the accuracy on the training data and the accuracy on
the validation data are low, around 0.5.

We will plot the change in the accuracy on the training data and the accuracy on the validation data.

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 5/7
11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory
1 plt.plot(range(epochs), train_acc_list, label='train')
2 plt.plot(range(epochs), val_acc_list, label='val')
3 # plt.plot(range(epochs), test_acc_list, label='test')
4 plt.xlabel("epoch")
5 plt.ylabel("accuracy")
6 plt.legend()
7 plt.show()

Texas Dataset
Texas dataset is included in WebKB dataset. The WebKB is a webpage dataset collected from computer science departments of various
universities by Carnegie Mellon University. We use one of the three sub-datasets of it, Cornell, Texas, and Wisconsin, where nodes represent
web pages, and edges are hyperlinks between them. The Texas dataset contains 183 web pages (nodes) and 309 hyperlinks (edges). Node
features are the bag-of-words representation of web pages. The web pages are manually classified into the five categories, student, project,
course, staff, and faculty.

The Texas dataset has different graph characteristics than the Cora dataset, but this does not affect MLP because it does not use graph
information for training.

1 texas_dataset = WebKB(root=data_dir, name='texas')


2 texas = texas_dataset[0]
3 texas.train_mask = texas.train_mask[:, 0]
4 texas.val_mask = texas.val_mask[:, 0]
5 texas.test_mask = texas.test_mask[:, 0]
6 texas
7 # Data(x=[183, 1703], edge_index=[2, 325], y=[183], train_mask=[183], val_mask=[183], test_mask=[183])

Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/new_data/texas/out1_node_feature_label.txt
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/new_data/texas/out1_graph_edges.txt
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_0.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_1.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_2.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_3.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_4.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_5.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_6.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_7.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_8.npz
Downloading https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master/splits/texas_split_0.6_0.2_9.npz
Processing...
Done!
Data(x=[183, 1703], edge_index=[2, 325], y=[183], train_mask=[183], val_mask=[183], test_mask=[183])

We can see that the number of data is much smaller than the Cora dataset.

Again, the three masks, such as train_mask, are prepared in advance. Let’s see how many data are split for training, validation, and testing,
respectively.

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 6/7
11/26/23, 11:37 PM GNN HANDS 003 - Colaboratory
1 print(f'Number of training nodes: {texas.train_mask.sum()}')
2 print(f'Number of validation nodes: {texas.val_mask.sum()}')
3 print(f'Number of test nodes: {texas.test_mask.sum()}')

Number of training nodes: 87


Number of validation nodes: 59
Number of test nodes: 37
We can see that the data is split into 87, 59, and 37 nodes for training, validation, and test, respectively.

Now let’s go on with the training

1 mlp = MLP(in_channels=texas_dataset.num_features, hidden_channels=16,


2 out_channels=texas_dataset.num_classes, dropout=0.5)
3 train_acc_list, val_acc_list, test_acc_list = run_training(mlp, texas)

Epoch: 200, Train: 1.0000, Val: 0.7966, Test: 0.7838

On the Texas dataset, the accuracy on the validation data is 0.76, and the accuracy on the test data is 0.76, which is better than on the Cora
dataset.

We plot the change in the percentage correct on the training data vs. the percentage correct on the validation data below. Model training seems
to be progressing well here as well.

1 plt.plot(range(epochs), train_acc_list, label='train')


2 plt.plot(range(epochs), val_acc_list, label='val')
3 # plt.plot(range(epochs), test_acc_list, label='test')
4 plt.xlabel("epoch")
5 plt.ylabel("accuracy")
6 plt.legend()
7 plt.show()

Since no parameter tuning was performed this time, it is possible that a higher accuracy could be achieved. However, the performance of the
model was clearly different between the Cora and Texas datasets.

It is possible that MLP, which does not learn the properties of the graph, was disadvantaged on the Cora dataset due to its homophily nature.
Conversely, it performed reasonably well on the Texas dataset, which has heterophily properties.

https://colab.research.google.com/drive/1DF8jEaBSfu0gE-wZwUhq-fnvPy4Eijnx#printMode=true 7/7

You might also like