More formally, a graph convolutional network (GCN) is a neural network that operates on graphs. Given a graph G = (V, E), a GCN takes as input

an input feature matrix N × F⁰ feature matrix, X, where N is the number of nodes and F⁰ is the number of input features for each node, and

where N is the number of nodes and F⁰ is the number of input features for each node, and an N × N matrix representation of the graph structure such as the adjacency matrix A of G.[1]

A hidden layer in the GCN can thus be written as Hⁱ = f(Hⁱ⁻¹, A)) where H⁰ = X and f is a propagation [1]. Each layer Hⁱ corresponds to an N × Fⁱ feature matrix where each row is a feature representation of a node. At each layer, these features are aggregated to form the next layer’s features using the propagation rule f. In this way, features become increasingly more abstract at each consecutive layer. In this framework, variants of GCN differ only in the choice of propagation rule f [1].

A Simple Propagation Rule

One of the simplest possible propagation rule is [1]:

f(Hⁱ, A) = σ(AHⁱWⁱ)

where Wⁱ is the weight matrix for layer i and σ is a non-linear activation function such as the ReLU function. The weight matrix has dimensions Fⁱ × Fⁱ⁺¹; in other words the size of the second dimension of the weight matrix determines the number of features at the next layer. If you are familiar with convolutional neural networks, this operation is similar to a filtering operation since these weights are shared across nodes in the graph.

Simplifications

Let’s examine the propagation rule at its most simple level. Let

i = 1, s.t. f is a function of the input feature matrix,

σ be the identity function, and

choose the weights s.t. AH⁰W⁰ =AXW⁰ = AX.

In other words, f(X, A) = AX. This propagation rule is perhaps a bit too simple, but we will add in the missing parts later. As a side note, AX is now equivalent to the input layer of a multi-layer perceptron.

A Simple Graph Example

As a simple example, we’ll use the the following graph:

A simple directed graph.

And below is its numpy adjacency matrix representation.

A = np.matrix([

[0, 1, 0, 0],

[0, 0, 1, 1],

[0, 1, 0, 0],

[1, 0, 1, 0]],

dtype=float

)

Next, we need features! We generate 2 integer features for every node based on its index. This makes it easy to confirm the matrix calculations manually later.

In [3]: X = np.matrix([

[i, -i]

for i in range(A.shape[0])

], dtype=float)

X Out[3]: matrix([

[ 0., 0.],

[ 1., -1.],

[ 2., -2.],

[ 3., -3.]

])

Applying the Propagation Rule

Alright! We now have a graph, its adjacency matrix A and a set of input features X . Let’s see what happens when we apply the propagation rule:

In [6]: A * X

Out[6]: matrix([

[ 1., -1.],

[ 5., -5.],

[ 1., -1.],

[ 2., -2.]]

What happened? The representation of each node (each row) is now a sum of its neighbors features! In other words, the graph convolutional layer represents each node as an aggregate of its neighborhood. I encourage you to check the calculation for yourself. Note that in this case a node n is a neighbor of node v if there exists an edge from v to n.

Uh oh! Problems on the Horizon!

You may have already spotted the problems:

The aggregated representation of a node does not include its own features! The representation is an aggregate of the features of neighbor nodes, so only nodes that has a self-loop will include their own features in the aggregate.[1]

Nodes with large degrees will have large values in their feature representation while nodes with small degrees will have small values. This can cause vanishing or exploding gradients [1, 2], but is also problematic for stochastic gradient descent algorithms which are typically used to train such networks and are sensitive to the scale (or range of values) of each of the input features.

In the following, I discuss each of these problems separately.

Adding Self-Loops

To address the first problem, one can simply add a self-loop to each node [1, 2]. In practice this is done by adding the identity matrix I to the adjacency matrix A before applying the propagation rule.

In [4]: I = np.matrix(np.eye(A.shape[0]))

I Out[4]: matrix([

[1., 0., 0., 0.],

[0., 1., 0., 0.],

[0., 0., 1., 0.],

[0., 0., 0., 1.]

]) In [8]: A_hat = A + I

A_hat * X

Out[8]: matrix([

[ 1., -1.],

[ 6., -6.],

[ 3., -3.],

[ 5., -5.]])

Since the node is now a neighbor of itself, the node’s own features is included when summing up the features of its neighbors!

Normalizing the Feature Representations

The feature representations can be normalized by node degree by transforming the adjacency matrix A by multiplying it with the inverse degree matrix D [1]. Thus our simplified propagation rule looks like this [1]:

f(X, A) = D⁻¹AX

Let’s see what happens. We first compute the degree matrix.

In [9]: D = np.array(np.sum(A, axis=0))[0]

D = np.matrix(np.diag(D))

D

Out[9]: matrix([

[1., 0., 0., 0.],

[0., 2., 0., 0.],

[0., 0., 2., 0.],

[0., 0., 0., 1.]

])

Before applying the rule, let’s see what happens to the adjacency matrix after we transform it.