Network Architecture

The general formula for the relational network can be written as follows:

First, we convert the data to an object(Oi and Oj). Then, after applying a function g to every combination of object pairs, we add them up before passing them through another function f.

The most important thing in a relational network is to consider an object regardless of the domain. Therefore, we can perform learning within only one network for different domains. Therefore,

The Process of relational network in VQA problem is as follow:

Generate feature maps

Question : Embedding using LSTM

Image : CNN (4 convolutional layers)

2. Apply RN(Relational Network) steps

Generate object pair with question : We make all possible combinations for objects extracted from question and image.

Add coordination position information for particle image feature.

Apply the first 4 layers MLP(function g)

Apply element-wise sum for the output of function g

Apply the next two layers MLP(function f)

An end-to-end relational networks¹

For better understanding, let’s examine a simple example. Suppose that there are three 3x3 CNN feature maps and one 1x4 query feature vector. Then, we end up with 9*9= 81 combinations of feature vectors. Please note that the first dimension will be inflated 81 times more than the original batch size, since we created 81 new data pairs. After applying a function g, those inflated batches will be reduced to the original batch size by element-wise summation before being passed to a function f. Assuming that each individual image object has a length of 3 and the question object has length of 4 , we can depict the architecture as below:

An example of data structure for RN

The coordinate position vector is a kind of vector to provide pixel position information. The structure of coordinate position is as follow:

def cvt_coord(i):

return [(i/5-2)/2., (i%5-2)/2.]