$\begingroup$

I was thinking about RNNs and notice that the update equation for an RNN could be:

$$ h_t = tanh( W_{hx} x_t +b_x) + tanh( W_{hh} h_{t-1} + b_h ) $$ $$ o_t = W_{oh} h_t + b_h $$

but instead are:

$$ h_t = tanh( W_{hx} x_t + W_{hh} h_{t-1} + b_h ) $$ $$ o_t = W_{oh} h_t + b_h $$

Why is it that they are not the way as in the first equation?

The only way I found to justify it to myself is by thinking about them one term at a time, i.e. imagine I get one input and one previous state. How can I combine them to make the next hidden state? Obviously not by just element-wise adding them directly since they are different dimensions possibly (even if they were passed through a non-linearity). How about projecting them somehow and then combining them first and then we can decide where the non-linearity goes:

$$ \langle W_h , h \rangle + \langle W_x x \rangle $$

we could do:

$$ \theta( \langle W_h , h \rangle + \langle W_x x \rangle + b_h ) $$

or

$$ \theta( \langle W_h , h \rangle + b_h )+ \theta( \langle W_x x \rangle + b_x) $$

if we were to write out both expressions and maybe think of $\theta$ as a non-linearity like say a cubic or a quadratic functions (i.e. really bad approximations to Gaussians or sigmoids), we would then be in trouble with the separate application application of the non-linearity because it would obviously be missing cross terms between h and x like hx or hx^2 etc. This just seems that in terms of functions that depend on time, we decreased the expressivity of the network.

Is that true? Or is that not correct? After thinking about it a bit more it seems that after some number of execution of time steps some cross terms might be able to to be introduced but only between very old time steps and much recent ones? i.e. there seems to be a delay or a skipping going around. Not sure.

In short, why do we combine the affine transformations first and then apply the non-linearity?