Twitter proof: neural networks and the linear activation function


< change language

In this post we will see why it is not helpful to have two consecutive layers of neurons with linear activation functions in neural networks. With just a bit of maths we can conclude that $n$ consecutive linear layers compute the exact same functions as $1$ single layer.

Claim: having two fully connected layers of neurons using linear activation functions is the same as having just one layer with a linear activation function.

We just have to lay down some notation in order for the maths to be doable. Assume the two consecutive layers of linear neurons are preceded by a layer with $n_0$ neurons, whose outpus are $o_1, o_2, \cdots, o_{n_0}$.
Let us say that after that layer, there is a layer of $n_1$ neurons with linear activation functions $f_i(x) = a_ix + b_i$; the weight from neuron $t$ of the previous layer to the neuron $i$ of this layer is $w_{t,i}$.
The second layer of neurons has $n_2$ neurons, with linear activation functions $f_i'(x) = a_i'x+b_i'$ and the weight from neuron $i$ of the previous layer to neuron $j$ of this layer is $w'_{i,j}$.

Twitter proof: we will compute the outputs produced by the second layer, in terms of the values $o_1, \cdots, o_{n_0}$ of the $0$-th layer and conclude we obtain a linear function of those. The output of the $j$-th neuron will be:
$$\begin{align} f'_j\left[\sum_{i=1}^{n_1}w'_{i,j}f_i\left(\sum_{t=1}^{n_0} w_{t,i}o_t\right)\right] &= a'_j\times\left[\sum_{i=1}^{n_1}w'_{i,j}f_i\left(\sum_{t=1}^{n_0} w_{t,i}o_t\right) \right] + b'_j\\ &=a'_j \left[\sum_{i=1}^{n_1}w'_{i,j}\left(a_i\times\left(\sum_{t=1}^{n_0} w_{t,i}o_t\right) + b_i\right) \right] + b'_j\\ &=a'_j \left[\sum_{i=1}^{n_1}w'_{i,j}a_i\left(\sum_{t=1}^{n_0} w_{t,i}o_t\right) + \sum_{i=1}^{n_1}w'_{i,j}b_i \right] + b'_j\\ &=a'_j \left[\sum_{i=1}^{n_1}w'_{i,j}a_i\left(\sum_{t=1}^{n_0} w_{t,i}o_t\right)\right] + a'_j\sum_{i=1}^{n_1}w'_{i,j}b_i + b'_j\\ &=a'_j \left[\sum_{t=1}^{n_0} \sum_{i=1}^{n_1}w'_{i,j}a_i w_{t,i}o_t\right] + a'_j\sum_{i=1}^{n_1}w'_{i,j}b_i + b'_j\\ \end{align}$$ now defining $a''_j = a'_j$, $$b''_j = b'_j + a'_j \sum_{i=1}^{n_1}w'_{i,j}b_i$$ and $$w''_{t,j} = \sum_{i=1}^{n_1} w'_{i,j} a_i w_{t,i}$$ we get that the output of the $j$-th neuron really is $$f''_j\left(\sum_{t=1}^{n_0} w''_{t,j} o_t\right)$$ where $f''_j(x) = a''_jx + b''_j$, a linear function. This concludes our proof!

I wrote this post today because of the #100DaysofMLcode initiative I started (yesterday!)!

  - RGS

Popular posts from this blog

Pocket maths: how to compute averages in your head

Tutorial on programming a memory card game

Markov Decision Processes 02: how the discount factor works