Weights initialization#
Numerical problems#
Vanishing gradients
Exploding gradients
Breaking symmetry
For details see [Zhang et al., 2023], section 5.4.
Random initialization#
How to initialize weights \(\boldsymbol W \in \mathbb R^{n_{\mathrm{in}} \times n_{\mathrm{out}}}\) of linear layer?
To preserve zero mean:
\[ \mathbb Ew_{ij} = 0 \]To preserve variance during forward pass:
\[ \mathbb V w_{ij} = \frac 1{n_{\mathrm{in}}} \]To preserve variance during backward pass:
\[ \mathbb V w_{ij} = \frac 1{n_{\mathrm{out}}} \]
Both last two conditions can be satisfied only if \(n_{\mathrm{in}} = n_{\mathrm{out}}\). In [Glorot and Bengio, 2010] a compromise was suggested:
\[
\mathbb V w_{ij} = \frac 2{n_{\mathrm{in}} + n_{\mathrm{out}}}.
\]
In particular, they took
(39)#\[ w_{ij} \sim U\bigg[-\sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}, \sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}\bigg]\]
Q. Why so strange numbers?
Initialization (39) suits well if activation functions have symmetric range (e.g., \(\psi(x) = \tanh(x)\)). For ReLU activation [He et al., 2015] suggests
\[
w_{ij} \sim \mathcal N\Big(0, \frac 2{n_{\mathrm{in}}}\Big).
\]