Weights initialization#

Numerical problems#

  • Vanishing gradients

  • Exploding gradients

  • Breaking symmetry

For details see [Zhang et al., 2023], section 5.4.

Random initialization#

ML Handbook chapter

How to initialize weights \(\boldsymbol W \in \mathbb R^{n_{\mathrm{in}} \times n_{\mathrm{out}}}\) of linear layer?

  • To preserve zero mean:

    \[ \mathbb Ew_{ij} = 0 \]
  • To preserve variance during forward pass:

    \[ \mathbb V w_{ij} = \frac 1{n_{\mathrm{in}}} \]
  • To preserve variance during backward pass:

    \[ \mathbb V w_{ij} = \frac 1{n_{\mathrm{out}}} \]

Both last two conditions can be satisfied only if \(n_{\mathrm{in}} = n_{\mathrm{out}}\). In [Glorot and Bengio, 2010] a compromise was suggested:

\[ \mathbb V w_{ij} = \frac 2{n_{\mathrm{in}} + n_{\mathrm{out}}}. \]

In particular, they took

(37)#\[ w_{ij} \sim U\bigg[-\sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}, \sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}\bigg]\]

Q. Why so strange numbers?

Initialization (37) suits well if activation functions have symmetric range (e.g., \(\psi(x) = \tanh(x)\)). For ReLU activation [He et al., 2015] suggests

\[ w_{ij} \sim \mathcal N\Big(0, \frac 2{n_{\mathrm{in}}}\Big). \]