Weights initialization

Weights initialization#

Numerical problems#

Vanishing gradients
Exploding gradients
Breaking symmetry

For details see [Zhang et al., 2023], section 5.4.

Random initialization#

How to initialize weights \(\boldsymbol W \in \mathbb R^{n_{\mathrm{in}} \times n_{\mathrm{out}}}\) of linear layer?

To preserve zero mean:

\[ \mathbb Ew_{ij} = 0 \]
To preserve variance during forward pass:

\[ \mathbb V w_{ij} = \frac 1{n_{\mathrm{in}}} \]
To preserve variance during backward pass:

\[ \mathbb V w_{ij} = \frac 1{n_{\mathrm{out}}} \]

Both last two conditions can be satisfied only if \(n_{\mathrm{in}} = n_{\mathrm{out}}\). In [Glorot and Bengio, 2010] a compromise was suggested:

\[ \mathbb V w_{ij} = \frac 2{n_{\mathrm{in}} + n_{\mathrm{out}}}. \]

In particular, they took

(39)#\[ w_{ij} \sim U\bigg[-\sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}, \sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}\bigg]\]

Q. Why so strange numbers?

Initialization (39) suits well if activation functions have symmetric range (e.g., \(\psi(x) = \tanh(x)\)). For ReLU activation [He et al., 2015] suggests

\[ w_{ij} \sim \mathcal N\Big(0, \frac 2{n_{\mathrm{in}}}\Big). \]

Weights initialization

Contents

Weights initialization#

Numerical problems#

Random initialization#