Training of MLP#

Learning objectives#

Binary classification#

The output of the neural network is usually a number from \([0, 1]\) which is the probability of the positive class. Sigmoid is the typical choice for the output layer:

\[ \widehat y = x_L = x_{\mathrm{out}} = \sigma(x_{L-1}) \]

Loss function:

\[ \mathcal L(\widehat y, y) = -y\log(\widehat y) -(1-y) \log(1-\widehat y) \]

Multiclass classification#

  • For \(K\) classes the output contains \(K\) numbers \((\widehat y_1, \ldots, \widehat y_K)\)

  • \(\widehat y_k\) is the probability of class \(k\)

  • Now the output of the neural network is

\[ \boldsymbol{\widehat y} = \boldsymbol x_L = \boldsymbol x_{\mathrm{out}} = \mathrm{SoftMax}(\boldsymbol x_{L-1}), \]
\[ \mathrm{SoftMax}(\boldsymbol z)_i = \frac{e^{z_i}}{\sum_i e^{z_i}}\]
  • Finally, plug the predictions into the cross-entropy loss:

\[ \mathcal L(\boldsymbol{\widehat y}, \boldsymbol y) = -\sum\limits_{k=1}^K y_k\log(\widehat y_k) \]

Regression#

  • Predict a real number \(\widehat y = x_L = x_{\mathrm{out}}\)

  • The loss function is usually quadratic:

\[ \mathcal L(\widehat y, y) = (\widehat y - y)^2 \]

Forward and backward pass#

../_images/forward_pass.png

The goal is to minimize the loss function with respect to parameters \(\boldsymbol \theta\),

\[ \mathcal L = \mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \to \min\limits_{\boldsymbol \theta} \]

where

\[ \boldsymbol \theta = (\boldsymbol W_1, \boldsymbol b_1, \boldsymbol W_2, \boldsymbol b_2, \ldots, \boldsymbol W_L, \boldsymbol b_L) \]

Let’s use the standard technique — the gradient descent!

  1. Start from some random parameters \(\boldsymbol \theta_0\)

  2. Given a training sample \((\boldsymbol x, \boldsymbol y)\), do the forward pass and get the output \(\boldsymbol {\widehat y} = F_{\boldsymbol \theta}(\boldsymbol x)\)

  3. Calculate the loss function \(\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y)\)

  4. Do the backward pass and calculate gradients

    \[ \nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \]

    i.e.,

    \[ \nabla_{\boldsymbol W_\ell}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \text{ and } \nabla_{\boldsymbol b_\ell}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y), \quad 1\leqslant \ell \leqslant L. \]
  5. Update the parameters:

\[ \boldsymbol \theta = \boldsymbol \theta - \eta \nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \]
  1. Go to step 2 with next training sample

Batch training#

It is compuationally inefficient to update all the parameters every time after passing a training sample. Instead, take a batch of size \(B\) of training samples at a time and form the matrix \(\boldsymbol X_{\mathrm{in}}\) if the shape \(B\times n_0\). Now each hidden representation is a matrix of the shape \(B \times n_i\):

\[ \boldsymbol X_i = \psi_i(\boldsymbol X_{i-1} \boldsymbol W_i +\boldsymbol B_i) \]

The output also has \(B\) rows. For instance, in the case of multiclassification task we have

\[ \boldsymbol X_L = \boldsymbol {\widehat Y} \in \mathbb R^{B\times K}, \]
\[ \mathcal L(\boldsymbol {\widehat Y}, \boldsymbol Y) = -\frac 1B\sum\limits_{i = 1}^B \sum\limits_{k=1}^K Y_{ik}\log(\widehat Y_{ik}) \]