Training of MLP

Training of MLP#

Learning objectives#

Binary classification#

The output of the neural network is usually a number from \([0, 1]\) which is the probability of the positive class. Sigmoid is the typical choice for the output layer:

\[ \widehat y = x_L = x_{\mathrm{out}} = \sigma(x_{L-1}) \]

Loss function:

\[ \mathcal L(\widehat y, y) = -y\log(\widehat y) -(1-y) \log(1-\widehat y) \]

Multiclass classification#

For \(K\) classes the output contains \(K\) numbers \((\widehat y_1, \ldots, \widehat y_K)\)
\(\widehat y_k\) is the probability of class \(k\)
Now the output of the neural network is

\[ \boldsymbol{\widehat y} = \boldsymbol x_L = \boldsymbol x_{\mathrm{out}} = \mathrm{SoftMax}(\boldsymbol x_{L-1}), \]

\[ \mathrm{SoftMax}(\boldsymbol z)_i = \frac{e^{z_i}}{\sum_i e^{z_i}}\]

Finally, plug the predictions into the cross-entropy loss:

\[ \mathcal L(\boldsymbol{\widehat y}, \boldsymbol y) = -\sum\limits_{k=1}^K y_k\log(\widehat y_k) \]

Regression#

Predict a real number \(\widehat y = x_L = x_{\mathrm{out}}\)
The loss function is usually quadratic:

\[ \mathcal L(\widehat y, y) = (\widehat y - y)^2 \]

Forward and backward pass#

The goal is to minimize the loss function with respect to parameters \(\boldsymbol \theta\),

\[ \mathcal L = \mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \to \min\limits_{\boldsymbol \theta} \]

where

\[ \boldsymbol \theta = (\boldsymbol W_1, \boldsymbol b_1, \boldsymbol W_2, \boldsymbol b_2, \ldots, \boldsymbol W_L, \boldsymbol b_L) \]

Let’s use the standard technique — the gradient descent!

Start from some random parameters \(\boldsymbol \theta_0\)
Given a training sample \((\boldsymbol x, \boldsymbol y)\), do the forward pass and get the output \(\boldsymbol {\widehat y} = F_{\boldsymbol \theta}(\boldsymbol x)\)
Calculate the loss function \(\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y)\)
Do the backward pass and calculate gradients

\[ \nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \]

i.e.,

\[ \nabla_{\boldsymbol W_\ell}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \text{ and } \nabla_{\boldsymbol b_\ell}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y), \quad 1\leqslant \ell \leqslant L. \]
Update the parameters:

\[ \boldsymbol \theta = \boldsymbol \theta - \eta \nabla_{\boldsymbol\theta}\mathcal L_{\boldsymbol\theta}(\boldsymbol {\widehat y}, \boldsymbol y) \]

Go to step 2 with next training sample

Batch training#

It is compuationally inefficient to update all the parameters every time after passing a training sample. Instead, take a batch of size \(B\) of training samples at a time and form the matrix \(\boldsymbol X_{\mathrm{in}}\) if the shape \(B\times n_0\). Now each hidden representation is a matrix of the shape \(B \times n_i\):

\[ \boldsymbol X_i = \psi_i(\boldsymbol X_{i-1} \boldsymbol W_i +\boldsymbol B_i) \]

The output also has \(B\) rows. For instance, in the case of multiclassification task we have

\[ \boldsymbol X_L = \boldsymbol {\widehat Y} \in \mathbb R^{B\times K}, \]

\[ \mathcal L(\boldsymbol {\widehat Y}, \boldsymbol Y) = -\frac 1B\sum\limits_{i = 1}^B \sum\limits_{k=1}^K Y_{ik}\log(\widehat Y_{ik}) \]