Layers of MLP#

From mathematical point of view MLP is a smooth function \(F\) which is constructed as a composition of some other functions

(36)#\[F(\boldsymbol x) = (f_{L} \circ f_{L-1} \circ\ldots \circ f_2 \circ f_1)(\boldsymbol x),\quad \boldsymbol x \in \mathbb R^{n_0}\]

Each function

\[ f_\ell \colon \mathbb R^{n_{\ell - 1}} \to \mathbb R^{n_\ell} \]

performs transformation between layer: it converts representation of \((\ell-1)\)-th layer

\[ \boldsymbol x_{\ell -1} \in \mathbb R^{n_{\ell - 1}} \]

to the representation of \(\ell\)-th layer

\[ \boldsymbol x_{\ell} \in \mathbb R^{n_{\ell}}. \]

Thus, the input layer \(\boldsymbol x_0 \in \mathbb R^{n_0}\) is converted to the output layer \(\boldsymbol x_L \in \mathbb R^{n_L}\). All other layers \(\boldsymbol x_\ell\), \(1\leqslant \ell < L\), are called hidden layers.

https://www.researchgate.net/publication/354817375/figure/fig2/AS:1071622807097344@1632506195651/Multi-layer-perceptron-MLP-NN-basic-Architecture.jpg

Warning

The terminology about layers is a bit ambiguous. Both functions \(f_\ell\) and their outputs \(\boldsymbol x_\ell = f(\boldsymbol x_{\ell - 1})\) are called \(\ell\)-th layer in different sources.

Linear regression#

Linear regression is one the simplest MLPs. There are several types of linear regression model:

  • simple linear regression (9) \(y = ax +b\)

  • multiple linear regression \(y = \boldsymbol x^\mathsf{T} \boldsymbol w + w_0\),

    \[\begin{split} \boldsymbol x = \begin{pmatrix} x_1 \\ \vdots \\ x_d \end{pmatrix}, \quad \boldsymbol w = \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix} \end{split}\]
  • multivariate linear regression \(\boldsymbol y^\mathsf{T} = \boldsymbol x^\mathsf{T} \boldsymbol W + \boldsymbol b^\mathsf{T}\), \(\boldsymbol W\in\mathbb R^{d \times m}\), \(\boldsymbol y, \boldsymbol b \in \mathbb R^m\)

Question

How many layers does all these kinds of linear regression have? What are the sizes of input and output?

Logistic regression#

Logistic regression predicts probabilites of classes:

  • binary logistic regression \(y = \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w + w_0)\)

  • multinomial logistic regression

    \[ \boldsymbol y^\mathsf{T} = \mathrm{Softmax}(\boldsymbol x^\mathsf{T} \boldsymbol W + \boldsymbol b^\mathsf{T}), \quad \boldsymbol W\in\mathbb R^{d \times K}, \quad \boldsymbol y, \boldsymbol b \in \mathbb R^K \]

Question

How many layers does logistic regression have? What are the sizes of input and output?

Parameters of MLP#

However, one important element is missing in (36): parameters! Each layer \(f_\ell\) has a vector of parameters \(\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}\) (sometimes empty). Hence, a layer should be defined as

\[ f_\ell \colon \mathbb R^{n_{\ell - 1}} \times \mathbb R^{m_\ell} \to \mathbb R^{n_\ell}. \]

The representation \(\boldsymbol x_\ell\) is calculated from \(\boldsymbol x_{\ell -1}\) by the formula

\[ \boldsymbol x_\ell = f_\ell(\boldsymbol x_{\ell - 1},\boldsymbol \theta_\ell) \]

with some fixed \(\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}\). The whole MLP \(F\) depends on parameters of all layers:

\[ F(\boldsymbol x, \boldsymbol \theta), \quad \boldsymbol \theta = (\boldsymbol \theta_1, \ldots, \boldsymbol \theta_L). \]

All these parameters are trained simultaneously by the backpropagation method.

Dense layer#

Edges between two consequetive layers denote linear (or dense) layer:

\[ \boldsymbol x_\ell^{\mathsf T} = f(\boldsymbol x_{\ell - 1}; \boldsymbol W, \boldsymbol b) = \boldsymbol x_{\ell - 1}^{\mathsf T} \boldsymbol W + \boldsymbol b^\mathsf{T}. \]

The matrix \(\boldsymbol W \in \mathbb R^{n_{\ell - 1}\times n_\ell}\) and vector \(\boldsymbol b \in \mathbb R^{n_\ell}\) (bias) are parameters of the linear layer which defines the linear transformation from \(\boldsymbol x_{\ell - 1}\) to \(\boldsymbol x_{\ell}\).

Q. How many numeric parameters does such linear layer have?

Exercise

Suppose that we apply one more dense layer:

\[ \boldsymbol x_{\ell + 1} = \boldsymbol {W'x}_{\ell} + \boldsymbol{b'} \]

Express \(\boldsymbol x_{\ell + 1}\) as a function of \(\boldsymbol x_{\ell - 1}\).

Linear layer in PyTorch#

import torch

x = torch.ones(3)
x
tensor([1., 1., 1.])

Weights:

linear_layer = torch.nn.Linear(3, 4, bias=False)
linear_layer.weight
Parameter containing:
tensor([[ 0.3155, -0.2208,  0.5131],
        [ 0.1268,  0.3283, -0.4448],
        [-0.1038, -0.4623,  0.2740],
        [-0.5087, -0.3319, -0.3864]], requires_grad=True)

Apply the linear transformation:

linear_layer(x)
tensor([ 0.6078,  0.0103, -0.2921, -1.2270], grad_fn=<SqueezeBackward4>)

Activation layer#

In this layer a nonlinear activation function \(\psi\) is applied element-wise to its input:

\[ \psi(\boldsymbol x^{\mathsf T}) = \psi\big((x_1, \ldots, x_n)\big) = \big(\psi(x_1), \ldots, \psi(x_n)\big) = \boldsymbol z^{\mathsf T} \]

In the origial work by Rosenblatt the activation function was \(\psi(t) = \mathbb I[t > 0]\). However, this function is discontinuous, that’s why in modern neural networks some other smooth alternatives are used.

Sometimes linear and activation layers are combined into a single layer. Then each MLP layer looks like

\[ \boldsymbol x_i^{\mathsf T} = \psi_i(\boldsymbol x_{i-1}^{\mathsf T} \boldsymbol W_{i} + \boldsymbol b_{i}^\mathsf{T}) \]

where

  • \(\boldsymbol W_{i}\) is a matrix of the shape \(n_{i-1}\times n_i\)

  • \(\boldsymbol x_i, \boldsymbol b_i \in \mathbb R^{n_i}\) and \(\boldsymbol x_{i-1} \in \mathbb R^{n_{i-1}}\)

  • \(\psi_i(t)\) is an activation function which acts element-wise

Activation functions#

The most popular activation functions (nonlinearities):

  • sigmoid: \(\sigma(x) = \frac 1{1+e^{-x}}\)

  • hyperbolic tangent: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

  • Rectified Linear Unit:

\[ \mathrm{ReLU}(x) = x_+ = \max\{x, 0\} \]
plot_activations(-5, 5, -2, 2)
../_images/d9fc968d5febbc17e7236f4dea3f396fb9a1625add07322f7471e9365d9ae971.svg

You can read about advantages and disadvantages of different activation functions here.

Activations in PyTorch#

Pytorch library has a big zoo of activations.

from torch import nn, randn
X = randn(2, 5)
X
tensor([[ 0.5186,  0.2777, -1.6990,  0.3025, -0.5038],
        [ 0.4378,  0.9187, -1.1543,  0.4697,  0.5946]])

ReLU zeroes all negative inputs while Leaky ReLU — not:

nn.ReLU()(X), nn.LeakyReLU()(X)
(tensor([[0.5186, 0.2777, 0.0000, 0.3025, 0.0000],
         [0.4378, 0.9187, 0.0000, 0.4697, 0.5946]]),
 tensor([[ 0.5186,  0.2777, -0.0170,  0.3025, -0.0050],
         [ 0.4378,  0.9187, -0.0115,  0.4697,  0.5946]]))

ELU

ELU activation function

\[\begin{split} \mathrm{ELU}(x) = \begin{cases} x,& x > 0 \\ \alpha (e^x - 1), & x \leqslant 0. \end{cases} \end{split}\]

has a hyperparameter \(\alpha\). What is the main theoretical advantage of the default value \(\alpha =1\)?