Layers of MLP#
From mathematical point of view MLP is a smooth function \(F\) which is constructed as a composition of some other functions
Each function
performs transformation between layer: it converts representation of \((\ell-1)\)-th layer
to the representation of \(\ell\)-th layer
Thus, the input layer \(\boldsymbol x_0 \in \mathbb R^{n_0}\) is converted to the output layer \(\boldsymbol x_L \in \mathbb R^{n_L}\). All other layers \(\boldsymbol x_\ell\), \(1\leqslant \ell < L\), are called hidden layers.
Warning
The terminology about layers is a bit ambiguous. Both functions \(f_\ell\) and their outputs \(\boldsymbol x_\ell = f(\boldsymbol x_{\ell - 1})\) are called \(\ell\)-th layer in different sources.
Linear regression#
Linear regression is one the simplest MLPs. There are several types of linear regression model:
simple linear regression (7) \(y = ax +b\)
multiple linear regression \(y = \boldsymbol x^\mathsf{T} \boldsymbol w + w_0\),
\[\begin{split} \boldsymbol x = \begin{pmatrix} x_1 \\ \vdots \\ x_d \end{pmatrix}, \quad \boldsymbol w = \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix} \end{split}\]multivariate linear regression \(\boldsymbol y^\mathsf{T} = \boldsymbol x^\mathsf{T} \boldsymbol W + \boldsymbol b^\mathsf{T}\), \(\boldsymbol W\in\mathbb R^{d \times m}\), \(\boldsymbol y, \boldsymbol b \in \mathbb R^m\)
Question
How many layers does all these kinds of linear regression have? What are the sizes of input and output?
Logistic regression#
Logistic regression predicts probabilites of classes:
binary logistic regression \(y = \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w + w_0)\)
multinomial logistic regression
\[ \boldsymbol y^\mathsf{T} = \mathrm{Softmax}(\boldsymbol x^\mathsf{T} \boldsymbol W + \boldsymbol b^\mathsf{T}), \quad \boldsymbol W\in\mathbb R^{d \times K}, \quad \boldsymbol y, \boldsymbol b \in \mathbb R^K \]
Question
How many layers does logistic regression have? What are the sizes of input and output?
Parameters of MLP#
However, one important element is missing in (34): parameters! Each layer \(f_\ell\) has a vector of parameters \(\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}\) (sometimes empty). Hence, a layer should be defined as
The representation \(\boldsymbol x_\ell\) is calculated from \(\boldsymbol x_{\ell -1}\) by the formula
with some fixed \(\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}\). The whole MLP \(F\) depends on parameters of all layers:
All these parameters are trained simultaneously by the backpropagation method.
Dense layer#
Edges between two consequetive layers denote linear (or dense) layer:
The matrix \(\boldsymbol W \in \mathbb R^{n_{\ell - 1}\times n_\ell}\) and vector \(\boldsymbol b \in \mathbb R^{n_\ell}\) (bias) are parameters of the linear layer which defines the linear transformation from \(\boldsymbol x_{\ell - 1}\) to \(\boldsymbol x_{\ell}\).
Q. How many numeric parameters does such linear layer have?
Exercise
Suppose that we apply one more dense layer:
Express \(\boldsymbol x_{\ell + 1}\) as a function of \(\boldsymbol x_{\ell - 1}\).
Linear layer in PyTorch#
import torch
x = torch.ones(3)
x
tensor([1., 1., 1.])
Weights:
linear_layer = torch.nn.Linear(3, 4, bias=False)
linear_layer.weight
Parameter containing:
tensor([[ 0.3253, 0.5486, 0.3824],
[ 0.0182, 0.4988, 0.3559],
[-0.2256, -0.2225, -0.3284],
[ 0.4426, -0.3302, 0.3479]], requires_grad=True)
Apply the linear transformation:
linear_layer(x)
tensor([ 1.2564, 0.8729, -0.7766, 0.4603], grad_fn=<SqueezeBackward4>)
Activation layer#
In this layer a nonlinear activation function \(\psi\) is applied element-wise to its input:
In the origial work by Rosenblatt the activation function was \(\psi(t) = \mathbb I[t > 0]\). However, this function is discontinuous, that’s why in modern neural networks some other smooth alternatives are used.
Sometimes linear and activation layers are combined into a single layer. Then each MLP layer looks like
where
\(\boldsymbol W_{i}\) is a matrix of the shape \(n_{i-1}\times n_i\)
\(\boldsymbol x_i, \boldsymbol b_i \in \mathbb R^{n_i}\) and \(\boldsymbol x_{i-1} \in \mathbb R^{n_{i-1}}\)
\(\psi_i(t)\) is an activation function which acts element-wise
Activation functions#
The most popular activation functions (nonlinearities):
sigmoid: \(\sigma(x) = \frac 1{1+e^{-x}}\)
hyperbolic tangent: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
Rectified Linear Unit:
plot_activations(-5, 5, -2, 2)
You can read about advantages and disadvantages of different activation functions here.
Activations in PyTorch#
Pytorch library has a big zoo of activations.
from torch import nn, randn
X = randn(2, 5)
X
tensor([[ 0.8339, 0.7751, -0.3266, 0.2491, 2.1545],
[-1.6237, -0.4284, 0.4013, 0.2186, -0.4625]])
ReLU zeroes all negative inputs while Leaky ReLU — not:
nn.ReLU()(X), nn.LeakyReLU()(X)
(tensor([[0.8339, 0.7751, 0.0000, 0.2491, 2.1545],
[0.0000, 0.0000, 0.4013, 0.2186, 0.0000]]),
tensor([[ 0.8339, 0.7751, -0.0033, 0.2491, 2.1545],
[-0.0162, -0.0043, 0.4013, 0.2186, -0.0046]]))
ELU
ELU activation function
has a hyperparameter \(\alpha\). What is the main theoretical advantage of the default value \(\alpha =1\)?