Layers of MLP#
From mathematical point of view MLP is a smooth function \(F\) which is constructed as a composition of some other functions
Each function
performs transformation between layer: it converts representation of \((\ell-1)\)-th layer
to the representation of \(\ell\)-th layer
Thus, the input layer \(\boldsymbol x_0 \in \mathbb R^{n_0}\) is converted to the output layer \(\boldsymbol x_L \in \mathbb R^{n_L}\). All other layers \(\boldsymbol x_\ell\), \(1\leqslant \ell < L\), are called hidden layers.
Warning
The terminology about layers is a bit ambiguous. Both functions \(f_\ell\) and their outputs \(\boldsymbol x_\ell = f(\boldsymbol x_{\ell - 1})\) are called \(\ell\)-th layer in different sources.
Linear regression#
Linear regression is one the simplest MLPs. There are several types of linear regression model:
simple linear regression (9) \(y = ax +b\)
multiple linear regression \(y = \boldsymbol x^\mathsf{T} \boldsymbol w + w_0\),
\[\begin{split} \boldsymbol x = \begin{pmatrix} x_1 \\ \vdots \\ x_d \end{pmatrix}, \quad \boldsymbol w = \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix} \end{split}\]multivariate linear regression \(\boldsymbol y^\mathsf{T} = \boldsymbol x^\mathsf{T} \boldsymbol W + \boldsymbol b^\mathsf{T}\), \(\boldsymbol W\in\mathbb R^{d \times m}\), \(\boldsymbol y, \boldsymbol b \in \mathbb R^m\)
Question
How many layers does all these kinds of linear regression have? What are the sizes of input and output?
Logistic regression#
Logistic regression predicts probabilites of classes:
binary logistic regression \(y = \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w + w_0)\)
multinomial logistic regression
\[ \boldsymbol y^\mathsf{T} = \mathrm{Softmax}(\boldsymbol x^\mathsf{T} \boldsymbol W + \boldsymbol b^\mathsf{T}), \quad \boldsymbol W\in\mathbb R^{d \times K}, \quad \boldsymbol y, \boldsymbol b \in \mathbb R^K \]
Question
How many layers does logistic regression have? What are the sizes of input and output?
Parameters of MLP#
However, one important element is missing in (36): parameters! Each layer \(f_\ell\) has a vector of parameters \(\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}\) (sometimes empty). Hence, a layer should be defined as
The representation \(\boldsymbol x_\ell\) is calculated from \(\boldsymbol x_{\ell -1}\) by the formula
with some fixed \(\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}\). The whole MLP \(F\) depends on parameters of all layers:
All these parameters are trained simultaneously by the backpropagation method.
Dense layer#
Edges between two consequetive layers denote linear (or dense) layer:
The matrix \(\boldsymbol W \in \mathbb R^{n_{\ell - 1}\times n_\ell}\) and vector \(\boldsymbol b \in \mathbb R^{n_\ell}\) (bias) are parameters of the linear layer which defines the linear transformation from \(\boldsymbol x_{\ell - 1}\) to \(\boldsymbol x_{\ell}\).
Q. How many numeric parameters does such linear layer have?
Exercise
Suppose that we apply one more dense layer:
Express \(\boldsymbol x_{\ell + 1}\) as a function of \(\boldsymbol x_{\ell - 1}\).
Linear layer in PyTorch#
import torch
x = torch.ones(3)
x
tensor([1., 1., 1.])
Weights:
linear_layer = torch.nn.Linear(3, 4, bias=False)
linear_layer.weight
Parameter containing:
tensor([[ 0.3155, -0.2208, 0.5131],
[ 0.1268, 0.3283, -0.4448],
[-0.1038, -0.4623, 0.2740],
[-0.5087, -0.3319, -0.3864]], requires_grad=True)
Apply the linear transformation:
linear_layer(x)
tensor([ 0.6078, 0.0103, -0.2921, -1.2270], grad_fn=<SqueezeBackward4>)
Activation layer#
In this layer a nonlinear activation function \(\psi\) is applied element-wise to its input:
In the origial work by Rosenblatt the activation function was \(\psi(t) = \mathbb I[t > 0]\). However, this function is discontinuous, that’s why in modern neural networks some other smooth alternatives are used.
Sometimes linear and activation layers are combined into a single layer. Then each MLP layer looks like
where
\(\boldsymbol W_{i}\) is a matrix of the shape \(n_{i-1}\times n_i\)
\(\boldsymbol x_i, \boldsymbol b_i \in \mathbb R^{n_i}\) and \(\boldsymbol x_{i-1} \in \mathbb R^{n_{i-1}}\)
\(\psi_i(t)\) is an activation function which acts element-wise
Activation functions#
The most popular activation functions (nonlinearities):
sigmoid: \(\sigma(x) = \frac 1{1+e^{-x}}\)
hyperbolic tangent: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
Rectified Linear Unit:
plot_activations(-5, 5, -2, 2)
You can read about advantages and disadvantages of different activation functions here.
Activations in PyTorch#
Pytorch library has a big zoo of activations.
from torch import nn, randn
X = randn(2, 5)
X
tensor([[ 0.5186, 0.2777, -1.6990, 0.3025, -0.5038],
[ 0.4378, 0.9187, -1.1543, 0.4697, 0.5946]])
ReLU zeroes all negative inputs while Leaky ReLU — not:
nn.ReLU()(X), nn.LeakyReLU()(X)
(tensor([[0.5186, 0.2777, 0.0000, 0.3025, 0.0000],
[0.4378, 0.9187, 0.0000, 0.4697, 0.5946]]),
tensor([[ 0.5186, 0.2777, -0.0170, 0.3025, -0.0050],
[ 0.4378, 0.9187, -0.0115, 0.4697, 0.5946]]))
ELU
ELU activation function
has a hyperparameter \(\alpha\). What is the main theoretical advantage of the default value \(\alpha =1\)?