Probabilistic models#
In probabilistic approach all quantities are considered as random variables. Each training sample \((\boldsymbol x, y)\) comes from a joint probability distribution with density \(p(\boldsymbol x, y)\). If we are using some model of machine learning with parameters \(\boldsymbol w\), then this density is conditioned on \(\boldsymbol w\):
The parametric family \(p(\boldsymbol x, y\vert \boldsymbol w)\) is called a probabilistic model of a machine learning problem.
Maximum likelihood estimation#
The likelihood of the dataset \(\mathcal D = (\boldsymbol X, \boldsymbol y) = \{(\boldsymbol x_i, y_i)\}_{i=1}^n\) is
If the samples \((\boldsymbol x_i, y_i)\) are i.i.d., then
The optimal weights \(\boldsymbol{\widehat w}\) maximize the likelihood, or, equivalently, log-likelihood:
Alternatively, one can minimize negative log-likelihood (NLL):
The optimal estimation of weights \(\boldsymbol{\widehat w}\) maximizing log-likelihood (33) is called maximum likelihood estimation (MLE).
Bayesian approach#
From (33) we obtain a point estimation \(\boldsymbol {\widehat w}_{\mathrm{MLE}}\). In Bayesian framework we estimate not points but distributions!
Assume that parameters \(\boldsymbol w\) have prior distribution \(p(\boldsymbol w)\). Observing the dataset \(\mathcal D\), we can apply the Bayes formula and obtain posterior distribution
Maximum a posteriori estimation (MAP) maximizes posterior distribution: