Probabilistic models#

In probabilistic approach all quantities are considered as random variables. Each training sample (x,y) comes from a joint probability distribution with density p(x,y). If we are using some model of machine learning with parameters w, then this density is conditioned on w:

(x,y)p(x,y|w).

The parametric family p(x,y|w) is called a probabilistic model of a machine learning problem.

Maximum likelihood estimation#

The likelihood of the dataset D=(X,y)={(xi,yi)}i=1n is

p(y|X,w).

If the samples (xi,yi) are i.i.d., then

p(y|X,w)=i=1np(yi|xi,w).

The optimal weights w^ maximize the likelihood, or, equivalently, log-likelihood:

(33)#logp(y|X,w)=logi=1np(yi|xi,w)=i=1nlogp(yi|xi,w)maxw.

Alternatively, one can minimize negative log-likelihood (NLL):

logp(y|X,w)=i=1nlogp(yi|xi,w)minw.

The optimal estimation of weights w^ maximizing log-likelihood (33) is called maximum likelihood estimation (MLE).

Bayesian approach#

From (33) we obtain a point estimation w^MLE. In Bayesian framework we estimate not points but distributions!

Assume that parameters w have prior distribution p(w). Observing the dataset D, we can apply the Bayes formula and obtain posterior distribution

p(w|D)=p(D|w)p(w)p(D).

Maximum a posteriori estimation (MAP) maximizes posterior distribution:

w^MAP=argmaxwp(w|D).