Naive Bayes classifier

Naive Bayes classifier#

A Naive Bayes classifier is a probabilistic machine learning model that is widely used for classification tasks, particularly in natural language processing (NLP) and text classification, spam detection, sentiment analysis, and more. It is based on Bayes’ theorem and makes a “naive” assumption about the independence of features. Despite its simplifying assumption, Naive Bayes often performs surprisingly well in practice.

Discriminative vs generative models#

A discriminative model learns posterior distribution \(p(y \vert \boldsymbol x, \boldsymbol w)\). Example: logistic regression model

\[ \mathbb P(y=1 \vert \boldsymbol x, \boldsymbol w) = \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w), \quad \mathbb P(y=0 \vert \boldsymbol x, \boldsymbol w) = 1 - \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w) \]

A generative model estimates joint distribution

\[ p(\boldsymbol x, y) = p(\boldsymbol x \vert y) p(y). \]

Bayesian classifier#

For classification use Bayes theorem:

\[ \mathbb P(y = k \vert \boldsymbol x) = \frac{p(\boldsymbol x \vert y = k) \mathbb P(y = k)}{\sum\limits_{j=1}^K p(\boldsymbol x \vert y = j) \mathbb P(y = j)} \]

Bayesian classifier maximizes this expression:

\[ \widehat y = \arg\max\limits_{1\leqslant j \leqslant K} p(\boldsymbol x \vert y = j) \mathbb P(y = j) \]

How to estimate \(\mathbb P(y = k)\) and \(p(\boldsymbol x \vert y = k)\) given a training dataset \((\boldsymbol X, \boldsymbol y)\)?

To esimatate \(\mathbb P(y = k)\) we can use frequencies:

\[ \widehat y_k = \frac 1n \sum\limits_{i=1}^n \mathbb I[y_i = k] \]
Estimation of density \(p(\boldsymbol x \vert y = k)\) is much harder task. It can be solved via
- parametric density estimation
- nonparametric density estimation
- mixture of distributions
(see Vorontsov’s slides for details)

Special cases of parametric density estimation:

Quadratic discriminant analysis (QDA):

\[ p(\boldsymbol x \vert y = k) = \mathcal N(\boldsymbol x \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac {\exp\big(-\frac 12(\boldsymbol x - \boldsymbol \mu_k)^\mathsf{T} \boldsymbol \Sigma_k^{-1} (\boldsymbol x - \boldsymbol \mu_k)\big)}{\sqrt{(2\pi)^n \det \boldsymbol \Sigma_k}} \]
Linear discriminant analysis (LDA): the covariance matrix \(\boldsymbol \Sigma\) is the same for all classes, and it’s MLE estimation is

\[ \widehat \Sigma = \frac 1n\sum\limits_{i=1}^n (\boldsymbol x_i - \boldsymbol{\widehat \mu}_{y_i})(\boldsymbol x_i - \boldsymbol{\widehat \mu}_{y_i})^\mathsf{T} \]

Naive Bayes estimation#

Naive assumption: all feature, conditioned on target, are independent:

\[ p(\boldsymbol x \vert y) = p(x_1, \ldots, x_d \vert y) = \prod\limits_{j=1}^d p(x_j \vert y) \]

To estimate 1-d densities \(p(x_j \vert y)\) is much easier than multivariate ones. The output of the Bayesian classifier is given by

\[ \arg\max\limits_{1\leqslant k \leqslant K}\big(\log \widehat y_k + \sum\limits_{j=1}^d \log \widehat p_j(x_j \vert y = k)\big) \]

import pandas as pd

tennis = pd.read_csv("PlayTennis.csv")
tennis

	Outlook	Temperature	Humidity	Wind	Play Tennis
0	Sunny	Hot	High	Weak	No
1	Sunny	Hot	High	Strong	No
2	Overcast	Hot	High	Weak	Yes
3	Rain	Mild	High	Weak	Yes
4	Rain	Cool	Normal	Weak	Yes
5	Rain	Cool	Normal	Strong	No
6	Overcast	Cool	Normal	Strong	Yes
7	Sunny	Mild	High	Weak	No
8	Sunny	Cool	Normal	Weak	Yes
9	Rain	Mild	Normal	Weak	Yes
10	Sunny	Mild	Normal	Strong	Yes
11	Overcast	Mild	High	Strong	Yes
12	Overcast	Hot	Normal	Weak	Yes
13	Rain	Mild	High	Strong	No

tennis[tennis["Play Tennis"] == "Yes"]

	Outlook	Temperature	Humidity	Wind	Play Tennis
2	Overcast	Hot	High	Weak	Yes
3	Rain	Mild	High	Weak	Yes
4	Rain	Cool	Normal	Weak	Yes
6	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Cool	Normal	Weak	Yes
9	Rain	Mild	Normal	Weak	Yes
10	Sunny	Mild	Normal	Strong	Yes
11	Overcast	Mild	High	Strong	Yes
12	Overcast	Hot	Normal	Weak	Yes

tennis[tennis["Play Tennis"] == "No"]

	Outlook	Temperature	Humidity	Wind	Play Tennis
0	Sunny	Hot	High	Weak	No
1	Sunny	Hot	High	Strong	No
5	Rain	Cool	Normal	Strong	No
7	Sunny	Mild	High	Weak	No
13	Rain	Mild	High	Strong	No

Let

\[ \boldsymbol x = (\mathrm{Sunny}, \mathrm{Cool}, \mathrm{High}, \mathrm{Strong}) \]

Calculate \(\mathbb P(\boldsymbol x\vert \mathrm{Yes})\) and \(\mathbb P(\boldsymbol x\vert \mathrm{No})\). What is the prediction \(\widehat y\) for the sample \(\boldsymbol x\)?

Example: MNIST#

sklearn has several types of Naive Bayes models depending on \(p(\boldsymbol x_i \vert y)\), e.g.

Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values

../_images/3a2fde1327bb564b132fe91e5eefa6ca336db9ef5a4694060097b722c16dcbf5.svg

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((60000, 784), (10000, 784), (60000,), (10000,))

Try Gaussian Naive Bayes:

from sklearn.naive_bayes import GaussianNB, BernoulliNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_hat = gnb.predict(X_test)
print("Gaussian accuracy:", accuracy_score(y_test, y_hat))

Gaussian accuracy: 0.5627

plt.figure(figsize=(10, 8))
plt.title("Naive Bayes on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);

../_images/a00eaa2de447738cb3568581dc0c99bc71f878520b136299ce73f66ae39e6b82.svg

Looks not very good. Now try Bernoulli Naive Bayes:

bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_hat = bnb.predict(X_test)
print("Bernoulli accuracy:", accuracy_score(y_test, y_hat))

Bernoulli accuracy: 0.8324

plt.figure(figsize=(10, 8))
plt.title("Naive Bayes on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);

../_images/97fff1121ad2a69fe91fb01a34624d469a1b8b9dcc3bcd0bd03d242535c6b31f.svg

Q. Bernoulli accuracy looks much higher than Gaussian. How do you think — why?