Naive Bayes classifier#

A Naive Bayes classifier is a probabilistic machine learning model that is widely used for classification tasks, particularly in natural language processing (NLP) and text classification, spam detection, sentiment analysis, and more. It is based on Bayes’ theorem and makes a “naive” assumption about the independence of features. Despite its simplifying assumption, Naive Bayes often performs surprisingly well in practice.

Discriminative vs generative models#

A discriminative model learns posterior distribution \(p(y \vert \boldsymbol x, \boldsymbol w)\). Example: logistic regression model

\[ \mathbb P(y=1 \vert \boldsymbol x, \boldsymbol w) = \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w), \quad \mathbb P(y=0 \vert \boldsymbol x, \boldsymbol w) = 1 - \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w) \]

A generative model estimates joint distribution

\[ p(\boldsymbol x, y) = p(\boldsymbol x \vert y) p(y). \]

Bayesian classifier#

For classification use Bayes theorem:

\[ \mathbb P(y = k \vert \boldsymbol x) = \frac{p(\boldsymbol x \vert y = k) \mathbb P(y = k)}{\sum\limits_{j=1}^K p(\boldsymbol x \vert y = j) \mathbb P(y = j)} \]

Bayesian classifier maximizes this expression:

\[ \widehat y = \arg\max\limits_{1\leqslant j \leqslant K} p(\boldsymbol x \vert y = j) \mathbb P(y = j) \]

How to estimate \(\mathbb P(y = k)\) and \(p(\boldsymbol x \vert y = k)\) given a training dataset \((\boldsymbol X, \boldsymbol y)\)?

  • To esimatate \(\mathbb P(y = k)\) we can use frequencies:

    \[ \widehat y_k = \frac 1n \sum\limits_{i=1}^n \mathbb I[y_i = k] \]
  • Estimation of density \(p(\boldsymbol x \vert y = k)\) is much harder task. It can be solved via

    • parametric density estimation

    • nonparametric density estimation

    • mixture of distributions

    (see Vorontsov’s slides for details)

Special cases of parametric density estimation:

  • Quadratic discriminant analysis (QDA):

    \[ p(\boldsymbol x \vert y = k) = \mathcal N(\boldsymbol x \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac {\exp\big(-\frac 12(\boldsymbol x - \boldsymbol \mu_k)^\mathsf{T} \boldsymbol \Sigma_k^{-1} (\boldsymbol x - \boldsymbol \mu_k)\big)}{\sqrt{(2\pi)^n \det \boldsymbol \Sigma_k}} \]
  • Linear discriminant analysis (LDA): the covariance matrix \(\boldsymbol \Sigma\) is the same for all classes, and it’s MLE estimation is

    \[ \widehat \Sigma = \frac 1n\sum\limits_{i=1}^n (\boldsymbol x_i - \boldsymbol{\widehat \mu}_{y_i})(\boldsymbol x_i - \boldsymbol{\widehat \mu}_{y_i})^\mathsf{T} \]

Naive Bayes estimation#

Naive assumption: all feature, conditioned on target, are independent:

\[ p(\boldsymbol x \vert y) = p(x_1, \ldots, x_d \vert y) = \prod\limits_{j=1}^d p(x_j \vert y) \]

To estimate 1-d densities \(p(x_j \vert y)\) is much easier than multivariate ones. The output of the Bayesian classifier is given by

\[ \arg\max\limits_{1\leqslant k \leqslant K}\big(\log \widehat y_k + \sum\limits_{j=1}^d \log \widehat p_j(x_j \vert y = k)\big) \]
import pandas as pd

tennis = pd.read_csv("PlayTennis.csv")
tennis
Outlook Temperature Humidity Wind Play Tennis
0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
5 Rain Cool Normal Strong No
6 Overcast Cool Normal Strong Yes
7 Sunny Mild High Weak No
8 Sunny Cool Normal Weak Yes
9 Rain Mild Normal Weak Yes
10 Sunny Mild Normal Strong Yes
11 Overcast Mild High Strong Yes
12 Overcast Hot Normal Weak Yes
13 Rain Mild High Strong No
tennis[tennis["Play Tennis"] == "Yes"]
Outlook Temperature Humidity Wind Play Tennis
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
6 Overcast Cool Normal Strong Yes
8 Sunny Cool Normal Weak Yes
9 Rain Mild Normal Weak Yes
10 Sunny Mild Normal Strong Yes
11 Overcast Mild High Strong Yes
12 Overcast Hot Normal Weak Yes
tennis[tennis["Play Tennis"] == "No"]
Outlook Temperature Humidity Wind Play Tennis
0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
5 Rain Cool Normal Strong No
7 Sunny Mild High Weak No
13 Rain Mild High Strong No

Let

\[ \boldsymbol x = (\mathrm{Sunny}, \mathrm{Cool}, \mathrm{High}, \mathrm{Strong}) \]

Calculate \(\mathbb P(\boldsymbol x\vert \mathrm{Yes})\) and \(\mathbb P(\boldsymbol x\vert \mathrm{No})\). What is the prediction \(\widehat y\) for the sample \(\boldsymbol x\)?

Example: MNIST#

sklearn has several types of Naive Bayes models depending on \(p(\boldsymbol x_i \vert y)\), e.g.

  • Gaussian Naive Bayes

  • Multinomial Naive Bayes

  • Bernoulli Naive Bayes

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values
../_images/3a2fde1327bb564b132fe91e5eefa6ca336db9ef5a4694060097b722c16dcbf5.svg
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((60000, 784), (10000, 784), (60000,), (10000,))

Try Gaussian Naive Bayes:

from sklearn.naive_bayes import GaussianNB, BernoulliNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_hat = gnb.predict(X_test)
print("Gaussian accuracy:", accuracy_score(y_test, y_hat))
Gaussian accuracy: 0.5627
plt.figure(figsize=(10, 8))
plt.title("Naive Bayes on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);
../_images/a00eaa2de447738cb3568581dc0c99bc71f878520b136299ce73f66ae39e6b82.svg

Looks not very good. Now try Bernoulli Naive Bayes:

bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_hat = bnb.predict(X_test)
print("Bernoulli accuracy:", accuracy_score(y_test, y_hat))
Bernoulli accuracy: 0.8324
plt.figure(figsize=(10, 8))
plt.title("Naive Bayes on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);
../_images/97fff1121ad2a69fe91fb01a34624d469a1b8b9dcc3bcd0bd03d242535c6b31f.svg

Q. Bernoulli accuracy looks much higher than Gaussian. How do you think — why?