Naive Bayes classifier#
A Naive Bayes classifier is a probabilistic machine learning model that is widely used for classification tasks, particularly in natural language processing (NLP) and text classification, spam detection, sentiment analysis, and more. It is based on Bayes’ theorem and makes a “naive” assumption about the independence of features. Despite its simplifying assumption, Naive Bayes often performs surprisingly well in practice.
Discriminative vs generative models#
A discriminative model learns posterior distribution \(p(y \vert \boldsymbol x, \boldsymbol w)\). Example: logistic regression model
A generative model estimates joint distribution
Bayesian classifier#
For classification use Bayes theorem:
Bayesian classifier maximizes this expression:
How to estimate \(\mathbb P(y = k)\) and \(p(\boldsymbol x \vert y = k)\) given a training dataset \((\boldsymbol X, \boldsymbol y)\)?
To esimatate \(\mathbb P(y = k)\) we can use frequencies:
\[ \widehat y_k = \frac 1n \sum\limits_{i=1}^n \mathbb I[y_i = k] \]Estimation of density \(p(\boldsymbol x \vert y = k)\) is much harder task. It can be solved via
parametric density estimation
nonparametric density estimation
mixture of distributions
(see Vorontsov’s slides for details)
Special cases of parametric density estimation:
Quadratic discriminant analysis (QDA):
\[ p(\boldsymbol x \vert y = k) = \mathcal N(\boldsymbol x \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac {\exp\big(-\frac 12(\boldsymbol x - \boldsymbol \mu_k)^\mathsf{T} \boldsymbol \Sigma_k^{-1} (\boldsymbol x - \boldsymbol \mu_k)\big)}{\sqrt{(2\pi)^n \det \boldsymbol \Sigma_k}} \]Linear discriminant analysis (LDA): the covariance matrix \(\boldsymbol \Sigma\) is the same for all classes, and it’s MLE estimation is
\[ \widehat \Sigma = \frac 1n\sum\limits_{i=1}^n (\boldsymbol x_i - \boldsymbol{\widehat \mu}_{y_i})(\boldsymbol x_i - \boldsymbol{\widehat \mu}_{y_i})^\mathsf{T} \]
Naive Bayes estimation#
Naive assumption: all feature, conditioned on target, are independent:
To estimate 1-d densities \(p(x_j \vert y)\) is much easier than multivariate ones. The output of the Bayesian classifier is given by
import pandas as pd
tennis = pd.read_csv("PlayTennis.csv")
tennis
Outlook | Temperature | Humidity | Wind | Play Tennis | |
---|---|---|---|---|---|
0 | Sunny | Hot | High | Weak | No |
1 | Sunny | Hot | High | Strong | No |
2 | Overcast | Hot | High | Weak | Yes |
3 | Rain | Mild | High | Weak | Yes |
4 | Rain | Cool | Normal | Weak | Yes |
5 | Rain | Cool | Normal | Strong | No |
6 | Overcast | Cool | Normal | Strong | Yes |
7 | Sunny | Mild | High | Weak | No |
8 | Sunny | Cool | Normal | Weak | Yes |
9 | Rain | Mild | Normal | Weak | Yes |
10 | Sunny | Mild | Normal | Strong | Yes |
11 | Overcast | Mild | High | Strong | Yes |
12 | Overcast | Hot | Normal | Weak | Yes |
13 | Rain | Mild | High | Strong | No |
tennis[tennis["Play Tennis"] == "Yes"]
Outlook | Temperature | Humidity | Wind | Play Tennis | |
---|---|---|---|---|---|
2 | Overcast | Hot | High | Weak | Yes |
3 | Rain | Mild | High | Weak | Yes |
4 | Rain | Cool | Normal | Weak | Yes |
6 | Overcast | Cool | Normal | Strong | Yes |
8 | Sunny | Cool | Normal | Weak | Yes |
9 | Rain | Mild | Normal | Weak | Yes |
10 | Sunny | Mild | Normal | Strong | Yes |
11 | Overcast | Mild | High | Strong | Yes |
12 | Overcast | Hot | Normal | Weak | Yes |
tennis[tennis["Play Tennis"] == "No"]
Outlook | Temperature | Humidity | Wind | Play Tennis | |
---|---|---|---|---|---|
0 | Sunny | Hot | High | Weak | No |
1 | Sunny | Hot | High | Strong | No |
5 | Rain | Cool | Normal | Strong | No |
7 | Sunny | Mild | High | Weak | No |
13 | Rain | Mild | High | Strong | No |
Let
Calculate \(\mathbb P(\boldsymbol x\vert \mathrm{Yes})\) and \(\mathbb P(\boldsymbol x\vert \mathrm{No})\). What is the prediction \(\widehat y\) for the sample \(\boldsymbol x\)?
Example: MNIST#
sklearn has several types of Naive Bayes models depending on \(p(\boldsymbol x_i \vert y)\), e.g.
Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
%config InlineBackend.figure_format = 'svg'
X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
X = X.astype(float).values / 255
Y = Y.astype(int).values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((60000, 784), (10000, 784), (60000,), (10000,))
Try Gaussian Naive Bayes:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_hat = gnb.predict(X_test)
print("Gaussian accuracy:", accuracy_score(y_test, y_hat))
Gaussian accuracy: 0.5627
plt.figure(figsize=(10, 8))
plt.title("Naive Bayes on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);
Looks not very good. Now try Bernoulli Naive Bayes:
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_hat = bnb.predict(X_test)
print("Bernoulli accuracy:", accuracy_score(y_test, y_hat))
Bernoulli accuracy: 0.8324
plt.figure(figsize=(10, 8))
plt.title("Naive Bayes on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);
Q. Bernoulli accuracy looks much higher than Gaussian. How do you think — why?