HW2#

Deadline: 24.11.2024 23:59 (GMT+5)

import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import imread
from skimage.transform import resize
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from glob import glob
%config InlineBackend.figure_format = 'svg'

def load_notmnist(path='./notMNIST_small',letters='ABCDEFGHIJ',
                  img_shape=(28,28),test_size=0.25,one_hot=False):
    
    # download data if it's missing. If you have any problems, go to the urls and load it manually.
    if not os.path.exists(path):
        print("Downloading data...")
        assert os.system('curl http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.gz > notMNIST_small.tar.gz') == 0
        print("Extracting ...")
        assert os.system('tar -zxvf notMNIST_small.tar.gz > untar_notmnist.log') == 0
    
    data,labels = [],[]
    print("Parsing...")
    for img_path in glob(os.path.join(path,'*/*')):
        class_i = img_path.split(os.sep)[-2]
        if class_i not in letters: 
            continue
        try:
            data.append(resize(imread(img_path), img_shape))
            labels.append(class_i,)
        except:
            print("found broken img: %s [it's ok if <10 images are broken]" % img_path)
        
    data = np.stack(data)[:,None].astype('float32')
    data = (data - np.mean(data)) / np.std(data)

    #convert classes to ints
    letter_to_i = {l:i for i,l in enumerate(letters)}
    labels = np.array(list(map(letter_to_i.get, labels)))
    
    if one_hot:
        labels = (np.arange(np.max(labels) + 1)[None,:] == labels[:, None]).astype('float32')
    
    #split into train/test
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=test_size, stratify=labels)
    
    print("Done")
    return X_train, y_train, X_test, y_test

Task description#

In this notebook you are suggested to train several models of machine learning on the notMNIST dataset. To get full points per each model it must beat the baseline on test dataset. Models to test:

  • naive Bayes classifier, 80% (0.5 points)

  • decision tree, 83.5% (1.5 points)

  • random forest, 91.5% (2 points)

  • CatBoost, 92% (1.5 points)

  • MLP, 93% (2.5 points)

  • CNN, 95% (3 points)

  • overall comparison of models, including some graphs (e.g, bar plot for test accuracy) (1 point)

Important notes#

  • All outputs of code cells must be preserved in your submission

  • Broken code in a section automatically implies \(0\) points for this section

  • Do not erase any existing cells

  • Use magic cell %%time to measure the time of execution of heavy-computation cells

  • For each model use the followind structure:

    1. Import and build model

    2. Fit model on train dataset

    3. Tune one or several hyperparameter to improve the peformance of your model (you may find randomized search useful)

    4. Print train and test accuracy of your best model

    5. Make predictions of your best model on test dataset

    6. Plot confusion matrix

    7. Plot 16 random samples from the test dataset with true labels and predicted classes

  • Keep your time: training ML models and searching for optimal parameters can be very time-consuming

Load notmnist dataset#

%%time
X_train, y_train, X_test, y_test = load_notmnist(letters='ABCDEFGHIJ')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])
Parsing...
found broken img: ./notMNIST_small/A/RGVtb2NyYXRpY2FCb2xkT2xkc3R5bGUgQm9sZC50dGY=.png [it's ok if <10 images are broken]
found broken img: ./notMNIST_small/F/Q3Jvc3NvdmVyIEJvbGRPYmxpcXVlLnR0Zg==.png [it's ok if <10 images are broken]
Done
CPU times: user 9.5 s, sys: 2.53 s, total: 12 s
Wall time: 17.3 s

Size of train and test datasets:

X_train.shape, X_test.shape
((14043, 784), (4681, 784))

Verify that the classes are balanced:

np.unique(y_train, return_counts=True)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([1404, 1405, 1404, 1405, 1405, 1404, 1404, 1404, 1404, 1404]))

Visualize some data#

def plot_letters(X, y_true, y_pred=None, n=4, random_state=123):
    np.random.seed(random_state)
    indices = np.random.choice(np.arange(X.shape[0]), size=n*n, replace=False)
    plt.figure(figsize=(10, 10))
    for i in range(n*n):
        plt.subplot(n, n, i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(X[indices[i]].reshape(28, 28), cmap='gray')
        # plt.imshow(train_images[i], cmap=plt.cm.binary)
        if y_pred is None:
            title = chr(ord("A") + y_true[indices[i]])
        else:
            title = f"y={chr(ord('A') + y_true[indices[i]])}, ŷ={chr(ord('A') + y_pred[indices[i]])}"
        plt.title(title, size=20)
    plt.show()
plot_letters(X_train, y_train, random_state=912)
../_images/255ab17a4e9090f6ffc666abf1ddf8de4d60364db543cc7b882400d981ebd45f.svg

Naive Bayes#

Decision tree#

Random Forest#

MLP#

CNN#

Plot results#