HW2#
Deadline: 24.11.2024 23:59 (GMT+5)
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import imread
from skimage.transform import resize
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from glob import glob
%config InlineBackend.figure_format = 'svg'
def load_notmnist(path='./notMNIST_small',letters='ABCDEFGHIJ',
img_shape=(28,28),test_size=0.25,one_hot=False):
# download data if it's missing. If you have any problems, go to the urls and load it manually.
if not os.path.exists(path):
print("Downloading data...")
assert os.system('curl http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.gz > notMNIST_small.tar.gz') == 0
print("Extracting ...")
assert os.system('tar -zxvf notMNIST_small.tar.gz > untar_notmnist.log') == 0
data,labels = [],[]
print("Parsing...")
for img_path in glob(os.path.join(path,'*/*')):
class_i = img_path.split(os.sep)[-2]
if class_i not in letters:
continue
try:
data.append(resize(imread(img_path), img_shape))
labels.append(class_i,)
except:
print("found broken img: %s [it's ok if <10 images are broken]" % img_path)
data = np.stack(data)[:,None].astype('float32')
data = (data - np.mean(data)) / np.std(data)
#convert classes to ints
letter_to_i = {l:i for i,l in enumerate(letters)}
labels = np.array(list(map(letter_to_i.get, labels)))
if one_hot:
labels = (np.arange(np.max(labels) + 1)[None,:] == labels[:, None]).astype('float32')
#split into train/test
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=test_size, stratify=labels)
print("Done")
return X_train, y_train, X_test, y_test
Task description#
In this notebook you are suggested to train several models of machine learning on the notMNIST dataset. To get full points per each model it must beat the baseline on test dataset. Models to test:
naive Bayes classifier, 80% (0.5 points)
decision tree, 83.5% (1.5 points)
random forest, 91.5% (2 points)
CatBoost, 92% (1.5 points)
MLP, 93% (2.5 points)
CNN, 95% (3 points)
overall comparison of models, including some graphs (e.g, bar plot for test accuracy) (1 point)
Important notes#
All outputs of code cells must be preserved in your submission
Broken code in a section automatically implies \(0\) points for this section
Do not erase any existing cells
Use magic cell %%time to measure the time of execution of heavy-computation cells
For each model use the followind structure:
Import and build model
Fit model on train dataset
Tune one or several hyperparameter to improve the peformance of your model (you may find randomized search useful)
Print train and test accuracy of your best model
Make predictions of your best model on test dataset
Plot confusion matrix
Plot 16 random samples from the test dataset with true labels and predicted classes
Keep your time: training ML models and searching for optimal parameters can be very time-consuming
Load notmnist
dataset#
%%time
X_train, y_train, X_test, y_test = load_notmnist(letters='ABCDEFGHIJ')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])
Parsing...
found broken img: ./notMNIST_small/A/RGVtb2NyYXRpY2FCb2xkT2xkc3R5bGUgQm9sZC50dGY=.png [it's ok if <10 images are broken]
found broken img: ./notMNIST_small/F/Q3Jvc3NvdmVyIEJvbGRPYmxpcXVlLnR0Zg==.png [it's ok if <10 images are broken]
Done
CPU times: user 9.5 s, sys: 2.53 s, total: 12 s
Wall time: 17.3 s
Size of train and test datasets:
X_train.shape, X_test.shape
((14043, 784), (4681, 784))
Verify that the classes are balanced:
np.unique(y_train, return_counts=True)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([1404, 1405, 1404, 1405, 1405, 1404, 1404, 1404, 1404, 1404]))
Visualize some data#
def plot_letters(X, y_true, y_pred=None, n=4, random_state=123):
np.random.seed(random_state)
indices = np.random.choice(np.arange(X.shape[0]), size=n*n, replace=False)
plt.figure(figsize=(10, 10))
for i in range(n*n):
plt.subplot(n, n, i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(X[indices[i]].reshape(28, 28), cmap='gray')
# plt.imshow(train_images[i], cmap=plt.cm.binary)
if y_pred is None:
title = chr(ord("A") + y_true[indices[i]])
else:
title = f"y={chr(ord('A') + y_true[indices[i]])}, ŷ={chr(ord('A') + y_pred[indices[i]])}"
plt.title(title, size=20)
plt.show()
plot_letters(X_train, y_train, random_state=912)