HW1#

Soft deadline: 13.10.2024 23:59 (GMT+5). Penalty for violating soft deadline: \(10\%\) per each day of delay.

Hard deadline: 20.10.2024 23:59 (GMT+5). No submissions after hard deadline are accepted.

General recommendations#

  • Do not erase any existing cells

  • Write solutions of the math problems in markdown cells of HW notebook using LaTeX. If you are not familiar with LaTeX, see a 2-page cheat sheet for a quick start

  • Provide your solution with understandable comments; do not submit tons of formulas and/or code cells without any text description of what you are doing

  • Readability counts! In case of poor writings you may receive penalty up to one point

Task 1.1 (1 point)#

Let \(\boldsymbol A \in\mathbb R^{m\times n}\), \(\boldsymbol B \in\mathbb R^{n\times m}\). Prove that \(\mathrm{tr}(\boldsymbol{AB}) = \mathrm{tr}(\boldsymbol{BA})\). Using this property, calculate \(\mathrm{tr}(\boldsymbol{uv}^\mathsf{T})\) if \(\boldsymbol u, \boldsymbol v \in\mathbb R^n\), \(\boldsymbol u \perp \boldsymbol v\).

YOUR SOLUTION HERE#

notMNIST dataset#

A utility function for fetching and splitting notMNIST dataset:

import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import imread
from skimage.transform import resize
from sklearn.model_selection import train_test_split
from glob import glob
%config InlineBackend.figure_format = 'svg'

def load_notmnist(path='./notMNIST_small',letters='ABCDEFGHIJ',
                  img_shape=(28,28),test_size=0.25,one_hot=False):
    
    # download data if it's missing. If you have any problems, go to the urls and load it manually.
    if not os.path.exists(path):
        print("Downloading data...")
        assert os.system('curl http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.gz > notMNIST_small.tar.gz') == 0
        print("Extracting ...")
        assert os.system('tar -zxvf notMNIST_small.tar.gz > untar_notmnist.log') == 0
    
    data,labels = [],[]
    print("Parsing...")
    for img_path in glob(os.path.join(path,'*/*')):
        class_i = img_path.split(os.sep)[-2]
        if class_i not in letters: 
            continue
        try:
            data.append(resize(imread(img_path), img_shape))
            labels.append(class_i,)
        except:
            print("found broken img: %s [it's ok if <10 images are broken]" % img_path)
        
    data = np.stack(data)[:,None].astype('float32')
    data = (data - np.mean(data)) / np.std(data)

    #convert classes to ints
    letter_to_i = {l:i for i,l in enumerate(letters)}
    labels = np.array(list(map(letter_to_i.get, labels)))
    
    if one_hot:
        labels = (np.arange(np.max(labels) + 1)[None,:] == labels[:, None]).astype('float32')
    
    #split into train/test
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=test_size, stratify=labels)
    
    print("Done")
    return X_train, y_train, X_test, y_test

Fetch and split notMNIST dataset:

%%time
X_train, y_train, X_test, y_test = load_notmnist(letters='ABCDEFGHIJ')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])
Parsing...
found broken img: ./notMNIST_small/A/RGVtb2NyYXRpY2FCb2xkT2xkc3R5bGUgQm9sZC50dGY=.png [it's ok if <10 images are broken]
found broken img: ./notMNIST_small/F/Q3Jvc3NvdmVyIEJvbGRPYmxpcXVlLnR0Zg==.png [it's ok if <10 images are broken]
Done
CPU times: user 9.35 s, sys: 3.24 s, total: 12.6 s
Wall time: 17.1 s

Size of train and test datasets:

X_train.shape, X_test.shape
((14043, 784), (4681, 784))

Verify that the classes are balanced:

np.unique(y_train, return_counts=True)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([1404, 1404, 1405, 1405, 1405, 1404, 1404, 1404, 1404, 1404]))

Visualize some data:

def plot_letters(X, y_true, y_pred=None, n=4, random_state=123):
    np.random.seed(random_state)
    indices = np.random.choice(np.arange(X.shape[0]), size=n*n, replace=False)
    plt.figure(figsize=(10, 10))
    for i in range(n*n):
        plt.subplot(n, n, i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(X[indices[i]].reshape(28, 28), cmap='gray')
        # plt.imshow(train_images[i], cmap=plt.cm.binary)
        if y_pred is None:
            title = chr(ord("A") + y_true[indices[i]])
        else:
            title = f"y={chr(ord('A') + y_true[indices[i]])}, ŷ={chr(ord('A') + y_pred[indices[i]])}"
        plt.title(title, size=20)
    plt.show()

plot_letters(X_train, y_train, random_state=911)
../_images/2b884eb6f518cb1ea3af4c1df04a18f5ae1e605a7593815ccff3b090fcfec2c3.svg

Task 1.2 (2 points)#

Apply k-NN algorithm to notMNIST dataset and measure its performance:

  • train several models with different hyperparameters (take \(1\leqslant k \leqslant 20\) and different distance metrics (\(p=1\), \(p=2\), \(p=+\infty\)))

  • visualize several test samples and their predictions (see code above)

  • show confusion matrix on train and test datasets

  • plot train and test accuracies for each model on the same graph

  • find the model with best test accuracy

# YOUR CODE HERE

Task 1.3 (2 points)#

Apply logistic regression to notMNIST dataset.

  • train several models with different value of \(C\)

  • visualize several test samples and their predictions

  • show confusion matrix on both train and test datasets

  • plot train and test accuracies against \(C\) for each model on the same graph

  • find a model with best test accuracy

# YOUR CODE HERE

Task 1.4 (1 point)#

Take two best models from previous tasks, k-NN and logistic regression, and show several digits which are

  • classified correctly by both models

  • classified correctly by k-NN but misclassified by logistic regression

  • classified correctly by logistic regression but misclassified by k-NN

  • misclassifed by both models

Find the most common class in each category.

# YOUR CODE HERE

Task 1.5 (2 points)#

Fetch California Housing dataset:

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(as_frame=True, return_X_y=True)
X.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
y.head()
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64
  • Split California Housing dataset into train and test

  • Train linear regression, Ridge regression, LASSO and Elastic Net.

  • For each model calculate MSE and \(R^2\)-score on both train and test dataset, and visualize them using bar plots

  • Print out coefficients of each model and note if some of them are equal to \(0\)

  • Find the model with best test metric

# YOUR CODE HERE