You are on page 1of 17

DT4031

Lab - 1
Applied machine learning
Basel Balta

Arya Rabani

Ammar Mesleh
Introduction and Motivation 3
Lab-Questions 3

Method 4

Solution 4

Discussion and Conclusion: 5

Appendix: The Pipeline Code 5

2
Introduction and Motivation

The aim of this lab is to finalize an end-to-end machine learning development pipeline and to do
supervised learning to automatically recognize hand-written digits. The database that will be utilized in
the MNIST handwritten digit database contains 60,000 images each repeating in digits ranging from 0 to
9. The lab will also complete the missing code and solve the “Todo’s” messages that are found in the
Jupyter Notebook followed by this lab document:

Lab-Questions
● TODO 1:

I've been working on this code for digit classification, could you please tell me which is the best
type of classifier of the two provided?

● TODO 2:

I'm trying to visualize the input representation in two


dimensions. Run the code to see the figure. How would you
interpret the scatter plot? (in relation to the course video
lectures)

● TODO 3:

I have another idea of feature extraction, how about we just compute the mean (and standard
deviation) in intensity in X and Y directions? Such that we will have four (two for mean and two
for std) additional vectors as features (each with length 28). If you write a function able to
compute such features, and use these features in training together with the HOG features, would it
lead to better generalization (improved accuracy)? Don't forget to normalize these new feature
vectors also.

● TODO 4:

3
Which digit(s) seems to be the most difficult to predict using the k-NN classifier? Could you use
the confusion matrix function: "plot_confusion_matrix" and visualize and draw conclusions and
present them to me? (you might have to read up on the literature about confusion matrices)

Method

An unfinished Python code is provided, and the lab's objective is to finish the code and understand the
End-to-End machine-learning process. The laboratory exercise was carried out by interpreting the code,
visualizing the outcomes, and comparing the performance of two distinct classification algorithms. The
laboratory activities were performed using a sample of 6,000 images from a dataset of 60,000 images
since the code could not run all the images successfully without crashing. The code includes a HOG
feature that represents the visual structure of an image by dividing it into cells and computing histograms
of gradient orientations in each cell, describing edges and shapes of objects to aid in classification.

Solution

Here are the answers to the questions that can be found in the aim part of this report:

1. The two methods that are provided in the lab file are “k-NN (k-Nearest Neighbor)” and “Linear
Support Vector algorithm”. Each of these methods has its own pros and cons depending on the
application, but in the case of this lab, the program shows that the k-NN method provides an
accuracy of 93.42%(train) and 88.08% (test). The linear support vector method results in an
accuracy of 95.75 %(train) and 91.33% (test). And in this case, the linear support method is the
best type of classifier chosen for this application.

2. A sample of 1,000 images had their HOG features reduced to two dimensions through PCA. This
allowed for the visualization of the data on a 2D plot. It can be observed from the plot that the
zeroes mainly being grouped on the top and right side of the plot, with sixes and nines nearby,
ones and sevens primarily clustered in the bottom left quarter, and fives and threes frequently
located near each other. This demonstrates that numbers with similar structures, such as circles,
lines, or triangles, tend to have similar HOG features and could be difficult for the classifying
algorithm. It's essential to keep in mind that the plot has been dramatically reduced from 196 to
just two dimensions, meaning that differentiating more digits may be possible if plotted in 3
dimensions

4
3. In order to improve the accuracy of the used classifiers the mean and the standard deviation were
extracted, normalized and appended to the model. After this implementation we can clearly notice
that there is almost no improvement in the k-NN model. while the accuracy for the linear support
vector method was drastically improved to 98.15% and 93.95% respectively. To conclude that the
more features we add to improve the accuracy the change will only be in the Linear Support
Vector classifier.

4. The confusion matrix allows the observer to see the accuracy of each digit. This matrix was
plotted from the output of the k-NN classifier and indicates that the 0 and 1 digits are the simplest
to categorize while the 2 and 5 are the most difficult.

Confusion Matrix : without normalization Confusion Matrix: Normalization

Discussion and Conclusion:

The lab experiment highlighted the steps in creating a classification algorithm using machine learning (a
pipeline algorithm in python). The results indicated that the Linear Support Vector algorithm was more
effective than the k-NN model, whether or not intensity features were added to the HOG features. The
visualization of the HOG data in 2D gave an in-depth look at the inner workings of the classifiers. While
adding intensity information sometimes improved the accuracy of the classifier. The k-NN classifier was
simple but very powerful. However, it needed to accurately classify most numbers with a 95% accuracy or
higher. We should also bear in mind that the sampling was done with 6,000 st av pictures instead of

5
60,000, since the computer hardware was not able to analyze all the 60,000 pictures. To sum up, the lab
experiment provided a valuable understanding of machine learning and classification algorithms.

Appendix: The Pipeline Code


"# Perform necessary imports for the analysis tasks\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.metrics import plot_confusion_matrix\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.svm import LinearSVC\n",
"import matplotlib.pyplot as plt\n",
"from skimage.feature import hog\n",
"from random import randrange\n",
"import scipy.stats as stats\n",
"import os.path as ospath\n",
"import pandas as pd\n",
"import numpy as np\n",
"import requests\n",
"import shutil\n",
"import gzip\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "narrow-lighting",
"metadata": {},
"outputs": [],
"source": [
"# Use dark mode of the matplotlib\n",
"plt.style.use('dark_background')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "lyric-inventory",
"metadata": {},
"outputs": [],
"source": [
"# Download the datasets\n",
"dataset_files = ['http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',\n",
" 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',\n",
" 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',\n",
" 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz']\n",
" \n",

6
"for file_url in dataset_files:\n",
" # Download dataset files one by one\n",
" if not ospath.isfile(file_url.split(\"/\")[-1]):\n",
" print(\"Downloading %s...\" % file_url.split(\"/\")[-1])\n",
" r = requests.get(file_url, allow_redirects=True)\n",
" open(file_url.split(\"/\")[-1], 'wb').write(r.content)\n",
"# close(file_url.split(\"/\")[-1])\n",
" else:\n",
" print(\"Skipping file %s, already downloaded\" % file_url.split(\"/\")[-1])\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "encouraging-suspect",
"metadata": {},
"outputs": [],
"source": [
"# Function for reading input images\n",
"def read_idx3_ubyte_data(filename: str, image_size: int = 28):\n",
" \n",
"\n",
" f = gzip.open(filename,'r')\n",
" \n",
" f.seek(0, os.SEEK_END)\n",
" num_images = int(f.tell()/(image_size*image_size))\n",
" f.seek(0, 0)\n",
" print(\"Reading %d number of images\" % num_images)\n",
" \n",
" f.read(16)\n",
" buf = f.read(image_size * image_size * num_images)\n",
" data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)\n",
" data = data.reshape(num_images, image_size, image_size, 1)\n",
" \n",
" return data\n",
"\n",
"# Function for reading labels\n",
"def read_labels_data(filename: str):\n",
" labels = []\n",
" f = gzip.open(filename,'r')\n",
" \n",
" f.seek(0, os.SEEK_END)\n",
" num_labels = f.tell()-8\n",
" f.seek(0, 0)\n",
" print(\"Reading %d number of labels\" % num_labels)\n",
" f.read(8)\n",
" for i in range(0, num_labels):\n",
" buf = f.read(1)\n",

7
" label = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)\n",
" labels.append(label)\n",
" \n",
" return labels\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "soviet-residence",
"metadata": {},
"outputs": [],
"source": [
"# read data (both images and correspondig labels)\n",
"data_train = read_idx3_ubyte_data('train-images-idx3-ubyte.gz')\n",
"data_test = read_idx3_ubyte_data('t10k-images-idx3-ubyte.gz')\n",
"labels_train = read_labels_data('train-labels-idx1-ubyte.gz')\n",
"labels_test = read_labels_data('t10k-labels-idx1-ubyte.gz')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "solar-academy",
"metadata": {},
"outputs": [],
"source": [
"def plot_single_digit(digit_data): \n",
" image = np.asarray(digit_data).squeeze()\n",
" plt.imshow(image, cmap='gray')\n",
" plt.colorbar()\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "palestinian-machinery",
"metadata": {},
"outputs": [],
"source": [
"# plot random digit in the training data set\n",
"plot_single_digit(data_train[randrange(60000)])"
]
},
{
"cell_type": "code",
"execution_count": null,

8
"id": "turned-shannon",
"metadata": {},
"outputs": [],
"source": [
"# Function for extracting HOG features from image data\n",
"def extract_HOG_features_from_image_data(input_data):\n",
" df_features = pd.DataFrame()\n",
" data_to_be_transformed = input_data\n",
" for image_index in range(data_to_be_transformed.shape[0]):\n",
" fd, hog_image = hog(data_to_be_transformed[image_index], orientations=8,
pixels_per_cell=(4, 4),\n",
" cells_per_block=(1, 1), visualize=True, channel_axis=-1) #
channel_axis=-1\n",
"\n",
" if df_features.shape[0] > 0:\n",
" df_features = pd.concat([df_features, pd.DataFrame(hog_image.ravel().T)],
ignore_index=True, axis=1)\n",
" else:\n",
" df_features = pd.DataFrame(hog_image.ravel().T)\n",
" \n",
" if (image_index > 0) and (image_index%20000 < 1):\n",
" print(\"10000 images processed...\")\n",
"\n",
" return pd.DataFrame(df_features.T)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "353a85da-d7db-4fb5-8ada-581f2985baf5",
"metadata": {},
"outputs": [],
"source": [
"data_train.shape, np.shape(labels_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73795732-51d1-423f-996d-645241e7a791",
"metadata": {},
"outputs": [],
"source": [
"print(data_train[:50, :, :, :].shape)\n",
"X_train = extract_HOG_features_from_image_data(data_train[:50, :, :, :])\n",
"print(X_train.shape)"
]
},

9
{
"cell_type": "code",
"execution_count": null,
"id": "unexpected-bridge",
"metadata": {},
"outputs": [],
"source": [
"# Extract (HOG) features from the image data (grab a coffee! This might take a while if
you are not on a powerful machine) TODO: Save these as new datafiles? Yes!\n",
"\n",
"X_train = extract_HOG_features_from_image_data(data_train[:6000, :, :, :])\n",
"X_test = extract_HOG_features_from_image_data(data_test[:6000, :, :, :])\n",
"y_train = np.array(labels_train[:6000], dtype='i').ravel()\n",
"y_test = np.array(labels_test[:6000], dtype='i').ravel()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8fb7e1ca-7606-4790-995a-c2460e43d55f",
"metadata": {},
"outputs": [],
"source": [
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89561582-41ab-49ed-968f-4f23bc02c4d7",
"metadata": {},
"outputs": [],
"source": [
"for i in X_train[0].values:\n",
" print(i, end=\", \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1fbc19d-e4f9-4ea5-8b77-52a387e26fa8",
"metadata": {},
"outputs": [],
"source": [
"y_train[0]"
]
},
{

10
"cell_type": "code",
"execution_count": null,
"id": "unexpected-apache",
"metadata": {},
"outputs": [],
"source": [
"# Visualize (1000 examples of) the extracted features in 2D (reduce the dimensionality
of\n",
"# the HOG features to 2 by utlizing a method called PCA, will be explained in Module
II)\n",
"\n",
"X_for_PCA_viz = X_train[0:1000]\n",
"y_for_PCA_viz = y_train[0:1000]\n",
"\n",
"pca = PCA(n_components=2)\n",
"X_r = pca.fit(X_for_PCA_viz).transform(X_for_PCA_viz)\n",
"plt.figure(figsize=(20, 20))\n",
"plt.scatter(X_r[:,0], X_r[:,1], c=y_for_PCA_viz, alpha=0.15)\n",
"ax = plt.gca()\n",
"\n",
"for i, the_label in enumerate(y_for_PCA_viz):\n",
" ax.annotate(str(the_label), (X_r[i, 0], X_r[i, 1]))\n",
"\n",
"plt.xlabel(\"PCA component 1\")\n",
"plt.ylabel(\"PCA component 2\")\n",
"plt.title(\"PCA visualization of MNIST training data set (HOG features)\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "false-pursuit",
"metadata": {},
"outputs": [],
"source": [
"def remove_uninformative_features(df_input):\n",
" # if a feature contains only one unique value (it is not informative) then remove it!\n",
" cols = df_input.select_dtypes([np.number]).columns\n",
" std = df_input[cols].std()\n",
" cols_to_drop = std[std==0].index\n",
" df_input = df_input.drop(cols_to_drop, axis=1)\n",
" return df_input, cols_to_drop\n",
"\n",
"\n",
"def normalize_each_feature(df_input):\n",
" return stats.zscore(df_input)"
]

11
},
{
"cell_type": "code",
"execution_count": null,
"id": "amended-start",
"metadata": {},
"outputs": [],
"source": [
"# check to see if we have uninformative features, if so remove them from the training
data set (and drop them from the test set also later)\n",
"X_train_with_columns_dropped, columns_that_was_dropped =
remove_uninformative_features(X_train)\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"X_train_normalized = normalize_each_feature(X_train_with_columns_dropped)\n",
"\n",
"\n",
"X_test_normalized = normalize_each_feature(X_test.drop(columns_that_was_dropped,
axis=1))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74f2e05a-fce1-4f0c-8838-fd6538915fde",
"metadata": {},
"outputs": [],
"source": [
"X_train_with_columns_dropped.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "described-genetics",
"metadata": {},
"outputs": [],
"source": [
"# Train a model using k-NN (by use of the scikit-learn module)\n",
"\n",
"clf_neigh = KNeighborsClassifier(n_neighbors=7)\n",
"clf_neigh.fit(X_train_normalized, y_train)\n",
"print(\"Accuracy of the k-NN model (train) is %2.2f percent \" %
float(clf_neigh.score(X_train_normalized, y_train)*100.0))\n",
"print(\"Accuracy of the k-NN model (test) is %2.2f percent\" %

12
float(clf_neigh.score(X_test_normalized, y_test)*100.0))\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "entertaining-benchmark",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Train a model using a Linear Support Vector algorithm (by use of the scikit-learn
module)\n",
"\n",
"clf_SVC = LinearSVC(random_state=0, tol=1e-5, max_iter=3000)\n",
"clf_SVC.fit(X_train_normalized, y_train)\n",
"print(\"Accuracy of the SVC model (train) is %2.2f percent \" %
float(clf_SVC.score(X_train_normalized, y_train)*100.0))\n",
"print(\"Accuracy of the SVC model (test) is %2.2f percent\" %
float(clf_SVC.score(X_test_normalized, y_test)*100.0))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca2ba0fb",
"metadata": {},
"outputs": [],
"source": [
"train_intensity = np.asarray(data_train[:6000, :, :, :]).squeeze();\n",
"mean_X_train_intensity_x_axis = train_intensity.mean(axis=1)\n",
"mean_X_train_intensity_y_axis = train_intensity.mean(axis=2)\n",
"std_X_train_intensity_x_axis = train_intensity.std(axis=1)\n",
"std_X_train_intensity_y_axis = train_intensity.std(axis=2)\n",
"mean_X_train_intensity_x_axis.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9cda3885",
"metadata": {},
"outputs": [],
"source": [
"#Normalize the new vectors for the training set\n",
"from sklearn.preprocessing import normalize\n",

13
"mean_X_train_intensity_x_axis_normalized =
normalize(mean_X_train_intensity_x_axis)\n",
"mean_X_train_intensity_y_axis_normalized =
normalize(mean_X_train_intensity_y_axis)\n",
"std_X_train_intensity_x_axis_normalized = normalize(std_X_train_intensity_x_axis)\n",
"std_X_train_intensity_y_axis_normalized = normalize(std_X_train_intensity_y_axis)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e80ee53",
"metadata": {},
"outputs": [],
"source": [
"#Extract intensity data along x- and y- axes for test data\n",
"test_intensity = np.asarray(data_test[:6000, :, :, :]).squeeze();\n",
"mean_X_test_intensity_x_axis = test_intensity.mean(axis=1)\n",
"mean_X_test_intensity_y_axis = test_intensity.mean(axis=2)\n",
"std_X_test_intensity_x_axis = test_intensity.std(axis=1)\n",
"std_X_test_intensity_y_axis = test_intensity.std(axis=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3ad5a401",
"metadata": {},
"outputs": [],
"source": [
"#Normalize the new vectors for the test set\n",
"mean_X_test_intensity_x_axis_normalized = normalize(mean_X_test_intensity_x_axis)\n",
"mean_X_test_intensity_y_axis_normalized = normalize(mean_X_test_intensity_y_axis)\n",
"\n",
"std_X_test_intensity_x_axis_normalized = normalize(std_X_test_intensity_x_axis)\n",
"std_X_test_intensity_y_axis_normalized = normalize(std_X_test_intensity_y_axis)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9f1b2fa",
"metadata": {},
"outputs": [],
"source": [
"#Define the features that are to be used in the classifiers\n",
"features_train =
np.column_stack((X_train_normalized,mean_X_train_intensity_x_axis_normalized,std_X_train

14
_intensity_x_axis_normalized,mean_X_train_intensity_y_axis_normalized,std_X_train_intensit
y_y_axis_normalized))\n",
"features_test =
np.column_stack((X_test_normalized,mean_X_test_intensity_x_axis_normalized,std_X_test_in
tensity_x_axis_normalized,mean_X_test_intensity_y_axis_normalized,std_X_test_intensity_y_
axis_normalized))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08b0839c",
"metadata": {},
"outputs": [],
"source": [
"#Use the KNN Classifier and display the accuracy\n",
"new_clf_neigh = KNeighborsClassifier(n_neighbors=7)\n",
"new_clf_neigh.fit(features_train, y_train)\n",
"print(\"Accuracy of the k-NN model (train with intensity data) is %2.2f percent \" %
float(new_clf_neigh.score(features_train, y_train)*100.0))\n",
"print(\"Accuracy of the k-NN model (train with intensity data) is %2.2f percent \" %
float(new_clf_neigh.score(features_test, y_test)*100.0))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81795f75",
"metadata": {},
"outputs": [],
"source": [
"#Use the Linear SVC Classifier and display the accuracy\n",
"new_clf_SVC = LinearSVC(random_state=0, tol=1e-5, max_iter=3000)\n",
"new_clf_SVC.fit(features_train, y_train)\n",
"print(\"Accuracy of the Linear SVC model (train with intensity data) is %2.2f percent \" %
float(new_clf_SVC.score(features_train, y_train)*100.0))\n",
"print(\"Accuracy of the Linear SVC model (test with intensity data) is %2.2f percent\" %
float(new_clf_SVC.score(features_test, y_test)*100.0))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "083083b8",
"metadata": {},
"outputs": [],
"source": [
"#Code Source:\n",

15
"#ttps://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
\n",
"from sklearn.metrics import ConfusionMatrixDisplay\n",
"# Plot non-normalized confusion matrix\n",
"titles_options = [(\"Confusion matrix, without normalization\", None),(\"Normalized
confusion matrix\", \"true\"),]\n",
"for title, normalize in titles_options:\n",
" disp = ConfusionMatrixDisplay.from_estimator(\n",
" clf_neigh,\n",
" X_train_normalized,\n",
" y_train,\n",
" cmap=plt.cm.Reds,\n",
" normalize=normalize,\n",
" )\n",
"\n",
" disp.ax_.set_title(title)\n",
" print(title)\n",
" print(disp.confusion_matrix)\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

16
17

You might also like