You are on page 1of 13

Report Practical Pattern Recognition

Marin Beijerbacht 9884270


Korijn Moor 0786853
Dimitris Christodoulou 5141761
December 4, 2022

1 Introduction
In this paper the MNIST dataset of handwritten digits is explored and analyzed. In doing so,
various feature extraction techniques to analyze the dataset and the features contained are
employed. New features are extracted from the dataset. By testing the self extracted features
information about what information is learned by a model is gained. Subsequently various
machine learning algorithms, Support Vector Machines, multinomial logit models and neural
networks, are trained on the original pixel values. Using the different models with different
parameters knowledge on the function on the models is gained. All this is implemented in
python using the machine learning library sklearn. When referred to function names in the
rest of the paper, these come form this library.
In the following sections the methods used for data evaluation and extraction, model
training and evaluating is detailed. In the following section the results from the evaluations
are analyzed and follows with a conclusion.

2 Description of the Data


The dataset used is the MNIST (Modified National Institute of Standards and Technology
database) dataset. It consists of 42000 handwritten digits from zero to nine, in the form
of 28 × 28 pixel images. This data is in the form of an array, where each row comprises
one digit. Each array value contains the greyscale value of the corresponding pixel, with
maximum value 255. The array contains an extra column for the digit label, meaning the
correct digit associated with the image data of the corresponding row.
Before proceeding with the experiment set-ups, a few terminologies need to be defined:

• Row i - The i-th row of the dataset array, beginning from the top of the array.

1
• Column j - The j-th column of the dataset array, beginning from the left side of the
array.

• Pixel(i, j) - The pixel value in row i and column j of the dataset array.

How many instances per digit are contained in the dataset and its distribution can be
seen in the table below.

Label Number of Digits Percentage on total


0 4132 9.83%
1 4684 11.15%
2 4177 9.94%
3 4351 10.35%
4 4072 9.69%
5 3795 9.03%
6 4137 9.85%
7 4401 10.47%
8 4063 9.67%
9 4188 9.97%

This shows that the majority class is the digit 1. If a classifier would predict this majority
class for every instance, 11.15% of the total dataset would be classified correctly.
Some pixels have a zero value for all digits in the dataset. These pixels are essentially
useless since they do not convey any information that can be used to distinguish between
different labels. The pixels in question are presented in the figure 1 below.

Figure 1: The yellow pixels are zero for all 42000 digits in the dataset, and thus convey no
useful information for classification.

2
3 Experimental Setup
3.1 Feature Extraction Experiments
For these experiments two new features will be extracted from the raw pixel data. The first
one is a so called ink feature, which specifies how much ”ink” a digit costs. The second
feature is called half ink. It is similar to ink but instead of using the whole digit it uses only
half of the digit.
More detail about these features and the experiments are laid out in the following part
of the report.

3.1.1 Ink Feature


The ink feature is calculated by summing the values of each pixel in a digit image. For each
digit from 0-9 the feature for each instance was aggregated and the means and standard
deviations were calculated for the summed features of each digit. After obtaining the ”ink”
values, they are preprocessed by scaling them to have zero mean and unit standard deviation.
After the scaling the values are reshaped to an array of single features. Subsequently a
multinomial logit model was fitted on the instances of this ink feature using the function
LogisticRegression from the linear model library. The model was trained on the full dataset
and the parameters used to train the model were the default ones. Evaluation of this model
is done over the entire dataset and presented as a confusion matrix.

3.1.2 Half-Ink Feature


The second feature that is extracted is called half ink. It is quite similar to the ink feature.
However, instead of summing over the entire pixel data only the top half of the pixel data is
used. Starting from the top left pixel and ending at the rightmost pixel halfway down the
values are summed. More specifically the procedure is as follows. For a digit, the half ink
feature is the sum over the pixel values from pixel (0,0) up to and including pixel (27,13).
This is represented mathematically as follows:
13 X
X 27
half ink = Pixel(i,j) (1)
j=0 i=0

The ink feature essentially is a measure of how ’large’ digits are respective to one another.
However, it completely discards the form of a digit. This half ink feature may distinguish
well between digits as it relies on the asymmetric property of the digits. This way the actual
shape of the digit hypothetically has a bigger impact on the amount of ”ink” used. Some
digits have more ”ink” on the bottom half of the matrix like the 6 whilst other numbers have
more on the top like 7 even though overall they use a relatively even amount of ”ink”. By
dividing the matrix in half horizontally and taking only the top half these imbalances can
be captured thus discriminating between the digits.
To perform the experiment a same approach was taken as with the ink feature. For
each digit as a group the mean and standard deviation was calculated. The data was scaled

3
and consequently reshaped to the correct shape to feed to the model. Then a multino-
mial logit model was fitted in the same way as for the ink feature. I.e. using sklearn’s
LogisticRegression with default parameters.

3.1.3 Both Features Together


Both features describe slightly similar but different properties of handwritten digits. To test
how these features work together the following experiment was performed.
First both the ink and half ink features were scaled so they have zero mean and unit
standard deviation. Then both features were stacked together, such that each instance in
the training data constituted of a vector of two features, one its ”ink” value and the other
one its ”half ink” value. The input matrix X hence has the form of a 42000 × 2 matrix.
This input is then used to train a multinomial logit model as implemented by
LogisticRegression with default parameters.
The evaluation is done on the same input matrix X. Hence, for this experiment the train
and test sets are also the same.

3.2 Model Selection Experiments


For the following experiments the values of the pixels themselves are used as the features, thus
each element has 784 features. To reduce runtime feature reduction is applied, the image was
resized from 28×28 pixels to 14×14. Now each instance has only 196 features. Subsequently
the dataset is scaled with StandardScaler so features are standardized by removing the
mean and scaling to unit variance. The dataset will also be split into separate train and
test sets. This split on the dataset is done with train test split wrapper function with an
integer passed on the random state parameter for reproducible results across runtimes. The
training set consists of 5000 instances whilst the remaining 37000 go to the test set.

3.2.1 Regularized Multinomial Logit Model Classifier


The first classifier tested was the Logit (Multinomial Logistic Regression) model. For this
the sklearn.linear model.LogisticRegression function was used. This function can have
multiple parameters that influence the outcome of the final predictions. For the penalty
parameter a regularization technique is chosen, the LASSO or l1 penalty is used for this
model. The other parameters taken into account were:

• tol - The tolerance criteria that the model takes into account for stopping. This tells
the model to stop searching for a minimum or maximum once the specified tolerance
is achieved. This hyperparameter is important to tune, since if it is too big, the
algorithm will stop before it converges. Generally, this is why this hyperparameter is
experimented upon with small values.

• C - Complexity hyperparameter, it is the inverse of regularization strength. It is an


important hyperparameter since it disincentivizes overfitting by applying a penalty to
increasing the magnitude of parameter values. Here the finetuning was performed with
powers of 10.

4
• max iter - The maximum number of iterations taken into account for the solvers to
converge. As the number of iterations increases, the precision with which logistical re-
gression tries to fit the data grows. This in turn causes overfitting. By experimenting
with different maximum iteration values an optimal value that does not cause overfit-
ting may be found. The possible values experimented upon range from 1 to 10000 in
powers of 10, so that a clear separation between the different iteration values may be
obtained.

• solver - A solver tries to find the parameter weights that minimize a cost function.
Since the LASSO penalty is used in this experiment, the only possible solvers for this
penalty are saga and liblinear.

The values that would produce the smallest classification error were searched using
GridSearchCV . Here for all given values of the parameters all possible combinations are
tested to see which pairing receives the smallest error. This method makes use of cross
validation for training and evaluating. The possible hyperparameter values fed into the grid
search model were:

Table 1: The parameters and values searched in the grid search


tol 0.000001 0.00001 0.0001 0.001 0.01 0.1
C 0.001 0.01 0.1 1 10 100 1000
max iter 1 10 100 1000 10000
solver saga liblinear

The best combination of hyperparameters observed for the Logit classifier through the
grid search method were C = 0.1, max iter = 1000, solver =′ saga′ and ′ tol′ : 1e − 05.

3.2.2 Support Vector Machines Classifier


To fit a model using a support vector machines classifier the svm.SV C() function was used.
Again, this model has various hyperparameters that were tuned using Grid Search. The ones
that were taken into account for this classifier were the same ones as in the Logit classifier,
with one exception; instead of the solver here the kernel is taken into account. A kernel takes
the given data as input and transforms it into a different form depending on the kernel that
can be given to the classifier. In table 2 the given set of values that the grid search worked
on can be found.

Table 2: The parameters and values searched in the support vector machines classifier
tol 0.000001 0.00001 0.0001 0.001 0.01 0.1
C 0.001 0.01 0.1 1 10 100 1000
max iter 1 10 100 1000 10000 -1
kernel linear poly rbf sigmoid

5
The best set of hyperparameters observed for the SVM classifier through the grid search
method were C = 100, kernel =′ poly ′ , max iter = 1000, tol = 0.1.

3.2.3 Feed-Forward Neural Network Classifier


The final model tested is a feed-forward neural network, also called a multi-layer perceptron,
which is implemented using the fuction neural network.M LP Classif ier. For reproducible
results within session randoms tate was set to 0. The other parameters taken into account
were:

• hidden layer sizes - The number of neurons in the hidden layer. Having more neurons
increases the complexity of the model. This parameter is tuned so the model is of a
right complexity to converge but not be too complex.
• activation - The activation function used, decides wen a neuron should be activated
or not. The different activation functions give different thresholds.
• solver - A solver tries to find the parameter weights that minimize a cost function.
• learning rate - How much the model is changed in response to estimated error. The
value for this parameter is very important to make sure the model does not overfit.
• tol - The tolerance criteria that the model takes into account for stopping. This tells
the model to stop searching for a minimum or maximum once the specified tolerance
is achieved.
• max iter - The maximum number of iterations taken into account for the solvers to
converge.
• alpha - Similar to the complexity hyperparameter C, it dictates how strongly ridge
regression is applied.

The values tested for the hyperparameters are in the table below:

Table 3: The parameters and values searched in the feed-forward neural network
tol 0.000001 0.00001 0.0001 0.001 0.01 0.1
alpha 0.1 0.01 0.001 0.0001 0.00001 0.000001
max iter 100 500 1000 10000 100000
hidden layer sizes 10 100 200 500 1000
activation identity logistic tanh relu
solver lbfgs sgd adam
learning rate constant invscaling adaptive

The best set of hyperparameters found in the grid search were random state = 0,
hidden layer sizes = 500, activation =′ relu′ , learning rate =′ constant′ , solver =′ adam′ ,
alpha = 0.1, max iter = 100 and tol = 0.0001. This combination of values for the hyperpa-
rameters give an estimated accuracy of 94%.

6
3.2.4 Accuracy Analysis
When the logit, svm, and feed-forward multilayer perceptron models are trained, they will
also be compared with each other to determine if one accuracy is significantly better or worse
from the others. This will be done using a statistical method.
Specifically, the statistical method used was McNemar’s test, since it translates well for
the accuracy and error rates of machine learning models. First, the contingency table for each
pair of models will be calculated. For the three classifiers C1, C2 and C3 each contingency
table indicates:

• The number of correct classifications for both classifiers C1 and C2

• The number of wrong classifications for both classifiers C1 and C2

• The number of correct classifications for classifier C1 that are wrong for classifier C2

• The number of correct classifications for classifier C2 that are wrong for classifier C1

After each contingency table is calculated, the McNemar’s test for each pair will be done
and the p-value and statistic for each pair will be recorded. The hypothesis H0 under which
the test will be performed is that the models’ accuracy differs significantly between the
pairs.

4 Analysis of Experimental Results


4.1 Feature Extraction Experiment Results
4.1.1 Ink Feature
The means and standard deviations for this feature are displayed in table 4.

Table 4: Ink mean and standard deviations per label


Label Ink Mean Ink Standard Deviation
0 34632.40 8462.91
1 15188.46 4409.93
2 29871.09 7653.92
3 28320.18 7574.97
4 24232.72 6375.41
5 25835.92 7527.59
6 27734.91 7531.41
7 22931.24 6169.04
8 30184.14 7778.35
9 24553.75 6466.00

From this statistical analysis we can obtain the following results for the dataset:

7
• The class with the least ink used is the class with the digit 1.

• The class with the most ink used is the class with the digit 0.

• Classes with digits 2, 3 and 6 are very close regarding the amount of ink used, as well
as classes 4 and 9.

From the above it can be inferred that the pair of classes which will be easiest to classify
is classes 1 and 0, since their ink difference is the greatest. Following the same logic, classes
2, 3, 6 and classes 4 and 9 will be difficult to classify since their ink values are so close. This
can also be seen in the results from the classifier itself. The confusion matrix for the Logit
classifier trained on the ink feature can be seen in the table below.

Table 5: The logit classifier confusion matrix for the ink feature
0 1 2 3 4 5 6 7 8 9
0 2420 83 322 805 0 0 0 384 0 118
1 10 3823 5 101 0 0 0 722 0 23
2 1496 280 326 1039 0 0 0 874 0 162
3 1247 408 334 1037 0 0 0 1141 0 184
4 441 829 195 886 0 0 0 1496 0 225
5 728 671 197 846 0 0 0 1190 0 163
6 1057 450 296 982 0 0 0 1145 0 207
7 325 1190 149 819 0 0 0 1700 0 218
8 1431 192 342 1047 0 0 0 879 0 172
9 484 763 196 870 0 0 0 1651 0 224

The table shows that the Logit classifier performs poorly on the dataset. The accuracy
calculated from this matrix is 23%. For the digits 4, 5, 6 and 8 no guesses were ever made.
This is likely caused by other ink values being very similar. For instance the 4 and 9, their ink
values are very close, however as seen in the data analysis the 9 has more training instances.
Now the classifier will pick the majority class the 9 every time hence the 4 is never predicted
by the classifier.

4.1.2 Half-Ink Feature


The calculated means and standard deviations for the half-ink feature per digit are shown
in the following table:
From the means it can be seen that the 1 is still easily distinguishable and the values for
some digit pairs lie closely together. For the 4 and 9 pair and the 6 and 2 pair this is the
case. The mean value for 5 also lies close to 2 and 6. What is observed through the mean

8
Table 6: Half ink mean and standard deviations per label
Label Half Ink Mean Half Ink Standard Deviation
0 16605.25 4148.89
1 5950.64 2524.94
2 13389.62 3718.41
3 11980.13 3606.81
4 10695.53 3354.00
5 12207.75 3590.74
6 13303.94 3876.17
7 9456.96 3155.64
8 14163.36 3917.75
9 10300.92 3203.87

half-ink features is also observed when the classifier was evaluated. The obtained confusion
matrix can be found in table 7.

Table 7: The confusion matrix for the half-ink feature


0 1 2 3 4 5 6 7 8 9
0 2524 22 454 540 0 0 0 200 332 60
1 7 3624 26 134 0 0 0 833 9 51
2 1248 195 516 972 0 0 0 738 337 171
3 782 459 487 1027 0 0 0 1098 280 218
4 390 771 375 886 0 0 0 1300 174 176
5 737 340 435 928 0 0 0 958 238 159
6 1213 235 489 963 0 0 0 772 318 147
7 210 1391 236 759 0 0 0 1491 118 196
8 1460 111 523 881 0 0 0 561 365 162
9 303 859 283 867 0 0 0 1512 165 199

From the table 7 above it can be seen that this feature does not provide any substantial
difference in results, with the accuracy improving slightly over the ink feature from a 22.3%
to 23.8% accuracy score. The 4, 5 and 6 still are never predicted by the model. Most notably,
while the classifier using the ink feature never predicted a digit to be 8, using the half ink
feature it does predict the digit 8.

4.1.3 Both Features Together


The confusion matrix for both of the features reveals a minor improvement over the classifier
and can be found in table 8. It can be observed that the classifier can now correctly classify

9
instances of the digit 6. However, it now classified incorrectly multiple instances of the digit
0. The same classifier using only the ink feature or only the half ink feature would never
predict any digit to be either 4, 5, or 6, having the classifier use both features makes it
predict 4, 5, or 6 sometimes. The accuracy of the model shows a slight improvement, which
would be far greater if the classifier didn’t incorrectly classify the previously correct digit
instances. The model accuracy is 27.32 for both of the features.

Table 8: The confusion matrix for both the ink and the half-ink features
0 1 2 3 4 5 6 7 8 9
0 2207 59 426 160 161 154 757 27 176 5
1 3 3719 12 53 185 196 75 427 9 5
2 765 237 657 974 357 151 412 410 153 61
3 344 399 473 1581 279 97 181 875 65 57
4 224 719 201 665 281 329 677 874 62 40
5 580 504 339 231 398 527 860 216 113 27
6 923 364 335 291 328 378 1130 247 108 33
7 135 1116 172 690 363 299 348 1190 51 37
8 1121 141 490 482 297 252 871 238 132 39
9 164 710 206 913 351 194 297 1245 58 50

4.2 Model Selection Experiment Results


4.2.1 Regularized Multinomial Logit Model Classifier
Using the pixel values as features themselves, a clear and drastic increase can be seen in
accuracy for the Logit classifier, since it now jumps to 89.81. The confusion matrix can be
seen below.

Table 9: The confusion matrix for the Logit classifier


0 1 2 3 4 5 6 7 8 9
0 3520 0 14 11 12 44 40 11 22 5
1 1 4007 12 24 1 24 5 8 38 1
2 32 56 3147 55 71 12 71 82 115 20
3 18 29 108 3276 14 144 25 51 115 64
4 7 26 16 7 3297 7 21 4 14 150
5 55 34 21 101 70 2822 65 22 120 45
6 33 15 25 4 47 59 3415 4 40 0
7 11 56 60 22 62 10 8 3516 8 134
8 17 149 38 114 28 111 23 22 2983 62
9 29 34 6 56 136 25 2 137 43 3247

10
This increase in accuracy can be attributed to the use of better and more features. Now
196 features are there to describe each instance. These features are more distinguishable as
they convey more information about the digit.

4.2.2 Support Vector Machines Classifier


Here we see an increase in accuracy when compared to the now improved Logit classifier,
with the SVM classifier attaining an accuracy of 93.74. The confusion matrix as presented
in table 10 corroborates these findings.

Table 10: The confusion matrix for the SVM classifier


0 1 2 3 4 5 6 7 8 9
0 3540 1 9 2 16 41 36 0 29 5
1 0 4056 14 8 7 3 8 2 22 1
2 13 18 3368 36 57 9 10 36 106 8
3 4 11 52 3474 8 97 4 22 144 28
4 6 9 25 2 3412 2 5 10 10 68
5 13 3 10 59 33 3098 26 7 73 33
6 12 8 33 1 45 44 3449 3 47 0
7 3 28 23 17 82 5 1 3568 22 138
8 7 19 19 50 25 59 3 11 3340 14
9 11 6 17 35 137 14 1 79 38 3377

From the confusion table 10 certain things stand out. Focusing on the mistakes the
classifier till made, the main misclassification was that it confused the digit 8 with 3. In
other words it predicted 8 while the true label was 3. Other interesting errors occurred when
the SVM predicted 9, while the digit was a 7, 4 while a digit was a 9 and 8 again while the
true label was 2.

4.2.3 Feed-Forward Neural Network Classifier


The feed-forward neural network performed the best out of the three models with an accuracy
of 94.13. The confusion matrix with the evaluation results of the classifier can be seen below.

11
Table 11: The confusion matrix for the feed forward neural network
0 1 2 3 4 5 6 7 8 9
0 3583 1 5 7 5 11 39 10 13 5
1 0 4045 18 15 6 7 5 6 15 4
2 29 53 3370 29 41 7 27 53 45 7
3 11 14 68 3501 5 81 9 35 75 45
4 6 12 20 5 3380 2 21 8 10 85
5 23 15 9 55 24 3077 48 11 63 30
6 27 7 17 1 27 21 3511 1 29 1
7 11 38 42 8 41 6 5 3632 9 95
8 14 59 23 53 23 37 19 11 3279 29
9 21 19 3 47 67 16 0 67 27 3448

This model did receive the highest accuracy of all models tested. However, it can still be
seen that for the digit pairs that were hard to distinguish using the ink or half-ink feature
like the 4 and 9, this model makes more mistakes there than it does for other digits.

4.3 Accuracy Analysis


As described in section 3.2.4, the contingency tables for each pair of classifiers is shown
below:

SVM - Correct SVM - Wrong


NN - Correct 33753 1084
NN - Wrong 732 1431

Table 12: The contingency table for pair 1: NN and SVM

Logit - Correct Logit - Wrong


NN - Correct 32545 600
NN - Wrong 1940 1915

Table 13: The contingency table for pair 2: NN and Logit

SVM - Correct SVM - Wrong


Logit - Correct 32479 666
Logit - Wrong 2358 1497

Table 14: The contingency table for pair 3: Logit and SVM

After calculating the three tables, McNemar’s test returns the following statistic and
p-value for each pair:

12
Statistic P-Value
Pair 1 - NN/SVM 732 0.00001
Pair 2 - NN/Logit 600 0.00003
Pair 3 - Logit/SVM 660 0.00002

Table 15: The statistic and p-values for each pair of classifiers

From the above results it is clear that the p-value is lower than α = 0.05 and so they the
Null hypothesis is rejected. Therefore the accuracies do not differ significantly.

5 Conclusion
The ink and half-ink features that were extracted were not very descriptive of the data.
It was found that since the ”ink” values for some digits were very similar the logit model
had trouble distinguishing the digits and did not predict some digits at all. When a model
was fitted with both the ink and half-ink features a slight improvement in accuracy was
observed. The model now had predictions for all of the digits. Whilst it did overall improve,
the improvement was not too large as for some of the digits, like 0, the accuracy did go
down.
For the experiments using the pixel values as features three models were tested, a reg-
ularized multinomial logit model, SVM and a feed-forward neural network. For all models
the values chosen for the parameters were tuned using grid search. The final accuracy scores
were as follows:

• Logit classifier: 89.81

• SVM classifier: 93.74

• Feed-Forward Neural Network: 94.13

The feed forward neural network had the highest accuracy predicting the digits in the
test set. However, when the differences in accuracy were tested for statistical significance
none was found.

13

You might also like