You are on page 1of 12

Student ID: S3794334

Student Name: Dhachainee Murugayah

I certify that this is all my own original work. If I took any parts from elsewhere, then they were non-essential
parts of the assignment, and they are clearly attributed in my submission. I will show I agree to this honor
code by typing "Yes": Yes.

Comparative Analysis On Mice Protein Expression Data Set Using


Classification Approach
DHACHAINEE MURUGAYAH
School of Computing, RMIT
S3794334@student.rmit.edu.au
10th June 2020

Table of Contents
An abstract/executive summary .............................................................................................................................. 2
Introduction ............................................................................................................................................................ 2
Goal and Objectives................................................................................................................................................ 2
Task 1: Retrieving and Preparing the Data......................................................................................................... 2
1.1: Data Retrieving ....................................................................................................................................... 2
1.2: Check data types ..................................................................................................................................... 2
1.3: Missing Values ....................................................................................................................................... 2
1.4: Redundant Features................................................................................................................................. 2
Task 2.1: Explore each column .......................................................................................................................... 3
Task 2.2: Explore the relationship between pairs of attributes ........................................................................... 4
Methodology........................................................................................................................................................... 7
Feature selection................................................................................................................................................. 8
Feature scaling ................................................................................................................................................... 8
Target encoding ............................................................................................................................................. 8
Classification Algorithm .................................................................................................................................... 8
Result ...................................................................................................................................................................... 9
KNN before tuning ............................................................................................................................................. 9
KNN after tuning................................................................................................................................................ 9
Decision tree before tuning .............................................................................................................................. 10
Decision tree after tuning ................................................................................................................................. 10
Performance comparison ...................................................................................................................................... 10
Discussion............................................................................................................................................................. 11
Limitations and Solutions ..................................................................................................................................... 11
Conclusion ............................................................................................................................................................ 12
References ............................................................................................................................................................ 12

1
An abstract/executive summary
The mice protein expression dataset is used to analyse the impact of learning between the normal and mice with
Down Syndrome. Hence, the purpose of this project is to study the mice protein expression dataset on the
influence of proteins that impact the recuperating ability of the mice with Down Syndrome. Classification task is
used in this analysis. In the classification analysis through KNN and Decision Tree, KNN has been recognized
to be effective in discovering critical protein effects, which assist in finding more effective and potential drug
targets. Meanwhile, Decision Tree is also classifies protein sample effectively with high accuracy rate.

Introduction
The dataset is obtained from https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression. It has 77
expression levels of protein or modified proteins that acquired through the signals that detected in the nuclear
fraction of cortex. There are 72 mice where 38 of the mice are control mice and the rest 34 are trisomic (Down
syndrome) mice. 15 measurements have taken for each protein and the mouse resulting 1080 measurements in
total per protein. Hence, this dataset has 1080 rows and 82 columns. A few mice have been infused with drug
and other have not to evaluate the impact of the memantine drug in recuperating the ability to learning in mice
with Down syndrome.

Goal and Objectives


Goal

• The goal is to analyse the influence of proteins that impact the recuperating ability of the Down
Syndrome mice to learn by conducting a comparative exploration on two different classification
models.

Objectives

• To perform data analysis to acquire fascinating insights from the dataset


• To build a classification model which will be able to determine protein in the mice with Down
Syndrome effectively with high accuracy.
• To identify relevant features with the help of feature selection techniques to minimize dimensionality

Task 1: Retrieving and Preparing the Data


1.1: Data Retrieving
The dataset is directly read and imported from the file stored in jupyter as a csv file. All the columns contain
appropriate names.

1.2: Check data types


The shape of the dataset is checked to make sure the data has been downloaded properly. Also, the features type
is checked to have a better understanding about the structure and format about the dataset. Since the first column
is ID column it will be dropped.

1.3: Missing Values


Number of the missing values in each column is calculated. There are 82 columns in this dataset. Rows with
more or equal to 75 null values will be dropped as dataset with few columns will not be sufficient to perform the
analysis. The rest of the null values are imputed with mean.

1.4: Redundant Features


A feature will consider as redundant or duplicate if it conveys the same information of what other feature has.
Column pS6_N is a duplicate column which has been identified and removed from the dataset.

2
Task 2.1: Explore each column

Figure 1: Type of the classes

Class of Shock Context (SC) and Context Shock (CS) of the control mice with the Memantine treatment are the
highest with equal counts whereas the class of Context Shock (CS) of the trisomic mice with the Saline
treatment is the lowest.

Figure 2: Types of behaviour Figure 3:Genotype of the mice Figure 4:Types of the treatment

• In the behaviour of the mice, it is clearly shows that majority of the mice are Shock Context(SC) which
means not stimulated to learn and only 49% of the mice are Context Shock(CS) which means
stimulated to learn.
• Less number of trisomic mice (Down Syndrome) is used in this experiment compare to the control
mice.
• More than half of the treatment used Memantine drug in this experiment and Saline is used for the rest
of the treatment.

Figure 5:Histogram of DYRK1A Figure 6:Histogram of ITSN1_N Figure 7: Histogram of pELK_N

The histogram of DYRK1A, ITSN1_N and pELK_N are skewed to the right as they have very less larger values
on the right side. The few large values brought the mean upwards without affecting the median. Hence, the
mean value is higher than the median in these graphs.

3
Figure 8:Histogram of BDNF_N Figure 9:Histogram of NR1_N Figure 10:Histogram of SYP_N

The histogram of BDNF_N, NR1_N and SYP_N are in the same shape at both side from the middle denote the
data are symmetric. In the symmetric data, both the mean value is median are almost same.

Task 2.2: Explore the relationship between pairs of attributes

Ten pairs of columns are chosen to explore the relationship between the columns. The first pair is 'Genotype'
and 'Treatment'. These two columns are explored to address whether not all the Control mice received
Memantine treatment.
plausible hypothesis: Not all the Control mice received Memantine treatment.

Figure 11: Type of the treatment

The graph above shows that, majority of the Control mice received Memantine drug in the treatment. From 570
Control mice, 300 of the mice are treated using Memantine and the rest 270 mice treated with Saline. It clearly
shows that, Control mice are treated with both Memantine and Saline.
The plausible hypothesis is proved.

The second pair is 'Genotype' and 'Treatment'. These two columns are explored to address whether the most of
the Down Syndrome mice received Memantine treatment.
plausible hypothesis: Most of the Down Syndrome mice received Memantine treatment.

Figure 12:Type of the treatment

The graph above shows that, majority of the trisomic mice received Memantine drug in the treatment. From 507
trisomic mice, 270 of the mice are treated using Memantine and the rest 237 mice are treated with Saline. It
clearly shows that, trisomic mice are mostly treated using Memantine.
The plausible hypothesis is proved.

4
The third pair is 'Treatment' and 'Genotype'. These two columns are explored to address whether Memantine
treatment is mostly given to Down Syndrome (Ts65Dn) mice.
plausible hypothesis: Memantine treatment is mostly given to Down Syndrome (Ts65Dn) mice.

Figure 13: Type of the mice

The graph above shows that, most of the Memantine treatment is given to Control mice compare to the trisomic
mice. From 570 Memantine treatments given, 300 treatments are given to Control mice and the rest 270
Memantine treatments are given to trisomic mice (Down Syndrome).
The plausible hypothesis is not proved.

The fourth pair is 'Treatment' and 'Genotype'. These two columns are explored to address whether Saline
treatment is mostly given to Control mice.
plausible hypothesis: Saline treatment is mostly given to Control mice.

Figure 14:Type of the mice

The graph above shows that, most of the Saline treatment is given to Control mice compare to the trisomic mice.
From 507 Saline treatments given, 270 treatments are given to Control mice and the rest 237 Saline treatments
are given to trisomic mice (Down Syndrome).
The plausible hypothesis is proved.

The fifth pair is 'Genotype' and 'Behavior'. These two columns are explored to address if all the control mice are
belonging to Context Shock(C/S) behavior.
plausible hypothesis: all the control mice are belongs to Context Shock(C/S) behaviour

Figure 15:Type of the behaviour of the mice

The graph above shows that, Control mice have both Context Shock(C/S) behavior and Shock Context (S/C)
behavior equally with the total counts of 285 each.
The plausible hypothesis is not proved.

5
The sixth pair is 'Genotype' and 'Behavior'. These two columns are explored to address if most the trisomic
mice are belongs to Shock Context(S/C) behaviour.
plausible hypothesis: Most the trisomic mice have Shock Context(S/C) behavior.

Figure 16: Type of the behaviour of the mice

The graph above shows that, trisomic mice are mostly have Shock Context(S/C) behavior. From 507 trisomic
mice, 267 mice have Shock Context(S/C) behavior and the rest 240 mice have Context Shock (C/S) behavior.
The plausible hypothesis is proved.

The sevent pair is 'Behavior' and 'Genotype'. These two columns are explored to address if Shock Context(C/S)
behavior is mostly from the Control mice.
plausible hypothesis: Shock Context(C/S) behaviour is mostly from the Control mice.

Figure 17:Type of the mice

The graph above shows that, Control mice are mostly have Shock Context(S/C) behavior compare to trisomic
mice. From 552 Shock Context(S/C) behavior, 285 Shock Context(S/C) behavior is from the Control mice and
the rest 267 Shock Context(S/C) behavior is from the trisomic mice.
The plausible hypothesis is proved.

The eight pair is 'Treatment' and 'Behavior'. These two columns are explored to address if most the mice that
received Memantine treatment are belongs to Context Shock(C/S) behaviour.
plausible hypothesis: Most of the Memantine treatment is given to the mice with Context Shock(C/S) behaviour.

Figure 18: Type of the behaviour of the mice

The graph above shows that, Memantine treatment is equally given to mice which have Context Shock(C/S)
behavior and Shock Context (S/C) behavior with the total counts of 285 each.
The plausible hypothesis is not proved

6
The nineth pair is 'Treatment' and 'Behavior'. These two columns are explored to address if most the mice that
received Saline treatment are belongs to Shock Context(C/S) behaviour.
plausible hypothesis: Most of the Saline treatment is given to the mice with Shock Context(C/S) behaviour.

Figure 19: Type of the behaviour of the mice

The graph above shows that, Saline treatment is mostly given to mice which have Shock Context (S/C)
behavior. From 507 Saline treatments, 267 treatments are given to Shock Context (S/C) behavior mice and the
rest 240 treatments are given to Context Shock (C/S) behavior mice.
The plausible hypothesis is proved.

The last pair is 'Behavior' and 'Treatment'. These two columns are explored to address if Shock Context(S/C)
behavior mice are mostly received Memantine treatment.
plausible hypothesis: Mice with Shock Context(S/C) behavior are mostly received Memantine treatment.

Figure 20: Types of the treatment

The graph above shows that, the mice which have Shock Context (S/C) behavior mostly received Memantine
treatment. From 552 Shock Context (S/C) behavior mice, 285 of them received Memantine treatment and the
rest 267 Shock Context (S/C) behavior mice received Saline treatment.
The plausible hypothesis is proved.

Methodology
This project presents a comparative analysis of Mice Protein Expression using two different machine learning
algorithms. The comparative analysis is performed to study the influences of protein which could have impacted
the recuperating ability of the Down Syndrome mice to learn. In this project, 2 different machine learning
algorithms is used to build the models. Performance and the accuracy level of the models are evaluated using
hyperparameters tuning and cross-validation method. The best model with high accuracy will be selected as the
proposed model for this project. KNN and Decision tree are the algorithms that will be used in building the
models. The proposed models will be involved five main stages of scrutiny:
First stage: Feature selection, where the important features will be selected to build the models
Second stage: Pre-processing stage where the selected features will go through scaling. Also the target feature is
encoded to numeric.
Third stage: Classification, where the 2 different algorithms used to build the models separately

7
Fourth stage: Model selection, where a grid search or hyperparameters tuning is used for the models to find the
optimal parameters
Fifth stage: Performance comparison, where each model will be compared to the other models using paired t-
tests

Feature selection
Since this study is about analysing the protein influence in the recovering learning ability among the trisomic
mice, all the 76 expression levels of protein are used in the feature selection to find out 15 most important
features. The feature selection is performed using Random Forest Importance (RFI). The 'class' column that
shows that result of the experiment is selected as the target variable.

Feature scaling
Feature scaling is crucial in data pre-processing as it normalises the selected features or data within a particular
range which assist in speeding up the calculation in the algorithms. In this study, min-max scaling is used to
scale the selected features.

Target encoding
The target feature is converted to numbers from 0 to 6 using LabelEncoder as below:
0 = c-CS-m
1 = c-CS-s
2 = c-SC-m
3 = c-SC-s
4 = t-CS-m
5 = t-CS-s
6 = t-SC-m
7 = t-SC-s

Classification Algorithm
The dataset is split training set and test set with the ratio of 70%-30% (70% for the training set and 30% for the
testing set). The training set has 753 rows and the testing set has 324 rows. The performance of the models
observed using confusion matrix as it provides the count of correct and incorrect predictions. Classification
report is used to measure the quality of the algorithm in performing prediction.

KNN
KNN is being one of the top machine learning algorithms and it is widely used many of applications such as
predicting the credit rating of the customers in financial institutes. Hence, KNN would be the suitable
machine learning model for this analysis. A random value for k is selected to find out the accuracy score.

KNN tuning
Minimal error rate indicates the best K value. However, using the test data for hyperparameter tuning results
overfitting. Hence, 10-fold cross validation is used to calculate the error rate for a subset of training set.

Decision Tree
Decision tree evaluates the test data with the prior training data to make prediction whether the balance scale is
balanced or tip to the right or tip to the left. It is easy to interpret and visualize decision trees. Moreover, since
decision tree does not require the normalization and scaling of data, it takes less effort pre-process the data.

Decision Tree tuning


GridSearch is used to tune the Decision Tree. maximum depth (max_depth), minimum sample split
(min_samples_split) are the hyperparameter of the decision tree with the max_depth': [2,5] are the depths
taken to find the best optimal value.

8
Result
KNN before tuning

Figure 21: Classification report of KNN before tuning

KNN after tuning

Figure 22: The graph of misclassification error vs k-values

Figure 23: Classification report of KNN after tuning

There is an increase in the accuracy score after tuning the KNN. From 94%, the accuracy score is increased to
97% after the tuning with the optimal K value of 3. Hence, it is always vital to find the optimal number of
neighbors to get high accuracy score.

9
Decision tree before tuning

Figure 24: Classification report of Decision Tree before tuning

Decision tree after tuning

Figure 25: Best score and best parameter of Decision Tree

Figure 26:Classification report of Decision Tree after tuning

There is a little raise in the accuracy score after tuning the Decision Tree. From 82%, the accuracy score is
increased to 83% after the tuning with the maximum depth of 20 and minimum sample split of 2 and entropy as
the criterion. Hence, it is always vital to find the optimal value to get high accuracy score to control the overall
behaviour of the algorithm.

Performance comparison
Stratified 10-fold cross validation with 10 repetition is used to compare the KNN and Decision Tree with best
estimator.
Firstly, paired t-test is performed to determine whether the models are having statistically significant difference.

Figure 27: The result of the paired t-test

The mean of the best score of the model after tuning is used to determine the accuracy of the classifier. The seed
of the both model/classifier is the same with accuracy as the scoring metric.

Figure 28: Mean score of KNN Figure 29: Mean score of Decision Tree

10
Besides that, the following metrics are considered to evaluate models based on the test set:
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix

Figure 30: Classification report of both models

Discussion
In paired t-test, p value lesser than 0.05 denotes that there is a significant difference between these two models.
The output shows that, at a 95% significance level, KNN is statistically the best model in terms of 'Accuracy' in
contrast to the test data as it has higher mean score value.
Since in this project we are analysing the influence of proteins, recall is chosen as the performance metric which
is also known as the true positive rate (TPR). In this context, KNN would is the best performer since it produces
the highest recall score for mice class. The confusion matrices are in accordance with the classification reports.
This is also supporting our finding that KNN is measurably the best performer with regards to the accuracy
metric.

Limitations and Solutions


The modelling strategy has some limitations:
Firstly, only 15 features selected for the modelling. The number of features in the feature selection can be
increased. The dataset is also small with only 753 rows for the training. More potential data can be gathered in
future to optimize the classifier even more.

Another limitation is the decision tree can be very unstable. It is because of the small variation in the dataset. In
classification model, if some dominate classes are present, it will create biased trees. Therefore, increasing the
depth is crucial to get optimal result.

11
Conclusion

KNN model with 15 best features chosen by Random Forest Importance (RFI) produces the highest accuracy
score on the training data. Moreover, KNN outperforms Decision Tree during the evaluation on the test data.
Furthermore, KNN model shows the highest recall score on the test data. From the observation we can say that
our model is not sensitive to the number of the features. Since it can able to work with 15 features preferable to
work with the full features. Overall, we achieved the best performance for the task of analysing the influence of
proteins that impact the recuperating ability of the Down Syndrome mice to learn with the accuracy score of
97% on the Mice Protein Expression Data Set using our proposed ensemble.

References

DataCamp, 2018. Decision Tree Classification in Python. [Online] Available at:


https://www.datacamp.com/community/tutorials/decision-tree-classificationpython [Accessed 3 May 2020].

DataCamp, 2018. KNN Classification using Scikit-learn. [Online] Available at:


https://www.datacamp.com/community/tutorials/k-nearest-neighborclassification-scikit-learn [Accessed 2 May
2020].

Higuera, C., Gardiner, K.J. and Cios, K.J., 2015. Self-organizing feature maps identify proteins critical to
learning in a mouse model of down syndrome. PloS one, 10(6).

Ramanathan, S., Sangeetha, M., Talwai, S. and Natarajan, S., 2018, September. Probabilistic Determination Of
Down's Syndrome Using Machine Learning Techniques. In 2018 International Conference on Advances in
Computing, Communications and Informatics (ICACCI) (pp. 126-132). IEEE.

Saringat, M.Z., Mustapha, A. and Andeswari, R., 2018. Comparative Analysis of Mice Protein Expression:
Clustering and Classification Approach. International Journal of Integrated Engineering, 10(6).

www.featureranking.com, 2020. Case Study: Predicting Income Status. [Online]


Available at: https://www.featureranking.com/tutorials/machine-learning-tutorials/case-study-predicting-
income-status/
[Accessed 7 June 2020].

www.featureranking.com, 2020. SK Part 3: Cross-Validation and Hyperparameter Tuning. [Online]


Available at: https://www.featureranking.com/tutorials/machine-learning-tutorials/sk-part-3-cross-validation-
and-hyperparameter-tuning/#1.5
[Accessed 2 June 2020].

12

You might also like