You are on page 1of 9

Report on

Classification model for mice using


Genotype Dataset
Assignment-01

Submitted by: Daxil Vala


Date: 29/10/2020
Abstract:

In recent years data analysis has received increasing attention and numerous
applications have emerged from this technique. So in this project we are working on
one such project which is building a “Classification model for mice using there
genotype information”. Here eight classes of mice are described based on features
such as genotype, behavior and treatment. We have used K-Nearest Neighbors
(KNN) model for the task.

Keywords: Classification, K-Nearest Neighbors (KNN)


INTRODUCTION:

In recent times, there has been a great advent of Machine Learning and Artificial
Intelligence in the domain of medical science as well as life science.
The data set consists of the expression levels of 77 proteins/protein modifications
that produced detectable signals in the nuclear fraction of cortex. There are 38
control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the
experiments, 15 measurements were registered of each protein per sample/mouse.
Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic
mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080
measurements per protein. Each measurement can be considered as an independent
sample/mouse.

The eight classes of mice are described based on features such as genotype, behavior
and treatment. According to genotype, mice can be control or trisomic. According
to behavior, some mice have been stimulated to learn (context-shock) and others
have not (shock-context) and in order to assess the effect of the drug memantine in
recovering the ability to learn in trisomic mice, some mice have been injected with
the drug and others have not.
METHODOLOGY:

Model Used:

k-NN is a type of instance-based learning, or lazy learning, where the function is


only approximated locally and all computation is deferred until function evaluation.
Since this algorithm relies on distance for classification, normalizing the training
data can improve its accuracy dramatically.
Both for classification and regression, a useful technique can be to assign weights to
the contributions of the neighbors, so that the nearer neighbors contribute more to
the average than the more distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d, where d is the distance to the
neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN
classification) or the object property value (for k-NN regression) is known. This can
be thought of as the training set for the algorithm, though no explicit training step is
required.
Operational Flow:
Results:
System Specification:
Dedicated Graphic Memory • DDR5
Type
Dedicated Graphic Memory • 4 GB
Capacity
Processor Brand • Intel
Processor Name • Core i7
Processor Generation • 7th Gen
SSD • 512 GB
RAM • 16 GB
RAM Type • DDR4
HDD Capacity • N/A
Processor Variant • 7500u
Chipset • Intel
Clock Speed • Up to 3.5 GHz
Memory Slots • 2
Expandable Memory • up to 16 GB
RAM Frequency • 2133 MHz
Cache • 4MB
RPM • 5400
Graphic Processor • Nvidia 1050
Dataset:
The data set consists of the expression levels of 77 proteins/protein modifications
that produced detectable signals in the nuclear fraction of cortex. There are 38
control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the
experiments, 15 measurements were registered of each protein per sample/mouse.
Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic
mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080
measurements per protein. Each measurement can be considered as an independent
sample/mouse.

Result:
Discussion:
• Variation of DYRK1A_N w.r.t given entries in dataset

o For the given 1080 entries in the database we plotted given genotype
information DYRK1A_N and we got the following graph
o From the given graph we can observe some trend in the composition
that for most of the instances the given value tends to stay in between
0-1 and only occasionally overshoots the given range.

• Variation of class w.r.t DYRK1A_N

o We can bring our observation an important conclusion from given


graph that if the value of DYRK1A_N overshoots value of 1 than there
is maximum chance of class to be defined as c-SC-s.
Conclusion:

o We observed that the accuracy obtained was 95.33%, using KNN


model on the given dataset.
o The parameters we chose for the KNN model are as follows:
K=5 (i.e. nearest neighbor count)
Metric = Minkowski
P=2
o As a part of future work, we would like to deploy different models and
compare their performances.
References:
1. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps
Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome.
PLoS ONE 10(6): e0129126 journal.pone.0129126

2. Ahmed MM, Dhanasekaran AR, Block A, Tong S, Costa ACS, Stasko M, et


al. (2015) Protein Dynamics Associated with Failed and Rescued Learning in
the Ts65Dn Mouse Model of Down Syndrome. PLoS ONE 10(3): e0119491.

3. Matplotlib documentation & Sklearn documentation.

You might also like