You are on page 1of 2

Capstone Build: Project Initiation Request (PIR)

Date: 11/3/2020

Student ID 16UG08002 16UG08003

Title of the Project Predicting the effect of genetic variants to enable personalized medicine

The purpose of the project is to classify the given genetic


variations/mutations based on evidence from text-based clinical
literature.The interpretation of genetic mutations requires a huge amount
of manual work. A clinical pathologist need to spend a lot of time manually
review and classify every single genetic mutation based on text-based
clinical literature. In order to make the process more efficient and effective,
we need to develop an algorithm that automatically classifies genetic
variations with some basic knowledge about this field. More specifically, we
apply test mining and machine learning algorithms on training our model.
Purpose / For feature extraction, we apply both the bag of word model and TF-IDF
Objective/ Scope of model, and for classification, we apply neural networks, SVM, and xgboost.
the Request For experimentation the data comes from kaggle. The data consists
information about genes, variations, and the class label. We evaluate the
result by calculating multi-class log loss between the prediction probability
and the ground truth labels.

Constraints
DATASET –

1)Training variants - comma separated file containing the description of the


genetic mutations used for training. Fields are ID (the id of the row used to
link the mutation to the clinical evidence), Gene (the gene where this
genetic mutation is located), Variation (the aminoacid change for this
mutations), Class (1-9 the class this genetic mutation has been classified on)
training_text (ID, Text)
Example –
ID,Gene,Varaition,Class
0,FAM58A,Truncating Mutations,1
1,CBL,W802*,2
2,CBL,Q249E,2

2)Training text - a double pipe (||) delimited file that contains the clinical
evidence (text) used to classify genetic mutations. Fields are ID (the id of the
row used to link the clinical evidence to the genetic mutation), Text (the
clinical evidence used to classify the genetic mutation)

Example -
ID,Text
0||Cyclin-dependent kinases (CDKs) regulate a variety of fundamental
cellular processes. CDK10 stands out as one of the last orphan CDKs for
which no activating cyclin has been identified and no kinase activity
revealed. Previous.......... .

Project
Deliverables

Signature of Guide
with Date
Signature of Student
With Date

You might also like