You are on page 1of 7

Cancer Diagnosis Using RNN Model using Big data

successful enterprises of tomorrow will be the ones that


can make sense of all that data at extremely high volumes
Abstract:Cancers are characterized by chromosomal and speeds in order to capture newer markets and
aberrations which, particularly in solid tumors, appear in customer base. Once the Big Data is converted into
complex patterns of progression. The characterization of nuggets of information then it becomes pretty
the timing and the order of the genetic mutations that straightforward for most business enterprises in the sense
drive tumor progression is difficult due to this high level that they now know what their customers want, what are
of complexity. For medical treatment reasons, it is the products that are fast moving, what are the
important to understand how these patterns develop and expectations of the users from the customer service, how
several models have been proposed. The first ones were to speed up the time to market, ways to reduce costs, and
of a more descriptive nature, but later several attempts methods to build economies of scale in a highly efficient
have been made to obtain mathematical models. In this manner. Thus Big Data distinctively leads to big time
work, a generative probabilistic model based on Hidden benefits for organizations and hence naturally there is
Markov Models is presented and provides a natural such a huge amount of interest in it from all around the
framework for the inclusion of unobserved or missing world.
data. The number of parameters is also reduced and the
inference algorithm used to estimate them is an EM There is no hard and fast rule about exactly what size a
algorithm (earlier algorithms have been rather database needs to be for the data inside of it to be
straightforward heuristics). The tests performed in a considered "big." Instead, what typically defines big data
synthetic data generator also develop in this work (which is the need for new techniques and tools to be able to
tries to recreate real cancers), seem to show that the process it. In order to use big data, you need programs
algorithm infers the hidden parameters with high that span multiple physical and/or virtual machines
accuracy. working together in concert to process all of the data in a
reasonable span of time.
Keywords:- HMM, RNN, Diagonosis, Gene Panels,
Hidden Methods. Getting programs on multiple machines to work together
in an efficient way so that each program knows which
INTRODUCTION components of the data to process, and then being able to
put the results from all the machines together to make
Big Data refers to all the data that is being
sense of a large pool of data, takes special programming
generated across the globe at an unprecedented rate. This
techniques. Since it is typically much faster for programs
data could be either structured or unstructured. Now a
to access data stored locally instead of over a network, the
days business enterprises plays a huge part of their
distribution of data across a cluster and how those
success depend on economy that is firmly knowledge-
machines are networked together are also important
oriented. Data drives the modern organizations of the
considerations when thinking about big data problems.
world and hence making sense of this data and
unravelling the various patterns and revealing unseen Machine Learning is a latest buzzword floating around. It
connections within the vast sea of data becomes critical deserves to, as it is one of the most interesting subfield of
and a hugely rewarding endeavour indeed. There is a need Computer Science. Let’s try to understand Machine
to convert Big Data into Business Intelligence that Learning in layman terms. Consider you are trying to toss
enterprises can readily deploy. Better data leads to better a paper to a dustbin.After first attempt, you realize that
decision making and an improved way to strategize for you have put too much force in it. After second attempt,
organizations regardless of their size, geography, market you realize you are closer to target but you need to
share, customer segmentation and such other increase your throw angle. What is happening here is
categorizations. Hadoop is the platform of choice for basically after every throw we are learning something and
working with extremely large volumes of data. The most
improving the end result. We are programmed to learn
from our experience.

This implies that the tasks in which machine Architecture


learning is concerned offers a fundamentally operational
definition rather than defining the field in cognitive terms.
This follows Alan Turing’s proposal in his paper “Within
the field of data analytics, machine learning is used to
devise complex models and algorithms that lend
themselves to prediction; in commercial use, this is
known as predictive analytics. These analytical models
allow researchers, data scientists, engineers, and analysts
to “produce reliable, repeatable decisions and results” and
uncover “hidden insights” through learning from
historical relationships and trends in the data set(input).

PREVIOUS WORK
Classification and clustering of gene expression
in the form of microarray or RNA-seq data are well
studied. There are various approaches for the
classification of cancer cells and healthy cells using gene
expression profiles and supervised learning models. The
self-organizing map (SOM) was used to analyze leukemia
cancer cells. A support vector machine (SVM) with a dot
product kernel has been applied to the diagnosis of
ovarian, leukemia, and colon cancers. SVMs with
nonlinear kernels (polynomial and Gaussian) were also
used for classification of breast cancer tissues from
microarray data. Unsupervised learning techniques are
capable of finding global patterns in gene expression data.
Gene clustering represents various groups of
similar genes based on similar expression patterns.
Hierarchical clustering and maximal margin linear
programming are examples of this learning and they have
been used to classify colon cancer cells. K-nearest
neighbors (KNN) unsupervised learning also has been ALGORITHM
applied to breast cancer data. Due to the large number of
genes, high amount of noise in the gene expression data, Recurrent Neural Network(RNN)
and also the complexity of biological networks, there is a
need to deeply analyze the raw data and exploit the Recurrent Neural Network (RNN) are a type of Neural
important subsets of genes. Regarding this matter, other Network where the output from previous step are fed as
techniques such as principal component analysis (PCA) input to the current step. In traditional neural networks,
have been proposed for dimensionality reduction of all the inputs and outputs are independent of each other,
expression profiles to aid clustering of the relevant genes but in cases like when it is required to predict the next
in a context of expression profiles. PCA uses an
word of a sentence, the previous words are required and
orthogonal transformation to map high dimensional data
to linearly uncorrelated components. hence there is a need to remember the previous words.
However, PCA reduces the dimensionality of the data Thus RNN came into existence, which solved this issue
linearly and it may not extract some nonlinear with the help of a Hidden Layer. The main and most
relationships of the data. In contrast, other approaches important feature of RNN is Hidden state, which
such as kernel PCA (KPCA) may be capable of remembers some information about a sequence.
uncovering these nonlinear relationships.
As part of the tutorial we will implement a recurrent briefly mentioned above, it’s a bit more complicated in
neural network based language model. The applications practice because typically can’t capture information
of language models are two-fold: First, it allows us to from too many time steps ago. Unlike a traditional deep
neural network, which uses different parameters at each
score arbitrary sentences based on how likely they are to
layer, a RNN shares the same parameters (
occur in the real world. This gives us a measure
above) across all steps. This reflects the fact that we are
of grammatical and semantic correctness. Such models performing the same task at each step, just with different
are typically used as part of Machine Translation systems. inputs. This greatly reduces the total number of
Secondly, a language model allows us to generate new parameters we need to learn.
text (I think that’s the much cooler application). Training 5. The above diagram has outputs at each time step,
a language model on Shakespeare allows us to but depending on the task this may not be necessary. For
generate Shakespeare-like text. example, when predicting the sentiment of a sentence we
may only care about the final output, not the sentiment
The idea behind RNNs is to make use of sequential after each word. Similarly, we may not need inputs at
each time step. The main feature of an RNN is its hidden
information. In a traditional neural network we assume state, which captures some information about a sequence.
that all inputs (and outputs) are independent of each
other. But for many tasks that’s a very bad idea. If you NNs have shown great success in many NLP tasks. At
want to predict the next word in a sentence you better this point I should mention that the most commonly used
know which words came before it. RNNs are type of RNNs are LSTMs, which are much better at
called recurrent because they perform the same task capturing long-term dependencies than vanilla RNNs are.
for every element of a sequence, with the output being But don’t worry, LSTMs are essentially the same thing as
depended on the previous computations. Another way to the RNN we will develop in this tutorial, they just have a
different way of computing the hidden state. We’ll cover
think about RNNs is that they have a “memory” which
LSTMs in more detail in a later post. Here are some
captures information about what has been calculated so example applications of RNNs in NLP (by non means an
far. In theory RNNs can make use of information exhaustive list).
in arbitrarily long sequences.
RNN Extensions
The above diagram shows a RNN being unrolled (or
unfolded) into a full network. By unrolling we simply Over the years researchers have developed more
mean that we write out the network for the complete sophisticated types of RNNs to deal with some of the
sequence. For example, if the sequence we care about is a shortcomings of the vanilla RNN model. We will cover
them in more detail in a later post, but I want this section
sentence of 5 words, the network would be unrolled into a
to serve as a brief overview so that you are familiar with
5-layer neural network, one layer for each word. the taxonomy of models.

1. X1 is the input at time step . For example, could be


a one-hot vector corresponding to the second word of a Bidirectional RNNs are based on the idea that the output
sentence. at time may not only depend on the previous elements
2. is the hidden state at time step . It’s the “memory” of in the sequence, but also future elements. For example, to
the network. is calculated based on the previous predict a missing word in a sequence you want to look at
hidden state and the input at the current both the left and the right context. Bidirectional RNNs are
quite simple. They are just two RNNs stacked on top of
step: . The function each other. The output is then computed based on the
usually is a nonlinearity such as tanh or ReLU. , hidden state of both RNNs.
which is required to calculate the first hidden state, is
typically initialized to all zeroes.
3. is the output at step . For example, if we wanted to
predict the next word in a sentence it would be a vector of
probabilities across our
vocabulary. .
4. You can think of the hidden state as the memory of the
network. captures information about what happened in
all the previous time steps. The output at step is
calculated solely based on the memory at time . As
Training through RNN

1. A single time step of the input is provided to the


network.
2. Then calculate its current state using set of current
input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the
problem and join the information from all the
previous states.
5. Once all the time steps are completed the final
current state is used to calculate the output.
6. The output is then compared to the actual output i.e
the target output and the error is generated.
Deep (Bidirectional) RNNs are similar to Bidirectional 7. The error is then back-propagated to the network to
RNNs, only that we now have multiple layers per time update the weights and hence the network (RNN) is
step. In practice this gives us a higher learning capacity trained.
(but we also need a lot of training data).
Advantages of Recurrent Neural Network

1. An RNN remembers each and every information


through time. It is useful in time series prediction
only because of the feature to remember previous
inputs as well. This is called Long Short Term
Memory.
2. Recurrent neural network are even used with
convolutional layers to extend the effective pixel
neighborhood.

Agglomerative Hierarchical Clustering


Hierarchical clustering algorithms actually fall
into 2 categories: top-down or bottom-up. Bottom-up
algorithms treat each data point as a single cluster at the
outset and then successively merge (or agglomerate) pairs
of clusters until all clusters have been merged into a
single cluster that contains all data points. Bottom-up
hierarchical clustering is therefore called hierarchical
agglomerative clustering or HAC. This hierarchy of
 RNN converts the independent activations into clusters is represented as a tree (or dendrogram). The root
dependent activations by providing the same of the tree is the unique cluster that gathers all the
weights and biases to all the layers, thus reducing samples, the leaves being the clusters with only one
the complexity of increasing parameters and sample. Check out the graphic below for an illustration
memorizing each previous outputs by giving each before moving on to the algorithm steps
output as input to the next hidden layer.
 Hence these three layers can be joined together such
that the weights and bias of all the hidden layers is Agglomerative Hierarchical Clustering
the same, into a single recurrent layer.
1. We begin by treating each data point as a single
cluster i.e if there are X data points in our dataset
then we have X clusters. We then select a distance
metric that measures the distance between two
clusters. As an example we will use average
linkage which defines the distance between two
clusters to be the average distance between data GRAPH REPRESENTATION
points in the first cluster and data points in the
second cluster.

2. On each iteration we combine two clusters into


one. The two clusters to be combined are selected
as those with the smallest average linkage. I.e
according to our selected distance metric, these
two clusters have the smallest distance between
each other and therefore are the most similar and
should be combined.

3. Step 2 is repeated until we reach the root of the


tree i.e we only have one cluster which contains all
data points. In this way we can select how many
clusters we want in the end, simply by choosing when
RNN Model Representation
to stop combining the clusters i.e when we stop
building the tree!

Hierarchical clustering does not require us to specify the


number of clusters and we can even select which number
of clusters looks best since we are building a tree.
Additionally, the algorithm is not sensitive to the choice
of distance metric; all of them tend to work equally well
whereas with other clustering algorithms, the choice of
distance metric is critical. A particularly good use case of
hierarchical clustering methods is when the underlying
data has a hierarchical structure and you want to recover
the hierarchy; other clustering algorithms can’t do this.

COMPARISION TABLE.

HMM Representation

PROPOSED METHOD

By Using RNN (Recurrent Neural Network


clustering on a diagnosis.

Gene-Panels
For most clinical applications, the use of gene-
panels to sequence only a discrete number of genes of
interest has been the method of choice, because of its
cost-efficiency, and because at the same time it achieves
high coverage of ROIs and offers simplicity in the raw solved by subsequent improvements in capture protocols
and subsequent data analyses. When the number of genes and data-analysis tools.Most cases carried BRCA1 gene
sequenced is restricted to the few already analysed in mutations (11% of the subjects) followed by BRCA2 (6%)
previous diagnostic tests using traditional methods, this is
and 10 additional genes (6%). Loss of heterozygousity in
normally called targeted re-sequencing
the wild-type allele was confirmed in more than 80% of
Different protocols are available to design and the cases .
capture panels of genes and other ROIs. In most cases,
companies providing the library preparation kits offer Clinical Relevance
online user-friendly tools to design the hybridisation
probes or the PCR oligos to enrich the desired ROIs. Numerous additional publications confirmed this potential
utility. In HBOC, all studies consistently indicated that
Whole-Exome-Sequencing genes besides BRCA1 and BRCA2 are mutated and confer
a moderate- to high-cancer-risk. In a study on 708
Protocols/kits to enrich the library for all exons consecutive patients suspected of HBOC, besides 69
are available from several companies and use the same or germline deleterious alterations in BRCA1 and BRCA2,
similar technologies as mentioned for the enrichment of
additional putative pathogenic mutations were identified
gene-panels. Following sequencing, raw data analysis is
relevant in order to determine the quality of the in PALB2 (almost 1% of the
experiments, checking for difficulties that may have patients), TP53, CHEK2, ATM, RAD51C, MSH2, PMS2 a
occurred at the level of library preparation and/or nd MRE11A (between 0.4% and 0.7% of the patients),
sequencing. Both steps are crucial to obtain good quality followed by RAD50, NBS1, CDH1, BARD1 (about 0.1%)
data. A high sequence-on-target yield of more than 90%
of the ROIs and coverage higher than 20× per nucleotide
CONCLUSION
is necessary for sufficient specificity and sensitivity in
mutation detection. Normally, when less than 90% of the
ROIs are sequenced but coverage is high, sample This report presents a solid theoretical
processing was suboptimal; when the ROIs are construction of how to model cancer progression and how
sufficiently sequenced (>90%) but coverage is low, then to deal with the high number of parameters to estimate.
the sequencing reaction was suboptimal and re- The algorithm developed in this project seems to perform
sequencing is required. very well when detecting the hidden parameters of the
model. However, there are some problems when the graph
Data Analyses and Interpretation that represents cancer progression is too sparse, meaning
lack of information. New methods should be devised in
After raw data are assessed for sufficient quality, order to solve this problem. Since the algorithm was
data analyses and interpretation continues using different implemented in MatLab, the performance was
pipelines depending on the approach used (gene-panel, significantly slower than expected. In retrospect, it seems
WES, WGS or targeted-RNA-seq) and on the questions that coding in C++ would have been a better choice. This
that need to be answered. is also the reason why a rigorous estimation of the
transition parameters was not possible, a question that
Base-calling is performed using software like the should also be addressed in a future work.
Casava pipeline that produces Fast-Q files (raw-initial
data), which can then be aligned to the human reference REFERENCES
genome using Burrows-Wheeler-Alignment tool (BWA).
Single base variants can be identified using Sequence-
1. M. Hjelm, M. Höglund, and J. Lagergren, “New
Alignment-MAP tools (SAM) and annotated. Additional
probabilistic network models and algorithms for
software and scripts (normally in-house developed) match oncogenesis,” Journal of Computational Biology, vol. 13,
the data from NGS analysis to variants in reference no. 4, pp. 853–865, 2016.
databases.
2. W. H. Organization, “Cancer,” 2009. Retrieved 2015-
Gene-Panels in Cancer Syndromes 05-05.
In initial technical difficulties were related to suboptimal
3. G. L. Patrick, An Introduction to Medicinal Chemistry,
enrichment of GC-rich regions and to problems in the Fourth Edition. Oxford Univ Press, 2009.
bioinformatics pipeline to correctly call indels, and were
4. L. Foulds, “The experimental study of tumor
progression: a review,” Cancer research, vol. 14, no. 5, p.
327, 2014.

5. D. Hanahan and R. Weinberg, “The hallmarks of


cancer,” Cell, vol. 100, no. 1, pp. 57–70, 2014.

6. L. LA., “Mutator phenotype may be required for


multistage carcinogenesis,” Cancer Research, vol. 51, pp.
3075–3079, 2011.

7. J. Rahnenfuhrer, N. Beerenwinkel, W. Schulz, C.


Hartmann, A. Von Deimling, B. Wullich, and T.
Lengauer, “Estimating cancer survival and clinical
outcome based on genetic tumor progression scores,”
Bioinformatics, vol. 21, no. 10, p. 2438, 2015.

8. R. Desper, F. Jiang, O. Kallioniemi, H. Moch, C.


Papadimitriou, and A. Schäffer, “Distance-based
reconstruction of tree models for oncogenesis,” Journal of
Computational Biology, vol. 7, no. 6, pp. 789–803, 2017.

You might also like