Integrated Gene Expression Probabilistic Models for Cancer Staging

 Gil  Alterovitz1,2,  Andrew  H.  Xia,2,3,  Jeremy  Warner4   1MIT  PRIMES,  Cambridge,  MA;  2Harvard  Medical  School,  Boston,  MA;  3The  Rivers  School,  Weston,  MA;  4Vanderbilt  University,  Nashville,  TN    


The  current  system  for  classifying  cancer  pa4ents'  stages  was  introduced  more  than   one   hundred   years   ago   and   many   parts   of   the   system   are   outdated.   Because   the   current   system   emphasizes   invasive   surgical   procedures   that   could   have   undesirable   outcomes,   there   has   been   a   movement   to   develop   a   new   taxonomy   using   molecular   signatures   to   avoid   surgical   tes4ng.   This   project   explores   the   issues   of   the   current   classifica4on   system   and   poten4al   ways   to   classify   cancer   pa4ents’   stages   more   effec4vely.   Computeriza4on   has   made   a   vast   amount   of   cancer   data   available   online.   However,   a   significant   por4on   of   the   data   is   incomplete;   some   crucial   informa4on   is   missing   and   therefore   we   explored   the   possibility   of   recovering  missing  cancer  data.  Using  various  methods,  we  have  shown  that  cancer   stages   cannot   be   simply   extrapolated   with   incomplete   data.   Furthermore,   a   new   approach  of  using  RNA  sequencing  data  is  studied.  RNA  sequencing  can  poten4ally   become   a   cost-­‐efficient   way   to   determine   a   cancer   pa4ent’s   stage.   We   have   obtained  promising  results  of  using  RNA  sequencing  data  in  breast  cancer  staging.  


Results   With  clinical  data,  there  was  evidence  that  the  TCGA  given  clinical  T,  N,   and  M-­‐staging  may  not  yield  the  correct  overall  TNM  cancer  stage.  Also,   the   staging   data   is   not   random,   as   methods   3-­‐6   show   no   Kappa   rela4onship.   With   missing   data,   part   2   of   the   project   becomes   more   necessary.   There   has   shown   to   be   correla4on   between   cancer   staging   and   RNA   sequencing   data   of   pa4ents.   For   example,   the   most   significant   gene   shown,   re4noblastoma   binding   protein   8   (RBBP8)   has   been   proven   to   affect  breast  cancer  development.  Other  genes  may  poten4ally  have  a   cause-­‐effect  rela4onship  pending  further  research.    


                   There  were  two  parts  to  this  project.  The  first  part  involved  looking  at  clinical   cancer  data  from  The  Cancer  Genome  Atlas  (TCGA)  and  analyzing  it  with  data  tree   func4ons.  The  second  part  of  the  project  involved  looking  at  RNA  sequencing  data   and  comparing  it  to  the  clinical  data  of  TCGA  and  looking  for  correla4on.                      In  the  first  part  of  the  project,  the  pa4ents’  T,  N,  and  M  cancer  stages,  as   recorded  in  TCGA,  were  entered  into  a  data  tree  with  output  TNM  stage  per  AJCC   standard  staging.    These  calculated  stages  were  compared  against  the  overall  TNM   stage   as   recorded   in   TCGA.   Then,   five   different   methods   of   imputed   stage   genera4on   were   evaluated   against   the   calculated   stages:   1)   equal   assignment   of   stages   (25%   I/II/III/IV);   2)   assignment   to   the   most   common   na4onal   stage;   3)   assignment  to  the  most  common  TCGA  stage;  4)  assignment  by  na4onal  distribu4on   of  stages;  5)  assignment  by  TCGA  distribu4on  of  stages.                      In  the  second  part  of  the  project,  pa4ents  with  RNA  sequencing  data  were   linked   up   to   their   clinical   data   from   the   first   part   of   the   project.   With   clinical   stages   already   determined   from   the   first   part,   the   pa4ents   RNA   sequencing   data   was   analyzed,  in  order  to  find  any  correla4on  between  certain  genes  and  cancer  staging.   A   T   test   was   conducted,   involving   the   many   types   of   RNA   sequencing   data   (raw   counts  normalized,  raw  counts  scaled  es4mate,  raw  counts  of  genes).    

  Conclusion   -­‐By  crea4ng  a  data  tree  func4on  to  analyze  cancer  pa4ents  staging  informa4on  and  comparing  it  to  the  TCGA  staging  informa4on  many   conflicts  were  discovered,  sugges4ng  that  the  data  may  not  be  completely  accurate.   -­‐The  method  of  using  RNA  Sequencing  data  to  analyze  cancer  pa4ents’  staging  informa4on  has  proven  to  be  effec4ve   -­‐Further  research  in  this  area  may  reveal  stage-­‐specific  paherns  of  gene  expression,  which  could  allow  for  less  invasive  cancer  staging.  

Acknowledgements   Thank   you   very   much   to   Andrew   Xia’s   MIT   PRIMES   mentors,   Slava   Gerovitch,  and  Pavel  E4ngof.  

