You are on page 1of 23

Reproducible

 Research:    
Concepts  and  Ideas  
Reproducible  Research  
 
Roger  D.  Peng,  Associate  Professor  of  Biosta4s4cs  
Johns  Hopkins  Bloomberg  School  of  Public  Health  
Replica5on  
•  The  ul5mate  standard  for  strengthening  scien5fic  
evidence  is  replica5on  of  findings  and  conduc5ng  
studies  with  independent  
–  Inves5gators    
–  Data  
–  Analy5cal  methods  
–  Laboratories  
–  Instruments  
•  Replica5on  is  par5cularly  important  in  studies  
that  can  impact  broad  policy  or  regulatory  
decisions  
What’s  Wrong  with  Replica5on?  
•  Some  studies  cannot  be  replicated  
–  No  5me,  opportunis5c  
–  No  money  
–  Unique  
•  Reproducible  Research:  Make  analy5c  data  
and  code  available  so  that  others  may  
reproduce  findings  
How  Can  We  Bridge  the  Gap?  
Replica5on  

?  

Nothing  
How  Can  We  Bridge  the  Gap?  
Replica5on  

Reproduciblity  

Nothing  
Why  Do  We  Need  
Reproducible  Research?  
•  New  technologies  increasing  data  collec5on  
throughput;  data  are  more  complex  and  
extremely  high  dimensional  
•  Exis5ng  databases  can  be  merged  into  new  
“megadatabases”  
•  Compu5ng  power  is  greatly  increased,  
allowing  more  sophis5cated  analyses  
•  For  every  field  “X”  there  is  a  field  
“Computa5onal  X”  
Example:  Reproducible  Air  Pollu5on  
and  Health  Research  
•  Es5ma5ng  small  (but  important)  health  effects  
in  the  presence  of  much  stronger  signals  
•  Results  inform  substan5al  policy  decisions,  
affect  many  stakeholders  
–  EPA  regula5ons  can  cost  billions  of  dollars  
•  Complex  sta5s5cal  methods  are  needed  and  
subjected  to  intense  scru5ny  
Internet-­‐based  Health  and  Air  
Pollu5on  Surveillance  System  (iHAPSS)  

h[p://www.ihapss.jhsph.edu  
Research  Pipeline  

Ar5cle  

Reader  
Research  Pipeline  
Author  

Presenta5on  code  

Processing  code   Analy5c  code   Figures  

Measured   Analy5c   Computa5onal  


Tables   Ar5cle  
Data   Data   Results  

Numerical  
Summaries   Text  

Reader  
Recent  Developments  in  
Reproducible  Research  
Recent  Developments  in  
Reproducible  Research  

The  Duke  
Saga  
Recent  Developments  in  
Reproducible  Research  
The  IOM  Report  
In  the  Discovery/Test  Valida5on  stage  of  omics-­‐based  
tests:  
•  Data/metadata  used  to  develop  test  should  be  made  
publicly  available  
•  The  computer  code  and  fully  specified  computa5onal  
procedures  used  for  development  of  the  candidate  
omics-­‐based  test  should  be  made  sustainably  available  
•  “Ideally,  the  computer  code  that  is  released  will  
encompass  all  of  the  steps  of  computa3onal  analysis,  
including  all  data  preprocessing  steps,  that  have  been  
described  in  this  chapter.  All  aspects  of  the  analysis  
need  to  be  transparently  reported.”  
 
What  do  We  Need?  
•  Analy5c  data  are  available  
•  Analy5c  code  are  available  
•  Documenta5on  of  code  and  data  
•  Standard  means  of  distribu5on  
Who  are  the  Players?  
•  Authors  
–  Want  to  make  their  research  reproducible  
–  Want  tools  for  RR  to  make  their  lives  easier  (or  at  
least  not  much  harder)  
•  Readers  
–  Want  to  reproduce  (and  perhaps  expand  upon)  
interes5ng  findings  
–  Want  tools  for  RR  to  make  their  lives  easier  
Challenges  
•  Authors  must  undertake  considerable  effort  to  
put  data/results  on  the  web  (may  not  have  
resources  like  a  web  server)  
•  Readers  must  download  data/results  individually  
and  piece  together  which  data  go  with  which  
code  sec5ons,  etc.  
•  Readers  may  not  have  the  same  resources  as  
authors  
•  Few  tools  to  help  authors/readers  (although  
toolbox  is  growing!)  
In  Reality…  
•  Authors  
–  Just  put  stuff  on  the  web  
–  (Infamous)  Journal  supplementary  materials  
–  There  are  some  central  databases  for  various  
fields  (e.g.  biology,  ICPSR)  
•  Readers  
–  Just  download  the  data  and  (try  to)  figure  it  out  
–  Piece  together  the  socware  and  run  it  
Literate  (Sta5s5cal)  Programming  
•  An  ar5cle  is  a  stream  of  text  and  code  
•  Analysis  code  is  divided  into  text  and  code  
“chunks”  
•  Each  code  chunk  loads  data  and  computes  results  
•  Presenta5on  code  formats  results  (tables,  figures,  
etc.)  
•  Ar5cle  text  explains  what  is  going  on  
•  Literate  programs  can  be  weaved  to  produce  
human-­‐readable  documents  and  tangled  to  
produce  machine-­‐readable  documents  
Literate  (Sta5s5cal)  Programming  
•  Literate  programming  is  a  general  concept  that  
requires  
1.  A  documenta5on  language  (human  readable)  
2.  A  programming  language  (machine  readable)  
•  Sweave  uses  LATEX  and  R  as  the  documenta5on  
and  programming  languages  
•  Sweave  was  developed  by  Friedrich  Leisch  
(member  of  the  R  Core)  and  is  maintained  by  R  
core  
•  Main  web  site:  http://www.statistik.lmu.de/
~leisch/Sweave
Sweave  Limita5ons  
•  Sweave  has  many  limita5ons  
•  Focused  primarily  on  LaTeX,  a  difficult  to  learn  
markup  language  used  only  by  weirdos  
•  Lacks  features  like  caching,  mul5ple  plots  per  
chunk,  mixing  programming  languages  and  
many  other  technical  items  
•  Not  frequently  updated  or  very  ac5vely  
developed  
Literate  (Sta5s5cal)  Programming  
•  knitr  is  an  alterna5ve  (more  recent)  package  
•  Brings  together  many  features  added  on  to  
Sweave  to  address  limita5ons  
•  knitr  uses  R  as  the  programming  language  
(although  others  are  allowed)  and  variety  of  
documenta5on  languages  
–  LaTeX,  Markdown,  HTML  
•  knitr  was  developed  by  Yihui  Xie  (while  a  
graduate  student  in  sta5s5cs  at  Iowa  State)  
•  See  h[p://yihui.name/knitr/    
Summary  
•  Reproducible  research  is  important  as  a  
minimum  standard,  par5cularly  for  studies  
that  are  difficult  to  replicate  
•  Infrastructure  is  needed  for  crea3ng  and  
distribu3ng  reproducible  documents,  beyond  
what  is  currently  available  
•  There  is  a  growing  number  of  tools  for  
crea5ng  reproducible  documents  

You might also like