Repro Research

Reproducible
Research:
Concepts and Ideas
Reproducible Research

Roger D. Peng, Associate Professor of Biosta4s4cs
Johns Hopkins Bloomberg School of Public Health
Replica5on
•  The ul5mate standard for strengthening scien5fic
evidence is replica5on of findings and conduc5ng
studies with independent
–  Inves5gators
–  Data
–  Analy5cal methods
–  Laboratories
–  Instruments
•  Replica5on is par5cularly important in studies
that can impact broad policy or regulatory
decisions
What’s Wrong with Replica5on?
•  Some studies cannot be replicated
–  No 5me, opportunis5c
–  No money
–  Unique
•  Reproducible Research: Make analy5c data
and code available so that others may
reproduce findings
How Can We Bridge the Gap?
Replica5on
?
Nothing
How Can We Bridge the Gap?
Replica5on
Reproduciblity
Nothing
Why Do We Need
Reproducible Research?
•  New technologies increasing data collec5on
throughput; data are more complex and
extremely high dimensional
•  Exis5ng databases can be merged into new
“megadatabases”
•  Compu5ng power is greatly increased,
allowing more sophis5cated analyses
•  For every field “X” there is a field
“Computa5onal X”
Example: Reproducible Air Pollu5on
and Health Research
•  Es5ma5ng small (but important) health effects
in the presence of much stronger signals
•  Results inform substan5al policy decisions,
affect many stakeholders
–  EPA regula5ons can cost billions of dollars
•  Complex sta5s5cal methods are needed and
subjected to intense scru5ny
Internet-‐based Health and Air
Pollu5on Surveillance System (iHAPSS)
h[p://www.ihapss.jhsph.edu
Research Pipeline
Ar5cle
Reader
Research Pipeline
Author
Presenta5on code
Processing code Analy5c code Figures
Measured Analy5c Computa5onal

Tables Ar5cle
Data Data Results
Numerical
Summaries Text
Reader
Recent Developments in
The Duke
Saga
The IOM Report
In the Discovery/Test Valida5on stage of omics-‐based
tests:
•  Data/metadata used to develop test should be made
publicly available
•  The computer code and fully specified computa5onal
procedures used for development of the candidate
omics-‐based test should be made sustainably available
•  “Ideally, the computer code that is released will
encompass all of the steps of computa3onal analysis,
including all data preprocessing steps, that have been
described in this chapter. All aspects of the analysis
need to be transparently reported.”

What do We Need?
•  Analy5c data are available
•  Analy5c code are available
•  Documenta5on of code and data
•  Standard means of distribu5on
Who are the Players?
•  Authors
–  Want to make their research reproducible
–  Want tools for RR to make their lives easier (or at
least not much harder)
•  Readers
–  Want to reproduce (and perhaps expand upon)
interes5ng findings
–  Want tools for RR to make their lives easier
Challenges
•  Authors must undertake considerable effort to
put data/results on the web (may not have
resources like a web server)
•  Readers must download data/results individually
and piece together which data go with which
code sec5ons, etc.
•  Readers may not have the same resources as
authors
•  Few tools to help authors/readers (although
toolbox is growing!)
In Reality…
•  Authors
–  Just put stuff on the web
–  (Infamous) Journal supplementary materials
–  There are some central databases for various
fields (e.g. biology, ICPSR)
•  Readers
–  Just download the data and (try to) figure it out
–  Piece together the socware and run it
Literate (Sta5s5cal) Programming
•  An ar5cle is a stream of text and code
•  Analysis code is divided into text and code
“chunks”
•  Each code chunk loads data and computes results
•  Presenta5on code formats results (tables, figures,
etc.)
•  Ar5cle text explains what is going on
•  Literate programs can be weaved to produce
human-‐readable documents and tangled to
produce machine-‐readable documents
•  Literate programming is a general concept that
requires
1.  A documenta5on language (human readable)
2.  A programming language (machine readable)
•  Sweave uses LATEX and R as the documenta5on
and programming languages
•  Sweave was developed by Friedrich Leisch
(member of the R Core) and is maintained by R
core
•  Main web site: http://www.statistik.lmu.de/
~leisch/Sweave
Sweave Limita5ons
•  Sweave has many limita5ons
•  Focused primarily on LaTeX, a difficult to learn
markup language used only by weirdos
•  Lacks features like caching, mul5ple plots per
chunk, mixing programming languages and
many other technical items
•  Not frequently updated or very ac5vely
developed
•  knitr is an alterna5ve (more recent) package
•  Brings together many features added on to
Sweave to address limita5ons
•  knitr uses R as the programming language
(although others are allowed) and variety of
documenta5on languages
–  LaTeX, Markdown, HTML
•  knitr was developed by Yihui Xie (while a
graduate student in sta5s5cs at Iowa State)
•  See h[p://yihui.name/knitr/
Summary
•  Reproducible research is important as a
minimum standard, par5cularly for studies
that are difficult to replicate
•  Infrastructure is needed for crea3ng and
distribu3ng reproducible documents, beyond
what is currently available
•  There is a growing number of tools for
crea5ng reproducible documents

Repro Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Repro Research

Uploaded by

Copyright:

Available Formats

Reproducible

Processing code Analy5c code Figures

Measured Analy5c Computa5onal

You might also like