Professional Documents
Culture Documents
Research:
Concepts
and
Ideas
Reproducible
Research
Roger
D.
Peng,
Associate
Professor
of
Biosta4s4cs
Johns
Hopkins
Bloomberg
School
of
Public
Health
Replica5on
• The
ul5mate
standard
for
strengthening
scien5fic
evidence
is
replica5on
of
findings
and
conduc5ng
studies
with
independent
– Inves5gators
– Data
– Analy5cal
methods
– Laboratories
– Instruments
• Replica5on
is
par5cularly
important
in
studies
that
can
impact
broad
policy
or
regulatory
decisions
What’s
Wrong
with
Replica5on?
• Some
studies
cannot
be
replicated
– No
5me,
opportunis5c
– No
money
– Unique
• Reproducible
Research:
Make
analy5c
data
and
code
available
so
that
others
may
reproduce
findings
How
Can
We
Bridge
the
Gap?
Replica5on
?
Nothing
How
Can
We
Bridge
the
Gap?
Replica5on
Reproduciblity
Nothing
Why
Do
We
Need
Reproducible
Research?
• New
technologies
increasing
data
collec5on
throughput;
data
are
more
complex
and
extremely
high
dimensional
• Exis5ng
databases
can
be
merged
into
new
“megadatabases”
• Compu5ng
power
is
greatly
increased,
allowing
more
sophis5cated
analyses
• For
every
field
“X”
there
is
a
field
“Computa5onal
X”
Example:
Reproducible
Air
Pollu5on
and
Health
Research
• Es5ma5ng
small
(but
important)
health
effects
in
the
presence
of
much
stronger
signals
• Results
inform
substan5al
policy
decisions,
affect
many
stakeholders
– EPA
regula5ons
can
cost
billions
of
dollars
• Complex
sta5s5cal
methods
are
needed
and
subjected
to
intense
scru5ny
Internet-‐based
Health
and
Air
Pollu5on
Surveillance
System
(iHAPSS)
h[p://www.ihapss.jhsph.edu
Research
Pipeline
Ar5cle
Reader
Research
Pipeline
Author
Presenta5on code
Numerical
Summaries
Text
Reader
Recent
Developments
in
Reproducible
Research
Recent
Developments
in
Reproducible
Research
The
Duke
Saga
Recent
Developments
in
Reproducible
Research
The
IOM
Report
In
the
Discovery/Test
Valida5on
stage
of
omics-‐based
tests:
• Data/metadata
used
to
develop
test
should
be
made
publicly
available
• The
computer
code
and
fully
specified
computa5onal
procedures
used
for
development
of
the
candidate
omics-‐based
test
should
be
made
sustainably
available
• “Ideally,
the
computer
code
that
is
released
will
encompass
all
of
the
steps
of
computa3onal
analysis,
including
all
data
preprocessing
steps,
that
have
been
described
in
this
chapter.
All
aspects
of
the
analysis
need
to
be
transparently
reported.”
What
do
We
Need?
• Analy5c
data
are
available
• Analy5c
code
are
available
• Documenta5on
of
code
and
data
• Standard
means
of
distribu5on
Who
are
the
Players?
• Authors
– Want
to
make
their
research
reproducible
– Want
tools
for
RR
to
make
their
lives
easier
(or
at
least
not
much
harder)
• Readers
– Want
to
reproduce
(and
perhaps
expand
upon)
interes5ng
findings
– Want
tools
for
RR
to
make
their
lives
easier
Challenges
• Authors
must
undertake
considerable
effort
to
put
data/results
on
the
web
(may
not
have
resources
like
a
web
server)
• Readers
must
download
data/results
individually
and
piece
together
which
data
go
with
which
code
sec5ons,
etc.
• Readers
may
not
have
the
same
resources
as
authors
• Few
tools
to
help
authors/readers
(although
toolbox
is
growing!)
In
Reality…
• Authors
– Just
put
stuff
on
the
web
– (Infamous)
Journal
supplementary
materials
– There
are
some
central
databases
for
various
fields
(e.g.
biology,
ICPSR)
• Readers
– Just
download
the
data
and
(try
to)
figure
it
out
– Piece
together
the
socware
and
run
it
Literate
(Sta5s5cal)
Programming
• An
ar5cle
is
a
stream
of
text
and
code
• Analysis
code
is
divided
into
text
and
code
“chunks”
• Each
code
chunk
loads
data
and
computes
results
• Presenta5on
code
formats
results
(tables,
figures,
etc.)
• Ar5cle
text
explains
what
is
going
on
• Literate
programs
can
be
weaved
to
produce
human-‐readable
documents
and
tangled
to
produce
machine-‐readable
documents
Literate
(Sta5s5cal)
Programming
• Literate
programming
is
a
general
concept
that
requires
1. A
documenta5on
language
(human
readable)
2. A
programming
language
(machine
readable)
• Sweave
uses
LATEX
and
R
as
the
documenta5on
and
programming
languages
• Sweave
was
developed
by
Friedrich
Leisch
(member
of
the
R
Core)
and
is
maintained
by
R
core
• Main
web
site:
http://www.statistik.lmu.de/
~leisch/Sweave
Sweave
Limita5ons
• Sweave
has
many
limita5ons
• Focused
primarily
on
LaTeX,
a
difficult
to
learn
markup
language
used
only
by
weirdos
• Lacks
features
like
caching,
mul5ple
plots
per
chunk,
mixing
programming
languages
and
many
other
technical
items
• Not
frequently
updated
or
very
ac5vely
developed
Literate
(Sta5s5cal)
Programming
• knitr
is
an
alterna5ve
(more
recent)
package
• Brings
together
many
features
added
on
to
Sweave
to
address
limita5ons
• knitr
uses
R
as
the
programming
language
(although
others
are
allowed)
and
variety
of
documenta5on
languages
– LaTeX,
Markdown,
HTML
• knitr
was
developed
by
Yihui
Xie
(while
a
graduate
student
in
sta5s5cs
at
Iowa
State)
• See
h[p://yihui.name/knitr/
Summary
• Reproducible
research
is
important
as
a
minimum
standard,
par5cularly
for
studies
that
are
difficult
to
replicate
• Infrastructure
is
needed
for
crea3ng
and
distribu3ng
reproducible
documents,
beyond
what
is
currently
available
• There
is
a
growing
number
of
tools
for
crea5ng
reproducible
documents