001 Phil Campbell Reproducibility and Data at Nature NPG 13-11-14

Research data and
reproducibility at Nature
Philip Campbell
Publishing Better Science through Better Data meeting
NPG
14-11-14
Contents
Data: opportunities and costs
Reproducibility: Natures approaches
Aspiration: all scientific literature online, all

data online, and for them to interoperate
Why is open data an urgent issue?
Closing the concept-data gap
Maintaining the credibility of science
Exploiting the data deluge & computational

potential
Combating fraud
Addressing planetary challenges
Supporting citizen science
Responding to citizens demands for evidence
Restraining the Database State
Intelligent openness
Openness of data per se has no value. Open science is more
than disclosure
Data must be:
Accessible
Intelligible
METADATA
Assessable
Re-usable
Only when these four criteria are fulfilled are data properly
open
The transition to open data

Pathfinder disciplines where benefit is recognised and habits are
changing
Databases as publications
Hosts/suppliers of databases are publishers
They have a responsibility to curate and provide reliable access to content.
They may also deliver other services around their products
They may provide the data as a public good or charge for access
Worldwide Protein Data

Bank (wwPDB)
Worldwide Protein Data Bank

(wwPDB)
The Worldwide Protein Data Bank (wwPDB) archive is the single
worldwide repository of information about the 3D structures of
large biological molecules, including proteins and nucleic acids.
As of January 2012, it held 78477 structures. 8120 were added in
2011, at a rate of 677 per month. In 2011, an average of 31.6
million data files were downloaded per month. The total storage
requirement for the repository was 135GB for the archive.
The total cost for the project is approximately $11-12 million per
year (total costs, including overhead), spread out over the four
member sites. It employs 69 FTE staff. wwPDB estimate that $6-7
million is for data in expenses relating to the deposition and
curation of data.
UK Data Archive
The UK Data Archive, founded 1967, is curator of the largest collection of digital data in the
social sciences in the United Kingdom. UKDA is funded mainly by Economic and Social
Research Council, University of Essex and JISC, and is hosted at University of Essex.
On average around 2,600 (new or revised) files are uploaded to the repository monthly. (This
includes file packages, so the absolute number of files is higher.) The baseline size of the
main storage repository is <1Tb, though with multiple versions and files outside this
system, a total capacity of c.10Tb is required.
The UKDA currently (26/1/2012) employs 64.5 people. The total expenditure of the UK Data
Archive (2010-11) was approx 3.43 million.. Total staff costs (2010-11) across the whole
organisation: 2.43 million.
Non-staff costs in 2009-10 were approx 580,000, but will be much higher in 2011-12, ie
almost 3 million due to additional investment.
Institutional Repositories
(Tier 3)
Most university repositories in the UK have small amounts of staff
time. The Repositories Support Project survey in 2011 received
responses from 75 UK universities. It found that the average university
repository employed a total 1.36 FTE combined into Managerial,
Administrative and Technical roles. 40% of these repositories accept
research data. In the vast majority of cases (86%), the library has
lead responsibility for the repository.276
ePrints Soton
ePrints Soton, founded in 2003, is the institutional repository for the
University of Southampton. It holds publications including journal
articles, books and chapters, reports and working papers, higher
theses, and some art and design items. It is looking to expand its
holdings of datasets.
It has a staff of 3.2 FTE (1FTE technical, 0.9 senior editor, 1.2 editors,
0.1 senior manager). Total costs of the repository are of 116, 318,
comprised of staff costs of 111,318, and infrastructure costs of
5,000. (These figures do not include a separate repository for
electronics and computer science, which will be merged into the main
repository later in 2012.) It is funded and hosted by the University of
Southampton, and uses the ePrints server, which was developed by the
University of Southampton School of Electronics and Computer
Science.
Contingency of these databases

PDB and arXiv dependent on mixes of
discretionary decisions by government bodies
and philanthropy
UK Data Archive is unusual in its centrality to the
social sciences funding system
University repositories highly varied in
performance and in support from the top
Funders and universities are under many
pressures
But researchers can do more to promote data
access, as can journals
Approaches to reproducibility
Growth in formal corrections

(Examples from Nature, Nature Biotechnology, Nature
Neuroscience, Nature Methods)
Missing controls, results not sufficiently representative of

experimental variability, data selection
Investigator bias, e.g., in determining the boundaries of an
area to study (lack of blinding)
Technical replicates wrongly described as biological
replicates
Over-fitting of models for noisy datasets in various
experimental settings: fMRI, x-ray crystallography, machine
learning
Errors and inappropriate manipulation in image
presentation, poor data management
Contamination of primary culture cells
Mandating reporting standards is not

sufficient
MIAME Minimal Information About a Microarray Experiment
002: Nature journals mandate deposition of MIAME-compliant microarray data

006: compliance issues identified
Ioannidis et al., Nat Gen 41, 2, 149 (2009)
18 papers containing microarray data published in NG in 2005-2006, 10 analyse

uld not be reproduced, 6 only partially.
Irreproducibility: NPG actions so far
Awareness raising meetings 2013/14: NINDS, NCI, Academy of Medical

Sciences, Royal Society, Science Europe,
Awareness raising: Editorials, articles by experts
We removed length limits on online methods sections
We substantially increased figure limits in Nature and improved access to
Supplementary Information data in research journals.
Statistical advisor (Terry Hyslop) and referees appointed
Reducing our irreproducibility Editorial + check lists for authors, editors and
referees (23 April 2013)
Nature + NIH + Science meeting of journal editors in Washington (May
2014)
Raising awareness: our content
Tackling the widespread and critical impact of batch effects in high-throughput data,
Leek et al., NRG, Oct 2010
How much can we rely on published data on potential drug targets? Prinz et al.,
NRDD, Sep 2011
The case for open computer programs, Ince et al., Nature, Feb 2012
Raise standards for preclinical cancer research, Begley & Ellis, Nature, Mar 2012
Must try harder Editorial, Nature, Mar 2012
Face up to false positives, MacArthur, Nature, Jul 2012
Error prone Editorial, Nature, Jul 2012
Next-generation sequencing data interpretation: enhancing reproducibility and
accessibility, Nekrutenko & Taylor, NRG, Sep 2012
A call for transparent reporting to optimize the predictive value of preclinical research.
Landis et al., Nature, Oct 2012
Know when your numbers are significant, Vaux, Nature, Dec 2012
Reuse of public genome-wide gene expression data, Rung & Brazma, NRG, Feb
2013
Raising awareness: our content (2)

Reducing our irreproducibility Editorial, Nature, May 2013
Reproducibility: Six red flags for suspect work, Begley, Nature, May
2013
Reproducibility: The risks of the replication drive, Bissell, Nature,
Nov 2013
Of carrots and sticks: incentives for data sharing, Kattge et al,
Nature Geoscience, Nov 2014
Open code for open science? Easterbrook, Nature Geoscience Nov
2014
Code share Editorial, Nature 29 Oct 2014
Journals unite Editorial with Science and NIH 6 Nov 2014
Implementation of reporting checklist

Onerous!
Authors, referees, editors, copyeditors
Referees:
We are not yet sure whether they are paying much attention.
Authors:
Some papers submitted with checklist without prompt
Many have embraced source data
Improves reporting (see following slide).

We have commissioned an external assessment of the
impact.
The list may be driving changes in experimental design in
the longer term
Reporting animal experiments in Nature

Neuroscience
Jan 12 (10 papers)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Oct 13 Jan 14 (41 papers)

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
not reported
not done
done
Not reported includes cases for which the specific question was not
relevant (e.g., investigator cannot be blinded to treatment)
Most frequent problems: power analysis calculations, low n (sample size
justification), proper blinding or randomization, multiple t-tests.
Attention needed: Cell line identity

Identify the source of cell lines and indicate if they were
recently authenticated (e.g., by STR profiling) and tested for
mycoplasma contamination.
This checklist question is not yet enforced as a mandate
Audit of Nature Cell Biology papers (Aug13 Dec13):

- Of 21 relevant papers:
- 20 indicate the source of cell lines(*)
- 4 indicate authentication was done(**)
- 5 acknowledge cell lines were not authenticated
- 17 indicate the cells were tested and demonstrated mycoplasmafree(**)
(*) quality of information variable
(**) timing of tests not always satisfactory
Question about developing authorcontribution transparency

Author contribution statements in Nature
journals are informal, unstructured, nontemplated.
Should this change? How? (Possible
goals: increased credit, increased
accountability for potential flaws.)
How granular should this information
become?
Irreproducibility: underlying issues
Experimental design: randomization, blinding, sample size determinations,

independent experiments vs technical replicates,
Statistics
Big data, overfitting (needs gut scepticism/tacit knowledge)
Gels, microscopy images,
Reagents validity antibodies, cell lines
Animal studies description
Methods description
Data deposition
Publication bias and refutations where?

IP confidentiality replication failures unpublishable
Lab supervision
Lab training
Pressure to publish
It pays to be sloppy
Funders: The NIH

Collins and Tabak, Nature 27 January 2014
NIH is developing a training module on enhancing reproducibility and
transparency of research findings, with an emphasis on good experimental design.
This will be incorporated into the mandatory training on responsible conduct of
research for NIH intramural postdoctoral fellows later this year. Informed by this pilot,
final materials will be posted on the NIH website by the end of this year for broad
dissemination, adoption or adaptation, on the basis of local institutional needs.
Funders: The NIH

Collins and Tabak, Nature 27 January 2014
Several of the NIH's institutes and centres are also testing the use of a checklist to ensure a
more systematic evaluation of grant applications. Reviewers are reminded to check, for
example, that appropriate experimental design features have been addressed, such as an
analytical plan, plans for randomization, blinding and so on. A pilot was launched last year
that we plan to complete by the end of this year to assess the value of assigning at least one
reviewer on each panel the specific task of evaluating the 'scientific premise' of the
application: the key publications on which the application is based (which may or may not
come from the applicant's own research efforts). This question will be particularly
important when a potentially costly human clinical trial is proposed, based on animalmodel results. If the antecedent work is questionable and the trial is particularly
important, key preclinical studies may first need to be validated independently.
Informed by feedback from these pilots, the NIH leadership will decide by the fourth quarter
of this year which approaches to adopt agency-wide, which should remain specific to institutes
and centres, and which to abandon.
Universities/institutes:
target issues
Data validation
Lab size and management
Training
Publication bias
Data/notebooks access
Reagent access
Nature and NPG data policies

Enforce community database deposition
Encourage community database
development
Launch Scientific Data
Nature-journal editors encourage
submissions of Data Descriptors to
Scientific Data
Thanks for listening

001 Phil Campbell Reproducibility and Data at Nature NPG 13-11-14

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

001 Phil Campbell Reproducibility and Data at Nature NPG 13-11-14

Uploaded by

Copyright:

Available Formats

Research data and

Aspiration: all scientific literature online, all

Why is open data an urgent issue?

Closing the concept-data gap

Maintaining the credibility of science

Exploiting the data deluge & computational

Addressing planetary challenges

Supporting citizen science

Responding to citizens demands for evidence

Restraining the Database State

The transition to open data

Worldwide Protein Data

Worldwide Protein Data Bank

Contingency of these databases

Growth in formal corrections

Missing controls, results not sufficiently representative of

Mandating reporting standards is not

002: Nature journals mandate deposition of MIAME-compliant microarray data

18 papers containing microarray data published in NG in 2005-2006, 10 analyse

Irreproducibility: NPG actions so far

Awareness raising meetings 2013/14: NINDS, NCI, Academy of Medical

Raising awareness: our content

Raising awareness: our content (2)

Implementation of reporting checklist

Improves reporting (see following slide).

Reporting animal experiments in Nature

Oct 13 Jan 14 (41 papers)

Attention needed: Cell line identity

Audit of Nature Cell Biology papers (Aug13 Dec13):

Question about developing authorcontribution transparency

Irreproducibility: underlying issues

Experimental design: randomization, blinding, sample size determinations,

Publication bias and refutations where?

Funders: The NIH

Funders: The NIH

Nature and NPG data policies

Thanks for listening

You might also like