You are on page 1of 34

Research data and

reproducibility at Nature
Philip Campbell
Publishing Better Science through Better Data meeting
NPG
14-11-14

Contents
Data: opportunities and costs
Reproducibility: Natures approaches

Aspiration: all scientific literature online, all


data online, and for them to interoperate

Why is open data an urgent issue?

Closing the concept-data gap

Maintaining the credibility of science

Exploiting the data deluge & computational


potential

Combating fraud

Addressing planetary challenges

Supporting citizen science

Responding to citizens demands for evidence

Restraining the Database State

Intelligent openness
Openness of data per se has no value. Open science is more
than disclosure
Data must be:
Accessible
Intelligible

METADATA

Assessable
Re-usable
Only when these four criteria are fulfilled are data properly
open

The transition to open data


Pathfinder disciplines where benefit is recognised and habits are
changing

Databases as publications
Hosts/suppliers of databases are publishers
They have a responsibility to curate and provide reliable access to content.
They may also deliver other services around their products
They may provide the data as a public good or charge for access

Worldwide Protein Data


Bank (wwPDB)

Worldwide Protein Data Bank


(wwPDB)
The Worldwide Protein Data Bank (wwPDB) archive is the single
worldwide repository of information about the 3D structures of
large biological molecules, including proteins and nucleic acids.
As of January 2012, it held 78477 structures. 8120 were added in
2011, at a rate of 677 per month. In 2011, an average of 31.6
million data files were downloaded per month. The total storage
requirement for the repository was 135GB for the archive.
The total cost for the project is approximately $11-12 million per
year (total costs, including overhead), spread out over the four
member sites. It employs 69 FTE staff. wwPDB estimate that $6-7
million is for data in expenses relating to the deposition and
curation of data.

UK Data Archive
The UK Data Archive, founded 1967, is curator of the largest collection of digital data in the
social sciences in the United Kingdom. UKDA is funded mainly by Economic and Social
Research Council, University of Essex and JISC, and is hosted at University of Essex.
On average around 2,600 (new or revised) files are uploaded to the repository monthly. (This
includes file packages, so the absolute number of files is higher.) The baseline size of the
main storage repository is <1Tb, though with multiple versions and files outside this
system, a total capacity of c.10Tb is required.
The UKDA currently (26/1/2012) employs 64.5 people. The total expenditure of the UK Data
Archive (2010-11) was approx 3.43 million.. Total staff costs (2010-11) across the whole
organisation: 2.43 million.
Non-staff costs in 2009-10 were approx 580,000, but will be much higher in 2011-12, ie
almost 3 million due to additional investment.

Institutional Repositories
(Tier 3)
Most university repositories in the UK have small amounts of staff
time. The Repositories Support Project survey in 2011 received
responses from 75 UK universities. It found that the average university
repository employed a total 1.36 FTE combined into Managerial,
Administrative and Technical roles. 40% of these repositories accept
research data. In the vast majority of cases (86%), the library has
lead responsibility for the repository.276
ePrints Soton
ePrints Soton, founded in 2003, is the institutional repository for the
University of Southampton. It holds publications including journal
articles, books and chapters, reports and working papers, higher
theses, and some art and design items. It is looking to expand its
holdings of datasets.
It has a staff of 3.2 FTE (1FTE technical, 0.9 senior editor, 1.2 editors,
0.1 senior manager). Total costs of the repository are of 116, 318,
comprised of staff costs of 111,318, and infrastructure costs of
5,000. (These figures do not include a separate repository for
electronics and computer science, which will be merged into the main
repository later in 2012.) It is funded and hosted by the University of
Southampton, and uses the ePrints server, which was developed by the
University of Southampton School of Electronics and Computer
Science.

Contingency of these databases


PDB and arXiv dependent on mixes of
discretionary decisions by government bodies
and philanthropy
UK Data Archive is unusual in its centrality to the
social sciences funding system
University repositories highly varied in
performance and in support from the top
Funders and universities are under many
pressures
But researchers can do more to promote data
access, as can journals

Approaches to reproducibility

Growth in formal corrections


(Examples from Nature, Nature Biotechnology, Nature
Neuroscience, Nature Methods)

Missing controls, results not sufficiently representative of


experimental variability, data selection
Investigator bias, e.g., in determining the boundaries of an
area to study (lack of blinding)
Technical replicates wrongly described as biological
replicates
Over-fitting of models for noisy datasets in various
experimental settings: fMRI, x-ray crystallography, machine
learning
Errors and inappropriate manipulation in image
presentation, poor data management
Contamination of primary culture cells

Mandating reporting standards is not


sufficient
MIAME Minimal Information About a Microarray Experiment

002: Nature journals mandate deposition of MIAME-compliant microarray data


006: compliance issues identified
Ioannidis et al., Nat Gen 41, 2, 149 (2009)

18 papers containing microarray data published in NG in 2005-2006, 10 analyse


uld not be reproduced, 6 only partially.

Irreproducibility: NPG actions so far

Awareness raising meetings 2013/14: NINDS, NCI, Academy of Medical


Sciences, Royal Society, Science Europe,
Awareness raising: Editorials, articles by experts
We removed length limits on online methods sections
We substantially increased figure limits in Nature and improved access to
Supplementary Information data in research journals.
Statistical advisor (Terry Hyslop) and referees appointed
Reducing our irreproducibility Editorial + check lists for authors, editors and
referees (23 April 2013)
Nature + NIH + Science meeting of journal editors in Washington (May
2014)

Raising awareness: our content

Tackling the widespread and critical impact of batch effects in high-throughput data,
Leek et al., NRG, Oct 2010
How much can we rely on published data on potential drug targets? Prinz et al.,
NRDD, Sep 2011
The case for open computer programs, Ince et al., Nature, Feb 2012
Raise standards for preclinical cancer research, Begley & Ellis, Nature, Mar 2012
Must try harder Editorial, Nature, Mar 2012
Face up to false positives, MacArthur, Nature, Jul 2012
Error prone Editorial, Nature, Jul 2012
Next-generation sequencing data interpretation: enhancing reproducibility and
accessibility, Nekrutenko & Taylor, NRG, Sep 2012
A call for transparent reporting to optimize the predictive value of preclinical research.
Landis et al., Nature, Oct 2012
Know when your numbers are significant, Vaux, Nature, Dec 2012
Reuse of public genome-wide gene expression data, Rung & Brazma, NRG, Feb
2013

Raising awareness: our content (2)


Reducing our irreproducibility Editorial, Nature, May 2013
Reproducibility: Six red flags for suspect work, Begley, Nature, May
2013
Reproducibility: The risks of the replication drive, Bissell, Nature,
Nov 2013
Of carrots and sticks: incentives for data sharing, Kattge et al,
Nature Geoscience, Nov 2014
Open code for open science? Easterbrook, Nature Geoscience Nov
2014
Code share Editorial, Nature 29 Oct 2014
Journals unite Editorial with Science and NIH 6 Nov 2014

Implementation of reporting checklist


Onerous!
Authors, referees, editors, copyeditors

Referees:
We are not yet sure whether they are paying much attention.

Authors:
Some papers submitted with checklist without prompt
Many have embraced source data

Improves reporting (see following slide).


We have commissioned an external assessment of the
impact.
The list may be driving changes in experimental design in
the longer term

Reporting animal experiments in Nature


Neuroscience
Jan 12 (10 papers)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

Oct 13 Jan 14 (41 papers)


100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

not reported
not done
done

Not reported includes cases for which the specific question was not
relevant (e.g., investigator cannot be blinded to treatment)
Most frequent problems: power analysis calculations, low n (sample size
justification), proper blinding or randomization, multiple t-tests.

Attention needed: Cell line identity


Identify the source of cell lines and indicate if they were
recently authenticated (e.g., by STR profiling) and tested for
mycoplasma contamination.
This checklist question is not yet enforced as a mandate

Audit of Nature Cell Biology papers (Aug13 Dec13):


- Of 21 relevant papers:
- 20 indicate the source of cell lines(*)
- 4 indicate authentication was done(**)
- 5 acknowledge cell lines were not authenticated
- 17 indicate the cells were tested and demonstrated mycoplasmafree(**)
(*) quality of information variable
(**) timing of tests not always satisfactory

Question about developing authorcontribution transparency


Author contribution statements in Nature
journals are informal, unstructured, nontemplated.
Should this change? How? (Possible
goals: increased credit, increased
accountability for potential flaws.)
How granular should this information
become?

Irreproducibility: underlying issues

Experimental design: randomization, blinding, sample size determinations,


independent experiments vs technical replicates,
Statistics
Big data, overfitting (needs gut scepticism/tacit knowledge)
Gels, microscopy images,
Reagents validity antibodies, cell lines
Animal studies description
Methods description
Data deposition

Publication bias and refutations where?


IP confidentiality replication failures unpublishable
Lab supervision
Lab training
Pressure to publish
It pays to be sloppy

Funders: The NIH


Collins and Tabak, Nature 27 January 2014
NIH is developing a training module on enhancing reproducibility and
transparency of research findings, with an emphasis on good experimental design.
This will be incorporated into the mandatory training on responsible conduct of
research for NIH intramural postdoctoral fellows later this year. Informed by this pilot,
final materials will be posted on the NIH website by the end of this year for broad
dissemination, adoption or adaptation, on the basis of local institutional needs.

Funders: The NIH


Collins and Tabak, Nature 27 January 2014

Several of the NIH's institutes and centres are also testing the use of a checklist to ensure a
more systematic evaluation of grant applications. Reviewers are reminded to check, for
example, that appropriate experimental design features have been addressed, such as an
analytical plan, plans for randomization, blinding and so on. A pilot was launched last year
that we plan to complete by the end of this year to assess the value of assigning at least one
reviewer on each panel the specific task of evaluating the 'scientific premise' of the
application: the key publications on which the application is based (which may or may not
come from the applicant's own research efforts). This question will be particularly
important when a potentially costly human clinical trial is proposed, based on animalmodel results. If the antecedent work is questionable and the trial is particularly
important, key preclinical studies may first need to be validated independently.

Informed by feedback from these pilots, the NIH leadership will decide by the fourth quarter
of this year which approaches to adopt agency-wide, which should remain specific to institutes
and centres, and which to abandon.

Universities/institutes:
target issues

Data validation
Lab size and management
Training
Publication bias
Data/notebooks access
Reagent access

Nature and NPG data policies


Enforce community database deposition
Encourage community database
development
Launch Scientific Data
Nature-journal editors encourage
submissions of Data Descriptors to
Scientific Data

Thanks for listening

You might also like