Professional Documents
Culture Documents
001 Phil Campbell Reproducibility and Data at Nature NPG 13-11-14
001 Phil Campbell Reproducibility and Data at Nature NPG 13-11-14
reproducibility at Nature
Philip Campbell
Publishing Better Science through Better Data meeting
NPG
14-11-14
Contents
Data: opportunities and costs
Reproducibility: Natures approaches
Combating fraud
Intelligent openness
Openness of data per se has no value. Open science is more
than disclosure
Data must be:
Accessible
Intelligible
METADATA
Assessable
Re-usable
Only when these four criteria are fulfilled are data properly
open
Databases as publications
Hosts/suppliers of databases are publishers
They have a responsibility to curate and provide reliable access to content.
They may also deliver other services around their products
They may provide the data as a public good or charge for access
UK Data Archive
The UK Data Archive, founded 1967, is curator of the largest collection of digital data in the
social sciences in the United Kingdom. UKDA is funded mainly by Economic and Social
Research Council, University of Essex and JISC, and is hosted at University of Essex.
On average around 2,600 (new or revised) files are uploaded to the repository monthly. (This
includes file packages, so the absolute number of files is higher.) The baseline size of the
main storage repository is <1Tb, though with multiple versions and files outside this
system, a total capacity of c.10Tb is required.
The UKDA currently (26/1/2012) employs 64.5 people. The total expenditure of the UK Data
Archive (2010-11) was approx 3.43 million.. Total staff costs (2010-11) across the whole
organisation: 2.43 million.
Non-staff costs in 2009-10 were approx 580,000, but will be much higher in 2011-12, ie
almost 3 million due to additional investment.
Institutional Repositories
(Tier 3)
Most university repositories in the UK have small amounts of staff
time. The Repositories Support Project survey in 2011 received
responses from 75 UK universities. It found that the average university
repository employed a total 1.36 FTE combined into Managerial,
Administrative and Technical roles. 40% of these repositories accept
research data. In the vast majority of cases (86%), the library has
lead responsibility for the repository.276
ePrints Soton
ePrints Soton, founded in 2003, is the institutional repository for the
University of Southampton. It holds publications including journal
articles, books and chapters, reports and working papers, higher
theses, and some art and design items. It is looking to expand its
holdings of datasets.
It has a staff of 3.2 FTE (1FTE technical, 0.9 senior editor, 1.2 editors,
0.1 senior manager). Total costs of the repository are of 116, 318,
comprised of staff costs of 111,318, and infrastructure costs of
5,000. (These figures do not include a separate repository for
electronics and computer science, which will be merged into the main
repository later in 2012.) It is funded and hosted by the University of
Southampton, and uses the ePrints server, which was developed by the
University of Southampton School of Electronics and Computer
Science.
Approaches to reproducibility
Tackling the widespread and critical impact of batch effects in high-throughput data,
Leek et al., NRG, Oct 2010
How much can we rely on published data on potential drug targets? Prinz et al.,
NRDD, Sep 2011
The case for open computer programs, Ince et al., Nature, Feb 2012
Raise standards for preclinical cancer research, Begley & Ellis, Nature, Mar 2012
Must try harder Editorial, Nature, Mar 2012
Face up to false positives, MacArthur, Nature, Jul 2012
Error prone Editorial, Nature, Jul 2012
Next-generation sequencing data interpretation: enhancing reproducibility and
accessibility, Nekrutenko & Taylor, NRG, Sep 2012
A call for transparent reporting to optimize the predictive value of preclinical research.
Landis et al., Nature, Oct 2012
Know when your numbers are significant, Vaux, Nature, Dec 2012
Reuse of public genome-wide gene expression data, Rung & Brazma, NRG, Feb
2013
Referees:
We are not yet sure whether they are paying much attention.
Authors:
Some papers submitted with checklist without prompt
Many have embraced source data
not reported
not done
done
Not reported includes cases for which the specific question was not
relevant (e.g., investigator cannot be blinded to treatment)
Most frequent problems: power analysis calculations, low n (sample size
justification), proper blinding or randomization, multiple t-tests.
Several of the NIH's institutes and centres are also testing the use of a checklist to ensure a
more systematic evaluation of grant applications. Reviewers are reminded to check, for
example, that appropriate experimental design features have been addressed, such as an
analytical plan, plans for randomization, blinding and so on. A pilot was launched last year
that we plan to complete by the end of this year to assess the value of assigning at least one
reviewer on each panel the specific task of evaluating the 'scientific premise' of the
application: the key publications on which the application is based (which may or may not
come from the applicant's own research efforts). This question will be particularly
important when a potentially costly human clinical trial is proposed, based on animalmodel results. If the antecedent work is questionable and the trial is particularly
important, key preclinical studies may first need to be validated independently.
Informed by feedback from these pilots, the NIH leadership will decide by the fourth quarter
of this year which approaches to adopt agency-wide, which should remain specific to institutes
and centres, and which to abandon.
Universities/institutes:
target issues
Data validation
Lab size and management
Training
Publication bias
Data/notebooks access
Reagent access