You are on page 1of 9

Molecular Ecology Resources (2017) 17, 120–128 doi: 10.1111/1755-0998.

12558

SPECIAL ISSUE: POPULATION GENOMICS WITH R


Developing educational resources for population genetics in
R: an open and collaborative approach
ZHIAN N. KAMVAR,* MARGARITA M. L OPEZ-URIBE,†  SIMONE COUGHLAN,‡ NIKLAUS J.
€ 
GR UNWALD,*§ HILMAR LAPP¶ and S T EP H A N I E M A N E L * *
*Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA, †Department of Entomology and
Plant Pathology, North Carolina State University, Raleigh, NC 27695, USA, ‡School of Mathematics, Statistics and Applied
Mathematics, National University of Ireland Galway, Galway, Ireland, §Horticultural Crops Research Unit, USDA Agricultural
Research Service, Corvallis, OR 97330, USA, ¶Center for Genomic and Computational Biology, Duke University, Durham, NC
27708, USA, **EPHE, PSL Research University, CNRS, UM, SupAgro, IRD, INRA, UMR 5175, CEFE F-34293, Montpellier,
France

Abstract
The R computing and statistical language community has developed a myriad of resources for conducting population
genetic analyses. However, resources for learning how to carry out population genetic analyses in R are scattered and
often incomplete, which can make acquiring this skill unnecessarily difficult and time consuming. To address this
gap, we developed an online community resource with guidance and working demonstrations for conducting popu-
lation genetic analyses in R. The resource is freely available at http://popgen.nescent.org and includes material for
both novices and advanced users of R for population genetics. To facilitate continued maintenance and growth of this
resource, we developed a toolchain, process and conventions designed to (i) minimize financial and labour costs of
upkeep; (ii) to provide a low barrier to contribution; and (iii) to ensure strong quality assurance. The toolchain
includes automatic integration testing of every change and rebuilding of the website when new vignettes or edits are
accepted. The process and conventions largely follow a common, distributed version control-based contribution
workflow, which is used to provide and manage open peer review by designated website editors. The online
resources include detailed documentation of this process, including video tutorials. We invite the community of pop-
ulation geneticists working in R to contribute to this resource, whether for a new use case of their own, or as one of
the vignettes from the ‘wish list’ we maintain, or by improving existing vignettes.
Keywords: git, GitHub, online resources, open access, open education, population genetics

Received 25 March 2016; accepted 2 May 2016

led to the rapid development of new statistical methods,


Introduction
which in turn have brought new input and output for-
The wealth of data resulting from the broad and afford- mats with them. In this environment, even only knowing
able availability of modern high throughput sequencing which analytical methods are already available, and
(HTS) and genotyping technologies have been a boon to which of them can be strung together into a functioning
investigating population genetics research questions workflow, can be quite challenging not only for novices
(Luikart et al. 2003) with applications in a wide range of but even for more experienced users (Andrews & Luikart
fields, from medicine to epidemiology, agronomy, evolu- 2014; Benestan et al. 2016).
tionary biology, ecology and conservation biology. As a Even before the advent of the HTS data deluge, the
consequence, both the community of scientists needing field of population genetics has relied heavily on com-
to use analytical methods in population genetics and that puter software for data analysis. Traditionally, such anal-
of scientists developing such methods have increased ysis often had to combine multiple software programs,
dramatically. At the same time, the expanding types of each of which uses different input formats and operating
questions made addressable by the wealth of data have systems (Excoffier & Heckel 2006), making it a time-con-
suming process. Since then, a consolidation on the R plat-
Correspondence: Stephanie Manel, Fax: +33 (0) 4 67 61 33 36; form for statistical computing (R Core Team 2015) has
E-mail: stephanie.manel@cefe.cnrs.fr fostered a rich, more interoperable and open source

© 2016 John Wiley & Sons Ltd


RESOURCES FOR POPULATION GENETICS IN R 121

ecosystem of a myriad of analytical resources for both The resource we describe is aimed at both novices in R or
users and developers across many quantitative sciences population genetics, for whom the site can provide
(Boettiger et al. 2015), including in population genetics. tested workflows adaptable to user’s own needs, and at
Some of the R packages available for population genetics advanced R users in population genetics, who can con-
analysis are featured in this special issue: ape (Paradis tribute their expertise by authoring, improving and peer-
et al. 2004), adegenet (Jombart 2008; Jombart & Ahmed reviewing vignettes. To alleviate potential barriers to
2011), pegas (Paradis 2010), phangorn (Schliep 2011), contribution, we include both detailed written documen-
poppr (Kamvar et al. 2014, 2015a) and OutFLANK (Whit- tation and video tutorials.
lock & Lotterhos 2015).
The rich ecosystem of relevant R packages is obvi-
Community resource design goals and
ously a blessing, but it can also be a curse. The Genetics
implementation
Task View on the Comprehensive R Archive Network
(CRAN, https://cran.r-project.org/), the repository of all R In designing and developing the online community
add-on packages, mentions more than 30 packages for resource we report here, we sought to achieve the follow-
different population genetic analyses ranging from pop- ing aims:
ulation structure to genome-wide association studies.
Minimally, each package comes with documentation for 1 Long-term sustainable hosting. This required a hosting
the methods and data structures it exposes to the user. platform for which the financial support did not come
However, addressing a research question typically with a built-in expiration date, precluding hosting
requires combining methods from several packages, and solutions that depend on ongoing grant funding.
documentation that can guide users on selecting pack- 2 Low technical barriers to authoring contributions. Specifi-
ages that are suitable for a given question, interoperate cally, authoring content for the website should require
well, and are well maintained is scarce. A community few or no tools and languages that are likely to be for-
mailing list where one could ask questions in this regard eign, or highly platform specific to users otherwise
exists (R-SIG-GENETICS, https://stat.ethz.ch/pipermail/r- familiar with conducting data analysis in R.
sig-genetics/), but it is not widely known. More impor- 3 Provisions for strong quality assurance. Specifically, all
tantly, although a mailing list can yield tips and answers code examples provably work, and primers are only
to ad hoc problems, it is not a go-to community resource published following open peer review.
providing narrated example workflows, replete with 4 Sustainable time commitment for website editors. In partic-
working code from loading the data to visualizing the ular, manual tasks that do not depend on population
results, vetted by other subject experts. genetics or R expertise must require minimal time or
To fill this gap, we created an online community web- be eliminated altogether.
site of vignettes for population genetics in R, called 5 All changes and contributions are automatically tracked.
popgenInfo. It is freely available at http://popgen. This is not only important for provenance and credit,
nescent.org. Recognizing that resources aimed to be but also a direct consequence of preceding aims,
community-maintained often face strong challenges to because it eliminates the need for contributors, editors
adoption and sustainability, we designed a process and and peer reviewers to manually identify changes.
chose building block technologies for hosting, contribut- 6 Fully reproducible and automatable software environment
ing and building the site aimed at overcoming these chal- for rebuilding (rendering) the website. This, too, is a direct
lenges while still maintaining quality assurance. In this consequence of preceding aims. It eliminates otherwise
study, we report on the resource as a whole, describe in manual maintenance tasks, and it makes the outcome
detail the contribution and maintenance process and dis- of those tasks predictable across the possibly different
cuss our initial results with using, and in the course native environments used by contributors and editors.
refining, the process ourselves to populate the site with
an initial collection of vignettes covering basic summary To achieve these aims, we devised a build, deploy-
statistics and population differentiation methods, for ment and contribution process with the following build-
microsatellite, SNP and sequence-type data. In a nut- ing blocks:
shell, the contribution and maintenance process centres
on authoring vignettes in the text-based R Markdown for-
Hosting
mat (Baumer et al. 2014); using continuous integration
for automatically testing and building the site (Wilson The rendered website is hosted on GitHub Pages. This
et al. 2014); using Docker containers for doing so repro- service is provided by Github free of charge and presum-
ducibly (Boettiger 2015); and using GitHub pull requests ably funded by the revenue GitHub makes from hosting
for open peer review of all contributions (see Ram 2013). private software source code repositories, a source that,

© 2016 John Wiley & Sons Ltd


122 Z . N . K A M V A R E T A L .

unlike volatile grant funding, is unlikely to become obso- including after the change set has been accepted. Inte-
lete or unavailable in the foreseeable future. grating the contribution upon acceptance usually only
takes the click of a button by the designated maintainer
(editor). These capabilities allowed us to devise an open
Authoring
peer review process that is seamless and efficient, yet
Content for the website is authored in R Markdown for- does not depend on any custom tooling.
mat (Xie 2013; Baumer et al. 2014), which uses simple
text-based markup in which R code fragments can be
Automatic testing and builds
directly embedded. R Markdown can be authored in any
basic text editor. RStudio, a highly popular tool for cod- We use a continuous integration testing service, Circle CI
ing R scripts, includes interactive features for navigating, (http://circleci.com), to automatically test every change
rendering and viewing of R Markdown documents (see set submitted for consideration as a pull request. At pre-
Gandrud 2015). sent, this test consists of successfully rendering the full
website from its source code, which, by virtue of render-
ing the R Markdown documents, includes executing all
Version control
embedded R code fragments. The pull request user inter-
The source of the website, including all R Markdown face prominently displays the success, or failure of this
documents, is hosted on GitHub (http://github.com) test, which thus provides immediate feedback both to
under Git version control. Git is a distributed version the contributor, as well as to a prospective peer reviewer
control system, which lends itself particularly well to whether basic errors are present or absent. A peer
community contribution. Prospective contributors can reviewer can therefore focus on evaluating the scientific
create their own copy of the repository (called a ‘fork’), merits of the content, and the included R code. The con-
create sets of changes against their copy (through tinuous integration is configured such that upon accep-
branching off and adding commits) and submit these tance of a pull request, the website is rebuilt from scratch
change sets (referred to as ‘pull requests’) back to the and automatically deployed to GitHub Pages, without
main repository (usually called ‘upstream’), all without requiring any manual supervision.
having to be granted any special permissions. Using GIT
for version control, and hosting repositories on GitHub,
Reproducible software environment
has become widely popular for software projects, includ-
ing for scientific software, thanks in part to the benefits Similar to challenges with constantly evolving scientific
that derive from both GIT’S and GitHub’s features for col- software in general, R packages, including those for
laborative development and open science (Ram 2013). population genetics, sometimes change their behaviour,
Blischak et al. (2016) recently published an introduction data structures or interface from one version to
to GIT aimed at scientist developers, and we provide a another, which can make the numeric results, or
glossary of key operations in Table S1 (Supporting infor- whether code using them works at all, highly depen-
mation). Because version control in general and GIT in dent on the environment of versions and dependencies
particular often remain lacking from the training quanti- installed on a system. To isolate the software environ-
tative biologists receive, we provide detailed guidelines ment needed to render the R Markdown source docu-
and video tutorials for how working with GIT can be ments and to run the embedded R code, we use
managed directly from within R, using the GIT2R package software containerization in the form of Docker con-
(Widgren & git2r Contributors 2016). tainers (Boettiger 2015). We use a Docker image
designed specifically for population genetics in R to
run all tests, and to render the website from its
Peer review
sources. To support this, the Docker image includes
Pull requests, the regular community contribution mech- the tools needed to render R Markdown documents.
anism in GitHub, lend itself particularly well to the open The specification for the Docker image is maintained
peer review process which we aimed to achieve. The publicly under version control (http://github.com/
user interface GitHub has built around this feature visu- NESCent/popgen-docker) and can be used by anyone,
ally identifies the change set being submitted for consid- including prospective contributors, to render the vign-
eration and allows commenting on the change set as a ettes, or the entire website, under the exact same envi-
whole or on individual lines of text for which changes ronment as they would be in our automatic testing
are being proposed. Contributors can reply to comments and site rebuild (using, for example, the Docker image
or add changes that are responsive to the comments. The continuously updated at Docker Hub at https://hub.-
full discussion thread remains openly available, docker.com/r/hlapp/rpopgen/).

© 2016 John Wiley & Sons Ltd


RESOURCES FOR POPULATION GENETICS IN R 123

Figure 1 illustrates as a flow chart how we combine As guidance for prospective contributors, we provide
these components into a community contribution process a template vignette (http://popgen.nescent.org/TEM-
that includes automatic testing, open peer review and PLATE.html). To allow contributors some creative free-
reproducible rebuilds of the online resource as a whole. dom, vignettes are not required to strictly follow a
certain structure, but they must include several key
points: (i) Description of purpose: an introduction that lets
Vignette development
readers know what the vignette is for; (ii) Resources: short
In contrast to the succinct and often terse method by descriptions of the data and R packages utilized in the
method documentation that comes with most R pack- vignette, to give readers context of the analyses; (iii)
ages, a vignette is long-form documentation to guide Workflow: the bulk of the vignette, normally an analysis
users on how multiple methods, from one or, as will be starting from data import to inspecting or visualizing the
more typical in our case, multiple packages can be com- results; (iv) References/Acknowledgements: includes all
bined to address an analysis use case from start to end works cited as well as the authors and contributors of
(see Wickham 2015). Our goal was to develop vignettes the vignette; (v) R session information: lists the operating
that are detailed enough for the reader to learn how to system, version of R and all R packages present when the
solve a population genetics analysis problem using the vignette was rendered, including their versions, installa-
tools suggested by the vignette’s author(s). tion date and source. The latter is considered a best

Fig. 1 Flow chart of vignette creation and submission. Commits move from top to bottom. The left side of the flow chart (orange) repre-
sents the popgenInfo repository on the NESCent account that exists on GitHub (denoted by the cloud symbol). The right side (blue) rep-
resents an individual user’s account that can exist both on GitHub and on their own computer (denoted by the cloud and computer
exchange symbols). Each of the three shapes (square, diamond and circle) represents a separate branch of the popgenInfo repository as
indicated by the label above each. Colours within each shape represent the state of the repository at a given commit. Dashed arrows rep-
resent actions taken by a human to move information between branches or repositories. The solid arrows represent automated processes
and commits. The process of vignette submission is outlined in seven steps (blue-labelled steps are carried out by the user; black-
labelled steps are automatic). 1) The popgenInfo repository (NEScent/popgenInfo) is forked from NESCent into the user’s GitHub
account. It is at this time that the user may additionally clone the repository to their computer. 2) The user creates and moves to a new
branch from the master branch on their account (user/popgenInfo). 3) New content (e.g. a vignette or correction) is added to the new
branch with git add and changes are stored using git commit. 4) A new pull request is created from the user’s new branch on GitHub to
the NESCent/popgenInfo master branch. Once the pull request is made, all future changes to the user/popgenInfo new branch will be
tracked. The review process begins when the user declares they are ready. 5) Once all requested modifications have been made and two
maintainers find it satisfactory, the new content is pulled into the master branch of the NESCent/popgenInfo repository. 6) The website
is automatically rebuilt with continuous integration in a Docker container and deployed within minutes. 7) The user can update their
master branch by using git fetch upstream and git merge upstream/master. Icon credits: FontAwesome.

© 2016 John Wiley & Sons Ltd


124 Z . N . K A M V A R E T A L .

practice in the R community (Wickham 2015), because population differentiation from DNA sequence data.
packages in the R ecosystem change often, and therefore, This section includes narrated snippets of R code with
knowing this information is crucial when investigating rendered output (e.g. data and plots). The vignette ends
differences in behaviour between the online vignette and with a conclusion that summarizes the steps of the analy-
running the workflow on a local system. sis and discusses the biological meaning of this type of
A simple example of a workflow vignette is the ‘Pop- analysis, followed by a list of vignette authors and con-
ulation Differentiation for DNA sequence data’ vignette tributors, and the R session information.
(http://popgen.nescent.org/PopDiffSequenceDa- It is worth noting that vignettes intentionally do not
ta.html). It describes a simplified workflow on how to aim to be exhaustive, neither about which packages
estimate population differentiation using two genes (one could be used, nor about all the alternative ways in
mitochondrial and one nuclear) from the bee species which the analysis question could be addressed. Instead,
Eulaema bombiformis. It begins with an auto-generated it aims to provide an example of how to analyse one’s
table of contents (Fig. 2). After a short introduction to the data to answer a certain research question.
topic of population differentiation and a description of
the data, the necessary packages to run the code are
Vignette submission
listed. The first part of the workflow demonstrates
importing data. This is important to include because data Vignettes are submitted for inclusion in our site through
import can be challenging for novice users. The follow- a four-step process, which we adapted from an open
ing section addresses the specifics of how to calculate source software contribution workflow known as

Fig. 2 View of a vignette on how to estimate population differentiation from sequence data.

© 2016 John Wiley & Sons Ltd


RESOURCES FOR POPULATION GENETICS IN R 125

‘GitHub flow’ (Fig. 1; Ram 2013): (i) fork the repository


Updating one’s fork
from the master repository (currently at http://github.-
com/NESCent/popgenInfo), or, if it was forked previ- Contributors’ forks are in essence copies of the ‘up-
ously, update the fork’s master branch; (ii) create a new stream’ master repository (NESCent/popgenInfo)
branch from the master branch, and switch to it; (iii) add obtained at some point in time. To minimize the chance
or edit content; and (iv) create a pull request against the that a contributor’s changes conflict with those of other
master repository (the default for forked repositories). contributors, all changes, including new vignettes, and
The first step creates an exact copy of the popgenInfo updates of existing ones must start from an up to date
repository in the prospective vignette author’s personal copy of the master repository. This is achieved by first
GitHub account, which is then cloned to the local com- fetching the changes in the master repository and then
puter for content authoring or editing. This process only merging them into the master branch of one’s fork.
needs to be performed once. After that, the user can Because in this workflow the master branch of the fork is
update their fork’s master branch from the master repos- only ever modified by the contributor when merging in
itory (typically referred to as ‘upstream’) as we have changes from upstream, this process will not produce
detailed in our Contribution Guidelines (http://pop- conflicts when updating a fork. This is important because
gen.nescent.org/CONTRIBUTING.html). In the second the task of resolving merge conflicts sometimes requires
step, the contributor creates a new branch (Table S1, Sup- advanced technical knowledge. Our workflow does not
porting information), which we recommend to name prevent merge conflicts altogether, but it allows for them
according to the pattern [DATE]-[SUBJECT]. By creating to be resolved on the upstream master repository, under
a new branch, the contributor isolates themselves from the hands of technically more experienced maintainers.
upstream changes possibly conflicting with their work in
progress (see section ‘Updating’) once they commence
Results and discussion
step three.
To aid users who are familiar with invoking methods We sought to create a community website for hosting
in R but who are novices in version control or git, we cre- narrated, peer-vetted workflows of population genetic
ated a detailed tutorial on operating this git-based contri- analyses with demonstrably working and fully repro-
bution process by interacting only with R, thanks to the ducible code for every step.
git2r package (Widgren & git2r Contributors 2016), To accomplish this, one of the main challenges we set
which exposes git operations as R functions. The tutorial out to address was lowering barriers to contribution,
is available at http://popgen.nescent.org/CONTRIBU- both technical and nontechnical, while not compromis-
TING_WITH_GIT2R.html, which includes links to ing stringent, open and transparent quality control. Only
videos to walk new users through the process.1 a minority of R users are also package developers; hence,
our process for contributing needed to be intuitive
enough for those who use R for analysis, but are not
Validation and review
experts in R package or code development. Moreover,
As described above, after a vignette (or other contribu- many packages in the R ecosystem for population genet-
tion) is submitted through a pull request, it is first ics are the result of ongoing research and thus evolve
automatically tested using a reproducible software envi- dynamically, which necessitates automatic testing for
ronment in the form of a Docker container. Once the test quality control. The contribution model we adopted to
results show that the site can still be successfully ren- achieve these goals, known as ‘GitHub flow’, has been
dered with the submitted content, the vignette then goes widely used in the open source software community for
through a process of open peer review in which anyone a number of years (Ram 2013).
in the community is invited to comment. Changes added This team of authors is comprised of both experienced
by the contributor in response to comments undergo the software developers, and population genetics scientists
same automatic testing. The open peer review is com- who are frequent users of R, but not experienced in col-
pleted when a minimum of two maintainers of the laborative or reusable code development. This provided
upstream repository approve. Once approved, the pull us with a unique opportunity to continuously test and
request may be accepted, which triggers an automatic refine our contribution process by developing the initial
update of the website with the new vignette in place set of vignettes for our website under the same process
(Fig. 1, steps 5, 6 and 7). that others would use, and circumstances similar to those
in which a community contributor would likely be.
Although the GitHub Flow model proved well suited to
1
As GitHub’s API is subject to occasional change, these videos ensuring that contributed workflows were of sufficient
may appear outdated.

© 2016 John Wiley & Sons Ltd


126 Z . N . K A M V A R E T A L .

quality, we found that it presented significant challenges How well the contribution process can scale in the
for those among us not familiar with open source devel- future remains to be seen. We built technical features
opment. Many of these challenges were due to a dearth into the process that allow reviewers to focus their time
of easily discoverable, well-written examples and non- on the scientific and methodological substance of con-
technical documentation of each step in the process. The tributed vignettes, rather than syntactic correctness of
frequent feedback generated by following our own pro- the code. However, the social challenges of scaling up a
cess allowed us to adapt the GitHub Flow contribution peer review system of course remain, such as how to
model to our needs by splitting it into 4 concrete steps recruit additional reviewers, and how to manage expec-
(Fig. 1), reducing the need for technical knowledge of git tations for turnaround times. The strides made by other
and GitHub, and allowing contributors to focus on the community-sustained online educational resources fac-
content of their workflows instead. ing similar challenges, such as Software Carpentry (Wil-
Even though our contribution process is currently son 2014) and Data Carpentry (Teal et al. 2015), suggest
optimized with the tools available to work on any of the that addressing them is possible.
major operating systems used by population genetics sci- We developed a number of vignettes for analyses of
entists (MS Windows, Mac OSX, Linux), some challenges population genetic data using different types of molecu-
with usability by nontechnical users remain. One of these lar markers (DNA sequences, microsatellites and SNPs).
is ensuring that contributors are working with the same In those vignettes, the user will find general information
versions of R and packages as those used in the software on how to calculate basic statistics, such as genetic diver-
environment in which the website vignettes are automat- sity, Hardy–Weinberg equilibrium and F-statistics. These
ically tested and rendered. This software environment is vignettes provide information about topics that directly
purposefully codified in the form of a Docker image target biological questions such as population differenti-
(Boettiger 2015), for which the underlying Docker file is ation, genetic distances and detecting signals of selection.
openly available and version tracked. In theory, this They serve as examples of how vignettes can be con-
Docker image can reproducibly provide the same soft- structed, but more importantly, we seeded the site with
ware environment (Boettiger 2015), whether to the auto- vignettes on basic population genetics summary statistics
matic testing procedure, contributors to our resource or because these are often the starting points of more
population genetics in R users at large. However, setting detailed analyses. We hope that the demonstration of
up and running Docker on different operating systems is basic analyses will lower the activation energy required
still not a simple task and still differs between systems. for novices to perform population genetic analyses in R.
To address this issue, the source repository for our web- This initial list of vignettes is by no means an exhaus-
site can itself act as an installable R package, which will tive one (Table 1). There are many other biological ques-
trigger installation and, if necessary, update of all its tions that could be fully addressed using analytical tools
dependencies. This then allows users to render the vign- incorporated in R packages (e.g. parentage analysis,
ettes without having to run Docker containers.
By virtue of how we designed the contribution pro-
cess, the results of following it remain available, includ- Table 1 List of existing vignettes (as of 26 January 2016) in the
ing its beginnings, and how it fares in the hands of both ‘Population Genetics in R’ website
technically well-versed and nontechnical population
Type of Marker Main Topic and List of Analyses
genetics scientists. For example, the submission history
of the variant selection signal vignette (https://github. Microsatellites Basic statistics – observed and expected
com/NESCent/popgenInfo/pull/121) shows how a ser- heterozygosity, Hardy–Weinberg equilibrium
ies of review comments led to incremental improvement Population differentiation – Fst, AMOVA,
of the vignette text, code fragments and the choice of DAPC
analytical methods, and revealed a component missing SNPs Basic statistics – observed and expected
heterozygosity, Hardy–Weinberg equilibrium
from the Docker software environment. It concludes
Population differentiation – Fst, unsupervised
with acceptance and integration after a second reviewer clustering
approves. The population differentiation based on Genetic distance (individual based) – based
microsatellite data vignette (https://github.com/NES- on Euclidean distance, based on number of loci,
Cent/popgenInfo/pull/86), submitted by one of the based on allele differences
technically most experienced in the team, shows how the Detection of the signal of selection from
process lends itself to the collaborative resolution of diffi- genome scan (population based analysis)
Sequence data Population differentiation – pairwise
cult technical issues, and how authors can actively
population differentiation, overall F-statistics,
recruit and solicit reviews from domain experts to push Analysis of MOlecular VAriance (AMOVA)
the process along.

© 2016 John Wiley & Sons Ltd


RESOURCES FOR POPULATION GENETICS IN R 127

signals of selection, clustering, effective population size population genetics analyses can more readily come up
estimation, phylogeography and landscape genetics). We to speed with how best to utilize the various available
invite other population geneticists to contribute to this resources, we developed a community website of nar-
effort for the advancement of the community of empiri- rated, peer-vetted and tested workflows, organized by
cal population geneticists, and we maintain a ‘wish list’ research use case. The site blends the rapid publishing
of candidate vignettes in the issue tracker that accompa- and openness of blogging with the peer review of tradi-
nies our website’s source code repository (https:// tional publishing to provide free educational material
github.com/NESCent/popgenInfo/issues?q=is%3Aope- that can be widely disseminated. Unlike articles in a tra-
n+is%3Aissue+label%3AWishlist). ditional journal, this resource can continuously evolve
The community website we created joins a few other with the field. For population geneticists developing new
curated community resources of bioinformatics methods methods, it can be a venue to showcase their methods in
and workflows. The collection of R workflows published a real world, reproducible analysis.
by Bioconductor (https://www.bioconductor.org/ Beyond the value of the website content itself, the
help/workflows/) is perhaps most similar in purpose, community contribution, automatic testing and repro-
but its scope is expressly for packages in the Bioconduc- ducible rebuild processes we designed may become use-
tor repository, which does not include many packages ful to other communities as templates for creating their
frequently used for population genetics. Additionally, own open community-sustained educational resources.
some of the technologies and processes around which it Every building block we used is either freely available as
is built, in particular an opaque review process, and sub- open source, or offered at a sufficiently feature-rich level
version, a client–server system for version control, do for free by ‘freemium’ platform-as-a-service providers
not lend themselves well to the contribution process we (here, GitHub for hosting, and Circle CI for automatic
aimed to achieve. As Ram (2013) explains in detail, a dis- testing and site rebuilds).
tributed version control system offers particular benefits Finally, in the future, the resource we created could
to community contribution, open science and a transpar- be extended to enhance open education, specifically
ent open review process, for which reason we purposely content, tools and software that support free learning
chose to design our contribution process around Git for and education practices by providing explicit exam-
version control, and GitHub for pull requests. In part, ples of analyses that will not go out of date (Smith
this was to enable open peer review, and in part to allow 2009). Even though the interactive platform we have
a process for contributing and integrating changes that is developed in this website does not replace theoretical
more approachable by population genetics scientists at lectures in population genetics, it provides useful
large who work in R, rather than primarily R package materials to demonstrate analytical approaches in pop-
authors. A less similar but conceptually related commu- ulation genetics using R. We hope that this central
nity resource is the PLoS Computational Biology Topic repository of materials will help demystify population
Pages (Wodak et al. 2012). These offer in-depth discus- genetic analysis in R and help to advance the practice
sions of various analytical subjects in bioinformatics. of reproducible research.
However, they are much broader in scope than the gap
we sought to fill, and focus much on theoretical consider-
Availability
ations, as opposed to demonstrating specific worked
examples. Our resource could, though, conceivably take The website can be found at http://popgen.nescent.org
on a complementary role to Topic Pages on population and the source repository for contributors is located at
genetics subjects. For example, a Topic Page author https://github.com/NESCent/popgenInfo. A snapshot
could complement their Topic Pages article with worked of the site’s source code, including the data sets used by
examples contributed to our website as tested and peer- its vignettes, was archived at the time of sub- mission
reviewed vignettes. and is available at http://dx.doi.org/10.5281/ zenodo.
48274

Conclusions
Acknowledgements
The statistical programming platform R has become an
essential tool for conducting population genetic analyses The resource reported in this study started at the Population
in the genomics era. In recent years, a rich ecosystem of Genetics in R Hackathon, which was held in March 2015 at the
National Evolutionary Synthesis Center (NESCent) in Durham,
packages for population genetic analysis has arisen,
NC, with the goal of addressing interoperability, scalability and
which has allowed analyses to be carried out entirely in R workflow building challenges for the population genetics pack-
(Kamvar et al. 2015b; Krueger-Hadfield & Hoban 2016). age ecosystem in R. The authors were participants in the
So that the expanding number of scientists conducting

© 2016 John Wiley & Sons Ltd


128 Z . N . K A M V A R E T A L .

hackathon and are indebted to NESCent (NSF #EF-0905606) for Paradis E (2010) pegas: an R package for population genetics with an
hosting and supporting the event. integrated-modular approach. Bioinformatics, 26, 419–420.
Paradis E, Claude J, Strimmer K (2004) APE: Analyses of Phylogenetics
and Evolution in R language. Bioinformatics, 20, 289–290.
R Core Team (2015) R: A Language and Environment for Statistical Comput-
References ing. R Foundation for Statistical Computing, Vienna, Austria.
Ram K (2013) Git can facilitate greater reproducibility and increased
Andrews KR, Luikart G (2014) Recent novel approaches for population transparency in science. Source Code for Biology and Medicine, 8, 7.
genomics data analysis. Molecular Ecology, 23, 1661–1667. Schliep KP (2011) phangorn: phylogenetic analysis in R. Bioinformatics,
Baumer B, Cetinkaya-Rundel M, Bray A, Loi L, Horton NJ (2014) R Mark- 27, 592–593.
down: integrating A Reproducible Analysis Tool into Introductory Smith MS (2009) Opening education. Science, 323, 89–93.
Statistics. Technology Innovations in Statistics Education, 8, uclastat_cts_ Teal TK, Cranston KA, Lapp H et al. (2015) Data carpentry: workshops to
tise_20118. increase data literacy for researchers. International Journal of Digital
Benestan L, Ferchaud A-L, Hohenlohe P et al. (2016) Conservation geno- Curation, 10, 135–143.
mics of natural and managed populations: building a conceptual and Whitlock MC, Lotterhos KE (2015) Reliable detection of loci responsible
practical framework. Molecular Ecology, doi:10.1111/mec.13647. for local adaptation: inference of a null model through trimming the
Blischak JD, Davenport ER, Wilson G (2016) A Quick Introduction to Ver- distribution of FST. American Naturalist, 186(Suppl 1), S24–S36.
sion Control with Git and GitHub. PLoS Computational Biology, 12, Wickham H (2015) R Packages. O’Reilly Media, Sebastopol, California.
e1004668. Widgren S et al. (2016) git2r: Provides Access to Git Repositories. R package
Boettiger C (2015) An introduction to Docker for reproducible research. version 0.14.0. URL https://CRAN.R-project.org/package=git2r.
ACM SIGOPS Operating Systems Review, 49, 71–79. Wilson G (2014) Software Carpentry: lessons learned. F1000Research, 3, 62.
Boettiger C, Chamberlain S, Hart E, Ram K (2015) Building Software, Wilson G, Aruliah DA, Brown CT et al. (2014) Best practices for scientific
Building Community: lessons from the rOpenSci Project. Journal of computing. PLoS Biology, 12, e1001745.
Open Research Software, 3, 1. Wodak SJ, Mietchen D, Collings AM, Russell RB, Bourne PE (2012) Topic
Excoffier L, Heckel G (2006) Computer programs for population genetics pages: PLoS Computational Biology meets Wikipedia. PLoS Computa-
data analysis: a survival guide. Nature Reviews. Genetics, 7, 745–758. tional Biology, 8, e1002446.
Gandrud C (2015) Reproducible Research with R and R Studio. Chapman Xie Y (2013) Dynamic Documents with R and knitr. Chapman and Hall/
and Hall/CRC, Boca Raton, Florida. CRC, Boca Raton, Florida.
Jombart T (2008) adegenet: a R package for the multivariate analysis of
genetic markers. Bioinformatics, 24, 1403–1405.
Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of
Z.K., M.L., S.C., N.G., H.L., and S.M. conceived and dis-
genome-wide SNP data. Bioinformatics, 27, 3070–3071.
Kamvar ZN, Tabima JF, Gr€ unwald NJ (2014) Poppr: an R package for cussed the concept and framework for the project, Z.K.,
genetic analysis of populations with clonal, partially clonal, and/or and H.L. implemented the underlying infrastructure,
sexual reproduction. PeerJ, 2, e281. Z.K., M.L., S.C., N.G., H.L., and S.M. wrote, reviewed,
Kamvar ZN, Brooks JC, Gr€ unwald NJ (2015a) Novel R tools for analysis
of genome-wide population genetic data with emphasis on clonality.
and revised the current manuscript and vignettes.
Frontiers in Genetics, 6, 208.
Kamvar ZN, Larsen MM, Kanaskie AM, Hansen EM, Gr€ unwald NJ
(2015b) Spatial and temporal analysis of populations of the sudden
oak death pathogen in Oregon forests. Phytopathology, 105, 982–989. Supporting Information
Krueger-Hadfield SA, Hoban SM (2016) The importance of effective
sampling for exploring the population dynamics of haploid–diploid Additional Supporting Information may be found in the online
seaweeds. Journal of Phycology, 52, 1–9. version of this article:
Luikart G, England PR, Tallmon D, Jordan S, Taberlet P (2003) The power
and promise of population genomics: from genotyping to genome Table S1. Technical glossary of git and GitHub computational
typing. Nature Reviews. Genetics, 4, 981–994. terms.

© 2016 John Wiley & Sons Ltd

You might also like