You are on page 1of 2

TOOLBOX

PROGRAMMING TOOLS:
ADVENTURES WITH R
A guide to the popular, free statistics and visualization software
that gives scientists control of their own data analysis.
ILLUSTRATION BY THE PROJECT TWINS

B Y S Y LV I A T I P P M A N N with her data-processing demands. Besides being free, R is popular partly because

F
With the results of her first genomic sequenc- it presents different faces to different users. It is,
or years, geneticist Helene Royo used ing experiments in hand at the start of a new first and foremost, a programming language —
commercial software to analyse her work. postdoc, Royo had a choice: pass the sequences requiring input through a command line, which
She would extract DNA from the devel- over to the experts or learn to analyse the data may seem forbidding to non-coders. But begin-
oping sperm cells of mice, send it for analysis herself. She took the plunge, and began learning ners can surf over the complexities and call up
and then fire up a package called GeneSpring how to parse data in the free, open-source soft- preset software packages, which come ready-
to study the results. “As a scientist, I wanted to ware package R. It helped that the centre she had made with commands for statistical analysis
understand everything I was doing,” she says. joined — the Friedrich Miescher Institute for and data visualization. These packages create a
“But this kind of analysis didn’t allow that: I just Biomedical Research in Basel, Switzerland — welcoming middle ground between the com-
pressed buttons and got answers.” And as Royo’s ran regular courses on the software. But she was fort of commercial ‘black-box’ solutions and the
studies comparing genetic activity on different also following a wider trend: for many academ- expert world of code. “R made it very easy,” says
chromosomes became more involved, she real- ics seeking to wean themselves off commercial Rojo. “It did everything for me.”
ized that the commercial tool could not keep up software, R is the data-analysis tool of choice. That, indeed, is what R’s developers

1 JA N UA RY 2 0 1 5 | VO L 5 1 7 | N AT U R E | 1 0 9
© 2015 Macmillan Publishers Limited. All rights reserved
TOOLBOX

intended when they designed it in the 1990s. bioinformatics group, she took about half a year

SYLVIA TIPPMANN/SOURCE: ELSEVIER SCOPUS DATABASE


Ross Ihaka and Robert Gentleman, statisticians A RISING TIDE OF R to work on R and Bioconductor. But there are
at the University of Auckland in New Zealand, An increasing proportion of research articles plentiful chances to learn, says Karthik Ram,
explicitly reference R or an R package.
had an interest in computing but lacked practi- an ecologist at the Berkeley Institute for Data
cal software for their needs. So they developed 4 Science in California who founded rOpenSci,
a programming language with which they Agricultural and biological sciences an initiative that helps scientists to adopt and
Biochemistry, genetics
could perform data analysis themselves. R got and molecular biology
develop R (see ‘An R starter kit’). He and his
its name in part from its developers’ initials, 3 Earth and planetary sciences colleagues teach free courses that do not require

Articles citing R (%)


although it was also a reference to the most Environmental science existing programming skills and are targeted
widely used coding language at the time, S. Immunology and microbiology towards scientists’ specific problems.
Mathematics
In the early days of the World Wide Web, R 2
Neuroscience
One researcher who took that training is
quickly attracted interest from scientists around Megan Jennings, an ecologist at San Diego
the globe who needed statistical software and State University in California. She tracks bob-
were willing to contribute ideas. Gentleman and 1 cats, mountain lions and other wild animals,
Ihaka decided to make their source code acces- to understand their movements. Armed with
sible to everybody, and coding-literate scientists more than 400,000 time-stamped photos to
quickly developed packages of pre-programmed 0 which she had appended species names — taken
routines and commands for particular fields. “I 2000 2005 2010 from 36 cameras running for almost a year —
can write software that would be good for some- Jennings wanted to follow particular species at
body doing astronomy,” says Gentleman, “but it’s particular times of year. At first, she manually
a lot better if someone doing astronomy writes packages, and the first citations of the ‘R Pro- selected the photos she wanted and fed them
software for other people doing astronomy.” ject’ appeared. Today, nearly 6,000 packages into a black-box program called PRESENCE.
exist for all kinds of specialized purposes. But with Ram’s help, she is creating an R package
MATHEMATICAL SOLUTIONS They allow scientists to compare a human and that reads in the tagged photos, cleans them up
Karline Soetaert, an oceanographer at the a Neanderthal genome (using Bioconductor: and then sends customized subsets of the data
Royal Netherlands Institute for Sea Research in go.nature.com/s7mq39); to model population to a pre-existing modelling package in R. “What
Yerseke, took up that idea when, in 2008, she growth (IPMpack: go.nature.com/cyhons); took me one hour to do manually, I will now be
wanted to check the health of zooplankton in predict equity prices (quantmod: go.nature. able to do in five minutes,” Jennings says.
the estuary of the river Scheldt. Soetaert wanted com/jxqasm); and visualize the results in pol- One of the greatest perks of R is its online
to calculate how fast zooplankton were dying, ished graphics (ggplot2: ggplot2.org) in a few support. Discussion forums about R-related
using measurements along the river, but R was lines of code. Experts can use R to write up topics outstrip online questions about any
not equipped for that. To tackle the problem, she manuscripts, embedding raw code in them commercial statistics software says Muenchen.
worked with two ecologists to develop deSolve to be run by the reader (knitr: http://yihui. “It’s common to see someone post a question
— the first package written in R to solve differ- name/knitr). Nearly 1 in 100 scholarly articles and the person who developed the package
ential equations. “Other software can do that, indexed in Elsevier’s Scopus database last year answer within half an hour,” he says. This rapid
but it is expensive and closed source,” she notes. cites R or one of its packages — and in agricul- response is key for scientists in basic research.
Now deSolve is used by epidemiologists model- tural and environmental sciences, the share is “I can find an answer to almost any question
ling infectious diseases, geneticists working on even higher (see ‘A rising tide of R’). online,” says Royo. She can confidently do
gene-regulatory networks and drug develop- most of her day-to-day data analysis herself,
ers working on pharmaco­kinetics (how com- STATISTICAL SUCCESS and she helps out less proficient colleagues.
pounds behave in living organisms). For many users, R’s quality as statistics software Still, “I google things every day”, she adds.
By 2003, 10 years after R’s first release, stands out. The tool is on a par with commer- Learning R, says Royo, has not only taught her
scientists had developed more than 200 cial packages such as SPSS and SAS, says Rob- coding skills, but has also made her more criti-
ert Muenchen, a statistician at the University of cal about other scientists’ analyses.
Tennessee in Knoxville who analyses the popu- Not every scientist is enthusiastic about learn-
TUTO R I A LS larity of software used in statistical computing. ing the necessary programming — even though,
In the past decade, R has caught up with and says Ram, R is less intimidating than languages
An R starter kit overtaken the market leaders. “Most likely, R such as Python (let alone Perl or C). “There are
became the top statistics package used during going to be far more scientists that will be com-
● Install R at the Comprehensive R the summer of this year,” he says. fortable with click-and-drop interfaces than will
Archive Network: http://cran.r-project. In genomics and molecular biology, a soft- ever learn to program at any time,” Muenchen
org. This also provides an introduction to ware project called Bioconductor was devel- says. Geneticist Rabih Murr, for example, took
the system: go.nature.com/jh9jb8. oped on the back of R. It helps scientists to the same R course as Royo when he was a
● Many researchers recommend using a process and compare huge numbers of genetic postdoc, but he did not invest as much time in
(free) powerful interface called RStudio: sequences, to query results against databases practising. To get started and develop research-
www.rstudio.com such as Gene Expression Omnibus and to specific skills in R definitely requires a commit-
● Among many online tutorials are upload data to the databases . It includes almost ment: “It’s a matter of priorities,” he says. But
those provided by DataCamp (go. 1,000 packages, some of which help to link the after becoming a lab head at the University of
nature.com/qndp6w), rOpenSci millions of DNA snippets from next-generation Geneva in Switzerland this year, he is planning
(ropensci.org), Software Carpentry sequencing experiments to annotated genes. to hire someone with R experience.
(go.nature.com/wg3s9u) and R-bloggers For her dive into R, Like any other skill, learning R cannot
(www.r-bloggers.com). Royo had intensive train- NATURE.COM be done overnight. But Jennings says that
● For a sample list of R packages in ing: under the supervi- For more on scientific it is worth it. “Make that time. Set it aside as
different sciences, see the online version sion of Michael Stadler, software, apps and an investment: for saving time later, and for
of this article at go.nature.com/zrhdkj. head of the Friedrich online tools, visit: building skills that can be used across multiple
Miescher Institute’s nature.com/toolbox problems we face as scientists.” ■

1 1 0 | N AT U R E | VO L 5 1 7 | 1 JA N UA RY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved

You might also like