Matter, A: Python

4LISTS
R In contrast to a vector, in which all ele-

ments must be of the same mode, R's list
structure can combine objects of different
types. For those familiar with Python, an R list
is similar to a or, for
Python dictionary
Perl hash. C programmers may find it similar to a C
that matter, a
struct. The list plays a central role in R, forming the
basis for data frames, object-oriented programming,
and so on.
In this chapter, we'll cover how to create liss and how to work with
them. As with vectors and matrices, one common operation with lists is
indexing. List indexing is similar to vector and matrix indexing but with
some major differences. And like marices, lists have an analog for the
apply() function. We'll discuss these and other list topics, inclucding ways
to Lake lists apart, which often comes in handy.
4.1 Creating Lists

Technically, alist is a vector. Ordinary vectors-those of the type we've
been using so far in this book-are termed atomic vectors, since their
components cannot be broken down into smaller compeonents. In contrast.
lists are referred to as rerursive vectors.
For our first look at lists, let's consider an employee database. For each
employee, we wish to store the name, salary, and a Boolean indicating union
membership. Since we have three different modes here-character, numer-
ic, and logical-it's a perfect place for using listS. Our enture database might
then be a list of lists, or some other kind of list such as a data frame, though
we won't pursue that here.
We could create a list to represent our employee. Joe, this way:
jlist(name-"J0e", salary-55000, union=T)
We could print out j. either in full or by componen:
j
Sname
1] "Joe"
$salary
(1] 55000
union
(1] TRUE
the component names-called tags in the R literature-such

as
Actually,
$alary are
optional. We could alternatively do this:
jalt - list("J0e", 55000, T)
jalt
[1] "Joe
I2]]
1) 55000
I131
1) TRUE
conidered clearer and less to usc

However, it is generally crror-prone
names instead of numeric indices.
Names of list components can be abbreviated to whatever extent is possi-
ble without causing ambiguity:
jssal
1] 55000
86 Chap
Since lists are vectors, they can be created via vector():
z - vector(mode= "list")
z[l "abe"]] « 3
$abc
(1] 3
4.2 General List Operations

Now that you've seen a simple example of creating a list, let's look at how to
access and work with lists.
4.2.1 List Indexing

You can access a list component in several different ways:
jssalary
(1] 55000
jll"salary"]]
(1] 55000
il[2]]
(1] S5000
We can refer to list components by their numerical indices, treating

the list as a vector. However, note that in this case, we use double brackets
instead of single ones.
So, there are three ways to access an individual componentc of a list lst
and return it in the data type of c:
1stsc
1st[[""]]
1st[[]]. where i is the index of c witthin lst
Each of these is useful in diflerent contexts, as you will see in sub-

sequent examples. But note the qualifying phrase. "return it in the data
ype ofc." An alternative to the econd and third techuiques listed is to
use single brackets rather than double brackets:
lst[e"]
lst[i). where i is the index of c within 1st
Both single-bracket and double-bracket indexing access list elements

in vector-index fashion. But there is an important difference from ordi
nary (atomic) vector indexing. If single brackets [ ) are used, the result is
87
another list-a sublist of the original. For instance, continuing the prececl-
have this:
ing example, we
jl1:2]
sname
(1] "Joe"
$salary
1] 55000o
j2 jl2]
j2
$salary
[1] 55000
class(j2)
(1] "list"
str(j2))
List of 1
$salary: num 55000
the first two

returned another list consisung of
The subsettüng operation makes sense
j. list Note that the word relurned
components of the original
This is similar to other cases you've
here, since index brackets are functions.
first appear to be functions, such as +.
seen operators that do not at
for
can use double brackets [[
only
]] for referencing
a
By you
contrast,
of that component.
the result having the type
single component, with
jl[1:2]]
Error in j[[1:2]]subscript out of bounds
j2a - jl[2]]
j2a
(1] 55000
class(j2a)
(1] "numeric"
4.2.2 Adding and Deleting List Elements

list elements arise in a surprising
The operations of adding and deleting lists
true for data s t r u c t u r e s in which
number of contexts. This is especially
R classes.
as data frames and
form the foundation, such
a list is crealed.
can be added afler
New components
z list(a="abc", b=12)
$a
(1] "abe"
88 Chapter 4
(1] 12
zsc «- "sailing" # add ac component

# did c really get added?
Z
$a
(1] "abc"
[1] 12
(1] "sailing"
Adding components can also be done via a vector index:
> z[[4]] «- 28
z[5:7] <- c(FALSE, TRUE,TRUE)
Z
(1] "abe"
5b
(1] 12
$c
(1] "sailing"
[[4]]
(1] 28
(1] FALSE
16]
(1] TRUE
7))
1) TRUE
You can delete a list component by seiting it to NULIL
z$b NULL
$a
(1] "abc"
89
(1] "sailing"
(1] 28
4]]
(1] FALSE
(1) TRUE
[t6]
[1] TRUE
Note that upon deleting zsb, the indices of the elements after it moved
up by 1. For instance, the former z[[4]] became z[[3]].
You can also concatenate lists.
> c(list("Joe", 55000, T),list(5))
(1] "Joe"
[[2]]
1) 55000
[[311
[1] TRUE
[[4]
[1] 5
4.2.3 Getting the Size of a List

Since a list is a vector. you can obain the number of components in a list via
length().
length(j)
(1] 3
4.2.4 Extended Example: Text Concordance

Web search and other types of textual data mining are of great interest
today. Let's use this area for an example of R list code.
We'll write a function called findwords () that will determine which words
are in a text file and compile a list of the locations of each word's occur
rences in the text. This would be useful for contextual analysis, for example.
90 Chapter 4
file, testconcord.Ixl, has the following contents (taken
Suppose our
input
from this book!):
The [1) here meansthat the first item in this line of output is
In this case, our output consists of only one line (and
one
item 1.
so this is redundant, but this notation helps

to read
item),
voluminous output that consists of many items spread over many
items
lines. For example, if there were two rows of output with six
per row, the second row would be labeled [7].
characters with blanks

In order to identify words, we replace all nonletter
and get rid of capitalization. We could use the string funcions presented in
such code is not shown
Chapter 1l to do this, but to keep matters simple,
here. The new file, lestconcorda.txt, looks like this:
is
the here that the first item in this line of output
means
line and one
item in this case our output consists of only one
item so this is redundant but this notation helps to read

oluminouss output that consists of many items spread over many
if there two rows of output with six items

lines for example were
er row the second row would be labeled
and 27, which

Then, for instance, the word ilem has locations 7, 14,
means that it occupies the seventh,
fourteenh, and twenty-seventh word
positions in the file.
returned when function
excerpt from the list that is
our
Here is an
this file:
findwords() is called on
findwords("testconcorda. txt")
Read 68 items
$the
11 15 63
shere
(1] 2
Smeans
[1) 3
$that
(1 4 40
$first
[1] 6
$item
(1) 7 14 27
91
The list consisls of one component per word in the file, with a word's
component showing the positions within the file where that word occurs.
Sure enough, the word item is shown as occurring at positions 7, 14, and 27.
Before looking at the code, let's talk a bit about our choice of a list struc-
ture here. One alternative would be to use a matrix, with one row per word
in the text. We could use rownames () to name the rows, with the entries within
a row showing the positions of that word. For instance, row item would con-
sist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix
approach has a couple of major drawbacks:
There is a problem in terms of the columns to allocate for our matrix.

If the maximum frequency with which a word appears in our text is, say.
ahead
10, then we would need 10 columns. But we would not know that
of time. We could add a new encountered a new
column each time we
word, using cbind() (in addition to using rbind() add a row for the
to
word itself). Or we could write code to do a preliminary run through
Either of
the input file to determine the maximum word frequency.
of increased code complexity and
these would come at the expense
possibly increased runtime.
wasteful of memory, since mosu
Such a storage scheme would be quite
rows would probably consist of a lot of zeros. In other words, the matrix
would be sparse-a situation that also often occurs in numerical aalysis
contexts.
makes Let's see how to code it.

Thus, the list structure really sense.
findwords - function ( tf)

# read in the words from the file, into a vector of mode character
txt - scan(tf,"")
wl- list()
for (i in 1:length(txt)) {
Wrd - txt[i] # ith word in input file
wl[[wrd]] (- c(wl[[wrd]], i)
return(wl)
10
We read in the words of the file (words simply neaning any groups of let-
ters separated by spaces) by calling scan(). The details of reading and writing
files are covered in Chapter 10, but the important point here is that txt will
now be a vector of strings: one string per instance of a word in the file. Here
is what txt looks like after the read:
txt
"means" "that" "the"
(1] "the" "here
16] "first" "item" "in "this "line"
in"
(11] "of" "output" "iS "item"
"our "output" "consists"
(16] "this" "case
92 Chopter 4
(21] "of "only" "one" "line" "and"
[26] "one" "item" "so" "this" "is"
31] "redundant" "but" "this" "notation" "helps"
(36] "to" "read "voluminous" "output" "that"
41 "consists" of"
"many" "items" "spread"
(46] "over" many "lines" "for" "example
[s1) "if" "there" "were "two" "rows
(s6] "of" "output" "with" "six" "items
[61] "per" "row" "the" "second 'rOW"
[66] "would" "be" "labeled"
The list operations in lines 4 through 8 build up our main variable, a list
wl (for word list). We loop through all the words from our long line, with wrd
being the current one.
Let's see what happens with the code in line 7 when i 4, so
thatwrd
"that" in our exámple file testconcorda. txt. At this point, wl[["that"]] will not
yet exist. As mentioned, R is set up so that in such a case, wl[[ "that"]] - NULL,
which means in line 7, we can concatenate it! Thus wl[["that"] will become
the one-element vector (4). Later, when i =
40, wl[["that")]] will become
(4,40), representing the fact that words 4 ancd 40 in the file are both "that".
Note how convenient it is that list indexing can be done through quoted
strings, such as in wl[["that"]].
An advanced, more elegant version of this code uses R's split() func-
tion, as you'll see in Section 6.2.2.
4.3 Accessing List Components and Values

If the components in a list do have tags, as is the case with name, salary, and
union for j in Section 4.1, you can obtain them via names():
names (j)
[1 "name" "salary" "union"
To obtain the values, use unlist ():
ulj - unlist (j)

ulj
name salary union
"Joe" "55000" "TRUE"
class(ulj)
[1] "character"
The retur value of unlist () is a vector-in this case, a vector of charac-

ter strings. Note that the element names in this vector come from the com-
ponents in the original list.
ss 93
On the otlher hand, if we wvere to start witlh numbers, we would get
numbers.
2 - list(a=5,b=12,C=13)
y «- unlist(z)
class(y)
1 "numeric"
y
a b
5 12 13
in this numeric vector. What abouta

So the output of unlist() case was a
mixed case?
W - list(a=5,b="xyz")
w u - unlist(w)
class(wu)
[1] "character"
Nu
"5" "xyz"
denominator: character strings. This

Here, R chose the least common
kind of precedence structure, and it is. As

R's help for
sounds like some
unlist() states:
coerced to a common
Where possible the list components are
as a
and so the result often ends up
mode during the unlisting,
characler vector. Vectors will be coerced
to the highest type of the
NULI. < raw < logical < inieger
componenis in the hierarchy
< real < complex < characler < list < expression: pairlists are
urealed as lists.
and
here. Though wu is a vector
But there is something else to deal with
not a list, R did give each of the elements a name. We can remove them by
their n a m e s to NULL, as you saw in Section 2.11.

setting
names(wu) <- NULL
Wu
[1] "s" "xyz"
the elements' names directly with unname(), as

We can also remove
follows:
wun - unname(wu)
Wun
(1] "s" "xyz"
94 Chopler
This also has the
they
advantage of not destroying the names in wu, in case
are needed later. If
they will not be needed later, we could
assign back to wu istead of to wun in thhe simply
1receding statemert.
4.4 Applying Functions to Lists
Two functions are
handy for
applying functions to lists: lapply and sapply.
4.4.1 Using the lopplyl) and sapply) Functions

The function lapply() (for list apply) works like the matrix apply() function,
calling the specified function on each component of a list (or vector coerced
to a list) and
returning another list. Here's an example:
>
lapply(list(1:3,25:29), med ian)
([1]]
(1] 2
[[2]]
(1) 27
R applied median() to 1:3 and to 25:29, returning a list consisting of 2 and 27.
In some cases, such as the example here, the list returned by lapply()
could be simplified to a vector or matrix. This is exactly what sapply () (for
simplifed [lJapply) does
sapply(1list(1:3,25:29), median)
[1] 2 27
You saw an examiple of matrix output in Section 2.6.2. There, we

applied a vectorized, vector-valued function-a function whose return
value is a vector, each of whose components is vectorized- to a vector
input. Using sapply(), rather than applying the function directly. gave us
the desired maurix forin in the output.
4.4.2 Extended Example: Text Concordance, Continued

The text concordance creator, findwords (). which we developed in Sec-
tion 4.2.4, returns a list of word locations, indexed by word. It would be
nice to be able to sort this list in various ways.
Recall that for the input file testconcorda. lxt, we got this output:
$the
(1] 1 5 63
$here
[1] 2
S'S
Smeans
[1] 3
Sthat
(1] 4 40o
word:
Here's code to present the list in alphabetical
order by
by word
# sorts wrdlst, the output of findwords() alphabetically
alphawl «- function (wrdlst) {
nms names (wrdlst) # the words
# words in alpha orderT
sn - sort(nms) same
rearranged version
# return
return(wrdlst [sn])
we can exiract
the names of the list components,
Since o u r words are then in
We these alphabetically, and
by simply calling names ().
sort
the words us a s o r t e d
list indexing. giving
ine 5, w e use the sorted version as input to
rather than double,
as
of single brackes,
version of the list. Note the use
consider using
for subsetting lists. (You might also
the former are required
Section 8.3.)
which we'll look at in
order() instead of sort (),
Let's try it out.
alphawl(w1)
$and
[1] 25
$be
(1] 67
sbut
(1] 32
$case
(1] 17
$consists
[1] 20 41
Sexample
(1] 50
displayed first, then be, and so o n .

It works fine. The entry for and was
96 Chopter 4
We can sort by word frequency in a similar manner.
4orders the output of findwords() by word frequency

freqwl«- function (wrdlst)
freqs sapply(wrdlst, length) # get word frequencies
return(wrdlst[order (freqs)])
In line 3, we are using the fact that each element in wrdlst is a vector
of numbers represening the positions in our input file at which the givern
Word is found. Calling length() on that vector gives us the number of umes
will be the
the given word appears in that file. The result of calling sapply()
vector of word frequencies.
latter
m o r e direct. This
We could use sort() here again, but order() is
to the original
function returns the indices of a sorted vector with respect
vector. Here's an example:
x c(12,5,13,8)
order(x)
(1] 2 4 1 3
element in x, x[4)
indicates hat x[2] is the smallest
Theoutput here w e u s e order()
to determine
o n . In case,
1s the second smallest, and so
o u r
. Plugging
least frequent, and so o n
which word is least frequent, second in order of
us the s a m e word list but
these indices into our word list gives
frequency.
Let's check it.
freqwl (wl)
Shere
(1] 2
Smeans
[1] 3
sfirst
[1] 6
$that
(1) 4 40
in
[1] 8 15
$line
[1] 10 24
$this
97
(1] 9 16 29 33
$of
1] 11 21 42 56
$output
(1] 12 19 39 57
Yep, ordered from least to most frequent.

words. I ran the following
We can also do a plot of the most frequent
code on an article on R in the New York Times, "Data Analysts Captivated by
R's Power," from January 6, 2009.
nyt - findwords("nyt.txt"))
Read 1011 items
snyt - freqwl(nyt)
nwords (- length (ssnyt)
barplot(ssnyt[round (0.9*nwords): nwords])
of the words in
the top 10 percent
to the frequencies of
plot
My goal was
41.
are shown in Figure
the article. The results
N rapu: Uvire {ALIVE)
in article about R
word frequencies an
Figure 4-1: Top
98 Chopter 4
4.4.3 Extended Example: Back to the Abalone Data
Let's use the lapply() function in our abalone gender exanple in Sec-
tion 2.9.2. Recall hat at one point in that example, we wislhed to know the
indices of the observations that were male, lemale, and infant. For an easy
demonstration, et's use the same lest case: a vector ol genders.
8 c("M", "F", "F", "I", "M", "M","F")
A more compact way of accomplishing our goal is as follows:
lapply(c("", "F","I"), funct ion(gender) which(g-gender))

[[1]]
(1] 15 6
((2]]
(1] 2 3 7
([3]1
(1] 4
be a list. Here il was a

The lapply() function expects its firs argument to
vector, but lapply() will coerce that vector to a list form. Also, lapply() expects
name ofa function,
its second to be a function. This could be the
argument
as you saw before, or the actual code, as we have here. This is an unonymous
7.13.
function, which you'll learn more about in Section and
function o n "M", then o n "F",
Then lapply() calls that anonymous
then o n "I". In that first case, the function calculates which(g=="M"), giving us
the indices for the

the vector of indices in g of the males. After determining
vectors in a list.
females and infants, lapply() will return the three
of main attention is the vector g
Note that even though the object o u r
in the lapply() call in the example.

genders, it is not the first argument
of
Instead, that argument is an innocuous-looking vector of the three possible
is mentioned only briefly in the function,
gender encodings. By conurast. g
situation in R. An even
as the second actual argument. This is a c o m m o n
better to do this will be presented in Section 6.2.2.
way
4.5 Recursive Lists

that can have lists within lists. Here's an
Lists can be recursive, meaning you
example:
>b«- list (u = 5, v 12)
c - list(w = 13)
>a «- list (b, c)
a
([1]]su
Lists 99
[1] 5
[1]]sv
(1] 12
[[2]1
[[2]]5w
(1] 13
length(a)
[1] 2
This code makes a into a two-component list, with each component itselr
also being a list.
The concatenate function c() has an optional argument recursive, which
Controls whether flattening occurs when recursive lists are combined
c(list(a-1, b=2, c=list(d+5, e-9)))

$a
[1]1
$b
[1) 2
sc
$csd
(1)5
scse
[1] 9
c(list(a=1, b=2, c=list (d-5, e-9)), recursive-T)
a b c.d c.e
1 2 5 9
is
In the first case, we accepted the default value of recursive, which
FALSE, and obtained a recursive list, with the c component of the main list
itself being another list. In the second call, with recursive set to TRUE, we got
a single list as a result; only the names look recursive. (lt's odd that seting
recursive to TRUE gives a nonerursive list.))
Recall that our first example of lists consisted of an employee database.
I mentioned that since each employee was represented as a list, the entire
database would be a list of lists. That is a concrete example of recursive lists.
100 Chopter 4

Matter, A: Python

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matter, A: Python

Uploaded by

Copyright:

Available Formats

4LISTS

R In contrast to a vector, in which all ele-

4.1 Creating Lists

jlist(name-"J0e", salary-55000, union=T)

We could print out j. either in full or by componen:

the component names-called tags in the R literature-such

conidered clearer and less to usc

ble without causing ambiguity:

4.2 General List Operations

4.2.1 List Indexing

We can refer to list components by their numerical indices, treating

Each of these is useful in diflerent contexts, as you will see in sub-

Both single-bracket and double-bracket indexing access list elements

the first two

4.2.2 Adding and Deleting List Elements

zsc «- "sailing" # add ac component

Adding components can also be done via a vector index:

You can delete a list component by seiting it to NULIL

> c(list("Joe", 55000, T),list(5))

4.2.3 Getting the Size of a List

4.2.4 Extended Example: Text Concordance

so this is redundant, but this notation helps

characters with blanks

item so this is redundant but this notation helps to read

if there two rows of output with six items

er row the second row would be labeled

and 27, which

There is a problem in terms of the columns to allocate for our matrix.

makes Let's see how to code it.

findwords - function ( tf)

4.3 Accessing List Components and Values

To obtain the values, use unlist ():

ulj - unlist (j)

The retur value of unlist () is a vector-in this case, a vector of charac-

in this numeric vector. What abouta

denominator: character strings. This

kind of precedence structure, and it is. As

their n a m e s to NULL, as you saw in Section 2.11.

the elements' names directly with unname(), as

(1] "s" "xyz"

4.4.1 Using the lopplyl) and sapply) Functions

You saw an examiple of matrix output in Section 2.6.2. There, we

4.4.2 Extended Example: Text Concordance, Continued

displayed first, then be, and so o n .

4orders the output of findwords() by word frequency

Yep, ordered from least to most frequent.

8 c("M", "F", "F", "I", "M", "M","F")

A more compact way of accomplishing our goal is as follows:

lapply(c("", "F","I"), funct ion(gender) which(g-gender))

be a list. Here il was a

the indices for the

in the lapply() call in the example.

4.5 Recursive Lists

Controls whether flattening occurs when recursive lists are combined

c(list(a-1, b=2, c=list(d+5, e-9)))

You might also like