Professional Documents
Culture Documents
Matter, A: Python
Matter, A: Python
j
Sname
1] "Joe"
$salary
(1] 55000
union
(1] TRUE
I2]]
1) 55000
I131
1) TRUE
jssal
1] 55000
86 Chap
Since lists are vectors, they can be created via vector():
z - vector(mode= "list")
z[l "abe"]] « 3
$abc
(1] 3
jssalary
(1] 55000
jll"salary"]]
(1] 55000
il[2]]
(1] S5000
1stsc
1st[[""]]
1st[[]]. where i is the index of c witthin lst
lst[e"]
lst[i). where i is the index of c within 1st
87
another list-a sublist of the original. For instance, continuing the prececl-
have this:
ing example, we
jl1:2]
sname
(1] "Joe"
$salary
1] 55000o
j2 jl2]
j2
$salary
[1] 55000
class(j2)
(1] "list"
str(j2))
List of 1
$salary: num 55000
By you
contrast,
of that component.
the result having the type
single component, with
jl[1:2]]
Error in j[[1:2]]subscript out of bounds
j2a - jl[2]]
j2a
(1] 55000
class(j2a)
(1] "numeric"
z list(a="abc", b=12)
$a
(1] "abe"
88 Chapter 4
(1] 12
$a
(1] "abc"
[1] 12
(1] "sailing"
> z[[4]] «- 28
z[5:7] <- c(FALSE, TRUE,TRUE)
Z
(1] "abe"
5b
(1] 12
$c
(1] "sailing"
[[4]]
(1] 28
(1] FALSE
16]
(1] TRUE
7))
1) TRUE
z$b NULL
$a
(1] "abc"
89
(1] "sailing"
(1] 28
4]]
(1] FALSE
(1) TRUE
[t6]
[1] TRUE
Note that upon deleting zsb, the indices of the elements after it moved
up by 1. For instance, the former z[[4]] became z[[3]].
You can also concatenate lists.
(1] "Joe"
[[2]]
1) 55000
[[311
[1] TRUE
[[4]
[1] 5
90 Chapter 4
file, testconcord.Ixl, has the following contents (taken
Suppose our
input
from this book!):
The [1) here meansthat the first item in this line of output is
In this case, our output consists of only one line (and
one
item 1.
is
the here that the first item in this line of output
means
line and one
item in this case our output consists of only one
this file:
findwords() is called on
findwords("testconcorda. txt")
Read 68 items
$the
11 15 63
shere
(1] 2
Smeans
[1) 3
$that
(1 4 40
$first
[1] 6
$item
(1) 7 14 27
91
The list consisls of one component per word in the file, with a word's
component showing the positions within the file where that word occurs.
Sure enough, the word item is shown as occurring at positions 7, 14, and 27.
Before looking at the code, let's talk a bit about our choice of a list struc-
ture here. One alternative would be to use a matrix, with one row per word
in the text. We could use rownames () to name the rows, with the entries within
a row showing the positions of that word. For instance, row item would con-
sist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix
approach has a couple of major drawbacks:
word, using cbind() (in addition to using rbind() add a row for the
to
word itself). Or we could write code to do a preliminary run through
Either of
the input file to determine the maximum word frequency.
of increased code complexity and
these would come at the expense
possibly increased runtime.
wasteful of memory, since mosu
Such a storage scheme would be quite
rows would probably consist of a lot of zeros. In other words, the matrix
would be sparse-a situation that also often occurs in numerical aalysis
contexts.
txt - scan(tf,"")
wl- list()
for (i in 1:length(txt)) {
Wrd - txt[i] # ith word in input file
wl[[wrd]] (- c(wl[[wrd]], i)
return(wl)
10
We read in the words of the file (words simply neaning any groups of let-
ters separated by spaces) by calling scan(). The details of reading and writing
files are covered in Chapter 10, but the important point here is that txt will
now be a vector of strings: one string per instance of a word in the file. Here
is what txt looks like after the read:
txt
"means" "that" "the"
(1] "the" "here
16] "first" "item" "in "this "line"
in"
(11] "of" "output" "iS "item"
"our "output" "consists"
(16] "this" "case
92 Chopter 4
(21] "of "only" "one" "line" "and"
[26] "one" "item" "so" "this" "is"
31] "redundant" "but" "this" "notation" "helps"
(36] "to" "read "voluminous" "output" "that"
41 "consists" of"
"many" "items" "spread"
(46] "over" many "lines" "for" "example
[s1) "if" "there" "were "two" "rows
(s6] "of" "output" "with" "six" "items
[61] "per" "row" "the" "second 'rOW"
[66] "would" "be" "labeled"
The list operations in lines 4 through 8 build up our main variable, a list
wl (for word list). We loop through all the words from our long line, with wrd
being the current one.
Let's see what happens with the code in line 7 when i 4, so
thatwrd
"that" in our exámple file testconcorda. txt. At this point, wl[["that"]] will not
yet exist. As mentioned, R is set up so that in such a case, wl[[ "that"]] - NULL,
which means in line 7, we can concatenate it! Thus wl[["that"] will become
the one-element vector (4). Later, when i =
40, wl[["that")]] will become
(4,40), representing the fact that words 4 ancd 40 in the file are both "that".
Note how convenient it is that list indexing can be done through quoted
strings, such as in wl[["that"]].
An advanced, more elegant version of this code uses R's split() func-
tion, as you'll see in Section 6.2.2.
union for j in Section 4.1, you can obtain them via names():
names (j)
[1 "name" "salary" "union"
class(ulj)
[1] "character"
ss 93
On the otlher hand, if we wvere to start witlh numbers, we would get
numbers.
2 - list(a=5,b=12,C=13)
y «- unlist(z)
class(y)
1 "numeric"
y
a b
5 12 13
mixed case?
W - list(a=5,b="xyz")
w u - unlist(w)
class(wu)
[1] "character"
Nu
"5" "xyz"
unlist() states:
coerced to a common
Where possible the list components are
as a
and so the result often ends up
mode during the unlisting,
characler vector. Vectors will be coerced
to the highest type of the
NULI. < raw < logical < inieger
componenis in the hierarchy
< real < complex < characler < list < expression: pairlists are
urealed as lists.
and
here. Though wu is a vector
But there is something else to deal with
not a list, R did give each of the elements a name. We can remove them by
Wu
[1] "s" "xyz"
follows:
wun - unname(wu)
Wun
94 Chopler
This also has the
they
advantage of not destroying the names in wu, in case
are needed later. If
they will not be needed later, we could
assign back to wu istead of to wun in thhe simply
1receding statemert.
4.4 Applying Functions to Lists
Two functions are
handy for
applying functions to lists: lapply and sapply.
[[2]]
(1) 27
R applied median() to 1:3 and to 25:29, returning a list consisting of 2 and 27.
In some cases, such as the example here, the list returned by lapply()
could be simplified to a vector or matrix. This is exactly what sapply () (for
simplifed [lJapply) does
sapply(1list(1:3,25:29), median)
[1] 2 27
$the
(1] 1 5 63
$here
[1] 2
S'S
Smeans
[1] 3
Sthat
(1] 4 40o
word:
Here's code to present the list in alphabetical
order by
by word
# sorts wrdlst, the output of findwords() alphabetically
alphawl «- function (wrdlst) {
nms names (wrdlst) # the words
# words in alpha orderT
sn - sort(nms) same
rearranged version
# return
return(wrdlst [sn])
we can exiract
the names of the list components,
Since o u r words are then in
We these alphabetically, and
by simply calling names ().
sort
the words us a s o r t e d
list indexing. giving
ine 5, w e use the sorted version as input to
rather than double,
as
of single brackes,
version of the list. Note the use
consider using
for subsetting lists. (You might also
the former are required
Section 8.3.)
which we'll look at in
order() instead of sort (),
Let's try it out.
alphawl(w1)
$and
[1] 25
$be
(1] 67
sbut
(1] 32
$case
(1] 17
$consists
[1] 20 41
Sexample
(1] 50
96 Chopter 4
We can sort by word frequency in a similar manner.
In line 3, we are using the fact that each element in wrdlst is a vector
of numbers represening the positions in our input file at which the givern
Word is found. Calling length() on that vector gives us the number of umes
will be the
the given word appears in that file. The result of calling sapply()
vector of word frequencies.
latter
m o r e direct. This
We could use sort() here again, but order() is
to the original
function returns the indices of a sorted vector with respect
vector. Here's an example:
x c(12,5,13,8)
order(x)
(1] 2 4 1 3
element in x, x[4)
indicates hat x[2] is the smallest
Theoutput here w e u s e order()
to determine
o n . In case,
1s the second smallest, and so
o u r
. Plugging
least frequent, and so o n
which word is least frequent, second in order of
us the s a m e word list but
these indices into our word list gives
frequency.
Let's check it.
freqwl (wl)
Shere
(1] 2
Smeans
[1] 3
sfirst
[1] 6
$that
(1) 4 40
in
[1] 8 15
$line
[1] 10 24
$this
97
(1] 9 16 29 33
$of
1] 11 21 42 56
$output
(1] 12 19 39 57
nyt - findwords("nyt.txt"))
Read 1011 items
snyt - freqwl(nyt)
nwords (- length (ssnyt)
barplot(ssnyt[round (0.9*nwords): nwords])
of the words in
the top 10 percent
to the frequencies of
plot
My goal was
41.
are shown in Figure
the article. The results
N rapu: Uvire {ALIVE)
in article about R
word frequencies an
Figure 4-1: Top
98 Chopter 4
4.4.3 Extended Example: Back to the Abalone Data
Let's use the lapply() function in our abalone gender exanple in Sec-
tion 2.9.2. Recall hat at one point in that example, we wislhed to know the
indices of the observations that were male, lemale, and infant. For an easy
demonstration, et's use the same lest case: a vector ol genders.
((2]]
(1] 2 3 7
([3]1
(1] 4
([1]]su
Lists 99
[1] 5
[1]]sv
(1] 12
[[2]1
[[2]]5w
(1] 13
length(a)
[1] 2
This code makes a into a two-component list, with each component itselr
also being a list.
The concatenate function c() has an optional argument recursive, which
$b
[1) 2
sc
$csd
(1)5
scse
[1] 9
c(list(a=1, b=2, c=list (d-5, e-9)), recursive-T)
a b c.d c.e
1 2 5 9
is
In the first case, we accepted the default value of recursive, which
FALSE, and obtained a recursive list, with the c component of the main list
itself being another list. In the second call, with recursive set to TRUE, we got
a single list as a result; only the names look recursive. (lt's odd that seting
recursive to TRUE gives a nonerursive list.))
Recall that our first example of lists consisted of an employee database.
I mentioned that since each employee was represented as a list, the entire
database would be a list of lists. That is a concrete example of recursive lists.
100 Chopter 4