DAL 371 SLID 12 SummarizingData

Summarizing, tabulating and
merging data
Summarizing dataframes
• There are a few useful functions to print general summaries of a
dataframe, to see which variables are included, what types of data
they contain
• The most basic function is summary, which works on many types of
objects
For each factor variable, the levels are printed (the

species variable, levels PIMO, PIPO and PSME.
For all numeric variables, the minimum, 1rst quantile,
median, mean, third quantile, and the maximum
values are shown.
Summarizing dataframes
• To simply see what types of variables your dataframe contains use the
str function (short for ’structure’).
Useful functions in the hmisc package
1. describe, is much like summary, but offers slightly more
sophisticated statistics.
2. contents, is similar to str, but does a very nice job of summarizing
the factor variables in your dataframe, prints the number of missing
variables, the number of rows, and so on.
storage refers to the internal storage
type of the variable:
NOTE that the factor variables are
stored as ’integer’, and other numbers
as ’double’
Making summary tables
• Summarizing vectors with tapply()
plantdat
execute the command
with(plantdat, tapply(Plantbiomass, Treatment, mean)) we get the result
the result is a vector

(elements of a vector can have names,
like columns of a dataframe).
The result here is a matrix, where A and B, the species

codes, are the rownames of this matrix.
plantdat2
• If you want to summarize a variable by the levels of another variable.
The tapply function applies a function (sum) to a vector (Rain),

that is split into chunks depending on another variable (Year).
Summarizing vectors with tapply()
• We can also use the tapply function on more than one variable at a
time.
Summarizing vectors with tapply()
• main advantage of tapply is that we can use it as input to barplot,
Summarizing dataframes with summaryBy
# Note use summaryBy, from the doBy package
Note that the result here is a dataframe.

plantdat2
• make summary tables of multiple variables at once, and end up with
a dataframe
• summaryBy, we can generate multiple summaries (mean, standard
deviation, etc.) on more than one variable in a dataframe at once.
• You can also use any function that returns a vector of results
Calculate daily means and totals
(Book Example)
• Example using weather data collected at the Hawkesbury Forest
Experiment in 2008. The data given is in half-hourly time steps
• Provide data as daily averages (for temperature) and daily sums (for
precipitation).
Tables of counts
• It is often useful to count the number of observations by one or more
multiple factors.
• One option is to use tapply or summaryBy in combination with the
length function.
• A much better alternative is to use the xtabs and ftable functions
Adding simple summary variables to dataframes
• Consider the allometry dataset, which includes tree height for three
species. Suppose you want to add a new variable ’MaxHeight’, that is
the maximum tree height observed per species.
Reordering factor levels based on a summary variable
• You can reorder the factor levels by some summary variable.
Reordering factor levels based on a summary variable
• Plot them in ascending order if there is no specific order to the factor levels,
Combining dataframes
plantdat, leafnitrogendata
Note the missing value (NA) for the plant for

which no leaf nitrogen data was available.
• In many problems, you do not have a single dataset that contains all
the measurements you are interested in
• Combine two datasets that have a different number of rows.
• Sometimes, the variable you are merging with has a different name in
either dataframe. In that case, you can either rename the variable
before merging, or use the option
• merge(data1, data2, by.x="unit", by.y="item")
• Where data1 has a variable called ’unit’, and data2 has a variable
called ’item’.
• Other times you need to merge two dataframes with multiple key
variables. Here two dataframes have measurements on the same
units at some of the the same times, but on different variables:
Merging multiple datasets
• Consider the cereal dataset which gives measurements of contents of
cereals.
• Suppose the measurements for ’protein’, ’vitamins’ and ’sugars’ were
all produced by different labs, and each lab sends you a separate
dataset.
• Some measurements for sugars and vitamins are missing, because
samples were lost in those labs.
# Note that the number of rows is different between the datasets,

# and even the index name ('Cereal.name') differs between the datasets.
Row-binding dataframes
plantdatmore
with exactly the same columns
• when you have multiple very similar dataframes use the rbind
function
• where some observations are duplicated between dataframes the
union function from the dplyr package which only returns unique
observations is used
• Sometimes, you want to rbind dataframes together but the column
names do not exactly match.
• One option is to 1rst process the dataframes so that they do match
(using subscripting). Or, just use the bind_rows function from the
dplyr package. an equivalent function to bind dataframes side-by-side is cbind
Exporting summary tables
• To export summary tables generated with aggregate, tapply or table
to text files, use write.csv or write.table

DAL 371 SLID 12 SummarizingData

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DAL 371 SLID 12 SummarizingData

Uploaded by

Copyright:

Available Formats

Summarizing, tabulating and

For each factor variable, the levels are printed (the

the result is a vector

The result here is a matrix, where A and B, the species

• If you want to summarize a variable by the levels of another variable.

The tapply function applies a function (sum) to a vector (Rain),

# Note use summaryBy, from the doBy package

Note that the result here is a dataframe.

Note the missing value (NA) for the plant for

# Note that the number of rows is different between the datasets,

You might also like