You are on page 1of 35

Summarizing, tabulating and

merging data
Summarizing dataframes
• There are a few useful functions to print general summaries of a
dataframe, to see which variables are included, what types of data
they contain
• The most basic function is summary, which works on many types of
objects

For each factor variable, the levels are printed (the


species variable, levels PIMO, PIPO and PSME.
For all numeric variables, the minimum, 1rst quantile,
median, mean, third quantile, and the maximum
values are shown.
Summarizing dataframes

• To simply see what types of variables your dataframe contains use the
str function (short for ’structure’).
Useful functions in the hmisc package
1. describe, is much like summary, but offers slightly more
sophisticated statistics.
2. contents, is similar to str, but does a very nice job of summarizing
the factor variables in your dataframe, prints the number of missing
variables, the number of rows, and so on.
storage refers to the internal storage
type of the variable:
NOTE that the factor variables are
stored as ’integer’, and other numbers
as ’double’
Making summary tables
• Summarizing vectors with tapply()

plantdat
execute the command
with(plantdat, tapply(Plantbiomass, Treatment, mean)) we get the result

the result is a vector


(elements of a vector can have names,
like columns of a dataframe).
Making summary tables

The result here is a matrix, where A and B, the species


codes, are the rownames of this matrix.

plantdat2
Making summary tables

• If you want to summarize a variable by the levels of another variable.

The tapply function applies a function (sum) to a vector (Rain),


that is split into chunks depending on another variable (Year).
Summarizing vectors with tapply()
• We can also use the tapply function on more than one variable at a
time.
Summarizing vectors with tapply()
• main advantage of tapply is that we can use it as input to barplot,
Summarizing dataframes with summaryBy

# Note use summaryBy, from the doBy package

Note that the result here is a dataframe.


Summarizing dataframes with summaryBy

plantdat2
Summarizing dataframes with summaryBy
• make summary tables of multiple variables at once, and end up with
a dataframe
• summaryBy, we can generate multiple summaries (mean, standard
deviation, etc.) on more than one variable in a dataframe at once.
Summarizing dataframes with summaryBy
• You can also use any function that returns a vector of results
Calculate daily means and totals
(Book Example)
• Example using weather data collected at the Hawkesbury Forest
Experiment in 2008. The data given is in half-hourly time steps
• Provide data as daily averages (for temperature) and daily sums (for
precipitation).
Tables of counts
• It is often useful to count the number of observations by one or more
multiple factors.
• One option is to use tapply or summaryBy in combination with the
length function.
• A much better alternative is to use the xtabs and ftable functions
Adding simple summary variables to dataframes
• Consider the allometry dataset, which includes tree height for three
species. Suppose you want to add a new variable ’MaxHeight’, that is
the maximum tree height observed per species.
Reordering factor levels based on a summary variable
• You can reorder the factor levels by some summary variable.
Reordering factor levels based on a summary variable
• Plot them in ascending order if there is no specific order to the factor levels,
Combining dataframes
plantdat, leafnitrogendata

Note the missing value (NA) for the plant for


which no leaf nitrogen data was available.
Combining dataframes
• In many problems, you do not have a single dataset that contains all
the measurements you are interested in
• Combine two datasets that have a different number of rows.
Combining dataframes
• Sometimes, the variable you are merging with has a different name in
either dataframe. In that case, you can either rename the variable
before merging, or use the option
• merge(data1, data2, by.x="unit", by.y="item")
• Where data1 has a variable called ’unit’, and data2 has a variable
called ’item’.
Combining dataframes
• Other times you need to merge two dataframes with multiple key
variables. Here two dataframes have measurements on the same
units at some of the the same times, but on different variables:
Merging multiple datasets
• Consider the cereal dataset which gives measurements of contents of
cereals.
• Suppose the measurements for ’protein’, ’vitamins’ and ’sugars’ were
all produced by different labs, and each lab sends you a separate
dataset.
• Some measurements for sugars and vitamins are missing, because
samples were lost in those labs.
Merging multiple datasets

# Note that the number of rows is different between the datasets,


# and even the index name ('Cereal.name') differs between the datasets.
Merging multiple datasets
Row-binding dataframes

plantdatmore
with exactly the same columns
Row-binding dataframes
• when you have multiple very similar dataframes use the rbind
function
Row-binding dataframes
• where some observations are duplicated between dataframes the
union function from the dplyr package which only returns unique
observations is used
Row-binding dataframes
• Sometimes, you want to rbind dataframes together but the column
names do not exactly match.
• One option is to 1rst process the dataframes so that they do match
(using subscripting). Or, just use the bind_rows function from the
dplyr package. an equivalent function to bind dataframes side-by-side is cbind
Exporting summary tables
• To export summary tables generated with aggregate, tapply or table
to text files, use write.csv or write.table

You might also like