Professional Documents
Culture Documents
merging data
Summarizing dataframes
• There are a few useful functions to print general summaries of a
dataframe, to see which variables are included, what types of data
they contain
• The most basic function is summary, which works on many types of
objects
• To simply see what types of variables your dataframe contains use the
str function (short for ’structure’).
Useful functions in the hmisc package
1. describe, is much like summary, but offers slightly more
sophisticated statistics.
2. contents, is similar to str, but does a very nice job of summarizing
the factor variables in your dataframe, prints the number of missing
variables, the number of rows, and so on.
storage refers to the internal storage
type of the variable:
NOTE that the factor variables are
stored as ’integer’, and other numbers
as ’double’
Making summary tables
• Summarizing vectors with tapply()
plantdat
execute the command
with(plantdat, tapply(Plantbiomass, Treatment, mean)) we get the result
plantdat2
Making summary tables
plantdat2
Summarizing dataframes with summaryBy
• make summary tables of multiple variables at once, and end up with
a dataframe
• summaryBy, we can generate multiple summaries (mean, standard
deviation, etc.) on more than one variable in a dataframe at once.
Summarizing dataframes with summaryBy
• You can also use any function that returns a vector of results
Calculate daily means and totals
(Book Example)
• Example using weather data collected at the Hawkesbury Forest
Experiment in 2008. The data given is in half-hourly time steps
• Provide data as daily averages (for temperature) and daily sums (for
precipitation).
Tables of counts
• It is often useful to count the number of observations by one or more
multiple factors.
• One option is to use tapply or summaryBy in combination with the
length function.
• A much better alternative is to use the xtabs and ftable functions
Adding simple summary variables to dataframes
• Consider the allometry dataset, which includes tree height for three
species. Suppose you want to add a new variable ’MaxHeight’, that is
the maximum tree height observed per species.
Reordering factor levels based on a summary variable
• You can reorder the factor levels by some summary variable.
Reordering factor levels based on a summary variable
• Plot them in ascending order if there is no specific order to the factor levels,
Combining dataframes
plantdat, leafnitrogendata
plantdatmore
with exactly the same columns
Row-binding dataframes
• when you have multiple very similar dataframes use the rbind
function
Row-binding dataframes
• where some observations are duplicated between dataframes the
union function from the dplyr package which only returns unique
observations is used
Row-binding dataframes
• Sometimes, you want to rbind dataframes together but the column
names do not exactly match.
• One option is to 1rst process the dataframes so that they do match
(using subscripting). Or, just use the bind_rows function from the
dplyr package. an equivalent function to bind dataframes side-by-side is cbind
Exporting summary tables
• To export summary tables generated with aggregate, tapply or table
to text files, use write.csv or write.table