You are on page 1of 4

R Reference Card for Data Mining Performance Evaluation apclusterK() affinity propagation clustering to get K clusters (apcluster)

performance() provide various measures for evaluating performance of pre- cclust() Convex Clustering, incl. k-means and two other clustering algo-
by Yanchang Zhao, yanchang@rdatamining.com, January 3, 2013 diction and classification models (ROCR) rithms (cclust)
The latest version is available at http://www.RDataMining.com. Click the link roc() build a ROC curve (pROC) KMeansSparseCluster() sparse k-means clustering (sparcl)
also for document R and Data Mining: Examples and Case Studies. auc() compute the area under the ROC curve (pROC) tclust(x,k,alpha,...) trimmed k-means with which a proportion
The package names are in parentheses. ROC() draw a ROC curve (DiagnosisMed) alpha of observations may be trimmed (tclust)
PRcurve() precision-recall curves (DMwR)
Association Rules & Frequent Itemsets CRchart() cumulative recall charts (DMwR) Hierarchical Clustering
Packages a hierarchical decomposition of data in either bottom-up (agglomerative) or top-
APRIORI Algorithm rpart recursive partitioning and regression trees down (divisive) way
a level-wise, breadth-first algorithm which counts transactions to find frequent party recursive partitioning hclust(d, method, ...) hierarchical cluster analysis on a set of dissim-
itemsets randomForest classification and regression based on a forest of trees using ran- ilarities d using the method for agglomeration
apriori() mine associations with APRIORI algorithm (arules) dom inputs birch() the BIRCH algorithm that clusters very large data with a CF-tree
rpartOrdinal ordinal classification trees, deriving a classification tree when the (birch)
ECLAT Algorithm response to be predicted is ordinal pvclust() hierarchical clustering with p-values via multi-scale bootstrap re-
rpart.plot plots rpart models with an enhanced version of plot.rpart in the sampling (pvclust)
employs equivalence classes, depth-first search and set intersection instead of
rpart package agnes() agglomerative hierarchical clustering (cluster)
counting
ROCR visualize the performance of scoring classifiers diana() divisive hierarchical clustering (cluster)
eclat() mine frequent itemsets with the Eclat algorithm (arules)
pROC display and analyze ROC curves mona() divisive hierarchical clustering of a dataset with binary variables only
(cluster)
Packages Regression rockCluster() cluster a data matrix using the Rock algorithm (cba)
arules mine frequent itemsets, maximal frequent itemsets, closed frequent item- proximus() cluster the rows of a logical matrix using the Proximus algorithm
sets and association rules. It includes two algorithms, Apriori and Eclat.
Functions (cba)
arulesViz visualizing association rules lm() linear regression isopam() Isopam clustering algorithm (isopam)
glm() generalized linear regression LLAhclust() hierarchical clustering based on likelihood linkage analysis
Sequential Patterns nls() non-linear regression (LLAhclust)
predict() predict with models flashClust() optimal hierarchical clustering (flashClust)
Functions
residuals() residuals, the difference between observed values and fitted val- fastcluster() fast hierarchical clustering (fastcluster)
cspade() mining frequent sequential patterns with the cSPADE algorithm ues
(arulesSequences) cutreeDynamic(), cutreeHybrid() detection of clusters in hierarchi-
gls() fit a linear model using generalized least squares (nlme) cal clustering dendrograms (dynamicTreeCut)
seqefsub() searching for frequent subsequences (TraMineR) gnls() fit a nonlinear model using generalized least squares (nlme) HierarchicalSparseCluster() hierarchical sparse clustering (sparcl)
Packages Packages
arulesSequences add-on for arules to handle and mine frequent sequences nlme linear and nonlinear mixed effects models Model based Clustering
TraMineR mining, describing and visualizing sequences of states or events
Clustering Mclust() model-based clustering (mclust)
Classification & Prediction HDDC() a model-based method for high dimensional data clustering (HDclas-
Partitioning based Clustering sif )
Decision Trees partition the data into k groups first and then try to improve the quality of clus- fixmahal() Mahalanobis Fixed Point Clustering (fpc)
ctree() conditional inference trees, recursive partitioning for continuous, cen- tering by moving objects from one group to another fixreg() Regression Fixed Point Clustering (fpc)
sored, ordered, nominal and multivariate response variables in a condi- kmeans() perform k-means clustering on a data matrix mergenormals() clustering by merging Gaussian mixture components (fpc)
tional inference framework (party) kmeansCBI() interface function for kmeans (fpc)
rpart() recursive partitioning and regression trees (rpart)
Density based Clustering
kmeansruns() call kmeans for the k-means clustering method and includes generate clusters by connecting dense regions
mob() model-based recursive partitioning, yielding a tree with fitted models estimation of the number of clusters and finding an optimal solution from
associated with each terminal node (party) dbscan(data,eps,MinPts,...) generate a density based clustering of
several starting points (fpc) arbitrary shapes, with neighborhood radius set as eps and density thresh-
Random Forest pam() the Partitioning Around Medoids (PAM) clustering method (cluster) old as MinPts (fpc)
cforest() random forest and bagging ensemble (party) pamk() the Partitioning Around Medoids (PAM) clustering method with esti- pdfCluster() clustering via kernel density estimation (pdfCluster)
randomForest() random forest (randomForest) mation of number of clusters (fpc)
varimp() variable importance (party) cluster.optimal() search for the optimal k-clustering of the dataset
Other Clustering Techniques
importance() variable importance (randomForest) (bayesclust)
clara() Clustering Large Applications (cluster) mixer() random graph clustering (mixer)
Neural Networks nncluster() fast clustering with restarted minimum spanning tree (nnclust)
fanny(x,k,...) compute a fuzzy clustering of the data into k clusters (clus-
nnet() fit single-hidden-layer neural network (nnet) orclus() ORCLUS subspace clustering (orclus)
ter)
Support Vector Machine (SVM) kcca() k-centroids clustering (flexclust) Plotting Clustering Solutions
svm() train a support vector machine for regression, classification or density- ccfkms() clustering with Conjugate Convex Functions (cba) plotcluster() visualisation of a clustering or grouping in data (fpc)
estimation (e1071) apcluster() affinity propagation clustering for a given similarity matrix (ap- bannerplot() a horizontal barplot visualizing a hierarchical clustering (clus-
ksvm() support vector machines (kernlab) cluster) ter)
Cluster Validation Packages SnowballStemmer() Snowball word stemmers (Snowball)
silhouette() compute or extract silhouette information (cluster) extremevalues detect extreme values in one-dimensional data LDA() fit a LDA (latent Dirichlet allocation) model (topicmodels)
cluster.stats() compute several cluster validity statistics from a cluster- mvoutlier multivariate outlier detection based on robust methods CTM() fit a CTM (correlated topics model) model (topicmodels)
ing and a dissimilarity matrix (fpc) outliers some tests commonly used for identifying outliers terms() extract the most likely terms for each topic (topicmodels)
clValid() calculate validation measures for a given set of clustering algo- Rlof a parallel implementation of the LOF algorithm topics() extract the most likely topics for each document (topicmodels)
rithms and number of clusters (clValid)
Time Series Analysis Packages
clustIndex() calculate the values of several clustering indexes, which can tm a framework for text mining applications
be independently used to determine the number of clusters existing in a Construction & Plot lda fit topic models with LDA
data set (cclust) ts() create time-series objects (stats) topicmodels fit topic models with LDA and CTM
NbClust() provide 30 indices for cluster validation and determining the num- plot.ts() plot time-series objects (stats) RTextTools automatic text classification via supervised learning
ber of clusters (NbClust) smoothts() time series smoothing (ast) tm.plugin.dc a plug-in for package tm to support distributed text mining
Packages sfilter() remove seasonal fluctuation using moving average (ast) tm.plugin.mail a plug-in for package tm to handle mail
cluster cluster analysis Decomposition RcmdrPlugin.TextMining GUI for demonstration of text mining concepts and
fpc various methods for clustering and cluster validation tm package
decomp() time series decomposition by square-root filter (timsac)
mclust model-based clustering and normal mixture modeling textir a suite of tools for inference about text documents and associated sentiment
decompose() classical seasonal decomposition by moving averages (stats)
birch clustering very large datasets using the BIRCH algorithm tau utilities for text analysis
stl() seasonal decomposition of time series by loess (stats)
pvclust hierarchical clustering with p-values textcat n-gram based text categorization
tsr() time series decomposition (ast)
apcluster Affinity Propagation Clustering YjdnJlp Japanese text analysis by Yahoo! Japan Developer Network
ardec() time series autoregressive decomposition (ArDec)
cclust Convex Clustering methods, including k-means algorithm, On-line Up- Social Network Analysis and Graph Mining
date algorithm and Neural Gas algorithm and calculation of indexes for Forecasting
finding the number of clusters in a data set arima() fit an ARIMA model to a univariate time series (stats) Functions
cba Clustering for Business Analytics, including clustering techniques such as predict.Arima() forecast from models fitted by arima (stats) graph(), graph.edgelist(), graph.adjacency(),
Proximus and Rock auto.arima() fit best ARIMA model to univariate time series (forecast) graph.incidence() create graph objects respectively from edges,
bclust Bayesian clustering using spike-and-slab hierarchical model, suitable for forecast.stl(), forecast.ets(), forecast.Arima() an edge list, an adjacency matrix and an incidence matrix (igraph)
clustering high-dimensional data forecast time series using stl, ets and arima models (forecast) plot(), tkplot() static and interactive plotting of graphs (igraph)
biclust algorithms to find bi-clusters in two-dimensional data Packages gplot(), gplot3d() plot graphs (sna)
clue cluster ensembles forecast displaying and analysing univariate time series forecasts V(), E() vertex/edge sequence of igraph (igraph)
clues clustering method based on local shrinking timsac time series analysis and control program are.connected() check whether two nodes are connected (igraph)
clValid validation of clustering results ast time series analysis degree(), betweenness(), closeness() various centrality scores
clv cluster validation techniques, contains popular internal and external cluster ArDec time series autoregressive-based decomposition (igraph, sna)
validation methods for outputs produced by package cluster ares a toolbox for time series analyses using generalized additive models add.edges(), add.vertices(), delete.edges(),
bayesclust tests/searches for significant clusters in genetic data dse tools for multivariate, linear, time-invariant, time series models delete.vertices() add and delete edges and vertices (igraph)
clustvarsel variable selection for model-based clustering neighborhood() neighborhood of graph vertices (igraph, sna)
clustsig significant cluster analysis, tests to see which (if any) clusters are statis- Text Mining get.adjlist() adjacency lists for edges or vertices (igraph)
tically different Functions nei(), adj(), from(), to() vertex/edge sequence indexing (igraph)
clusterfly explore clustering interactively cliques() find cliques, ie. complete subgraphs (igraph)
Corpus() build a corpus, which is a collection of text documents (tm)
clusterSim search for optimal clustering procedure for a data set clusters() maximal connected components of a graph (igraph)
tm map() transform text documents, e.g., stemming, stopword removal (tm)
clusterGeneration random cluster generation %->%, %<-%, %--% edge sequence indexing (igraph)
tm filter() filtering out documents (tm)
clusterCons calculate the consensus clustering result from re-sampled clustering get.edgelist() return an edge list in a two-column matrix (igraph)
TermDocumentMatrix(), DocumentTermMatrix() construct a
experiments with the option of using multiple algorithms and parameter read.graph(), write.graph() read and writ graphs from and to files
term-document matrix or a document-term matrix (tm)
gcExplorer graphical cluster explorer (igraph)
Dictionary() construct a dictionary from a character vector or a term-
hybridHclust hybrid hierarchical clustering via mutual clusters Packages
document matrix (tm)
Modalclust hierarchical modal Clustering
findAssocs() find associations in a term-document matrix (tm) sna social network analysis
iCluster integrative clustering of multiple genomic data types
findFreqTerms() find frequent terms in a term-document matrix (tm) igraph network analysis and visualization
EMCC evolutionary Monte Carlo (EMC) methods for clustering
stemDocument() stem words in a text document (tm) statnet a set of tools for the representation, visualization, analysis and simulation
rEMM extensible Markov Model (EMM) for data stream clustering
stemCompletion() complete stemmed words (tm) of network data
Outlier Detection termFreq() generate a term frequency vector from a text document (tm) egonet ego-centric measures in social network analysis
stopwords(language) return stopwords in different languages (tm) snort social network-analysis on relational tables
Functions
removeNumbers(), removePunctuation(), removeWords() re- network tools to create and modify network objects
boxplot.stats()$out list data points lying beyond the extremes of the move numbers, punctuation marks, or a set of words from a text docu- bipartite visualising bipartite networks and calculating some (ecological) indices
whiskers ment (tm) blockmodelinggeneralized and classical blockmodeling of valued networks
lofactor() calculate local outlier factors using the LOF algorithm (DMwR removeSparseTerms() remove sparse terms from a term-document matrix diagram visualising simple graphs (networks), plotting flow diagrams
or dprep) (tm) NetCluster clustering for networks
lof() a parallel implementation of the LOF algorithm (Rlof ) textcat() n-gram based text categorization (textcat) NetData network data for McFarland’s SNA R labs
NetIndices estimating network indices, including trophic structure of foodwebs Packages googleVis an interface between R and the Google Visualisation API to create
in R nlme linear and nonlinear mixed effects models interactive charts
NetworkAnalysis statistical inference on populations of weighted or unweighted lattice a powerful high-level data visualization system, with an emphasis on mul-
networks
Graphics tivariate data
tnet analysis of weighted, two-mode, and longitudinal networks Functions vcd visualizing categorical data
triads triad census for networks plot() generic function for plotting (graphics) denpro visualization of multivariate, functions, sets, and data
Spatial Data Analysis barplot(), pie(), hist() bar chart, pie chart and histogram (graph- iplots interactive graphics

Functions
ics) Data Manipulation
boxplot() box-and-whisker plot (graphics)
geocode() geocodes a location using Google Maps (ggmap) stripchart() one dimensional scatter plot (graphics) Functions
qmap() quick map plot (ggmap) dotchart() Cleveland dot plot (graphics) transform() transform a data frame
get map() queries the Google Maps, OpenStreetMap, or Stamen Maps server qqnorm(), qqplot(), qqline() QQ (quantile-quantile) plot (stats) scale() scaling and centering of matrix-like objects
for a map at a certain location (ggmap) coplot() conditioning plot (graphics) t() matrix transpose
gvisGeoChart(), gvisGeoMap(), gvisIntensityMap(), splom() conditional scatter plot matrices (lattice) aperm() array transpose
gvisMap() Google geo charts and maps (googleVis) pairs() a matrix of scatterplots (graphics) sample() sampling
GetMap() download a static map from the Google server (RgoogleMaps) cpairs() enhanced scatterplot matrix (gclus) table(), tabulate(), xtabs() cross tabulation (stats)
ColorMap() plot levels of a variable in a colour-coded map (RgoogleMaps) parcoord() parallel coordinate plot (MASS) stack(), unstack() stacking vectors
PlotOnStaticMap() overlay plot on background image of map tile cparcoord() enhanced parallel coordinate plot (gclus) split(), unsplit() divide data into groups and reassemble
(RgoogleMaps) paracoor() parallel coordinates plot (denpro) reshape() reshape a data frame between “wide” and “long” format (stats)
TextOnStaticMap() plot text on map (RgoogleMaps) parallelplot() parallel coordinates plot (lattice) merge() merge two data frames; similar to database join operations
Packages densityplot() kernel density plot (lattice) aggregate() compute summary statistics of data subsets (stats)
plotGoogleMaps plot spatial data as HTML map mushup over Google Maps contour(), filled.contour() contour plot (graphics) by() apply a function to a data frame split by factors
RgoogleMaps overlay on Google map tiles in R levelplot(), contourplot() level plots and contour plots (lattice) melt(), cast() melt and then cast data into the reshaped or aggregated
plotKML visualization of spatial and spatio-temporal objects in Google Earth smoothScatter() scatterplots with smoothed densities color representation; form you want (reshape)
ggmap Spatial visualization with Google Maps and OpenStreetMap capable of visualizing large datasets (graphics) complete.cases() find complete cases, i.e., cases without missing values
clustTool GUI for clustering data with spatial information sunflowerplot() a sunflower scatter plot (graphics) na.fail, na.omit, na.exclude, na.pass handle missing values
SGCS Spatial Graph based Clustering Summaries for spatial point patterns assocplot() association plot (graphics) Packages
spdep spatial dependence: weighting schemes, statistics and models mosaicplot() mosaic plot (graphics)
reshape flexibly restructure and aggregate data
matplot() plot the columns of one matrix against the columns of another
Statistics data.table extension of data.frame for fast indexing, ordered joins, assignment,
(graphics)
and grouping and list columns
Summarization fourfoldplot() a fourfold display of a 2 × 2 × k contingency table (graph-
gdata various tools for data manipulation
ics)
summary() summarize data
describe() concise statistical description of data (Hmisc)
persp() perspective plots of surfaces over the x?y plane (graphics) Data Access
cloud(), wireframe() 3d scatter plots and surfaces (lattice) Functions
boxplot.stats() box plot statistics
interaction.plot() two-way interaction plot (stats)
Analysis of Variance iplot(), ihist(), ibar(), ipcp() interactive scatter plot, his- save(), load() save and load R data objects
aov() fit an analysis of variance model (stats) togram, bar plot, and parallel coordinates plot (iplots) read.csv(), write.csv() import from and export to .CSV files
anova() compute analysis of variance (or deviance) tables for one or more pdf(), postscript(), win.metafile(), jpeg(), bmp(), read.table(), write.table(), scan(), write() read and
fitted model objects (stats) png(), tiff() save graphs into files of various formats write data
write.matrix() write a matrix or data frame (MASS)
Statistical Test gvisAnnotatedTimeLine(), gvisAreaChart(),
readLines(), writeLines() read/write text lines from/to a connection,
t.test() student’s t-test (stats) gvisBarChart(), gvisBubbleChart(),
gvisCandlestickChart(), gvisColumnChart(), such as a text file
prop.test() test of equal or given proportions (stats) sqlQuery() submit an SQL query to an ODBC database (RODBC)
binom.test() exact binomial test (stats) gvisComboChart(), gvisGauge(), gvisGeoChart(),
gvisGeoMap(), gvisIntensityMap(), sqlFetch() read a table from an ODBC database (RODBC)
Mixed Effects Models gvisLineChart(), gvisMap(), gvisMerge(), sqlSave(), sqlUpdate() write or update a table in an ODBC database
lme() fit a linear mixed-effects model (nlme) gvisMotionChart(), gvisOrgChart(), (RODBC)
nlme() fit a nonlinear mixed-effects model (nlme) gvisPieChart(), gvisScatterChart(), sqlColumns() enquire about the column structure of tables (RODBC)
sqlTables() list tables on an ODBC connection (RODBC)
Principal Components and Factor Analysis gvisSteppedAreaChart(), gvisTable(),
gvisTreeMap() various interactive charts produced with the Google odbcConnect(), odbcClose(), odbcCloseAll() open/close con-
princomp() principal components analysis (stats) nections to ODBC databases (RODBC)
prcomp() principal components analysis (stats) Visualisation API (googleVis)
gvisMerge() merge two googleVis charts into one (googleVis) dbSendQuery execute an SQL statement on a given database connection
Other Functions (DBI)
var(), cov(), cor() variance, covariance, and correlation (stats)
Packages dbConnect(), dbDisconnect() create/close a connection to a DBMS
density() compute kernel density estimates (stats) ggplot2 an implementation of the Grammar of Graphics (DBI)
Packages snowfall usability wrapper around snow for easier development of parallel R gWidgets a toolkit-independent API for building interactive GUIs
RODBC ODBC database access programs Red-R An open source visual programming GUI interface for R
DBI a database interface (DBI) between R and relational DBMS snowFT extension of snow supporting fault tolerant and reproducible applica- R AnalyticFlow a software which enables data analysis by drawing analysis
RMySQL interface to the MySQL database tions, and easy-to-use parallel programming flowcharts
RJDBC access to databases through the JDBC interface Rmpi interface (Wrapper) to MPI (Message-Passing Interface) latticist a graphical user interface for exploratory visualisation
RSQLite SQLite interface for R rpvm R interface to PVM (Parallel Virtual Machine)
nws provide coordination and parallel execution facilities
Other R Reference Cards
ROracle Oracle database interface (DBI) driver
foreach foreach looping construct for R R Reference Card, by Tom Short
RpgSQL DBI/RJDBC interface to PostgreSQL database
doMC foreach parallel adaptor for the multicore package http://rpad.googlecode.com/svn-history/r76/Rpad_homepage/
RODM interface to Oracle Data Mining
doSNOW foreach parallel adaptor for the snow package R-refcard.pdf or
xlsReadWrite read and write Excel files
doMPI foreach parallel adaptor for the Rmpi package http://cran.r-project.org/doc/contrib/Short-refcard.pdf
WriteXLS create Excel 2003 (XLS) files from data frames
doParallel foreach parallel adaptor for the multicore package R Reference Card, by Jonathan Baron
Big Data doRNG generic reproducible parallel backend for foreach Loops http://cran.r-project.org/doc/contrib/refcard.pdf
GridR execute functions on remote hosts, clusters or grids R Functions for Regression Analysis, by Vito Ricci
Functions http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.
as.ffdf() coerce a dataframe to an ffdf (ff ) fork R functions for handling multiple processes
pdf
read.table.ffdf(), read.csv.ffdf() read data from a flat file to Generating Reports R Functions for Time Series Analysis, by Vito Ricci
an ffdf object (ff ) http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf
write.table.ffdf(), write.csv.ffdf() write an ffdf object to a Sweave() mixing text and R/S code for automatic report generation (utils)
flat file (ff ) knitr a general-purpose package for dynamic report generation in R RDataMining Website, Package, Twitter & Groups
ffdfappend() append a dataframe or an ffdf to an existing ffdf (ffdf ) R2HTML making HTML reports
R2PPT generating Microsoft PowerPoint presentations RDataMining Website: http://www.rdatamining.com
big.matrix() create a standard big.matrix, which is constrained to avail- Group on LinkedIn: http://group.rdatamining.com
able RAM (bigmemory) Interface to Weka Group on Google: http://group2.rdatamining.com
read.big.matrix() create a big.matrix by reading from an ASCII file Package RWeka is an R interface to Weka, and enables to use the following Weka Twitter: http://twitter.com/rdatamining
(bigmemory) functions in R. RDataMining Package: http://www.rdatamining.com/package
write.big.matrix() write a big.matrix to a file (bigmemory) Association rules: http://package.rdatamining.com
filebacked.big.matrix() create a file-backed big.matrix, which may Apriori(), Tertius()
exceed available RAM by using hard drive space (bigmemory) Regression and classification:
mwhich() expanded “which”-like functionality (bigmemory) LinearRegression(), Logistic(), SMO()
Packages Lazy classifiers:
ff memory-efficient storage of large data on disk and fast access functions IBk(), LBR()
ffbase basic statistical functions for package ff Meta classifiers:
filehash a simple key-value database for handling large data AdaBoostM1(), Bagging(), LogitBoost(),
g.data create and maintain delayed-data packages MultiBoostAB(), Stacking(),
BufferedMatrix a matrix data storage object held in temporary files CostSensitiveClassifier()
biglm regression for data too large to fit in memory Rule classifiers:
bigmemory manage massive matrices with shared memory and memory-mapped JRip(), M5Rules(), OneR(), PART()
files Regression and classification trees:
biganalytics extend the bigmemory package with various analytics J48(), LMT(), M5P(), DecisionStump()
bigtabulate table-, tapply-, and split-like functionality for matrix and Clustering:
big.matrix objects Cobweb(), FarthestFirst(), SimpleKMeans(),
XMeans(), DBScan()
Parallel Computing Filters:
Functions Normalize(), Discretize()
foreach(...) %dopar% looping in parallel (foreach) Word stemmers:
registerDoSEQ(), registerDoSNOW(), registerDoMC() regis- IteratedLovinsStemmer(), LovinsStemmer()
ter respectively the sequential, SNOW and multicore parallel backend Tokenizers:
with the foreach package (foreach, doSNOW, doMC) AlphabeticTokenizer(), NGramTokenizer(),
sfInit(), sfStop() initialize and stop the cluster (snowfall) WordTokenizer()
sfLapply(), sfSapply(), sfApply() parallel versions of Editors/GUIs
lapply(), sapply(), apply() (snowfall) Tinn-R a free GUI for R language and environment
Packages RStudio a free integrated development environment (IDE) for R
multicore parallel processing of R code on machines with multiple cores or rattle graphical user interface for data mining in R
CPUs Rpad workbook-style, web-based interface to R
snow simple parallel computing in R RPMG graphical user interface (GUI) for interactive R analysis sessions

You might also like