You are on page 1of 18

Computational Statistics (2007) 22:91–108

DOI 10.1007/s00180-007-0023-6

O R I G I NA L PA P E R

Excel :: COM :: R

Thomas Baier · Erich Neuwirth

Published online: 24 February 2007


© Springer-Verlag 2007

Abstract R is a powerful system for statistical computing. Its great flexibility


makes it the perfect tool for a wide range of applications. Unfortunately this
flexibility also leads to a level of complexity which is hard to handle for the casual
user. On the other hand tools like Microsoft Excel are very easy to handle but
are not well-suited for more complex applications. This article describes how to
make use of the flexibility of R while still providing a familiar and easy to use
GUI in Microsoft Excel. We will provide a description of the design and show
the various ways of installation and user interaction with R using Excel.

1R

An in-depth discussion of R (R Development Core Team 2005b) is far beyond


the scope of this article. We will provide a short description of R and show some
of the advantages of this system and also some of its disadvantages.
“The R FAQ” (Hornik 2005) provides a short description of R. It starts with
the following paragraph:

R is a system for statistical computation and graphics. It consists of a


language plus a run-time environment with graphics, a debugger, access
to certain system functions, and the ability to run programs stored in
script files.

T. Baier
Department of Statistics, Vienna University of Technology, Vienna, Austria

E. Neuwirth (B)
Department of Scientific Computing,
University of Vienna, Vienna, Austria
e-mail: erich.neuwirth@univie.ac.at
92 T. Baier, E. Neuwirth

The programming language implemented in R is the S language version 4


(see Chambers 1998). This language is also implemented in a software product
called S-Plus by Insightful Corporation (Insightful Corporation 2005). The
specific variant implemented in R is described in R Development Core Team
(2005e).
R runs on many different platforms, including various flavours of Unix,
Microsoft Windows and MacOS X. More detailed information can be found
in R Development Core Team (2005d) and Hornik (2005). The “look & feel”
on the various platforms may be different, but it is common for all platforms,
that R provides a command prompt where the user can enter textual commands.
The character output of the commands is shown in the same (console) window,
while the graphical output shows up in distinct graphics windows. The command
language used in the command prompt is S version 4. All global symbols or
functions can be used by simply typing their names.
In addition to the built-in functions, R can be extended by means of packages.
Hundreds of packages are available for public download from one of the
CRAN 1 sites. Many of the available packages provide useful functions and
objects for use in computational statistics. At the same time other packages
provide connectivity to a broad range of applications and data formats. Some
examples are
• import and export from/to text files is a built-in functionality (found in the
base package.
• reading and writing XML files using the package XML (Lang 2005c)
• importing data from EpiInfo, Minitab, S-PLUS, SAS, SPSS, Stata and Systat
or exporting data to Stata [package foreign, R-core members et al. (2005)]
• data base connectivity, with e.g., MySQL (DuBois 2000) using RMySQL
(James and DebRoy 2005) or via ODBC using RODBC (Lapsley and
Ripley 2005).
• the packages RDCOMClient (Lang 2005a), RDCOMServer (Lang 2005b)
and rcom (Baier 2005) can be used to connect R to third party applications
using COM on Microsoft Windows operating systems. (see Sect. 3 and
Microsoft Corporation & Digital Equipment Corporation 1995)
An introduction into R’s connectivity options—mostly focused on data im-
port and export—can be found in R Development Core Team (2005c).
R is developed as an open-source project and distributed under the “GNU
GENERAL PUBLIC LICENSE” (Free Software Foundation 1991). Everyone
is allowed to use R free of charge—even in commercial projects. The same
license or a license similar to GPL applies to many of the packages available
for R.
As has been mentioned before, the primary way of user interaction with the
system is via R’s command prompt. While this allows very fast interaction for
the advanced user and a high degree of flexibility using the S language, this
kind of user interface is very hard to use for the casual user. Novice users are

1 Comprehensive R Archive Network, available via http://www.cran.r-project.org/.


Excel :: COM :: R 93

presented with a GUI which is not very common for modern applications. User
interaction does not follow the model of menu and dialog driven application
software. In the case of R, the user is is interacting with an interpreter for an
object oriented statistical programming language. A command prompt provides
a very steep learning curve for the beginner and requires quite an investment in
time by users new to the system. Packages, like Rcommander (Fox et al. 2005)
try to assist in learning R by providing a menu-driven GUI but from the user’s
point of view, the command prompt is still the main user interface. Many users
familiar with applications like OpenOffice (OpenOffice.org 2006) or Microsoft
Office do not want to invest the time necessary to learn using R.
So there is a large group of potential users who cannot use R because it
either is too complex or at least seems to be too complex. In addition to that,
there are tasks where the command line is not the ideal way of user interaction.
While R contains an integrated spreadsheet component which can be used to
comfortably enter and edit tabular data, users familiar with Microsoft Excel
or other full featured spreadsheet programs will miss many features they have
grown accustomed to. Therefore we decided to connect Microsoft Excel with
R. The software package implementing (among other things) this connection
is called R (D)COM Server V2.01 (Baier and Neuwirth 2005) and can be
downloaded from http://www.cran.r-project.org/ from section Other.

2 Excel

Microsoft Excel is a current state of the art spreadsheet program. Spreadsheet


programs are very convenient tools for numerical computations and in fact
computationally equivalent to many programming language based software
systems for numerical computation. Spreadsheet programs are unique by some
important properties:
• creation of formulas with a point and click user interface,
• relative and absolute cell references instead of named variables,
• iteration and multiple computations by copying formulas,
• automatic recalculation when input values change
These properties create a unique way of interaction with data. Since changing
cell contents immediately also changes computed results, spreadsheets support
a very explorative way of analyzing data. The fact that the algebraic notation
for formulas is not the primary way of interacting with formulas makes spread-
sheets accessible for a much wider audience than programming languages for
numerical computations. Additionally, it also not too difficult for the average
user to change formulas and therefore spreadsheet programs are not closed
application programs (like accounting systems) but allow end users a scaled
down version of programming and even software development. These aspects
of modeling allow smooth progress from simple tasks like invoicing, accoun-
ting, and very simplistic statistics to quite complex statistical and mathematical
models. These end user programming ideas are discussed in Nardi (1993), and a
94 T. Baier, E. Neuwirth

detailed account on the modeling aspects of spreadsheets is given in Neuwirth


and Arganbright (2003).
An additional advantage of Excel is its integration into the Windows desktop.
Transferring data and images between Excel and other applications is very easy,
and it is even possible to embed parts of Excel sheets into text documents in a
way that the text document will be updated automatically when the Excel sheet
contents change.
Excel even has some support for statistics, both in the form of worksheet
functions and in the form of menu based procedures, but these methods are not
well accepted by the statistical community. The numerical precision of some
of these methods (e.g. methods based on matrix inversion) is not very good,
and in many cases, the parametrization of the arguments of the functions are
somewhat strange. Performing complex statistical analyses with Excel without
any extensions is not advisable.
To alleviate the situation, Excel has an Add-In mechanism. All current ver-
sions of Microsoft office programs have a builtin programming language (VBA)
and an integrated development environment (IDE) for this language. The lan-
guage is reasonably complete, and it allows access to other libraries installed
either on the same or even on a different computer accessible by a network
connection. The Add-In mechanism allows programmers to define new work-
sheet functions seamlessly integrated into Excel (they can be used like the
functions built into Excel’s core engine). Add-Ins also can add menus and dia-
log boxes to Excel, making procedures supplied by other libraries accessible
through an extension of Excel’s user interface.
Our Excel extension package uses this Add-In mechanism to allow Excel to
call R directly from formulas in cells, and we also allow users and programmers
to call R methods from their own VBA programs. As a third way of connecting
Excel and R we added operations started from menu items on additional menus
integrated into Excel’s main menu and the cell context menu (available when a
cell is rightclicked).

3 COM

For embedding R into Microsoft Excel, the DCOM technology is used.


COM (Microsoft Corporation & Digital Equipment Corporation 1995) is
shorthand for component object model, a technology widely used on Microsoft
Windows platforms to encapsulate functionality in a common way. This makes
it possible to use such a component exposed by a so-called server application in
a client application. A component is a set of functionality and data (an object).
It can be as simple as, e.g. the encapsulation of a simple integer value and as
complex as a whole application like Microsoft Excel.
The services COM provides and its whole architectural model is very similar
to CORBA (Object Management Group 2002), an object model used mostly
on Unix platforms. COM
Excel :: COM :: R 95

• defines a way for a server applications to expose objects to its clients,


• defines methods to handle object life-time by enforcing the use of a simple
reference-counting mechanism and
• provides standardized mechanisms for object creation and sharing objects
between different processes.

COM components are used very similarly to objects from a class library. The
component author can decide whether the component is integrated into the
client application (running in the same process context as the client applica-
tion) or if it runs in a separate process. In the latter case, COM transparently
handles sharing components across process boundaries and so allows to inte-
grate components provided by one executable file or program into another one.
For a client application (like Microsoft Excel) the component itself is treated
just like an object provided by the application itself. Using this component
then is similar to calling an internal Visual Basic for Applications (VBA, Mi-
crosoft Corporation 2001b) function or object. Therefore, with respect to the
programming language, nothing new is to be learnt, the programmer just has
access to additional objects. Of course, the properties and methods of the object
itself are new. The integration into the VBA IDE, however, is so tight that it is
even possible to use the integrated object browser in the IDE to browse objects
provided via COM.
One of the major advantages of COM itself is its wide support on the
Microsoft Windows platform. Nearly every programming language, as, e.g. Vi-
sual Basic (Microsoft Corporation 2001c), Delphi (McNab et al. 1996) or C++,
scripting languages like JavaScript (Flanagan 2001), Perl (Wall et al. (1996) or
more specifically ActiveState Tool Corporation (2000) for a Windows version
able to use COM) or Python (Martelli 2003) or even applications providing ma-
cro support as the Microsoft Office family of products (VBA) provide support
for using the functionality exposed through COM objects by a server applica-
tion.
DCOM stands for distributed COM and extends the COM model with a very
important feature. While COM itself provides methods for performing func-
tion or method calls across process boundaries, DCOM goes one step further.
DCOM makes COM objects transparently available in a network of computers.
In our previous example of Microsoft Excel utilizing R as a computational com-
ponent, DCOM now allows to run the component (R) and the client application
(Microsoft Excel) on different machines. DCOM will take care of the neces-
sary communications over the network to make the services exposed by the R
component available to Excel.
COM requires developers to separate interface and implementation. Along
these lines we have defined a COM interface called IStatConnector (see
Sect. 4 which formally defines the functionality our COM server provides. Client
applications work with the COM interface and the true binding to the imple-
mentation is done when the COM object is instantiated (while the application
is running). This separation of interface and implementation and the run-time
binding mechanism allows to ensure compatibility between different versions
96 T. Baier, E. Neuwirth

of the COM server. For example, client applications created in 1999 for the
first release of our COM server for R still work without modifications with the
current version released in October 2005.
The COM interface defines the functions (and variables) a COM object
provides. An implementation of a COM interface is called a coclass.
In the last few years, Microsoft has developed a new component technology
as part of the .NET (Microsoft Corporation 2001a) system. Why we chose COM
as our component technology is easy to explain:
• COM and DCOM can be used on all 32 Bit Windows platforms (Windows
9x, ME, NT 4, 2000, XP and even Windows CE/Pocket PC)
• mature COM support is found in most programming languages and appli-
cations, while good support for .NET’s component technology is still not
found very often. Even Microsoft Excel does not have native .NET support
at the moment.
• when the concepts for integrating R into Excel has been developed back in
1999, .NET was not available at all
• the .NET → COM bridge technology allows to use our COM components
from .NET applications quite well.
For the near future we want to stay with COM as our base component
technology, but to make it easier for the new .NET environment, we will develop
native .NET components both for the computational components and for the
controls and applications. Since the .NET technology is fully documented and
implemented non-Microsoft platforms also (Mono Project 2006) this will allow
to use our mechanism on a wider range of operating systems.

4 Integrating Excel and R

Our goal was to make R’s computational engine available to third party appli-
cations in general, and to Microsoft Excel in particular. In COM terminology,
this makes Excel a COM client application and R a COM server. When talking
about COM, this also includes the DCOM technology. COM clients do not
distinguish between COM and DCOM when accessing an object’s methods and
properties. Only the client machine’s configuration and the process of object
creation may be different.
In this integrated system with components Excel and R, Excel (the client)
is the controlling part, whereas R (the server) offers its services on request to
Excel. Figure 1 shows the connection between the two applications including
data flow.
The COM server provides R’s functionality through a COM interface called
IStatConnector. Below we show relevant parts of this COM interface.
interface IStatConnector : IDispatch
{
// starting and stopping the interpreter
HRESULT Init([in] BSTR bstrConnectorName);
HRESULT Close();
Excel :: COM :: R 97

Fig. 1 Microsoft Excel uses


R’s computational engine via
eval(expr)/set data
COM
Excel R
result data

COM

// setting and retrieving symbol data


HRESULT GetSymbol([in] BSTR bstrSymbolName,[out,retval] VARIANT* pvData);
HRESULT SetSymbol([in] BSTR bstrSymbolName,[in] VARIANT vData);

// evaluating an expression in the interpreter


HRESULT Evaluate([in] BSTR bstrExpression,[out,retval] VARIANT* pvData);
HRESULT EvaluateNoReturn([in] BSTR bstrExpression);

...
};

Our work is centered around the coclass StatConnector, which is an


implementation of the IStatConnector COM interface. Excel is connected
to R by an Add-In for Microsoft Excel called RExcel. From StatConnector’s
point of view, the Add-In performs the following tasks to integrate R into the
spreadsheet application:

1. Excel (by means of the Add-In) creates an instance of the


IStatConnector interface,
2. and calls Init on the COM object to start up R
3. normal operation: either interactively or using Excel’s recalculation loop,
transfer data to R, perform computations in R and get result data back to
Excel
4. shut down R by calling Close
5. release the COM object

The functions (methods) will return on completion of the requested opera-


tion. E.g., Evaluate takes an expression as its input argument. Control is then
handed over to R. The client application waits until the R has finished compu-
ting the expression and the result of the computation is returned to Excel (see
Fig. 1). The COM server (R) only reacts to Excel’s requests.
When initializing the COM object, a fresh R environment is created and
initialized. In our case, this “R instance” only belongs to the Excel Add-In. If
another application (or another instance of Excel) is started and wants to use
R for its own purpose, a different R process is created. All applications are
using separate R processes. This is similar to running R multiple times from the
command line at the same time.
Our COM server is a true back-end application. It does not provide any kind
of user interface. Even the command prompt described in the first section is
invisible to the user. This allows to truly embed R’s computational engine into
98 T. Baier, E. Neuwirth

another application (Excel in this case) and completely hide R’s own GUI from
users.
The COM interface IStatConnector Excel (or more specifically: RExcel)
uses is separated from the interface’s implementation, the coclass
StatConnector. The package rcom makes use of this concept and provides an
alternative implementation of the IStatConnector interface which displays
R’s “normal” GUI and allows manual interaction with R in parallel to using the
COM client. By simply changing the object creation mechanism in Microsoft
Excel, we can exchange the COM implementation and provide access to an R
process with its own GUI accessible for the user, while R is still integrated into
Excel.

5 Concepts of the implementation in and around R

The COM server (the coclass StatConnector) mainly consists of two parts.
The first part is tightly coupled with the implementation of R itself, the latter is
the “real” implementation of the COM interface IStatConnector. It is not
the goal of this article to give a detailed description of either the R implemen-
tation or of the COM implementation. We will only provide a short description
of the design goals and the advantages (and disadvantages) of our approach.
On Windows platforms, as well as for most other operating systems, GCC
(see Stallman 2005) is used to build the R executables and libraries from source
code. When we started our project, there was not much support for creating
COM servers using GCC. Commercial compilers, on the other hand, provided
good support to create COM server applications and contained class libra-
ries making creation of COM servers an easy task. Unfortunately—unlike on
most Unix-alike platforms—interoperability between different vendors’ C and
C++ compilers is not possible (easily). Only when using the so-called system
calling convention, which is defined for C code only, it was possible to create
implementations with one compiler (e.g., with GCC), which could safely be
called from an implementation compiled with a different compiler (in our case
Microsoft VC++). This abstraction layer only uses C functions (no COM or
C++) and utilizes Microsoft Windows’ system calling convention. This guaran-
tees that the functions can safely be called from a C program compiled with
any C compiler for Windows. The abstraction layer (below referenced as the
proxy object SC_Proxy_Object) consists of a set of pointers to C functions
stored in a structure and not only defines functions to access R but also a data
format which maps the R-specific internal storage format (SEXPs, see R De-
velopment Core Team 2005f for more information) to the so-called BDX data
format (Binary Data eXchange format) designed specially for this goal. This
data format has been designed for efficient (structured) data exchange with as
few memory/conversion operations as possible.
The proxy object SC_Proxy_Object is a general interface object and its
definition is independent from R. The same interface object (definition) and
the data format could be used for other systems than R, too (e.g., GNU Octave,
Excel :: COM :: R 99

see Eaton 2005). The implementation in rproxy.dll makes extensive use of


the R API as described in R Development Core Team (2005f) and is tightly
coupled to R. Its implementation is part of the R source code and the binary
for rproxy.dll is delivered with the Windows binary distributions of R.
SC_Proxy_Object and BDX are both stable interfaces and make the COM
server itself (the coclass StatConnector) independent from the R version.
This decouples the COM server from R in a way that release cycles for R and
the COM server can be independent from each other.
StatConnector is an out-of-process server written with Microsoft Visual
C++ 6.0 using Microsoft’s Active template library (ATL). Its connection to R is
based on the SC_Proxy_Object interface. The interface’s implementation is
loaded dynamically from rproxy.dll. Data transfer is performed using the
BDX format. The COM server’s main goals are
• provide an implementation of IStatConnector
• start and stop R
• convert from BDX to the VARIANT data format used in the COM inter-
face and vice versa
• perform error handling
• allow callback objects to be installed for R for e.g., graphics or text output
(see Sect. 7) and handle the callback functionality
StatConnector is implemented as a single-use out-of-process server. This
means that every application creating a StatConnector object gets its own
server process where the COM object lives. R is running in the process context
of the StatConnector object. Therefore, separate R processes exist for every
client application.
The overhead created by the COM server is that of an out-of-process COM
call (including marshaling and VARIANT data transfer) vs. a direct function
call to achieve the results. Converting the data from VARIANT format to
BDX format and then to R’s SEXP storage format can be neglected for most
applications. The same holds true for forwarding the COM method calls to R.
The COM server itself provides very fast and easy access to R. When creating
interactive applications, the performance bottleneck is mostly found in long-
running scripts or expressions on the R side.
The advantages and disadvantages of this design and implementation ap-
proach are obvious now:
+ COM makes it easy to access R from many different applications and
programming languages
+ StatConnector provides an implementation of the stable interface
IStatConnector. This guarantees compatibility for client applications
using IStatConnector over time while still benefiting from improve-
ments found in new versions of the StatConnector implementation.
− Because of two different interfaces (COM interface IStatConnector
and C interface SC_Proxy_Object) special care must be taken to en-
sure compatibility. Since the first release in 1999, full compatibility for
client applications relying on IStatConnector could be guaranteed. In
100 T. Baier, E. Neuwirth

this time many new versions of both StatConnector and R have been
released.
+ SC_Proxy_Object provides a stable interface for C and C++ program-
mers using any Windows C compiler. For C programmers this may be
easier than going the COM way via IStatConnector but still decouples
the implementations from a specific R version.
+ Changes in R may require code to be changed. As the R interface is
implemented in rproxy.dll and rproxy.dll is part of R (and the R dis-
tribution/setup) this can be done easily. It is very unlikely to have to make
changes to the StatConnector implementation because of changes in
R. This helps keeps maintenance cost low.
+ Installing a new R version does not require to switch to a new version of
StatConnector in most cases.
+ Multiple versions of R can be installed at the same time and used by the
client applications by simply setting a registry key to point to the version
of R which shall be used.
+ Different applications use different R processes. This makes the client
applications independent from each other and does not require any co-
operation between them.
− A small overhead for data conversion and additional function call overhead
is imposed by this architecture. Practically, this does not have any impact on
a typical application’s performance. Minimizing data transfer and function
calls keeps this overhead low.
+ Problems in R code or maybe some bug in a package (resulting in a crash)
does not affect the client application. The client application always only
gets an error code from the COM implementation and can handle faults
like those gracefully.
+ The same infrastructure which is used by StatConnector can be used by
alternative implementations, too. rcom (see Baier 2005) is another COM
server for R implementing IStatConnector. The implementation reuses
rproxy.dll to provide a different level of integration between the COM
client and R and also allows to implement a different user interface para-
digm.
Next, we will describe how the Add-In for Microsoft Excel uses
StatConnector to integrate both spread-sheet and R functionality.

6 Excel implementation

The COM interface IStatConnector only supplies a very basic mechanism


for communication between Excel and R. Besides the administrative tasks of
initiating the R process and shutting it down it allows to
• send data to R,
• retrieve data from R and
• send as string containing R commands to be executed
Excel :: COM :: R 101

R has many complex data types implemented in R’s object system. Excel
essentially only has vectors (columns or rows) and matrices. The more complex
data types of R are conceptually incompatible with Excel’s tabular paradigm.
Therefore, our interface between the two applications only handles arrays
(containing only one basic data type like string or real) and dataframes. Da-
taframes in Excel are represented as arrays of columns. Each column has a
name (of the variable) in the top row and consists of data of equal type (string,
real, time,…). Different columns may have different types. The interface allows
to transfer Excel ranges of one underlying data type to R as array and to
transfer rectangular areas of cells (called “ranges”) following the “dataframe
convention” to R as dataframe. Similarly, scalars, vectors, and arrays in R can
be transferred to Excel ranges. The current version of the interface will even
handle date, time and complex numbers reasonably. Both data types are defined
in Excel and in R, but they are not implemented in the same way. Therefore,
great care must be taken when transferring these data types. Incidentally, most
parts of RExcel are implemented in VBA, which is an interpreted language. To
speed up data transfer for large arrays and dataframes, some routines had to be
implemented in compiled Visual Basic.
An additional problem is handling of missing values. Excel treats empty cells
differently under different conditions. For arithmetic functions in many cases
empty cells are treated the same as cells containing the value 0. For statisti-
cal projects, this is a serious issue which has been extensively documented in
McCullough and Wilson (2002). Therefore, our interface allows to specify dif-
ferent methods of handling for empty cells, and furthermore allows for different
conventions of indicating missing data in Excel.
To allow Excel to connect to R, the interface installs a new menu item RExcel
in Excel’s main menu. This menu items opens a submenu containing, among
other items, commands to transfer the currently selected data to R and to
transfer an array or a dataframe from R to a range in Excel.
This menu also has an item for connecting to R and to select the type of R
server to be used (R(D)COM or rcom).
In addition to transferring data from Excel to R and back, a mechanism for
executing R procedures and functions from Excel is needed. If the the rcom
mechanism is used, starting R brings up an R command line, so the user can
run R commands from this interface in the same way he would interact with
a standard R GUI. The advantage is that data can be transferred from Excel
most easily, and results can be transferred directly into Excel. If the underlying
R process is the R(D)COM server, however, no command line interface is
available. Therefore, another way is needed to run R commands. RExcel allows
to enter R commands as text into Excel cells. Then, a range can be selected
interactively and the text in these cells will be interpreted as a sequence of R
commands and executed. This way, Excel ranges become the R command line.
Additionally, the R code is saved as part of the worksheet and therefore one
Excel file can contain all the data and the R code needed to perform complete
statistical analyses.
102 T. Baier, E. Neuwirth

When working with R code from within Excel, debugging can become rather
tedious. Therefore, the interface offers tools to help with debugging. There is a
special debug mode where all R commands executed are displayed in a special
popup window. When an error occurs, this window will also display R’s error
messages. So Excel can even be used as a mini development environment.
The interface also has a command for getting the output of the last command
executed by R. This output can be put into a cell range in Excel as text. This
is useful for command producing output which cannot easily be represented
as arrays or dataframes. This way, when programming R one can inspect the
results of performing operations in R in an informal manner.
Using this mechanism (data transfer in both directions, execution of R com-
mands initiated by Excel), it is possible to write Excel macros performing sta-
tistical tasks and start them from menus in Excel. This way, complete statistical
applications can be written in Excel and the user only sees some additional
menu items performing these tasks. RExcel enhances Microsoft Excel with sta-
tistical methods not part of Excel itself. These enhancements are integrated
seamlessly into the GUI and Excel’s user interface paradigms, like, say, Excel’s
solver for multivariate equation solving and optimization.
To be able to implement such embedded applications, the implementor has to
know R, the spreadsheet part of Excel, and VBA. The hub for such applications
is VBA. Macros written in this language take care of data transfer. R commands
to be run are constructed as strings in VBA and then executed by calling an
appropriate procedure in VBA.
Here is a typical small example demonstrating the usual pattern of using R
in Excel this way:
Sub RegreDemo()
Call RInterface.StartRServer
Call RInterface.PutDataframe("mydf", _
Range("Regression!A1:C26"))
Call RInterface.RRun("attach(mydf)")
Call RInterface.GetArray("lm(y˜x1+x2)$coefficients", _
Range("Regression!F2"))
Call RInterface.StopRServer
End Sub

Excel allows to start parameterless macros directly from menu items or


toolbar buttons. Therefore, an Excel spreadsheet can have a menu with an
item performing such an operation. For a naive end user, performing such an
operations looks identical to use one of Excel’s menu based tools (e.g. sorting
data).
The most important feature of spreadsheets is automatic recalculation. Our
R-Excel integration methods described so far have not linked R with this Excel
feature, but there is a special mechanism integrating R into Excel’s automatic
recalculation loop. There are some Excel functions (defined in VBA) which
perform R computations.

RApply("pchisq",C4,D4,E4)
Excel :: COM :: R 103

computes the inverse (noncentral) χ 2 -distribution function for a probability


value given in cell C4, degrees of freedom in cell D4, and noncentrality parame-
ter in cell E4. Whenever the contents of one of these cells are changed, Excel
will immediately call R and update the value computing the χ 2 -value. There-
fore, our interface integrates R’s computational engine with Excel’s automatic
recalculation features, producing an R spreadsheet program. The mechanism
for creating formulas using R is Excel’s mechanism for creating formulas: point-
and-click can be be used to indicated the position of the parameter values of
function calls. This integration provides a radically different approach to the
usual batch oriented way of using R.
R not only offers powerful computational features, it also offers a wide range
of statistical graphical representations. R graphics at the moment is not fully
integrated into Excel the same way as native Excel graphics, but it is relatively
easy to produce R graphics and get a snapshot of the image into Excel. RExcel
allows to execute any R commands. Therefore, graphics can be produced, too.
Such graphics is displayed in a windows belonging to the R process, it is not
embedded in an Excel worksheet. R graphics can be copied to the clipboard
either manually with the menu commands available in graphics windows, or with
the savePlot command. After copying the graphics (preferably in a vector
format like WMF), pasting the clipboard contents into an Excel worksheet
will embed the chart in the worksheet. This kind of chart, however, behaves
differently from native Excel charts: changing the data will not automatically
change the chart. The copy-paste cycle needs to be repeated manually to get an
updated chart. This will be changed in future releases of RExcel. This technique
can only be used when the R server is running on the same machine as Excel.
There is another way of combining Excel’s graphics features and R, and using
this mechanism it is possible to produce animated graphics.
In the worksheet in Fig. 2 the numbers partially covered by the graph are
the numerical representations of a kernel density estimator. These numbers are
computed by R. The slider on top of the window controls the window width
for the density estimator. Whenever the slider is moved, thereby changing the
window width, the numbers are recomputed by R and the graph (an x-y-chart
produced by Excel) is updated. In this example, Excel initiates R’s computation
whenever necessary to update the graph.
An important consideration when designing RExcel was that it should sup-
port different user interaction modes:
scratchpad and data transfer mode menu controlled data transfer from R to
Excel and back, immediate command execution either from Excel cells or
from R command line.
macro mode macros invisible to the user control data transfer and R com-
mand execution.
spreadsheet mode formulas in Excel cells control data transfer and command
execution, automatic recalculation is controlled by Excel.
Our current implementations supports all three modes. Future releases will
concentrate on more complete graphics integration.
104 T. Baier, E. Neuwirth

Fig. 2 Excel graphics using


results computed by R

7 Additional tools

So far, this article has focused on the “core components”, which are the coclass
StatConnector (including the COM interface IStatConnector) and the
Microsoft Excel Add-In RExcel. In other words, the missing link between R’s
computational engine and the mathematical (or statistical) part of Microsoft’s
Office suite has been discussed.
Looking at R itself, it is obvious that some important parts of R’s features
have been omitted so far: graphics and text output.
Achieving graphics output seems to be very simple: When calling one of R’s
graphics commands (e.g. plot), R (or more precisely, the R instance running
in the COM server) will open an R graphics window and the graphical output is
shown. Although this approach provides a suitable solution at the first glance,
this cannot be the right solution on a second thought.
The graphics window is opened by the COM server and also “belongs”
to the COM server process. If the COM server is run on a remote machine,
the graphics window will be shown on the remote machine, too2 . The correct
solution for this is to show R’s display window on the local machine, while it
is controlled from the remote machine (the graphics are “drawn” by the COM
server). This is achieved by providing a so-called Active X control (see Cluts
2001). Active X controls are user interface components which can be shown in
a window or form. The “programmable” interface of the control is represented
by a (custom) COM interface.
The implementation uses the same mechanism to communicate between R
and the Active X control as the Excel Add-In does to talk to R. The Active X

2 This is a very simplified approach for explaining the mechanism. In reality, the COM server tries
to open the graphics window on the remote machine, but this will only succeed, if launch and run
permissions are set appropriately and the login state of the remote machine allows to show the
window.
Excel :: COM :: R 105

Fig. 3 Graphics output in


microsoft excel form
(via active X)

control is a COM object, and the R COM server on the remote machine holds
a reference to the control on the local machine. This “callback mechanism” is
implemented in rproxy.dll.
To capture R’s text output (e.g., texts appearing in the console window pro-
duced using cat) another Active X control is provided. In addition to the GUI
representation (the Active X control StatConnectorCharacterDevice)
a non-GUI object is also provided. The coclass StringLogDevice stores all
text output in a string variable and provides a way to programmatically access
R’s text output.
By using the core component StatConnector and the output components
StatConnectorGraphicsDevice and StringLogDevice any COM client
application can fully make use of R both as a powerful computational com-
ponent and as a high-quality graphics engine (Fig. 3).

8 Excel/R communication modes

Communication between Excel and R takes place using COM or DCOM. In


both cases, it is possible to use two different mechanisms for RExcel to invoke
functionality in R.
Method calls (like, e.g., Evaluate or GetSymbol) can be made using the
so-called custom interface (IStatConnector) or using a dispinterface (uses
IDispatch to access IStatConnector’s methods).
The method using the custom interface IStatConnector is comparable to
a function call in C or C++. It provides strong typing (checking of arguments
and data types) and is the fastest way to access a COM object. The COM client
must have an intimate knowledge of the COM interface it wants to use (both
at compile time and at run time). Alternatively, the dispinterface can be used.
In this case, access to the methods is made through an generic COM interface,
IDispatch. To issue a call to an IStatConnector method, the IDispatch’s
method Invoke is called (using IDispatch’s custom interface) and Invoke is
told to call a method in IStatConnector. This additional level of indirection
makes dispinterfaces a bit slower but the COM client does not have to exactly
106 T. Baier, E. Neuwirth

know the internals of IStatConnector. E.g., when using IDispatch it is


enough to know the method’s name and parameters, the client does not have
to know if the method is, e.g. the first function in the interface, or the second
function etc. The drawback is that using a dispinterface will show errors only at
run-time because of a lack of type-safeness.
To install RExcel in a way it can use this interface the user performing the
installation has to have administrator rights on the machine, even when the R
DCOM server is running on a different machine. But there is a way of installing
RExcel for using a remote server which does not need administrator rights. This
method allows users in an environment with tight access restrictions to quickly
install RExcel on machines where they do not have administrators privileges.
It is also possible to install RExcel on a client machine without installing
R and still use strongly typed access. In this case, the type libraries containing
information about the signatures of the the functions supplied by R DCOM
(i.e., the interface definitions) are required on the client machine, but not the
server binaries (the R(D)COM program) themselves.
Installing the R(D)COM server (just as installing any other COM server)
always requires administrative (or at least “power user”) privileges. It is possible
to install R (and the R(D)COM server) on one central server. Installation of
the client machines then can be done by a “normal” user and does not require
administrative privileges. The client machines then can use the R installation on
the centralized server machine. This is a reasonable context for an environment
with one powerful server and less powerful client machines. In this case, RExcel
serves as the user interface to R.
Let us summarize the differences between the interfaces:

• advantages and disadvantages of typed use


+ type-safe
+ easy to find bugs during development
+ easier to find runtime errors
+ less complex: easier to find setup errors
+ fast
+ more flexible: can support more data types (e.g. structs, unsigned inte-
gers)
− requires (registered) type library for local and remote use
− Excel requires type library even to load Add-In
• advantages and disadvantages of dispinterface
+ can run R remotely without any local components of COM server (even
no type library is required)
+ works with all COM clients (e.g. scripting languages)
+ when no R components are installed, Excel can still load the Add-In
− cause of errors often hidden (e.g. hard to distinguish between errors
with setup, programming, communication (for remote R)
− requires additional component (DLL) for running without type library
− more complex way of calling functions (indirectly, via name/id) makes
it slower and more error-prone
Excel :: COM :: R 107

References

ActiveState Tool Corporation (2000) Active Perl, 5.6.0.618 edn, ActiveState Tool Corporation.
http://www.ActiveState.com/ActivePerl/
Baier T (2005) rcom: R COM Client Interface and internal COM Server. R package version 1.2.1.
Baier T, Neuwirth E (2005) R (D)COM Server V2.00. http://www.cran.r-project.org/other/DCOM
Chambers JM (1998) Programming with Data, Springer, New York. ISBN 0-387-98503-4
http://www.cm.bell-labs.com/cm/ms/departments/sia/Sbook/
Cluts N (2001) Microsoft activex controls overview, in ‘MSDN Library’, Vol. Backgrounders,
Microsoft Corporation. http://www.msdn.microsoft.com/
DuBois P (2000) MySQL. New Riders
Eaton JW (2005) Octave: interactive language for numerical computations. University of Wisconsin,
Department of Chemical Engineering. http://www.octave.org/doc/index.html
Flanagan D (2001) JavaScript: the definitive guide, 4th edn. O’Reilly Media, Inc. ISBN 0596000480
Fox J with contributions from Michael Ash, Grosjean P, Maechler M, Putler D, Wolf P (2005) Rcmdr:
R Commander. R package version 1.1-1 http://www.r-project.org, http://www.socserv.socsci.
mcmaster.ca/jfox/Misc/Rcmdr/
Free Software Foundation (1991) GNU GENERAL PUBLIC LICENSE. Version 2
Free Software Foundation (1999) GNU LESSER GENERAL PUBLIC LICENSE. Version 2.1
Hornik K (2005) The R FAQ. ISBN 3-900051-08-9. http://www.CRAN.R-project.org/doc/FAQ/
Insightful Corporation (2005) S-PLUS 7’. http://www.insightful.com/products/splus/
James D, DebRoy S (2005) RMySQL
Lang DT (2005a) RDCOMClient: R-DCOM client. R package version 0.91-0.
http://www.omegahat.org/RDCOMClient, http://www.omegahat.org, http://www.omegahat.
org/bugs
Lang DT (2005b) RDCOMServer: R-DCOM object server. R package version 0.6-0.
http://www.omegahat.org/RDCOMServer, http://www.omegahat.org, http://www.omegahat.
org/bugs
Lang DT (2005c) XML: Tools for parsing and generating XML within R and S-Plus. R package
version 0.99-1. http://www.omegahat.org/RSXML
Lapsley M, Ripley BD (2005) RODBC: ODBC database access. R package version 1.1-4
Martelli A (2003) Python in a Nutshell. O’Reilly Media, Inc. ISBN 0596001886
McCullough BD, Wilson B (2002) On the accuracy of statistical procedures in Microsoft Excel 2000
and Excel XP. Comput Stat Data Anal 40:713–721
McNab E, Swart RE, Hinks P, Horn D, Jansen A, Jewell D, Wako W, Winning C (1996) The
Revolutionary Guide to Delphi 2. Peer Information Inc. ISBN 1874416672
Microsoft Corporation (2001a) Common language runtime. In: ‘MSDN Library’, vol. .NET Fra-
mework SDK, Microsoft Corporation. http://www.msdn.microsoft.com/
Microsoft Corporation (2001b) Microsoft office 2000/visual basic programmer’s guide. In:
‘MSDN Library’, vol. Office 2000 Documentation, Microsoft Corporation. http://www.msdn.
microsoft.com/
Microsoft Corporation (2001c) Visual basic. In: ‘MSDN Library’, vol. Visual Studio 6.0 Documen-
tation, Microsoft Corporation. http://msdn.microsoft.com/
Microsoft Corporation & Digital Equipment Corporation (1995) The component object model
specification, Technical Report 0.9, Microsoft Corporation (Draft)
Mono Project (2006) The Mono Project. http://www.mono-project.com/
Nardi BA (1993) A Small Matter of Programming. MIT Press, Boston. ISBN 0-262-14053-5
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=6799
Neuwirth E, Arganbright D (2003) Mathematical Modeling with Microsoft Excel,
Thomson-Brooks/Cole. ISBN 0-534-42085-0. http://www.brookscole.com/cgi-wadsworth/
course_products_wp.pl ?fid=M2b&product_isbn_issn=0534420850&discipline_number=1
Object Management Group I (2002) Common object request broker architecture: Core specifica-
tion, Technical report, Object Management Group, Inc. 3.0
OpenOffice.org (2006) OpenOffice. http://www.openoffice.org/
R-core members, DebRoy S, Bivand R, others: see COPYRIGHTS file in the sources (2005) foreign:
Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase. R package version 0.8-10
108 T. Baier, E. Neuwirth

R Development Core Team (2005a) An introduction to R, R Foundation for statistical computing,


Vienna ISBN 3-900051-12-7
R Development Core Team (2005b) R: a language and environment for statistical computing, R
Foundation for Statistical Computing. Vienna. ISBN 3-900051-07-0 http://www.R-project.org
R Development Core Team (2005c) R Data Import/Export, R foundation for statistical computing,
Vienna, ISBN 3-900051-10-0
R Development Core Team (2005d) R installation and administration, R foundation for statistical
computing. Vienna, ISBN 3-900051-09-7
R Development Core Team (2005e) R language definition, R foundation for statistical computing.
Vienna, ISBN 3-900051-13-5
R Development Core Team (2005f) Writing R extensions, R foundation for statistical computing.
Vienna, ISBN 3-900051-11-9
Stallman RM (2005) Using and porting GCC, 2.95 edn. Free Software Foundation.
http://gcc.gnu.org/
Wall L, Christiansen T, Schwartz R (1996) Programming perl. O’Reilly & Associates. ISBN 1-
56592-149-6

You might also like