You are on page 1of 10

Data Services in your Spreadsheet!

Régis Saint-Paul, Boualem Benatallah Julien Vayssière


School of Computer Science & Engineering SAP Research, RC Brisbane
University of New South Wales Level 12, 133 Mary Street
Sydney NSW 2052, Australia Brisbane QLD 4000, Australia
{regiss, boualem}@cse.unsw.edu.au julien.vayssiere@sap.com

ABSTRACT 5]. This architecture offers a unified and high-level view of


Service-oriented architecture offers a high-level and inte- the company resources.
grated access to data and applications across the company. In particular, the advent of data services [8] and standards
Using Data Services, together with Service Data Objects, such as Service Data Objects (SDO)[19], professional devel-
developers can now refer to business entities rather than opers now benefit from high-level and integrated data access.
storage structures. They are relieved from the burden of Integrated because developers can access transparently from
repetitive and low-level tasks such as joining tables. a single end-point to information that may be managed by
Sadly, a vast community of developers don’t as yet enjoy various systems and stored in various location. High-level
this relief. They are end-user programmers—and their pro- because data services rely on a conceptual modeling of the
gramming environment of choice is the spreadsheet. In this information—namely the Entity-Relationship (ER) model.
paper, we identify what are the criteria that define—from For example, developers can retrieve from a data service an
an end-user point of view—a good integration of spread- entity customer. This entity presents information retrieved
sheet with a service oriented architecture. Based on these for some parts from a relational database and for the rest
criteria, we propose SpreadATOR, a generic bridge between from a supply chain management software. From this en-
data services and spreadsheets. tity, developers can also access to related entities such as the
purchase orders or the invoices of this particular customer.
Undoubtedly, spreadsheets need to be integrated with SOA.
Categories and Subject Descriptors Indeed, there already exist some efforts to this end. For in-
H.4.1 [Office Automation]: Spreadsheets; E.2 [Data Stor- stance, Microsoft Excel Services [1] allows to incorporate
age Representations]: Object representation Excel computations as part of a larger process. Another
example is given by Visual Studio Tool for Office (VSTO)
[14] which allows to isolate the presentation elements of a
General Terms spreadsheet from the data it contains. Data are stored as
Languages, Design separate XML documents and can be consumed by other
applications.
Those initiatives, however, are meant for professional de-
Keywords velopers. A solid background in object-oriented program-
Spreadsheet, Service-oriented, architecture, integration ming is needed to use VSTO. In this article, we are con-
cerned with the large majority of spreadsheet users—the
non-professional programmer ones. Manipulating SDO en-
1. INTRODUCTION tities can already be done by using the macro language that
End-user programmers—the 45 million of them, as esti- accompanies spreadsheet environments. But few end-users
mated for 2001 in US alone [18]—routinely use spreadsheet are ready to invest time in learning a macro language, which
to visualize, manipulate, and analyze data. Thanks to this resort to learning object-oriented programming.
environment, they can build more or less complex applica- For the majority of end-users, spreadsheet programming
tions that solve their daily problems. Even building a report means formulas and cells manipulations only, sometimes as-
can be seen as programming an application that takes corpo- sisted by wizards or visual assistants. An integration solu-
rate data as input and outputs a presentation. To build this tion has to preserve this programming model and it has to
application, spreadsheet users have to import data and place accommodate existing spreadsheet environment. With this
them in spreadsheet cells, highlight the important pieces, in mind, we want to investigate how integration of spread-
compute maybe some aggregates, add a chart or two. If sheet with SOA can impact the daily tasks of end-users pos-
well done, this application can be used again later on a new itively; what would characterize—to them—a good integra-
set of data to effortlessly produce a new report. tion.
Service oriented architecture (SOA) emerged as a response The integration of spreadsheet with data services bears
to the general problems of enterprise application integration several facets. In this article, we focus on the problem of
(EAI) and enterprise information integration (EII) [11, 20, importation and manipulation of data delivered by data ser-
Copyright is held by the author/owner(s).
vices. Other aspects such as the exportation of data from
WWW2007, May 8–12, 2007, Banff, Canada. spreadsheet is left for future research.
.
This paper is organized as follows. In section 2 we define
Grid model
more precisely the model of spreadsheet programming and
present the framework of service-oriented architecture and
data services. Spreadsheet layer

In Section 3.1, our first contribution is to identify the


dimensions of the importation and manipulation of data ob- Data Data Data Data
Conceptual model
Entity-Relationship (SDO)
jects in spreadsheet—that is, we define what are the char- Service Service Service Service

acteristics of a good integration solution with data services.


Service layer
In particular, we identify how the richness of the ER model
can benefit to end-users when they program with the flat Logical model
(e.g. relational)
tabular representation of data proposed by a spreadsheet. Supply chain
application
CRM
application
On this basis, we review in Section 3.2 the existing ap-
DBMS OLAP Resource layer
proaches to data importation. As we will see, those ap-
proaches are not fully satisfactory because (i) they do not
embrace the spreadsheet programming model and, thus, have
limited programmability and, (ii) while they offer some pre- Figure 1: Spreadsheets in Service Oriented Archi-
defined manipulations that take advantage of the underlying tecture
structure of imported data, users can not benefit from this
structure in their programming.
In Section 4 we present our approach, called SpreadA- The traditional spreadsheet model is based on a grid, also
TOR, for data importation. Our other contributions are: called a worksheet, where cells are identified by their coor-
dinates, denoted using letters for the columns and numbers
• In order to give more programming possibilities to end- for the rows. For example C4 is the cell located in the third
users and to leverage their expertise in spreadsheet column of the fourth row. Each cell contains either a single
programming, we propose to import data by using the atomic value or no value at all. This value is obtained ei-
traditional spreadsheet programming model. To this ther from a direct user input (e.g. user may input C4=6) or
end, SpreadATOR proposes to blend the specification from a formula expression (e.g. user may set C4=SIN(C3)+5).
of importation by using formulas. But contrary to Formulas are expressions that combine i) functions, such as
other approaches, those formulas are not limited to SIN(x) or ‘+’ in this last example, ii) constant expressions
primitive data types and can be used to access collec- and iii) variable references denoted as cell coordinates—C3
tions of data or composite data structures. in this example. A spreadsheet application may contain sev-
• We propose a method to allow users to exploit rela- eral worksheets, together referred to as the workbook.
tionships that exist between entities to access related Spreadsheets are often tagged as the most successful end-
information. This method also lets users program com- user development environment [15, 12]. An explanation to
putations that apply to a business entity (e.g. a cus- this well deserved reputation can be sought in the cognitive
tomer) and can be reused when the entity appears in dimension (CD) framework [6]. CDs are criteria that help
other context (e.g. in a list of customers). to understand the difficulties faced by developers in using a
system. An example of cognitive dimension is Progressive
• We propose to enhance the traceability of imported evaluation. It measures how far developers need to go into
data by offering a generic facility for the display of their programming before being able to check if what they
meta-data and allow users to use meta-data in formula. did is correct. Spreadsheets evaluate continuously and thus,
they do very well along this dimension.
In Section 5, we show how our prototype implementation Some tasks, such as adding a chart, can not be done by for-
of SpreadATOR—as an add-in to MS Excel—can be used mula programming. A wizard dialog can in this case be used
with publicly available data sources and how it simplifies the to guide users through the sequence of operations needed to
spreadsheet developer task when working with composite accomplish this task. Indeed, formula themselves can be
data. We present in section 6 additional related work and built with the help of such a visual assistant.
conclude in Section 7. The spreadsheet programming model is thus a combination
of functional programming, grid manipulations and visual
2. BACKGROUND programming. All these elements concur to deliver an in-
tuitive experience—with little or no learning barriers—and
2.1 Spreadsheet Programming Model great pays-off. Extensions to spreadsheet have to preserve
Spreadsheet applications come in a variety of implemen- as much as possible the qualities of this programming model.
tation and bear an even more varied set of features. In ad- We refer the reader to [12] for a more complete exposition
dition, several research proposals have been made to extend of the spreadsheet programming model. In the following,
the spreadsheet model in various directions. we will introduce when needed the characteristics of this
In order to facilitate our exposition, we will introduce in model that we believe are important when importing and
this section what we call the traditional model of spread- manipulating data.
sheet programming. We intend here to capture the essence
of spreadsheet common to the vast majority of commercial 2.2 Service Oriented Architecture
implementations. We will reserve the discussion of the vari- Our framework is the service oriented architecture illus-
ous extensions to this model—mainly research prototypes— trated Figure 1. It consists of three layers.
to section 6. The resource layer is the domain of data management sys-
tems, e.g. relational databases or data warehouses, as well Importation consists in locating a data resource and build-
as applications, e.g. customer relationship or supply chain ing its representation on the worksheet grid. This should be
management softwares. At this layer, data are structured accomplished in a way that preserves the incrementality of
according to a logical model, concerned with issues such as spreadsheet programming. Users should be able to modify
retrieval performances or storage costs. For example, data any part of a grid representation, adding or removing im-
can be organized in relational tables with various degree of ported information, and this with a minimum of side-effects
normalization in order to reduce the computation time of on the rest of the application. And since the modification
queries. For their part, applications are accessed in this unit in a spreadsheet is the cell, building the grid represen-
layer using specific—often heterogeneous—programming in- tation ought to be possible cell by cell. In terms of cognitive
terfaces (e.g. C++ for the CRM and java for the supply dimensions, this notion is referred to as viscosity.
chain software).
Programming at this level often implies tedious and repet- 3.1.2 Traceability
itive manipulations. For example, in order to retrieve, from It is important to know where the information displayed
a relational database, all the desired information regarding in a spreadsheet comes from and what it means; not only
a customer, programmers may have to join several tables in at the time the application is built, but throughout all its
rather complex SQL expression. life-cycle. For example, if a cell displays the value ‘Prefect’,
The service layer helps in providing a higher level expe- does it stand for the last name of a customer, a mis-spelling
rience to developers. Application interfaces are made ho- for ‘Perfect’ or the model name of a car.
mogeneous thanks to web services and information can be Clearly, users should be able at any given time to know
accessed through data services. For instance, ALDSP from which external information is represented in any given cell.
BEA [8] proposes data access through the Service Data Ob- This information is twofold. First, the cell value has to be
ject (SDO) standard [19, 21]. Service data objects—or their precisely bound with an external resource. It is needed to
.Net cousin [2]— rely on the Entity-Relationship model and perform the retrieval of this value and its subsequent ac-
allow developers to use this model for accessing and updat- tualization. This identification is thus at a system level.
ing data. Second, the information need to be identified at a user level
This approach relieves developers from low-level manipu- with what is called meta-data. Meta-data include for exam-
lations akin to the resource layer. When an entity such as ple the semantic of the value, its precision or, if its a dis-
customer is accessed, all the related information can also be tance, in which unit it is expressed (e.g. miles, kilometers or
accessed easily. The customer entity has explicit relation- light-years). Finally, users may need to refer to meta-data
ships with other entities such as purchase orders. Developers in formula expressions.
can use these relationships to gain access to related infor-
mation. 3.1.3 Parametric data access
The third layer in Figure 1 represents the spreadsheet ap- A parametric data access is needed in order to compose
plication. We have seen in section 2.1 that in the spreadsheet importation operations. For example, a user may first im-
environment, variables are cells and the only available data port a list of customers from the Customer Relationship
structures are cell matrices. Management software. Then, additional data regarding the
When going from the bottom layer of resources to the current processing of customer orders may be retrieved from
superior layer of services, developers gained in terms of ab- the supply chain management software. In this scenario, the
straction. By contrast, programming in spreadsheet looks second importation depends on data that come from the first
like a huge step backward. Here lies the main difficulty. A importation.
good integration will allow spreadsheet developers to ben- We have seen in section 2.1 that in spreadsheet program-
efit from the high-level abstractions of data services while ming, cells references are used as variables. In this example,
letting them work with the simple structures offered by a it means that the stock name would be stored in a cell, and
worksheet. In the next section, we discuss what this im- that the importation of its quote would use that cell refer-
plies. ence as parameter.

3. DATA IMPORTATION AND MANIPULA- 3.1.4 Data access efficiency


Importation may impact both the data service used to de-
TION livers the information and the spreadsheet application that
This section presents the dimensions of the problem of retrieves it. To illustrate this problem, consider the portfo-
data importation from data services to spreadsheet and their lio application mentioned above. A table consisting of the
subsequent manipulations. We then review the existing ap- names and quotations of a list of stocks have to be imported
proaches to data importation and identify their strengths from some relational database. It is more efficient to retrieve
and weaknesses. this list with a single selection and projection query rather
than querying each individual cell values separately.
3.1 Dimensions of the problem
3.1.5 Relationship-based manipulations
3.1.1 Incremental construction of grid representa- As mentioned in section 2.2, data services offer a high
tion level access to corporate information based on the Entity-
We have seen in section 2.1 that one of the characteristics Relationship model. We argued that a good integration of
of the spreadsheet programming model is the possibility to spreadsheet has to give to users access to this model. But
incrementally build applications, starting from simple and what does that mean concretely? To illustrate this, consider
gradually adding more and more formulas. a cell that displays, as in Figure 2, the last name of a cus-
Customer 003
Beeblebrox
A
Spreadsheet application
B C D
details from a given aggregate. But the structure is only ac-
Customer 002
Zaphod
Customer Dent 1 Last name First name cessible through the set of pre-defined manipulations offered
lastName Arthur Prefect Ford
firstName
Customer
Prefect
001
Importation of a
2
3 Dent Arthur by this tool. It is possible to perform a drill-down operation
1
Ford list of customers 4 Beeblebrox Zaphod
by using the pivot table API from the macro language of
Navigation through
Order relationship
Excel but not to build spreadsheet-based computation that
order order for customer 001
refer to the drilled-down data.
*
Template: PODetailsAndAverage
This hinders the interoperability between importation so-
PO 001-040801
PO 05/10/2005
PO 001-051001
Betellgeuse 7
A B C D lutions. The spreadsheet becomes integrated with a col-
poDate 05/10/2005 1 Date Total PO
poShipAddress 07/10/2005
Betellgeuse
70.00
7
2 05/10/2005 70.00
lection of separate systems, but no interaction is possible
poTotal 70.00 Importation of a
list of orders for
3
4
08/08/2004

Average Total PO
95.00

82.50
between them in spreadsheet programming. However, we
a given customer
observe that this situation is not due to a defect in the pivot
table itself. It has to do with the approach that consists in
integrating the spreadsheet directly with the resource layer
Figure 2: Relationship navigation
(see Section 2.2 and Figure 1), where resources are hetero-
geneous. Integrating with the service layer allows to build a
unique solution to access a variety of systems, and closeness
tomer, say ‘Prefect’. Accessing the underlying ER model,
in this case would hopefully become irrelevant.
represented on the left of the figure, means here that we
We also want to emphasize that a relationship navigation
can display additional details about this customer, say his
is different from a parametric importation. The difference
first name and address, or list of his recent purchases. That
is that relationships are pre-built in the conceptual model
is, the value is not seen as isolated, but as an element of a
and users don’t need to express them—they are ready to
larger composite entity, here the customer, in relationship
use. Relationships are precious when the parameters are
with other entities, e.g. purchase orders.
not trivial (for example when obtaining the purchase orders
The Pivot Table, a feature found in MS Excel, implements
of a customer involves to join several tables with composite
such a mechanism and, incidentally, offers an illustration
foreign keys). Put it roughly, relationship navigations are
of the shortcomings that integration solution should avoid.
to parametric data access what SDOs are to SQL.
The pivot table allows to compute an aggregate, e.g. a sum
Now that we have a clearer picture of what to expect from
or an average, of a collection of values grouped along some
an integration solution with SOA, we propose to examine
dimensions. It is the spreadsheet representation of an OLAP
existing approaches to data importation and see how well
cube. It takes the form of a table, i.e. a collection of cells,
they do along these five dimensions.
where horizontal and vertical headers represent the chosen
dimensions and where each non-header cell is an aggregate. 3.2 Review of existing approaches
A right-click on an aggregated value pops up a context menu
We observed that approaches for data importation and
that offers to display the details of this aggregate; that is
manipulation could helpfully be classified into two categories:
the list of individual values that were summed or averaged
formula-based importation and external mapping definition.
to produce this cell content. This is a form of relationship
For each, we picture below its main traits, give some ex-
navigation. The aggregated value is in relation with the
ample of commercial products, and we discuss their relative
individual values used to compute the aggregate. However,
merits according to the criteria identified in the previous
the navigation experience offered by this method poses two
section.
problems: it is fixed and closed.
The navigation is fixed since it is not possible to specify 3.2.1 Formula-based importation
how the details are displayed. In the pivot table case, they
In this model, the grid representation of external data is
are displayed in a new worksheet as a table where each col-
obtained from formula evaluation as illustrated in Figure 3.
umn is a dimension and each row an individual value. But
Examples of this method for MS Excel include the built-in
suppose that you modify this new worksheet to compute
functions for database (e.g. DGET, DSUM, DAVG, etc.) and the
some custom aggregations, say a sum of all values greater
Real Time Data (RTD) provider. The function DGET(x,y,z),
than 100. If you happen to need a similar computation for
when used in a cell formula, retrieves from a database x the
some other aggregate values, you unfortunately will have to
value corresponding to attribute y of a tuple identified by
do the work again, as the details of each aggregate is going to
z. RTD is an extension mechanism that can be used by a
be displayed in its own newly created worksheet.There are
professional developer to provide access to dynamic values.
workarounds. For example, the computation can be pro-
They correspond respectively to a push and a pull model of
grammed in a separate worksheet that refers the the work-
data importation. In addition to those features, a profes-
sheet automatically produced by this tool. But this method
sional developer can easily extend the library of functions
is more complex. What would be needed here is the possi-
available in the formula language by User Defined Func-
bility to customize the navigation.
tions (UDF). Figure 3 illustrates this approach with a UDF
The navigation is also closed since the details of an aggre-
Customer that takes the customer number and an attribute
gate are accessible only from within the pivot table, through
as parameter.
this particular context menu. It is not possible, for exam-
The main advantage that derives from using formula for
ple, to access the details of an aggregate from another tool
data importation is that it is perfectly in line with the
or from a formula expression, nor is it possible to refer to the
spreadsheet programming model and, thus, share its good
origin of a cell value from a formula or another tool. Only
properties:
the resulting aggregate value is accessible, the structure to
which it belongs, a lattice in this case, is known to the pivot • The grid representation can be built incrementally and
table since it allows users to drill-down, roll-up or display the each cell can individually be modified;
Customer 001 External Mapping Definition
Prefect
Customer(001, firstName) C2
Ford Spreadsheet application Customer 001 Customer(001, lastName) B2
A B C D Prefect
Customer 002 Ford
Customer 1 Customer(>001, {lastName, firstName}) B3:C4
lastName Dent
Arthur 2 =Customer(001, ‘lastName’)
Customer Customer 002
firstName Dent
3 lastName
Arthur Spreadsheet application
4 firstName
A B C D
Customer 003
Beeblebrox Customer 003 1 Last name First name

Zaphod Beeblebrox 2 Prefect Ford


Zaphod 3 Dent Arthur
4 Beeblebrox Zaphod

Figure 3: Formula-based importation


Figure 4: External mapping definition

• The traceability is immediate since the formula reflects


the origin of the information—provided its syntax is tion to appear. This results in a mapping between cells and
not too abstruse. However, only the binding of a value XML schema elements not much different from that repre-
with an external data is expressed in a formula. Ad- sented in Figure 4. It is not a complete mapping definition
ditional meta-data support is needed; yet—as the data to import also need to be specified. This
is done in a separate operation in which user actually select
• Data access is naturally parametric, and only limited an XML document of that schema.
by the way functions are defined. For instance, import- Another example is given by the pivot table1 already in-
ing data with the function Customer(id, attributeN ame) troduced in section 3.1. When used with external data, a
gives more flexibility than with CustomerLastName(id) series of wizard dialogs helps users to formulate the which
but less than GetData(entityT ype, id, attributeN ame). part of their mapping, i.e. to select a data source and build
Note that the three versions could be provided. a query. Then, users can specify the where part which, in
The efficiency of the data access is however more problem- this case, means choosing a single cell: the upper-left corner
atic. First, a straightforward implementation of the func- of the pivot table. The number of cells used by the pivot
tions used for data retrieval implies an individual query to table will depend on the size of the data that need to be
the data resource. This can have a significant impact on the displayed.
data provider, the spreadsheet application and even on the This approach—a mix of wizard dialogs to select the data
network. It can be mitigated by implementing some cache and of visual assistant to build the grid representation—has
mechanism local to the spreadsheet, but i) this implementa- been adopted by the vast majority of software vendors to
tion is not trivial and 2) even so, the evaluation of separate propose the integration of spreadsheet with their applica-
functions for each of the cells will have an impact. tions; far too many indeed to start citing them. The ad-
The major concern is thus that formula are not suited vantages of the approach are clear: (i) the data access can
to import collections of data. They can return only values be made very efficient. Since users first build a query to re-
compatible with the cell content, i.e. a single primitive value trieve in one shot all the related data, only one query need
such as a text, a date, or a number. In addition to the to be sent to the data resource. (ii) It is possible to specify
performance impact, this limitation also makes it impossible a grid representation of collections of data of unknown or
to import collections of data with varying—or unknown— varying size.
size. This, of course, is not acceptable as most of the data There are drawbacks however; mainly because the map-
served by data services are in one or both of these categories. ping definition is not programmed in a spreadsheet style,
Finally, although relationship navigation can be imple- with formulas.
mented, very few approaches actually propose this feature. • Traceability is problematic. The cell content is usually
For example, Excel add-in for MS Analysis server [3] im- identified by headers; only the cell location determines
plements this feature by parsing the formula to identify the its content, meaning that it can not be moved. Some
component of the structure to which it refers and offers in approaches resort to using the comment zone of cells
a context menu some navigation choices. As mentioned in to store information regarding the origin of the data.
section 3.1 this navigation is closed and fixed. Using the comment zone in fact highlights that spread-
3.2.2 External mapping definition sheets don’t provide any facilities to store, display and
use meta-data. This problem is recognized as very sig-
An alternative to import data in spreadsheets is to specify nificant when spreadsheets are used in context such as
which data to import and where on the spreadsheet to place Business Intelligence or for reporting [10]. This has
them, as shown in Figure 4. We refer to this specification led us to propose in Section 4.3 a systematic support
as the mapping definition. It is programmed by spreadsheet of meta-data informations.
users through wizards or visual assistant. This definition
doesn’t involve formula, and in that sense, it is external to • A parametric data access supposes that some spread-
the spreadsheet application. sheet cells can be referenced in the external mapping
The XML mapping tool [17] that ships with MS Excel is 1
Arguably, the pivot table is not a “pure” importation solu-
an example of this approach. It relies on drag-and-drop op- tion in the sense it can also transform the imported data to
erations for the mapping definition. Users can select XML compute aggregates; but (i) it is an importation tool when
elements from the tree representation of an XML schema used with an OLAP data source and (ii) being standard in
and drop them over the cells where they want the informa- Excel makes it a good example.
Formula-based External
External Mapping Definition
Efficiency Problematic Good SpreadATOR External mapping definition
Traceability Good Fragile Customer 001 A B C D
Incremental Good Pre-commitment Prefect 1
Ford
Parametric Good Inconsistency, Hidden dep. 2 =Customers[“001”].LastName
3
Relationship Closed and fixed Closed and fixed Customer 002
=\\Customers\002@lastName
Customer 4
lastName Dent
Arthur
Table 1: A comparison of importation models firstName Spreadsheet application
A B C D
Customer 003 1
Beeblebrox
2 Prefect
definition. If we take the example of Figure 4, it means Zaphod
3 Dent

that rather than defining the mapping for customer 4

001, we would define it for customer A1. This is not


a technical difficulty, but it introduces an important
problem and, probably for that reason, we haven’t re- Figure 5: Formula-based External mapping defini-
viewed any approach allowing it. The problem is that tion
of hidden dependency; one of the cognitive dimensions
we introduced in section 2.1. It means that if we define
the mapping this way, the cells C2 and B2 of our ex- of data or composite data: formulas return only primitive
ample now become dependent on the value of cell A1. types. This weakness has a huge practical impact as most of
When this happen in spreadsheet programming, the the data users need to import are indeed collection or com-
dependency between C2 and A1 would be clear from posite. As a result, existing systems for data importation
the fact the formula in C2 refers to A1. Here however, almost solely rely on an external definition of the mapping.
cell C2 itself doesn’t show this dependency, it is hid- It appears then that in order to satisfy all the criteria
den unless the appropriate dialog used to define the listed in Section 3.1, we need to blend the qualities of both
external mapping is displayed. approaches. This is what we attempt to do with SpreadA-
TOR2 . In a nutshell, SpreadATOR is essentially an exter-
• The grid representation is not anymore built incre- nal mapping approach but it also offers a spreadsheet-like
mentally: all the data are imported at the same time. programming experience based on formulas. Section 4.1
More often than not, in order to add an attribute to presents the formula language that we use to construct the
an imported table, it is necessary to go through all the external mapping definition.
steps of the importation process again. The previous In Section 3.1.5, we discussed why we believe that both
importation—and possibly all the modifications made formula and external mapping approaches are not entirely
by the user—is simply erased and replaced by a new satisfying when it comes to relationship-based manipula-
one. tions. We present in Section 4.2 how those manipulations
Note that to mitigate this last problem, or the fact, for are eased in SpreadATOR and what are the resulting bene-
instance, that the way mappings are defined imposes to im- fits for the end-user programmer.
port collections of values as contiguous collections of cells, Finally, we saw that conveying meta-data information is
some products allow to transform an external mapping into necessary to achieve good traceability. We propose in Sec-
a formula-based importation (e.g. SAP BEx Analyzer[4], tion 4.3 an innovative method to (i) convey these informa-
Oracle BI [13] or Microsoft Analysis server [3]). A set of tion and, (ii) allow end-users to incorporate them in their
cells which obtains its values by an external mapping can computation.
be refactored into formula expressions to retrieve cell val-
ues. It becomes then possible to relocate any of these cells 4.1 Formula-based external mapping
and, thus, to insert an empty row or column in a table with- We define SpreadATOR as a middleware for spreadsheet
out preventing future data refresh. However, this method integration. It adopts an external mapping approach to im-
can’t be called a formula-based. Data are still queried as port data retrieved from data services. The innovation of
defined during the mapping definition and the formulas refer SpreadATOR is to make this mapping definition explicit and
to the result of this query. Thus, both traceability—since to blend it with the rest of the formula-based programming
the formula doesn’t refer to the external data—and para- of the spreadsheet application.
metric access—since the query isn’t in the formula—remain SpreadATOR mapping definition is thus based on formula
a problem. expressions that are very similar to spreadsheet formula.
Regarding relationship-based manipulation, the situation They are stored in cells and can use other cell references.
is the same as in formula-based approach. They are possible Figure 5 shows such a mapping definition in cells B2 and B3.
and some form of navigation is usually supported by prod- The exact syntax chosen for the language is not important
ucts in this category. But as discussed about the pivot table here. It is merely a matter of preference or implementation.
in section 3.1, they are closed and fixed. To emphasize this idea, we used a formula with an object-
oriented syntax in cell B2 and one in the style of XPath in
4. A NOVEL APPROACH: SPREADATOR B3.
Our implementation of SpreadATOR relies on JScript.Net
A formula-based importation—since it conforms to the for formula evaluation. JScript is an implementation of
spreadsheet programming model—is more satisfying than javascript for the .Net framework. Therefore, the formula
an external mapping definition on the dimensions concerned
with the programming aspects (see Table 1). However, formula- 2
SpreadATOR stands for “Spreadsheets and dATA Objects
based importation is not suited for importing collections Reconciled”.
syntax we adopted corresponds to that of cell B2. We want to emphasize that, despite the object-like syn-
Although not strictly speaking an object-oriented language, tax of the formula language used in SpreadATOR, we don’t
javascript has the advantage of being easily interfaced with assume any familiarity of end-users with object-oriented pro-
pure object libraries. In particular, it is possible to use gramming. First, users are only accessing pre-built objects
any assembly compliant with .Net from JScript expressions. available from data services, they are not actually “creating”
Thus, our implementation can benefit from ADO.Net [2], these objects. Second, the use of formula doesn’t exclude
the .Net equivalent of SDO, and can interact with other the complementary usage of wizards and visual assistants
data access libraries such as Application Programming In- to generate the formulas. Traditional spreadsheet formulas
terfaces (API). Section 5 presents such a scenario. To avoid are themselves often built by using a wizard dialog. The vi-
unnecessary confusion, we will adopt for the rest of this pa- sual assistant we propose in SpreadATOR—called the object
per an object-oriented terminology and, for example, speak explorer and presented in Section 5—makes the grid repre-
of object instances rather than entities. sentation construction an experience very close to that of
Figure 5 shows the mapping definition in one spreadsheet using the XML mapping tool in Excel.
grid and its evaluation in another. In reality, the spread-
sheet user only sees one grid and can choose to display ei- 4.2 Relation-ship based manipulations
ther the formulas or their evaluations as it is already the
We saw in section Section 3.1.5 that existing approaches
case with traditional formulas. Traditional formulas and
offer limited support for relationship-based manipulation. In
SpreadATOR formulas are merged in a single interface, mak-
particular, these approaches do not allow spreadsheet pro-
ing the programming very intuitive to spreadsheet develop-
grammers to actually benefit from relationships in their pro-
ers. They are oblivious of the fact the two types of formulas
gramming.
are maintained by different systems.
To address this problem, we propose to introduce a tem-
Formula expressions in (current implementation of) Spread-
plate mechanism. The idea of template is not new to spread-
ATOR are essentially javascript statements. However, we
sheet and end-users are already familiar with it. The innova-
needed to extend slightly the language with few key-words
tion of SpreadATOR is to associate templates with the type
and syntactic sugars. First, spreadsheet formulas need to
of composite objects—each type may have several templates—
reference cells. Because cell coordinates could collide with
and allow to define a generic grid representation for in-
the name of object members or variables, we enclose them
stances of that type. Templates are given names and are
with angle brackets (e.g. =customers[<A1>].lastName im-
proposed in a drop-down menu (see Figure 6(a)); its con-
ports the last name of the customer whose number corre-
tent depends on the current cell selection.
sponds to cell A1). Second, we needed some mechanism to
It is the fact that SpreadATOR formula can return ref-
allow the mapping of collections of values to collections of
erences to composite objects that permits to associate tem-
cells. This is achieved by using the character * instead of
plates with types. To illustrate this mechanism, suppose a
the identifier of an element of the list. For example, a col-
worksheet with a formula A1=Customers[001]; that is, cell
umn containing the last names of a list of customers is ob-
A1 contains, from SpreadATOR point of view, a reference to
tained by =Customers[*].lastName3 . Finally, three other
the instance of type Customer that represents the customer
key-words—obj, template, and metadata—take a special
001. When A1 is selected, users can open a template asso-
meaning in SpreadATOR formula and are presented in the
ciated to the type Customer (or create a new template for
following sections.
that type). An internal object named obj is associated to
Statements that return an object reference are also valid.
the instance referenced in A1. Users can use this reference
For example, =Customers[’001’] returns a reference to an
to build a representation of this object.
instance of customer and =Customers returns a reference to
So the difference between a template and a worksheet is
the complete list of customers. The reference returned is
that instead of referencing an external object, such as in
managed by SpreadATOR; for Excel, the cell simply con-
=Customers[001].lastName, a template references an inter-
tains a string representation of these objects (obtained by
nal object denoted obj. For example, in a template suited
the default transtyping given by toString()). The advan-
for objects of type Customer, we can have formula such as
tages of storing a reference to a composite entity in a cell
=obj.lastName. The approach used to build the template
are:
is exactly the same as for a worksheet and rely on the same
• It is now possible to refer directly to these composite visual assistant, only the formula generated by the assistant
objects to build their representation on the spread- are different.
sheet. For example, if B2=Customers[’001’], we can The template defines how to represent an object obj. Ac-
have a formula B3=<B2>.lastName. Thus, it suffices cessing a template is equivalent to a relationship navigation
that the content of B2 changes (e.g. if it is replaced by since the template can display any information related to the
a reference to customer 002), for all related formula instance selected—for example, the list of purchase orders of
to change accordingly. This makes formula shorter, the selected customer. It is indeed more powerful since users
easier to read and more efficient to compute; are not limited to displaying only the destination entity(ies)
of a relationship.
• The content of cell B2 now has a type. It is possible to Furthermore, SpreadATOR allows to access the customized
display additional information corresponding to that grid representation of an object type from a worksheet that
particular entity type and permit navigation to related contains instances of that type. For example, suppose that
entities through the template mechanism, described in a customer template called “PO details” is used to compute
the next Section. some custom aggregate—say an average of the PO which to-
tal exceeds 100$—which result is in cell G4 of the template.
3
Note that a only one * is allowed per formula From our worksheet example above, where cell A1 contains a
reference to customer 001, we can access the custom aggre- is selected, the evaluation of this formula is displayed in
gate of the template by using the formula =template(A1,’PO the bottom-right section of the screen, e.g. hLast contacted,
details’,G4). 21/06/2006i.
This formula can easily be duplicated for all the customers Though this value is not needed for display, it could be
present on a worksheet, simply changing the reference A1. needed for computation. For example, users may want to
In object-oriented terminology, it is as if the type Customer highlight customers that have not been contacted for a while.
was extended with a new method that computes the custom Since meta-data are visible at the same time as the work-
aggregate. When this formula is evaluated, obj is associ- sheet, they can be used in drag-and-drop operations. The
ated with the reference contained in A1. It can be seen as keyword metadata is used to refer to their value. For exam-
the SpreadATOR equivalent to keyword this in object ori- ple =metadata(A1, ’Last contacted’) returns the meta-
ented programming. However, obj stands for “current cell data named Last contacted for the composite object—here of
composite content”, rather than “current instance of that type Customer —contained in cell A1. Meta-data are hence
class”. at the same time very similar but complementary to tem-
By comparison, computing this custom aggregate on a plates. While template can be used to display a large quan-
list of customers in traditional spreadsheet programming tity of related information, the purchase orders of a cus-
approach—that is, without resorting to another program- tomer for instance, meta-data offer a simple and intuitive
ming paradigm such as a macro language—is more com- mechanism to display complementary information about an
plex. For example, you could import the list of customers imported entity.
in a worksheet and either (i) import as a large table all the
purchase orders of all customers in a second worksheet or
(ii) import the list of POs in a separate worksheet for each
5. CONSUMING REAL-WORLD DATA
customer. In either case, you’ll have to rebuild the join be- To validate our approach, we demonstrate in this section
tween the list of customers and that of POs; that is, you’ll how RSS feeds can be easily accessed and manipulated in
have to search the starting and ending row corresponding Excel using SpreadATOR. We show how (i) access to a RSS
to a customer (case (i)) or you’ll have to search the work- feed can be performed using a generic library—that is, a li-
sheet corresponding to a customer (case (ii)). The formula brary which has not been developed specifically for use from
language of spreadsheet includes search functions, precisely within a spreadsheet—and (ii) that the composite structure
for such situations. But why should users have to build this of RSS feeds can be laid out on the worksheet and ma-
join when it is readily available in the ER model of data? nipulated in formulas using pure visual programming, i.e.
through a combination of point-and-click and drag-and-drop
operations.
4.3 Meta-data management RSS is a popular XML format for news syndication. RSS
Meta-data are not very different in nature from other in- documents are XML documents with several levels of nested
formation related to a given entity. They mainly differ by parent/child relationships. A typical RSS document in-
their semantic and usage. A meta-data is typically not a cludes one or more channels where each channel contains
value that users want to lay out on the spreadsheet because a collection of news items. They are published by news-
they are not an essential attribute of the information ac- oriented web sites or blogs and are widely available on the
cessed. They represent a complement of information, kind Internet.
of a documentation; they speak about data. We use here the library called RSS.Net 4 . This library
For example, a Business Intelligence (BI) software often exposes a class called RssFeed which is an object-oriented
provides access to Key Performance Indicator (KPI) such as representation of an RSS document. An instance of RssFeed
“Order processing delay” which expresses in days the aver- object can be built using the static method Read(url ) of this
age time needed to process a customer purchase order. This class.
value is a high level aggregate and users need to know what To be available in SpreadATOR, a library or a service has
they exactly mean, how they are evaluated, when the value first to be referenced. This is done using a dialog where
was computed for the last time, what its precision is, etc. users have to provide the service URL (from which a library
We need meta-data (i) to always be accessible whenever we is produced using .Net utilities), or, as in our example, to
examine a KPI and (ii) not to occupy cells of their own on select the compiled file of a library (called an assembly in
the worksheet—unless, of course, the specific application we .Net terminology).
build calls for it. Once this is done, all the public resources of the library,
Thus, we propose in SpreadATOR to display meta-data including the RssFeed class, can be used in spreadsheet for-
separately from the worksheet in some reserved space of the mulas. Figure 6(a) presents the SpreadATOR add-in as it
user interface (see the bottom-right area in Figure 6(a)). appears in Excel. In the top right corner, SpreadATOR
We define meta-data as a collection of hname, valuei pairs offers a zone where users can input formulas (though formu-
obtained from a collection of hname, f ormulai tuples that las can also be input using the Object Explorer as explained
depend—in the same way as the template mechanism—on hereafter). In this Figure, we can see that cell A8 contains
the type of the composite object contained in the cell. The the formula RssFeed.Read(URL). The URL used in this
former is used for display in a list when a cell containing a example is that of the IEEE Transactions on Computers5 .
composite object is selected, while the later corresponds to This formula evaluation returns an object of type RssFeed
a collection defined by the user where each formula refers whose default transtyping as a string is displayed in the cell
to the selected object through the keyword obj introduced (in this case, it corresponds to the URL of the feed).
in the previous section. For example, the end user can de-
4
fine a meta-data for type Customer with hLast contacted, available at http://www.rssdotnet.com
5
obj.lastContactDatei. When a cell containing a customer located at: http://csdl.computer.org/rss/tc.xml
(a) A worksheet accessing several RSS feeds (b) Template view of one RSS feed

Figure 6: The SpreadATOR add-in user interface in MS Excel

In this example, three different feeds have been accessed in Channel and access it from any cell containing a composite
three different rows of the spreadsheet. Details of those feeds value of that type (e.g. cell C6 Figure 6(b)).
are provided in other columns. For row 8, the last refresh
date and the most recent news item are accessed using, for-
mulas <A8>.LastModified and <A8>.Channels[0].Items[0] 6. RELATED WORK
respectively. Again, these formulas can be entered by us- We already reviewed in Section 3.2 the existing approaches
ing the Object Explorer, as described below. Other rows to data importation in spreadsheet and our discussion was
use similar formulas, with only the cell reference changed. focused on mainstream spreadsheets. But spreadsheets have
This illustrates that any level in the nested structure of the received from the research community a sustained attention.
RSS document can be freely laid out on the worksheet. It Several proposals have been made to extend spreadsheets in
also shows that the individual components of the compos- order to introduce features found in conventional program-
ite value contained in a cell can be accessed directly within ming languages and make these features easy to exploit for
formula expressions using only cell references, hence avoid- end-users.
ing the impedance mismatch problem faced in traditional An early work in that area is the Analytic Spreadsheet
spreadsheet programming. Package (ASP) [16] where the language Smalltalk 80 is used
The composite value contained in cell A8 can also be dis- to build a spreadsheet where cells can contain instances of
played in details using a template corresponding to its type objects. Object visualization within cells is provided either
(see section 4.2). Figure 6(b) shows a detailed display of the by the default transtyping mechanism offered by smalltalk
RssFeed instance of cell A8. In this template, all formulas with the printString protocol or by instanciating objects
use the keyword obj to refer to the current instance to be that derive from DisplayObject to build custom visualiza-
displayed. For example, the formula used in cell C6 Figure tion.
6(b) is obj.Channels[0]. This formula returns an object of A more advanced integration of object-oriented features,
type RssChannel, whose default string representation is the as well as functional programming, into the spreadsheet en-
title of the channel. vironment was proposed in [9]. This work extends the tra-
The Object Explorer can be seen on the right of Figure ditional spreadsheet to support programming abstractions
6(b). In this mode, it displays the details of the type con- such as encapsulation, reuse, recursive functions, higher or-
tained in the selected cell which, in our case, corresponds to der functions or polymorphisms. It defines a full spreadsheet-
the type RssChannel since cell C6 is selected. The Object based language where worksheets are seen as methods and,
Explorer can be used to build a layout by simply selecting when grouped in a workbook collection, collectively define
one of the properties of the object contained in the selected a class.
cell, and dragging the corresponding node over to an empty In [12], an extension to Excel is proposed to allow end-
cell. This results in the formula corresponding to that node users to build custom functions. In this approach, the type
being copied into the cell. For instance, one can see that the system of the spreadsheet is extended so that whole matrices
node description is selected, and the corresponding formula can be stored in a single cell. A cell that contains a matrix
is displayed in the bottom part of the panel. is displayed in a different way so that end-users clearly know
The detailed layout of Figure 6(b) can be reused for any that their content is composite. Forms/3 [7] is a prototype
instance of RssFeed class. Hence, similar details can be ob- that implements several extensions to spreadsheet program-
tained for any of the three RssFeed objects on Figure 6(a) ming. It allows for instance recursive computations, or ex-
through a simple click. Moreover, it is possible to create ception handling. Cells in Forms/3 can contain any type of
several templates, corresponding to as many views of com- data.
plex data of a given type. Templates can also be nested. For All these approaches are very interesting; they explore
instance, it is possible to create a template for the type Rss- how to redefine the spreadsheet programming model in order
to bring into it the powerful abstractions found in other
languages. The integration of spreadsheets in SOA would 8. REFERENCES
be much facilitated if mainstream spreadsheet applications [1] Excel Services Overview. Technical report, Microsoft
(and their user-base) decided to adopt some of the ideas Corp., 2006.
proposed in those works. For example, if Excel actually [2] ADO.Net Tech Preview Entity Data Model. Technical
handled matrix types as cell values, we could benefit from report, Microsoft Corp., June 2006.
this type system and propose a richer mapping of composite
[3] Designing Reports with the Microsoft Excel Add-in for
external objects.
SQL Server analysis services. Microsoft Corp., 2004.
Our concern in this article is almost opposite since we
[4] SAP NetWeaver: A Complete Platform for
precisely try to leave the spreadsheet programming model
Large-Scale Business Intelligence. Technical report,
untouched. SpreadATOR acts as a middleware; its formula
Winter Corp., 2005.
language could as-well be hidden to users who could choose
to rely solely on the visual assistant (the object explorer). [5] G. Alonso et al. Web Services - Concepts,
SpreadATOR for instance do not provide any mechanism to Architectures and Application. Springer-Verlag, 2004.
actually build those object abstractions. Thanks to this, we [6] A. Blackwell and T. Green. HCI Models, Theories,
were able to implement our prototype as an add-in to an and Frameworks: Toward an Interdisciplinary Science.
existing spreadsheet application. We try to bring to end- J.M. Carroll Editor, chapter Notational systems – the
users this small part of the benefits of programming at the cognitive dimensions of notations framework. Morgan
conceptual level that we think does not imply any major Kaufmann, 2003.
change in the way they already work with spreadsheets. [7] M. Burnett et al. Forms/3: A first-order visual
language to explore the boundaries of the spreadsheet
paradigm. Journal of Functional Programming,
7. CONCLUSION 11(2):155–206, 2001.
[8] M. Carey. Data delivery in a service-oriented world:
In this article, we have defined the problem of spread-
the BEA AquaLogic data services platform. In
sheet integration with data services from the viewpoint of
SIGMOD’06, pages 695–705, New York, USA, 2006.
spreadsheet users. We tried to answer how developers could
[9] C. Clack and L. Braine. Object-oriented functional
benefit from the higher level and integrated view of IT re-
spreadsheets. In GlaFP’97, september 1997.
sources offered by service-oriented architecture. More specif-
ically, we discussed how to leverage, in the importation and [10] K. Gile. Keeping IT sane in a crazy BI world of Excel.
manipulation of data, the conceptual modeling of informa- Technical Report 36353, Forrester, 2005.
tion as provided by data services and API such as SDO or [11] A. Halevy et al. Enterprise information integration:
ADO.Net. We identified the shortcomings of existing solu- successes, challenges and controversies. In
tions and proposed a novel approach to spreadsheet integra- SIGMOD’05, pages 778–787, New York, USA, 2005.
tion called SpreadATOR. [12] S. P. Jones, A. Blackwell, and M. Burnett. A
An important aspect of SpreadATOR is that it can be in- user-centred approach to functions in excel. In ICFP
tegrated with existing spreadsheet applications such as MS ’03, pages 165–176, New York, NY, USA, 2003.
Excel. It does not suppose an extension of the spreadsheet [13] K. Laker. Exploiting the power of oracle using
language and can act as a middleware. In the same time, its microsoft excel. Technical report, Oracle Corp., 2004.
interface blends with MS Excel. This allows users who have [14] E. Lippert and E. Carter. .Net programming for office:
the need to easily introduce programmatic aspects in their using C# with Excel, Word, Outlook and Infopath.
importation, for example using a cell reference to make it Addison Wesley, 2005.
parametric. An additional benefit is an improved readabil- [15] B. A. Nardi and J. R. Miller. The spreadsheet
ity of the mapping. Finally, the object explorer offers the interface: A basis for end user programming. In
necessary support to avoid formula input and is very similar INTERACT’90, pages 977–983. North-Holland, 1990.
to the schema mapping tool already proposed in Excel. [16] K. W. Piersol. Object-oriented spreadsheets: the
The superiority of specialized importation tools over a analytic spreadsheet package. In OOPLSA’86, pages
generic approach followed by SpreadATOR is their capacity 385–390, New York, USA, 1986. ACM Press.
to provide very specific wizard dialogs or visual metaphors [17] F. Rice. Creating XML mappings in excel 2003.
to assist users (e.g. they can refer to dimensions when ac- Technical report, Microsoft Corp., 2005.
cessing an OLAP server and to tables with accessing a re- [18] C. Scaffidi, M. Shaw, and B. Myers. Estimating the
lational database). However, we believe that SpreadATOR numbers of end users and end user programmers. In
is in fact compatible with these high-level features. We ar- VL/HCC’05, pages 207–214, 2005.
gue that these specific assistants should output a mapping
[19] Next-generation data programming: Service data
definition in a common formula-based importation language
objects. Technical report, IBM, BEA, 2003.
such as the one introduced in this article. It can easily be
done since, as demonstrated, SpreadATOR is able to work [20] http://www.service-architecture.com.
with any (.Net) API. For end-users, the benefit is a spread- [21] K. Williams and B. Daniel. An introduction to service
sheet application over which they have a complete control data objects. Java Developer’s Journal, October 2004.
as well as the possibility to combine various importation
tools in a same application. By using a common mapping
definition, importation systems would also leverage the com-
mon facilities offered by SpreadATOR such as the template
mechanism or the meta-data management and save signifi-
cant development time.

You might also like