Automated Refactoring of Nested-IF Formulae in Spreadsheets: Jie Zhang, Shi Han, Dan Hao, Lu Zhang, Dongmei Zhang

Automated Refactoring of Nested-IF Formulae in Spreadsheets
Jie Zhang1 , Shi Han2 , Dan Hao1 , Lu Zhang1 , Dongmei Zhang2

1 Key Laboratory of High Confidence Software Technologies (Peking University), MoE, Beijing, China
2 Microsoft Research, Beijing, China
1 {zhangjie marina,haodan,zhanglucs}@pku.edu.cn, 2 {shihan,dongmeiz}@microsoft.com
ABSTRACT What is worse, the bad practice of using nested-IF expressions

Spreadsheets are the most popular end-user programming software, among end users is quite common: our study of over 68,000 real-
where formulae act like programs and also have smells. One well world spreadsheets2 reveals that for the worksheets containing IF,
recognized common smell of spreadsheet formulae is nested-IF ex- 30.04% of them also contain nested-IF. If we denote the maximum
nesting level inside a nested-IF as if-depth 3 , in our corpus each
arXiv:1712.09797v1 [cs.SE] 28 Dec 2017
pressions, which have low readability and high cognitive cost for
users, and are error-prone during reuse or maintenance. However, spreadsheet includes on average 9 formulae with if-depth over 10,
end users usually lack essential programming language knowledge while the observed maximum if-depth is 48 with multiple instances.
and skills to tackle or even realize the problem. The previous re- Formula refactoring is a practical solution to tackle this prob-
search work has made very initial attempts in this aspect, while no lem, which was first proposed by Badame and Dig [12]: to perform
effective and automated approach is currently available. semantic-preserving formula transformations (without changing
This paper firstly proposes an AST-based automated approach to the behavior) with the purpose of removing formulae smells. Nev-
systematically refactoring nested-IF formulae. The general idea is ertheless, such refactoring requires essential knowledge and skills
two-fold. First, we detect and remove logic redundancy on the AST. of programming which is challenging for end users. To help end
Second, we identify higher-level semantics that have been fragmen- users, several previous works [4, 12, 13] have proposed a few simple
ted and scattered, and reassemble the syntax using concise built-in refactoring patterns trying to decrease the if-depth, but they either
functions. A comprehensive evaluation has been conducted against have very low coverage (i.e., the ratio of formulae that can be ameli-
a real-world spreadsheet corpus, which is collected in a leading IT orated) or are non-automatic.
company for research purpose. The results with over 68,000 spread- In this paper, we firstly propose an AST (Abstract Syntax Tree)
sheets with 27 million nested-IF formulae reveal that our approach based approach to systematically tackling this problem via auto-
is able to relieve the smell of over 99% of nested-IF formulae. Over mated refactoring. The general idea is two-fold. First, there often
50% of the refactorings have reduced nesting levels of the nested-IFs exists logic redundancy across different condition paths within a
by more than a half. In addition, a survey involving 49 participants nested-IF. Reduction of the redundant logic can remove useless
indicates that for most cases the participants prefer the refactored parts and simplify the nested-IF formula. Second, some higher-level
formulae, and agree on that such automated refactoring approach semantics are often fragmented into hierarchical combinations of IF
is necessary and helpful. conditions in a nested-IF. Reassembling the fragmented syntax from
corresponding IF-subtrees into built-in functions can shorten the
nested-IF formula. To analyze and refactor both redundant logic and
fragmented syntax, our approach leverages and works on the AST
1 INTRODUCTION (Abstract Syntax Tree) structure as intermediate representation of
Spreadsheets are the most popular end-user programming tools [1]. nested-IF formulae.
One of the most important enabling factors is that spreadsheets The evaluation is conducted on over 68,000 real-world spread-
provide immediate feedback so users can make a change in one sheets with over 27 million nested-IF formulae. The experimental
place and immediately see the results [2]. Underneath such an results lead to the following three key takeaways. First, our ap-
advantage, formulae play an important role as end-user friendly proach is generally applicable - over 99% of the nested-IF formulae
programs. However, end-users typically lack essential knowledge can be refactored and the refactor has been verified as correct.
and skills of programming, and are easier to write formulae with Second, our approach is effective - over 50% of the refactoring
bad smells [3]. have achieved more than half of their if-depth reduced; while the
One of the well-recognized spreadsheet smells is nested-IF expres- nested-IF functions in most formulae are completely reduced or
sions [3, 4]. IF functions1 (i.e., the syntax is I F (condition, true branch, transformed with if-depth 1. Third, end users recognize our ap-
f alse branch)) are widely used spreadsheet functions. Nested-IF proach and its results. A survey on 49 participants indicates that
expressions happen when end users write an IF function inside an- most of them prefer the refactored formulae and believe the auto-
other IF or nested-IF function. According to previous research [3–7], mated refactoring is necessary and helpful; while only a few of
nested-IF formulae in speadsheets are complex, unreadable, error- them are equipped with the knowledge of manual refactoring.
prone, as well as hard to debug and maintain. There are also a lot The main contributions of this paper are shown as follows.
of online discussions about the harm of nested-IF formulae. Some 1) An automated and highly-effective approach to identify
people have expressed their desire to reduce nested-IFs “wherever and refactor nested-IF formulae. The goal is to help end users
possible” [8–10].
2 In this paper, we refer spreadsheet as a file consisting of one or multiple work-
sheets [11].
1 Functions are predefined built-in formulae already available in spreadsheet systems. 3 E.g, I F (I F (L1 >= F $5, L1), I F (L1 <= F $6, L1, “”), “”) has an if-depth of 2.
Q1 = X 1
reduce the complexity and cognitive cost of nested-IF formulae in
T F
spreadsheets. Q1 Q 1 = “”
2) A comprehensive evaluation of the proposed automated T F
approach. We evaluated the correctness, applicability, effective- M = “” X1 Q1 <> X 1
ness and usefulness of the approach. T F T F
3) A statistical study on the current usage of nested-IF for- “” M Q1 FALSE
mulae in real-world spreadsheets. We present detailed statistics

Figure 1: AST example
of nested-IF formulae against over 68,000 real-world spreadsheets
collected in a leading IT company for research purpose. shows the results of the Total/Unique Set, From the table, among
the 27,689,699 total formulae and 19,260,407 unique formulae, over
2 PRELIMINARY STUDY 12% formulae have an if-depth of over 5, over 75,000 total formu-
Our preliminary study aims to present the usage status of nested-IF lae and 35,000 unique formulae even have an if-depth of over 15,
formulae and to further motivate our approach. indicating surprisingly heavy usage of nested-IF formulae.
The investigation is based on a spreadsheet corpus including
Table 2: Usage status of nested-IF formulae
over 68,000 real-world Excel files (a.k.a. spreadsheets/workbooks),
which contain a total of over 149,170 worksheets. The source of this Total Unique
corpus is a spreadsheet repository collected in a leading IT company Formula number 27,689,699 19,260,407
for research purposes. And the files in our corpus are extracted Formula number per-spreadsheet 407 283
by excluding those with technical complications as obstacles for if-depth in range (1,5] 24,206,022 16,680,744
interaction-free processing (e.g., password protected, external ref- if-depth in range (5,10] 2,815,521 2,408,549
erence embedded requiring trust confirmation). Compared to the if-depth in range (10,15] 548,129 118,355
corpora that have been widely used in previous research work–the if-depth in range (15,65] 75,455 35,201
EUSES Spreadsheet Corpus [14], Enron Spreadsheet Corpus [11],
and Hawaii Kooker Corpus [15]–our corpus has the following two This heavy usage of common nested-IF formulae would cause
advantages: much harm as we mentioned in the introduction. However, end-
1) Larger scale. The number of spreadsheets (68,075) is much users usually lack the awareness of such harm. They also tend to
larger than the Enron corpus (15,770), EUSES corpus (4,037), and lack enough spreadsheet function knowledge to manually refactor
Hawaii Kooker Corpus (74). Table 1 lists a detailed comparison or avoid using nested-IF formulae. Consequently, it is essential that
between our corpus and the previous largest corpus Enron4 . automatic approaches should be constructed to help tackle these
2) Higher diversity regarding domains. The corpus contains problems.
diverse spreadsheets for various purposes across multiple domains, After manually checking a sample of these nested-IF formulae6 ,
while the other corpora either contain large numbers of toy spread- we realize that many formulae contain unnecessary conditions
sheets or come from a single company of a specific domain. which would cause dead branches7 . Additionally, a large part
Table 1: Comparison between Enron and our corpus of nested-IF functions actually combine together to fulfill a cer-
tain functionality, which can also be fulfilled by other high-level
Corpus functions already defined in spreadsheets like Excel. For example,
Enron Ours the following formula is a real case from our corpus: I F (I F (Q1 =
Total number of spreadsheets 15,770 68,075 X 1, Q1, I F (Q1 = “”, X 1, I F (Q1 <> X 1, Q1))) = “”, “”, I F (Q1 = X 1,
Total number of worksheets 79,983 149,170 Q1, I F (Q1 = “”, X 1, I F (Q1 <> X 1)))). If we use M to represent
Average size of spreadsheets 113.4 KB 1,211.3 KB sub-string I F (Q1 = X 1, Q1, I F (Q1 = “”, X 1, I F (Q1 <> X 1, Q1))),
Number of spreadsheets with formulae 9,120 37,109 then the formula can be written as I F (M = “”, “”, M). To better
Number of spreadsheets with IF functions 2,020 14,425 illustrate the formula structure, in Figure 1 we present the gen-
Number of IF functions 3,420,790 138,085,568
eral AST in the left as well as the AST of M in the right. From
Based on this very large and diverse corpus, we investigate the Figure 1, it is easy to tell that condition Q1 <> X 1 is redundant:
usage status of nested-IF formulae. Inside one spreadsheet many this condition is in the false branch of condition Q1 = X 1, and
formulae may be created by dragging one formula down or to the thus is certain to be true. Additionally, I F (M = “”, “”, M) actu-
right to repeat its calculation. As in previous work [11, 17, 18], we ally equals to M no matter what value M is. If we remove these
remove these formulae by clustering the formulae based on their unnecessary IF expressions, the formula can be refactored into
R1C1 notation5 . We then pick one formula from each cluster to I F (Q1 = X 1, Q1, I F (Q1 = “”, X 1, Q1). Furthermore, we can use the
form the new formula set. We call this new set the “Unique Set” IFS function to transform I F (Q1 = X 1, Q1, I F (Q1 = “”, X 1, Q1) into
and the original set the “Total Set”. I FS(Q1 = X 1, Q1, Q1 = “ 00, X 1,T RU E, Q1), which is much cleaner
The results are shown in Table 2. The first two rows show the than the original one. Inspired by this observation, we propose our
results of the formula number. The remaining rows show the num- AST-based approach to automatically accomplishing such kind of
ber of formulae in different depth ranges. Column “Total”/“Unique” formula refactorings.
4 We do not compare with the other two corpora because their file type is too old for 6 The authors manually checked the nested-IF formulae in 100 randomly-sampled spread-
Python excel library openpyxl [16] to parse. sheets.
5 The R1C1 notation will stay the same even if the formula is dragged down or right. 7 A dead branch will never be executed.
2
3 APPROACH identified 9 patterns for 9 major types of semantics respectively.
These patterns are summarized based on our case analysis. First,
3.1 Overview
we sampled around 100 spreadsheets from the large-scale corpus.
By analyzing the AST structure of each formula, our approach Second, we manually studied the samples and came up with the
identifies optimizable nested-IF expressions and performs refact- patterns by summarization and abstraction, combining with our
oring by replacing basic-level and counter-intuitive syntax with own knowledge. Each pattern corresponds to a spreadsheet built-in
non-redundant and high-level syntax. The major rationale behind functions8 . We present the name, explanation, and examples of
using AST is the desirable structural mapping between AST and each alternative function in Table 3. As of the composing of this
nested-IF as follows. An IF function typically contains three parts: paper, there might be other function candidates that remain out of
1) condition, 2) true-branch expression, and 3) false-branch expres- our knowledge. Nonetheless, our proposed algorithm framework
sion. Therefore, the ASTs of nested-IF expressions are binary trees, should be extensible for easy incorporation of new patterns, and we
with the true- and false-branch expressions being the two child- do plan to continue related study in the future accordingly. More
nodes of the condition node. Consequently, with AST, it is easy to details of this step can be found in Section 3.3.
detect and locate nested-IF in a formula as well as convenient to
conduct further analysis based on the tree structure. 3.2 Redundancy Removal
In this section, we first introduce a high-level overview of our
3-step algorithm framework, followed by detailed introductions In this section, we introduce how we identify and remove redundant
to the two key algorithms for redundancy removal and syntax conditions in a formula (Step 2). The procedure is presented in
reassembling, respectively. Algorithm 1, with the help of an example flow in Figure 2.
Step1: AST generation. We parse each formula and generate 1) Nested-IF expression extraction. First, we extract outmost
its AST to support the subsequent analysis. AST is a tree repres- nested-IF expressions from each formula (see Line 4 in Algorithm 1,
entation of the abstract syntactic structure of source code written function getParentIfList). By outmost we mean the highest hier-
in a programming language. In spreadsheet related research, AST archy in a nested branching logic or on an AST. For example, as
is usually adopted to indicate formula complexity [19]. The larger shown in Figure 2, for formula SU M(I F (C1, V 1, I F (! C1, V 2, I F (C2,
depth (height) of the AST, the higher complexity of the formula. V 3, V 4))), V 5), there is only one target nested-IF expression: the out-
Based on the AST of the formula, we then traverse the AST and most IF expression: I F (C1, V 1, I F (! C1, V 2, I F (C2, V 3, V 4))); for for-
calculate the if-depth. Specifically, along each path of AST, we re- mula SU M(I F (C1, V 1, V 2), I F (C2, V 3, I F (! C2, V 4, V 5)), I F (C3, V 5,
cord the number of IF functions, and regard the largest one across all I F (I F (C4, V 6, V 7)))), there are two target nested-IF expressions:
paths as the if-depth of the formula. For example, the if-depth of for- I F (C2, V 3, I F (! C2, V 4, V 5)) and I F (C3, V 5, I F (I F (C4, V 6, V 7))). Please
mula I F (condition1, I F (condition2, value1, value2), I F (condition3, I F ( note that the nested-IF targets to be analyzed may also exist in
condition4, value3, value4), value5) is 3. predicates of a condition node, and we also extract such IF expres-
A nested-IF is identified in a formula when its if-depth is greater sions. For example, I F (C1, V 1, I F (! C1, V 2, V 3)) is also extracted
than 1, and will be passed to the subsequent steps for refactoring from the condition part of formula I F (I F (C1, V 1, I F (! C1, V 2, V 3)) =
analysis. Otherwise, if the if-depth equals 0 (i.e., no IF in this V 1, V 3, I F (! C1, V 4, V 5)). By doing so, we do not miss the chance
formula) or 1 (i.e., no nested-IF in this formula), our algorithm will of optimizing the nested-IF at condition parts. Moreover, in case
bypass the formula directly. the nested-IF would be reduced as simple predicates, it would po-
Step2: Redundancy removal. An IF expression can essentially tentially increase the chance of eventually optimizing the outmost
be mapped to an if-else branching statement in professional pro- nested-IF.
gramming. Once the condition on some node remains deterministic 2) Branch collection. Based on the AST of each extracted
due to its preceding evaluation at some ancestor node on AST, it will nested-IF, we create a dictionary dicConBranch as the key structure
become a redundant condition and one of its child branches must be to help detect and remove redundant logic (function дetDicConBranch
dead code. Such redundant conditions are spreadsheet smells that in Line 7). As shown in Figure 2, for each entry in the dictionary,
require removal, since they introduce unnecessary complications its key is the condition of an AST node such as C1 or C2; the
to the spreadsheet data thus may confuse end users. We conduct dBranchList value (Line 10) stores a tuple of two AST sub-trees
such redundancy removal first, because its existence may also ob- corresponding to true and false branches respectively. In addition,
scure the AST structure from well understood patterns and thus put each entry also has a nBranchList value (Line 11) for the negation
negative impact on our pattern matching for syntax reassembling. of the key condition such as ! C1, and stores the tuple of true and
More details of this step can be found in Section 3.2. false branches accordingly. The dictionary is constructed by visit-
Step3: Syntax reassembling. We have observed another typ- ing each condition node on the AST. When the same condition (or
ical smell in real-world spreadsheets, where single and higher-level negation) is hit for multiple times, the AST sub-tree tuples at each
semantics are often fragmented by end user into lower-level syn- hitting site are appended to the dBranchList (or nBranchList).
tax pieces with nested-IFs. In fact, for such semantics there are 3) Redundancy identification and removal. Intuitively, if
concise and easily understood forms in spreadsheet systems with any entry stores more than 1 tuple in dBranchList and nBranchList
built-in functions. The goal of this step is to conduct reverse infer- collectively (Line 12, 24), it indicates existence of redundant branches
ence against such a smell, i.e., to recognize and reassemble such on the AST about the condition at key. We iterate such inspection
semantic-fragmented AST regions into their more concise forms via
8 Most mainstream spreadsheet tools such as Excel and Google Sheets support these
pattern matching and replacement. In this paper, we have manually functions.
3
Algorithm 1: Function: removeRedun indicating that IF expression I F (! C1, V 2, I F (C2, V 3, V 4)) is redund-
Input: Fm : the current nested-IF formula ant. Since this expression lies in the false branch of condition C1,
Input: AST : the AST of Fm condition ! C is deterministic as true. Therefore, we remove con-
Output: Fm : the new formula dition ! C1 and its false branch, and only keep the true branch V 2.
1 containRedun ← TRUE As a result, the original formula becomes SU M(I F (C1, V 1, V 2), V 5).
2 while containRedun do Such iteration repeats until no redundancy is detected.
3 containRedun ← FALSE
4 if List ← getParentIfList(AST , Fm )
5 for each if exp in if List do 3.3 Syntax Reassembling
6 if T r ee = generateAST(if exp) After removing redundancies, if the resultant formula still contains
7 dicConBranch ← getDicConBranch(if T r ee) nested-IF expressions, in this third step we further analyze the
8 for each condition in dicConBranch do AST to detect and reassemble fragmented semantics into built-in
9 branchList ← dicConBranch.get(condition) functions as listed in Table 3.
10 dBranchList ← branchfList.directpart
11 nBranchList← branchfList.negativepart 3.3.1 General Procedure. In general, this step is in a paradigm
12 if dBranchfList.len > 2 then of iterative pattern-matching and replacement. For each remaining
13 containRedun ← TRUE // Redundancy exists. nested-IF after step 2, we further construct a threePartList as the
14 redunIFList ← generateRedunList(condition, key structure to facilitate pattern matching. Each threePartList
dBranchList)
consists of three lists for condition, true branch, and false branch, re-
15 for each redunIF in redunIFList do
16 if redunIF in condition.truebranch then
spectively. For example, for expression I F (C1, I F (C2, I F (C3, V 1, V 2),
17 Fm ← Fm .replace(redunIF, V 2), V 2), the condition part is [C1, C2, C3], the true branch part is
redunIF.truebranch) [I F (C2, I F (C3, V 1, V 2), V 2), I F (C3, V 1, V 2), V 1], and the false part
18 else is [V 2, V 2, V 2].
19 Fm ← Fm .replace(redunIF, Subsequently, based on threePartList, we infer the semantic of the
redunIF.falsebranch) IF expression and check if it matches some spreadsheet functions.
20 end If yes, we transform the formula using the matched function, and
21 end replace the nested-IF expression with the transformed one. Follow-
22 end ing the order shown in Table 3, we probe each pattern in sequence.
23 if nBranchList.len > 0 then Once a pattern is matched, the probe jumps to the next iteration
24 containRedun ← TRUE // Redundancy exists. from the first pattern again. This iteration terminates with zero pat-
25 redunIFList ← generateRedunList(condition, tern match. Note that the patterns CHOOSE/MATCH /LOOKU P
nBranchList)
have higher priority than the pattern I FS during the matching,
26 for each redunIF in redunIFList do
27 if redunIF in condition.falsebranch then
because they are more comprehensible and enable more concise ex-
28 Fm ← Fm .replace(redunIF, pressions. In the future, we may consider to provide all alternative
redunIF.truebranch) refactoring recommendations for end users to choose from.
29 else
30 Fm ← Fm .replace(redunIF, 3.3.2 Alternative Functions. In this paper, we have identified 7
redunIF.falsebranch) categories of patterns corresponding to 7 types of spreadsheet func-
31 end tions. In this sub-section, we explain the patterns in details by text
32 end description, AST, and examples. The basic patterns (with if-depth
33 end of 5 in all examples) are illustrated in Figure 3. Based on specific
34 AST ← generateAST(Fm ) structures of each pattern, their pattern matching algorithms share
35 end the preceding general procedure and differ in minor details.
36 end (1) AND pattern. If a nested-IF expression satisfies the follow-
37 end ing conditions, we infer it has the semantic of the AND function,
38 return Fm as shown in Figure 3: first, the false branches of each condition are
all identical; second, the true branches of each condition are all IF
expressions, except for the last true value (i.e., V 1). Such expressions
can be replaced with I F (AN D(conditionlist), truebranch, f alsebranch).
against dicConBranch to detect and remove redundancies. Each For example, the expression with the first AST in Figure 3 can be
detected redundancy site corresponds to one redundant IF expres- replaced with I F (AN D(C1, C2, C3, C4), V 1, V 2).
sion that can be replaced with either the true branch (the condition (2) OR pattern. If a nested-IF expression satisfies the follow-
is deterministic as true) or the false branch (the condition is de- ing conditions, we infer that it actually has the semantic of the
terministic as false). Thus, under each situation, we generate the OR function, as shown in the second AST of Figure 3: first, the
redundant IF expression according to the condition and its branch true branches of each condition are all identical; second, the false
list (Lines 15 and 26) and make replacement. For example, as the branches of each condition are all IF expressions, except for the last
example in Figure 2 shows, the nBranchList of key C1 is not null, false value (i.e., V 2). Such kind of expressions can be replaced with
4
Key dBranchList Key dBranchList
nBranchList nBranchList Key dBranchList Key dBranchList
nBranchList Key dBranchList
nBranchList nBran
SUM(IF(C1, V1, SUM(IF(C1, V1, SUM(IF(C1, V1, SUM(IF(C1, V1, SUM(IF(C1, V1,
IF(!C1,V2,IF(C2,V3,V4))),V5) IF(!C1,V2,IF(C2,V3,V4))),V5)
C1 [V2,IF(!C1,V2,IF(C2,V3,V4))]
C1 [V2,IF(!C1,V2,IF(C2,V3,V4))]
[V2,IF(C2,V3,V4)] C 1 [V2,IF(C2,V3,V4)]
IF(!C1,V2,IF(C2,V3,V4))),V5) C 1C1
IF(!C1,V2,IF(C2,V3,V4))),V5)IF(!C1,V2,IF(C2,V3,V4))),V5)
[V2,IF(!C1,V2,IF(C2,V3,V4))]
C1 [V2,IF(!C1,V2,IF(C2,V3,V4))]
[V2,IF(C2,V3,V4)]C1 [V2,IF(!C1,V2,IF(C2,V3,V4))]
[V2,IF(C2,V3,V4)] [V2,IF
=> =>C 1 C1 => C1 T F =>C 1 T F =>
IF(C1, V1, IF(!C1,V2,IF(C2,V3,V4)))
TIF(C1,FV1, IF(!C1,V2,IF(C2,V3,V4)))
C2 T
[V3,V4]F C2 [V3,V4]
Null IF(C1, V1,
T IF(!C1,V2,IF(C2,V3,V4)))
F 1 !C Null IF(C1,FV1, IF(!C1,V2,IF(C2,V3,V4)))
T [V3,V4]IF(C1, V1, IF(!C1,V2,IF(C2,V3,V4)))
V 1 V 1 C2 !C 1 C2 [V3,V4]
Null C2 [V3,V4]
Null Null
Key V 1 dBranchList
!C 1 V1 !C 1
nBranchList V 1 !C 1 T F V1 !C 1 T F
SUM(IF(C1, V1, T F T KeyF dBranchList nBranchListT F 2
V C2 T F 2
V C2 C1
IF(!C1,V2,IF(C2,V3,V4))),V5) SUM(IF(C1, V1,
C1 2 C2 V 2 C 2 [V2,IF(C2,V3,V4)]
V[V2,IF(!C1,V2,IF(C2,V3,V4))]
IF(!C1,V2,IF(C2,V3,V4))),V5) C1 V2 C2 T F V2 C2 T F C1 T F
=> C1 [V2,IF(!C1,V2,IF(C2,V3,V4))] [V2,IF(C2,V3,V4)]
T F T F T F T F 3 T F 3 SUM(IF(C1, V1, V2), V5)
V 4T F 1
=>
IF(C1, V1, IF(!C1,V2,IF(C2,V3,V4))) V V4 V V V2
C2 V1, IF(!C1,V2,IF(C2,V3,V4)))
IF(C1, [V3,V4] Null
V3 V4 V3 V4
C2 [V3,V4] V1 V2 Null
V3 V4 V3 V4 V1 V2
AND OR CHOOSE MATCH LOOKUP IFS

AND OR CHOOSE AND C1
MATCH Figure
ORC2:
1 Redundancy
LOOKUP = n1 removalA1
CHOOSE
A1 process.
= str 1
MATCH = r1
IFS A1LOOKUP IFSC 1
C1 C1 A1 = n1 A1 = s t rC11 T F T = r 1F
C 1 A1 A1 = n1
T F A1 = s tTr 1 F 1 = rT1
CA1 F C1 T F
T F T F T F T T F FC 2 V 2T TV
F 1 CF 2 T s t r 1F A1 = n2 T n1 F A1 = s t r 2 T T F r 2 F A1 = r 3 T F 1
V C2
C2 V2 V1 C 2 AND str 1 A1 = n2 OR n1 A1 t r22 FCHOOSE
C 2 = TsV V 1 r 2C 2 TA1 = Fr 3s t r 1 MATCH
A1 = n2
T F n1 LOOKUP
A1 = s tTr 2 F Vr12 C 2A1 = rT3 F
IFS V1 C2 T F
T F T F C1 C F1 A1 = n1
T T FC 3 F V 2
A1 =Fs t r 1 T A1 = r 1 T C1
T T F T1
V C3 s t r 2F A1 = n3 n2 F A1 = s t r 3T T F r 4 F A1 = r 5 T F 2
V C3
C3 V2 V1 C3 T F s t r 2 T A1 =Fn3 Cn2 2 F=Fs t r 3
3 TTV A1 V1 r 43 TT A1 F= rF5s t r 2
C A1 = n3
T
T
F
F
n2 A1 = s tTr 3 FV 2 r 4 C 3 A1 = rT5 F
T F V2 C3 T F
MAX MIN
T F T F 2
C V2 V 1T C2 F T s t rFC14 T V
A12
= n2
F T n1 A1
F 1T C
V
= str 2 T
4 F
r2
s t r 3F MAX
A1 = r 3 T MIN
A1 = n4 n3 F A1 = s t T
r 4 FT r6F A1 = r 7
V1
MAX A > B C2 MIN A < B T F 3
V C4
C4 V2 V 1 TC 4 F s t r 3T A1F= n4 C 4 TV 2 n3
F T A1 = Fs t r 4 V 1 F = r F7s t r 3
C 4r 6 T T A1 A1 = A T
T> B
n4 F F A A1
n3 < B= s tTr 4 V F3 rC6 4 A1 = rT7 F A > BTT FF A < BT F V3 C4 T F
T F CT 3 FV 2 V 1 T C 3 TF F 1
V sVt 2r 2 T A1 = Fn3 T V 1 TV 2A1 =F s t r 3 T sTt r 4Fr 4 F
F n2 F ALS
A1E =T r 5T F n4 F TE F T
F ALS r8F F ALS ET AVF 2 3T
CB AF B T F 4
V V5
V1 V2 TV1 FV2 s t rT4 F 1 V
VFALS E2 n4 T F EV 1
F ALS V 2 r8 TF ALS EsFt r 4 A F ALS EB T A n4 F BF ALS E V 4 Vr58 F ALS E MINA B T AF B V4 V5
MAX
C4 V2 V1 C4 str 3 A1 = n4 n3 A1 = s t r 4 r6 A1 = r 7 A>B A<B V3 C4
T F
Figure
T
2:
F
Typical ASTTof function
F
AND,OR,CHOOSE,MATCH,LOOKUP,
F
and IFS. stri represents a string; ni represents a numbe
Figure 2: Typical AST of function AND,OR,CHOOSE,MATCH,LOOKUP,
Figure 2: Typical AST and
of function IFS. striTrepresents T
a string;
AND,OR,CHOOSE,MATCH,LOOKUP, ni Frepresents
T
and F
aIFS.
number; T F
stri represents T
a string;F
ni represents a number;
V1 V2 ri represents
V1 V2 a reference.
str 4 F ALS E n4 F ALS E r8 F ALS E A B A B V4 V5
i represents a reference. ri represents a reference.
Figure 3: Typical AST of function AND,OR,CHOOSE,MATCH,LOOKUP, MAX,MIN, and IFS. stri represents a string; ni repres-
ents a number; true
a branches of each
< icondition
< 5). are all identical; second, the false except for the last false value. For example, as shown in Table 3, e
rue branches of each condition allrepresents
areri identical; reference
true second,
branchesthe each(0
of false except
condition arefor
allthe last false
identical; value. the
second, For false
example, asexcept
shownfor
in the
Table
last3,false
ex- value. For example, as shown in Table 3, ex-
branches of each condition are all IF expressions, except for the last pression I F (A1 = C1, D1, I F (A1 = C2, D2, I F (A1 = C3, D3, I F (A1
branches of each condition are all IF expressions, branchesexceptoffor eachthecondition
last arepression
all IF of I F (A1 = C1,except
expressions, D1, I F (A1the = C2, I F (A1 = C3,ID3, (A1 = D1, I F (A1 = C2, D2, I F (A1 = C3, D3, I F (A1 =
expressions can beforreplaced lastD2,with pression F (A1 IF=
false value (i.e., V 2). Such kind C4, D4)))) canC1,be transformed into V LOOKU P(A1, C1 : D4, 2, FAL
alse value (i.e., V 2). Such kind of expressionsfalse can be valuereplaced
(i.e., with
2). Such kind of can
expressions be advanced
transformed
can be replaced into Vwith C1 : D4, 2,beFALSE).
Table 3: The functions �e abovetransformed
can into C1 : the 2, FALSE).
V C4, D4)))) LOOKU P(A1, C4, D4))))
I F (OR(conditionlist), true alue, f alse alue). For example, the ex- pa�erns suit theV circumstance
LOOKU P(A1,that D4,values look
F (OR(conditionlist), true alue, f alse alue).I FFor example, the ex-true alue,
(OR(conditionlist), �e above pa�erns example, suit the circumstance
the ex-with that�e theabovevaluespa�ernslooked suit the circumstance that the values looked
Name Explanation pression with the second AST f alse FigureFor
in alue). 2Transformation
can be replaced Examples up can be found directly in other cells. For those cannot be foun
pression with the second AST in Figure 2 can pressionbe replacedwith the with second up can Figurebe found directly in other with cells. For those cannot be found
I F (OR(C1, C2, C3, C4), AST
V 1, Vin 2). 2 can be replaced up can be found
directly, directly
in this paper, in weother cells. For
propose those cannot
to create new tables be found
in the e
F (OR(C1, C2, C3, C4), AND V 1, V 2). Returns TRUEI Fif(OR(C1, all of the arguments Vevaluate
1, V directly,
2). An to TRUE. in this paper, I F (C1, I F (C2, I to
we propose F (C3,
create V 1,directly,
V 2),tables
new V 2), inmakein2)the
Vthis I F (AN
ex-
paper,
→ wethe propose
D(C1, C2,toC3), V 1, new
create V 2) tables in the ex-
(3) C2,CHOOSEC3, C4),pattern. IF expression that matches the cel to ease for look up function. Consequently, as lon
OR An IFReturns
expressionTRUEthat if (3)
any
matchesargument evaluatesAn
the pattern. tocelTRUE.
toexpression
IFhave make easethat (C1,
forI matches
Fthe look V 1,the
up (C2, V 1, IConsequently,
I Ffunction. F (C3, V 1, V 2))) as→long (O R(C1, C3), V 1, V 2)
First, all cel toasmake ease for
(3) CHOOSE pattern. CHOOSE I F the look up
C2,function. Consequently, as long
CHOOSE pa�ern should the following features. the conditions are evaluating the value of a certain cell (d
CHOOSE pa�ern CHOOSE should haveReturnsthe following features.
CHOOSE First,
pa�ern aallgiven have
using should as the the conditions
index. are
orfollowing features.evaluating
= 1,First, the I Fvalue
all (A1 =as of a certain
2, the r 2,conditions cell= (do-
3,are
st revaluating the we value
a value from
theaconditions
list
are number position
equality I F (A1
evaluations, st r 1,
the correspond- st
ing I F (A1
look up based 3))) →
on this cell), canofperform
a certain cell (do-
transformatio
the conditions are number equality evaluations, the correspond-
the conditions arecouldnumber ing look evaluations,
equality up based on C Hthe this Scell),
OOcorrespond- westcan perform
ingstlook transformation
ing numbers form a arithmetic progression, E(A1,
which canr 1, st
be r 2, with up
r 3)
thebased
LOOKUP on this cell), we
function. Forcan perform
example, fortransformation
expression I F (A1
ng numbers couldMATCH form a arithmetic
Returns progression, ing numbers whichcould cana be form with an the
a arithmetic LOOKUP
progression, function. which For 1,example,
can 1,be forstexpression (A1 =
a number representing intoposition insequences.
array. (A1 = st I F (A1 of =with Vthe
r 2, 2,VILOOKUP
F2,(A1 stfunction. 4,For example, for6,expression
I F (A1 = V I7,FV(A1 = w
IF=
translated a natural Second, I Fthe false rbranches 1, I F (A1 =r 3,V 3)))
3, V → I F (A1 = V 5, V 8)))),
ranslated into a natural sequences. Second, translated the false branches into a of
natural V 1, V 2, I FSecond,
sequences. (A1 = V the3,MV 4, ICFbranches
false (A1 (A1, = {st 5,rV1,6,stI rF 2,
V of V(A11, V=2,
3},VI F7,
0) V 8)))),
(A1 = V we
3, V 4, I F (A1 = V 5, V 6, I F (A1 = V 7, V 8)))), we
each condition are all IF expressions, except for the last false value. AT H st r
create a table ranged (E1 : F 4), where E1 = V 1, F 1 = V 2, E2 = V
each condition are LOOKUPall IF expressions, except for
each thecondition
last falseare value. IF create a table ranged (E1 : =Ffalse
4),
C1,where = V 1,C2, F 1D2,= V I2,F E2 ==VC3,3, (E1
Perform a vertical/horizontal
�ird, the lookup trueall branch expressions,
values areexcept
(corresponding toallfunc- for the
strings. I FFor last
(A1 example, value.
D1, I FE1
I F (A1 (A1= =create F 2a =table ranged
(A1
V 4,E3 = V 5, F 3: F=
D3, I4), where
FV(A16,E4 = C4,
= E1 7,= FV41,=→
V D4)))) F 1V = 2, this
8. VIn E2 =way,V 3, t
�ird, the true branch values are allVLOOKUP
tion strings. �ird,
Forand example,
the
HLOOKUP)true (A1by= searching
I Fbranch values F 2are=
for Va4,E3
all strings.
value = inVFor5, FVexample,
3LOO= V KU 6,E4 I F = V 7,
(A1
(A1, = F 4: D4,
= FV22,8.=FIn V this
4,E3 way,
= V the
5, F 3 = V 6,E4 = V 7, F 4 = V 8. In this way, the
1, str 1, I F (A1 = 2, str 2, I F (A1 = 3, str 3, I F (A1 = 4, str 4)))) could P C1 expression can be transformed into V LOOKU P(A1, E1 : F 4,
ALS E) 2).
1, str 1, I F (A1 = 2, str 2, I F (A1 = the3,first
str 3, I F (A1
1, str = 1,4,of 4))))
Fa(A1 = could
2, str expression
2, I CHOOSE(A1,
F (A1 = value
3, str 3, can be transformed
F1,(A1 into P(A1, E1 : can F 4, 2).
str 2,=str 4,3, 4)))) could V LOOKUexpression be transformed
column/row table and returning the in I the pattern.into : F 4, 2).
str
be Itransformed into str strstr 4); expression (6) MAX/Min AnV LOOKU
IF expression P(A1, E1 that matches t
be transformed into CHOOSE(A1, 1, str 2, str
be 3, 4); expression
transformed into An IF expression (6) thatMAX/Min
matches the
2, I1, str 2,=str 6,3, 3,4); expression An IF have expression that matches the �
same strrow/column (6) MAX/Min pattern.
IinF (A1
strthe position.
= 2, str
index 1,CHOOSE(A1,
I F (A1 = 4, strstr F (A1 strstr I F (A1 = 8, str 4)))) MAX or MINpattern. pa�ern should the following features.
F (A1 = 2, str 1, I F (A1 = 4, str 2,Return
MAX/MIN I F (A1 the
= 6,largest/smallest
strI F3,(A1
I F (A1= 2,=be 8,value
strI F4))))
1, (A1from = 4, MAX I F orsetMIN
2,CHOOSE(A1/2,
a supplied (A1 =of6,nu- pa�ern
3,str should
(A1
(A >=2,B, 8, have 4)))) the following MAX features.
or MIN pa�ern �e should have the following features. �e
could str transformed str
into str II FF 1, str str str
A, str 4). AX (A,
3,B)→M B)condition should do the comparison of two parts, e.g., A <
could be transformed into CHOOSE(A1/2, 1, str be
strcould 2, str 3, str 4). into CHOOSE(A1/2,
transformed condition should dostrthe
1, 2, comparison
3, 4). of twoconditionparts, e.g.,
A <= B, A > B, the
shouldA < do comparison of branch
two parts,
meric values. and e.g.,
str str str B, A < bran B,
(4) MATCH pattern. An IF expression that matches the MATCH A >= B. �e true the false
(4) MATCH pattern. IFS An IF expression
Run multiple thattests
matches
(4)pa�ern
MATCH
and the MATCH
return pattern.
should a value have AntheIFfollowing
expression
corresponding
A <= B, Ato>the that
B, Amatches
features. I>=F (C1, B. the
First, V�e 1,
all Itrue
MATCHF (C2,
the
branch
V 2, I FA
condi-
and
(C3, the
VB,3,false
<=should AI F> branch
(C4,
be B, A 4))))
these
V twoB.parts
>= �e respectively.
true branch and For→ the false expressio
example, branch
pa�ern should have the following features. First, all
pa�ern shouldthe condi-
have the should be
following these two
features. I Fparts
First, respectively.
allthe the 1,condi- V 2,For C3,example,
should beVexpressions
these
first TRUE result. tions are string equality evaluations. Second, S (C1, V true C2,branch V 3, C4,I F (A <4) B, A,twoB), Iparts
F (A <= respectively.
B, A, B), I FFor (B >example,
A, A, B), expressions
I F (B >= A, A,
tions are string equality evaluations. Second, the
tionsvalues aretruestring branch I F (A < B, A, Second, B), I F (A the I F (B > A, A, I<F (B
are allequality evaluations. true branch (Acan F (A
<= B, A, B), I FB), >=B),
B,allA, A,IA, B)<= B, A, B), I F (B > A, A, B), I F (B >= A, A, B)
numbers that could form a arithmetic progression, be transformed into MI N (A, B); expressions I F (A > B, A,
values are all numbers that could form a arithmetic values areprogression,
all thatcan all be transformed into MI N (A, B); expressions F (Atransformed
cannumbers could
into aform a arithmetic
sequences.progression, �ird, the false can all
Ibe
>= B, A, B), Iinto (A,B),
B);I expressions
F (B <= A, A,I FB)(Acan
> B, A, B),
which be translated natural I F (A F (B MI < A, N A, > B,
allA,beB),tran
which can be translated into a natural sequences. which �ird,
can bethe false into
translated I Fa(Anatural B), I F (B <
sequences. �ird,
A, A,the F (B
Ifalse can ofB,all beB),trans-
I F (OR(conditionlist), of each For example, the the I F (Aformed I F (B (A, (B expressions,
<= A, A, B) can all be trans-
>= B, A, ex- Third, B),the false <=branches
A, A, B) >= each
A,into condition < A, are B),all
I F IF
truevalue, branchesf alsevalue). condition are all IF expressions, except for MAX B).A,
branches of eachpressionconditionwith are all
theIF second
expressions,
branches except
ofFigurefor the
each condition beformed
are into
allexpression
IF MAX (A, B).except
expressions,
AST lastinfalse value. 3 can For example, replaced with I F (A1except = str 1,for 1, Ithe
for Fthe
(A1 last = formedfalse value. intoIFS
(7) MAX For (A,
pattern.example,
B). �e past expression
pa�ern is Ithe F (A1
IFS = pa�ern, which is t
ast false value. For example,C2, expression
C3, C4), VI F1,(A1 last =false
str 1, 1, I F (A1 =example, expression �e past 1,1,pa�ern is= the IFS2,pa�ern, which= stris3,one.
the
str 2, value.
2, I F (A1For I F (A1 = str str1, I FI(A1 �e Fpast pa�ern 4,is4))))
the IFS pa�ern,
be ofwhicheach is the
I F (OR(C1, V 2). (7) IFS pattern. 1, F (A1 = str (7)most
2, I FIFS
(A1 pattern. 3, I As (A1 = as could
= str 3, 3, I F (A1 = str 4, 4)))) could be transformed into �exible long strthe false branches conditio
tr 2, 2, I F (A1 = str 3, 3, I F (A1 = str 4, 4)))) could strbe transformed
2, I F (A1 = strinto
2,CHOOSE(A1, 3, 3, I F (A1 most
= str �exible
4, 4)))) one. As
could be long as the false
transformed into branches most of�exible
each conditionone. As long as the false branches of each condition
(3) CHOOSE pattern. A nested-IF expression str 1, str 2, str that
3, strmatches
4, 0); expressiontransformed I F (A1 = str 1, 2, into MATCH are all (A1, IF str 1, str 2, str
expressions 3, str 4,for0);the
(except expression
last one), the expression c
CHOOSE(A1, str 1, str 2, str 3, str 4, 0); expression I F (A1 = strstr 1,1,2,str 2, str 3,are all
4, IF expressionsI F(except for1,the last one), arethe expression can (except for the last one), the expression can
the semantic of CHOOSE
CHOOSE(A1,I F (A1 [20]
function = strshould
2, 4, I F (A1 have = strthe
str 3,0);
6, Iexpression
F (A1 = str 4, 8))))
following
(A1 =could
I F (A1
str
=
2, trans-
be
str 1, 2,
all
beIFtransformed
expressions with the IFS function, as shown in Table 3.
F (A1 = str 2, 4, I F (A1 = str 3, 6, I F (A1 = str 4,I F8)))) (A1formedcould be4,trans-
= str 2,into I F2 (A1 = str 3, be6,transformed
(A11,=strstr with the IFS function, as shown in Table 3. with the IFS function, as shown in Table 3.
features. First,
I F str 2,4,str8))))
3, str could
I4, 0). be=trans- be transformed
Except Ifor the=above 4,pa�erns thatbe match the existing sprea
str 1,all
strthe conditions are 2number equality evaluations, F (A1 that2,match4, I F (A1 =existing 3, 6,the F (A1 8))))that
could trans-
⇤ CHOOSE(A1, str str str
ormed into 2 ⇤ CHOOSE(A1, 2, str 3,formed
str 4, 0).into Except 2,forstrthe
1,IFstrexpression 3, strabove pa�erns
4, 0).matches the
Except for spread-
above pa�erns match the that
existing
with the corresponding numbers (5) LOOKUP
forming
⇤ CHOOSE(A1,
an pattern.
arithmetic
str
Anprogression, that formed the
into LOOKUP
2 ∗ MATCH sheet
(A1, functions,
str 1, str 2, we
str found
3, str 4, another
0). pa�ern doesspread-
not mat
(5) LOOKUP pattern. An IF expression that matches (5)(VLOOKUP
LOOKUP the LOOKUPpattern. An sheet
IF functions,
expression that wematches
found the anotherLOOKUP pa�ern sheet that does functions,not match we found another pa�ern that does not match
or HLOOKUP) pa�ern should have the following fea- any function, but can also be transformed accordingly to remo
VLOOKUP or HLOOKUP) which canpa�ern be translated
should have into
(VLOOKUP natural
the orsequences.
following fea-
HLOOKUP) Second,
any function,
pa�ern the
should falsebut
have can the also (5)transformed
be
following LOOKUPfea- pattern. Atonested-IF
accordingly remove expression that matches
tures. First, all the conditions are reference value equality evalu- any function, nested IF. but We can also
all this 9
be transformed
pa�ern the “USELESS” accordinglypa�ern. to For
removeexamp
tures. First, all the branches
conditionsof each condition
are reference are all
value
tures. IF expressions,
equality
First, all evalu-
the conditions except
nestedfor IF.theWelast all this pa�ernthe semantic
the evalu- of VLOOKUP/HLOOKUP
pa�ern. IF.For Weexample, pattern [22, 23] should
arearecellreference value equality nested all Ithis
F (A pa�ern
= B, A,the pa�ern. ForB.example,
“USELESS” “USELESS”
ations. �e references neighbours vertically/horizontally. expression B) actually equals to A or We put t
ations. �e references false value.
are cellThird, the true
neighbours branch
ations. �e
Second, values
vertically/horizontally.
all are
the all
references true strings.
are cell
branches Forare
expression example,
neighbours references
= B, A,have
I F (Avertically/horizontally.
B) actually
that referred the following
equals to features.
to other expression
A or B. We
checking IFirst,
F put
(A
order all
=theB,ofthe B)conditions
A,this actually
pa�er aretheequality
equals
before toIFS or B. We put the
A pa�ern.
Second, all the true branches
I F (A1 are1,references
= 1, str I F (A1 = 2, that
Second, referred
2, I all
strcells. (A1
F �e theto=true
other
3, branches
str 3, I F checking
(A1 are = 4, order
references
str 4)))) ofthat
thisreferred
pa�er
evaluations before
to other theof IFS pa�ern.
checking
reference order
values. of
The this pa�er
references before
are the
cell IFS pa�ern.
neighbors
references are cell neighbours vertically/horizontally, and
cells. �e references couldare cell neighbours vertically/horizontally,
be transformed intocells. �e references andare cell2,neighbours
1, str str 3, str 4);vertically/horizontally,
ex- and
have
CHOOSE(A1, the samestrcolumns/rows as the referencesvertically/horizontally.
in the conditions. Second, all the true branches are refer-
have the same columns/rows
pression I F (A1 as the= 2,references
str 1,haveI F (A1inthethe= conditions.
same
4, columns/rows
2, (A1 = 6, as the
3, references
(A1 = in the
ences conditions.
that referred to other cells. The references are cell neighbors
�ird, the false branches of each condition are all IF expressions,
str I F str I F 4 RESEARCH QUESTIONS
�ird, the false branches of each condition are �ird, all IF theexpressions,
false branches of 4eachRESEARCH condition are allQUESTIONS IF expressions, 4 RESEARCH QUESTIONS
8, str 4)))) could be transformed into CHOOSE(A1/2, str 1, str 2, str 3, vertically/horizontally, and have the same
In this paper, we would like to investigate columns/rows as the
the following four
str 4). In this paper, we wouldreferences like to investigate the
in the conditions. following
In this paper, four
we
Third, re- false
would
the like to investigate
branches of each the following four re-
con-
search questions.
(4) MATCH pattern. A nested-IF expression that search
matchesquestions. the dition are all IF expressions, search questions. except for the last false value. For
5
semantic of MATCH function [21] should have5 the following fea- example, as shown5 in Table 3, expression I F (A1 = C1, D1, I F (A1 =
tures. First, all the conditions are string equality evaluations. Second, C2, D2, I F (A1 = C3, D3, I F (A1 = C4, D4)))) can be transformed into
the true branch values are all numbers that could form an arith- V LOOKU P(A1, C1 : D4, 2, FALSE).
metic progression, which can be translated into a natural sequence. 9 The “V” and “H” refer to “vertical” and “horizontal” respectively.
5
The above patterns suit the circumstance that the values looked Table 4: Refactor coverage
up can be found directly in other cells. For those that cannot be Formula Set Original Refactored Refactor Coverage
found directly, in this paper, we propose creating new tables in the
Total 27,689,299 27,645,688 99.84%
worksheets to make ease for the look up function. Consequently,
Unique 19,260,407 19,243,407 99.91%
as long as the conditions are evaluating the value of a specific cell
(doing look up based on this cell), we can perform transformation
with the LOOKUP function. For example, for expression I F (A1 = RQ4: Do end users prefer the refactored formulae? This ques-
V 1, V 2, I F (A1 = V 3, V 4, I F (A1 = V 5, V 6, I F (A1 = V 7, V 8)))), we tion aims to find out the necessity of refactoring from the respective
create a table ranged (E1 : F 4), where E1 = V 1, F 1 = V 2, E2 = V 3, of end users, as well as whether end users prefer the refactored
F 2 = V 4,E3 = V 5, F 3 = V 6,E4 = V 7, F 4 = V 8. In this way, the formula our approach provides.
expression can be transformed into V LOOKU P(A1, E1 : F 4, 2).
(6) MAX/MIN pattern. A nested-IF expression that matches 4.2 Refactor Equality
the semantic of MAX or MIN pattern should have the following To answer the first research question, we conduct manual inspection
features. The condition should do the comparison of two parts, as well as formula calculation result comparison.
e.g., A < B, A <= B, A > B, A >= B. The true branch and the For manual inspection, considering that there are over 10 million
false branch should be these two parts respectively. For example, formulae and it is impossible to check the refactorings one by one,
expressions I F (A < B, A, B), I F (A <= B, A, B), I F (B > A, A, B), we randomly select 2000 formula pairs < Fo , Fr > (Fo represents
I F (B >= A, A, B) can all be transformed into MI N (A, B); expressions the original formula, Fr represents the refactored formula). The first
I F (A > B, A, B), I F (A >= B, A, B), I F (B < A, A, B), I F (B <= A, A, B) three authors then check each pair and record their judgements.
can all be transformed into MAX (A, B). For formula value comparison, we scan all Excel files and replace
(7) IFS pattern. The IFS pattern has the fewest conditions. As the original nested-IF formulae with the refactored ones. For each
long as the false branches are IF expressions (except for the leaves), formula pair < Fo , Fr >, we get a responding value pair < Vo , Vr >.
the expression can be transformed with the IFS function, as shown We thus record whether Vo equals Vr . To automatically achieve
in Table 3. Note that this pattern makes the fewest syntax changes the above process, we use ClosedX ML, which is a powerful .NET
comparing to the original syntax, and the number of conditions library enabling users to create and modify Excel files.
remain the same. However, the IFS function has the advantage Our experiment results indicate that either manual inspection
of conciseness and readability, and there is also no need to worry or value comparison indicates a 100% correctness of the refactored
about the IF statements and parentheses [24]. Additionally, there formulae. This result reveals the reliability of our refactoring results.
is no need to supply a value if the condition is false (unlike the
nested-IF expression which needs another IF expression to serve 4.3 Refactor Coverage
as the false branch) [25]. Our survey of end users reflects these To answer the second research question, we present the total pro-
advantages as well (see Section 4.5). portion of refactored formulae in Section 4.3.1, the proportion of
(8) USELESS pattern. Except for the above patterns that match formulae handled by each pattern in Section 4.3.2, and the propor-
the existing spreadsheet functions, we find another pattern that tion of formulae handled by more than one patterns in Section 4.3.3.
does not match any function, but can also be transformed accord-
ingly to remove nested IF. We call this pattern the “USELESS” 4.3.1 Total Coverage. For the original total set of nested-IF for-
pattern. For example, expression I F (A = B, A, B) actually equals A mulae O tot al and original unique set Ounique , we conduct auto-
or B. We put the checking order of this patter just before the IFS matic refactoring following the refactoring procedure introduced
pattern. in Section 3. Correspondingly, we get the refactored set R tot al (out
For ease of presentation, we unify the condition redundancy, of O tot al ) and Runique (out of Ounique ). The refactor coverage
#R t ot al #R
the USELESS pattern, and the 7 types of advanced functions all as can then be calculated as #O ∗ 100% and #Runique ∗ 100% (#
unique
“patterns”.
t ot al
represents the number).
Table 4 presents the refactor coverage results. The first row is
4 EVALUATION for the total formula set, the second row is for the unique formula
set. From this table, our approach is able to handle almost all the
4.1 Research Questions nested-IF formulae, with a refactor coverage of over 99%.
In this paper, we investigate the following four research questions. We also observe that there are around 43,000 nested-IF formu-
RQ1: Are the refactored formulae functionally equal to the lae that cannot be automatically refactored. We analyze them and
original ones? This question aims to check the correctness of our found that they can be categorized into two types. In the first type,
refactoring approach. the condition part of the outmost IF expression contains another IF
RQ2: What is the refactor coverage of our approach? This expression and does not match our patterns even if being treated as
question aims to check the applicability of our approach: how a whole, such as I F (AN D(I Fsubexpression1, I Fsubexpression2) =
many nested-IF formulae can our approach handle. T RU E, value1, value2). In the second type, although the inner IF
RQ3: What is the refactor effectiveness of our approach? expression lies in the branches of the outer expression, it is wrapped
This question aims to check whether our refactorings relieve the with other non-IF functions, and thus the AST is quite complex, such
nested-IF smells: how much can our approach decrease the if-depth as I F (Condition, SU M(I Fsubexpression1, I Fsubexpression2), value).
of nested-IF formulae.
6
Table 5: Refactor coverage of each pattern Table 6: Number of formulae processed by multi-patterns
Total Unique Pattern Num Total Unique
Pattern
Refactored Coverage Refactored Coverage
2 1,019,035 (3.69%) 785,156 (4.1%)
REDUN 1,766,803 6.39% 1,250,006 6.50% 3 60,843 (0.22%) 60,825 (0.32%)
AND 2,322,346 8.40% 2,206,296 11.47% total 1,079,878 (3.91%) 845,981 (4.40%)
OR 4,169,992 15.08% 2,099,450 10.91%
CHOOSE 165,594 0.60% 141,695 0.74%
MATCH 23,254 0.08% 9,331 0.05%
LOOKUP 1,780,419 6.44% 1,637,564 8.51%
MAXMIN 234,960 0.85% 214,452 1.11%
USELESS 83,060 0.30% 69,912 0.36%
IFS 18,239,046 65.97% 12,520,577 65.06%
4.3.2 Coverage of Each Single Pattern. We next investigate the

coverage of each pattern. To do this, for each refactored formula, we
assign it with a pattern list pList = [REDU N , AN D, OR, CHOOSE,
MATCH, LOOKU P, MAX MI N , U SELESS, I FS] (REDU N represents
redundancy) recording which patterns are adopted during the re-
factoring process. Because the MAX and MI N are quite similar, we Figure 4: Circos chart of the overlap between patterns.
merge them into one MAX MI N . The adopted patterns are assigned
with a value of 1, otherwise their values are 0.
The results are shown in Table 6. In summary, at most 3 patterns
For example, for formula I F (condition1, value1, I F (cell1 > cell2,
are applied to process a formula. 3.91%/4.40% of the total/unique set
cell1, cell2)), our approach will refactor it into I F (condition1, value1,
of refactored formulae are processed with more than one patterns.
MAX (cell1, cell2)),and thus pList = [0, 0, 0, 0, 0, 0, 1, 0, 0]. Each for-
We also present the specific overlap of different patterns (of the
mula would have a list. In this way, we will get a matrix of all
total set), as shown in Figure 4. The figure is a circos10 visualization.
patterns with the total formula set, based on which it is easy to
The pattern segment size reflects the scale of refactored formulae
calculate the proportion of the formulae that each pattern handles.
of each pattern (shown in Table 4). The lines between different
We present the final results in Table 5. Column “Total” shows the
segments are called cells. The thickness of these cells can indicate
results of the total formula set. Column “Unique” shows the results
the absolute number of overlapped formulae between different
of the unique formula set. Column “Coverage” shows the proportion
patterns. As shown in this figure, the IFS pattern has overlap with
of formulae that each pattern handles against all original formulae.
almost all other patterns, except AND, MATCH, CHOOSE. This
The first row presents the results of redundant conditions, the
is because the IFS pattern has the fewest conditions and many
remaining rows show the results of different alternative functions.
formulae satisfy them (as the example in Section 2 shows). MATCH
As shown in Table 5, around 6% formulae contain redundant
has no overlap with any pattern. The reason may be that formulae
conditions. This is rather surprising, because redundant conditions
that match the MATCH pattern are usually very regular and have
are somewhat low-level spreadsheet smells users should not make if
simple semantics.
they know the basic structure of IF expressions. The dead branches
caused by redundant conditions are like the dead code in traditional
4.4 Refactor Effectiveness
programming, and should be removed definitely. These results
reflect the fact that end users may lack the basic knowledge of To answer the third research question, we present the absolute
programming, even in understanding conditional logic. if-depth reduction results in Section 4.4.1, the relative if-depth re-
For the patterns, the AND, OR, and LOOKUP patterns handle duction (the depth reduction rate) in Section 4.4.2, and the final
around 6%-15% of formulae respectively, the IFS pattern handles as if-depth of refactored formulae in Section 4.4.3.
high as 65% of formulae. The remaining patterns such as MATCH, 4.4.1 Absolute Depth Reduction. First, we would like to know
CHOOSE, and USELESS handle less than 2% respectively. The high how many if-depths can our approach reduce on the refactored
refactor coverage of the IFS pattern is not surprising, because as we formulae. For each refactored formula pair < Fo , Fr >, we parse Fo
introduced in Figure 3, IFS has the fewest conditions which most and Fr and calculate their respective if-depth: depo , depr . The depth
nested-IF expressions match. reduction DepReducenum is then calculated by DepReducenum =
depo − depr .
4.3.3 Coverage of Multi Patterns. As our approach will repeatedly
Table 7 presents the results for the unique set. Row “Formula
try different patterns until no IF expressions can be removed, some
Number” presents the number of formulae that was refactored with
formulae may be handled by more than one pattern during the
the corresponding depth reduction. For example, in the first cell,
repeat. Corresponding to the pList (introduced in Section 4.3.2)
the number 7,885,404 means that among all the refactored formulae,
of this formula, more than one element would have value “1”. To
7,885,404 of them have a if-depth reduction of 1. From the table,
investigate the frequency of such circumstances, we record the
formulae handled by multi patterns. 10 http://mkweb.bcgsc.ca/tableviewer/visualize/
7
Table 7: Absolute number of if-depth reduction for Unique.
Depth Reduce 1 2 3 4 5 6 7 8 9 10 11 12 13
Number 7,885,404 5,284,297 2,280,668 836,304 418,525 96,353 1,790,044 454,225 38,614 5,772 45,452 47,105 6,927
Depth Reduce 14 15 16 19 20 22 24 27 28 29 36 39 48
Number 6,302 12,214 507 10,200 1 12 10,200 1,963 490 490 2 795 10,541
Start Please have a look at F1 and F2
Total Unique
(0%,25%], 0 (0%,25%], 0 Q1:Before this survey, do you know
F1 equals to F2? [ ] Yes, I know [ ]
No, I don’t know.
(25%,50%],
10231301 (25%,50%], Reason:_______ F1
(75%,100%], 7884846 Q2: Do you prefer F1 or F2?
___________
(75%,100%], 9370933
15036220 F2
Please tick your reason(s):
(50%,75%], (50%,75%], [ ] A:F2 is shorter
2378167 1987628 [ ] B:F2 is less complex
[ ] C:F2 is easier to understand
[ ] D:F2 is not easy to make mistakes
Figure 5: Distribution of different depth reduction ratios. Q3:Before this survey, do you know
how to manually refactor F1 to F2? [
] Yes, I know [ ] No, I don’t know.
Table 8: Number of formulae with different new depth
Reason:_______ No
Q4: Will it be helpful to automatically
Total Unique ___________ refactor F1 to F2?
New Depth Formula Num New Depth Formula Num Yes
0 13,906,460 (50.30%) 0 8,723,082 (45.33%) End Thanks!
1 13,717,158 (49.62%) 1 10,498,258 (54.56%)

2 22,070 2 22,067 Figure 6: Survey questions.
4.5 Preference of End Users
our approach is able to reduce the if-depth with various degrees, In this section, we explore end users’ attitude towards the nested-
indicating the effectiveness of our approach. IF formulae and the refactored formulae. Our survey includes 49
Most of the refactorings have a depth reduction of below 5. participants. 26 of them have been using use spreadsheets quite of-
However, the absolute depth reduction results shown in this table ten (over 5 times per-week), 18 of them use spreadsheet sometimes
depend on the amount of original formulae with different if-depths. (about once per-week), 2 occasionally (about once per-month), and 3
A formula with 2 nested-IF functions can have a depth reduction rarely use them. Additionally, about two thirds of them are employ-
of 2 at most. To relieve this problem, we also present the relative ees employed in banks, law firms, telecommunication companies,
depth reduction results in Section 4.4.2 as well as the final new and so on. The remaining ones are mainly college students.
if-depth of the refactored formulae in Section 4.4.3. Our survey includes seven parts. Each part contains one repres-
4.4.2 Relative Depth Reduce. As mentioned above, only present- entative pattern13 . For each pattern, we present two functionally
ing the absolute number of depth reduction may fail to reflect the equivalent spreadsheet formulae F 1 and F 2 and several questions
real effectiveness. In this section, we also present the results of relat- concerning the participants’ preferences. For example, for the
ive depth reduction: DepReducer at io = DepReducenum /depo . For AND pattern, we first describe a function like “If three conditions
ease of presentation, we divide DepReducer at io into four ranges: Condition1, Condition2, Condition3 are all TRUE, return Value1;
(0%, 25%] 11 , (25%, 50%], (50%, 75%], and (75%, 100%]. The distribu- otherwise return Value2.”. Then, F 1 and F 2 will be presented as: F 1 :
tion of each range is presented in Figure 5. I F (Condition1, I F (Condition2, I F (Condition3, V alue1, V alue2),
From the figure, in general over half of the refactoring reduc- V alue2), V alue2), F 2 : I F (AN D(Condition1, Condition2, Condition3),
tion is over 50% of if-depth. In particular, for the total and unique V alue1, V alue2). At the beginning, the participants are invited to
set of refactored formulae, most refactorings fall into the range have a look at a formula pair. Then, they are supposed to answer
of (25%, 50%] and (75%, 100%], indicating that our relative depth four questions as shown in the first column of Figure 6. Ques-
reduction results are good. tion Q1 investigates participant’s basic knowledge. Q2 investigates
participant’s preference (between Fo and Fr ). Q3 and Q4 checks
4.4.3 Final Depth After Refactoring. Except for the absolute and whether a participant lacks the knowledge of manual refactoring
relative depth reduction results, we check whether the refactored and whether automatic refactoring is needed. As each of the 49
formulae still have large if-depth. To answer this question, we participants are supposed to finish the seven parts one by one, for
investigate the new if-depth depr of each refactored formula Fr . each question we collect 49 ∗ 7 = 343 answers. We call each answer
The results are shown in Table 8. a “case”.
From the table, most of the refactorings yield a new if-depth of 0 The survey results are shown in Table 9. Row “A2.A” “A2.D” are
or 112 , indicating that our approach is able to completely remove concerned with the reasons why participants prefer F 2 (the refact-
the nested-IF functions in most formulae. ored formula). We first focus on the total survey results shown
11 0% < DepReduce r at io <= 25%
in Column “Total” and “Prop.”. From this table, only under 28.57%
12 if-depth of 0 and 1 are equally effective in relieving nested IF smells, because either of
them avoid the smell completely. 13 Pattern AND and OR are combined, as are MAX and MIN.
8
Table 9: Feedback from end users
Answer Redundancy AND/OR CHOOSE MATCH LOOKUP MAX/MIN IFS Total Prop.
A1: I know F1 equals to F2. 17 18 12 10 8 24 9 98 28.57%
A2: I prefer F2. 45 46 46 48 43 45 45 318 92.71%
A2.A:F2 is shorter. 24 23 26 26 28 27 23 177 51.60%
A2.B:F2 is less complex. 41 41 29 29 34 31 37 242 70.55%
A2.C:F2 is easier to understand. 31 26 29 29 24 27 24 190 55.39%
A2.D:F2 is not easy to make mistakes. 17 18 20 20 33 22 17 147 42.86%
A3: I can refactor manually. 11 11 10 9 7 20 4 72 20.99%
A4: Automatic refactoring is helpful. 48 49 48 48 49 49 48 339 98.83%
cases participants have the knowledge to judge the equivalence candidates, so that the end users can choose which ever they like.
between F 1 and F 2. 92.71% of cases participants prefer the refact- Note that from our survey results introduced in Section 4.5, it
ored formulae. All four reasons we listed have high votes, with maybe better if we provide explanations of each suggestion to aid
“A2.B:F2 is less complex.” the highest. Only with 20.99% cases parti- the understanding of end users. We will develop and present such
cipants have the ability to manually refactor. 98.83% believe that a plug-in in future work.
our automated refactoring approach is necessary and helpful.
Note that there are still around 7% (24) of the cases where parti- 5.2 Limitations and Future Work
cipants do not like the refactored formulae. We look into their com-
ments and find that in 6 cases the participants prefer F 1 because they There are several directions that our approach can be improved.
have no idea which one is better and choose one answer randomly. First, there may be better application scenarios. Except for the
In one case the participant said that I F (cell1 > cell2, cell1, cell2) is application scenario introduced above, another more intelligent one
better than MAX (cell1, cell2) because when cell1 equals cell2, the is to provide possible refactoring suggestions before the end users
result is more specific. In all the remaining 17 cases the participants finish writing the formula. This scenario especially suits (intended)
prefer F 1 because they do not have knowledge related to F 2 . Sim- nested-IF formulae with high if-depth. For example, suppose that an
ilarly, there are 1% cases where participants regard our approach end user intends to write a 30-depth Nest-IF formula, it would be
helpless, and all of the reasons are that they lack knowledge about useful if we identify the possible semantics when the user starts to
the refactored formulae. These negative opinions are quite valuable, write the 10th nested IF expression. In this way, the user may skip
indicating that in practical application, it is necessary to provide the verbose writing of the remaining 20 depth and choose to use
knowledge related to new functions to help end users understand the suggested function. Another advantage is that this application
them better. A possible application scenario may be that refact- scenario can help users to avoid making errors aroused from using
oring will not be conducted without the permission of end users. nested-IF formulae. This application scenario is applicable. In future
They may get some refactor suggestions as well as explanations, we plan to use machine learning techniques to infer the semantics
and could choose whether they would like to adopt the suggestion. of a formula based on a part of its syntax.
More discussion about the application scenario can be found in Second, the refactor effectiveness can be further improved. Al-
Section 5. though our approach is currently effective in relieving the smell of
We then focus on the answers of different patterns. LOOKUP nested-IF formulae, the coverage of some patterns can be enlarged.
and IFS are known to the fewest participants, while MAXMIN is For example, for the formula in Section 2: I F (Q1 = X 1, Q1, I F (Q1 =
known to the most participants. More people think that LOOKUP “”, X 1, Q1), except for the IFS pattern, it would also match the OR
are less error-prone comparing to other patterns. The other answers pattern if we transform it into I F (Q1 = X 1, Q1, I F (Q1! = “”, Q1, X 1)
more or less have no obvious differences among different patterns. (refactored as I F (OR(Q1 = X 1, Q1! = “”), Q1, X 1)). In other words,
it is interesting to explore how to preprocess a formula to make it
better prepared for refactoring.
5 DISCUSSION Third, the refactor coverage can be further improved. There are
In this section, we discuss the application scenario (in Section 5.1) some nested-IF formulae out of our refactoring ability, such as the
and the future work of our approach (in Section 5.2). two types mentioned in Section 4.3. Also, our current patterns are
not the universal set. There must be some other patterns that can
5.1 Application also contribute to the nested IF problem, yet are outside of our
current knowledge. Nevertheless, the main idea of the approach
Our application scenario is to make our approach a spreadsheet remains effective, and we would complement the pattern set upon
plug-in. When an end user finishes writing a formula with nested finding other good pattern candidates.
IF functions, the plug-in may identify whether the formula can be
refactored. If yes, it alerts that these nested IFs are bad smells, and
provides refactor suggestions. Note that in this paper each formula 6 RELATED WORK
will yield one specific refactor result, while it is also applicable End user programming has been studied since 1993 [26]. Spread-
that we generate different refactoring results: for the formulae that sheets have been recognized as the most successful and most popu-
can be handled by multiple patterns, each pattern corresponds to lar form of end user programming [6]. Current research on spread-
one result. All the results can be ranked and serve as suggestion sheets mainly follows the trend of applying traditional software
9
engineering methods to deal with spreadsheet problems. For ex- to relieve the smells of 87% formulae. However, their approach does
ample, most research focus on smell detection [4, 5, 14, 27–32], fault not support automated refactoring.
detection and automatic repair [33–38], clone detection [39–41], Later on, Hermans and Dig [13] combine the two approaches
refactoring [4, 13, 31], visualisation [42–45], and so on. above and present BumbleBee, which is a refactoring tool allowing
The research work most related to ours include smell detection a formula to be refactored based on the defined transformation rules.
and refactoring. The former is related to the motivation of this Several patterns such as MAXMIN and OR are also mentioned in
paper: why nested-IF formulae are bad smells. The latter is related the paper. However, the formula can be refactored only when
to the approach of this paper: how to refactor spreadsheets to the transformation rule is defined, while according to our survey,
reduce smells. We next introduce these two aspects one by one. only 20.99% of participants may have the knowledge of defining
transformation rules. The work of Hoepelman [47] expand this
6.1 Smell Detection work and introduces more refactoring support.
To sum up, currently several works aim to tackle the challenges
Same as code smells [46], spreadsheets smells refer to some char-
brought by spreadsheet smells, while no automatic and high-coverage
acteristics that may cause problems. Smells have different levels:
refactoring approach is available. We propose to systematically
formula-level, cell level, and structural level. We mainly introduce
tackling the nested-IF formulae refactoring problem, which is able
the formula-level ones.
to handle almost all formulae with high depth-reduce effectiveness.
Abreu et.al. [5] combines 15 smells to indicate potential faults.
They treat conditional complexity as one of the key smells. The 7 CONCLUSION
results indicate that this smell only can detect 6 spreadsheet faults.
Hermans et.al. [4] regard conditional complexity as one of the We propose a spreadsheet formula refactoring approach aiming
five smells, because even in traditional professional programming, to automatically relieve the smells of nested IF functions. We first
conditional complexity is a threat to code readability. However, try to identify if the formula contains redundant conditions. After-
according to their results derived from EUSES, on average each wards, we identify the semantics of the combination of nested IFs
spreadsheet only has 3 formulae containing at least one condition, and replace them with an alternative spreadsheet function. Evalu-
while from our corpus, we find that on average each spreadsheet ation on a very large real world spreadsheet corpus indicates that
has 1,193 formulae containing conditions; from the corpus of Enron, the refactor effectiveness is impressive: most of the nested-IF for-
the number is 217. The reason for this huge difference may be that mulae can be refactored. Our survey of 49 participants reveals that
EUSES contains a lot of toy spreadsheets created by users who the majority of them like our refactoring approach and think it is
rarely use spreadsheet formulae. helpful.
Hermans et.al. [4] also mention that end users already know the
bad effects of conditional complexity. Our survey results confirm
this statement: around half of the participants think that formulae
with high conditional complexity are more complex and error-
prone; 70.55% think that they are harder to understand.
Another work of Hermans et.al. [6] present an overview of soft-
ware engineering approaches applied to spreadsheets. They claim
that most spreadsheets contain formulae with multiple IF condi-
tions, which is an obvious spreadsheet smell.
6.2 Formula Refactoring

Badame and Dig [12] are the first to propose refactoring in the
spreadsheet domain. A tool – ReeBook – is presented, with which
seven refactoring patterns are presented. These seven patterns
target at different smells. For example, pattern MAKE CELL CON-
STANT aims to make formulae less error prone and more readable
by adding the $ symbol. However, their approach is disperse and
can handle only simple formulae. For example, one of their refact-
oring patterns is called “REPLACE AWKWARD FORMULA”, which
only focus on the SUM function (e.g., replace B5 +C5 + D5 + E5 with
SU M(B5 : E5) ). They evaluate their approach on EUSES corpus and
find that their refactoring can be applied to many formulae. How-
ever, they only present the number of formulae that are “potential
candidates” for each pattern, while not presenting the actual num-
ber of successfully refactored formulae. Thus, the refactor coverage
and effectiveness are unknown.
Hermans et.al. [4] defined different refactoring according to their
smells. The results indicate that their refactoring approach is able
10
REFERENCES 93-nested-formulas-and-avoiding-pitfalls-0b22ff44-f149-44ba-aeb5-4ef99da241c8?
[1] wiki. End user development. https://en.wikipedia.org/wiki/End-user ui=en-US&rs=en-US&ad=US, 2015.
development, 2015. [25] spreadsheeto. Letfis take a look at IF, Nested IF and IFS. http://spreadsheeto.
[2] Margaret M Burnett and Christopher Scaffidi. 10. end-user development. com/if/, 2015.
[3] Felienne Hermans, Martin Pinzger, and Arie van Deursen. Detecting and re- [26] Bonnie A Nardi. A small matter of programming: perspectives on end user
factoring code smells in spreadsheet formulas. Empirical Software Engineering, programming, 1993.
20(2):549–575, 2015. [27] Rui Abreu, Jácome Cunha, Joao Paulo Fernandes, Pedro Martins, Alexandre
[4] Felienne Hermans, Martin Pinzger, and Arie van Deursen. Detecting and re- Perez, and João Saraiva. Smelling faults in spreadsheets. In Proc. ICSME, pages
factoring code smells in spreadsheet formulas. Empirical Software Engineering, 111–120. IEEE, 2014.
20(2):549–575, Apr 2015. [28] F. Hermans, M. Pinzger, and A. van Deursen. Detecting and visualizing inter-
[5] R. Abreu, J. Cunha, J. P. Fernandes, P. Martins, A. Perez, and J. Saraiva. Smelling worksheet smells in spreadsheets. In Proc. ICSE, pages 441–451, 2012.
faults in spreadsheets. In Proc. ICSME, pages 111–120, Sept 2014. [29] F. Hermans, M. Pinzger, and A. van Deursen. Detecting code smells in spreadsheet
[6] F. Hermans, B. Jansen, S. Roy, E. Aivaloglou, A. Swidan, and D. Hoepelman. formulas. In Proc. ICSM, pages 409–418, 2012.
Spreadsheets are code: An overview of software engineering approaches applied [30] Jácome Cunha, João P Fernandes, Hugo Ribeiro, and João Saraiva. Towards
to spreadsheets. In Proc. SANER, volume 5, pages 56–65, March 2016. a catalog of spreadsheet smells. In International Conference on Computational
[7] Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Science and Its Applications, pages 202–216. Springer, 2012.
Di Penta, Andrea De Lucia, and Denys Poshyvanyk. When and why your code [31] Wensheng Dou, Shing-Chi Cheung, and Jun Wei. Is spreadsheet ambiguity harm-
starts to smell bad. In Proc. ICSE, pages 403–414. IEEE Press, 2015. ful? detecting and repairing spreadsheet smells due to ambiguous computation.
[8] Dave Bruns. 19 tips for nested IF formulas. http://https://exceljet.net/nested-ifs, In Proc. ICSE, pages 848–858. ACM, 2014.
[32] Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. Custodes: auto-
2016.
matic spreadsheet cell clustering and smell detection using strong and weak
[9] reddit. Is it a good or bad practice reducing nested if statements.
features. In Proc. ICSE, pages 464–475. ACM, 2016.
https://www.reddit.com/r/csharp/comments/33puzj/is it a good or bad
[33] S. Roy, F. Hermans, and A. van Deursen. Spreadsheet testing in practice. In Proc.
practice reducing nested if/, 2015.
SANER, pages 338–348, 2017.
[10] reddit. Never use nested IFs again. https://www.reddit.com/r/excel/comments/
[34] Ray Panko. What we don’t know about spreadsheet errors today: The facts, why
2slys1/never use nested ifs again/, 2015.
we don’t believe them, and what we need to do. arXiv preprint arXiv:1602.02601,
[11] Felienne Hermans and Emerson Murphy-Hill. Enron’s spreadsheets and related
2016.
emails: A dataset and analysis. In Proc. ICSE, pages 7–16. IEEE Press, 2015.
[35] Kamalasen Rajalingham, David R Chadwick, and Brian Knight. Classification of
[12] Sandro Badame and Danny Dig. Refactoring meets spreadsheet formulas. In
spreadsheet errors. arXiv preprint arXiv:0805.4224, 2008.
Proc. ICSM, pages 399–409. IEEE, 2012.
[36] Yirsaw Ayalew and Roland Mittermeir. Spreadsheet debugging. arXiv preprint
[13] Felienne Hermans and Danny Dig. Bumblebee: a refactoring environment for
arXiv:0801.4280, 2008.
spreadsheet formulas. In Proc. ICSE, pages 747–750. ACM, 2014.
[37] Rui Abreu, Jácome Cunha, Joao Paulo Fernandes, Pedro Martins, Alexandre Perez,
[14] Marc Fisher and Gregg Rothermel. The euses spreadsheet corpus: a shared
and Joao Saraiva. Faultysheet detective: When smells meet fault localization. In
resource for supporting experimentation with spreadsheet dependability mech-
Proc. ICSME, pages 625–628. IEEE, 2014.
anisms. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–5.
[38] Joseph R Ruthruff, Margaret Burnett, and Gregg Rothermel. Interactive fault
ACM, 2005.
localization techniques in a spreadsheet environment. IEEE Transactions on
[15] Salvatore Aurigemma and Raymond R Panko. The detection of human spread-
Software Engineering, 32(4):213–239, 2006.
sheet errors by humans versus inspection (auditing) software. arXiv preprint
[39] Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. Data clone
arXiv:1009.2785, 2010.
detection and visualization in spreadsheets. In Proc. ICSE, pages 292–301, 2013.
[16] E Gazoni and C Clark. openpyxl-a python library to read/write excel 2010
[40] Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. Data clone
xlsx/xlsm files, 2016.
detection and visualization in spreadsheets. In Proc. ICSE, pages 292–301. IEEE
[17] T. Schmitz and D. Jannach. Finding errors in the enron spreadsheet corpus. In
Press, 2013.
Proc. VL/HCC, pages 157–161, 2016.
[41] Chanchal K Roy. Detection and analysis of near-miss software clones. In Proc.
[18] Bas Jansen. Enron versus euses: A comparison of two spreadsheet corpora. arXiv
ICSM, pages 447–450. IEEE, 2009.
preprint arXiv:1503.04055, 2015.
[42] Felienne Hermans, Martin Pinzger, and Arie van Deursen. Supporting profes-
[19] Thomas Reschenhofer, Bernhard Waltl, Klym Shumaiev, and Florian Matthes. A
sional spreadsheet users by generating leveled dataflow diagrams. In Proc. ICSE,
conceptual model for measuring the complexity of spreadsheets. arXiv preprint
pages 451–460, 2011.
arXiv:1704.01147, 2017.
[43] Felienne Hermans, Martin Pinzger, and Arie Van Deursen. Automatically ex-
[20] CHOOSE. CHOOSE function. https://support.office.com/en-us/article/
tracting class diagrams from spreadsheets. Proc. ECOOP, pages 52–75, 2010.
CHOOSE-function-fc5c184f-cb62-4ec7-a46e-38653b98f5bc.
[44] Takeo Igarashi, Jock D Mackinlay, Bay-Wei Chang, and Polle T Zellweger. Fluid
[21] MATCH. MATCH function. https://support.office.com/en-us/article/
visualization of spreadsheet structures. In Proc. ICIV, pages 118–125. IEEE, 1998.
MATCH-function-e8dffd45-c762-47d6-bf89-533f4a37673a.
[45] Hidekazu Shiozawa, Ken-ichi Okada, and Yutaka Matsushita. 3d interactive
[22] VLOOKUP. VLOOKUP function. https://support.office.com/en-us/article/
visualization for inter-cell dependencies of spreadsheets. In Proc. ICIV, pages
VLOOKUP-function-0bbc8083-26fe-4963-8ab8-93a18ad188a1.
79–82. IEEE, 1999.
[23] HLOOKUP. HLOOKUP function. https://support.office.com/en-us/article/
[46] Martin Fowler and Kent Beck. Refactoring: improving the design of existing code.
HLOOKUP-function-a3034eec-b719-4ba3-bb65-e1ad662ed95f.
Addison-Wesley Professional, 1999.
[24] office. IF function fi?! nested formulas and avoiding pitfalls
[47] DJ Hoepelman. Tool-assisted spreadsheet refactoring and parsing spreadsheet
. https://support.office.com/en-us/article/IF-function-%e2%80%
formulas. 2015.
11

Automated Refactoring of Nested-IF Formulae in Spreadsheets: Jie Zhang, Shi Han, Dan Hao, Lu Zhang, Dongmei Zhang

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Refactoring of Nested-IF Formulae in Spreadsheets: Jie Zhang, Shi Han, Dan Hao, Lu Zhang, Dongmei Zhang

Uploaded by

Copyright:

Available Formats

Automated Refactoring of Nested-IF Formulae in Spreadsheets

Jie Zhang1 , Shi Han2 , Dan Hao1 , Lu Zhang1 , Dongmei Zhang2

ABSTRACT What is worse, the bad practice of using nested-IF expressions

mulae in real-world spreadsheets. We present detailed statistics

AND OR CHOOSE MATCH LOOKUP IFS

4.3.2 Coverage of Each Single Pattern. We next investigate the

Start Please have a look at F1 and F2

New Depth Formula Num New Depth Formula Num Yes

0 13,906,460 (50.30%) 0 8,723,082 (45.33%) End Thanks!

1 13,717,158 (49.62%) 1 10,498,258 (54.56%)

6.2 Formula Refactoring

You might also like