Professional Documents
Culture Documents
Automated Refactoring of Nested-IF Formulae in Spreadsheets: Jie Zhang, Shi Han, Dan Hao, Lu Zhang, Dongmei Zhang
Automated Refactoring of Nested-IF Formulae in Spreadsheets: Jie Zhang, Shi Han, Dan Hao, Lu Zhang, Dongmei Zhang
pressions, which have low readability and high cognitive cost for
users, and are error-prone during reuse or maintenance. However, spreadsheet includes on average 9 formulae with if-depth over 10,
end users usually lack essential programming language knowledge while the observed maximum if-depth is 48 with multiple instances.
and skills to tackle or even realize the problem. The previous re- Formula refactoring is a practical solution to tackle this prob-
search work has made very initial attempts in this aspect, while no lem, which was first proposed by Badame and Dig [12]: to perform
effective and automated approach is currently available. semantic-preserving formula transformations (without changing
This paper firstly proposes an AST-based automated approach to the behavior) with the purpose of removing formulae smells. Nev-
systematically refactoring nested-IF formulae. The general idea is ertheless, such refactoring requires essential knowledge and skills
two-fold. First, we detect and remove logic redundancy on the AST. of programming which is challenging for end users. To help end
Second, we identify higher-level semantics that have been fragmen- users, several previous works [4, 12, 13] have proposed a few simple
ted and scattered, and reassemble the syntax using concise built-in refactoring patterns trying to decrease the if-depth, but they either
functions. A comprehensive evaluation has been conducted against have very low coverage (i.e., the ratio of formulae that can be ameli-
a real-world spreadsheet corpus, which is collected in a leading IT orated) or are non-automatic.
company for research purpose. The results with over 68,000 spread- In this paper, we firstly propose an AST (Abstract Syntax Tree)
sheets with 27 million nested-IF formulae reveal that our approach based approach to systematically tackling this problem via auto-
is able to relieve the smell of over 99% of nested-IF formulae. Over mated refactoring. The general idea is two-fold. First, there often
50% of the refactorings have reduced nesting levels of the nested-IFs exists logic redundancy across different condition paths within a
by more than a half. In addition, a survey involving 49 participants nested-IF. Reduction of the redundant logic can remove useless
indicates that for most cases the participants prefer the refactored parts and simplify the nested-IF formula. Second, some higher-level
formulae, and agree on that such automated refactoring approach semantics are often fragmented into hierarchical combinations of IF
is necessary and helpful. conditions in a nested-IF. Reassembling the fragmented syntax from
corresponding IF-subtrees into built-in functions can shorten the
nested-IF formula. To analyze and refactor both redundant logic and
fragmented syntax, our approach leverages and works on the AST
1 INTRODUCTION (Abstract Syntax Tree) structure as intermediate representation of
Spreadsheets are the most popular end-user programming tools [1]. nested-IF formulae.
One of the most important enabling factors is that spreadsheets The evaluation is conducted on over 68,000 real-world spread-
provide immediate feedback so users can make a change in one sheets with over 27 million nested-IF formulae. The experimental
place and immediately see the results [2]. Underneath such an results lead to the following three key takeaways. First, our ap-
advantage, formulae play an important role as end-user friendly proach is generally applicable - over 99% of the nested-IF formulae
programs. However, end-users typically lack essential knowledge can be refactored and the refactor has been verified as correct.
and skills of programming, and are easier to write formulae with Second, our approach is effective - over 50% of the refactoring
bad smells [3]. have achieved more than half of their if-depth reduced; while the
One of the well-recognized spreadsheet smells is nested-IF expres- nested-IF functions in most formulae are completely reduced or
sions [3, 4]. IF functions1 (i.e., the syntax is I F (condition, true branch, transformed with if-depth 1. Third, end users recognize our ap-
f alse branch)) are widely used spreadsheet functions. Nested-IF proach and its results. A survey on 49 participants indicates that
expressions happen when end users write an IF function inside an- most of them prefer the refactored formulae and believe the auto-
other IF or nested-IF function. According to previous research [3–7], mated refactoring is necessary and helpful; while only a few of
nested-IF formulae in speadsheets are complex, unreadable, error- them are equipped with the knowledge of manual refactoring.
prone, as well as hard to debug and maintain. There are also a lot The main contributions of this paper are shown as follows.
of online discussions about the harm of nested-IF formulae. Some 1) An automated and highly-effective approach to identify
people have expressed their desire to reduce nested-IFs “wherever and refactor nested-IF formulae. The goal is to help end users
possible” [8–10].
2 In this paper, we refer spreadsheet as a file consisting of one or multiple work-
sheets [11].
1 Functions are predefined built-in formulae already available in spreadsheet systems. 3 E.g, I F (I F (L1 >= F $5, L1), I F (L1 <= F $6, L1, “”), “”) has an if-depth of 2.
Q1 = X 1
reduce the complexity and cognitive cost of nested-IF formulae in
T F
spreadsheets. Q1 Q 1 = “”
2) A comprehensive evaluation of the proposed automated T F
approach. We evaluated the correctness, applicability, effective- M = “” X1 Q1 <> X 1
ness and usefulness of the approach. T F T F
3) A statistical study on the current usage of nested-IF for- “” M Q1 FALSE
Total Unique
(0%,25%], 0 (0%,25%], 0 Q1:Before this survey, do you know
F1 equals to F2? [ ] Yes, I know [ ]
No, I don’t know.
(25%,50%],
10231301 (25%,50%], Reason:_______ F1
(75%,100%], 7884846 Q2: Do you prefer F1 or F2?
___________
(75%,100%], 9370933
15036220 F2
Please tick your reason(s):
(50%,75%], (50%,75%], [ ] A:F2 is shorter
2378167 1987628 [ ] B:F2 is less complex
[ ] C:F2 is easier to understand
[ ] D:F2 is not easy to make mistakes
Figure 5: Distribution of different depth reduction ratios. Q3:Before this survey, do you know
how to manually refactor F1 to F2? [
] Yes, I know [ ] No, I don’t know.
Table 8: Number of formulae with different new depth
Reason:_______ No
Q4: Will it be helpful to automatically
Total Unique ___________ refactor F1 to F2?
cases participants have the knowledge to judge the equivalence candidates, so that the end users can choose which ever they like.
between F 1 and F 2. 92.71% of cases participants prefer the refact- Note that from our survey results introduced in Section 4.5, it
ored formulae. All four reasons we listed have high votes, with maybe better if we provide explanations of each suggestion to aid
“A2.B:F2 is less complex.” the highest. Only with 20.99% cases parti- the understanding of end users. We will develop and present such
cipants have the ability to manually refactor. 98.83% believe that a plug-in in future work.
our automated refactoring approach is necessary and helpful.
Note that there are still around 7% (24) of the cases where parti- 5.2 Limitations and Future Work
cipants do not like the refactored formulae. We look into their com-
ments and find that in 6 cases the participants prefer F 1 because they There are several directions that our approach can be improved.
have no idea which one is better and choose one answer randomly. First, there may be better application scenarios. Except for the
In one case the participant said that I F (cell1 > cell2, cell1, cell2) is application scenario introduced above, another more intelligent one
better than MAX (cell1, cell2) because when cell1 equals cell2, the is to provide possible refactoring suggestions before the end users
result is more specific. In all the remaining 17 cases the participants finish writing the formula. This scenario especially suits (intended)
prefer F 1 because they do not have knowledge related to F 2 . Sim- nested-IF formulae with high if-depth. For example, suppose that an
ilarly, there are 1% cases where participants regard our approach end user intends to write a 30-depth Nest-IF formula, it would be
helpless, and all of the reasons are that they lack knowledge about useful if we identify the possible semantics when the user starts to
the refactored formulae. These negative opinions are quite valuable, write the 10th nested IF expression. In this way, the user may skip
indicating that in practical application, it is necessary to provide the verbose writing of the remaining 20 depth and choose to use
knowledge related to new functions to help end users understand the suggested function. Another advantage is that this application
them better. A possible application scenario may be that refact- scenario can help users to avoid making errors aroused from using
oring will not be conducted without the permission of end users. nested-IF formulae. This application scenario is applicable. In future
They may get some refactor suggestions as well as explanations, we plan to use machine learning techniques to infer the semantics
and could choose whether they would like to adopt the suggestion. of a formula based on a part of its syntax.
More discussion about the application scenario can be found in Second, the refactor effectiveness can be further improved. Al-
Section 5. though our approach is currently effective in relieving the smell of
We then focus on the answers of different patterns. LOOKUP nested-IF formulae, the coverage of some patterns can be enlarged.
and IFS are known to the fewest participants, while MAXMIN is For example, for the formula in Section 2: I F (Q1 = X 1, Q1, I F (Q1 =
known to the most participants. More people think that LOOKUP “”, X 1, Q1), except for the IFS pattern, it would also match the OR
are less error-prone comparing to other patterns. The other answers pattern if we transform it into I F (Q1 = X 1, Q1, I F (Q1! = “”, Q1, X 1)
more or less have no obvious differences among different patterns. (refactored as I F (OR(Q1 = X 1, Q1! = “”), Q1, X 1)). In other words,
it is interesting to explore how to preprocess a formula to make it
better prepared for refactoring.
5 DISCUSSION Third, the refactor coverage can be further improved. There are
In this section, we discuss the application scenario (in Section 5.1) some nested-IF formulae out of our refactoring ability, such as the
and the future work of our approach (in Section 5.2). two types mentioned in Section 4.3. Also, our current patterns are
not the universal set. There must be some other patterns that can
5.1 Application also contribute to the nested IF problem, yet are outside of our
current knowledge. Nevertheless, the main idea of the approach
Our application scenario is to make our approach a spreadsheet remains effective, and we would complement the pattern set upon
plug-in. When an end user finishes writing a formula with nested finding other good pattern candidates.
IF functions, the plug-in may identify whether the formula can be
refactored. If yes, it alerts that these nested IFs are bad smells, and
provides refactor suggestions. Note that in this paper each formula 6 RELATED WORK
will yield one specific refactor result, while it is also applicable End user programming has been studied since 1993 [26]. Spread-
that we generate different refactoring results: for the formulae that sheets have been recognized as the most successful and most popu-
can be handled by multiple patterns, each pattern corresponds to lar form of end user programming [6]. Current research on spread-
one result. All the results can be ranked and serve as suggestion sheets mainly follows the trend of applying traditional software
9
engineering methods to deal with spreadsheet problems. For ex- to relieve the smells of 87% formulae. However, their approach does
ample, most research focus on smell detection [4, 5, 14, 27–32], fault not support automated refactoring.
detection and automatic repair [33–38], clone detection [39–41], Later on, Hermans and Dig [13] combine the two approaches
refactoring [4, 13, 31], visualisation [42–45], and so on. above and present BumbleBee, which is a refactoring tool allowing
The research work most related to ours include smell detection a formula to be refactored based on the defined transformation rules.
and refactoring. The former is related to the motivation of this Several patterns such as MAXMIN and OR are also mentioned in
paper: why nested-IF formulae are bad smells. The latter is related the paper. However, the formula can be refactored only when
to the approach of this paper: how to refactor spreadsheets to the transformation rule is defined, while according to our survey,
reduce smells. We next introduce these two aspects one by one. only 20.99% of participants may have the knowledge of defining
transformation rules. The work of Hoepelman [47] expand this
6.1 Smell Detection work and introduces more refactoring support.
To sum up, currently several works aim to tackle the challenges
Same as code smells [46], spreadsheets smells refer to some char-
brought by spreadsheet smells, while no automatic and high-coverage
acteristics that may cause problems. Smells have different levels:
refactoring approach is available. We propose to systematically
formula-level, cell level, and structural level. We mainly introduce
tackling the nested-IF formulae refactoring problem, which is able
the formula-level ones.
to handle almost all formulae with high depth-reduce effectiveness.
Abreu et.al. [5] combines 15 smells to indicate potential faults.
They treat conditional complexity as one of the key smells. The 7 CONCLUSION
results indicate that this smell only can detect 6 spreadsheet faults.
Hermans et.al. [4] regard conditional complexity as one of the We propose a spreadsheet formula refactoring approach aiming
five smells, because even in traditional professional programming, to automatically relieve the smells of nested IF functions. We first
conditional complexity is a threat to code readability. However, try to identify if the formula contains redundant conditions. After-
according to their results derived from EUSES, on average each wards, we identify the semantics of the combination of nested IFs
spreadsheet only has 3 formulae containing at least one condition, and replace them with an alternative spreadsheet function. Evalu-
while from our corpus, we find that on average each spreadsheet ation on a very large real world spreadsheet corpus indicates that
has 1,193 formulae containing conditions; from the corpus of Enron, the refactor effectiveness is impressive: most of the nested-IF for-
the number is 217. The reason for this huge difference may be that mulae can be refactored. Our survey of 49 participants reveals that
EUSES contains a lot of toy spreadsheets created by users who the majority of them like our refactoring approach and think it is
rarely use spreadsheet formulae. helpful.
Hermans et.al. [4] also mention that end users already know the
bad effects of conditional complexity. Our survey results confirm
this statement: around half of the participants think that formulae
with high conditional complexity are more complex and error-
prone; 70.55% think that they are harder to understand.
Another work of Hermans et.al. [6] present an overview of soft-
ware engineering approaches applied to spreadsheets. They claim
that most spreadsheets contain formulae with multiple IF condi-
tions, which is an obvious spreadsheet smell.
11