Professional Documents
Culture Documents
Deborah Ferreira1 , Julia Rozanova1 , Krishna Dubba2 , Dell Zhang2∗ , Andre Freitas1
1
Department of Computer Science, University of Manchester, Kilburn Building, Oxford Road, M13 9PL, UK
2
Blue Prism AI Labs, c/o WeWork, 125 Kingsway, London WC2B 6NH, UK
{deborah.ferreira, julia.rozanova, andre.freitas}@manchester.ac.uk, {krishna.dubba, dell.zhang}@blueprism.com
arXiv:2001.02639v2 [cs.HC] 4 Feb 2020
(1) (1)
In Figure 1, let I1 = {i1 , i2 }, where these identify the submit button submit button
form input box and the submit button respectively. The state
symbol s will represent the computational state under the
hood, while the function set F is determined by the valid
actions relevant to the interfaces: here, for example, one can (1)
Figure 2: Interface element i1 with a positional realisation
imagine these to include a click action f1 (s, i) and a text visualised as a bounding box.
input action f2 (s, i, ‘example text input’).
The set Ω is that of all possible user inputs to the
text box. To simplify, suppose that in this case it is A soft typing system may be implemented if one intro-
Ω = {‘a’, . . . , ‘z’, ‘ ’}∗ (where ∗ is the Kleene star). duces an environmental vocabulary, by which descriptive
types may be assigned to both interface elements and input
Definition 3. A model or instantiation of an environment e values.
is defined with respect to a working computer desktop with
finitely many graphical user interfaces, an interactor (which Definition 6. An environment vocabulary
may either be automatic, such as a Sikuli program instance, T ⊂ {‘a’, . . . , ‘z’, ‘ ’}∗ S
is a set of descriptive strings
or a human end-user) which has its own language and an strings. Define type : {I1 , . . . , In , Ω} → T to be a
descriptor function for interface elements and values which
1
http://sikulix.com/ may serve as action function arguments.
Example 2. In figure 1, one may choose to define an envi- Process
ronment vocabulary such that type(i1 ) = ‘button’.
text2process
the activity. The task is similar to end-user programming and description, such as “Click on the ‘Next’ button.”. The en-
semantic parsing, giving that users can convert from natu- vironment provides a precise evaluation metric, rewarding
ral language to a program that implements a process (Sales, simulated behaviour, with rewards ranging from -1.0 (fail-
Handschuh, and Freitas 2017) and Program Synthesis, since ure) to 1.0 (success), according to the results of each action,
the final goal is generating a program. Given a set of imper- i.e., if the action shifts the current state to a state closer to
ative sentences, we want to generate a program that executes the environment goal or not.
the described actions. The same work (Shi et al. 2017) also proposes FormWoB,
Similarly, the user might also want to generate the NL which consists of four web tasks based on real flight book-
description of a process that is already implemented, espe- ing websites, and QAWoB, which approaches web tasks as
cially in cases where the process becomes too complex to be question answering, soliciting questions from crowd work-
analysed by humans. ers. Even though MiniWoB provides an essential baseline
for IPA, it is still a synthetic dataset, not reflecting real user
Related Work applications. It has a specific screen size, a limited number
Intelligent process automation aims to replicate and im- of applications and user actions, creating a controlled and
prove, iteratively, activities carried out by humans. Com- closed environment.
paring the performance of different IPA techniques requires An example of a real (non-synthetic) dataset for soft-
well-defined tasks and metrics that reflect the relevant as- ware intelligence is the PhotoShop Operation Video (PSOV)
pects when automating a process. In this section, we present Dataset (Cheng et al. 2018), containing videos and dense
the benchmarks present in similar research areas. command annotations for Photoshop Software, with 74
Even though IPA applications, such as software intelli- hours of videos and 29,204 labelled commands. Despite
gence and RPA, have particular research questions (Aalst et PSOV presenting real-world use cases, it is still limited to
al. 2018), there are only a few published datasets designed one specific software, i.e., it is not generalisable to other ap-
to help addressing this demand. plications.
One conspicuously related dataset is the Mini World of Comparable benchmarks can be found in the research
Bits (MiniWoB) (Shi et al. 2017), a benchmark of 100 re- area of video captioning. (Rohrbach et al. 2015) presents the
inforcement learning environments containing many of the MPII Movie Description dataset (MPII-MD) which contains
characteristics of live web tasks, created in a controlled con- transcribed and aligned audio descriptions and script data
text. Each MiniWoB environment is an HTML page with a sentences of a set of 55 movies of diverse genres. (Torabi et
resolution of 210x160 pixels, and a natural language task al. 2015) introduces a dataset including 84.6 hours of paired
video/sentences from 92 DVDs, with high-quality natural Eimage arg if arg is an image
language phrases describing the visual content in a given ?
Earg (arg, arg ) = Esymb arg if arg is a symbolic
segment of time.
value
A process can be seen as a formal representation of a task;
therefore, there is an evident alignment between IPA and For an image argument, the intersection over union (IoU)
semantic parsing (SP). Unlike IPA, several benchmarks are is used to define Eimage arg . Given the bounding box of the
available for SP, from which we can obtain insights. corresponding interface element (bb(i(arg))) and the gold
Most of these benchmarks focus on evaluating the conver- standard bounding box (bb? (i(arg))):
sion from a natural language specification to a program writ-
ten in a specific programming language. WikiSQL (Zhong, area of (bb(i(arg)) ∩ bb? (i(arg)))
IoU =
Xiong, and Socher 2017) is a collection of questions, corre- area of (bb(i(arg)) ∪ bb? (i(arg)))
sponding SQL queries, and SQL tables. NL2Bash (Lin et al. where Eimage arg (arg) = 0 if IoU > 0.5 and
2018) is a corpus with frequently used Bash commands with Eimage arg (arg) = 0 otherwise.
its respective natural language description. Other applica- IoU assumes that Imagesys can be registered within the
tions include converting from a visual object to code (Ling et gs reference Screenshotgs .
al. 2016; Beltramelli 2018) and from source code to Pseudo- An alternative way to define Eimage arg is by directly
code in different languages (Oda et al. 2015). comparing the two argument images using means squared
Process mining is also another closely related research error (MSE) or structural similarity index (SSIM) (Wang et
area; it aims to discover, monitor and improve real processes, al. 2004):
extracting knowledge from event logs (Van Der Aalst et al.
2011). Unlike IPA, it does not have its main focus on au- m−1 n−1
1 XX
tomation, but on finding answers for domain-specific ques- M SE = [I(i, j) − K(i, j)]2
mn i=0 j=0
tions, such as analysing patient treatment procedures (Bose
and van der Aalst 2011), and discovering the roles of the
people involved in the various stages of a specific pro- (2µx µy + c1 )(2σxy + c2 )
cess (van Dongen 2015). SSIM (x, y) =
(µ2x + µ2y + c1 )(σx2 + σy2 + c2 )
Evaluation Metrics
D2P & T2P The previous measures do not take into account the
This section concentrates on a weighted quantification of the order of the program statements. In order to capture
correctness and completeness of the generated IPA program the sequential nature of an IPA program we include a
output. Correctness and completeness are defined against a sequence-based metric which is based on the longest
gold reference IPA program which was produced by one or common subsequence (LCS) function. Given two sequences
more human programmers. X = (x1 , x2 , . . . , xm ) and Y = (y1 , y2 , . . . , yn ), and given
Given a set Π = {p1 , . . . pk } of generated IPA pro- that the prefixes of X are X1,2,...,m and the prefixes of Y
grams where each pj = f0 (arg0 , ..., argn ) ◦ .... ◦ are Y1,2,...,n , LCS is defined as:
fm (arg0 , ..., argn ) and a corresponding gold standard Π? ,
∅
we define a set of metrics of varying granularity that can be
(if i = 0 or j = 0)
seen as approximate measures of program correctness.
We start with the following error functions:
LCS (Xi−1 , Yj−1 )ˆxi
• Strict Error: LCS (Xi , Yj ) =
(if i, j > 0 and xi = yj )
0 if pj = p?j
Estrict (p, p? ) = max{LCS (Xi , Yj−1 ), LCS (Xi−1 , Yj )}
1 otherwise
(if i, j > 0 and xi 6= yj ).
M AEstrict (Π) =
X Estrict ( pj , p?j ) In order to compute the maximal program fragment gen-
erated we encode the programs Π? and Π as a sequence
|Π| of unique symbols si from a hash table derived from the
pj ∈Π
statements of Π? ∪ Π. The maximum program overlap
• Predicate / Argument Sensitive Error: (MPO) is defined as:
?
Esensitive (p, p? ) = M P O(Π, Π? ) = lcs(S,S
|S|
)
Step-by-step description
Open outlook.
Webmail
Write an email to <hidden>requesting to book a flight from Manchester to Tokyo on the 01/08/2019.
Open a spreadsheet containing names of people, places (origin and destination) and dates (in a spreadsheet),
Spreadsheet + Webmail Open outlook.
Send an email to <hidden>requesting a flight booking for each row in the spreadsheet (one e-mail for each).
Open outlook.
Webmail + browser Read an email requesting the booking of a flight.
Use https://www.skyscanner.net/ to verify the price of the flight.
Open outlook.
Read one email requesting the booking of a flight.
Browser + spreadsheet + webmail
Use https://www.skyscanner.net/ to verify the price of the flight.
Add the name of the person and the cheapest flight to a (pre-created) spreadsheet.
Random task - Different OS Any of the above tasks executed using the OS Ubuntu.
4. Annotator C verified and corrected the videos and the an- Conclusion
notations.
In this work, we present a formalisation for IPA using a
5. From the 90 different tasks, we chose ten random tasks. bottom-up approach, defining several composing blocks of
an IPA program. We also define the modalities and tasks that
6. Annotator B recorded the ten tasks using Ubuntu 16.04. are part of IPA, demo2text, demo2process, text2process and
process2text, and we present how these tasks relate to simi-
7. Annotator D wrote a summary of the task being per- lar research areas.
formed, a step by step description (for every video seg- With the formalisation, it was also possible to define ap-
ment) and a program capable of executing the task auto- propriate metrics for each one of the tasks, evaluating the
matically. resulting process or text. We also describe how we built a
dataset that can be used as a benchmark for the IPA tasks.
8. Annotator C verified and corrected the videos and the an- We envision that the research area in IPA will see signifi-
notations. cant progress in the next few years, following the increased
use of RPA tools. The formalisation present in this work is a [Oda et al. 2015] Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.;
starting step towards building complex IPA systems. Sakti, S.; Toda, T.; and Nakamura, S. 2015. Learning
While we have provided the formalisation, evaluation to generate pseudo-code from source code using statisti-
techniques and methodology for designing a benchmark, cal machine translation (t). In 2015 30th IEEE/ACM In-
there is an important next step in this research: joining to- ternational Conference on Automated Software Engineering
gether these contributions and creating a baseline for each (ASE), 574–584. IEEE.
one of the IPA tasks. [Pan et al. 2017] Pan, Y.; Yao, T.; Li, H.; and Mei, T. 2017.
Video captioning with transferred semantic attributes. In
Acknowledgements Proceedings of the IEEE conference on computer vision and
The authors would like to thank Jacques Cali and the Blue pattern recognition, 6504–6512.
Prism team for their support of this project. We also thank
[Papineni et al. 2002] Papineni, K.; Roukos, S.; Ward, T.;
the anonymous reviewers for the valuable feedback.
and Zhu, W.-J. 2002. Bleu: a method for automatic evalua-
tion of machine translation. In Proceedings of the 40th an-
References nual meeting on association for computational linguistics,
[807 2017] 2017. Ieee guide for terms and concepts in intel- 311–318. Association for Computational Linguistics.
ligent process automation. IEEE Std 2755-2017 1–16.
[Reddy, Harichandana, and T Alekhya 2019] Reddy, K.
[Aalst et al. 2018] Aalst, W. M.; Bichler, M.; Heinzl, A.; P. N.; Harichandana, U.; and T Alekhya, R. S. M. 2019. ; A
et al. 2018. Robotic process automation. Business & In- Study of Robotic Process Automation Among Artificial In-
formation Systems Engineering: The International Journal telligence; International Journal of Scientific and Research
of WIRTSCHAFTSINFORMATIK 60(4):269–272. Publications (IJSRP) 9(2) (ISSN: 2250-3153).
[Agostinelli, Marrella, and Mecella 2019] Agostinelli, S.; [Rohrbach et al. 2015] Rohrbach, A.; Rohrbach, M.; Tandon,
Marrella, A.; and Mecella, M. 2019. Research challenges N.; and Schiele, B. 2015. A dataset for movie description.
for intelligent robotic process automation. Workshop on In Proceedings of the IEEE conference on computer vision
Artificial Intelligence for Business Process Management and pattern recognition, 3202–3212.
(AI4BPM19)held in conjuction with the 17th Int. Confer-
ence on Business Process Management (BPM’19). Vienna, [Sales, Handschuh, and Freitas 2017] Sales, J.; Handschuh,
Austria 1-6 September, 2019. S.; and Freitas, A. 2017. Semeval-2017 task 11: End-
user development using natural language. In Proceedings
[Aguirre and Rodriguez 2017] Aguirre, S., and Rodriguez, of the 11th International Workshop on Semantic Evaluation
A. 2017. Automation of a business process using robotic (SemEval-2017), 556–564.
process automation (rpa): A case study. In Figueroa-Garcı́a,
J. C.; López-Santana, E. R.; Villa-Ramı́rez, J. L.; and Ferro- [Shi et al. 2017] Shi, T. T.; Karpathy, A.; Fan, L. J.; Hernan-
Escobar, R., eds., Applied Computer Sciences in Engineer- dez, J.; and Liang, P. 2017. World of bits: An open-domain
ing, 65–71. Cham: Springer International Publishing. platform for web-based agents. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70,
[Beltramelli 2018] Beltramelli, T. 2018. pix2code: Gener- 3135–3144. JMLR. org.
ating code from a graphical user interface screenshot. In
Proceedings of the ACM SIGCHI Symposium on Engineer- [Torabi et al. 2015] Torabi, A.; Pal, C.; Larochelle, H.; and
ing Interactive Computing Systems, 3. ACM. Courville, A. 2015. Using descriptive video services to cre-
ate a large data source for video annotation research. arXiv
[Bose and van der Aalst 2011] Bose, R. J. C., and van der preprint arXiv:1503.01070.
Aalst, W. M. 2011. Analysis of patient treatment proce-
dures. In Business Process Management Workshops (1), vol- [Van Der Aalst et al. 2011] Van Der Aalst, W.; Adriansyah,
ume 99, 165–166. A.; De Medeiros, A. K. A.; Arcieri, F.; Baier, T.; Blickle, T.;
Bose, J. C.; Van Den Brand, P.; Brandtjen, R.; Buijs, J.; et al.
[Cheng et al. 2018] Cheng, J.; Hsu, H.-K.; Fang, C.; Jin, H.;
2011. Process mining manifesto. In International Confer-
Wang, S.; and Yang, M.-H. 2018. Learning from photoshop
ence on Business Process Management, 169–194. Springer.
operation videos: The psov dataset. In Asian Conference on
Computer Vision, 223–239. Springer. [van Dongen 2015] van Dongen, B. F. 2015. Bpi challenge
2015. In 11th International Workshop on Business Process
[Gatt and Krahmer 2018] Gatt, A., and Krahmer, E. 2018.
Intelligence (BPI 2015).
Survey of the state of the art in natural language generation:
Core tasks, applications and evaluation. Journal of Artificial [Wang et al. 2004] Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Si-
Intelligence Research 61:65–170. moncelli, E. P.; et al. 2004. Image quality assessment: from
error visibility to structural similarity. IEEE transactions on
[Lin et al. 2018] Lin, X. V.; Wang, C.; Zettlemoyer, L.; and
image processing 13(4):600–612.
Ernst, M. D. 2018. Nl2bash: A corpus and semantic parser
for natural language interface to the linux operating system. [Zhong, Xiong, and Socher 2017] Zhong, V.; Xiong, C.; and
arXiv preprint arXiv:1802.08979. Socher, R. 2017. Seq2sql: Generating structured queries
[Ling et al. 2016] Ling, W.; Grefenstette, E.; Hermann, from natural language using reinforcement learning. arXiv
K. M.; Kočiskỳ, T.; Senior, A.; Wang, F.; and Blunsom, P. preprint arXiv:1709.00103.
2016. Latent predictor networks for code generation. arXiv
preprint arXiv:1603.06744.