Professional Documents
Culture Documents
and subsequently extract them. Most of the existing research has enabling these cognitive assistants and chatbots.
focused on extracting flows in given procedures or understanding In this work, we address the problem of procedure identification
the procedures in order to answer conceptual questions. Identifying in documents of varying sizes, types and formats from IT domain.
and extracting multiple procedures automatically from documents Most of the existing research has focused on extracting flows from
of diverse formats remains a relatively less addressed problem. In given procedures or understanding the entities and outcomes in the
this work, we cover some of this ground by ś 1) Providing insights procedures to answer conceptual questions. However, extracting
on how structural and linguistic properties of documents can be procedures automatically from documents other than webpages
grouped to define types of procedures, 2) Analyzing documents to remains a relatively un-addressed problem. It is a challenging prob-
extract the relevant linguistic and structural properties, and 3) For- lem across domains to automatically understand from documents
mulating procedure identification as a classification problem that as to which parts are talking about procedures and consequently
leverages the features of the document derived from the above anal- extract them.
ysis. We first implemented and deployed unsupervised techniques One of our initial thoughts was to use structural cues to spot pro-
which were used in different use cases. Based on the evaluation cedures. For example, technical documents like manuals and IBM
in different use cases, we figured out the weaknesses of the unsu- Redbooks 1 have lots of information expressed using enumerated
pervised approach. We then designed an improved version which lists. We did a small manual study to find how many of these lists ac-
was supervised. We demonstrate that our technique is effective in tually express a procedure and found that only 30ś40% of these lists
identifying procedures from big and complex documents alike by were procedures and rest were some other type of enumerations.
achieving accuracy of 89%. Moreover, we also found a big set of troubleshooting documents
where the procedures were not expressed using enumerated lists,
ACM Reference Format:
but were nested structures. This is discussed more in Section 2.1. We
Shivali Agarwal, Shubham Atreja, and Vikas Agarwal. 2020. Extracting
Procedural Knowledge from Technical Documents. In Proceedings of ACM
concluded that relying solely on cues like enumerations is grossly
Conference (Conference’17). ACM, New York, NY, USA, 7 pages. https://doi. insufficient.
org/10.1145/nnnnnnn.nnnnnnn One of the other main challenges is the language used to write
procedures. While some documents use imperative style of writing,
1 INTRODUCTION there are other documents that use passive, obligatory or declar-
ative style of writing as shown in Figure 1. Imperative style of
There is a constant endeavour to empower the virtual agents, as-
expression is easy to identify as actionable. Since most sentences
sistants and chatbots with more and more usable and actionable
are of declarative form, a procedure written in non-imperative lan-
knowledge. This knowledge is typically present in documents like
guage is more challenging to identify. The dual challenge is when
user manuals, troubleshooting documents, guides, books and so on.
imperatives are used to write non-procedures as we see in many
Traditionally, the ability to search relevant documents, or portions
Redbooks.
of it, has been the method to use this knowledge. Now, with vir-
Since we deal with various styles of documents, there is no stan-
tual assistants, there is a need to develop capabilities where these
dardization of how a procedure is likely to be written. In some
agents can understand documents like a human would and sug-
documents, the procedure steps are written at the level of some
gest actions. Guided troubleshooting is an example of one such use
section or sub-section heading while in others, they are part of sec-
case where the knowledge of problems and solutions in the form
tion content. Therefore, coming up with a generic solution is very
Permission to make digital or hard copies of all or part of this work for personal or challenging. The main challenge here is to disambiguate between a
classroom use is granted without fee provided that copies are not made or distributed procedure goal and a procedure step. In some documents, headings
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM are goals while in others they are steps. There are also cases, where
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, the sequence of goals denote a higher level procedure with each
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
goal having its own procedure.
Conference’17, July 2017, Washington, DC, USA
© 2020 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 http://www.redbooks.ibm.com/
Conference’17, July 2017, Washington, DC, USA Shivali Agarwal, Shubham Atreja, and Vikas Agarwal
sublists, associated images etc. In the next step, the tree structure is
created and traversed to group the text into chunks using the struc-
tural annotations in a manner that obeys the hierarchy, maintaining
parent-child and sibling relationship. The chunks are formed as per
these rules ś a) each numbered/bullet list forms a chunk, b) section
titles/headings at the same level of depth and having common par-
ent form a chunk, and c) paragraphs at the same depth level form
chunks.
Chunk size is total number of items in case of lists type content
and total sentences in case of paragraphs and heading type content.
Figure 4: Entities in a procedure
Note that each numbered/bullet item is considered one item in list
even if there are multiple sentences in that bullet. The context of
each chunk comes from the parent sentence or the previous sibling. computed the tf-idf value for each word in a sentence and used
The notion of previous sibling is determined by order of occurrence those as corresponding feature values. In-frequent words (occurring
in the documents. These chunks are then analyzed to determine if in less than 3 sentences in the training data set) were removed as
they are procedures. The analysis is performed on each text item they are generally proper nouns or some domain specific keywords
in the chunk as explained next. that do not contribute in determining actionable statements. Since
actionable words can vary from one domain to another, we may
3.2 Linguistic Analysis need to re-train our model with appropriate training data for a
In this section, we describe the type of linguistic analyses performed different domain.
on the document text. There are mainly three types of analyses To further improve the classification accuracy, we augmented
performed ś i) if a sentence/statement is denoting an action to be our bag-of-words features with additional features derived from
performed, ii) if a chunk consists of sentences that are cohesive and linguistic analysis of sentences. Based on our analysis of procedures
related, and iii) if the discourse style of a sentence denotes a goal. found in various technical documents, we identified the following
linguistic features that provide cues for identifying an actionable
3.2.1 Actionable Statements. Identifying whether a given step or statement. Since verbs in sentences denote the action to be per-
sentence is an actionable statement or not is an important part of formed, all these features are based on the main verb(s) in the
overall procedure identification problem. As procedures specify sentence.
steps to solve a given problem or to do a particular thing therefore, Tense: The tense of a sentence has a distinguishing ability since
a large fraction of statements within a procedure are actionable actionable statements are generally expressed in present tense. Past
statements i.e., they contain a perform-able action (in addition tense is usually used to specify actions that have already happened.
to other statements that are just informational). The steps in a Voice: The voice of a sentence tells whether it is expressed actively
procedure are generally written in an imperative style such as łdo or passively. Generally, actionable statements are written in active
this" or łtake this action" which directly conveys that an action is voice.
to be performed. However, they are also expressed in a declarative Polarity: The polarity of a sentence has a distinguishing ability
(or non-imperative) style, like ś łThe user enters the password", or since actionable statements are usually expressed in positive sense
łThe user logs into the system". whereas negative sense is used to indicate something that should
Imperative sentences are by default assumed to be actionable. not be done or is not related to the task at hand.
We use the work in [6] which in turn is built on [9] for classifying a The implementation details and evaluation of non-imperative ac-
sentence as an imperative or a conditional. This work also advocates tionable classification are provided in Section 4.1.
how to analyze a conditional sentence for ‘condition’ and ‘effect’
portions. We further classify the ‘effect’ part of the sentence as 3.2.2 Relatedness Score. While all procedures have actionable state-
imperative or not. ment, any set of actionable statements cannot be called a procedure.
We now describe a technique to identify non-imperative action- A set of statements that corresponds to a procedure must have some
able statements i.e., statements that describe some action to be degree of relatedness amongst them. This notion of relatedness is
performed but are written in a non-imperative style. We modeled usually captured in the sequence of actions that are performed ś in
the problem of actionable statement identification as a classification terms of entities involved in the action and the agent performing
task. We identified two sets of features ś 1) bag-of-words features, the action. As an example, Figure 4 shows a basic example of how
and 2) linguistic features, that we found were relevant for identify- the set of entities are distributed across a sequence of statements
ing actionable statements. representing a procedure.
One of the main identifiers for actionable statements are words As demonstrated in previous literature [1, 4], entity-based repre-
used in the sentence. We found that actionable statements use sentation can effectively model the local coherence of a discourse.
specific action words such as install, login etc., that convey that For our task of modeling the relatedness between the statements of
some action needs to be performed. Some of these action words a procedure, we utilize an existing approach [4] that computes the
are general English language words, whereas others are domain discourse coherence through an unsupervised graph-based repre-
specific words. Rather than providing a list of these action words, sentation. It involves constructing a bipartite graph using sentences
we let the classifier learn these from the training data. For this, we and the entities present in them as the two sets of nodes. An edge
Extracting Procedural Knowledge from Technical Documents Conference’17, July 2017, Washington, DC, USA
between a sentence node and an entity node is added to the graph therefore, run predicted value propagation enabled classifier on the
if the entity is present in that sentence. The edge is also assigned a tree from leaf chunks till root, one level at a time. The goal related
weight, based on the grammatical role of the entity in the sentence features at one level higher are updated dynamically as the classifier
(either subject, object or anything else). We then apply weighted is run bottom-up. This primarily enables real-time refinement of
one-mode projections to the sentence node set of the bipartite graph. available information as new information arrives.
The projection results in a directed weighted graph denoting the
connection between different sentences and the average out degree 4 IMPLEMENTATION AND EVALUATION
of this graph is used to model the relatedness score. For a given In this section, we describe the implementation details and eval-
pair of sentence nodes, their edge weight depends on their set of uation of our procedure identification techniques including the
common entities and the distance between them. experiments performed.
Our approach differs from this existing approach in two ways.
First, instead of using noun phrases as entities, we rely on a key- 4.1 Actionable Statements
word extraction tool [11] to extract the set of entities from the For linguistic analysis, we used the Stanford natural language
sentences. Second, we utilize Semantic Role Labelling (SRL) [13] parsers [2, 8] for getting the PoS tags, constituency parse tree, and
while assigning weights to the edges. After generating the SRL dependency parse tree. This information was then used in building
output for a sentence, we assign weights to the edges, based on the the three linguistic feature extractors. For bag-of-words features
argument (A0, A1, A2) to which the target entity belongs. Similar we wrote our own code in Java to compute the tf-idf values.
to the original approach, we employ the linear weighting scheme We manually labeled 394 sentences from different types of techni-
ś the highest weight is assigned to the case where entity belongs cal documents like web pages, pdf documents, etc. as non-imperative
to A0 (subject), followed by A1 (object), and then A2 (indirect ob- actionable or not. These sentences were selected from procedural
ject). The final relatedness score is a real number, where a higher and non-procedural blocks within these documents. We randomly
value indicates a higher degree of entity-based links between the divided this into two sets - a training set of 258 sentences, and a
statements. test set consisting of 86 sentences and generated feature vectors
3.2.3 Goal Annotations. There are some discourse style cues that for these. For building the SVM classifier, we used the LIBSVM 2
are useful in inferring goal from a sentence. We have used two such toolkit. We used the RBF kernel with 5-fold cross validation, param-
predominant styles. The first one is when a (sub)heading is stated eter tuning, and feature scaling to build our model. We achieved an
using present continuous tense of the verb as an opening word accuracy of 80.23% on our test set.
e.g. łCreating a Service Instance", then there is a high probability
that this is a goal. Annotating such inferred goals act as supporting 4.2 Procedure Classification
evidence for presence of a procedure in the ensuing text. Such The framework for sections 3.1, 3.2.1, 3.2.2, 3.2.3 was implemented
discourse style also qualifies to be actionable statement. Another in Java. We used the available implementations for document con-
key discourse style observed in troubleshooting documents is use version [15] and imperatives and conditionals [6]. The resulting
of łMethod" as a prefix in (sub)headings. output was a ‘csv’ file containing the static feature set shown in
Table 1 . The propagation and classification as described in Sec-
3.3 Feature Computation and Procedure tion 3.3 was implemented using skLearn toolkit in Python. We used
Classification LinearSVC classifier with default settings.
We evaluated our proposed technique for procedure identifi-
The structural and linguistic annotations obtained on sentences and
cation using two data-sets ś 1) technical support webpages a.k.a.
chunks are now used for feature engineering so that the chunks
troubleshooting documents, and 2) product documentation in the
can now be actually classified as a procedure or not. The features
form of pdf documents i.e., IBM Redbooks. While documents in
for classification model are computed for each chunk using the
the first set are generally small in size ś describing solutions for
annotations at document tree level, sentence level and chunk level.
a particular problem and containing small number of procedures,
Table 1 provides the list of features used for the classification model.
the product documentations are generally huge, up to few hundred
In this table, linguistic features are presented using the categories of
pages and contain large number of procedural and non-procedural
actionable, relatedness and goal-based. The structural features are a
blocks.
category by themselves. There is another category of context-based
We processed 5 redbooks and 12 troubleshooting documents,
features, which is a mix of structural (refer section 3.1) and linguistic
extracted 5021 structural annotations from document conversion,
analysis and hence, kept separate. These features are computed at
identified 693 chunks for evaluation, and tagged them as procedures
the chunk level and are thus modeled as aggregators.
or non-procedures. Out of these, we used 2 redbooks for training
It should be noted that not all features are static in nature. Fea-
the classifier. These 2 redbooks had 259 chunks out of which we
tures related to goals can get updated dynamically as the classifier is
selected 251 for training data. This left us with 434 test candidates,
run from bottom to top of the document tree. For example, if a chunk
290 and 144 from the 3 redbooks and 12 troubleshooting documents
is classified as a procedure, then this information needs to propa-
respectively. Out of these 120 were actually procedures. We ensured
gate to the parent node in order to refine the features, in real-time,
that there were representatives of all linguistic and structural types
that use the knowledge of whether children are of type procedure.
L1,L2,S1,S2,S3 (refer section 2.1) in the training set. Importantly,
In our model, there are two features, namely, nInferredGoal and
nNonActionableGoal that use propagation based knowledge. We, 2 https://www.csie.ntu.edu.tw/ cjlin/libsvm/
Conference’17, July 2017, Washington, DC, USA Shivali Agarwal, Shubham Atreja, and Vikas Agarwal
we ensured presence of meaningful negative samples in both train removed. From the table, one can observe that both structural and
and test data, for example, chunks that had actionable sentences linguistic features are important for identifying the procedures.
but were still not procedures. While structural features are key to precision, the group of linguis-
Table 2 gives the accuracy, precision and recall metrics for our tic features play vital role in both precision and recall. Removing
procedure classification task. As can be seen, our procedure classi- actionable set of features results in a huge drop in recall metrics
fication method was able to achieve good accuracy in identifying which is as expected with decrease in precision as well. On the other
procedures for varied sets of documents. hand, removing any of relatedness or goal-based set of features re-
sults in a sharp decrease in precision accompanied with an increase
Accuracy Precision Recall in recall. The result is intuitive, as relatedness and goals are mainly
0.889 0.776 0.841 used to filter out negative examples, i.e., ones that contain action-
Table 2: Procedure classification metrics able items but do not constitute a procedure. Removing structural
group of features causes sharp drop in precision. We found that
that the depth level, chunk size and number of images associated
play an important role in ruling out false positives. The results also
We further performed experiments to gain insights into the role
show that removing context-based features, which are derived using
of different features that were used. This was done by selective
basic structural and linguistic properties of document also leads to
feature elimination. We removed the feature(s), that we wanted
a reduction in recall, although the impact is limited compared to
to study the effect of selectively, from the training data and the
the other features.
test data and ran the classifier. The impact on accuracy metrics by
removing the features corresponding to different categories (Table1)
5 RELATED WORK
is shown in Table 3.
There have been several works in past in the area of procedural
Feature Type Accuracy Precision Recall text, but most of them are focused on knowledge extraction from
Actionable 0.829 0.722 0.613 procedural and instructional text for building a knowledge base
Structural 0.834 0.644 0.882 for information retrieval tasks. Zhang et al. [17] provides a tech-
Relatedness 0.843 0.654 0.891 nique for extracting procedural knowledge such as action, purpose,
Goal-based 0.822 0.623 0.890 actee, instrument, etc. from instructional texts and representing
Context-based 0.882 0.768 0.808 it in a machine understandable format for automation purposes.
Jung et al. [7] describes an approach for building a situation on-
Table 3: Effect of Removing Features on Classification
tology by automatically extracting tasks, actions and their related
information from how-to instructions available in wikiHow and
eHow on Web. Park and Motahari Nezhad [12] have a similar aim
Discussion Table 3 shows that there is a constant trade off between of identifying relationships such as goals, tasks and sub-tasks, but
precision and recall when different sets of features are selectively within a procedure itself. In [16] the authors describe an approach
Extracting Procedural Knowledge from Technical Documents Conference’17, July 2017, Washington, DC, USA
for improving the predicted effects of actions in a procedure by proper list type formatting (bullets or numbers), the procedure
incorporating global commonsense constraints. All these works are steps are generally mentioned on separate lines using paragraph
mainly focused on deeper understanding of procedural knowledge style of formatting or manually add numbering as part of step text.
in terms of actions and entities involved, to build a machine un- Moreover, there can be a mix of both formatted and un-formatted
derstandable knowledge base for answering questions related to steps in a procedure. These issues create additional challenges for
it. But, they do not deal with complete procedural text extraction the already complex task. We are currently exploring techniques
from general documents which is the main aim of our work. to correctly identify the boundaries of procedures in presence of
Another body of work is in the area of action statements iden- formatting issues in the documents.
tification from text. Nezhad et al. [10] describe a system called
eAssistant which identifies actionable statements such as a request REFERENCES
or a promise from email and conversational text for notifying users [1] Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-
based approach. Computational Linguistics 34, 1 (2008), 1ś34.
of important content that requires their attention. Similarly, Ryu [2] Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency
et al. [14] describe a method for automatically detecting action- Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical
able clauses from how-to instructions for problem solving tasks. Methods in Natural Language Processing (EMNLP). Association for Computational
Linguistics, 740ś750. http://aclweb.org/anthology/D14-1082
While these work don’t address the problem of complete procedural [3] Estelle Delpech and Patrick Saint-Dizier. 2008. Investigating the Structure of
text extraction from documents, they address a small part of our Procedural Texts for Answering How-to Questions. In LREC 2008. http://www.
overall task related to step identification that requires us to first lrec-conf.org/proceedings/lrec2008/pdf/20_paper.pdf
[4] Camille Guinaudeau and Michael Strube. 2013. Graph-based local coherence
detect whether a given statement is actionable or not. We build modeling. In Proceedings of the 51st Annual Meeting of the Association for Compu-
upon and use some of the linguistic features described in these tational Linguistics (Volume 1: Long Papers), Vol. 1. 93ś103.
[5] Abhirut Gupta, Abhay Khosla, Gautam Singh, and Gargi Dasgupta. 2018. Mining
work for actionable statement identification in Section 3.2.1. Procedures from Technical Support Documents. CoRR arXiv:1805.09780 (2018).
There has been very little work on complete procedural text http://arxiv.org/abs/1805.09780
extraction from knowledge documents. The closest that we could [6] Abhirut Gupta and Anupama Ray. 2018. Semantic Parsing for Technical Support
Questions. In Proceedings of 27th COLING 2018.
find is [5] where the authors describe a method for extracting pro- [7] Yuchul Jung, Jihee Ryu, Kyung-min Kim, and Sung-Hyon Myaeng. 2010. Au-
cedures from web pages and further analyzing them to identify tomatic Construction of a Large-scale Situation Ontology by Mining How-to
decision points for guided troubleshooting purposes. In [3], the Instructions from the Web. Web Semant. 8, 2-3 (July 2010), 110ś124. https:
//doi.org/10.1016/j.websem.2010.04.006
authors provide linguistic analysis of procedures and describe a [8] Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In
technique for extracting procedures and its components for answer- Proceedings of the 41st Annual Meeting on Association for Computational Linguistics
- Volume 1 (ACL ’03). Association for Computational Linguistics, Stroudsburg,
ing how-to questions where the response is a complete procedure. PA, USA, 423ś430. https://doi.org/10.3115/1075096.1075150
While the overall goal of these works is similar to ours, they only [9] M McCord, J. W. Murdock, and B. K. Boguraev. 2012. Deep Parsing in Watson.
handle simple procedures that are expressed as a list in a webpage. IBM J. Res. Dev. 56, 3 (May 2012), 264ś278.
[10] Hamid R. Motahari Nezhad, Kalpa Gunaratna, and Juan Cappi. 2017. eAssistant:
They do not handle more complex procedure representation such as Cognitive Assistance for Identification and Auto-Triage of Actionable Conver-
those expressed using section/sub-section headings or a hierarchy sations. In Proceedings of the 26th International Conference on World Wide Web
of procedures. These are more common in technical documentation Companion (WWW ’17 Companion). International World Wide Web Confer-
ences Steering Committee, Republic and Canton of Geneva, Switzerland, 89ś98.
available as MS-Word or PDF documents that we are able to handle. https://doi.org/10.1145/3041021.3054147
[11] Watson Natural Language Understanding NLU. 2015. https://www.ibm.com/
watson/services/natural-language-understanding/.
[12] Hogun Park and Hamid Reza Motahari Nezhad. 2018. Learning Procedures from
6 CONCLUSION Text: Codifying How-to Procedures in Deep Neural Networks. In Companion
Procedures are an important actionable component of knowledge Proceedings of the The Web Conference 2018 (WWW ’18). International World
Wide Web Conferences Steering Committee, Republic and Canton of Geneva,
documents that are important for variety of tasks such as enabling Switzerland, 351ś358. https://doi.org/10.1145/3184558.3186347
cognitive assistants and chatbots. In this paper, we described a very [13] Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic
effective technique for automatically identifying and extracting pro- parsing and inference in semantic role labeling. Computational Linguistics 34, 2
(2008), 257ś287.
cedures from different size, type and style of documents. We first [14] Jihee Ryu, Yuchul Jung, and Sung-Hyon Myaeng. 2012. Actionable Clause Detec-
presented a structural and linguistic characterization of procedures tion from Non-imperative Sentences in Howto Instructions: A Step for Actionable
Information Extraction. In Text, Speech and Dialogue. Springer Berlin Heidelberg,
found in documents and then described techniques for analyzing Berlin, Heidelberg, 272ś281.
documents both structurally and linguistically to extract these fea- [15] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. 2018. Corpus
tures. We then used classification technique built on these features Conversion Service: A Machine Learning Platform to Ingest Documents at Scale.
In Proceedings of the 24th ACM SIGKDD (KDD ’18). 774ś782.
that used structural and linguistic features to identify procedural [16] Niket Tandon, Bhavana Dalvi, Joel Grus, Wen-tau Yih, Antoine Bosselut, and
chunks in documents. The evaluation showed that we were able Peter Clark. 2018. Reasoning about Actions and State Changes by Injecting Com-
to achieve good accuracy on different document types. This paper monsense Knowledge. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing. Association for Computational Linguistics, 57ś66.
mainly focused on documents from IT domain. As part of future http://aclweb.org/anthology/D18-1006
work, we want to extend our technique to other domains like fi- [17] Ziqi Zhang, Philip Webster, Victoria S. Uren, Andrea Varga, and Fabio Ciravegna.
2012. Automatically Extracting Procedural Knowledge from Instructional Texts
nance, compliance etc. and see if we can transcend the boundaries using Natural Language Processing. In LREC.
of domain and develop a generalized model.
In this paper we mainly focused on well formed documents i.e,
those where procedure steps are properly formatted as section/sub-
section titles or list items. In reality, we have found many doc-
uments that are badly formatted. For example, instead of using