You are on page 1of 5

Using Text Annotation Tool on Cyber Security

News – A Review

Mohamad Syahir Abdullah Anazida Zainal


School of Computing, Universiti Teknologi Malaysia, School of Computing, Universiti Teknologi Malaysia,
Johor, Malaysia Johor, Malaysia
msyahir38@utm.my anazida@utm.my

Mohd Aizaini Maarof Mohamad Nizam Kassim


School of Computing, Universiti Teknologi Malaysia, Cyber Security Responsive Division,
Johor, Malaysia CyberSecurity Malaysia,
aizaini@utm.my Seri Kembangan, Selangor, Malaysia
nizam@cybersecurity.my

Abstract— Cyber-attack has become one of the main process. The annotated text then can be used as a training
concern in our everyday life and being reported example in text analytics. However, as previously, a human
throughout online news website. As thousands of news tend to work manually with the text, such as to annotate text
article existed, it is difficult to go through all the news with a label which it has been proved to be a tedious, error
which lead to a slower analyzing process. Hence, a vital prone process and also consume a lot of time to be
text mining component known as Information completed [3].
Extraction (IE) is needed in order to ease the
Opportunely, the advancement of IE this past few years
knowledge discovery process for the wide collection of
has help to ease the text mining process. IE is important in
the cyber security news. To make IE process better and
extracting information from the unstructured data or text.
easier, the usage of tool such as General Architecture
IE help in conveying it into a more understandable human
for Text Engineering (GATE) can help a lot especially
language text by the support of natural language processing
in creating annotated corpus. In this paper, we will
(NLP) which are able to locate any instances related [4] to
introduce and reviewing several annotation tools that
the cyber security domain from the pile of documents
are freely available and also to discuss steps needed to
available. One of the important IE task is called Named
create an annotated corpus for the cyber security text
Entity Recognition (NER) which is usually used to locate
documents.
the targeted information that need to be extracted. NER is
Keywords— Cyber-attacks, online news, Information
also used for classifying the named entity into several
Extraction (IE), annotation tool, GATE categories such as person, location, organization, date and
time. However, the performance of NER can easily be
affected with the absence of good training data.
I. INTRODUCTION
Cyber-attack incidents have become a norm occurrence To achieve or create a good dataset for training
in our cyber space. From time to time, more damages and purposes, it will start with data collection. Then, the
lost involving our cyber properties and information collected data should undergo a process called annotation.
technologies are inflicted [1]. As the number of incidents Text annotation is a way of interaction with the text-
increased, people need to start taking precaution steps to usually by marking or highlighting some part from the text.
avoid being a victim of the cyber-attacks. Despite that, the It goes by underlining a certain piece of information that is
fast growth of internet also resulting in a better information in our focus areas and is significant to convey the meaning
dispersal through online news platform. Lot of information behind the text. As tedious it can be when manual annotate
especially on the cyber-attack incidents can be spread is conducted [3], it surely help a lot in creating corpus for
instantly across the globe. However, as there are thousands the training purposes. Thus, to aid this process, the
of news available online, it gives more difficulties for introduction of annotation tools that is more precise, and
people to access to a correct and precise information they can be run automatically have help lots of researchers in
needed [2]. completing their works.

The difficulty is higher when dealing with different Some of the annotation tools are open-sourced. One of
kind of news format from the online website especially the commonly used is the General Architecture for Text
involving the unstructured text such as news article [18]. Engineering (GATE). There are also another tools such as
News articles are usually written in lengthy and descriptive brat rapid annotation tool (BRAT), WARP-Text and so on.
manner. Although a lot of time will be consumed for a These tools are used quite often in text mining tasks
lengthy document to be analyzed and processed, it can especially during data annotation. GATE for example
provide us with opportunity to highlight or extract lots of contains different kinds of plugin which related to NLP
valuable information existed in the text through annotation components such as POS taggers, Named Entity

32

Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 01:32:47 UTC from IEEE Xplore. Restrictions apply.
Recognition (NER) and so on. GATE also provide a complete text mining tasks. Figure 1 shows the annotation
friendly interface which let us to indicate part of the texts that can be made and the relationship between the entities
that represent our perceptions and ideas. In the meantime, are formed.
BRAT is an intuitive web-based tool that also supported by
BRAT has been associated in biomedical domain and
NLP. Meanwhile, BRAT can serve as an intuitive interface
works with various text format [13] [14] using both Python
that help the non-technical user to have a better annotation
and JavaScript which also based on a client-server
[3].
architecture as it provide services such as named entity
This paper consist of the following structures. In the annotation, dependency syntactic annotation and binary, n-
next section, we will discussing some related previous ary relation annotation.
works that has been made by others and several tools
(GATE, BRAT, WART-Text, etc.) used by them. Then, the
subsequent section discuss mainly on how to use GATE
tool that include the process from data collection to the
annotation process. Finally, the last section will contain the
conclusion of this paper and some of the future works that
can be done.

II. RELATED WORKS

Annotation tools can provide immense help to the


researchers and save lot of time and cost for annotating. Figure 1: Example of BRAT Interface [3]
Using annotation tool can avoid us from any unintended
mistake compare to traditional human annotation that is C. OTHERS
prone and vulnerable to errors and incompetency [5]. In
this section, several annotation tools that are popular Another annotation tool that is an open-sourced is
among the researchers such as GATE, BRAT and other WARP-Text which is also a web-based. This tool designed
tools will be discussed. for annotating multiple layer of text and create relationship
among them at a different granularity levels [15] and
A. GATE helpful in annotating atomic paraphrases task. One more
This annotation tool is freely available and regarded as tool used in annotation process is WordFreak, created for
an open-sourced language engineering infrastructure [6]. many kinds of text documents and enable to adapt to new
One of the many purposes of GATE is to annotate the text Java classes used to define a new task. [16] used this tool
document which is not only limited to English language. for constructing an annotated corpus to support biomedical
GATE uses a UNICODE throughout and has been information extraction.
practiced to annotate many other language [8] [9] such as
Indic language, where millions of corpus has been created
for the American National Corpus and EMILLE [7]. GATE III. ANNOTATION SCHEME
also specialized in extracting a specific knowledge from the The process start with data collection from the cyber
unstructured text obtained from the Web. [5] combined the security news website. Reliable news websites such as
ontologies on targeted knowledge with GATE for the Bleeping Computer, The Hacker News, ThreatPost,
extended extraction process on Semantic Web. In their Security Week and so on can be used for gathering the data
work, GATE help them to identify the part-of-speech [18]. Another method is through the subscription into
(subject, verb, object, etc.) of the documents and also certain news provider. Recorded Future.com through its
determine the NER such as person, date and location [5]. Cyber Daily section can provide this kind of service [17]
Furthermore, GATE supports variety of formats from the when subscribed. Notification email will be send daily that
input and output of the documents processed. It could contain sources of cyber security news.
handle multiple formats [10] such as XML [11], HTML,
SGML, plain text and so on when loaded on, then
converted the input text into GATE document. Meanwhile,
GATE also provide automation process and decrease the
complexity behind the deployment where [12] create a
collaborative corpus annotation facilities on Web browser
by using the extension of the GATE.

B. BRAT
This tool is an intuitive web-based annotation tool
developed for annotating rich structured text that contain
various NLP tasks built in [3]. It provide a better interface
that can support even a non-technical user to navigate and
Figure 2: Part of the Email from Recorded Future.com

33

Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 01:32:47 UTC from IEEE Xplore. Restrictions apply.
The links will redirect us to reliable news sources. This that consist several main processing resources such as
method reduce the time to manually look for any particular tokenizer, gazetteer, sentence splitter, POS tagger, NER
incident that happens on that day. and so on. ANNIE can be loaded via ‘Load ANNIE’ icon
The next step is to convert the content of the webpages on top of the bar beside the folder icon. You can see the
into an easily readable and process able form. In this case, ANNIE being loaded in the GATE panel under the
we will convert into plain text. To do so, there are two ‘processing resources’ section. To run ANNIE with our
ways that can be done, first is using an open-sourced text, simply select ANNIE under the ‘Application’ section
online based tool called Textise {https://www.textise.net/} and click it so that it will be shown on the main editor.
where it provide service to removes everything else except Before running the application, make sure to select the
text that include images, videos, html tag and so on. The intended Corpus that need to be annotated by selecting it
text obtained through this method is then saved on the ‘Corpus’ area. Click ‘Run the Application’ on the
accordingly into a .txt file. Next, we can use Python bottom of the editor. It will take some time to run the
application called Goose to extract the main body of the application depending on the amount of your corpus.
news article, the title of the article, and also metadata on
any videos or images contain in the webpage. Goose will View the documents in the corpus. You will have the
just need to read the required news link, before extracting option to select the existing annotation loaded form
the required text including to strip out any html tag and ANNIE. To add a new annotation, click ‘Annotation Sets’,
unnecessary elements. However, note that some webpage begin highlighting the word that need to be annotated.
does contains multi layered or extra protection that will Place the cursor on that word for a few second until a pop-
disable both of the methods mentioned to scrap the text. up window appear. Replace the ‘_NEW_’ with appropriate
tag to describe the annotation that has been made. Repeat
the process for any new annotation label or load an
A. Loading the GATE Application
annotation xml into the GATE.
On the startup, GATE required you to load a new
resource as it usually in default contain no resource. By
clicking the Language/Processing Resource window, you C. Viewing and Saving the Annotation
can either select ‘new’ to create a whole new session or Annotation that automatically done through ANNIE
‘open’ to open past existing session. Next, load the text file application can be viewed by accessing the ‘Annotation
into GATE tool by clicking ‘GATE Document’ after Sets’ at the top bar of the editor which will display the side
previous step. A new window will appear containing frame listing all available annotations such as Person, Date,
parameter that can be selected. To load a file, click on Location and so on. Clicking on any of these will display
‘SourceURL’, a folder symbol on the right (refer circle area the words related as shown in Figure 4 (‘Date’ is selected
in Figure 3) to choose the location of the file you need. and GATE will highlight any related words it detected).
‘Open file’ dialog will be opened later so that you can
choose where your file reside and click ‘OK’ to load the
text file into GATE. Then, create a GATE Corpus by
following the same steps and select the loaded document.

Figure 4: Example when selecting one of the Annotation

Figure 3: 'SourceURL' Box for Loading the Text Lastly, GATE work can be saved by two methods either
through Data Store (suitable for working with large
corpus) or by saving the annotated text as XML. For Data
B. Start Annotating Store, it need to be created first by selecting the ‘Data
Select the text file that has been loaded. You can Store’ and click ‘Create datastore’ to choose a new empty
directly annotate manually any piece of information you folder. Any corpus or individual document can be saved in
want or use a built in GATE application to help your the Data Store where it is easier to be retrieved when we
annotation process. One of the commonly used is ANNIE want to continue with our annotation. Next, to save into

34

Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 01:32:47 UTC from IEEE Xplore. Restrictions apply.
XML form, simply select the document or corpus and ACKNOWLEDGMENT
choose ’save as XML’ option and the data is saved in the This research is funded by CyberSecurity Malaysia
following format (figure 5 as viewed using xml reader) as under strategic collaboration with Cyber Threat
it contained all the previous tag made before. This data Intelligence Lab, School of Computing, UTM.
later can be used directly as training dataset.
REFERENCES

[1] Zhao, Z., Ahn, G.J., Hu, H. and Mahi, D., 2012, September.
SocialImpact: systematic analysis of underground social
dynamics. In European Symposium on Research in Computer
Security (pp. 877-894). Springer, Berlin, Heidelberg.
[2] Shabat, H., Omar, N. and Rahem, K., 2014, December. Named
entity recognition in crime using machine learning approach.
In Asia Information Retrieval Symposium (pp. 280-288).
Springer, Cham..
[3] Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S.
and Tsujii, J.I., 2012, April. BRAT: a web-based tool for NLP-
assisted text annotation. In Proceedings of the Demonstrations
at the 13th Conference of the European Chapter of the
Association for Computational Linguistics (pp. 102-107).
Association for Computational Linguistics.
[4] Thompson, P., Iqbal, S.A., McNaught, J. and Ananiadou, S.,
2009. Construction of an annotated corpus to support
biomedical information extraction. BMC bioinformatics,
10(1), p.349.
Figure 5: Example of XML format generated [5] Alani, H., Kim, S., Millard, D.E., Weal, M.J., Hall, W., Lewis,
P.H. and Shadbolt, N.R., 2003. Automatic ontology-based
knowledge extraction from web documents. IEEE Intelligent
IV. CHALLENGES AND LIMITATION
Systems, 18(1), pp.14-21.
Even though the tool is able to help people annotating their [6] Cunningham H, Maynard D, Bontcheva K, Tablan V 2002.
dataset faster and more manageable to comprehend, it still GATE: A framework and graphical development environment
hold some challenges and limitations to its capability. for robust NLP tools and applications. In Proceedings of the
First, learning to use the tool itself is quite difficult for 40th Anniversary Meeting of the Association for
those who are new in the area, as there are low numbers of
Computational Linguistics (ACL’02). Philadelphia, US.
manuals and tutorials available online. People need to
spend some time to read from the publication paper in [7] Baker P, Hardie A, McEnery A, Cunningham H, Gaizauskas
order to fully understand how to use the tool effectively. R. 2002. EMILLE, A 67-Million Word Corpus of Indic
Next, the learning curve also promptly increase when you Languages: Data Collection, Mark-up and Harmonisation. In
want to use different modules in your annotation. It will Proceedings of 3rd Language Resources and Evaluation
needed you to ‘try and error’ and experimenting it by Conference (LREC'2002), pages 819--825.
yourself to see how the process actually works which will [8] Gamback B, Olsson F 2000. Experiences of Language
cost you a lot of time. Finally, when dealing with large Engineering Algorithm Reuse. In Proceedings of the Second
dataset, it will still need to consume more time preparing International Conference on Language Resources and
the annotation as most of the tools are still not fully Evaluation (LREC), pages 155--160, Athens, Greece.
automatic. It is advised to have multiple people to work
[9] Pastra K, Maynard D, Hamza O, Cunningham H, Wilks Y
together in the annotation process to achieve the end
2002. How feasible is the reuse of grammars for Named Entity
product in a short period.
Recognition? In Proceedings of 3rd Language Resources and
Evaluation Conference (LREC’02), Gran Canaria, Spain..
V. CONCLUSION [10] Bird S, Liberman M 1999. A Formal Framework for Linguistic
The availability and ease of use of open-sourced Annotation. Technical Report MS-CIS-99-01, Department of
annotation tools can help lot of works in text mining Computer and Information Science, University of
especially in cyber security domain. As thousands of new Pennsylvania. http://xxx.lanl.gov/\-abs/cs.CL/9903003.
articles published every day, we certainly need a faster [11] Ide N, Bonhomme P, Romary L 2000. XCES: An XML-based
way to process all the information. By using annotation Standard for Linguistic Corpora. In Proceedings of the Second
tool, Information Extraction tasks such as data annotation International Language Resources and Evaluation Conference
and NER can be accomplished within a lesser amount of (LREC), pages 825--830, Athens, Greece.
time and provide better annotation with less error. Then,
[12] Cunningham, H., Tablan, V., Bontcheva, K. and Dimitrov, M.,
the next step in text mining can be reinforced through a
2003. Language engineering tools for collaborative corpus
good training data obtained from the annotation tool.

35

Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 01:32:47 UTC from IEEE Xplore. Restrictions apply.
annotation. In Proceedings of Corpus Linguistics (Vol. 2003). [16] Thompson, P., Iqbal, S.A., McNaught, J. and Ananiadou, S.,
[13] T. Ohta, S. Pyysalo, J. Tsujii, and S. Ananiadou. 2012. 2009. Construction of an annotated corpus to support
Opendomain anatomical entity mention detection. In biomedical information extraction. BMC bioinformatics,
Proceedings of DSSD. 10(1), p.349.
[14] M. Neves, A. Damaschun, A. Kurtz, and U. Leser. 2012. [17] Kallus, N., 2014, April. Predicting crowd behavior with big
Annotating and evaluating text for stem cell research. In public data. In Proceedings of the 23rd International
Proceedings of BioTxtM.. Conference on World Wide Web (pp. 625-630). ACM.

[15] V. Kovatchev, M. Ant`onia Mart´ı, and Maria Salam´o. 2018. [18] Abdullah, M.S., Zainal, A., Maarof, M.A. and Kassim, M.N.,
Etpc - a paraphrase identification corpus annotated with 2018, November. Cyber-Attack Features for Detecting Cyber
extended paraphrase typology and negation. In Proceedings of Threat Incidents from Online News. In 2018 Cyber Resilience
LREC-2018. Conference (CRC) (pp. 1-4). IEEE..

36

Authorized licensed use limited to: University of Exeter. Downloaded on June 23,2020 at 01:32:47 UTC from IEEE Xplore. Restrictions apply.

You might also like