You are on page 1of 7

Available online at www.sciencedirect.

com

ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2017) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 117 (2017) 23–29

3rd International Conference on Arabic Computational Linguistics, ACLing 2017, 5-6 November
2017, Dubai, United Arab Emirates

A collocation extraction tool and two language resources for MSA


Abdulmohsen Al-Thubaity*, Ibtehal Baazeem
The National Centre for Computer Technology and Applied Math, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia

Abstract

Collocation extraction from corpora, whether complete or according to specific criteria, plays a significant role in
computational linguistics, corpus linguistics, and natural language processing. In this paper, we present Musaheb, an
Arabic collocation extraction tool that has been designed and implemented to overcome the limitations of existing
collocation extraction tools. Musaheb can extract n-gram collocations (for n ≤ 5) in addition to extracting the collocates
of specific word types within a window size of zero to 15 words. Moreover, it provides eight collocation statistics that
may be used to calculate collocation strength, and permits the input of various constraints during node selection (that
is, the determination of the word or phrase whose collocates we wish to search for) and collocate extraction. Based on
the user preferences for the node, concordance, and collocate selection, Musaheb saves all nodes and their associated
collocates in an XML file, allowing easy conversion to different formats. Two language resources for Modern
Standard Arabic developed via Musaheb are presented in this paper: 1) a 2-gram language model, and 2) a 2-gram
collocation list based on an Arabic newspaper corpus comprising more than 20 million words.

© 2017 The Authors. Published by Elsevier B.V.


Peer-review under responsibility of the scientific committee of the 3rd International Conference on Arabic Computational
Linguistics.

Keywords: Collocation extraction; computational linguistics; collocation measures; Arabic language resources; language modelling

1. Introduction

Collocation is an important linguistic phenomenon present in all languages in which the occurrence of a word

* Corresponding author. Tel.: +966114814148; fax: +966114814453.


E-mail address: aalthubaity@kacst.edu.sa

1877-0509 © 2017 The Authors. Published by Elsevier B.V.


Peer-review under responsibility of the scientific committee of the 3rd International Conference on Arabic Computational
Linguistics.

1877-0509 © 2017 The Authors. Published by Elsevier B.V.


Peer-review under responsibility of the scientific committee of the 3rd International Conference on Arabic Computational Linguistics.
10.1016/j.procs.2017.10.090
24 Abdulmohsen Al-Thubaity et al. / Procedia Computer Science 117 (2017) 23–29
2 Author name / Procedia Computer Science 00 (2017) 000–000

prompts the occurrence of other specific words that are context-dependent and non-random. As an example, native
English speakers are more likely to say ‘strong tea’ rather than ‘powerful tea’, whereas native Arabic speakers more
likely to say ‘‫ ’ﺷﺎﻱ ﺛﻘﻴﻞ‬rather than ‘‫’ﺷﺎﻱ ﻗﻮﻱ‬.
The phenomenon of collocation garnered the attention of the linguistics community when Firth (1957) defined it
as ‘words that frequently occur together’, or ‘the company that words keep’ [1]. Although technology has significantly
reduced the need to solely rely on intuition, at the time of Firth’s definition, the linguist’s intuition was all that was
used to identify word collocates [2].
By following Firth’s notation of collocation, over time, corpus linguistics researchers developed a quantitative
definition of collocation. For instance, Hoey defined collocation as the ‘relationship a lexical item has with items that
appear with greater than random probability in [textual] context’ [3]. Textual context can refer to an entire text, a
section of a text, a paragraph, a window of words before and after the node (i.e., the word or phrase whose collocates
are being searched for), or the word immediately following the node.
Several statistical measures have been proposed to extract collocations from corpora; their implementation was
made possible by the availability of large corpora in electronic formats, and the increased computational capabilities
of a new generation of computers. Example measures include mutual information (MI), log-likelihood (LL), t-score,
chi-squared (ܺ ଶ ሻ, and log dice.
Collocation information extracted from corpora has revealed its importance in domains such as dictionary
construction [4], data-driven language learning [5], second language acquisition [6], language for special purposes
[7], computational linguistics [8], computational psycholinguistics [9], automatic term extraction [10], and automatic
question-answering [11].
Several online corpora websites and standalone corpora processing systems provide different sets of collocation
measures for extracting the collocates of a single word. However, this process is undesirable for researchers that need
to extract the collocates of hundreds of words, or the collocates of words that meet certain conditions, as it would be
exceedingly time consuming and tedious.
To overcome this problem, various systems have been introduced to improve the efficiency of collocation
extraction from corpora, such as Collocate, Collocation Extract, and TermeX [12]. However, these systems have
several limitations:
a. They do not support the Arabic language.
b. They focus on extracting n-grams and unable to extract collocation within a certain window.
c. They implement a limited number of collocation measures.
In this paper, we introduce a new collocation extraction system, Musaheb, that has been designed and implemented
to overcome the limitations of existing collocation extraction systems. The following section describes the architecture
and features of the proposed system. Musaheb’s output can be implemented in a variety of useful applications, such
as visualising collocation networks, extracting the syntax pattern of collocations, and extracting multiword terms.
Musaheb was utilized to develop a 2-gram language model and 2-gram collocation list for Modern Standard Arabic
(MSA) based on a 20-million-word corpus of Arabic newspapers (see Section 2.2).

2. System description

Musaheb is a free and open-source Java-based system† that can be virtually run on any machine or operating system
with JVM installed. It features an Arabic/English graphical user interface (GUI) for ease of use, and supports UTF-8
encoding and txt files. Corpus texts can be loaded from a single folder with several sub-folders or from several folders.
Musaheb provides general statistics regarding the total number of tokens, types, and files.
Musaheb provides five criteria for choosing the node: (a) all word types in the corpus, (b) all word types in the
corpus except those on an exclusion list, (c) word types provided via an inclusion list, (d) word types that have a
frequency above or equal to a threshold value and (e) word-types that have a frequency value between two threshold


https://sourceforge.net/projects/musaheb/
Abdulmohsen Al-Thubaity et al. / Procedia Computer Science 117 (2017) 23–29 25
Author name / Procedia Computer Science 00 (2017) 000–000 3

values. Further, the systems allows the user to choose a single word or an n-gram (n ≤ 5) for the node. It also allows
the user to choose the collocation span with a window size of 15 words before and after the node. This feature allows
the user to choose collocates that are n-grams (i.e., immediately following the node) or within a window specified by
the user (i.e. the collocates can be located at any position before or after the node in the window). As an example, if
the user would like to extract the collocation of 2-grams, they would need to specify the node as a single word and
restrict the window size to 0 and 1 before and after the node, respectively.
Musaheb extracts collocations via eight widely-used collocation measures: � � , the Dice coefficient (DICE), log-
DICE, LL, MI, t-score, Z-score, and the weirdness coefficient (W). Collocation strength is calculated based on the
contingency tables of observed and expected frequencies, shown in Tables 1 and 2, respectively. Equations 1 to 8
illustrate the mathematical form of the collocation extraction measures.

Table 1. Contingency table of observed frequencies.


F (collocates) F (other words)
In window O11 O12 = R1 - O11 R1: sum of word frequencies within the window =
O11 + O12

Not in O21 = O11 - C1 O22 = R2 - O21 R2 = N - R1


window

C1: frequency of collocate in the C2 = N - C1 N: corpus size = R1 + R2


entire corpus

Table 2. Contingency table of expected frequencies.


  F (collocates) F (other words)
In window E11 = (R1 × C1)/N E12 = (R1 × C2)/N R1 = E11 + E12

Not in window E21 = (R2 × C1)/N E22 = (R2 × C2)/N R2 = E21 + E22
C1 = E11 + E21 C2 = E12 + E22 N: corpus size = R1 + R2

���� ���� ��
�� � ∑�� (1)
���

�����
���� � (2)
�� ���

�����
�������� � �� � ���� � � (3)
�� ���

���
�� � � ∑�� ��� ������ � (4)
���

���
�� � ���� � (5)
���

��� ����
�� (6)
����

��� ����
�� (7)
����

��� ���
�� / (8)
�� ��

Musaheb allows the user to choose one of the eight collocation measures and filter the results based on four
criteria: (a) the minimum frequency for the collocate within the window, (b) the minimum frequency for the collocate
26 Abdulmohsen Al-Thubaity et al. / Procedia Computer Science 117 (2017) 23–29
4 Author name / Procedia Computer Science 00 (2017) 000–000

in the corpus, (c) the minimum document frequency for the node and collocate in the corpus, and (d) the minimum
value for the collocation score. It then saves the output in an XML file that provides (a) general statistics for the
corpus; (b) the user criteria; and (c) the node, node frequency, node collocates, collocate frequency, and the
corresponding collocation score, where nodes and collocates are listed based on their corpus frequencies and
collocation scores, respectively. Saving the output in XML format maximises flexibility for the user, as it can be easily
converted to other formats and allows the desired information to be extracted for further investigation or use in
applications such as visualisation and mining.

Figure 1 illustrates the GUI for Musaheb.

Fig. 1. Musaheb GUI.


Abdulmohsen Al-Thubaity et al. / Procedia Computer Science 117 (2017) 23–29 27
Author name / Procedia Computer Science 00 (2017) 000–000 5

3. Case study

To illustrate a direct application of the proposed system, we used Musaheb to build a collocation list for an Arabic
newspaper corpus as a representative sample of MSA. Such a list is a valuable resource for research in Arabic applied
linguistics, computational linguistics, and natural language processing. In the following sections, we describe the data
and criteria used to extract this collocation list, and then present our results.

3.1 Data

First, we collected a corpus comprising 63,103 texts from various Arabic newspaper websites originating in 11
Arab countries—Algeria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Morocco, Saudi Arabia, Sudan, the United Arab
Emirates, and Yemen—on the topics of policy, economy, society, sport, religion, culture, health and science, and
technology.
To enrich the collocation list with syntactic information, the corpus was first segmented via the Stanford Arabic
segmenter [13] and then tagged using the Stanford Arabic part-of-speech (POS) tagger [14]. Following segmentation
and POS tagging, the corpus comprised 20,225,058 tokens and 63,103 word types. Table 3 illustrates the Stanford
Arabic POS tagger tag set [15].

Table 3. Stanford Arabic POS tagger tag set


Tag Meaning Tag Meaning
ADJ Adjective NNS Noun, plural
CC Coordinating conjunction NOUN Noun
CD Cardinal number PRP Personal pronoun
DT Determiner PRP$ Possessive pronoun
DTJJ Adjective with the determiner ‘Al’ PUNC Punctuation
DTJJR Adjective, comparative with the determiner ‘Al’ RB Adverb
DTNN Noun, singular or mass with the determiner ‘Al’ RP Particle
DTNNP Proper noun, singular with the determiner ‘Al’ UH Interjection
DTNNPS Proper noun, plural with the determiner ‘Al’ VB Verb, base form
DTNNS Noun, plural with the determiner ‘Al’ VBD Verb, past tense
IN Preposition or subordinating conjunction VBG Verb, gerund or present participle
JJ Adjective VBN Verb, past participle
JJR Adjective, comparative VBP Verb, non-3rd person singular present
NN Noun, singular or mass VN Verb, past participle
NNP Proper noun, singular WP Wh-pronoun
NNPS Proper noun, plural WRB Wh-adverb

3.2 Experimental settings

The following settings were applied to control the output of collocation extraction:
Node: single word
Node frequency: equal to or greater than 100
Collocation window: zero words before the node and one word after the node
Collocation measure: Log-likelihood
Collocation score: equal to or greater than 24 (equivalent to a 99.999999 confidence level)
Collocate frequency within window: equal to or greater than 10
Collocate frequency within corpus: equal to or greater than 20
Document frequency: equal to or greater than 10

3.3 Results
Collocation extraction results demonstrated that the system extracted 154,367 2-gram collocates for 14,280 nodes.
This 2-gram list, along with its POS tagging information, can be implemented as a language model for MSA. The 2-
gram list was manually filtered to retain only the syntactic patterns that provide a complete meaning. Additionally,
the list was manually reviewed to eliminate POS tagging errors and 2-gram collocates whose meaning was incomplete.
28 Abdulmohsen Al-Thubaity et al. / Procedia Computer Science 117 (2017) 23–29
6 Author name / Procedia Computer Science 00 (2017) 000–000

Finally, a 35,934 2-gram collocation list for MSA was obtained. Table 4 provides the ten most frequent syntactic
patterns for 2-gram collocation in our corpus and corresponding examples.

Table 4. The ten most-frequent syntactic patterns for 2-gram collocation in MSA.
POS1 POS2 Freq. Examples
NN DTNN 12307 ،‫ ﺣﻘﻮﻕ ﺍﻹﻧﺴﺎﻥ‬،‫ ﻭﺯﺍﺭﺓ ﺍﻟﺼﺤﺔ‬،‫ ﻭﺯﻳﺮ ﺍﻟﺨﺎﺭﺟﻴﺔ‬،‫ ﺻﺎﺣﺐ ﺍﻟﺴﻤﻮ‬،‫ ﻣﺠﻠﺲ ﺍﻟﻮﺯﺭﺍء‬،‫ ﻛﺮﺓ ﺍﻟﻘﺪﻡ‬،‫ﻋﺎﺻﻔﺔ ﺍﻟﺤﺰﻡ‬
.‫ ﻭﺳﺎﺋﻞ ﺍﻹﻋﻼﻡ‬،‫ ﻣﺠﻠﺲ ﺍﻷﻣﻦ‬،‫ﺭﺋﻴﺲ ﺍﻟﻮﺯﺭﺍء‬

DTNN DTJJ 9367 ‫ ﺍﻻﺗﺤﺎﺩ‬،‫ ﺍﻟﺮﺋﻴﺲ ﺍﻟﻴﻤﻨﻲ‬،‫ ﺍﻟﻤﺪﻳﺮ ﺍﻟﻌﺎﻡ‬،‫ ﺍﻟﻨﻴﺎﺑﺔ ﺍﻟﻌﺎﻣﺔ‬،‫ ﺍﻟﻌﺎﻡ ﺍﻟﻤﺎﺿﻲ‬،‫ ﺍﻟﺘﻮﺍﺻﻞ ﺍﻻﺟﺘﻤﺎﻋﻲ‬،‫ﺍﻟﺸﻌﺐ ﺍﻟﻴﻤﻨﻲ‬
.‫ ﺍﻟﻤﻨﻄﻘﺔ ﺍﻟﺸﺮﻗﻴﺔ‬،‫ ﺍﻟﻤﺪﻳﺮ ﺍﻟﻔﻨﻲ ﺍﻟﻘﻤﺔ ﺍﻟﻌﺮﺑﻴﺔ‬،‫ﺍﻷﻭﺭﺑﻲ‬

NN JJ 4027 ‫ ﻣﺼﺪﺭ‬،‫ ﻓﺘﺮﺓ ﻁﻮﻳﻠﺔ‬،‫ ﺷﻜﻞ ﻋﺎﻡ‬،‫ ﻣﺆﺗﻤﺮ ﺻﺤﻔﻲ‬،‫ ﻋﺪﺩ ﻛﺒﻴﺮ‬،‫ ﻣﺮﺓ ﺃﺧﺮﻯ‬،‫ ﻭﻗﺖ ﺳﺎﺑﻖ‬،‫ ﺟﻬﺔ ﺃﺧﺮﻯ‬،‫ﻧﺎﻓﺬﺓ ﺟﺪﻳﺪﺓ‬
.‫ ﺣﻞ ﺳﻴﺎﺳﻲ‬،‫ﺃﻣﻨﻲ‬

NN DTNNS 2138 ،‫ ﻣﺨﺘﻠﻒ ﺍﻟﻤﺠﺎﻻﺕ‬،‫ ﺗﻜﻨﻮﻟﻮﺟﻴﺎ ﺍﻟﻤﻌﻠﻮﻣﺎﺕ‬،‫ ﺗﻘﺪﻳﻢ ﺍﻟﺨﺪﻣﺎﺕ‬،‫ ﺍﺗﺨﺎﺫ ﺍﻹﺟﺮﺍءﺍﺕ‬،‫ ﺧﻼﻝ ﺍﻟﺴﻨﻮﺍﺕ‬،‫ﺧﺎﺩﻡ ﺍﻟﺤﺮﻣﻴﻦ‬
.‫ ﻣﻜﺎﻓﺤﺔ ﺍﻟﻤﺨﺪﺭﺍﺕ‬،‫ ﻣﻮﺍﺟﻬﺔ ﺍﻟﺘﺤﺪﻳﺎﺕ‬،‫ ﺗﺒﺎﺩﻝ ﺍﻟﺨﺒﺮﺍﺕ‬،‫ﻣﺤﻜﻤﺔ ﺍﻟﺠﻨﺎﻳﺎﺕ‬

DTNNS DTJJ 2016 ،‫ ﺍﻟﻀﺮﺑﺎﺕ ﺍﻟﺠﻮﻳﺔ‬،‫ ﺍﻟﻌﻼﻗﺎﺕ ﺍﻟﻌﺎﻣﺔ‬،‫ ﺍﻟﻌﻤﻠﻴﺎﺕ ﺍﻟﻌﺴﻜﺮﻳﺔ‬،‫ ﺍﻟﻘﻮﺍﺕ ﺍﻟﻤﺴﻠﺤﺔ‬،‫ ﺍﻟﻮﻻﻳﺎﺕ ﺍﻟﻤﺘﺤﺪﺓ‬،‫ﺍﻟﺤﺮﻣﻴﻦ ﺍﻟﺸﺮﻳﻔﻴﻦ‬
.‫ ﺍﻟﺨﺪﻣﺎﺕ ﺍﻟﺼﺤﻴﺔ‬،‫ ﺍﻟﻘﻮﺍﺕ‬،‫ ﺍﻟﻌﻼﻗﺎﺕ ﺍﻟﺜﻨﺎﺋﻴﺔ‬،‫ ﺍﻟﺴﻠﻄﺎﺕ ﺍﻟﻤﺤﻠﻴﺔ‬،‫ﺍﻟﺠﻬﺎﺕ ﺍﻟﻤﻌﻨﻴﺔ‬

VBD DTNN 1410 ‫ ﺃﺿﺎﻑ‬،‫ ﺃﻭﺿﺢ ﺍﻟﻤﺼﺪﺭ‬،‫ ﺗﻢ ﺍﻻﻧﺘﻬﺎء‬،‫ ﺃﻛﺪ ﺍﻟﺪﻛﺘﻮﺭ‬،‫ ﻗﺎﻝ ﺍﻟﻤﺘﺤﺪﺙ‬،‫ ﺗﻢ ﺍﻻﺗﻔﺎﻕ‬،‫ ﻗﺎﻝ ﺍﻟﺪﻛﺘﻮﺭ‬،‫ﺃﺿﺎﻑ ﺍﻟﻤﺼﺪﺭ‬
.‫ ﺃﻛﺪ ﺍﻟﻮﺯﻳﺮ‬،‫ ﺗﻢ ﺍﻟﺘﺤﻔﻆ‬،‫ﺍﻟﺒﻴﺎﻥ‬

NNS DTNN 1082 ‫ ﻭﺟﻬﺎﺕ‬،‫ ﻗﻮﺍﺕ ﺍﻟﻨﻈﺎﻡ‬،‫ ﻣﺆﺳﺴﺎﺕ ﺍﻟﺪﻭﻟﺔ‬،‫ ﻁﺎﺋﺮﺍﺕ ﺍﻟﺘﺤﺎﻟﻒ‬،‫ ﺩﺭﺟﺎﺕ ﺍﻟﺤﺮﺍﺭﺓ‬،‫ ﻗﻮﺍﺕ ﺍﻷﻣﻦ‬،‫ﻗﻮﺍﺕ ﺍﻟﺘﺤﺎﻟﻒ‬
.‫ ﺷﺒﻜﺎﺕ ﺍﻟﺘﻮﺍﺻﻞ‬،‫ ﻣﻨﻈﻤﺎﺕ ﺍﻟﻤﺠﺘﻤﻊ‬،‫ ﺳﻴﺎﺭﺍﺕ ﺍﻹﺳﻌﺎﻑ‬،‫ﺍﻟﻨﻈﺮ‬

NNS JJ 699 ‫ ﺳﻨﻮﺍﺕ‬،‫ ﻫﺪﻓﻴﻦ ﻧﻈﻴﻔﻴﻦ‬،‫ ﺍﺷﺘﺒﺎﻛﺎﺕ ﻋﻨﻴﻔﺔ‬،‫ ﺗﺼﺮﻳﺤﺎﺕ ﺧﺎﺻﺔ‬،‫ ﻏﺎﺭﺍﺕ ﺟﻮﻳﺔ‬،‫ ﻛﻤﻴﺎﺕ ﻛﺒﻴﺮﺓ‬،‫ﺗﺼﺮﻳﺤﺎﺕ ﺻﺤﻔﻴﺔ‬
.‫ ﺩﻭﺭﺍﺕ ﺗﺪﺭﻳﺒﻴﺔ‬،‫ ﻗﻮﺍﺕ ﻣﻮﺍﻟﻴﺔ‬،‫ ﻗﻮﺍﺕ ﺑﺮﻳﺔ‬،‫ﻁﻮﻳﻠﺔ‬

DTNN ADJ 433 ،‫ ﺍﻟﻤﻨﺘﺨﺐ ﺍﻷﻭﻟﻤﺒﻲ‬،‫ ﺍﻟﻔﺘﺮﺓ ﺍﻷﻭﻟﻰ‬،‫ ﺍﻟﺠﻴﺶ ﺍﻟﺜﺎﻟﺚ‬،‫ ﺍﻟﻌﺎﻟﻢ ﺍﻟﺜﺎﻟﺚ‬،‫ ﺍﻟﻔﺮﻳﻖ ﺍﻷﻭﻝ‬،‫ ﺍﻟﻤﻨﻄﻘﺔ ﺍﻟﺮﺍﺑﻌﺔ‬،‫ﺍﻟﻴﻮﻡ ﺍﻟﺴﺎﺑﻊ‬
.‫ ﺍﻟﻘﻄﺎﻉ ﺍﻟﺨﺎﺹ‬،‫ ﺍﻟﻤﻮﺳﻢ ﺍﻷﻭﻝ‬،‫ﺍﻟﻤﺆﺗﻤﺮ ﺍﻟﻌﺎﺷﺮ‬

NOUN DTNN 358 ،‫ ﺃﻏﻠﺐ ﺍﻷﺣﻴﺎﻥ‬،‫ ﻣﻌﻈﻢ ﺍﻟﺪﻭﻝ‬،‫ ﻧﺼﻒ ﺍﻟﻨﻬﺎﺋﻲ‬،‫ ﻛﻞ ﺍﻷﻁﺮﺍﻑ‬،‫ ﺑﻌﺾ ﺍﻷﺣﻴﺎﻥ‬،‫ ﺟﻤﻴﻊ ﺍﻷﻁﺮﺍﻑ‬،‫ﺑﻌﺾ ﺍﻟﺪﻭﻝ‬
.‫ ﻣﻌﻈﻢ ﺍﻟﻨﺎﺱ‬،‫ ﻗﺮﺍﺑﺔ ﺍﻟﺴﺎﻋﺔ‬،‫ﻧﺼﻒ ﺍﻟﻨﻬﺎﺋﻲ‬

4. Conclusion

In this paper, we introduce Musaheb, a tool that performs comprehensive collocation extraction on corpora
according to user criteria in conditions in which the collocate directly follows the node (n-gram) or in which the
collocate is located within a predefined context window. Eight popular collocation measures were implemented to
define these conditions. The proposed tool saves its results in an XML file to permit conversion to other formats,
thereby facilitating its implementation in various applications such as collocation network mining and visualisation,
multiword expression extraction, domain-specific collocation dictionary construction, and term extraction.
Furthermore, Musaheb was used to create two language resources for MSA: a 2-gram language model comprising
154,367 units and a 2-gram collocation list comprising 35,9347 units, based on an Arabic newspaper corpus
comprising more than 20 million words.

References

[1] Firth JR. Papers in Linguistics 1934–1951. Oxford, UK: Oxford University Press; 1957.

[2] Ackermann K, Chen Y-H. Developing the academic collocation list (ACL)–A corpus-driven and expert-judged
approach.” J Engl Acad Purp 2013;12(4):235–47.
Abdulmohsen Al-Thubaity et al. / Procedia Computer Science 117 (2017) 23–29 29
Author name / Procedia Computer Science 00 (2017) 000–000 7

[3] Hoey M. Patterns of Lexis in Text. Oxford, UK: Oxford University Press; 1991.

[4] Kallas J, Kilgarriff A, Koppel K, Kudritski E, Langemets M, Michelfeit J, Tuulik M, Viks Ü. Automatic generation
of the Estonian Collocations Dictionary database. Proc eLex 2015 Conf Sussex, UK, 2015;11–13.

[5] Rahimi M, Momeni G. The effect of teaching collocations on English language proficiency. Procedia Soc Behav
Sci 2012;31:37–42.

[6] Demir C. Lexical collocations in English: a comparative study of native and non-native scholars of English. J Lang
Linguist Stud 2017;13(1):75–87.

[7] Gulec N, Arif Gulec B. Lexical collocations (verb + noun) across written academic genres in English. Procedia
Soc Behav Sci 2015;182:433–40.

[8] Schulz P, Wilker A. Fast collocation-based Bayesian HMM word alignment. Proc 26th Int Conf Computational
Linguist Osaka, Japan, 2016.

[9] Durrant P, Doherty A. Are high-frequency collocations psychologically real? Investigating the thesis of
collocational priming. Corpus Linguist Linguist Theor 2010;6(2):125–55.

[10] Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M. Biomedical term extraction: overview and a new
methodology. Inf Retr J 2016;19(1–2):59–99.

[11] Barrera A, Verma R, Vincent R. SemQuest: University of Houston's semantics-based question answering system.
Proc Text Anal Conf (TAC) Gaithersburg, Maryland, USA, 2011.

[12] Delač D, Krleža Z, Šnajder J, Dalbelo Bašić B, Šarić F. TermeX: a tool for collocation extraction. Proc Int
Conf Intell Text Process Comput Linguist Mexico City, Mexico, 2009:149–57.

[13] Monroe W, Green S, Manning C. Word segmentation of informal Arabic with domain adaptation.” Proc Arab
Comput Linguist Baltimore, Maryland, USA, 2014;2:206–11.

[14] Toutanova K, Klein D, Manning C, Singer Y. Feature-rich part-of-speech tagging with a cyclic dependency
network. Proc 2003 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol Edmonton, Canada,
2003;1:173–80.

[15] www.sketchengine.co.uk