Professional Documents
Culture Documents
Edited by
Jean-Pierre Colson
University of Louvain
doi 10.1075/ivitra.24
Cataloging-in-Publication Data available from Library of Congress:
lccn 2019057308 (print) / 2019057309 (e-book)
isbn 978 90 272 0535 3 (Hb)
isbn 978 90 272 6139 7 (e-book)
Foreword vii
Aline Villavicencio
Introduction 1
Gloria Corpas Pastor and Jean-Pierre Colson
Monocollocable words: A type of language combinatory periphery 9
František Čermák
Translation asymmetries of multiword expressions in machine translation:
An analysis of the TED-MWE corpus 23
Johanna Monti, Mihael Arcan and Federico Sangati
German constructional phrasemes and their Russian counterparts:
A corpus-based study 43
Dmitrij Dobrovol’skij
Computational phraseology and translation studies:
From theoretical hypotheses to practical tools 65
Jean-Pierre Colson
Computational extraction of formulaic sequences from corpora:
Two case studies of a new extraction algorithm 83
Alexander Wahl and Stefan Th. Gries
Computational phraseology discovery in corpora with the MWETOOLKIT 111
Carlos Ramisch
Multiword expressions in comparable corpora 135
Peter Ďurčo
Collecting collocations from general and specialised corpora:
A comparative analysis 151
Marie-Claude L’Homme and Daphnée Azoulay
What matters more: The size of the corpora or their quality?
The case of automatic translation of multiword expressions
using comparable corpora 177
Ruslan Mitkov and Shiva Taslimipoor
vi Computational Phraseology
Index 325
Empirical variability of Italian
multiword expressions as a useful feature
for their categorisation
Luigi Squillante
Sapienza – Università di Roma
https://doi.org/10.1075/ivitra.24.12squ
© 2020 John Benjamins Publishing Company
226 Luigi Squillante
indicating that the bond between the components for a multiword unit or a
lexical collocation can be formed by activating different kinds of restrictions, de-
pending on the considered grammatical pattern.
1. Introduction
In the linguistic tradition, the definition of word is still problematic because of the
complexity involved in the identification of its defining features. The concept of a
‘word’ is intuitively present in the speaker’s consciousness, yet it seems impossible
to provide a univocal definition because of the different levels of analysis that words
can undergo. In fact, the idea of unity that stands behind this concept is relative,
since the same linguistic material can form a unicum or a set of analysable parts,
depending on the perspective that one considers.1 The difficulty in delimiting the
concept of a word becomes that much more significant when one approaches phra-
seologisms, which show how expressions formed by two or more graphic words2
can operate as a unit or show specific bonds, placing certain constraints onto the
component words so that they are not completely free. In general, one refers to
these kinds of entities as multiword expressions (MWEs). According to the most
comprehensive definition given by Calzolari et al. (2002), a MWE is “a sequence of
words that acts as a single unit at some level of linguistic analysis”.
In Italian, examples of this kind of phenomena include common expressions
with unitary or idiomatic meaning such as luna di miele ‘honeymoon’ or pol-
lice verde ‘green thumb’, technical terms such as Presidente del Consiglio ‘Prime
Minister’ or anidride carbonica ‘carbon dioxide’, syntactic constructions as fintanto
che ‘as long as’, adverbs as alla bell’e meglio, lit. ‘at the beautiful and better’ meaning
‘in a mediocre way’, and also particular associations between lexemes that a native
1. Voghera (1994) states, though, that the debate on the concept of words is based on a number
of shared assumption, according to which some reliable criteria allowing the identification of
words do exist, although some of them are more effective than others: namely, uninterruptibility
(it is not possible to insert linguistic material inside a word), impossibility to move the compo-
nents (it is not possible to change the order of the morphemes), potential isolation (it is possible
to construct a statement by only using one word), potential break (it is always possible to insert
a pause, before or after the word). Among these criteria, only the first two seem to guarantee a
wide reliability in word identification. For a deeper analysis on the definition of word, see Di
Sciullo and Williams (1987), Lepschy (1989), Simone (1990), and Ramat (1990, 2005, 2016).
2. In our written tradition a graphic word is any sequence of characters between two blank spaces.
Empirical variability of Italian multiword expressions 227
speaker would recognise, such as prestare attenzione, lit. ‘to borrow attention’ trans-
lating ‘to pay attention’. Italian linguistic and lexicographic tradition (De Mauro
and Voghera, 1996; Voghera, 2004; De Mauro, 2007; Urzì, 2009; Lo Cascio, 2011;
Tiberii, 2012) has generally divided this indistinct set of expressions into two main
classes of entities that represent two opposite poles of a continuous spectrum.3
On one hand we have the unità polirematiche (multiword units), which include
expressions that need the co-occurrence of the components in order to convey the
expressions’ overall meaning (which is not inferable from the components taken
in isolation), and are often further characterised by opacity of meaning. Examples
of this kind of entity are the already-mentioned luna di miele, but also buco nero
‘black hole’, catena montuosa ‘mountain chain’ or fai da te ‘do it yourself ’. On the
other hand, we have expressions that exhibit some “special bond” in terms of lexical
choice and are characterised only by a preference for the co-occurrence of the com-
ponents. Such expressions are generally compositional and examples are prestare
attenzione ‘pay attention’, compilare un modulo ‘to fill out a form’, and capelli castani
‘brown hair’, where castani is a version of brown only used for hair and eyes in
Italian. The continuum between these two prototypical poles is full of expressions
that show different levels of cohesion or lexical preferences and it is often very dif-
ficult to allocate a given expression to one or the other of the two categories. The
present work aims at outlining a methodology that might be used to discriminate
between these expressions on the basis of their empirical behaviour as attested in
large corpora.
It is empirically evident (Burger, 1998; Wermter and Hahn, 2004) that MWEs
exhibit anomalous behaviours with respect to free combination of words, such
as beautiful house. While the latter are generally grammatically formed and un-
dergo all the transformations that grammar and lexicon allow, MWE anomalies
include several restrictions. In the case of Italian, the main features of anomalies
are listed below.
Non-grammaticality. Some expressions appear not to follow standard gram-
matical rules in their structure, such as the already mentioned alla bell’e meglio,
but also essere in forse, lit. ‘to be in maybe’ meaning ‘to be doubtful’, or prendersela
a male, lit. ‘to take it at bad’ meaning ‘to be offended’.
3. Historically, the terminology that has emerged in relation to the concept of phraseologisms
and MWEs is wide and ambiguous. For an overview of the English tradition see Williams (2003)
and Bartsch (2004). As for the Italian tradition, Masini (2007) provides an excellent survey.
228 Luigi Squillante
discriminate MWEs on the basis of their different nature, since this is not inferable
only from mere co-occurrence information. As an example, Table 1 shows the top
eight lemmatised noun-adjective bigrams extracted from the Italian corpus PAISÀ
(Lyding et al., 2014) using the well known log-likelihood measure (Dunning, 1993).
As one can see, expressions as anno successivo, whose components show a high
grade of freedom and transparency, appear alongside more cohesive and institu-
tionalised expressions such as colonna sonora and sistema operativo (which has a
much lower score). The use of statistics alone, then, does not seem to be sufficient
to account for the categorical continuum of MWEs.
So far, theoretical aspects related to the definition of MWEs, their typology and
behaviours arising from quantitative corpus-based studies have not been widely
explored. In this scenario, it would thus be interesting to identify other possible
procedures, apart from statistics, that could help to shed light on the nature and
categories of MWEs, while mantaining a quantitative approach.
Analyses performed on very few English patterns, such as phrasal verbs or
verb-object constructions, proved the usefulness of linguistic information in au-
tomatic procedures, since it has been shown that using AMs in combination with
semantic information can improve the automatic identification of MWEs (Lin,
1999; Bannard et al., 2003; McCarthy et al., 2003; Baldwin et al., 2003). Moreover, a
study by Wermter and Hahn (2004) showed how the anomalies in the behaviour of
MWEs are empirically relevant with respect to the behaviour of standard phrases:
in a corpus, indeed, the number of modified MWEs is much lower than the number
of modified standard phrases. Finally, Fazly and Stevenson (2007), in their case
study on the verb-object pattern in English, argued that syntactic and semantic
Empirical variability of Italian multiword expressions 231
modifications seem to be the most relevant axes of variation that can empirically
suggest a categorisation in different types of MWEs.
If we focus on linguistic information only, and accept the hypothesis according
to which MWEs have at least one anomalous behaviour in terms of modifications
they can undergo, then it is possible to examine: (1) the kinds of variations they
undergo, and (2) the magnitude of the modification in terms of percentage of mod-
ified expressions.
In order to perform such an analysis, it is necessary to have a large corpus
available, a list of expressions to be analysed and that can be extracted by the cor-
pus itself, and a system which is able to automatically check, for each expression
in the list, the occurrence of the expression in its basic form and that of its several
possible variants.
The corpus needs at least part-of-speech labelling for its tokens and must in-
clude references to lemmas for each of them. In case the corpus has been also syn-
tactically parsed, this annotation level can be very useful since it will allow the user
to search for expressions whose components can freely move within the sentence
but are linked together anyway.
4. Methodology
In order to analyse the modifications that each expression allows, I used an ad hoc
built tool (Squillante, 2015), whose methodology is described below and which
can automatically count the number of modified expressions attested in the corpus
with respect to the number of expressions in their basic form. The study takes into
account three kinds of modifications: syntactic variations (answering the question:
is it possible to move the components?), lexical variations (is it possible to replace one
of the components with a synonym?) and inflectional variations (does the expression
allow for inflection?). For reasons connected to the computational load, the tool is
only able to investigate expressions formed by two words or three words, of which
two are content words.
The tool can interrogate the corpus using queries that count the number of ex-
pressions matching the search criteria. If the corpus is syntactically parsed we can
search for lemmas of the expressions that appear distant in the sentence but are
connected by dependency links. If the corpus has no syntactic parsing we should
only consider surface co-occurrences of the lemmas and then search for any varia-
tion of the expressions by queries based only on sequences of words. The basic idea
232 Luigi Squillante
where ni is the number of syntactically modified expressions and nbf is the number
of expressions in their basic, lemmatised and unmarked form. In this way, the
higher the ni, the higher the value of Isyn. The underlying assumption is that if the
modified expression is highly attested in the corpus, then the expression is syntac-
tically modifiable; if it is not, then the expression is syntactically frozen.
Concerning syntactic variation, I take into consideration the following
modifications.
Interruptibility test. The tool checks whether the expressions are attested in a
form in which the components are separated by one or more words, as in the case
of prestare attenzione / prestare molta attenzione ‘to pay attention / to pay much
attention’.
Fixed order test. The tools checks if it is possible to invert the order of the con-
tent words forming the expression, as in the case of giorno e notte / notte e giorno
‘day and night / night and day’.
Syntactic verbal transformations. For verbal patterns, the tools checks if the
basic expression is attested in four variants: (1) topicalisation, where the object is
in first position (il campionato ho disputato ‘the championship I contested’); (2)
anaphora, where there is a double object construction with the insertion of a pro-
noun (il campionato l’ho disputato ‘the championship, I contested it’); (3) passive
form, where the expression is inflected in its passive form (il campionato è stato
disputato, ‘the championship has been contested’; (4) relativisation, where the object
is linked to a subordinate clause by a relative pronoun (il campionato che disputi
‘the championship that you compete for’).
lemmas whose synonyms are divided by word senses. The choice of this resource
is due to the fact that it is one of the major current synonym databases freely avail-
able for Italian and it is structured in raw text, which makes it easily accessible to
computational tools.
Given an expression, the tool searches for the synonyms of each content word
in the thesaurus and, if present, it generates a lemmatised modified expression,
whose frequency must be checked in the corpus. The greater the presence of mod-
ified expressions, the more the original expression will be considered empirically
modifiable. The index quantifying the level of lexical variability is given by the
following formula
ns
I sub =
ns + nbf
where ns is the number of modified expressions and nbf is the frequency of the
expression in its original form. The number of modified expressions is better ex-
plained using the following formula:
s n = Â nsyn1,i + Â nsyn2 ,i
i i
where it is possible to see that ns is the sum of the number of all the expressions
where only the first content word is replaced with all the expressions where only
the second content word is. Similarly to the index of interruptibility, the higher the
ns, the higher the value of Isub.
As an example, we can consider colonna sonora, for which the synonyms in the
database are shown in Figure 1.
colonna|6
(s.f.)|pilastro|sostegno
(s.f.)|cariatide|cippo|obelisco|stele
(s.f.)|aiuto|appoggio|cardine|fondamento|perno|sostegno
(s.f.)|elenco|fila|serie
(s.f.)|carovana|coda|compagnia|drappello|fila|formazione|schiera
(s.f.)|banda|pista
sonoro|1
(agg.)|acustico|altisonante|enfatico|forte|risonante|roboante|rumoroso|squillante
Figure 1. Example of the lemmas colonna ‘column’ (noun) and sonoro ‘sonorous’
(adjective) in the Italian OpenOffice Thesaurus
234 Luigi Squillante
6. The choice of replacing only one component at a time in the expression, instead of considering
simultaneous substitutions of both the content words, is mainly due to two reasons. First, if n and m
are respectively the number of synonyms for the first and the second content word, our procedure
generates n+m expressions to be checked, while with simultaneous substitution the total number
would increase to n·m and this would generate a huge computational load without any further
optimisation processes. Secondly, tests performed on sample expressions, as reported in Squillante
(2015), showed how simultaneous substitution led to a great dispersion of the expression’s original
meaning, as in the case of via d’uscita ‘way out’, for which expressions as inizio di pubblicazione
‘start of publication’ and itinerario d’apertura ‘opening itinerary’ were created.
7. This is an emblematic case, since braccio destro has the idiomatic meaning of “right-hand
person”, while ala destra has, in turn, a totally different idiomatic meaning, defining the role of a
soccer player.
Empirical variability of Italian multiword expressions 235
Figure 2. Vector structures for the first 15 most frequent words in the case of guerra
mondiale ‘world war’, conflitto mondiale ‘world conflict’ and spedizione mondiale ‘world
expedition’, with frequency information extracted from the corpus PAISÀ
words in the case of guerra mondiale, conflitto mondiale ‘world conflict’ and spedi
zione mondiale ‘world expedition’, with frequency information extracted from the
PAISÀ corpus.
For all the expression vectors to be compared, I order the co-occurring words
alphabetically, such that every word has a component value in each vector, which
is equal to its frequency, as shown by Figure 3. Then the tool is able to evaluate
the geometrical proximity of the vectors using the values of their components by
calculating the cosine distance, whose formula is reported below:
 = 1A B
n
A◊B k k
cos θ = = k
 = 1A  = 1B
n n
A B 2 2
k k k k
Ak and Bk represent the k-th component (which means frequency of the k-th
co-occurring word) for the expressions A and B; n is the frequency rank threshold
for the considered co-occurring words, which was set to 50 in my study: this means
that the vectors consider only the 50 most frequent co-occurring words for each
expression.
236 Luigi Squillante
Figure 3. Vectors of the 15 most frequent content words co-occurring with the
expressions guerra mondiale ‘world war’, conflitto mondiale ‘world conflict’ and
spedizione mondiale ‘world expedition’ extracted from PAISÀ. The number of total
words is more than 15, but less than 45, since many of the most frequent words are
shared between expressions
The results of the cosine distance are values in the range between 0 and 1 and thus
we can interpret them as a percentage of similarity between the expressions. In the
case of Figure 2, the similarity between guerra mondiale and conflitto mondiale is
97%, while the similarity between guerra mondiale and spedizione mondiale is 11%.
Empirical variability of Italian multiword expressions 237
We can use the cosine distance as a weight in the computation of the occur-
rences of the modified expressions, and thus calculate a corrected value for ns, as
shown by the following formula:
In this way, if there is a high number of attested modified expressions, but they are
not synonyms of the original expression, the weight will rule out these occurrences
or at least reduce them proportionally to the level of synonymity of the modified
expression with respect to the original one.
Finally, for each of the given expressions, the tool is able to check if there is a con-
centration of their occurrences in a specific inflected form. The index is given by
the following formula:
nbf - n prev
I infl =
nbf
where nbf is the number of expressions in their basic form and nprev is the number
of attested expressions in the most prevalent form. In this way, the higher the value
of nprev is, the lower the value of Iinfl will be.
In the analysis performed on the Italian language I used the tool to study nine gram-
matical patterns among the most common MWE generators in Italian,8 which are
noun + adjective (NA, casa editrice ‘publishing house’), adjective + noun (AN, libero
arbitrio ‘free will’), noun + preposition + noun (NPN, gioco d’azzardo ‘bet’), noun
+ preposition combined with determinative article + noun (NPdN, vigile del fuoco
‘firefighter’), noun + preposition + infinitive verb (NPVinf, macchina da scrivere
‘typewriter’), noun + noun (NN, sala giochi ‘penny arcade’), noun + conjunction +
noun (NCN, punto e virgola ‘semicolon’), verb + conjunction + verb (VCV, gratta
e vinci ‘scratch card’), verb + determinative article + noun (VDN, dare i natali, lit.
‘to give the Christmases’ meaning ‘to give birth’).
8. The nine patterns include eight patterns that are typical generators of nominal MWEs and
only one sequence (VDN) to test our methodology also on verbal patterns.
238 Luigi Squillante
9. In the case of the pattern NA, for example, 100% of the expressions whose prevalent form is
the plural have values lower than 4% for syntactic variations.
Empirical variability of Italian multiword expressions 239
NA AN
1 1
0.8 0.8
0.6 0.6
Isub
Isub
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Isyn Isyn
NPN NPdN
1 1
0.8 0.8
0.6 0.6
Isub
Isub
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Isyn Isyn
Figure 4. Distribution across the plane defined by the indexes of syntactic (Isyn) and
lexical (Isub) variations of the 500 most frequent expression belonging to the patterns NA,
AN, NPN and NPdN extracted from the PAISÀ corpus
the determinative article.10 As for the NPVinf pattern, the strong presence of low
syntactic variability but high lexical variation depends on the fact that almost the
totality of the considered expressions are fragments of the sequence [in] grado di +
infinitive verb ‘able to + infinitive’, which can be associated with a broad variety of
verbs and shares the same pattern with the well-known crystallised MWEs such as
macchina da scrivere ‘typewriter’, gomma da masticare ‘chewing gum’, associazione
per delinquere ‘criminal conspiracy’, etc. The NN pattern, which is not a sequence
10. As noted by Masini (2008), the presence of the determinative article, which is the natural
syntactic contour to nouns, let them integrate into standard rules of syntax and grammar. His
absence, instead, can relate to phenomena such as composition or incorporation.
240 Luigi Squillante
NPVint NN
1 1
0.8 0.8
0.6 0.6
Isub
Isub
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Isyn Isyn
NCN VCV
1 1
0.8 0.8
0.6 0.6
Isub
Isub
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Isyn Isyn
Figure 5. Distribution across the plane defined by the indexes of syntactic (Isyn) and
lexical (Isub) variations of the 500 most frequent expression belonging to the patterns
NPVinf, NN, NCN and NVN extracted from the PAISÀ corpus
11. NN is the only pattern to include entities recognisable as multiword units also when lexical
modification is allowed for one of the components, which can be strictly replaced only by one
Empirical variability of Italian multiword expressions 241
VDN
1
0.8
Isub 0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Isyn
Figure 6. Distribution across the plane defined by the indexes of syntactic (Isyn) and
lexical (Isub) variations of the 500 most frequent expression belonging to the pattern VDN
extracted from the PAISÀ corpus
area where both syntactic and lexical variations are allowed, is the place where free
expressions gather. The fact that the upper-right area is often almost empty is due
to the fact that taking the 500 most frequent expressions for each pattern implies a
high proportion of MWEs. Tables 2 and 3 show examples of the expressions found
in these areas.
Table 2. Examples of expressions extracted from PAISÀ which are in the bottom-left corner
of the plane defined by Isyn and Isub and are thus identified with low values for both indices
Pattern Expression English translation I syn I sub
NA carro armato tank 0.00127 0.00371
AN pronto soccorso first aid 0.00097 0.00003
NPN opera d’arte artwork 0.00355 0.00232
NPdN forze dell’ordine law enforcement agency 0.00273 0.00969
NPVinf ragion d’essere raison d’être 0 0.08080
NN centro benessere spa 0.00551 0.00195
NCN botta e risposta tit for tat 0.00671 0
VCV tira e molla hesitation 0 0
VDN cessate il fuoco cease-fire 0.00189 0.00468
Table 3. Examples of expressions extracted from PAISÀ which are in the upper-right area
of the plane defined by Isyn and Isub and are thus identified with high values for both indices
Pattern Expression English translation I syn I sub
NA famiglia nobile noble family 0.58510 0.39928
AN importante ruolo important role 0.81944 0.53419
NPN serie di eventi series of events 0.29160 0.31903
NPdN zona della città area of the city 0.66253 0.75384
NPVinf scelta di fare choice to do 0.24546 0.74917
NN // // // //
NCN amici e familiari friends and relatives 0.50714 0.78417
VCV saccheggiare e devastare to ransack and devastate 0.48889 0.70458
VDN avere l’effetto to have the effect 0.39463 0.66223
Table 4. Examples of expressions extracted from PAISÀ which are in the upper-left
to bottom-right diagonal of the plane defined by Isyn and Isub and are recognisable
as lexical collocations
Pattern Syntactic Lexical Example En. translation I syn I sub
variation variation
NA – + crescita economica economic growth 0.00583 0.67480
AN // // // // // //
NPN + – olio d’oliva olive oil 0.30246 0
– + via di fuga way out 0.00317 0.48524
NPdN – + scopo del gioco goal of the game 0.05914 0.40996
NPVinf // // // // // //
NN // // // // // //
NCN + – morti e feriti dead and wounded 0.67103 0
VCV + – ridere o piangere to smile or cry 0.35897 0
VDN + – coniare un termine to coin a term 0.46188 0.03227
12. The set of expressions extracted from the PAISÀ corpus inevitably also brings up expressions
that are not included in the three main collocation dictionaries available for Italian language
(Urzì, 2009; Lo Cascio, 2011; Tiberii, 2012) as a consequence of the empirical methodology used
in my analysis. In fact, although the study of the variations is performed through a quantitative
approach, the analysis of the expressions gathering in certain areas of the plane is inevitably
performed with a qualitative approach. The fact that Italian collocation dictionaries are compiled
on the basis of intuition or not fully explained methodologies makes them just a first reference
in outlining the nature of a certain expression found in the corpus, whose status must inevitably
be validated by the author.
Empirical variability of Italian multiword expressions 243
The rest of the expressions that are present in the remaining areas can be consid-
ered free.13
6. Conclusion
13. The fact that even some free expressions exhibit some restriction, as Figures 4, 5 and 6 show,
is mainly due to the nature of the considered pattern. AN, for example, is a marked sequence in
Italian, since the adjective is typically found in the post-nominal position. If the adjective is placed
before the noun, it tends to acquire a metaphorical or subjective meaning that is not present
when it follows the noun. Grande maestro ‘great teacher’ is different from maestro grande ‘adult/
tall teacher’ and while the first is highly attested in PAISÀ (912 occurrences), the latter appears
with only two occurrences since it has a more specific and unusual meaning. As a consequence of
this, the expression results as syntactically frozen not due to its MWE status, but due to intrinsic
standard strategies of the language. For the coordinated patterns NCN and VCV, the impossibility
of changing the order of the sequence is often due to the semantic consequentiality of the two
content words rather than their MWE status, as in the case of infanzia e adolescenza ‘childhood
and adolescence’ or ideare e progettare ‘to conceive and design’. For a more detailed analysis of
the many cases of this kind, see Squillante (2015).
244 Luigi Squillante
References
Baldwin, T., Colin B., Takaaki, T., & Widdows, D. (2003). An Empirical Model of Multiword
Expression Decomposability. In Proceedings of the ACL 2003 Workshop on Multiword
Expressions: Analysis, Acquisition and Treatment (pp. 89–96).
Bannard, C., Timothy, B., & Lascarides, A. (2003). A Statistical Approach to the Semantics
of Verb-Particles. In Proceedings of the ACL 2003 Workshop on Multiword Expressions:
Analysis, Acquisition and Treatment (pp. 65–72).
Bartsch, S. (2004). Structural and Functional Properties of Collocations in English. Tübingen: Narr.
Bosque, I. (2004). REDES, Diccionario Combinatorio del Español Contemporaneo. Milan: Hoepli.
Burger, H. (1998). Phraseologie: Eine Einfūhrung am Beispiel des Deutschen. Berlin: Erich
Schmidt Verlag.
Calzolari, N., Filmore, C., Grisham, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002).
Towards Best Practice for Multiword Expressions in Computational Lexicons. In Proceed
ings of the 3rd International Conference on Language Resources and Evaluation.
Casadei, F. (1996). Metafore ed espressioni idiomatiche. Uno studio semántico sull’italiano. Roma:
Bulzoni.
De Mauro, T., & Voghera, M. (1996). Scala mobile. Un punto di vista sui lessemi complessi. In
P. Benincà, G. Cinque, T. De Mauro, & N. Vincent (Eds.), Italiano e i dialetti nel tempo (pp.
99–131). Roma: Bulzoni.
De Mauro, T. (2007). GRADIT – Grande Dizionario Italiano dell’Uso. Torino: UTET.
Di Sciullo, A. M., & Williams, E. (1987). On the Definition of Word. Cambridge, MA: MIT Press.
Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computa
tional Linguistics, 19(1), 61–74.
Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. (PhD Thesis,
Institut für maschinelle Sprachverarbeitung, Universität Stuttgart).
Evert, S. (2008). Corpora and collocations. In A. Lüdeling, & M. Kyto (Eds.), Corpus Linguistics.
An International Handbook (pp. 1212–1248). Berlin: Mouton De Gruyter.
Fazly, A., & Stevenson, S. (2007). Distinguishing Subtypes of Multiword Expressions Using
Linguistically-Motivated Statistical Measures. In Proceedings of the Workshop on A Broader
Perspective on Multiword Expressions. Jun 2017. Prague, Czech Republic. ACL (Association
for Computational Linguistics). 9-16. (Retrieved from: https://www.aclweb.org/anthology/
W07-1102).
Katz, J. J., & Fodor, J. A. (1963). The structure of a Semantic Theory. Language, 39(2), 170–210.
https://doi.org/10.2307/411200
Lepschy, G. C. (1989). Sulla linguistica moderna. Bologna: Il Mulino.
Lin, D. (1999). Automatic Identification of non-compositional phrases. In Proceedings of the 37th
Annual Meeting of the Association for Computational Linguistics on Computational Linguistics
(pp. 317–324). https://doi.org/10.3115/1034678.1034730
Lo Cascio, V. (2011). Dizionario Combinatorio Italiano. Amsterdam: John Benjamins.
Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H.,
Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the
9th Web as a Corpus Workshop of the Association for Computational Linguistics (pp. 36–43).
Masini, F. (2007). Parole sintagmatiche in italiano. (PhD Thesis, Università degli Studi di Roma
Tre, Roma).
Empirical variability of Italian multiword expressions 245
Masini, F. (2008). Binomi coordinati in italiano. In E. Cresti (Ed.), Prospettive nello studio del
lessico italiano (pp. 563–571). Firenze: Firenze University Press.
McCarthy, D., Keller, B., & Carroll, J. (2003). Detecting a Continuum of Compositionality in
Phrasal Verbs. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis,
Acquisition and Treatment (pp. 73–80).
Paolo, R. (2005). Pagine Linguistiche. Roma-Bari: Laterza.
Miller, G. A., & Walter, G. C. (1991). Contextual Correlates of Semantic Similarity. Language and
Cognitive Processes, 6(1), 1–28. https://doi.org/10.1080/01690969108406936
Ramat, P. (1990). Definizione di parola e sua tipologia. In M. Berretta, P. Molinelli & A. Valentini
(Eds.), Parallela 4. Morfologia. Tübingen: Gunter Narr.
Ramat, P. (2016). What’s in a word? In SKASE Journal of Theoretical Linguistics, 13(2), 106–121.
Simone, R. (1990). Fondamenti di linguistica. Roma-Bari: Laterza.
Squillante, L. (2015). Polirematiche e Collocazioni dell’ Italiano. Uno Studio Linguistico e Computa
zionale. (PhD Thesis, Sapienza- Università di Roma).
Tiberii, P. (2012). Dizionario delle Collocazioni. Le combinazioni delle parole in italiano. Bologna:
Zanichelli.
Urzì, F. (2009). Dizionario delle Combinazioni Lessicali. Luxembourg: Convivium.
Voghera, M. (1994). Lessemi complessi: percorsi di lessicalizzazione a confronto. Lingua e Stile,
XXIX(2), 185–214.
Voghera, M. (2004). Polirematiche. In M. Grossmann, & F. Rainer (Eds.), La formazione delle
parole in italiano (pp. 56–69). Tübingen: Niemeyer.
Wermter, J., & Udo, H. (2004). Collocation Extraction Based on Modifiability Statistics. In
Proceedings of the 20th International Conference on Computational Linguistics (pp. 980–986).
Williams, G. (2003). Les collocations et l’école contextualiste britannique. In F. Grossmann, &
A. Tutin (Eds.), Les Collocations: analyse et traitement (pp. 33–44). Amsterdam: De Werelt.