Borders and Boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter Data To The Rescue

Journal of Linguistic Geography, page 1 of 25.
© Cambridge University Press 2019 O RI G I N A L R E S E A R C H

doi:10.1017/jlg.2018.9
1 Borders and boundaries in Bosnian, Croatian,

2 Montenegrin and Serbian: Twitter data to the rescue
3 Nikola Ljubešić,1,2* Maja Miličević Petrović,3 and Tanja Samardžić4

1
4 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
2
5 Department of Information and Communication Science, Faculty of Humanities and Social Sciences, University of Zagreb, Croatia
3
6 Department of General Linguistics, Faculty of Philology, University of Belgrade, Belgrade, Serbia
4
7 Language and Space Lab, University of Zürich, Zürich, Switzerland
8 In this paper we deal with the spatial distribution of 16 linguistic features known to vary between Bosnian, Croatian,
9 Montenegrin, and Serbian. We perform our analyses on a dataset of geo-encoded Twitter status messages collected in the
10 period from mid-2013 to the end of 2016. We perform two types of analyses. The first one finds boundaries in the spatial
11 distribution of the linguistic variable levels through the kernel density estimation smoothing technique. These boundaries are
12 then plotted over the state borders for a visual comparison. The second analysis deals with linguistic distance between the
13 states. The groupings of linguistic variables and countries are calculated given the state borders and the Jensen-Shannon
14 divergence between distributions of the 16 variables within each state. This analysis is completed with a measure of variable
15 consistency for each country. These analyses are intended to show the extent to which current state borders correspond to
16 linguistic boundaries. They suggest that Croatia and Serbia still represent the two extremes, reflecting a history of normative
17 divergences, while Bosnia-Herzegovina and Montenegro, depending on the variable, lean to one or the other side.
18 1. INTRODUCTION
(3,662), Hungarian (225), Macedonian (529), Mother 46
19 The story of the language known in former Yugoslavia tongue (3,318), German (129), Roma (5,169), Romanian 47
20 as Serbo-Croatian is a telling example of the complex (101), Russian (1,026), Slovenian (107), Serbo-Croat 48
21 interaction between linguistic microvariation and its (12,559), Serbo-Montenegrin (618), Other (2,917), Regio- 49
22 political environment. This story has been told many nal languages (458), Does not want to declare (24,748). 50
23 times from many different points of view and no ver- Taking the census entries literally, one would think 51
24 sion is likely to be accepted by all interested parties. In that more than 20 distinct languages are spoken in a 52
25 fact, the protagonists of this story are not only Croats population of the size of around 620,000. A reader a 53
26 and Serbs living in Croatia and Serbia respectively, but little more familiar with the linguistic practices in the 54
27 a whole range of ethnic groups living on the territory of country could say that 10 of these entries (in bold), 55
28 four present-day countries and former Yugoslav covering around 90% of the population, are just differ- 56
29 republics, Bosnia-Herzegovina, Croatia, Montenegro, ent names for the same linguistic entity. The latter 57
30 and Serbia, whose official standard languages now interpretation, however, would be a crude simplifica- 58
31 carry the respective country names: Bosnian, Croatian, tion of the linguistic reality, neglecting the very fact 59
32 Montenegrin, and Serbian (BCMS). recorded in the census, namely the need for different 60
33 The questions of the unity of the language and its names. Linguistic grouping remains poorly understood 61
34 name(s)—Is it a single language with multiple names, in all four former Yugoslav republics, playing at the 62
35 or multiple languages each with a single name?—have same time an important role in political moves. 63
36 spurred long and passionate debates often leading to The separation of the former Yugoslav republics into 64
37 absurd political decisions and social phenomena. A independent states during the wars in the 1990s only 65
38 good example of the current linguistic paradox appears revived linguistic debates, which had been going on 66
39 in what is supposed to be a relatively straightforward since the first steps towards unifying South Slavic 67
40 description of the population of a state: the distribution groups in the early 19th century. Throughout this time, 68
41 of mother tongues in Montenegro’s 2011 census (num- the focus has been on prescribing (through grammars, 69
42 ber of speakers in parentheses):1 Montenegrin dictionaries, and orthographic manuals) and imposing 70
43 (229,251), Serbian (265,895), Bosnian (33,077), Alba- (through regulatory acts) a particular set of words or 71
44 nian (32,671), Croatian (2,791), Montenegrin-Serbian writing rules on a particular territory. Linguists, wri- 72
45 (369), English (185), Croatian-Serbian (224), Bosniak ters, and other scholars have taken part in the debate by 73
publishing opinionated articles, signing written agree- 74
*Address for correspondence: Nikola Ljubešić, Jožef Stefan Institute,

ments and declarations.2 All these publications aim at 75
Ljubljana, Slovenia; University of Zagreb, Zagreb, Croatia, nikola.ljubesic@ defining how the language should be used, with less 76
ijs.si interest in understanding how it is used in reality. 77
2 Nikola Ljubešić et al.
78 In contrast to its perceived political importance, communication that allows non-professionals to 133
79 regional linguistic variation on the territory of BCMS is express publicly their own observations, comments, or 134
80 not systematically monitored. The most recent opinions. Twitter is especially interesting for studying 135
81 comprehensive overview of the dialectal variation is regional variation because it allows geo-localization of 136
82 Pavle Ivić’s handbook published in 1956. Subsequent the posts. The language of Twitter is usually considered 137
83 dialectological studies have resulted in numerous to be highly non-standard. However, we expect to see 138
84 individual descriptions of rural idioms, but no broad- an impact, to a significant degree, of language standar- 139
85 coverage surveys have been conducted. Beside a small dization, since recent work (Fišer, Erjavec, Ljubešić & 140
86 number of “ideal” dialectological informants,3 little is Miličević 2015) has shown that as much as 90% of 141
87 known on how people on the territory of BCMS Twitter content follows the linguistic norms. Speakers 142
88 actually speak. who create this content are mostly sensitive to the 143
89 Until recently, urban varieties were not considered effects of language standardization: they are typically 144
90 an interesting research topic, under the assumption that educated and in constant contact with public commu- 145
91 they conform to the standard language prescribed in nication. With the lack of comprehensive linguistic 146
92 official grammars and dictionaries or otherwise are not surveys, the language on Twitter is currently the closest 147
93 valuable. Changes in communication technology lead- approximation of the real language used in everyday 148
94 ing to a democratization of public communication life in the territory of BCMS. 149
95 allowed more variation in language use to be observed
96 in public spaces. This contributed to a raised awareness
2. RELATED WORK 150
97 of regional variation beyond the traditional dialectolo-
98 gical framework. The language of blogs, comments, and Most of our study falls within the domain of modern 151
99 posts tends to be highly varied, including both regional dialectology. The difference between traditional and 152
100 and social variation. Moreover, a lot of this language modern dialectology has been underlined by several 153
101 use is recorded and accessible for research. This researchers who gradually introduced new goals and 154
102 situation creates a new opportunity for an objective, methods in the study of regional linguistic variation 155
103 broad-coverage study of the delicate issue of linguistic (Chambers & Trudgill, 1998; Britain, 2002). 156
104 practices in BCMS. Traditional dialectology, instantiated in the study by 157
105 The goal of our study is to measure empirically the Ivić (1956) mentioned above, tracks phonetic (different 158
106 most commonly cited regional differences between pronunciation of the same word known to vary) and 159
107 Bosnian, Croatian, Montenegrin and Serbian using lexical (different terms used to express the same con- 160
108 large data sets available through user-generated content) variation. Its goals changed over time from the 161
109 tent on the Internet. We aim to establish spatial spreads 19th-century focus on reconstructing the history of a 162
110 of the categories in question and the degree to which language to the 20th-century efforts to pin down the 163
111 they follow the current borders between the four borders between dialects and to preserve non-standard 164
112 countries. A potential agreement between linguistic and varieties that are disappearing under the pressure of 165
113 administrative boundaries can be expected given some standardization. 166
114 traditional differences and considering that the four Modern dialectology, often overlapping with socio- 167
115 countries have been conducting independent standar- linguistics, is concerned with the full range of regional 168
116 dizations since the split of the former Yugoslavia. variation including urban varieties and, especially, 169
117 However, the opposite can be expected based on what social diversification. Introduction of social factors in 170
118 is known about the history of language use in this the study of linguistic variation is often attributed to 171
119 region: while administrative borders changed often, at Labov (1963), who showed that the variation in cen- 172
120 no point in time did they coincide with linguistic tralized vs. open pronunciation of an English diphthong 173
121 boundaries. Our analysis is intended to provide has a “social meaning”: it distinguishes the native 174
122 empirical evidence that can serve in picturing the cur- inhabitants of an island (Martha’s Vineyard) from 175
123 rent state of language use. It is highly automated, which visitors. Sociolinguistics has since become a wide field, 176
124 means that the procedure can be repeated at regular with only part of it focusing on social factors in relation 177
125 time points in the future with relatively low costs. In to regional variation, mainly following the goals set by 178
126 this way, we can monitor directly historical changes in Trudgill (1974). Our study continues this strand, whose 179
127 the spatial distribution of linguistic features to see central topics are dialect levelling and formation of new 180
128 whether and how importantly changes in political varieties (Hornsby, 2009; Trudgill, Gordon, Lewis & 181
129 conditions contribute to deepening the divide between MacLagan, 2000). The difference between our and pre- 182
130 successors to a once common standard language. vious studies is that we do not address multiple social 183
131 We limit our study to the language used on the factors that might be involved in linguistic variation 184
132 social network Twitter, a typical modern means of and potential change in our target region. We focus 185
Borders and boundaries in BCMS 3
186 instead on one feature: the recent emergence of state While we employ some of the techniques elaborated in 241
187 borders that cut across the territory of a single dialect. dialectometry, our data set includes a wider range of 242
188 A setting similar to the one that we study is addres- features, including morphology and syntax. 243
189 sed by Woolhiser (2005), who analyzes the effects of the An important novelty in modern dialectology is the 244
190 state border between Poland and Belarus, established interest in a wider structural range, including morpho- 245
191 for the first time after World War II and dividing a syntax (Bart, Glaser, Sibler & Weibel, 2013; Glaser, 2013; 246
192 Belarusian dialect into two countries. Woolhiser identifies Szmrecsanyi, 2008), discourse, and pragmatics (Pichler 247
193 a number of features that show divergent developments & Hessen, 2016). As speakers’ intuitions about formal or 248
194 on the two sides of the border. The variant on the high-level phenomena are not easily captured with 249
195 Belarusian side tends to converge with the standard traditional questionnaires, extending the scope of 250
196 Belarusian and Russian, while the variant of the dialect research was only possible thanks to new methods of 251
197 on the Polish side moves away from both the Belarusian data collection. 252
198 version of the dialect and standard Polish. While we ask Language corpora represent a new data source suitable 253
199 questions similar to those asked by Woolhiser (2005), for studying a wider variety of features. Introduction 254
200 the context of the potential linguistic change is quite of corpus data into the study of regional variation 255
201 different. First, the standard “roof” languages in our (Speelman, Grondelaers & Geeraerts, 2003; Kortman & 256
202 case are not as easily identifiable as in the case of Polish Wagner, 2005; Szmrecsanyi, 2008) allows collecting 257
203 and Belarussian (as explained in more detail below). information about text frequency of the varying forms 258
204 Second, the data that we analyze cannot be taken and constructions as they are spontaneously produced. 259
205 as representing a dialect as opposed to a standard The main disadvantages of this data source are uneven 260
206 language. As mentioned above, the language of Twitter spatial coverage (naturally occurring texts tend to be 261
207 is more likely to be situated somewhere between these more concentrated in particular regions) and sparseness 262
208 two points of Woolhiser’s “vertical” variation axis. of linguistic phenomena.4 Our collection of Twitter 263
209 Finally, while Woolhiser (2005) discusses in depth the messages can be considered a corpus consisting of 264
210 prevalence of just a few features on a few locations, our micro-texts. It allows studying various phenomena 265
211 analysis involves relatively large datasets from many related to language use. Unlike corpora employed in 266
212 locations collected and analyzed automatically. previous studies, which are collected by experts, our 267
213 Automatic quantitative analysis is what our study texts are automatically harvested from a social network. 268
214 has in common with dialectometry, a line of research Data from social networks have already been used to 269
215 that is considered a part of modern dialectology, where study linguistic variation in relation to social and geo- 270
216 methods are proposed to measure linguistic distance graphical factors, mostly in the field of computational 271
217 between language varieties. The first quantification of linguistics. Here we focus on the work involving Twitter. 272
218 linguistic distance was proposed by Séguy (1971), who This network has become a popular data source for 273
219 counts the number of lexical items that are shared by computational experiments thanks to its application 274
220 two varieties and compares this measure with geo- programming interface (API), which allows automatic 275
221 graphical distance. Linguistic distance measures are collection of many messages and user metadata for 276
222 subsequently refined to take into account various facts research purposes. 277
223 about the distributions of linguistic features. Goebl Unlike previously reviewed work, which is primarily 278
224 (1982, 1984) introduces feature weights as a function of concerned with the varied forms of semantically 279
225 their frequency to take into account the fact that some equivalent items, the research on Twitter includes con- 280
226 features are more spread across varieties and therefore tent analysis (enabled by the fact that Twitter is essen- 281
227 not indicative of the similarity of any particular pair. tially text). Eisenstein, O’Connor, Smith & Xing (2010,) 282
228 Similarity at the word level is taken into account by propose a hierarchical model that learns (with moderate 283
229 Nerbonne, Heeringa, van den Hout, van der Kooi, success) to associate a particular topic with a particular 284
230 Otten & van de Viset (1995) and Nerbonne, Heeringa & geographic region. Most of the subsequent studies look 285
231 Kleiweg (1999) using string edit distance. While dia- for significant associations between textual features and 286
232 lectometry is traditionally focused on spatial variation, demographic characteristics of the speakers (Eisenstein, 287
233 Wieling, Nerbonne & Baayen (2011) propose an inte- Smith & Xing, 2011; Nguyen, Smith & Rosé, 2011), but 288
234 grative analysis where social and regional factors of several studies address geographical factors. Doyle 289
235 variation are included in a single (mixed-effects) model (2014) shows that the spatial distribution of linguistic 290
236 which predicts the distance of a number of Dutch dialects features extracted from Twitter corresponds to the dis- 291
237 from the standard language. All these studies rely on tributions previously established with traditional dia- 292
238 data sets collected in traditional dialectological surveys. lectological methods. Eisenstein, O’Connor, Smith & 293
239 These sets, as mentioned above, contain lexical and Xing (2014) model the spatial diffusion of new linguistic 294
240 phonetic realizations of selected words known to vary. features in time, showing that it is strongly influenced 295
296 by demographic factors. In these studies, uneven spatial and Ottoman, with different degrees of autonomy 349
297 distribution and data sparsity are addressed with exercised by the Slavic population on the territory of 350
298 sophisticated statistical models involving latent vari- today’s BCMS. Montenegro and Serbia were the first to 351
299 ables and various transformations of initial counts. Our obtain full independence in 1878, but not with the same 352
300 analysis is primarily exploratory (and not inferential), borders as today. In this context, the choice of the vari- 353
301 but we employ sampling and smoothing techniques to ety to be standardized in the cultural centers was 354
302 abstract from initial observations and identify patterns strongly influenced by the romantic vision of a common 355
303 in regional variation. future of all Slavic groups living in a single independent 356
304 Twitter language studies outside of (American) Slavic state. However, this vision was not strong 357
305 English are rather rare. Gonçalves & Sánchez (2014) try enough to fully overtake more local, regional traditions, 358
306 to cluster worldwide Spanish varieties regionally, but that insisted on cultural and, especially, religious dif- 359
307 they find a predominant urban vs. rural divide. Scheffler, ferences. This interplay between two opposite interests 360
308 Gontrum, Wegel & Wendler (2014) attempt to assign of all involved parties—integration and separation— 361
309 German tweets to one of the given regions by calculating remained constant until the present day. 362
310 regional probability of words, but without taking into Regional linguistic varieties in BCMS are best iden- 363
311 account potential topic variation. Our work builds upon tified by the values of two prominent features: 1) the 364
312 previous studies on automatic discrimination between form of the question word what, and 2) the phonetic 365
313 BCMS in newspaper texts (Ljubešić, Mikelić & Boras, reflex of the Proto-Slavic vowel jat’. The value of the first 366
314 2007), as well as Twitter data (Ljubešić & Kranjčić, 2015). feature, clustering with a number of others, gives the 367
315 Previous work has shown that good discrimination can most distinctive varieties, which can be labelled as ‘što’, 368
316 be obtained for practical purposes; we apply automatic ‘kaj’, and ‘ča’. The spatial distribution of these varieties, 369
317 analysis to address specific questions of interest to the which constitute separate dialects, if not languages, is 370
318 general study of language use and change. plotted in Map 1, taken from Alexander (2013:346). 371
Variation with respect to the second feature gives more 372
nuanced, but still prominent varieties ‘e’, ‘je’, ‘i’. Its 373
319 3. LANGUAGE CONVERGENCE and
spatial distribution is also plotted in Maps 2–4, taken 374
320 DIVERGENCE in BCMS
from Alexander (2013:350–352). 375
321 The current linguistic situation on the territory of BCMS As can be seen in the maps, spatial distributions of 376
322 is a result of linguistic, political, and cultural develop- the two features are rather different. The ‘ča’ variety 377
323 ments that have interacted in complex ways throughout mostly has ‘i’ for the other feature, and ‘kaj’ has ‘e’. The 378
324 history. A comprehensive account of these developments most widely spread ‘što’ variety can have all three 379
325 is Alexander’s (2013) chapter on language and identity in values for the other feature. Note also that these differ- 380
326 BCMS. Alexander (2013) shows that today’s situation is ent feature values were already characteristic of the 381
327 not substantially different from any other historical per- vernaculars spoken in the territory of today’s BCMS in 382
328 iod since the beginning of the 19th century and the first the 19th century. Their geographical placement has not 383
329 attempts at creating strategic language policies in the changed since, but the domain of their use has shrunk in 384
330 region. These attempts were well embedded in general those varieties that did not become part of the standard 385
331 tendencies throughout 19th-century Europe, when most in the meantime. 386
332 of the currently known national states were established. We can see all these varieties as possible choices for 387
333 Language standardization was an integral part of creat- the future standard language in the period immediately 388
334 ing national identities. preceding the standardization efforts. The options to 389
335 Creation of a standard language usually involves: 1) choose from in the four cultural-administrative centers 390
336 choosing a single (predominant or prestigious) linguistic are summarized as follows: 391
337 variety to be imposed on a clearly delimited territory; 2)
338 codifying the chosen variety with official grammars and ∙ Belgrade: ‘što + e’ or ‘što + je’. The first option was 392
339 dictionaries; 3) imposing the codified variety through (and still is) predominant on its territory, but the 393
340 the state administration. Political power—and not only second one was also used, especially in folk tales and 394
341 in BCMS—is often expressed in terms of language poems, highly valued at the time. The second option 395
342 standardization. also allowed a connection with the Serbian-oriented 396
343 Linguistic standardization in BCMS took place in a (Orthodox) population outside the territory under 397
344 political context where none of the main cultural cen- the influence of Belgrade.6 398
345 ters, Belgrade (Serbia), Zagreb (Croatia), Sarajevo ∙ Zagreb: ‘kaj + e’ or ‘ča + i’ or ‘što + je’. There was no 399
346 (Bosnia-Herzegovina), or Cetinje (Montenegro),5 had clear preference for any of the three options. The first 400
347 the political power to fully implement it. The territory one was (and still is) spoken in the city of Zagreb and 401
348 was split between two big empires, Austro-Hungarian had a literary tradition. The second one also had a rich 402
Major Dialects
Slovenia Kajkavian
Croatia “Torlak”
Bosnia-
Herzegovina Serbia
Sea
ck
Bla
Ad Montenegro Bulgaria
ria
tic
Se
a
Macedonia
Map 1. Čakavian, kajkavian and štokavian dialects.
pronunciation of jat’
ekavian
Slovenia
Croatia
Bosnia-
Herzegovina Serbia
ea
ck S
Bla
ria
tic
Se
a
Macedonia
Map 2. Area of ekavian pronunciation.
403 literary tradition and prestige, mostly in Dalmatia. prominent Serbian language reformer supported by the 410
404 The third one was (and still is) most widely spread in Austrian authorities. His proposal was accepted in 411
405 the Croatian-oriented (Catholic) population. Zagreb, the center of the Illyrian movement which had 412
406 ∙ Sarajevo: ‘što + je’ was the only option. the goal of unifying all South Slavs and countering the 413
407 ∙ Cetinje: ‘što + je’ was the only option. dominance of German and Hungarian language in that 414
area. The proposal was accepted just partly in Belgrade, 415
408 It is clear from this summary that the best choice for a where the Vukovian reform was eventually accepted in 416
409 common language was ‘što + je’. This is precisely the everything except that the ‘što + e’ version was kept. 417
option that was proposed by Vuk Karadžić, the The adoption of the ‘Vukovian’ proposal meant that 418
ijekavian
Slovenia
Croatia
Bosnia-
Herzegovina
Serbia
Sea
ck
Bla
ria
tic
Se
a
Macedonia
Map 3. Area of ijekavian pronunciation.
ikavian
Slovenia
Croatia
Bosnia-
Herzegovina Serbia
ea
ck S
Bla
ria
tic
Se
a
Macedonia
Map 4. Area of ikavian pronunciation.
419 almost the same variety was codified in all four centers. tradition. This was especially the case in Zagreb, where 427
420 This provided the basis for further unification efforts, unification required the biggest effort due to the exis- 428
421 especially during the time of Yugoslavia, whose main tence of the Štokavian, but also Kajkavian and Čakavian 429
422 official language became Serbo-Croatian (or Croato- literary traditions, and where unification was seen as 430
423 Serbian). Serbian predominance. Divergences were codified 431
424 Unification tendencies are, however, just one half of through two “variants” in the 1960s, an “eastern” 432
425 the story. Throughout this time, almost equally strong (Belgrade) and a “western” (Zagreb) one, and by con- 433
426 were the opposite tendencies that called for keeping stitutionally allowing separate “standard idioms” in the 434
local varieties and connections with the literary four Serbo-Croatian speaking republics in 1974. After 435
436 the breakup of Yugoslavia, the separationist position There is a well-known phenomenon in social media 478
437 became predominant in all four centers.7 Despite the that a small number of users, often automated processes 479
438 desire on the part of all concerned to separate the four (also called bots), post the majority of content. Before 480
439 standards as much as possible, all four centers still have moving forward, we checked our user distribution for 481
440 “almost the same” variety as a base, the one that was such phenomena and found that they were not present. 482
441 chosen for codification in the 19th century. This variety Our most prominent user has 1,526 published tweets, 483
442 is now being re-codified in four different directions to i.e., 0.22% of the entire dataset, publishing on average 484
443 mark the political breakup. one tweet per day. The second and third most promi- 485
444 Our study addresses the effects which the above nent users account for 0.18% tweets, the fourth, fifth and 486
445 processes have had on everyday language use in con- sixth 0.16% tweets, etc., showing that our user- 487
446 temporary BCMS. dependent distribution does not have a dominating 488
head and that there is no need for discarding or even 489
underrepresenting the most prominent users. 490
447 4. DATA EXTRACTION
An early analysis of our dataset already confirmed 491
448 4.1. Dataset our assumption that the Twitter usage across the four 492
countries of interest varies greatly. In Table 1 we present 493
449 The data for our study were collected with TweetCat
the distribution of tweets, reporting the number of 494
450 (Ljubešić, Fišer & Erjavec, 2014), a tool for harvesting
tweets from each country, as well as the percentage of 495
451 Twitter data in low-density languages, i.e., languages
tweets covered by that country. We also compare the 496
452 infrequently occurring in the Twitter stream. The col-
distribution of tweets with the distribution of country 497
453 lection method uses the Twitter Search API and high
areas. The numbers show that three countries are 498
454 frequency words specific to the language(s) of interest,
underrepresented given their area, while one (Serbia) is 499
455 searching for authors who use these words, and per-
vastly overrepresented, accounting for 81% of Twitter 500
456 forming language identification on the whole language
content, but only 39% of territory. For this reason, we 501
457 production of each candidate user. All candidate users
used our full dataset while performing country- 502
458 who pass the language identification filter are added to
conditioned calculations, whereas for calculations that 503
459 the user index and their tweet production is collected.
are not performed on specific countries, but on the area 504
460 Both the user identification and the user data collection
in general, we worked with a sampled dataset in which 505
461 procedures are run iteratively for as long as required. In
the distribution of tweets by country follows the coun- 506
462 our collection method we defined a single list of high-
try area distribution. We constructed the sampled 507
463 frequency words and therefore ran a single process for
dataset by randomly drawing from our initial dataset. 508
464 collecting the data. This process was run from June 2013
In that sampled dataset the percentage of tweets follows 509
465 up to the end of 2016.
the percentage of the area of a country, both percen- 510
466 Throughout this collection period we gathered data
tages for Serbia, for instance, being 39%. 511
467 from 70,107 users who in turn produced 38,726,488
468 tweets. For the purposes of this study, we only kept the
469 data geo-encoded in the four countries of interest
4.2. Variables of Interest 512
470 (Bosnia, Croatia, Montenegro, Serbia). This restriction
471 left us with 17,172 users and 1,755,525 tweets, i.e., 4.5% We look at 16 two-level categorical variables, summar- 513
472 of the initial data points. After extracting the 16 variables ized in Table 2. Variable names, levels and examples are 514
473 of interest, we removed all data points (tweets) that provided, as well as their raw and relative frequencies. 515
474 contained no value for any of our variables, and there- The frequencies were calculated on the sampled data- 516
475 fore no relevant data. Our final dataset thereby consists set, in order to overcome the danger of over-quantifying 517
476 of 13,102 users and 693,111 tweets, meaning that our 16 variables more frequently occurring in Serbia and vice 518
477 variables are present in 40% of geo-encoded tweets. versa. 519
Table 1. Distribution of tweets by country, compared to the area distribution by country, and the sampled dataset following the area country
distribution.
Country Tweet # Tweet % Country area (km2) Country area % Sampled tweet #
Bosnia - Herzegovina (BA) 28,909 4.74% 51197 25.72% 24,577

Croatia (HR) 27,168 4.45% 56594 28.43% 27,168
Montenegro (ME) 58,263 9.55% 13812 6.94% 6,630
Serbia (RS) 495,693 81.26% 77474 38.91% 37,181
Table 2. A summary of variables whose spatial distribution was studied.
Variable Examples of levels
e:je e:
N = 53009 Kako to misliš devojka si, a nikad nisi zajebala obrve? (RS)
% = 34.76 ‘What do you mean you’re a girl and you’ve never fucked up your eyebrows?’
phonetic je:
Pobise mi se neke djevojke ispod prozora, sto je ovo majko mila (ME) ‘Some girls just got into a physical
fight under my window, where is this world going’
rdrop r:
N = 1722 Uzivam li u tvojem drustvu, odgovor je da. Mogu li zivjeti bez tebe, odgovor je takodjer da. (BA)
% = 1.13 ‘Do I enjoy your company, the answer is yes. Can I live without you, the answer is also yes.’
phonetic rdrop:
zaspacu ali sad sam narucila hranu takodje poslednja epizoda oitnb oh zivote (RS)
‘will fall asleep but just ordered food also last episode oitnb oh life’
k:h k:
N = 378 Gledati smrtnike kako se pate dok odgovaraju kemiju je jako zanimljivo (BA)
% = 0.25 ‘Watching mortals suffering during an oral chemistry exam is fun’
phonetic h:
Ima vremena do jutra za mene i hemiju (RS)
‘There’s time till morning for me and chemistry’
h:noh h:
N = 2911 @amaiia_hr Uuu, šta kuhate? (HR)
% = 1.91 ‘@amaiia_hr Uuu, what are you cooking?’
phonetic noh:
Ljubim bolje nego sto kuvam. (BA)
‘I kiss better than I cook.”
sto:sta sto:
N = 5228 Nestala struja baterija prazna, što ću da radim noćas kukala mi majka (ME)
% = 3.43 ‘Power off and empty battery, what will I do tonight poor me’
lexical sta:
Sta mi ovo treba, sta ja ovo radim,i zasto se igram sa zivotom mladim? (BA)
‘What do I need this for, what am I doing, and why am I playing my young life’
dali:jeli dali:
N = 1538 Da li se ipak udati ili zavrsavati faks? Vecita dilema. (RS)
% = 1.01 ‘To get married or to graduate? The eternal dilemma.’
lexical jeli:
Ako se ja najedem prasetine je li to kanibalizam (ME)
‘If I eat a lot of pork is that cannibalism’
s:sa s:
N = 14271 Ovo s nobelovcima je demagogija. Pet nobelovaca, pet ekonomakih teorija! #RTLDuel. (HR)
% = 9.36 ‘This thing with Nobel winners is demagogy. Five Nobels, five economic theories! #RTLDuel”
lexical sa:
I nije sve tako sivo, kad sa nekim imas poci na pivo… (ME)
‘Everything’s not so gloomy if you’ve got someone to go out for beer with…’
mnogo:puno mnogo:
N = 1651 Mnogo ucim, mnogo panicim, mnogo se nerviram. #skrenucu (RS)
% = 1.08 ‘I study a lot, I panic a lot, I worry a lot. #willlosemymind’
lexical puno:
“Ja nisam ekspert, ali mogu o tome govoriti jer sam gledao puno gangsterskih filmova” Damir Matković
#HRTdnevnik (HR)
‘ “I’m no expert, but I can talk about this because I’ve seen many gangster movies” Damir Matković
#HRTdnevnik’
ko:tko ko:
N = 1078 Hvala ti, SARMO, sto si tu kad niko nije. Ko te izmisli, svaka mu cast. Mmmmmmm. :D #biglove (RS)
% = 0.71 ‘Thank you, SARMA, for being there when nobody else is. Kudos to whoever invented you.
lexical Mmmmmmm. :D #biglove’
Table 2: (Continued )
Variable Examples of levels
tko:
Neka mi jos jednom netko kaze da se ljudi na Balkanu ne vole i da smo divljaci poslat cu ga u tri lijepe :)
(HR)
‘If I ever again hear anyone say that people in the Balkans don’t love each other and that we are savages
I’ll send them all to hell :)’
long:shortinf long:
N = 21670 A badnji rucak cu variti do treceg vaskrsenja (RS)
% = 14.21 ‘And I will take until the third resurrection to digest the Christmas Eve meal’
morphosyntactic short:
malo tmurno, no zasto se ne provozat?;) (HR)
‘a bit cloudy, but why not go for a ride?;)”
da:inf da:
N = 34875 Deo haljine nase predstavnice za evroviziju moze da posluzi kao satorsko krilo (RS)
% = 22.87 ‘One section of the dress of our Eurovision representative can serve as a tent’
morphosyntactic inf:
Ovo odijelo za mature moze posluziti i kad se Lazar bude zenio! (ME)
‘This prom suit can also serve for when Lazar gets married!’
synth:nonsynth synth:
N = 3130 Slavice se dan kao drzavni kad izmisle bateriju koja traje 5 dana (RS)
% = 2.05 ‘A state holiday will be declared when someone invents a battery that lasts 5 days’
morphosyntictic nonsynth:
otvorit cemo kafic DNO DNA (BA)
‘we’ll open a bar called BOTTOM’S BOTTOM’
adjg adjglong:
N = 5236 Nakon prosloga napornoga tjedna, spavanje s kokicama. #odmor (HR)
% = 3.43 ‘After the tiring last week, going to bed with the hens. #rest’
morphosyntactic adjgshort:
Ide radio s uz madonu material girl zasto pustate pesme iz proslog veka? (RS)
‘The radio’s got Madonna’s Material Girl on, why are you playing songs from the last century?’
ira:isaova ira:
N = 1762 Škola mi je tolko organizirana da nisu isprintali svjedodžbe na vrijeme. (HR)
% = 1.16 ‘My school is so organised that they did not print end-of-year reports on time.’
morphosyntactic isaova:
U mom zivotu jedino je organizovan jelovnik (RS)
‘The only organised thing in my life is the menu’
treba trebam:
N = 3829 Prof:Kome nije jasno? -Nije meni. Prof:E pa trebao si slusat. OO ITALIJO (ME)
% = 2.51 ‘Teacher:Who did not understand? -I didn’t. Teacher: Well you should have listened. OO ITALY’
morphosyntactic treba:
divnooo,još jedna stvar koju treba da uradim aaa (RS)
‘Wooonderful, one more thing I need to do aaa’
ica:ka ica:
N = 192 Profesorica matematike vise voli da izbaca sa casa no ‘leba da jede (ME)
% = 0.13 ‘The maths teacher likes asking students to leave the lesson more than anything’
morphosyntactic ka:
Profesorka srpskog je upravo rekla da će verbalno da me zadavi (RS)
‘The Serbian teacher just said she would strangle me verbally’
520 As can be seen from the table, the variables belong to works were considered, including traditional grammars 527
521 three levels of linguistic structure: phonetics, lexis, and and orthography manuals (for Serbian: Pešikan, Jerković 528
522 morphosyntax. They were selected from a larger set of & Pižurica, 2010; Stanojčić & Popović, 2008; Stevanović, 529
523 candidates based on the criteria of linguistic relevance, ease 1989; for Croatian: Barić, Lončarić, Malić, Pavešić, Peti, 530
524 of automatic retrieval, and sufficient coverage in the data. Zečević & Znika, 1997; for Bosnian: Halilović, 2004; Jahić, 531
525 Linguistic relevance was determined through a lit- Halilović & Palić, 2000; for Montenegrin: Čirgić, 532
526 erature review. Variables mentioned in a number of Pranjković & Silić, 2010,; Perović, Silić & Vasiljeva, 2009), 533
534 as well as studies explicitly dealing with differences Table 3. A sample of the feature extraction lexicon used for e:je.
535 between the new standard languages (cited below, with
536 reference to individual variables).8 Expectedly, most Word Value
537 works deal with Serbian and Croatian, and their mutual
538 differences. Tošović (2008) conducted an extensive pjesma je
539 overview of reference works for the four languages and pesma e
540 found that, out of 289 resources consulted, 57% were pjesama je
541 descriptions of Serbian, 41% descriptions of Croatian, djevojke je
542 2.8% descriptions of Bosnian, and only 0.1% descriptions devojka e
543 of Montenegrin. The studies dealing with differences
544 were initially also heavily focused on Serbian and Croa- at e (as in mleko ‘milk’, or pesma ‘song’) and (i)je (mlijeko, 587
545 tian, but Bosnian has subsequently received a lot of pjesma). The e reflex is characteristic of Serbia, while 588
546 attention, largely due to attempts to disentangle its (i)je is found in Croatia, Bosnia-Herzegovina and 589
547 Croatian-like and Serbian-like features. The youngest of Montenegro. Based on reference descriptions, this is the 590
548 the standard languages, Montenegrin, is, as expected, the variable whose geographical distribution is expected to 591
549 least covered one. Note also that the available studies be most straightforward.12 It is at the same time the 592
550 mostly target the standard as described in reference most frequent variable we look at. 593
551 works, with little empirical data about actual The e:je variable was extracted using a lexicon file 594
552 language use. containing the list of target items in one column, and 595
553 We focus on variables that can be automatically variable values in another (as illustrated by the exam- 596
554 identified based on the surface form of words, and/or ples in Table 3; see also Ljubešić, Samardžić & Derungs, 597
555 the entries in the available morphological lexicons 2016). The list was automatically generated from the 598
556 (hrLex and srLex9; Ljubešić, Klubička, Agić & Jazbec, inflectional morphological lexicons hrLex and srLex 599
557 2016).10 We exclude those variables that are difficult or (Ljubešić et al., 2016) by searching for pairs of words in 600
558 impossible to retrieve in an automatic manner, some- which both had the same morphosyntactic description 601
559 thing which is often due to homonymy; for instance, the and identical word forms except for the transformations 602
560 contrast between te (characteristic of Croatian) and pa (ije vs. e or je vs. e), and both had just one possible 603
561 (more typical of Serbian), both meaning ‘then’, was not canonical form (lemma). A total of 146,864 word forms 604
562 studied due to te also being the accusative singular 2nd were listed. 605
563 person personal pronoun (as in vidim te ‘I see you’), as
564 well as the feminine nominative/accusative form of a rdrop 606
565 demonstrative pronoun (as in te kuće ‘those houses’).
In some words in BCMS, e.g., jučer/juče ‘yesterday’, the 607
566 The variables also had to be frequent enough to
final r can either occur or be dropped; the former option 608
567 provide a meaningful number of data points for the
is more typical of Croatian, and the latter of Serbian (see 609
568 analysis of spatial distribution. One of the main con-
frequencies in Tošović, 2009). The specific words we 610
569 sequences of this constraint meant that under lexical
look at are juče(r) ‘yesterday’, prekjuče(r) ‘day before 611
570 variables we only look at function words, despite the
yesterday’, veče(r) ‘evening’, naveče(r) ‘in the evening’, 612
571 differences in the inventory of lexical words being
uveče(r) ‘in the evening’, predveče(r) ‘in the early eve- 613
572 typically listed as the most prominent ones (see e.g.
ning’, and takođe(r) ‘also’. 614
573 Brown & Alt, 2004:7; Piper, 2009:549-550; Tošović,
This variable was also extracted through a lexicon 615
574 2008:183).11 It should also be mentioned that it was
file. The file was created manually; it only included the 616
575 impossible to reach a sufficient number of variables
words listed above, and for each of them the value with 617
576 with similar frequencies; the implications of the large
regard to r drop. 618
577 differences in frequency will be discussed when neces-
578 sary in the results section.
k:h 619
579 In what follows, we provide more detailed descrip-
580 tions of individual variables and the procedures used in The k:h alternation is another systematic phonetic phe- 620
581 their extraction. nomenon often cited as a differential marker between 621
Croatian and Serbian. It occurs at word beginning in 622
words of Greek origin which started with ch-, so in 623
582 e:je
contemporary BCMS we find word pairs such as 624
583 The e:je variable concerns one of the features central to kemija/hemija ‘chemistry’, or kirurg/hirurg ‘surgeon’ 625
584 defining the dialects on the territory of BCMS—the (more examples in Silić, 2008). K is consistently used in 626
585 Proto-Slavic vowel jat’ and its different contemporary Croatian, and h in Serbian. At the level of the norm, 627
586 reflexes (described in more detail in Section 3). We look Bosnian and Montenegrin pattern with Serbian and use 628
629 h (Halilović 2004:48; Perović et al., 2009); however, k form of the interrogative pronoun ‘what’ is što in 681
630 seems to also be possible in Bosnia-Herzegovina, as will Croatian, Bosnian and Montenegrin, and šta in Serbian 682
631 be shown in our later analyses. (both šta and što are listed in the reference works, but šta 683
632 The k:h variable was extracted using a manually cre- is more common). Tošović (2009) reports corpus fre- 684
633 ated lexicon file. All inflected forms of each relevant quencies that show što (including its other uses, as a 685
634 lemma were included, for a total of 587 word forms. The relative pronoun and as a short form for zašto ‘why’) to 686
635 lemmas were identified through the lists reported in the be 10-20 times more frequent than šta in Bosnian and 687
636 literature and through dictionary searches. Croatian; in Serbian, šta is about 4 times more frequent 688
than što (which is also used as a relative pronoun and a 689
637 h:noh short form for zašto). 690
In terms of automatic extraction, this variable is very 691
638 The last of our phonetic variables is related to the pre-
simple in one sense, as it is based on a very short lexicon 692
639 sence/absence of h, which is sometimes omitted at
file, but not so straightforward in another, due to the 693
640 word beginning, and omitted or replaced with an
presence of a diacritic sign. We included in the lexicon 694
641 alternative (typically j or v) within a word. Examples of
file the forms što and šta. We were not able to follow the 695
642 pairs with(out) an initial h drop are hrđa/rđa ‘rust’ and
general approach of disregarding diacritics in the ana- 696
643 hrvanje/rvanje ‘wrestling’. In non-initial positions,
lysis—that is taking sto and sta into account—due to 697
644 snaha/snaja ‘sister/daughter-in-law’, čahura/čaura
homonymy of sto, meaning ‘table’ and ‘one hundred’. 698
645 ‘cocoon; capsule’, and gluh/gluv ‘deaf’ exemplify the
646 contrast.
dali:jeli 699
647 The problem of where h is to be written and pro-
648 nounced dates back to the 19th century, when a general In BCMS, yes/no questions are asked using inter- 700
649 rule was developed stating that it should be used where rogative particles je li and da li. Je li is the norm in 701
650 it was required by etymological criteria; this rule was Croatian, where da li only occurs in the colloquial reg- 702
651 kept in the orthographic norm of Serbo-Croatian, but it ister (Hudeček & Vukojević, 2007). Serbian uses both 703
652 was differentially adopted in the different variants, with forms, but je li is commonly shortened to je l’, jel’ or jel 704
653 Serbian mostly allowing both forms, and Croatian and and used colloquially, while the preferred full form is 705
654 Bosnian keeping the h (Čedić, 2001).13 The presence of h da li. Bosnian seems to be mixed, with a moderate pre- 706
655 is particularly characteristic of Bosnian, where it is added ference for Croatian-type question forms (see Špago- 707
656 in some words that do not contain it in Croatian and did Ćumurija, 2009). The Montenegrin orthography manual 708
657 not necessarily contain it etymologically—kahva ‘coffee’ lists both je li and da li (Perović et al., 2009). 709
658 (Croatian kava, Serbian kafa), lahko ‘easily’ (Croatian and As a multi-word variable, dali:jeli was extracted using 710
659 Serbian lako), and similar. These forms were non- regular expressions, ‘\bda li\b’ and ‘\bje li\b’ respec- 711
660 standard in Serbo-Croatian, but they entered the norm tively. The shorter alternatives (je l’, jel’, jel) were not 712
661 for Bosnian later on (Halilović, 2004:22-23). The Bosnian included, as they could not be treated as a separate 713
662 norm also banned the possibility of using suv ‘dry’, variable level (recall that we focus on two-level vari- 714
663 duvan ‘tobacco’, and other similar Serbo-Croatian forms ables) and merging them with the more formal je li 715
664 allowed alongside suh and duhan. Montenegrin seems to would bias the results. 716
665 pattern with Serbian, but without a clearly formulated
666 rule, and with some inconsistencies—the orthography s:sa 717
667 manual lists only snaha, only gluv, and both čaura and
The preposition s(a) ‘with’ is another point of diver- 718
668 čahura (Perović et al., 2009).
gence in BCMS. In standard Croatian, the choice 719
669 The lexicon of words relevant for the h:noh variable
between the forms s and sa is based on phonetic factors 720
670 was compiled manually, taking into account all inflec-
—sa is to be used before s, š, z, ž (sa šlagom ‘with cream’), 721
671 ted forms; a total of 1088 word forms were included.
before consonant clusters such as ks or ps (sa Ksenijom 722
672 Note that the forms that are highly specific of Bosnian
‘with Ksenija’), and before the instrumental form of the 723
673 were omitted, as they do not belong to the etymological
1st singular pronoun ja (sa mnom ‘with me’); s should be 724
674 pattern the normative rules were based on, and some-
used in all other cases (s ledom ‘with ice’, s Ivanom ‘with 725
675 times also have multiple equivalents in the other stan-
Ivan’). In standard Serbian, there is a rule about using sa 726
676 dards (cf. kahva / kava / kafa).
before similar-sounding consonants, and about using s 727
in fixed expressions such as s jedne strane ‘on the one 728
677 sto:sta
hand’, but the choice is explicitly left to the speakers in 729
678 Our first lexical variable has to do with the feature all other cases.14 Tošović (2009) reports the relevant 730
679 behind the first major division in BCMS. In the dialects frequencies in the parallel corpus GRALIS, showing 731
680 based on što (as opposed to kaj and ča), the standard that s is around four times more frequent than sa in 732
733 Croatian and around twice as frequent in Bosnian, descriptions. The short forms were defined by taking 782
734 while in Serbian sa is around 2.5 times more frequent away the final -i. 783
735 than s.
736 In terms of extraction, s:sa was one of the simplest
da:inf 784
737 variables, obtained using a two-form manual
738 lexicon file. One of the features most often cited as differentiating 785
between the syntax of Serbian and Croatian is the 786
composition of complex predicates containing modal 787
739 mnogo:puno
(moći ‘can’, morati ‘must’, smeti ‘dare, may’, trebati 788
740 The intensifying adverbs mnogo and puno ‘many, a lot’, ‘need’) or phasal verbs (početi ‘begin’, završiti ‘end’), 789
741 are both used in all variants of BCMS, but puno is par- which in Serbian tend to take as complement da (‘that’) 790
742 ticularly typical of Croatian, and mnogo of Serbian. The + present tense form of the verb, a construction typical 791
743 use of puno in Serbian is the subject of numerous dis- of the Balkan Sprachbund (as in volim da pišem ‘I like to 792
744 cussions, and some normativists have long been trying write’), while in Croatian, infinitives are used when the 793
745 to ban it, claiming that its only meaning is that of an subject remains the same (volim pisati) (Kovačić, 2005; 794
746 adverb derived from the adjective pun ‘full’. For this Piper, 2009; Tošović, 2008). In Bosnian, the two con- 795
747 reason, it is often perceived as colloquial. structions are normatively equal (Čedić, 2001). 796
748 This variable was also extracted through a two-form We extracted this variable using a list of verb infini- 797
749 manually created lexicon file. tives and present tense forms from the hrLex and srLex 798
morphological lexicons. 799
750 ko:tko
synth:nonsynth 800
751 The interrogative pronoun meaning ‘who’ takes the
752 form ko in Serbian, Bosnian, and Montenegrin, and tko The future tense has a synthetic form for most verbs in 801
753 in Croatian; the same goes for the derived pronouns Serbian, with clitic forms of the auxiliary hteti ‘want’ 802
754 neko/netko ‘somebody’, niko/nitko ‘nobody’, svako/ merged with the verb (as in pisaću ‘I will write’), while 803
755 svatko ‘everybody’, and iko/itko ‘anybody’. Tko is the the analytic form is used in Croatian, with the infinitive 804
756 older form, and some authors use its survival in (short form) and the auxiliary as separate words 805
757 Croatian as an argument for its greater conservative- (pisat ću); the analytic form is used in Serbian too when 806
758 ness compared to Serbian (see Pranjković, 1997). We the verb ends in -ći (reći ću ‘I will say’). This variable is 807
759 focus on the derived forms and leave out the actual tko very frequently mentioned in discussions of the 808
760 and ko, due to ko also being used as a very frequent short relationship between Serbian and Croatian (see Bekavac, 809
761 form of kao ‘like, as’. Ne(t)ko and sva(t)ko are excluded as Seljan & Simeon, 2008; Kovačić, 2005; Piper, 2009; 810
762 well, due to also being neuter singular forms of the Tošović, 2008). Bosnian uses both kinds of forms, and 811
763 demonstratives neki ‘some’ and svaki ‘every’, leaving in there does not seem to be a very clear preference for one 812
764 the analysis ni(t)ko ‘nobody’ and i(t)ko ‘anybody’. or the other (for conflicting views in the literature see 813
765 This variable was also obtained using a manual lex- Bekavac et al., 2008; Silić, 2008; Špago-Ćumurija, 2009). 814
766 icon. Only nominative forms were listed, as t is absent in The Montenegrin norm allows both types of future 815
767 the other cases for both tko and ko type pronouns, which formation, underlining that synthetic forms are more 816
768 would bias the results. common (Perović et al., 2009). 817
Again, as with previous morphosyntactic variables, 818
the extraction process was based on the hrLex and srLex 819
769 long:shortinf
lexicons, the latter containing synthetic future forms. 820
770 The full infinitival form of verbs in BCMS ends in either
771 -ti or -ći (pisati ‘write’; ići ‘go’). In Croatian, it is quite
adjg 821
772 common to shorten the infinitives by removing the final
773 i (as in pisat, ić), sometimes because of the rule for future In adjectival inflection in BCMS it is sometimes possible 822
774 formation (for verbs ending in -ti, e.g., pisat ću ‘I will to append a vowel at the end of a word for easier pro- 823
775 write’; more detail below, under the synth:nonsynth nunciation and/or stylistic markedness. The most 824
776 variable), and sometimes colloquially (see Miličević & typical case is -a in genitive singular forms of masculine 825
777 Ljubešić, 2016; Miličević, Ljubešić & Fišer, 2017); this adjectives; e.g., novoga ‘of the new’ is fairly frequently 826
778 phenomenon is virtually non-existent in Serbian. used in standard Croatian instead of novog, more typi- 827
779 The extraction of infinitives was based on a lexicon cal of Serbian. 828
780 file derived from hrLex and srLex, from which all infi- This variable was obtained again by exploiting the 829
781 nitives were obtained based on the morphosyntactic hrLex and srLex morphological lexicons. 830
831 ira:isa:ova -ica and -ka are too generic as word endings and do not 881
always mark agents.15 882
832 This variable concerns the morphological composition
The variable was extracted by identifying feminine 883
833 of verbs. When deriving borrowings from international
noun lemmata in the hrLex and srLex lexicons that end 884
834 verbs, Croatian typically uses the verbal suffix -ira (as in
in the corresponding suffix pairs. The extracted list was 885
835 promovirati ‘promote’, registrirati ‘register’), while -isa
additionally checked by hand. 886
836 and -ova prevail in Serbian (promovisati, registrovati). As
Given that the variables were extracted auto- 887
837 far as Bosnian is concerned, Čedić (2001) mentions that
matically, and some of them could only be approxi- 888
838 in the past two decades -ira verbs have become more
mated, some noise in the data was inevitable. This is 889
839 frequent than -isa and -ova verbs, but that it also hap-
very often due to diacritic omissions, which are fairly 890
840 pens that an -ira infinitive and an inflected form
common on Twitter, and which we disregarded (e.g., 891
841 belonging to the -ova paradigm appear in the same text
noc was treated equally as noć ‘night’). This approach 892
842 (e.g., organizirati plus organizuju instead of organiziraju).
led to some atypical cases of homonymy. Such cases 893
843 To extract this variable, a similar procedure was fol-
were sometimes easily predictable, and we adjusted the 894
844 lowed as for e:je, with the difference that canonical
procedure to avoid them, as for the sto:sta variable. 895
845 forms rather than word forms were matched for
However, unpredictable overlaps also occurred. The 896
846 everything but the ira vs. isa/ova suffix. From the iden-
frequent ones (whether related to diacritic omissions or 897
847 tified canonical forms, lexicons of all word forms were
not) were spotted during the analysis; e.g., the form 898
848 produced.
braće, which can be the future tense of the verb brati 899
‘pick (fruits, flowers, etc.)’, but is much more often the 900
849 treba genitive plural of the noun brat ‘brother’—in the final 901
analysis of the future forms we disregarded such cases, 902
850 In standard Serbian, the modal verb trebati ‘need’ is keeping only those for which no match with other 903
851 often used impersonally. This is the result of a pre- lemmas in the lexicon were found. Some less frequent 904
852 scriptive tradition that bans constructions such as overlaps were discovered only later and were deemed 905
853 trebam da idem ‘I need.1SG to go.PRES.1SG’ and infrequent enough not to have a major impact on the 906
854 requires treba da idem ‘I need.3SG to go.PRES.1SG’. No results; e.g., the string glasace was classified as a syn- 907
855 such rule is instantiated in the grammar of Croatian, thetic future form (glasaće ‘he/she/it/they will vote’), 908
856 where personal forms are normally accompanied by even though in some contexts in the data it actually 909
857 infinitives, as in trebam ići ‘I need.1SG go.INF’. means glasače ‘voters.ACC’). 910
858 Interestingly, this difference is not commonly listed Note also that for most variables the situation is not 911
859 in the works dealing with the differences between expected to be black and white as to the geographical 912
860 Croatian and Serbian. distribution of levels, given that both values are often 913
861 The variable treba was extracted using the regular attested within the same standard languages. What we 914
862 expressions ‘\btreba(m|s|mo|te|ju)\b|\btreba(?! da)’, are more likely to witness is a dominance of one level in 915
863 covering present tense forms of the verb (without the some areas, and the other level in another, possibly 916
864 adjacent da), and ‘\btreba da\b’, for the impersonal corresponding to the patterns prescribed or described 917
865 form of the verb. as characteristic in normative works. This is under- 918
standable given the shared recent history, and the lit- 919
erature does often say the current differences are more 920
866 ica:ka
often a matter of frequency of use and/or stylistic value, 921
867 Our last variable concerns the suffixes used for deriving than complete divergence (see e.g. Piper, 2009:543; 922
868 feminine agent nouns, which partly overlap, and partly Tošović, 2009). 923
869 differ across BCMS. The suffix -ica (as in nastavnica
870 ‘teacher’) is present in all languages, but it is dominant
5. ANALYSES 924
871 only in Croatian and Bosnian, while in Serbian the suf-
872 fixes -ka (čitateljka ‘reader’) and -inja (laborantkinja ‘lab We performed two main types of analyses: a) estimat- 925
873 technician’) are very frequent as well (Dražić & Voji- ing the spatial distribution of the set of variables 926
874 nović, 2009; Šehović, 2009). The choice of the suffix also described above, and b) computing linguistic distance 927
875 depends on the ending of the masculine noun from between the four administrative regions (BCMS) given 928
876 which the feminine form is derived—inter-varietal dif- the described variables and the current state borders. In 929
877 ferences between -ica and -ka mostly occur after -or and the first case, we looked for linguistic boundaries irre- 930
878 -ar (as in profesor – profesorica/profesorka ‘professor’, or spective of administrative borders, and once the lin- 931
879 zubar – zubarica/zubarka ‘dentist’). For the purposes of guistic boundaries were identified, we compared them 932
880 this paper, we thus only looked at -rica and -rka, as both to administrative borders. In the second case, we
933 measured linguistic similarity given the state territory. form is dominant. The extremely high density of the 986
934 The second analysis contains a measure of similarity predominant longer form in this area would leave little 987
935 between the states and a measure of internal con- probability to be assigned to the long form in other 988
936 sistency in the choice of specific variable levels within regions. Since there is no such high-density point for the 989
937 one state. In this way, we measured both inter- and short form, its probability will be more evenly dis- 990
938 intrastate similarity. We refer to interstate similarity as tributed across regions and thus estimated by KDE as 991
939 distance (directly inverse similarity) and to the intrasi- higher than the probability of the long form in many 992
940 milarity as a country’s variable consistency. regions where this is, in fact, not true. 993
941 We performed all calculations in the statistical soft- To address this issue, we performed KDE on balanced 994
942 ware R,16 mostly exploiting existing packages, but samples. We randomly selected observations so that the 995
943 defining functions ourselves when necessary. number of observations used for KDE is proportional to 996
the territory of each of the four countries. By doing this, 997
we simulate a more even distribution of observations, 998
944 5.1. Estimating Spatial Distributions
making our data set more suitable for KDE. 999
945 The goal of the spatial analysis is to establish which The same feature of KDE that is inconvenient in 1000
946 level of a variable is dominant on which territory, dealing with uneven spatial distribution of observations 1001
947 regardless of the known state borders. We smoothed becomes crucial for eliminating the general frequency 1002
948 and extrapolated the originally observed counts using bias in cases where one variable level is generally much 1003
949 kernel density estimation (KDE), a well-established more frequent than the other. Transforming the original 1004
950 method for representing point observations as density counts into a probability distribution with per-level 1005
951 surfaces. We show the areas of dominance on a map normalization allows us to detect the variation in the 1006
952 and call those visualizations level dominance plots. We differences between levels across regions. Without this 1007
953 performed this calculation for each of our 16 variables transformation the more frequent levels would be per- 1008
954 and compared its output manually with the BCMS state ceived as dominant on the whole territory. 1009
955 borders.
956 The local value of a density surface corresponds to the
5.2. Computing Linguistic Distance 1010
957 number of observations of the respective feature level
958 proximate to this location. A kernel function is applied We continue our analyses by focusing on the differences 1011
959 for smoothing the signal, and thus account for local in distributions of our variables in the four countries. 1012
960 noise. After computing density surfaces for each feature We perform these analyses on the full dataset as dis- 1013
961 level individually, local intensities are compared and proportional amounts of data available in different 1014
962 only the level with maximum local intensity is preserved countries does not impact the per-country distributions 1015
963 and mapped as the dominant level. Hence, the level that we base our analyses on. In these analyses, we 1016
964 dominance plot function visually represents linguistic primarily exploit the information-theoretic measure of 1017
965 areas dominated by individual feature levels. Jensen-Shannon divergence (JSD), which quantifies the 1018
966 It is important to note that KDE distributes the prob- information loss occurring if we assume one distribu- 1019
967 ability of a variable level over a unit area under the curve. tion over another. 1020
968 The probability of a variable level in one area is thus We calculated two basic types of distance: distance 1021
969 relative to its probability in other areas. An extremely between variables and distance between countries. The 1022
970 high probability of a level in one observation region distance between variables tells us how similar the cho- 1023
971 leaves little probability to be assigned to this level in sen linguistic features are to one another, and also whe- 1024
972 other areas. This is an important shortcoming of KDE in ther some features tend to cluster together. High feature 1025
973 the case of uneven spatial distribution of observations. If clustering is indicative of distinct varieties. The distance 1026
974 there is a high-density area for one level, but no such area between countries provides an aggregate score of how 1027
975 exists for the other level, the dominance of the level for much the language used in the four countries differs. 1028
976 which there is a high-density area will be systematically When calculating the distance between two coun- 1029
977 underestimated in all the areas outside of the high- tries, we calculate for each variable JSD between the two 1030
978 density area. To give one example, the long infinitive country distributions, obtaining thereby 16 distances 1031
979 form is overall more frequent than the short form and its which we average. To give an example, when calculat- 1032
980 dominance should spread over most of the territory of ing the distance between Bosnia-Herzegovina and 1033
981 BCMS. However, calculating KDE on the initial obser- Croatia, we calculated JSD over their distributions for 1034
982 vations would show a different spread, more con- the e:je variable, as well as for the 15 remaining vari- 1035
983 centrated on a smaller region (Serbia in this case). This ables, finally averaging over the 16 obtained distances. 1036
984 would happen because the territory of Serbia includes a For calculating the distance between two variables, 1037
985 high-density area (the city of Belgrade), where the longer we used the same initial variable distributions as when 1038
1039 calculating the distance between two countries, but specific regions. Namely, some of the regions are known to 1063
1040 grouped them now not by variable, but by country. To be sparsely populated, therefore a level dominance in these 1064
1041 obtain a single distance we again averaged the JSDs areas can be due to generalization over small amounts or 1065
1042 obtained on each distribution pair, the pairs now com- even no data. We represent the amount of data available in 1066
1043 ing from different variables in identical countries. To Map 5 in the form of a heatmap. As expected, the map 1067
1044 give a similar example to the previous one, for calcu- shows the largest cities in the area to be the centers of content 1068
1045 lating the distance between the e:je and ica:ka variables, production. The only area completely lacking Twitter data is 1069
1046 we calculated JSD over the Bosnian distributions of the Dinarides area on the border between Croatia and Bos- 1070
1047 these two variables, repeating the procedure on the nia-Herzegovina, known to be largely unpopulated. While 1071
1048 three remaining countries.17 Finally, we calculated the most of Serbia is well covered with data, Croatia seems to 1072
1049 average of the four obtained distances. have the least consistent coverage, with large areas in the 1073
1050 Finally, to quantify the consistency of each country northeast and the central part showing very scarce data 1074
1051 with regard to our 16 variables of interest, we used an coverage. A similar, but less drastic situation can be 1075
1052 index calculated as the average of the lower ratios of all observed in southwestern Bosnia-Herzegovina and in bor- 1076
1053 the variables. The two extremes of this metric are 0.0 if der areas between Bosnia-Herzegovina and Montenegro, 1077
1054 in each of the variables one level covers the whole dis- and Montenegro and Serbia. 1078
1055 tribution, and 0.5 if each of the variables has an equi- To simplify the presentation of the level dominance 1079
1056 probable distribution, therefore both levels of a variable plots of each of the 16 variables, we organize them into 1080
1057 having the probability of 0.5. four basic groups given the state patterns they follow: 1081
1. Croatia vs. remaining countries 1082

2. Croatia and Bosnia-Herzegovina vs. Montenegro 1083
1058 6. RESULTS
and Serbia 1084
1059 6.1. Estimating Spatial Distributions 3. Serbia vs. remaining countries 1085
4. No visible state pattern 1086
1060 Given that Twitter is known to be used more in densely
1061 populated areas, while analyzing the level dominance plots, An overview of the variables in the four state pat- 1087
1062 we take into account the amount of data available from terns is given in Table 4 (with color-coded variable 1088
Map 5. Heatmap representing the spatial distribution of data points in our sampled dataset.
Table 4. Overview of variables given the state pattern they are which Bosnia-Herzegovina and Montenegro incline 1102
grouped in. Type of variable is encoded with colour. either to the east or to the west. 1103
Map 6 depicts the level dominance plots following 1104
State pattern Variables the Croatia vs. remaining countries pattern. While the ira: 1105
isa:ova and ko:tko variables follow the pattern in full, 1106
HR vs. rest ira:isaova ko:tko k:h rdrop especially if we take into account the complex shape of 1107
HR, BA vs. da:inf mnogo:puno treba h:noh Croatia, while the variables k:h and rdrop show a 1108
ME, RS synth:nonsynth s:sa deviation in southern Bosnia-Herzegovina, pre- 1109
RS vs. rest e:je ica:ka dominantly using the Croatian-preferred level. This 1110
no pattern dali:jeli long:shortinf sto:sta adjg
does not come as a surprise, given the large Croatian 1111
population living in this area. 1112
In Map 7, the dominance plots following the second, 1113
1089 types). The table shows that for most of our variables Croatia, Bosnia-Herzegovina vs. Montenegro, Serbia, state 1114
1090 of choice, more precisely for three-fourths of them, a pattern are given. Similarly to the previous pattern, the 1115
1091 strict state pattern can be observed. There does not variables da:inf, mnogo:puno and treba follow the pattern 1116
1092 seem to be any correspondence between variable types in full, while h:noh, synth: nonsynth and s:sa show a 1117
1093 and state patterns; no pattern contains only a single deviation, this time in the area of central-northern Bos- 1118
1094 variable type. The most productive pattern, covering nia-Herzegovina mostly populated by ethnic Serbs, 1119
1095 half of our variables, is the west vs. east, i.e., Croatia, which follows the eastern-preferred levels. 1120
1096 Bosnia-Herzegovina vs. Montenegro, Serbia pattern. The Map 8 depicts the dominance plots that follow the 1121
1097 least productive, at least in terms of the number of Serbia vs. remaining countries state pattern, namely the e: 1122
1098 variables, is the Serbia vs. remaining countries pattern, je variable and the ica:ka variable. The latter variable 1123
1099 although it covers the overall most frequent phenom- lacks coverage in southern Croatia, where the level 1124
1100 enon, the jat’ reflex. One should notice that all patterns dominant in the remainder of Croatia and the neigh- 1125
1101 actually follow a relaxed west vs. east pattern, in boring countries can be expected. 1126
Map 6. Level dominance plots grouped in the Croatia vs. remaining countries pattern.
Map 7. Level dominance plots grouped in the Croatia, Bosnia-Herzegovina vs. Montenegro, Serbia pattern.
1127 Finally, Map 9 contains the dominance plots showing follow state borders, reflecting long-standing linguistic 1136
1128 no or partial state patterns. While dali:jeli and long: and normative differences, as well as the recent separate 1137
1129 shortinf very roughly follow the Serbia vs. remaining standardization processes. The second conclusion is 1138
1130 countries pattern, the sto:sta and adjg variables show that there is an overall east vs. west pattern in which 1139
1131 signs of a pattern not observed in the previous vari- Bosnia-Herzegovina and Montenegro tend to incline 1140
1132 ables: Croatia and Montenegro leaning on one, Bosnia- either to the east or to the west. The final conclusion 1141
1133 Herzegovina and Serbia on the other side. relaxes the first one as a significant number of variables, 1142
1134 There are three main conclusions we can draw from more precisely five, break the state pattern in Bosnia- 1143
1135 the presented results. The first one is that most variables Herzegovina, with parts heavily populated either with 1144
Map 8. Level dominance plots grouped in the Serbia vs. remaining countries pattern.
Map 9. Level dominance plots grouped in no state pattern.
1145 ethnic Croats or Serbs leaning towards the level domi- are the basis for the calculations whose results are given 1150
1146 nant in the respective “mother country.” in the remainder of this section. 1151
We start from the variable distance matrix, which 1152
we represent in the form of a dendrogram (Figure 2). 1153
1147 6.2. Computing Linguistic Distance
Our goal is to compare this border-obeying clustering 1154
1148 In Figure 1 we show the distributions of each variable in of features with the results of dominance plot group- 1155
1149 each country, grouped by variables. These distributions ings performed in the previous section. Note that the 1156
BA HR ME RS BA HR ME RS BA HR ME RS BA HR ME RS
adjglong
da nonsynth
long
adjgshort
inf synth
short
s e
sta
dali
sa je
sto
jeli
ira r
h
h
isaova rdrop
noh
treba
mnogo
ica
ko
trebam
puno
ka
tko
Figure 1. Per-country distribution plot of the 16 variables taken under consideration for Bosnia (BA), Croatia (HR), Montenegro
(ME) and Serbia (RS).
calculating per-country variable distributions. The 1161

primary goal of the comparison of the two analyses is 1162
to either challenge or further strengthen our previous 1163
conclusions. 1164
The first cluster in the dendrogram, containing the 1165
variables k:h, ko:tko, ira:isa:ova and rdrop, fully corre- 1166
sponds to the pattern Croatia vs. remaining countries from 1167
the previous section. The second cluster from the left, 1168
comprising the synth:nonsynth, mnogo:puno, s:sa and h: 1169
noh variables, covers four out of six variables clustered 1170
previously in the Croatia, Bosnia-Herzegovina vs. Mon- 1171
adjg
treba
tenegro, Serbia pattern. The large cluster present in the 1172

synthnonsynth
right side of the figure corresponds to a smaller extent to

eje
icaka
1173
dainf
stosta
mnogopuno
the previously observed state patterns. The Serbia vs. 1174

longshortinf
dalijeli
remaining countries pattern, comprising the e:je and ica:ka 1175

kh
kotko
iraisaova
rdrop
variables, can be identified as a separate cluster, long: 1176

ssa
hnoh
shortinf and dali:jeli similarly forming a cluster and 1177

being part of the no state pattern. The remaining two 1178
Figure 2. Dendrogram based on the variable distances
clusters (adjg and treba, da:inf and sto:sta) do not corre- 1179
calculated via average JSD.
spond to previously identified patterns. 1180
1157 setting of the two analyses is very different. While the We can conclude that this first analysis strongly 1181
1158 previous analysis used country borders as possible backs our previous conclusions that dominance plots 1182
1159 explanations for the obtained dominance plots, which follow specific state patterns. Namely, three-fourths of 1183
1160 did not have access to border information, in this the variables that are clustered together in this analysis 1184
analysis these borders are our starting point by were previously grouped into the same state patterns, 1185
Table 5. Country distance matrix calculated as average JSD. regarding the 16 chosen variables). Croatia is followed, 1231
but not closely, by Serbia, with Montenegro and Bosnia- 1232
BA HR ME RS Herzegovina being the two least distinct countries. 1233
We wrap up this series of analyses by quantifying the 1234
BA 0.0 variable consistency index of each country. The quan- 1235
HR 0.116 0.0 tifications are the following: for Bosnia it is 0.23, for 1236
ME 0.016 0.163 0.0 Croatia 0.18, for Montenegro 0.19, while for Serbia it is 1237
RS 0.048 0.222 0.047 0.0 0.14. These findings show Serbia to be the most con- 1238
sistent country given our variables, which can be 1239
explained by the fact that it is more compact dialect- 1240
1186 and in the case of one-half of the variables, large clusters wise than Croatia, and more centralized standard-wise 1241
1187 of four variables fully correspond to previously con- than Bosnia-Herzegovina and Montenegro. Croatia and 1242
1188 structed state patterns. Montenegro are very close to each other and take the 1243
1189 We next analyze the calculated country distance middle ground, while Bosnia-Herzegovina, as expec- 1244
1190 matrix. The distances between countries are based on ted, is linguistically the most diverse country by far, in 1245
1191 our 16 variables and we should stress right here that all likelihood because of the competing influences of 1246
1192 these distances do not take into account the natural Croatia and Serbia. 1247
1193 frequency of occurrence of the phenomena oper- This series of analyses has once again shown that 1248
1194 ationalized in these variables. The country distance Croatia and Serbia represent linguistic extremes among 1249
1195 matrix is presented in Table 5. our four countries of interest. Bosnia-Herzegovina and 1250
1196 The distance matrix shows that the most similar Montenegro seem to be closer to Serbia in our per-variable 1251
1197 country pair is Bosnia-Herzegovina and Montenegro, setting. The most linguistically distinct country is Croatia 1252
1198 followed by Bosnia-Herzegovina and Serbia, and (most distant from the other countries), and the most 1253
1199 Montenegro and Serbia. The least similar countries are consistent country regarding our 16 variables is Serbia. 1254
1200 Croatia and Serbia. These distances again follow our
1201 observations from the previous section, Croatia and
7. DISCUSSION 1255
1202 Serbia presenting two extremes, and Bosnia-
1203 Herzegovina and Montenegro falling somewhere in The goal of our study was to empirically measure the 1256
1204 between, but being overall closer to Serbia than to spread of some of the features considered indicative of 1257
1205 Croatia. However, as already stated, these distances are regional differences in BCMS, looking in particular at 1258
1206 based on 16 variables that have very different fre- the extent to which this spread corresponds to the cur- 1259
1207 quencies of occurrence. Calculating the distance rent state borders. Historical developments, including 1260
1208 between the same languages on running text would the recent separation of former Yugoslavia, give rise to 1261
1209 primarily rely on the four most frequent variables that opposing expectations: a match between linguistic and 1262
1210 cover 81% of variable occurrences, namely e:je (Serbia administrative borders can be interpreted as an effect of 1263
1211 vs. remaining countries), da:inf (Croatia and Bosnia vs. rather constant divergent norming tendencies, empha- 1264
1212 Montenegro and Serbia), long:shortinf (partially Serbia sized by the most recent political split; no match can be 1265
1213 vs. remaining countries) and s:sa (Croatia and Bosnia vs. interpreted as an effect of equally constant unifying 1266
1214 Montenegro and Serbia). These four variables draw trends and a common dialectal basis of the standard 1267
1215 Bosnia-Herzegovina much closer to Croatia, leaving languages. 1268
1216 Montenegro still closer to Serbia, which matches the Although our analysis does not provide a simple 1269
1217 results seen in the automatic classification of Twitter answer, we can draw several generalizations regarding 1270
1218 users presented in Ljubešić & Kranjčić (2015), where the regional distribution of a set of features, and we can 1271
1219 most of the errors come from confusing Bosnian and show how these features constitute differences between 1272
1220 Croatian users on one side and Serbian and Montene- the language used in the four countries. 1273
1221 grin users on the other. At the most general level, Croatian and Serbian 1274
1222 The previously presented country distance matrix represent two extremes, while Montenegrin and espe- 1275
1223 can be transformed in a single-country table by aver- cially Bosnian fall in between them, changing sides 1276
1224 aging all the distances of a country to the remaining depending on the variable; overall, Montenegro leans 1277
1225 countries, thereby quantifying the overall distance of a more frequently to Serbia, and Bosnia-Herzegovina to 1278
1226 country to its neighbors, i.e., its linguistic distinctness. Croatia. However, when each language is contrasted to 1279
1227 These average distances are the following: for Bosnia it the other three languages, Croatian is more distinct 1280
1228 is 0.060, for Croatia 0.167, for Montenegro 0.075, and for from the rest than Serbian; along similar lines, Bosnian 1281
1229 Serbia 0.106. The results reveal Croatia to be most dis- and Montenegrin are overall closer to Serbian than to 1282
1230 tant and therefore, most distinct linguistically (at least Croatian. The country that most frequently does not 1283
1284 correspond to variable level boundaries is Bosnia-Her- When it comes to the results concerning the internal 1339
1285 zegovina, depicting the ethnic heterogeneity of the variable consistency of the four countries, the situation 1340
1286 population, and the strong role of language as a differ- is somewhat surprising at first sight. While the Croatian 1341
1287 entiating factor. norm is usually described as being purist and strict, and 1342
1288 Our findings reflect quite closely the recent history of the Serbian one as allowing multiple options and more 1343
1289 the languages in question. Since the first 19th century free choice based on the speakers’ intuitions, for our 1344
1290 attempts at joint standardization, Croatia and Serbia sample of variables the country that is most consistent 1345
1291 have always constituted two opposed poles of the in its choices of variable values is actually Serbia. 1346
1292 shared standard, each with its own historical baggage Croatia and Montenegro are less stable, with Bosnia- 1347
1293 and its own agenda—Serbia more focused on the uni- Herzegovina, expectedly, being the most varied from 1348
1294 fying potential of a common literary language in the this point of view too. 1349
1295 Yugoslav context, and Croatia intent on preserving at A possible explanation is again essentially a 1350
1296 least some of its diversity and its distinctive features in a historical-linguistic one, having to do with the fact that 1351
1297 situation that it perceived as Serbian dominance.18 The standardization required different levels of adaptation 1352
1298 language spoken in Montenegro was seen as a variant in different countries. The native speech of Serbia’s 1353
1299 of Serbian and, until Montenegro gained independence, cultural centers was not only closer to the proposed 1354
1300 it was largely absent from the disputes. Bosnia-Herze- standard compared to the native speech of Croatia’s 1355
1301 govina, on the other hand, had the advantage that its center, but Serbia was overall more centralized and 1356
1302 majority native speech fully corresponded to the chosen more unified in terms of the vernacular even before 1357
1303 standard, but it also had to keep focus on maintaining a standardization—the speech of Belgrade and Novi Sad 1358
1304 fragile balance among its different ethnic and religious had clear prestige, which it still does. The diversity of 1359
1305 groups (Croats, Serbs and Muslims), siding with Croatia’s dialects with their rich literary tradition meant 1360
1306 Croatian on some language features, with Serbian on that strict rules were needed if a common base was to be 1361
1307 others, and enriching its lexical base with words of created. However, strict rules did not eliminate the 1362
1308 Turkish origin. As our data show, these tendencies are regional variation, which continues to show up in 1363
1309 especially distinguishable today. everyday speech. 1364
1310 As far as different linguistic features are concerned,
1311 the feature that certainly carries the most linguistic
8. CONCLUDING REMARKS 1365
1312 relevance is our e:je variable, on which Serbia is distinct
1313 from the remainder of the countries in prevalently using Overall, we can conclude that in BCMS linguistic 1366
1314 e forms such as mleko ‘milk’ rather than (i)je forms (mli- boundaries do, to some extent, match administrative 1367
1315 jeko). This feature reflects a prominent dialectal distinc- boundaries, as well as ethnic divides in Bosnia- 1368
1316 tion that played an important role in establishing a Herzegovina. However, the match is never complete, 1369
1317 standard diasystem instead of a single standard already and boundaries differ for different variables. The domi- 1370
1318 in the 19th century. Regarding the other features, we nant boundary establishes a west vs. east divide, where 1371
1319 observe a considerable degree of bundling, but we do Croatia and Serbia are fairly stable on their respective 1372
1320 not find plausible linguistic explanations for the clusters ends, while Bosnia-Herzegovina and Montenegro align 1373
1321 of features. sometimes with one, and sometimes with the other. 1374
1322 Divergent language norming, from the 1960s “var- Of course, the results that we obtained depend 1375
1323 iants” and the 1970s “standard idioms” to more recent heavily on the specific variables we selected and should 1376
1324 separate standardizations, seems to have brought the ideally be expanded by including additional variables. 1377
1325 desired divergent results for some features (e.g., the However, given that we focused on some of the core 1378
1326 synthetic and analytic forms of the future tense, a point features brought up in almost all works dealing with 1379
1327 of much dispute within the wider question of phonetic differences within BCMS, our findings can be seen as 1380
1328 vs. etymological spelling), but not so clearly for others empirical evidence that should not be ignored in further 1381
1329 (note in particular the widespread use of da li in linguistic accounts of the linguistic situation in BCMS. 1382
1330 Croatia). Features that are felt as being more related to In future work, we will study more variables, looking 1383
1331 what sounds natural than to strict normative rules also more closely at the distinction between rules grounded 1384
1332 lead to clear patterns in some cases. For example, the in actual language use and the purely normative ones, 1385
1333 fairly high incidence of short infinitives in Montenegro as well as apply approaches where variables are not 1386
1334 can be related to the properties of a major dialect; even defined in advance, but where the full amount of lin- 1387
1335 more prominently, the -ka suffix in Serbian is one of its guistic signal is processed in search for the most dis- 1388
1336 distinguishing features despite not being emphasized in tinguishing features of that signal. 1389
1337 prescriptive rules. Again, it is difficult to draw a con- Finally, our results seem to lend support to the view 1390
1338 clusion applying to most features of the same type. of Twitter as a new source of data for deriving spatial 1391
1392 distributions of linguistic features. Given a medium of numerous paper relevant for the discussion of the status 1446
1393 level adoption of Twitter in most of the countries, we of BCMS. 1447
9
1394 can expect other, more popular social media, primarily hrLex: http://hdl.handle.net/11356/1072; srLex: 1448
1395 Facebook, to be an even better source of linguistically http://hdl.handle.net/11356/1073. 1449
10
1396 relevant spatial signal. Comparable morphological lexicons for Bosnian and 1450
Montenegrin are currently not available. However, all 1451
values of the variables we look at are covered by Croatian 1452
and Serbian data. 1453
1397 Acknowledgments 11
Examples include pairs such as voz / vlak ‘train’, hleb / kruh 1454
1398 The work presented here is partially funded by the ‘bread’, bešika / mjehur ‘bladder’, and many others. 1455
12
1399 SNSF grant 160501 and by a special grant awarded by Officially, Serbian uses both e and (i)je, but the over- 1456
whelming majority of speakers use e. 1457
1400 the UZH URPP ‘Language and Space.’ 13
Pranjković (1997) lists the h rule as an example of the con- 1458
servativeness of the Croatian norm, compared to the 1459
1401 Notes openness of the Serbian norm, which accepts innovations 1460
more readily. 1461
1 14
1402 https://www.monstat.org/userfiles/file/popis2011/ Pranjković (1997) claims that the dominance of sa in Ser- 1462
1403 Tabela%20CG2.xls. bian results from its general tendency to unify competing 1463
2
1404 Even the year of this writing (2017) saw one such declara- forms rather than distinguishing their specific contexts of 1464
1405 tion, “Deklaracija o zajedničkom jeziku” (“Declaration on use (a parallel example is provided by another preposition, 1465
1406 the common language“), initiated by several linguists and k(a) ‘towards’). 1466
1407 signed by over 8000 respondents. The original text of the 15
The generic status of word endings was the reason why we 1467
1408 declaration and the list of the respondents are available at had to leave out the most widely discussed suffix pair, 1468
1409 http://jezicinacionalizmi.com/deklaracija/. The Econo- -telj/-lac (as in čitatelj/čitalac ‘reader’). 1469
1410 mist covered the event with a short article: https://www. 16
https://www.r-project.org. 1470
1411 economist.com/blogs/economist-explains/2017/04/ 17
When calculating variable distances we actually calculate 1471
1412 economist-explains-4. JSD over different variables where levels do not corre- 1472
3
1413 The term NORM (non-mobile older rural male) is often used spond. To mitigate for this, we perform the calculation 1473
1414 to refer to typical informants in traditional dialectology. In over both possible combinations of level pairs and choose 1474
1415 the case of BCMS, the typical profile includes female, the minimum value. 1475
1416 rather than male informants (Petrović, 2015). 18
The difference comes as no surprise, given that Serbia’s 1476
4
1417 As a matter of fact, the issue of sparseness was encountered major cultural centers, Belgrade and Novi Sad, did not 1477
1418 in the first dialectological survey famously carried out by have to adapt their speech much to conform to the new 1478
1419 Wenker in the 19th century. This survey consisted of a standard, keeping even the ‘e’ option, while the main 1479
1420 number of standard German sentences translated into Croatian center, Zagreb, had to abandon its native ‘kaj + e’ 1480
1421 local varieties. While its spatial coverage was excellent, it dialect and switch to the much less familiar ‘što + je’ 1481
1422 did not provide information about many categories known variety. 1482
1423 to vary across regions because most of these categories did
1424 not show up in the selected sentences. This is what led, in References 1483
1425 part, to the development of dialectological questionnaires
1426 targeting specific categories of variation. Alexander, Ronelle. 2013. Language and identity: The fate of 1484
5
1427 We refer here to today’s countries that did not exist as such Serbo-Croatian. In Roumen Daskalov and Tchavdar Mar- 1485
1428 at the time. inov (eds.), Entangled histories of the Balkans. Volume 1: 1486
6
1429 The language of the Serbian literary tradition—first Serbian National ideologies and language policies, 341–417. Leiden & 1487
1430 Church Slavonic, then Slavonic-Serbian (“slavenoserbski”) Boston: Brill. 1488
1431 —was an artificial variety stemming from Church Sla- Barić, Eugenija, Mijo Lončarić, Dragica Malić, Slavko Pavešić, 1489
1432 vonic, with increasingly present elements of the Serbian Mirko Peti, Vesna Zečević & Marija Znika. 1997. Hrvatska 1490
1433 vernacular, but also other Slavic elements (in the Slavonic- gramatika, 2nd edn. Zagreb: Školska knjiga. 1491
1434 Serbian phase). Vuk’s efforts towards standardizing con- Bart, Gabriela, Elvira Glaser, Pius Sibler & Robert Weibel. 2013. 1492
1435 temporary vernacular meant breaking up with the literary Analysis of Swiss German syntactic variants using spatial 1493
1436 tradition, which created a strong resistance among Serbian statistics. In Xosé Afonso Álvarez Pérez, Ernestina Carrilho 1494
1437 scholars. & Catarina Magro (eds.), Current approaches to limits and areas 1495
7
1438 Although not without an opposition such as the most recent in dialectology, 143–169. Newcastle upon Tyne: Cambridge 1496
1439 declaration mentioned above. Scholars Publishing. 1497
8
1440 A useful source of the relevant literature consisted of edited Bekavac, Božo, Sanja Seljan & Ivana Simeon. 2008. Corpus- 1498
1441 volumes published within a project conducted 2006-2010 based comparison of contemporary Croatian, Serbian and 1499
1442 at the University of Graz, dedicated to the study of the Bosnian. In Marko Tadić, Mila Dimitrova-Vulchanova & 1500
1443 differences between Bosnian, Croatian and Serbian Svetla Koeva (eds.), Proceedings of the Sixth International 1501
1444 (http://www-gewi.uni-graz.at/gralis/projektarium/ Conference “Formal approaches to South Slavic and Balkan 1502
1445 BKS-Projekt/index.html); these volumes contain reprints languages” (FASSBL 6), 33–39. Zagreb: Croatian Language 1503
1504 Technologies Society & Faculty of Humanities and Social Glaser, Elvira. 2013. Area formation in morphosyntax. In Peter 1563
1505 Sciences. Auer, Martin Hilpert, Anja Stukenbrock & Benedikt 1564
1506 Britain, David. 2002. Dialectology. In David Bickerton (ed.), A Szmrezcsanyi (eds.), Space in language and linguistics: 1565
1507 web guide to teaching and learning in languages, linguistics and Geographical, interactional and cognitive perspectives (linguae & 1566
1508 area studies. Southampton: Subject Centre for Languages, litterae 24), 195–221. Berlin & Boston: De Gruyter. 1567
1509 Linguistics and Area Studies. http://www.llas.ac.uk/ Goebl, Hans. 1982. Dialektometrie: Prinzipien und methoden des 1568
1510 resources/gpg/964 [Updated January 2005]. einsatzes der numerischen taxonomie im bereich der 1569
1511 Browne, Wayles & Theresa Alt. 2004. A handbook of Bosnian, dialektgeographie. Wien: Osterreichischen Akademie der 1570
1512 Serbian, and Croatian. http://www.seelrc.org:8080/ Wissenschaften. 1571
1513 grammar/mainframe.jsp?nLanguageID=1 (29 October, Goebl, Hans. 1984. Dialektometrische Studien: Anhand 1572
1514 2017). italoromanischer, riitoromanischer und galloromanischer 1573
1515 Chambers, J.K. & Peter Trudgill. 1998. Dialectology, 2nd edn. Sprachmaterialien aus AIS und ALF. 3 Vol. Tübingen: Max 1574
1516 Cambridge: Cambridge University Press. Niemeyer. 1575
1517 Č edić, Ibrahim. 2001. Bosanskohercegovački standardnojezički Gonçalves, Bruno & David Sánchez. 2014. Crowdsourcing 1576
1518 izraz – bosanski jezik. In Svein Mønnesland (ed.), Jezik i dialect characterization through Twitter. PLoS ONE 9(11): 1577
1519 demokratizacija, 69–77. Sarajevo: Institut za jezik. Reprinted e112074. https://doi.org/10.1371/journal.pone.0112074 1578
1520 in Branko Tošović & Arno Wonisch (eds.). 2009. Bošnjački Halilović, Senahid. 2004. Pravopis bosanskoga jezika za osnovne i 1579
1521 pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, srednje škole. Zenica: Dom štampe. 1580
1522 41–50. Graz & Sarajevo: Institut für Slawistik der Karl- Hornsby, David. 2009. Dedialectalization in France: 1581
1523 Franzens-Universität Graz & Institut za jezik Sarajevo. Convergence and divergence. International Journal of the 1582
1524 Čirgić, Adnan, Ivo Pranjković & Josip Silić. 2010. Gramatika Sociology of Language 196(97). 157–180. 1583
1525 crnogorskoga jezika. Podgorica: Ministarstvo prosvjete i nauke Hudeček, Lana & Luka Vukojević. 2007. Da li, je li i li – 1584
1526 Crne Gore. normativni status i raspodjela. Rasprave 33. 217–234. 1585
1527 Doyle, Gabriel. 2014. Mapping dialectal variation by querying Ivić, Pavle. 1956. Dijalektologija srpskohrvatskog jezika. Uvod i 1586
1528 social media. In Proceedings of the 14th Conference of the štokavsko narečje. Novi Sad: Matica srpska. 1587
1529 European chapter of the Association for Computational Jahić, Dževad, Senahid Halilović & Ismail Palić. 2000. 1588
1530 Linguistics, 98–106. Gothenburg: Association for Gramatika bosanskoga jezika. Zenica: Dom štampe. 1589
1531 Computational Linguistics. Kortmann, Bernd & Susanne Wagner. 2005. The Freiburg 1590
1532 Dražić, Jasmina & Jelena Vojinović. 2009. Imenice tipa nomina English dialect project and corpus. In Bernd Kortmann, 1591
1533 agentis u srpskom i hrvatskom jeziku (tvorbeni i semantički Tanja Herrmann, Lukas Pietsch & Susane Wagner (eds.), A 1592
1534 aspekt). In Branko Tošović (ed.), Die Unterschiede zwischen Comparative Grammar of British English Dialects: Agreement, 1593
1535 dem Bosnischen/Bosniakischen, Kroatischen und Serbischen. Gender, Relative Clauses, 1–20. Berlin & New York: Mouton de 1594
1536 Lexik – Wortbildung – Phraseologie, 311–320. Berlin-Münster- Gruyter. 1595
1537 Wien-Zürich-London: LIT Verlag. Reprinted in Branko Kovačić, Marko. 2005. Serbian and Croatian: One language or 1596
1538 Tošović & Arno Wonisch (eds). 2010. Srpski pogledi na odnose languages? Jezikoslovlje 6. 195–204. 1597
1539 između srpskog, hrvatskog i bošnjačkog jezika, Book I/2, Labov, William. 1963. The social motivation of a sound change. 1598
1540 41–50. Graz & Belgrade: Institut für Slawistik der Word 19. 273–309. 1599
1541 Karl-Franzens-Universität Graz & Beogradska Ljubešić, Nikola, Nives Mikelić & Damir Boras. Language 1600
1542 knjiga. identification: How to distinguish similar languages? In 1601
1543 Eisenstein, Jacob, Brendan O’Connor, Noah A. Smith & Eric P. Proceedings of the 29th International Conference on Information 1602
1544 Xing. 2010. A latent variable model for geographic lexical Technology Interfaces ITI 2007, 541–546. Cavtat, Croatia. 1603
1545 variation. In Proceedings of the 2010 Conference on Empirical Ljubešić, Nikola, Darja Fišer & Tomaž Erjavec. 2014. TweetCaT: 1604
1546 Methods in Natural Language Processing, 1277–1287. A tool for building Twitter corpora of smaller languages. In 1605
1547 Cambridge, MA: Association for Computational Linguistics. Proceedings of the Ninth International Conference on Language 1606
1548 Eisenstein, Jacob, Noah A. Smith & Eric P. Xing. 2011. Resources and Evaluation (LREC’14), 2279–2283. Reykjavik, 1607
1549 Discovering sociolinguistic associations with structured Iceland. 1608
1550 sparsity. In Proceedings of the 49th Annual Meeting of the Ljubešić, Nikola & Denis Kranjčić. 2015. Discriminating 1609
1551 Association for Computational Linguistics: Human language between closely related languages on Twitter. Informatica 39 1610
1552 technologies, 1365–1374. Portland: Association for (1). 1–8. 1611
1553 Computational Linguistics. Ljubešić, Nikola, Filip Klubička, Željko Agić & Ivo-Pavao 1612
1554 Eisenstein, Jacob, Brendan O’Connor, Noah A. Smith & Eric P. Jazbec. 2016. New inflectional lexicons and training corpora 1613
1555 Xing. 2014. Diffusion of lexical change in social media. PloS for improved morphosyntactic annotation of Croatian and 1614
1556 ONE 9(11). e113114. https://doi.org/10.1371/journal. Serbian. In Nicoletta Calzolari, Khalid Choukri, Thierry 1615
1557 pone.0113114 Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, 1616
1558 Fišer, Darja, Tomaž Erjavec, Nikola Ljubešić & Maja Miličević. Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk 1617
1559 2015. Comparing the nonstandard language of Slovene, & Stelios Piperidis (eds.), Proceedings of the Tenth International 1618
1560 Croatian and Serbian tweets. In Mojca Smolej (ed.), Simpozij Conference on Language Resources and Evaluation (LREC 2016), 1619
1561 Obdobja 34. Slovnica in slovar - aktualni jezikovni opis, Part 1, 23–28. Paris: European Language Resources Association 1620
1562 225–231. Ljubljana: Filozofska fakulteta. (ELRA). 1621
1622 Ljubešić, Nikola, Tanja Samardžić & Curdin Derungs. 2016. Scheffler, Tatjana, Johannes Gontrum, Matthias Wegel & Steve 1681
1623 TweetGeo – A tool for collecting, processing and analyzing Wendler. 2014. Mapping German tweets to geographic 1682
1624 geo-encoded linguistic data. In Yuji Matsumoto & Rashmi regions. In Proceedings of the NLP4CMC Workshop at Konvens, 1683
1625 Prasad (eds.), Proceedings of COLING 2016, the 26th Interna- 26–34. Bochum: Bochumer Linguistische Arbeitsberichte. 1684
1626 tional Conference on Computational Linguistics: Technical Séguy, Jean. 1971. La relation entre la distance spatiale et la 1685
1627 Papers, 3412–3421. Osaka: The COLING 2016 Organizing distance lexicale. Revue de linguistique romane 35. 335–357. 1686
1628 Committee. Silić, Josip. 2008. Fonetsko-fonološke i ortografsko-ortoepske 1687
1629 Miličević, Maja, Nikola Ljubešić & Darja Fišer. 2017. Birds of a razlike između bosanskoga (bošnjačkoga), hrvatskoga i 1688
1630 feather don’t quite tweet together: An analysis of spelling srpskoga jezika. In Branko Tošović (ed.). Die Unterschiede 1689
1631 variation in Slovene, Croatian and Serbian twitterese. In zwischen dem Bosnischen/Bosniakischen, Kroatischen und Ser- 1690
1632 Darja Fišer & Michael Beißwenger (eds.), Investigating bischen, 266–274. Berlin-Münster-Wien-Zürich-London: LIT 1691
1633 computer-mediated communication: Corpus-based approaches to Verlag. Reprinted in Branko Tošović & Arno Wonisch (eds.). 1692
1634 language in the digital world, 14–43. Ljubljana: Scientific 2010. Hrvatski pogledi na odnose između srpskog, hrvatskog i 1693
1635 Publishing House of the Faculty of Arts, University of bošnjačkog jezika, Book I, 87–98. Graz & Zagreb: Institut für 1694
1636 Ljubljana. Slawistik der Karl-Franzens-Universität Graz & Izvori. 1695
1637 Miličević, Maja & Nikola Ljubešić. 2016. Tviterasi, tviteraši or Speelman, Dirk, Stefan Grondelaers & Dirk Geeraerts. 2003. 1696
1638 twitteraši? Producing and analyzing a normalized dataset of Profile-based linguistic uniformity as a generic method for 1697
1639 Croatian and Serbian tweets. Slovenščina 2.0 4. 156–188. comparing language varieties. Computers and the Humanities 1698
1640 Nerbonne, John, Wilbert Heeringa, E Erik van den Hout, Peter 37(3). 317–317. 1699
1641 van der Kooi, Simone Otten & Willem van de Vis. 1995. Stanojčić, Živojin & Ljubomir Popović. 2008. Gramatika srpskog 1700
1642 Phonetic distance between Dutch dialects. In Gert Durieux, jezika za gimnazije i srednje škole. Beograd: Zavod za 1701
1643 Walter Daelemans & Steven Gillis (eds.), CLIN VI: udžbenike. 1702
1644 Proceedings from the Sixth CLIN Meeting, 185–202. Stevanović, Mihailo. 1989. Savremeni srpskohrvatski jezik. 1703
1645 Antwerpen: Center for Dutch Language and Speech, Beograd: Naučna knjiga. 1704
1646 University of Antwerpen (UIA). Szmrecsanyi, Benedikt. 2008. Corpus-based dialectometry: 1705
1647 Nerbonne, John, Wilbert Heeringa & Peter Kleiweg. 1999. Edit aggregate morphosyntactic variability in British English 1706
1648 distance and dialect proximity. In David Sankoff & Joseph dialects. International Journal of Humanities and Arts Comput- 1707
1649 Kruskal (eds.), Time Warps, String Edits and Macromolecules: ing 2(1/2) (special issue; John Nerbonne, Charlotte Goos- 1708
1650 The Theory and Practice of Sequence Comparison, 2nd edn., kens, Sebastian Kürschner & Renée van Bezooijen (eds.) 1709
1651 5–15. Stanford: CSLI. Language Variation). 279–296. 1710
1652 Nguyen, Dong, Noah Smith & Carolyn Rosé. 2011. Author age Šehović, Amela. 2009. Mocioni sufiksi u bosanskom, hrvatskom 1711
1653 prediction from text using linear regression. In Proceedings of i srpskom jeziku (u nomina agentis et professionis). In 1712
1654 the 5th ACL-HLT Workshop on Language Technology for Branko Tošović & Arno Wonisch (eds.), Bošnjački pogledi na 1713
1655 Cultural Heritage, Social Sciences, and Humanities, 115–123. odnose između srpskog, hrvatskog i bošnjačkog jezika, 433–445. 1714
1656 Portland: Association for Computational Linguistics. Graz & Sarajevo: Institut für Slawistik der Karl-Franzens- 1715
1657 Perović, Milenko A., Josip Silić & Ljudmila Vasiljeva. 2009. Universität Graz & Institut za jezik Sarajevo. 1716
1658 Pravopis crnogorskoga jezika i rjecň ik crnogorskoga jezika Š pago-Ć umurija, Edina. 2009. Bosnian or Croatian? Sintaksičke 1717
1659 ̌ ik). Podgorica: Ministarstvo prosvjete i nauke
(pravopisni rjecn razlike u kursevima bosanskog i hrvatskog jezika za strance. 1718
1660 Crne Gore. In Branko Tošović (ed.), Die Unterschiede zwischen dem 1719
1661 Pešikan, Mitar, Jovan Jerković & Mato Pižurica. 2010. Pravopis Bosnischen/Bosniakischen, Kroatischen und Serbischen. 1720
1662 srpskoga jezika. Novi Sad: Matica srpska. Grammatik, 375–387. Berlin-Münster-Wien-Zürich-London: 1721
1663 Petrović, Tanja. 2015. Srbija i njen Jug : “južnjački dijalekti” LIT Verlag. Reprinted in Branko Tošović & Arno Wonisch 1722
1664 između jezika, kulture i politike. Beograd: Fabrika knjiga. (eds.). 2009. Bošnjački pogledi na odnose između srpskog, 1723
1665 Pichler, Heike & Ashley Hesson. 2016. Discourse-pragmatic hrvatskog i bošnjačkog jezika, 273–292. Graz & Sarajevo: 1724
1666 variation across situations, varieties, ages: I DON’T KNOW Institut für Slawistik der Karl-Franzens-Universität Graz & 1725
1667 in sociolinguistic and medical interviews. Language & Institut za jezik Sarajevo. 1726
1668 Communication 49. 1–18. Tošović, Branko. 2008. Gramatičke razlike između srpskog, 1727
1669 Piper, Predrag. 2009. O prirodi gramatičkih razlika između hrvatskog i bošnjačkog jezika (preliminarium). In Tilman 1728
1670 srpskog i hrvatskog jezika. In Predrag Piper (ed.), Berger & Biljana Golubović (eds.), Morphologie – Mü ndlichkeit 1729
1671 Južnoslovenski jezici: gramatičke strukture i funkcije, 537–552. – Medien: Festschrift fü r Jochen Raecke, 311–322. Hamburg: 1730
1672 Beograd: Beogradska knjiga. Verlag Dr. Kovač. Reprinted in Branko Tošović & Arno 1731
1673 Pranjković, Ivo. 1997. Hrvatski standardni jezik i srpski Wonisch (eds.). 2010. Srpski pogledi na odnose između srpskog, 1732
1674 standardni jezik. In Emil Tokarz (ed.), Język wobec hrvatskog i bošnjačkog jezika, Book I/2, 183–200. Graz & 1733
1675 przemian kultury, 50–59. Katowice: Wydawnictwo Belgrade: Institut für Slawistik der Karl-Franzens- 1734
1676 Uniwersytetu Śląskiego. Reprinted in Branko Tošović & Universität Graz & Beogradska knjiga. 1735
1677 Arno Wonisch (eds.). 2012. Hrvatski pogledi na odnose između Tošović, Branko. 2009. Die grammatikalischen Unterschiede 1736
1678 srpskog, hrvatskog i bošnjačkog jezika, Book II, 408–417. Graz & zwischen dem Bosnischen/Bosniakischen, Kroatischen und 1737
1679 Zagreb: Institut für Slawistik der Karl-Franzens-Universität Serbischen. In Branko Tošović (ed.), Die Unterschiede 1738
1680 Graz & Izvori. zwischen dem Bosnischen/Bosniakischen, Kroatischen und 1739
1740 Serbischen. Grammatik, 131–188. Berlin-Münster-Wien- the genesis of New Zealand English. Journal of Linguistics 36 1751
1741 Zürich-London: LIT Verlag. Reprinted in Branko Tošović & (2). 299–318. 1752
1742 Arno Wonisch (eds.). 2010. Srpski pogledi na odnose između Wieling, Martijn, John Nerbonne & Harald Baayen. 2011. 1753
1743 srpskog, hrvatskog i bošnjačkog jezika, Book I/2, 237–292. Graz Quantitative social dialectology: Explaining linguistic 1754
1744 & Belgrade: Institut für Slawistik der Karl-Franzens- variation geographically and socially. PLoS ONE 6(9). 1755
1745 Universität Graz & Beogradska knjiga. e23613. doi:10.1371/journal.pone.0023613 1756
1746 Trudgill, Peter. 1974. Linguistic change and diffusion: Woolhiser, Curt. 2005. Political borders and dialect 1757
1747 description and explanation in sociolinguistic dialect divergence/convergence in Europe. In Peter Auer, Frans 1758
1748 geography. Language in Society 3. 215–246. Hinskens & Paul Kerswill (eds.), Dialect Change. Convergence 1759
1749 Trudgill, Peter, Elizabeth Gordon, Gillian Lewis & Margaret and Divergence in European Languages, 236–262. New York: 1760
1750 MacLagan. 2000. Determinism in new-dialect formation and Cambridge University Press. 1761
1762

Borders and Boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter Data To The Rescue

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Borders and Boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter Data To The Rescue

Uploaded by

Copyright:

Available Formats

Journal of Linguistic Geography, page 1 of 25.

© Cambridge University Press 2019 O RI G I N A L R E S E A R C H

1 Borders and boundaries in Bosnian, Croatian,

3 Nikola Ljubešić,1,2* Maja Miličević Petrović,3 and Tanja Samardžić4

*Address for correspondence: Nikola Ljubešić, Jožef Stefan Institute,

Map 1. Čakavian, kajkavian and štokavian dialects.

Map 2. Area of ekavian pronunciation.

Map 3. Area of ijekavian pronunciation.

Map 4. Area of ikavian pronunciation.

Bosnia - Herzegovina (BA) 28,909 4.74% 51197 25.72% 24,577

Table 2. A summary of variables whose spatial distribution was studied.

Variable Examples of levels

Variable Examples of levels

1. Croatia vs. remaining countries 1082

Map 9. Level dominance plots grouped in no state pattern.

calculating per-country variable distributions. The 1161

tenegro, Serbia pattern. The large cluster present in the 1172

right side of the ﬁgure corresponds to a smaller extent to

the previously observed state patterns. The Serbia vs. 1174

remaining countries pattern, comprising the e:je and ica:ka 1175

variables, can be identiﬁed as a separate cluster, long: 1176

shortinf and dali:jeli similarly forming a cluster and 1177

You might also like