You are on page 1of 10

This article was downloaded by: [Corporacion CINCEL]

On: 03 May 2012, At: 07:08


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative Linguistics


Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/njql20

Hapax Legomena and Language


Typology
a

Ioan-Iovitz Popescu & Gabriel Altmann


a

Bucharest University

Ldenscheid

Available online: 07 Oct 2008

To cite this article: Ioan-Iovitz Popescu & Gabriel Altmann (2008): Hapax Legomena and
Language Typology, Journal of Quantitative Linguistics, 15:4, 370-378
To link to this article: http://dx.doi.org/10.1080/09296170802326699

PLEASE SCROLL DOWN FOR ARTICLE


Full terms and conditions of use: http://www.tandfonline.com/page/terms-andconditions
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sublicensing, systematic supply, or distribution in any form to anyone is expressly
forbidden.
The publisher does not give any warranty express or implied or make any
representation that the contents will be complete or accurate or up to date. The
accuracy of any instructions, formulae, and drug doses should be independently
verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand, or costs or damages whatsoever or
howsoever caused arising directly or indirectly in connection with or arising out
of the use of this material.

Journal of Quantitative Linguistics


2008, Volume 15, Number 4, pp. 370378
DOI: 10.1080/09296170802326699

Hapax Legomena and Language Typology*


Ioan-Iovitz Popescu1 and Gabriel Altmann2
1

Bucharest University; 2Ludenscheid

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

ABSTRACT
Counting word forms one obtains more hapax legomena in highly synthetic languages
than in highly analytic ones. We propose an index of analytism based exclusively on
mechanical counting of word forms in text.

Since the seminal work of V. Skalicka (see esp. 20042006) language


typology is not a classication of languages any more. Though many
linguists are still engaged in describing, collecting and locating isolated
language phenomena, typology can nowadays be seen as the study of
relationships between language properties. In general, the relationships
are the stronger the more contingent two phenomena are but it can be
expected that even distant phenomena like phonemics and syntax
display some dependences. The consequent continuation of this kind of
typology has become language synergetics (cf. Kohler, 1986, 2005), which
tries to express all relationships formally and derive them from a
common self-regulation scheme. Needless to say, all relationships in
language are stochastic and the resulting equations are either nonlinear
regression functions or probability distributions. Many of them hold
only for averages.
However, up to now, textual phenomena have not been exploited for
nding any typological relationships because the study of texts in many
languages at once is not as easy as using ready-made grammar textbooks
from which elaborated phenomena can be selected. Perhaps the rst

*Address correspondence to: Gabriel Altmann, Stuttinghauser Ringstr. 44, 58515


Ludenscheid. E-mail: RAM-Verlag@t-online.de
0929-6174/08/15040370 2008 Taylor & Francis

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

HAPAX LEGOMENA AND LANGUAGE TYPOLOGY

371

attempt to use textual phenomena is made in Popescu et al. (2008), where


a relationship between a function of the h-point (cf. Popescu, 2007) of the
rank-frequency distribution of word forms and the degree of synthetism
of language has been found. The study was performed in texts of 20
languages and the relationship does not depend on text length.
In the present contribution we shall try to show a relationship
between hapax legomena of texts and the synthetism/analytism of
language. Logically, if a language is highly synthetic, whatever the text
length, not all forms of each word will occur several times (i.e. more
than once). A great number of forms will occur only once, thus forming
the stock of hapax legomena occupying the last HL ranks. On the
contrary, in highly analytic languages the number of forms is small,
hence all words have a greater chance of being repeated. The situation
would be quite dierent if we analysed lemmatized texts, in which no
morphology is contained.
This result (showing only the bottom part of the data) simply shows
that in highly synthetic languages with a long hapax legomena sequence
the theoretical function rather underestimates the frequencies trying to
capture the steep decrease in frequencies.
In the case of highly analytic languages (Figure 1c) the theoretical
function overestimates the empirical frequencies (not only of hapax
legomena).
In the present work we try to give to these considerations a
quantitative expression, as illustrated in Figures 1a, b, c. For this
purpose, let us denote V the vocabulary of word forms and HL the
number of ranks at which hapax legomena of the rank-frequency
distribution occur, and let us suppose that the rank-frequency distribution follows Zipfs law. The assumption of the validity of Zipfs law has
been corroborated on innumerable data sets in many languages, so that
sporadic exceptions could be captured by one of the many generalizations of this law. For the sake of simplicity we shall use the Zipf
function f(r) c/ra in which a and c are tting parameters, thus getting
rid of the necessity of normalization and truncation at the right-hand
side.
Now, if we t this function to rank-frequency data, we may expect that
with a good tting the function achieves f(r) 1 (i.e. the level of hapax
legomena) exactly in the middle of hapax legomena (i.e. at r V HL/2),
as shown in Figure 1a for a Bulgarian text. But iterative tting of the Zipf
function may yield dierent values. Therefore, in the following we will

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

372

I.-I. POPESCU & G. ALTMANN

Fig. 1a. Zipan location of hapax legomena in a balanced language (here Bulgarian).

Fig. 1b. Over-Zipan location of hapax legomena in a highly synthetic language (here
Hungarian).

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

HAPAX LEGOMENA AND LANGUAGE TYPOLOGY

373

Fig. 1c. Under-Zipan location of hapax legomena in a highly analytic language (here
Hawaiian).

take the above central location as a reference for any other tting and
introduce the indicator
A

c
V  HL=2a

as an expression of analytism of a language. The interval of this indicator


is not yet known, but the relationship to analytism/synthetism is evident:
the greater A, the stronger is the analytism. In order to check our
hypothesis we used the texts analysed in Popescu et al. (2008) and
obtained A-values presented in Table 1 for 100 texts from 20 languages
and in Table 2 for the corresponding language averages. The individual
extreme cases of Table 1 are illustrated in Figure 1b for a highly synthetic
Hungarian text and in Figure 1c for a highly analytic Hawaiian text.
Generally, the Zipf tting curve is displaced rightwards for analytic
languages and leftwards for synthetic languages with respect to the real
rank-frequency distribution. The displacement depends on the degree of
analytism/synthetism. Clearly, there is a bi-univocal and reciprocal
correspondence between the analytism indicator A and its crossing point

374

I.-I. POPESCU & G. ALTMANN

Table 1. Zipfs function f(r) c/ra tting to data of 100 texts from 20 languages.

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

ID
B 01
B 02
B 03
B 04
B 05
Cz 01
Cz 02
Cz 03
Cz 04
Cz 05
E 01
E 02
E 03
E 04
E 05
E 07
E 13
G 05
G 09
G 10
G 11
G 12
G 14
G 17
H 01
H 02
H 03
H 04
H 05
Hw 03
Hw 04
Hw 05
Hw 06
I 01
I 02
I 03
I 04
I 05
In 01
In 02
In 03
In 04

R2

HL

c
VHL=2a

400
201
285
286
238
638
543
1274
323
556
939
1017
1001
1232
1495
1597
1659
332
379
301
297
169
129
124
1079
789
291
609
290
521
744
680
1039
3667
2203
483
1237
512
221
209
194
213

0.6850
0.5704
0.5550
0.6169
0.6202
0.7473
0.7169
0.8028
0.6228
0.8722
0.7657
0.7434
0.8179
0.8712
0.8009
0.7568
0.8034
0.6935
0.6523
0.6053
0.5895
0.6062
0.5755
0.5515
1.2268
1.1865
1.2114
0.9549
0.8168
0.7932
0.7633
0.7267
0.7816
0.7266
0.7488
0.7895
0.7014
0.6524
0.5809
0.5915
0.5417
0.4877

41.8602
17.6950
20.9975
23.6917
22.0499
54.2844
51.9648
175.4805
23.3822
77.1944
145.9980
180.1325
254.7482
385.9532
319.1386
300.1258
811.1689
32.8211
32.5565
21.8114
19.9677
14.3627
10.8110
13.1021
214.2708
122.0057
44.9653
74.8581
30.9795
329.6012
678.1305
592.6243
1081.7823
509.5979
305.6487
56.8099
153.3448
54.5840
18.2346
19.1717
15.6229
11.9156

0.9837
0.8705
0.8790
0.9619
0.9367
0.9764
0.9767
0.9832
0.9537
0.9715
0.9620
0.9661
0.9752
0.9870
0.9822
0.9347
0.9800
0.9646
0.9626
0.9402
0.9593
0.9514
0.9349
0.9349
0.9600
0.9365
0.8864
0.9451
0.9093
0.9489
0.9154
0.8742
0.9352
0.9336
0.9559
0.9523
0.9385
0.9293
0.9486
0.9583
0.9565
0.9574

298
153
212
222
187
517
412
964
241
445
662
735
620
693
971
1075
736
250
302
237
232
141
107
84
844
638
259
509
250
255
347
302
500
2514
1604
382
848
355
166
147
130
145

0.9507
1.1292
1.1798
0.9790
1.0090
0.6416
0.8013
0.8261
0.8562
0.4864
1.0783
1.4610
1.2123
1.0449
1.2529
1.5416
2.5688
0.8129
0.9431
0.9331
0.9320
0.8888
0.8977
1.1531
0.0749
0.0824
0.0950
0.2753
0.4784
2.8821
5.3384
6.2199
5.8855
1.7784
1.3468
0.6427
1.3948
1.2306
1.0420
1.0509
1.1233
1.0683
(continued )

HAPAX LEGOMENA AND LANGUAGE TYPOLOGY

375

Table 1. (Continued ).

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

ID
In 05
Kn 003
Kn 004
Kn 005
Kn 006
Kn 011
Lk 01
Lk 02
Lk 03
Lk 04
Lt 01
Lt 02
Lt 03
Lt 04
Lt 05
Lt 06
M 01
M 02
M 03
M 04
M 05
Mq 01
Mq 02
Mq 03
Mr 001
Mr 018
Mr 026
Mr 027
Mr 288
R 01
R 02
R 03
R 04
R 05
R 06
Rt 01
Rt 02
Rt 03
Rt 04
Rt 05
Ru 01
Ru 02

R2

HL

c
VHL=2a

188
1833
720
2477
2433
2516
174
479
272
116
2211
2334
2703
1910
909
609
398
277
277
326
514
289
150
301
1555
1788
2038
1400
2079
843
1179
719
729
567
432
223
214
207
181
197
422
1240

0.5374
0.6072
0.5237
0.6621
0.5809
0.5786
0.6416
0.7731
0.7512
0.6792
0.7935
0.8047
0.6366
0.6505
0.5877
0.5293
0.7680
0.8197
0.7902
0.8353
0.7484
0.8030
0.7440
0.9795
0.6293
0.6685
0.6224
0.6166
0.6304
0.6720
0.7567
0.7175
0.6673
0.6746
0.6349
0.8575
0.7469
0.7208
0.7359
0.6917
0.6538
0.7713

19.4218
66.4545
22.1001
124.5588
95.9573
77.0267
23.4838
139.2126
71.8668
18.7509
109.3668
160.3530
109.5291
129.2023
34.1056
19.3370
185.4091
123.4636
147.8281
137.7184
297.2460
240.0615
46.4870
225.2046
78.3965
128.5531
101.6971
120.0829
100.2890
73.6423
115.8007
60.8094
52.4236
48.1009
30.3691
123.9533
83.2271
78.6409
60.2092
87.0541
36.1404
138.5450

0.8843
0.9775
0.9699
0.9105
0.9522
0.9666
0.9348
0.9510
0.9527
0.9801
0.9078
0.9335
0.9832
0.9463
0.9713
0.9325
0.9225
0.9693
0.9557
0.9763
0.9306
0.9588
0.9655
0.9856
0.9815
0.9863
0.9633
0.9456
0.9683
0.9571
0.9802
0.9778
0.9798
0.9743
0.9350
0.9645
0.9316
0.9465
0.9358
0.9469
0.9604
0.9915

121
1373
564
1784
1655
1873
127
302
174
80
1792
1878
2049
1359
737
521
202
146
133
192
239
91
86
138
1128
1249
1486
846
1534
606
908
567
573
424
353
127
128
98
102
73
316
946

1.4347
0.9223
0.9144
0.9480
1.3181
1.0862
1.1474
1.5798
1.4240
0.9901
0.3666
0.4729
0.9695
1.2627
0.8449
0.8726
2.3386
1.5787
2.1571
1.4664
3.3897
2.9102
1.4370
1.0853
1.0210
1.1470
1.1758
1.7214
1.0857
1.0739
0.7930
0.7771
0.8993
0.9157
0.8995
1.6008
1.9726
2.0454
1.6749
2.5959
0.9437
0.8251
(continued )

376

I.-I. POPESCU & G. ALTMANN

Table 1. (Continued ).

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

ID
Ru 03
Ru 04
Ru 05
Sl 01
Sl 02
Sl 03
Sl 04
Sl 05
Sm 01
Sm 02
Sm 03
Sm 04
Sm 05
T 01
T 02
T 03

R2

HL

c
VHL=2a

1792
2536
6073
457
603
907
1102
2223
267
222
140
153
124
611
720
645

0.7106
0.7181
0.7826
0.7467
0.6846
0.7685
0.9187
0.7232
0.8285
0.7752
0.6858
0.7925
0.7161
0.7624
0.7803
0.7652

158.2659
234.3457
775.3826
44.1840
68.9001
115.2402
334.8100
240.2785
177.1858
123.5355
58.1896
89.0771
46.3093
120.0367
144.5780
167.7334

0.9620
0.9571
0.9807
0.9760
0.9823
0.9604
0.9912
0.9490
0.9678
0.9450
0.8708
0.9563
0.9263
0.8817
0.8685
0.8923

1365
1850
4395
364
423
651
701
1593
119
96
75
76
66
465
540
447

1.0851
1.1661
1.2063
0.6665
1.1571
0.8651
0.7633
1.2572
2.1315
2.2641
2.4320
2.0738
1.8312
1.2995
1.2297
1.6447

B, Bulgarian; Cz, Czech; E, English; G, German; H, Hungarian; Hw, Hawaiian; I, Italian;


In, Indonesian; Kn, Kannada; Lk, Lakota,; Lt, Latin; M, Maori; Mq, Marquesan; Mr,
Marathi; R, Romanian; Rt, Rarotongan; Ru, Russian; Sl, Slovenian; Sm, Samoan; T,
Tagalog.

(real or virtual) between the tting Zipf curve and the hapax legomena
level f(r) 1.
Finally, it is worthwhile mentioning that the data of Table 1 reveal a
good linear dependence between the hapax legomena length HL and the
vocabulary V, namely HL 0.7256*V718.6979, as it is further
illustrated in Figure 2. Hence, replacing HL in (1), we get a good
approximation for the analytism indicator in the form
A

2a c
1:2744 V  18:6979a

An independent measure of analytism/synthetism using e.g. the


Greenberg-Krupa indices (Greenberg, 1960; Krupa, 1965) could tell us
whether the purely morphological denition of this property corresponds
to our purely textual measure which can be won mechanically without
analysing each word of a text with regard to its morphological structure.

377

HAPAX LEGOMENA AND LANGUAGE TYPOLOGY

Table 2. Mean analytism indicator A of 20 languages.

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

Language
Hungarian
Czech
Latin
Romanian
German
Slovenian
Kannada
Russian
Bulgarian
Indonesian
Marathi
Italian
Lakota
Tagalog
English
Marquesan
Rarotongan
Samoan
Maori
Hawaiian

Mean A

Number of texts

0.2012
0.7223
0.7982
0.8931
0.9372
0.9418
1.0378
1.0453
1.0495
1.1438
1.2302
1.2787
1.2853
1.3913
1.4514
1.8108
1.9779
2.1465
2.1861
5.0815

5
5
6
6
7
5
5
5
5
5
5
5
4
3
7
3
5
5
5
4

Fig. 2. Illustrating the linear dependence between the hapax legomena length HL and the
vocabulary V.

378

I.-I. POPESCU & G. ALTMANN

Needless to say, one can perform the same procedure using the ZipfMandelbrot function or a number of other hyperbolic functions.
We adhere to Occams razor and restrict ourselves to the original Zipf
function which yielding this morphological distinctiveness without
touching morphology gets a secondary important corroboration.

Downloaded by [Corporacion CINCEL] at 07:08 03 May 2012

REFERENCES
Greenberg, J. H. (1960). A quantitative approach to the morphological typology of
languages. International Journal of American Linguistics, 26, 178194.
Kohler, R. (1986). Zur linguistischen Synergetik. Struktur und Dynamik der Lexik.
Bochum: Brockmeyer.
Kohler, R. (ed.) (2005). Synergetic linguistics. In R. Kohler, G. Altmann & R. G.
Piotrowski (Eds), Quantitative Linguistics. An International Handbook (pp. 760
775). Berlin: de Gruyter.
Krupa, V. (1965). On quantication of typology. Linguistics, 12, 3136.
Popescu, I.-I. (2007). Text ranking by the weight of highly frequent words. In P. Grzybek
& R. Kohler (Eds), Exact Methods in the Study of Language and Text (pp. 555
565). Berlin/New York: Mouton de Gruyter.
Popescu, I.-I., Vidya, M. N., Uhl r ova, L., Pustet, R., Mehler, A., Macutek, J., Krupa,
V., Kohler, R., Jayaram, B. D., Grzybek, P., & Altmann, G. (2008). Word
Frequency Studies. Berlin: Mouton de Gruyter (in press).
Skalicka, V. (20042006). Souborne dlo IIII (edited by F. Cermak et al.). Praha:
Nakladatelstv Karolinum. [The work contains the Czech translations.]

You might also like