Professional Documents
Culture Documents
Bucharest University
Ldenscheid
To cite this article: Ioan-Iovitz Popescu & Gabriel Altmann (2008): Hapax Legomena and
Language Typology, Journal of Quantitative Linguistics, 15:4, 370-378
To link to this article: http://dx.doi.org/10.1080/09296170802326699
ABSTRACT
Counting word forms one obtains more hapax legomena in highly synthetic languages
than in highly analytic ones. We propose an index of analytism based exclusively on
mechanical counting of word forms in text.
371
372
Fig. 1a. Zipan location of hapax legomena in a balanced language (here Bulgarian).
Fig. 1b. Over-Zipan location of hapax legomena in a highly synthetic language (here
Hungarian).
373
Fig. 1c. Under-Zipan location of hapax legomena in a highly analytic language (here
Hawaiian).
take the above central location as a reference for any other tting and
introduce the indicator
A
c
V HL=2a
374
Table 1. Zipfs function f(r) c/ra tting to data of 100 texts from 20 languages.
ID
B 01
B 02
B 03
B 04
B 05
Cz 01
Cz 02
Cz 03
Cz 04
Cz 05
E 01
E 02
E 03
E 04
E 05
E 07
E 13
G 05
G 09
G 10
G 11
G 12
G 14
G 17
H 01
H 02
H 03
H 04
H 05
Hw 03
Hw 04
Hw 05
Hw 06
I 01
I 02
I 03
I 04
I 05
In 01
In 02
In 03
In 04
R2
HL
c
VHL=2a
400
201
285
286
238
638
543
1274
323
556
939
1017
1001
1232
1495
1597
1659
332
379
301
297
169
129
124
1079
789
291
609
290
521
744
680
1039
3667
2203
483
1237
512
221
209
194
213
0.6850
0.5704
0.5550
0.6169
0.6202
0.7473
0.7169
0.8028
0.6228
0.8722
0.7657
0.7434
0.8179
0.8712
0.8009
0.7568
0.8034
0.6935
0.6523
0.6053
0.5895
0.6062
0.5755
0.5515
1.2268
1.1865
1.2114
0.9549
0.8168
0.7932
0.7633
0.7267
0.7816
0.7266
0.7488
0.7895
0.7014
0.6524
0.5809
0.5915
0.5417
0.4877
41.8602
17.6950
20.9975
23.6917
22.0499
54.2844
51.9648
175.4805
23.3822
77.1944
145.9980
180.1325
254.7482
385.9532
319.1386
300.1258
811.1689
32.8211
32.5565
21.8114
19.9677
14.3627
10.8110
13.1021
214.2708
122.0057
44.9653
74.8581
30.9795
329.6012
678.1305
592.6243
1081.7823
509.5979
305.6487
56.8099
153.3448
54.5840
18.2346
19.1717
15.6229
11.9156
0.9837
0.8705
0.8790
0.9619
0.9367
0.9764
0.9767
0.9832
0.9537
0.9715
0.9620
0.9661
0.9752
0.9870
0.9822
0.9347
0.9800
0.9646
0.9626
0.9402
0.9593
0.9514
0.9349
0.9349
0.9600
0.9365
0.8864
0.9451
0.9093
0.9489
0.9154
0.8742
0.9352
0.9336
0.9559
0.9523
0.9385
0.9293
0.9486
0.9583
0.9565
0.9574
298
153
212
222
187
517
412
964
241
445
662
735
620
693
971
1075
736
250
302
237
232
141
107
84
844
638
259
509
250
255
347
302
500
2514
1604
382
848
355
166
147
130
145
0.9507
1.1292
1.1798
0.9790
1.0090
0.6416
0.8013
0.8261
0.8562
0.4864
1.0783
1.4610
1.2123
1.0449
1.2529
1.5416
2.5688
0.8129
0.9431
0.9331
0.9320
0.8888
0.8977
1.1531
0.0749
0.0824
0.0950
0.2753
0.4784
2.8821
5.3384
6.2199
5.8855
1.7784
1.3468
0.6427
1.3948
1.2306
1.0420
1.0509
1.1233
1.0683
(continued )
375
Table 1. (Continued ).
ID
In 05
Kn 003
Kn 004
Kn 005
Kn 006
Kn 011
Lk 01
Lk 02
Lk 03
Lk 04
Lt 01
Lt 02
Lt 03
Lt 04
Lt 05
Lt 06
M 01
M 02
M 03
M 04
M 05
Mq 01
Mq 02
Mq 03
Mr 001
Mr 018
Mr 026
Mr 027
Mr 288
R 01
R 02
R 03
R 04
R 05
R 06
Rt 01
Rt 02
Rt 03
Rt 04
Rt 05
Ru 01
Ru 02
R2
HL
c
VHL=2a
188
1833
720
2477
2433
2516
174
479
272
116
2211
2334
2703
1910
909
609
398
277
277
326
514
289
150
301
1555
1788
2038
1400
2079
843
1179
719
729
567
432
223
214
207
181
197
422
1240
0.5374
0.6072
0.5237
0.6621
0.5809
0.5786
0.6416
0.7731
0.7512
0.6792
0.7935
0.8047
0.6366
0.6505
0.5877
0.5293
0.7680
0.8197
0.7902
0.8353
0.7484
0.8030
0.7440
0.9795
0.6293
0.6685
0.6224
0.6166
0.6304
0.6720
0.7567
0.7175
0.6673
0.6746
0.6349
0.8575
0.7469
0.7208
0.7359
0.6917
0.6538
0.7713
19.4218
66.4545
22.1001
124.5588
95.9573
77.0267
23.4838
139.2126
71.8668
18.7509
109.3668
160.3530
109.5291
129.2023
34.1056
19.3370
185.4091
123.4636
147.8281
137.7184
297.2460
240.0615
46.4870
225.2046
78.3965
128.5531
101.6971
120.0829
100.2890
73.6423
115.8007
60.8094
52.4236
48.1009
30.3691
123.9533
83.2271
78.6409
60.2092
87.0541
36.1404
138.5450
0.8843
0.9775
0.9699
0.9105
0.9522
0.9666
0.9348
0.9510
0.9527
0.9801
0.9078
0.9335
0.9832
0.9463
0.9713
0.9325
0.9225
0.9693
0.9557
0.9763
0.9306
0.9588
0.9655
0.9856
0.9815
0.9863
0.9633
0.9456
0.9683
0.9571
0.9802
0.9778
0.9798
0.9743
0.9350
0.9645
0.9316
0.9465
0.9358
0.9469
0.9604
0.9915
121
1373
564
1784
1655
1873
127
302
174
80
1792
1878
2049
1359
737
521
202
146
133
192
239
91
86
138
1128
1249
1486
846
1534
606
908
567
573
424
353
127
128
98
102
73
316
946
1.4347
0.9223
0.9144
0.9480
1.3181
1.0862
1.1474
1.5798
1.4240
0.9901
0.3666
0.4729
0.9695
1.2627
0.8449
0.8726
2.3386
1.5787
2.1571
1.4664
3.3897
2.9102
1.4370
1.0853
1.0210
1.1470
1.1758
1.7214
1.0857
1.0739
0.7930
0.7771
0.8993
0.9157
0.8995
1.6008
1.9726
2.0454
1.6749
2.5959
0.9437
0.8251
(continued )
376
Table 1. (Continued ).
ID
Ru 03
Ru 04
Ru 05
Sl 01
Sl 02
Sl 03
Sl 04
Sl 05
Sm 01
Sm 02
Sm 03
Sm 04
Sm 05
T 01
T 02
T 03
R2
HL
c
VHL=2a
1792
2536
6073
457
603
907
1102
2223
267
222
140
153
124
611
720
645
0.7106
0.7181
0.7826
0.7467
0.6846
0.7685
0.9187
0.7232
0.8285
0.7752
0.6858
0.7925
0.7161
0.7624
0.7803
0.7652
158.2659
234.3457
775.3826
44.1840
68.9001
115.2402
334.8100
240.2785
177.1858
123.5355
58.1896
89.0771
46.3093
120.0367
144.5780
167.7334
0.9620
0.9571
0.9807
0.9760
0.9823
0.9604
0.9912
0.9490
0.9678
0.9450
0.8708
0.9563
0.9263
0.8817
0.8685
0.8923
1365
1850
4395
364
423
651
701
1593
119
96
75
76
66
465
540
447
1.0851
1.1661
1.2063
0.6665
1.1571
0.8651
0.7633
1.2572
2.1315
2.2641
2.4320
2.0738
1.8312
1.2995
1.2297
1.6447
(real or virtual) between the tting Zipf curve and the hapax legomena
level f(r) 1.
Finally, it is worthwhile mentioning that the data of Table 1 reveal a
good linear dependence between the hapax legomena length HL and the
vocabulary V, namely HL 0.7256*V718.6979, as it is further
illustrated in Figure 2. Hence, replacing HL in (1), we get a good
approximation for the analytism indicator in the form
A
2a c
1:2744 V 18:6979a
377
Language
Hungarian
Czech
Latin
Romanian
German
Slovenian
Kannada
Russian
Bulgarian
Indonesian
Marathi
Italian
Lakota
Tagalog
English
Marquesan
Rarotongan
Samoan
Maori
Hawaiian
Mean A
Number of texts
0.2012
0.7223
0.7982
0.8931
0.9372
0.9418
1.0378
1.0453
1.0495
1.1438
1.2302
1.2787
1.2853
1.3913
1.4514
1.8108
1.9779
2.1465
2.1861
5.0815
5
5
6
6
7
5
5
5
5
5
5
5
4
3
7
3
5
5
5
4
Fig. 2. Illustrating the linear dependence between the hapax legomena length HL and the
vocabulary V.
378
Needless to say, one can perform the same procedure using the ZipfMandelbrot function or a number of other hyperbolic functions.
We adhere to Occams razor and restrict ourselves to the original Zipf
function which yielding this morphological distinctiveness without
touching morphology gets a secondary important corroboration.
REFERENCES
Greenberg, J. H. (1960). A quantitative approach to the morphological typology of
languages. International Journal of American Linguistics, 26, 178194.
Kohler, R. (1986). Zur linguistischen Synergetik. Struktur und Dynamik der Lexik.
Bochum: Brockmeyer.
Kohler, R. (ed.) (2005). Synergetic linguistics. In R. Kohler, G. Altmann & R. G.
Piotrowski (Eds), Quantitative Linguistics. An International Handbook (pp. 760
775). Berlin: de Gruyter.
Krupa, V. (1965). On quantication of typology. Linguistics, 12, 3136.
Popescu, I.-I. (2007). Text ranking by the weight of highly frequent words. In P. Grzybek
& R. Kohler (Eds), Exact Methods in the Study of Language and Text (pp. 555
565). Berlin/New York: Mouton de Gruyter.
Popescu, I.-I., Vidya, M. N., Uhl r ova, L., Pustet, R., Mehler, A., Macutek, J., Krupa,
V., Kohler, R., Jayaram, B. D., Grzybek, P., & Altmann, G. (2008). Word
Frequency Studies. Berlin: Mouton de Gruyter (in press).
Skalicka, V. (20042006). Souborne dlo IIII (edited by F. Cermak et al.). Praha:
Nakladatelstv Karolinum. [The work contains the Czech translations.]