Υπολογιστική Υφολογία (eBook)

.
,
,

:
:
:
ISBN: 978-960-603-122-9
Copyright , 2015
Creative Commons -
- 3.0.
https://creativecommons.org/licenses/by-nc-nd/3.0/gr/

9, 15780
www.kallipos.gr

......................................................................................................................... 4
............................................................................................................................................... 7
1: .............................................................................................. 10
1.1. 18 ....................................................................................................................... 10
1.2. 19 ....................................................................................................................... 10
1.2.1. ......................................................................................................... 10
1.2.2. ...................................................................................................................... 12
1.2.3. ...... 13
............................................................................................................................................. 18
2: 20 .................................................... 20
2.1. ..................................................................................................................................................................... 20
2.2. (The Federalist Papers)
Mosteller & Wallace .......................................................................................................................................................... 21
2.2.1. ..................................................................................... 21
2.2.2. Mosteller & Wallace ........................................................................................................ 21
2.2.3. Mosteller & Wallace ........................................................................................................ 22
2.2.4. Bayes ........................................................ 24
2.3. . Burrows .................................. 27
2.3.1. (Principal Component Analysis PCA) .......................................... 27
2.3.2. Burrows .............................................................................................. 40
2.3.3. Burrows ......................................................................................... 43
2.4. ..................................................................................................................................... 45
2.4.1. Cusum (Cusum controversy) ........................................................................................................ 45
2.4.2. O Foster W.S. ..................................................................................................................... 51
2.4.3. ................................................................ 52
............................................................................................................................................. 54
3: ..................................................................................... 58
3.1. ............................................................................................................. 58
3.2. / ...................................................................................................................... 59
3.2.1. ............................................................................................................................................................ 59
3.2.2. .......................................................................................................................................... 64
3.2.3. - ........................................................................................................................................ 70
3.2.4. ....................................................................................................................................... 72
3.3. / ......................................................................................................................... 73
3.3.1. .................................................................................................................................................. 73
3.3.2. ................................................................................................................................................. 75
4
3.3.3. ................................................................................................................................................... 77
3.4. ........................................................................................................................................ 78
3.4.1. Zipf ................................................................................................... 78
3.4.2. ............................................................................................ 81
3.4.3. TTR ............................................................................................... 85
3.4.4. .......................................................... 88
3.4.5. ............................................................. 91
3.4.6. ....................................................................................... 96
3.5. .................................................................................................................................................. 100
3.5.1. ............................................................................................................................................. 100
3.5.2. ..................................................................................................................................................... 103
3.5.3. ............................................................................................................................................. 104
3.6. ............................................................................................................................................ 107
3.7. .............................................................................................................................................. 110
........................................................................................................................................... 110
4: .......................................... 122
4.1. ................................................................................................................... 122
4.2. ......................... 124
4.3. : ........................................................................................... 127
4.4. .................................................................... 131
4.4.1. ..................................................................... 131
4.4.2. ........................................... 134
........................................................................................................................................... 143
5: ......... 145
5.1. ............................................................................................................... 145
5.2. ............ 145
5.2.1. ............................................. 145
5.2.2. : (Support Vector Machines) ...................... 148
5.2.3. ....................................................................................................................................... 155
5.3. . 160
5.3.1. ............................................. 160
....................................................... 162
5.3.2. ............................................. 164
5.3.4. ................ 167
........................................................................................................................................... 171
6: ................................................... 174
6.1. ................................................................................................................................................................... 174
5
6.2.
........................................................................................................................................................................... 175
6.2.1. .................................................................................................................... 175
6.2.2. ....................................................................................................................... 176
6.3. ............................................................ 178
6.4. ................................................ 179
6.5. ................................................................................................................................................. 181
6.5.1. ....................................... 183
6.5.2. .......................................................................................................................... 184
6.5.3. ......................................................................................................................................... 185
6.5.4. .......................................................................................................................................................... 186
6.5.5. ...................................................................................................................... 188
6.5.6. ................................................................................................................................... 188
6.6. : ........................................................................................ 188
6.7. ............................................................... 189
6.8. : ................................................................................. 191
........................................................................................................................................... 194
................................................................................................................................... 198
1 .................................................................................................................................................................... 198
2 .................................................................................................................................................................... 200
3 .................................................................................................................................................................... 202
4 .................................................................................................................................................................... 205

.
, ,
, , , , ,
. ,
( )
.
,
,
.
, , , , , , ,
,

. ,
,
.
, , ,

.
,
80 90,
(Information
Retrieval). ,
. ,
.
, Google
2 .
, ,
,
. , ,

. ,
,
.
, , ,
:
:
()
, . , ,
, , .
:
,
, , .

, , : ;
:
,
7
, .. , ..
, .
:

(..
Wikipedia).

.

,
.
, ,
, (
) (, ).
( ),
( ).

, .
()
( ).

.
,
. ,
.
.
.
,

.

. (
), .
. , ,
( )
.

.
,
.
,
.
,
.
,
.
DNA ,
. , ,
. ,
, , ,
.
, ,
;
8
, .
, . ,
,
. ,
,
.
, ,
,
.
, ,
,
, DNA.

.
(.. ),
, (..
). , , ,
.

.
, ,
. ,
, ,

,
.
,
. ,

, .

(, ..).

.
PERL,
R ( 2.14 Windows) ,
.

, , .
9
1:
18
.
(18, 19)
( ,
, ..).

: ,
, , , , .
1.1. 18
18

.

Richard Roderick (1710-1756). Henry VIII, Roderick (1758, . 225-
228) (feminine / double endings),
.

Edmond Malone (1741-1812). O Malone ,
.
Henry VI,
.
(enjambment) .
Henry VI,
(Malone, 1787).
. 1743
The Songs of Zion, 90 ,

. Kumblaeus

. Kumblaeus

.
,
.
Dovring (1954-1955, . 389)
(Content Analysis). , ,
,
(Opinion Mining),
forums, (blogs) ..
1.2. 19
1.2.1.
19
.
10

.

.
Lewis Campbell (1830-1908)

(, , , , ) .
Cambell (1867) ,
, .

Wilhelm Dittenberger (1840-1906),
(Brandwood, 1990, p. 11).
, Sprachliche Kriterien fr die Chronologie des platonischen Dialoge (Dittenberger, 1881)

,
Campbell,
(Brandwood, 1990, p. 10).

(Baron, 1897; Frederking, 1882; Hussey, 1889; Kugler, 1886; Ritter, 1888; Schanz, 1886; Siebeck,
1888; von Arnim, 1896; Walbe, 1888),
2 .
,
(reply formulas)1.

, Wincenty Lutosawski (1863-1954)2,
(Pawowski, 2008, p. 151;
Pawowski & Pacewicz, 2004, p. 423) (stylometry). O
Lutosawski 500
( 1 4 ),
.
.
(Lutosawski, 1897):

, [ ]

,
.
(Lutosawski, 1897, p. 65)
Lutosawski

:
[] (Lutosawski, 1897, p. 193).
1

Ritter (1888), 43 ,
3 .
2
Lutosawski Riga () Tartu ().
Kazan () ()
() .
11
1.2.2.
19
.
James Spedding (1808-1881), B
Francis Bacon, Henry VIII
John Fletcher3 (1579-1625) (Spedding, 1850). Spedding
,
4.
.
16 , 6
(33% ),
10 50% (Spedding, 1850, pp. 121-122),

Fletcher.
Spedding
.
New Shakspere Society 1873 B
Frederick James Furnivall (1825-1910),

(Furnivall, 1874, p. vi).
Spedding Fletcher Henry VIII
,
John K. Ingram. , (Ingram, 1874, p.
453) Spedding, Henry VIII
Fletcher
5.

, Frederick Gard Fleay (1831-1909). ,

,
(Fleay, 1874a, p. 6). , o Fleay
,
. ,
,
.
: ) , ) , ) ) .
Fleay ,
.
Fleay
Fletcher.
3
John Fletcher .
4
Spedding Roderick
Henry VIII,
:
(Spedding, 1850, p. 115).
5
,
. ,
, (Keyser, 1992, p. 72).
Henry VIII
. Ingram (Ingram, 1874, p. 453)
45 (light endings) 35 (weak endings)
Fletcher 7 1 .
( Fishers Exact Test),
(p = 0,134).
12
Fletcher (Massinger Beaumont),
Fleay (Fleay, 1874b, p. 53):
: Fletcher
, .
:
Fletcher. , Massinger,
, .
: O Fletcher Massinger Beaumont.
: Fletcher
Massinger.
: Fletcher .
: O Fletcher
.
Fleay
, , ,
, ,
(Sawyer, 2006, p. 8). Grady (1994, p. 47), o Fleay
,
.
Furnivall, ,
(Benzie, 1983, p. 189).
, , Algernon Charles Swinburne (1837-
1909),
. -
.

(Foster, 1989),
Examiner:

, Bedham [
],
, ,
.
("Review of A Study of Shakespeare," 1880, pp. 49-50)
1.2.3.

. O Lord (1958, . 282)

Augustus De Morgan (1806-1871).
W. Heald, :
[]
[].
[ ]
.
.
13
(de Morgan, 1851/1882)

Thomas Corwin Mendenhall (1841-1924). (Mendenhall, 1887,
1901) .
Mendenhall :
1 , 2
... ( 1.1) :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
25 169 232 187 109 78 79 48 28 20 10 10 2 3
1.1 ( ).
( 1.1) Mendenhall (1901, p. 98)

1.000 Vanity Fair William
Thackeray.
25 , 169 , 232 ...
(1 , 2
...) , x
y .
Mendenhall (1887, p. 237) 1.000

( 1.1):
1.1 ( ).

. Mendenhall
14
,
. O Mendenhall :

(characteristic curve of his composition).
(Mendenhall, 1901, p. 99)
(word length frequency distributions),

Mendenhall
(word- spectrum).
1887, .
10.000
Vanity Fair William Thackeray.
, .
Mendenhall John Stuart Mill,

.
( ),
. ,
(
Thackeray).
Edward Atkinson
,
, .
( ).

,
.
Mendenhall 14
1901 6 (Mendenhall, 1901),
, William Shakespeare Francis
Bacon.
. ,
400.000 , Bacon , ,
200.000 . , ,
Mendenhall.
,
3-5 . Mendenhall
Bacon .
Grieve (2005, pp. 12-13) Mendenhall
Science 1887, ,
. . . Eddy (1887) Mendenhall
,

, ...
, (1890), A. B. M.7

.
6
Augustus Hemingway, .
7
(Bailey, 1969, p. 218) [ Grieve (2005, p. 13)]
Mendenhall,
.
15
(Thomas Carlyle, Thomas De
Quincey, Samuel Johnson).
,

.
, William Benjamin Smith
(1850-1934) 8 Conrad Mascol (Mascol, 1888a, 1888b) Unitarian Review.

. , Smith
,
: ,
.
,
( )
( ),
Eddy (1887). Smith ,
.
(
) ,
.
Smith , ,
(.
3, 3.1). ,

. , Smith

(Mascol, 1888a, p. 454).
Smith 40
.
( , , ).
,
(curves of style).
.
( , , ).
( 1.2)
, Smith,
. , .
8
William B. Smith 19
. , ,
.

Conrad Mascol (Severance, 1936, p. 8).
16
1.2 . . [: (Mascol,
1888b, p. 546)].
, 1.2,
.
Smith .
, 19 Friedrich Schleiermacher
(1768-1834) . ' ,
(hapax legomena),
( 1) 9 (Schleiermacher, 1807, pp. 22-
76).
,
(Computational Linguistics)
(Machine Learning) (Alviar, 2008).
9
.
Harrison (1921, pp. 18-28) [ (Mills, 2003, p. 10, . 2)]
. , Harrison
,

.
17

Alviar, J. J. (2008). Recent advances in computational linguistics and their application to Biblical studies. New
Testament Studies, 54(01), 139-159. doi: doi:10.1017/S0028688508000088
Bailey, R. W. (1969). Statistics and style: a historical survey. In L. Doleel & R. W. Bailey (Eds.), Statistics
and Style (pp. 217-236). New York: American Elsevier.
Baron, C. (1897). Contributions la chronologie des dialogues de Platon. Revue des tudes Grecques, 10,
264-278.
Benzie, W. (1983). Dr. F. J. Furnivall: A victorian scholar adventurer. Norman, Okla.: Pilgrim Books.
Brandwood, L. (1990). The chronology of Plato's dialogues. Cambridge: Cambridge University Press.
Campbell, L. (1867). General introduction The Sophistes and Politicus of Plato with a revised text and
English notes. Oxford: Clarendon Press.
Dittenberger, W. (1881). Sprachliche Kriterien fr die Chronologie des platonischen Dialoge. Hermes, 16(3),
321-345.
Eddy, H. T. (1887). The characteristic curves of composition Science, 9(216), 297. doi: 10.1126/science.ns-
9.216.297
Fleay, F. G. (1874a). On metrical tests as applied to dramatic poetry The New Shakspere Society's
Transactions (pp. 1-16). London: Trubner & Co.
Fleay, F. G. (1874b). On metrical tests as applied to dramatic poetry. Part II. Fletcher, Beaumont, Massinger
The New Shakspere Society's Transactions (pp. 51-72). London: Trubner & Co.
Foster, D. W. (1989). 'Elegy' by W.S.: A study in attribution. Cranbury, NJ: Associated University Presses.
Frederking, A. (1882). Sprachliche Kriterien fr die Chronologie der platonischcen Dialoge. Neue Jahrbcher
fr Philologie und Paedagogik, 125, 534-541.
Furnivall, F. J. (1874). Notices on meetings The New Shakspere Society's Transactions (pp. v-xii). London:
Trubner & Co.
Grady, H. G. (1994). The modernist Shakespeare: Critical texts in a material world. Oxford: Clarendon Press.
Grieve, J. W. (2005). Quantitative authorship attribution: a history and an evaluation of techniques. (Master
of Arts), Simon Fraser University, Burnaby, B.C. Retrieved from http://ir.lib.sfu.ca/handle/1892/2055
Harrison, P. N. (1921). The problem of the Pastoral Epistles. London: Oxford University Press.
Hussey, G. B. (1889). On the use of certain verbs of saying in Plato. The American Journal of Philology,
10(4), 437-444.
Ingram, J. K. (1874). On the "weak endings" of Shakspere, with some account fo the history of the verse-tests
in general. The New Shakspere Society's Transactions (pp. 442-464). London: Trubner & Co.
Keyser, P. (1992). Review Article: Stylometric method and the chronology of Plato's works. Bryn Mawr
Classical Review, 3(1), 58-74.
Kugler, F. (1886). De particulae toi eiusque compositorum apud Platonem usu. (Ph.D.), University of Basel,
Basel.
Lutosawski, W. (1897). The origin and growth of Plato's logic. London, New York, Bombay: Longmans,
Green and Co.
Mascol, C. (1888a). Curves of Pauline and Pseudo-Pauline Style I. Unitarian Review, 30, 452-460.
Mascol, C. (1888b). Curves of Pauline and Pseudo-Pauline Style II. Unitarian Review, 30, 539546.
Mendenhall, T. C. (1887). The characteristic curves of composition. Science, 11, 237-249.
18
Mendenhall, T. C. (1901). A mechanical solution to a literary problem. Popular Science Monthly, 60, 97-105.
Mills, D. E. (2003). Authorship attribution applied to the Bible. (MSc), Texas Tech University. Retrieved
from http://dspace.lib.ttu.edu/handle/2346/19010
Pawowski, A. (2008). Les aspects linguistiques dans loeuvre scientifique de Wincenty Lutosawski.
Organon, 37(40), 149-176.
Pawowski, A., & Pacewicz, A. (2004). Wincenty Lutoslawski (18631954): Philosophe, helleniste ou
fondateur sous-estime de la stylometrie? Historiographia Linguistica, 31(2-3), 423-447. doi:
10.1075/hl.31.2.10paw
Review of A Study of Shakespeare. (1880, 10 January). The Examiner, 49-50.
Ritter, C. (1888). Untersuchungen ber Plato. Stuttgart: Kohlhammer.
Sawyer, R. (2006). The New Shakspere Society, 1873-1894. Borrowers and Lenders. The Journal of
Sakespeare and Appropriation, 2(2), 1-11.
Schanz, M. (1886). Zur Entwicklung des platonischen Stils. Hermes, 21, 439-459.
Schleiermacher, F. (1807). ber den sogenannten ersten Brief des Paulus an den Timotheos. Berlin:
Sendschreiben and J. C. Gass.
Severance, H. O. (1936). William Benjamin Smith, Ph.D., LL.D. A Friend of the University of Missouri
Library. The University of Missouri Bulletin, 37(3), 3-23.
Siebeck, H. (1888). Untersuchungen zur Philosophie der Griechen. Freiburg.
Spedding, J. (1850). Who wrote Shakspere's Henry VIII? Gentleman's Magazine, 178 / new series 34, 115-
123.
von Arnim, H. F. A. (1896). De Platonis Dialogis Quaestiones Chronologicae. Rostock.
Walbe, E. (1888). Syntaxis Platonicae Specimen. (Ph.D.), University of Bonn, Bonn.
19
2: 20
20
,

. Mosteller & Wallace (1984)
(The Federalist Papers) ,

. , Burrrows
(
) (PCA).
.

, ..., Principal Component Analysis (PCA).
2.1.
1900 ,
. ,
. ( 2.1)
stylometry 1900 2000,
Google10. ,
1940 (
) .
2.1 (20 ) stylometry

Google.
10
Google Books Ngram Viewer
(http://ngrams.googlelabs.com/) 16/04/2015, Corpus
(smoothing) 10.
20
2.2. (The Federalist Papers)
Mosteller & Wallace
2.2.1.
, 20 ,
11.
Alexander Hamilton, John Jay James Madison 1787-1788,
. 85
, 77 (The Independent Journal,
The New York Packet Daily Advertiser) Publius. Hamilton
8
The Federalists: A collection of essays, written in favour of the New Constitution, as agreed upon by
the Federal Convention, September 17, 1787, J. & A. McLean
1788. 85 , 51
Hamilton12, 14 Madison13, 5 John Jay14 3 Hamilton
Madison15. 12 16
Hamilton Madison, .

. Hamilton Madison 4
. ,

. Hamilton,
, .. 28

.
2.2.2. Mosteller & Wallace

Mosteller Wallace
1941, ,
. ,
(. 5.1 3). Hamilton
34,55 , 19,2 , Madison
34,59 , 20,3 (Mosteller & Wallace, 1984, . 7).
, ,
Hamilton Madison
, while ~ whilst, Hamilton while Madison whilst.
,
(marker words) Hamilton: enough upon.
2.1
, :
11
Project Gutenberg :
http://www.gutenberg.org/ebooks/1404 :
http://thomas.loc.gov/home/histdox/fedpapers.html
12
.. 1, 6-9, 11 - 13, 15 17, 21 36, 59 61, 65 85.
13
.. 10, 14, 37 48.
14
.. 2 5, 64
15
.. 18 20.
16
.. 49 58, 62, 63.
21
enough while whilst upon
Hamilton 0,59 0,26 0 2,93
Madison 0 0 0,47 0,16
0 0 0,34 0,08

0,18 0 0,36 0,36
2.1 ( 1000 )
Hamilton Madison [: (Mosteller & Wallace, 1984, p. 11)].
while Hamilton,
whilst Madison. , enough upon
Hamilton, Madison.

Madison, whilst Madison.
Hamilton, upon
(0,08 2 23.881 ).
Mosteller & Wallace
:
1.
.
, Hamilton Madison.
2.
. Mosteller & Wallace (1984, p. 17)
:

William Benjamin Smith,
(. 0).
2.2.3. Mosteller & Wallace

, Mosteller & Wallace ,

, .
165 17. ,
:
1. : 363 ,

. 70 (.
1: .1.1). 20
(. 1: .1.2). , 10
11 ~ 21 10
22 ~ 44.
2. : , ,
, .
4 (waves) A, B, C
( )
17
Mosteller & Wallace 1.
22
. A 6 Hamilton 5
Madison .
, (3,2), 3
Hamilton 2 Madison. ,
0
, .. (5,0) (0,5)
Hamilton Madison .
Mosteller & Wallace
3.000 (types)
305. 305 10
.
.. argument, (5,1),
(1,3), (6,4).
(6 Hamilton
4 Madison). 305 (x,y)
x 0 17 ( Hamilton A,
) y 0 14 ( Madison
).
305 z Mosteller & Wallace
:
#% &
!= (1)
#+%
, , whilst (0,13)
(1) z = 13.
#% & 0 13 & 13 &

169
!= = = = = 13
#+% 0 + 13 13 13
z
z 3,6,
305 46.
C (6
Hamilton 5 Madison). (28 )
z 4.5
1: .1.3).
3. (index)
Hamilton Madison: Mosteller
& Wallace (index) 18
Hamilton 19 Madison ( 70.000
6.700 ).
,
, .

(, , )
48 , 1:
.1.4.
23
2.2.4. Bayes
Mosteller & Wallace Bayes
. Bayes
:
.(2|0) .(0)
.(0|2) = (2)
. 2 0 . 0 + .(2|~0) .(~0)
:
P(A|): ,
X. .
(posterior probability)
.
P(|A): ,
.
P(A): .
(prior probability)
.
P(X|~A): X, .
~ .
P(~A): 1-P(A).
Bayes
.
Hamilton Madison, .

, : )
) (probability distribution),
,
.
also Hamilton
1000 0,31 Madison
0,67. , ,
. ,
.

.
Poisson,
, .
(Quantitative Linguistics) (Baayen, 2001; Popescu et al., 2009; Zipf, 1935, 1949)
18 (. 4.5 3) ,
Poisson
.
Poisson :
6 78 9:
. # = , # = 0,1,2, 9 > 0 (3)
#!
18
. Zipf, Zipf-Mandelbrot (. 4.1 3), LNRE (. 4.5
3).
24
e ( 2,718) Euler, .
x! x
x. .. 5!=5*4*3*2*1=120.
.
,
,
.

, Mosteller & Wallace (1984, p.
53), also. 19
H(amilton)= 0,31 Hamilton M(adison)= 0,67 Madison.
Poisson
,
also , Hamilton Madison.
(3) :
1. Poisson
() .
2. x . , x=0,
Poisson ,
Hamilton Madison. x=1 Poisson
(also )
Hamilton Madison.
also 0, 1 2
Hamilton, Madison:
2,7187E,FG 0,31E 0,733 1

.@AB(0) = = = 0,733
0! 1
2,7187E,FG 0,31G 0,733 0,31
.@AB(1) = = = 0,227
1! 1
2,7187E,FG 0,31& 0,733 0,096
.@AB(2) = = = 0,035
2! 2
2,7187E,LM 0,67E 0,512 1
.JAK(0) = = = 0,512
0! 1
2,7187E,LM 0,67G 0,512 0,67
.JAK(1) = = = 0,343
1! 1
2,7187E,LM 0,67& 0,512 0,449
.JAK(2) = = = 0,115
2! 2
1.000
also 2 , Madison (PMad(2)= 0,115)
Hamilton (PHam(2)= 0,035).
, (odds)20
Madison ( Hamilton) .
19
Mosteller & Wallace
1.000 . , H 0,31
Hamilton 1 also,
3.000 (1*1.000/0,031).
20
. 3 4.
25
.
Madison
PMad(2) :
.JAK(2) 0,115
OPQR9STUQVWXY[RX\V]WSW_[W`PJAKabcd = = = 3,26
.@AB(2) 0,035
Madison 1 3,26
3,26 Madison Hamilton.
,
.
Mosteller & Wallace Bayes ( (2))
:
21 . = . +
(4)
.

. :
. hR\_gS 1 #)
eTfXY[ gPR9STUQVWXY[ RX\V]WSW_[ = (5)
. hR\_gS 2 #)
Mosteller & Wallace

(Mosteller &
Wallace, 1984, p. 183) . ,
1, 0,
(4) . ,

,
. 12
Madison .. 55.
Mosteller & Wallace .
,
Poisson (Negative
Binomial), .
Bayes, (Discriminant Function
Analysis) .

, .
, ,
Holmes and Forsyth (1995),
, . Bayes,
Mosteller & Wallace,
30 .
Holmes and Forsyth (1995, p. 114) ,
,
. ,
,
.
21

, .
26
2.3. .
Burrows
2.3.1. (Principal Component Analysis PCA)

,
Mosteller & Wallace, John F. Burrows .

(multivariate statistical methods) . O
Burrows - (Principal Component Analysis
PCA) 90.
Burrows ,
.

.
(principal components). ,

. Pearson (1901),
Hotelling (1933). 80 /

.
,
.
100 (Corpus)
. (. 2.2)
()
.

1.txt mail 18 17 13 7 3
2.txt mail 27 21 13 9 2

3.txt 19 16 15 6 4

4.txt 28 20 17 9 1

2.2 .
( 2.2) ( )
(4.txt) . , 1.txt 2.txt
Email, 3.txt 4.txt .

, .
(variance)22 .
,
.
22
(variance) .
(standard deviation)
.
27
(weighted linear combination),
.

,
, , ,
, 0 1.
z
.
, ,
.
, :
-1 1 68% .
-2 2 95% .
-3 3 99,7% .
1 -1 ( ) 84%
.
2 -2 ( ) 97,7%
.
3 -3 ( ) 99,9%
.
2i 2 (6)
!i =
b
zi i , i
, s . ,
,
(= 23) (s= 5,23).

.
, :
k
1
b= (2i 2)& (7)
j1
ilG
s , , i
i .
k
ilG ,
(), (i=
1). .
:
28
Gno&MoGpo&n
1. : m = = 23
q
& &
2. : 2G 2 + 2G 2 + 2G
& &
2 + 2G 2 = 18 23 & + 27 23 & + 19 23 &
+ 28 23 &
= 5 &
+
4& + 4 & +5& = 25 + 16 + 25 + 16 = 82
G n& n&
3. : 82 = = = 27,33.
r7G q7G F
(variance).
4. : 27,33 = 5,23
2G 2 18 23 5
!G = = = = 0.956
b 5.23 5.23
2& 2 27 23 4
!& = = = = 0.765
b 5.23 5.23
, ,
( 2.3):

1.txt -0.956 -0.630 -0.783 -0.500 0.387
2.txt 0.765 1.050 -0.783 0.833 -0.387
3.txt -0.765 -1.050 0.261 -1.167 1.162
4.txt 0.956 0.630 1.306 0.833 -1.162
2.3 .

.
. , ,
,
. , Meyers, Gamst, and Guarino (2006, pp. 480-488).
( )
() ( 2.2):
29
2.2 .
1. ,
.
2. .
3. ,
, , 90.
4. ,
, ,
(best fit criterion).
5. ,
, , .
,
() .
(extraction)
( ) .
,
, , .

, .

( least
squares method). (
2.3):
30
2.3
.

. ,
.
, ( 2.4),
90,
( 3).
,
. (
2) ,

.
2.4 .
, ,
( 2.5).
( ). ,

31
. , ,
. , ,
,
, , .
2.5 .

.
(correlation) .
.
( 2.6):
32
2.6 .
(mean length
utterance MLU) 20 52 .
.

. ,
, .
.
.
.
.
Pearson r
0 1 ( ) 0 -1 .
(r2) (coefficient of determination), ,
, 0-1
.
.
,
.
, (r) (r2)
(.. y ~ , y~o ..). ,
(. 2.4),
. r r2
, r r2
. (r)
(factor loadings).
33

1 2 1
2
[r] [r] [r2]
[r2]
0,9765 0,0503 0,953552 0,0025
0,9523 -0,2866 0,906875 0,0821
0,3547 0,9345 0,125812 0,8733
0,9808 -0,1841 0,961969 0,0339
-0,9623 -0,0757 0,926021 0,0057
() 3,874 0,9975
2.4 .
( 2.4)
(>0,95)
(r=0,3547)23. ,
.
.
(r=0,9345),
(<0,3). ( 2.7)
:
2.7 .
23

.
FactoMineR (Husson, Josse, Le, & Mazet, 2012).
34
, , ,
0. .

.
r2
(
), (eigenvalue),
.
. 10 ,
100% , 1024.

. 5 ( 5 )
3,874, 0,9975.
,
, ,
.
:
% uXVYQV]gS[ R`P _TQS]__X S RTWS YTXV gP]XgWgV =
(XuX`WXQ RTWS[ YTXV[ gP]XgWgV[ 100) 3,874 100

= = 77,84
VTX\Q[ Q_WVy9SW] WS[ V]9PgS[ 5
19,95%
. 97,43%
,
.
,
. ,
,
.
, , (linear regression)
:
k
%i{ = |i{ #i (8)

}lG
yik i k , wik
i k xi
i . ~
j .
2.4 .

. o wkj, j k
(Sharma, 1996, p. 71):
~{}
|{} = (9)
9{
24
r2 1,
,
(10).
35
lkj j k k k
. , ,
( (8))
( (9)). :
~G 0,9765 0,9765
|G = = = = 0,4961
9 3,874 1,9682
, ( 2.5):
1 2
0,4961 0,0504
0,4838 -0,2869
0,1802 0,9356
0,4983 -0,1843
-0,4889 -0,0757
2.5 2 .
(1.txt)
:
%G.: G = 0,4961 0,956 + 0,4838 0,630 + 0,1802 0,783 + 0,4983 0,5 + 0,4889
0,387 = 0,9143 + 0,3970 + 0,6134 + 0,25 + 0,1499 = 1,359
%G.: & = 0,0504 0,956 + 0,2869 0,630 + 0,9356 0,783 + 0,1843 0,5
+ 0,0757 0,387 = 0,0482 + 0,1808 + 0,7329 + 0,0922 + 0,0293
= 0,5375

. ( 2.6):

1 2
1.txt 1,359 -0,5375
2.txt -1,3511 -1,12
3.txt 1,99 0,6341
4.txt -1,9979 1,0234
2.6 .

.
( 2.8):
36
2.8 .

. ,
77,49% . , 1.txt 3.txt,
, (
). , o 2.txt
( ).
4.txt . (
),
. , .

(19,95%).
. , 1.txt 2.txt
3.txt 4.txt
.
( 3.txt 4.txt),
emails.
( 2.8) ,
( 2.7),
.
( 2.9):
37
2.9 .

.
. ,
, ,
. , , , ,
,
.

, :
, : , , .

, , , .

.
,

.

.
( 2.10)
:
38
2.10 () ().

.
.
-425, 4
. (
2.10) ,
2.9. ,
( ) , ,
. 2.9
( ). ,
.
2.9, (
).

. ( 2.10),
emails , ,
.
.
. , ,
( 2.9)
, , ,
emails.
,
.
Burrows
.
25

.
39
2.3.2. Burrows

Burrows (Burrows, 1987) 26
Jane Austen (1775-1817).

.
( )
.
(Burrows & Hassal, 1988) A Journey from
this World to the Next, 1749 Henry Fielding
(1707-1754). , Fielding (Anna
Boleyn) .
(Digeon, 1931; Rawson, 1973)
Fielding Sarah Fielding (1710-
1768), , , .
Anna Boleyn Fielding ( )
. Burrows &
Hassal 10 Henry Fielding ( 87.390 ) 10
Sara Fielding ( 90.786 ). 33
.

Henry Sarah Fielding. ,
: the, a, an Henry Fielding
. ,
(but, which, to, for, at) .
, ,
( Pearson r):
1. 33 10 Henry Fielding,
2. 33 10 Sarah Fielding,
3. 33 .
1 2
Henry Sarah Fielding (
r= 0,973 r= 0,961 ),
. , 3
(r=0,930),
. ,
Burrows & Hassal
.
, 3 ,
94,99% , 2,32% 0,71%. , ,
, 97,31%
. ,
3.1 2. Burrows & Hassal
. ,

, (. 2.5)
26
Burrows 30 : 8 , 6
, 5 , 3 , 2 , to, that, for,
all (Burrows, 1987, pp. 4-6).
40

(Burrows & Hassal, 1988, p. 437).

,
27.
, ,
. ,
.
Sarah Fielding, Henry
Fielding.
28 Burrows &
Hassel Anna Boleyn.
Sarah Fielding.
, , , Burrows & Hassal
.
1.000 .
.
Henry Sarah Fielding.
, 3 ,
.
Henry Fielding,
Sarah Fielding
Sarah Fielding.
Anna Boleyn Sarah Fielding.
Burrows & Hassal Anna
Boleyn Henry Fielding,
, . ,
,
,
(overcorrection). , (.. which) Sarah
Henry Fielding,

.
Henry Sarah Fielding Burrows
(Burrows, 1989), Anna Boleyn, A Vision
Familiar letters between the principal characters in David
Simple (1747) The Opposition. A Vision (1742). Burrows
, ,
(. 2.3) ,
.
8 ,
(indirect reported speech),
.
49
.
27

.
, ,
. Smith (1991, p.
246) (semantic factor)
.
90 (Binongo, 1994; Smith,
1991, 1992, 1994) .
28
Burrows & Hassal, , 2
3 .
41
.

(. 2.7):
-2 -1,5 -1 -0,5 0 0,5 1 1,5 2 2,5 3

-Sarah 3 8 8 5 7 3 1 2
and
+Henry 3 3 9 5 8 3 2 1 1
D C B
2.7: and Sarah Henry Fielding.
24 Sarah Fielding ( 47)

0 .
, Henry Fielding 15 ( 35),
and .
+ Henry Fielding.
and
Anna Boleyn (D), A Vision (B C) The Opposition (A).
49 ,
Burrows .
A Vision, Henry Fielding, B C
Sarah Fielding. 49
. 15

. Burrows 34
(, , ) .

15 .
Sarah Fielding ,
Hendry Fielding .
, Opposition Anna Boleyn Henry
Fielding, 2 3 Anna Boleyn Vision Sarah Fielding. ,
,
,
. Anna Boleyn
Sarah Fielding Henry Fielding. , Vision,
34 Sarah Fielding, ,
Henry Fielding .
Burrows
34 .
( ). Burrows,
,
. ,

. , Anna Boleyn Henry Fielding,
Sarah
Fielding. Vision Henry
Fielding, .

, ,
Burrows. ,

(Digital Humanities). ,
42
Smith (1990, p. 250) (
Burrows)
.
2.3.3. Burrows

. Holmes & Forsyth (1995)
(. 2
2), 49 .
Mosteller & Wallace
( Madison).
(Hamilton Madison) ,
30 Mosteller & Wallace
.
(>>50),
(Holmes & Forsyth, 1995, p. 126). Craig & Kinney (2009)
,
, ,
.

. Binongo (1994) Nick
Joaquin, ,
,
. 50
Joaquin .
,
. pastiche

. ,
. Somers &
Tweedie (2003) Lewis Caroll
(1865) Gilbert Adair, Alice Through the Needles Eye (1984)
Caroll Adel,
40 .
,
.

, , (.. ),
(.. - ).
(Baayen, van Halteren, & Tweedie, 1996)
. 50
,
. 6
, .
3 , 50
(syntactic rewrite rules)
. ,

. ,
,
. , Binongo & Smith (1999)
25 Oscar
Wilde.
43
Forstall & Scheirer (2010),

(Support Vector Machines - SVM)29.
(robust) ,

.
- , -30
(functional ngram).

. -
,
. -
.
.
5.000 10.000 -, - -
.

. .
( ) .

.

. ,
.
Writeprints ( ) Abbasi & Chen (2006; 2008).
(,
, 31, ),
(forum posts).
.
, Abbasi & Chen ,
.
, (Dynamic Sliding Window)
Kjel, Woods & Frieder (1993).
(Abbasi & Chen, 2008, p. 15):
1.
.
2.
.
3. .
4.
(
29
2.2 5.
30
- 2.3 3.3 3.
(, , ,
. ..) .
31
(structural features)
forum, ,
(. 6 3).
, , emails ,
.
44
)
.
5. 32.
1-4 .
, ,
, David Holmes
/ (Holmes, 1998, p.
114).
.
, , .
,
.
:
.
,
4-5 .
(scalability issues),

.

,
(Zhao, Zobel, & Vines, 2006, p. 94).
,
.
2.4.
2.4.1. Cusum (Cusum controversy)

, ,
(2.1 2.2),
, ,
.

Cusum ( QSUM), 80
90
.
Cusum Cumulative Sum ( )

.
(Chang &
McLean, 2006). 60, Andrew Q. Morton,

.
70, Morton, Ronald Bee cusum
32
Kjell et al. (1993, p. 144) 2.048
, Abbasi and Chen (2006)
.
.
45
(Bee, 1971, 1972) Morton

( ) (Morton, Michaelson, & Wake, 1978).
Morton (Morton, 1991; Morton & Michaelson, 1990),
1996 (Farringdon,
Morton, Farringdon, & Baker, 1996). cusum
.
, .
, ,
. Morton

25 50 .
33, :
( ).
2 3 + ( 2-3+).
cusum :
1. .
2. .
3. .
4. (cumulative sum) .
5. .
6. .
(. 2)
. :
27 2-3
18.
:
,
,
.
31 2-3
20.

Cusum 2, 3 4 :
29 (27+31/2)
2 3 19
(18+20/2).
33
Farringdon et al. (1996, pp. 20, 25) : 1. , 2.
, 3. , 4.
, , 5. , , 6.
, 7. , 8.
.
46

. :
= 29 -27 = 2
= 29-31= -2
2-3+ = 19-18=1
2-3+ = 19-20= -1

cusum.

cusum cusum.
:
1 cusum ( )= 2, 2 cusum ( )= 0 (2+(-2))

1 cusum ( 2-3+)= 1, 2 cusum ( 2-3+)= 0 (1+(-1))
,
cusum
.
cusum
. To cusum ( 2.11):
2.11 cusum 2.
cusum
2-3+
(23).
47
. Morton
, , ,
,
.
, 2.11
.
.. 3, 4 5
cusum ( 2.12):
2.12 Cusum 3, 4 5.
,
. 3
.
Morton
(Farringdon et al.,
1996, p. 241).
cusum ,
, ( )

(Barr & Ariba, 1997, p. 106).
, cusum
(Canter, 1992, p. 93).
(Sunday Telegraph) (Mathews, 1993a), (The
Guardian) (Campbell, 1992) (New Scientist) (Mathews, 1993b),
(BBC Tomorrows World, Channel 4 Street-
legal) (Sams, 1994, p. 472).

,
(Canter, 1992; de Haan & Schils, 1994; Hardcastle, 1993, 1997; Holmes, 1998; Holmes
& Tweedie, 1995; Karian, 2006; Sanford, Aked, Moxey, & Mullin, 1994; Schils & de Haan, 1993)
48
cusum
. .

. Michael Farringdon,
cusum
.
(Farringdon et al., 1996, p. 241).
cusum,
, ,
.
,
.
, , . Sanford et al. (1994)

cusum (. . 33).
,
Morton .
Holmes and Tweedie (1995) (Asimov)
26 . cusum 25
Asimov Foundation (1960) 25
Foundation and Earth (1986).
.
, , . de Haan and Schils (1994)
cusum
.
. 100 ,
.

15%, . Karian (2006)
4 cusum ,
( ).
,
, ,
. ,
. Hardcastle (1993, p. 98)
,
, . ,
,
Holmes and Tweedie (1995, p. 24).
cusum 34,
,
. Canter (1992) 100 cusum,
Spearmans rank correlation coefficient ()35. cusum (
2-3+) ( ),
1. , (
34
Michael Farringdon (Farringdon et al., 1996, p. 261)
[] [
].
35
Spearman 0 1 0 -1 .
Pearson r
(ordinal), (.. , , ,
, ).
49
) 036. Canter
0,9, . 0,9
,
0,9 . cusums
, 48% 0,9.

,
. ,
65% 0,9.

.
Canter (1992), ,
Morton, cusum
.
62% 68%,
. Morton
.

. ( 2.12)
-30 30 -200 200 ( 2.13):
2.13 ( 2.12) .

2.12 .
,
. Morton
36
, 2.11 (= 0,97, p<0,001),
( 2.12),
(= 0,33, p= 0,11).
50

(Farringdon et al.,
1996, p. 271).
. Karian (2006, p. 271) Morton
cusum
(. 0). ,
( ).
, ,
,
(Holmes & Tweedie, 1995, pp. 29-30). ,
.

(Karian, 2006, p. 269).
cusum
Morton 1993,
. ,

(Sams, 1994, p. 472).
2.4.2. O Foster W.S.

A Funeral Elegy Foster (1989). Donald Foster,
, ,
(. 0), A Funeral Elegy .
,
, Richard Abrams (Abrams, 1996),

(New York Times) (Pristin, 1997) Literary Detective.
, Foster
best seller Primary Colors Joe Klein.
(
Bill Clinton 1992), . Klein
, Foster Klein
. Klein
, 17 1996
Washington Post
.
Foster. FBI

. , ,
Ted Kaczynski37 2001
38.
1996 Elliot & Valenza (1996)
Foster
37
Ted Kaczynski, Unabomber, 1978 1985 16 -
3 23. ,
FBI ,
.
38
, Foster Vanity Fair (Foster,
2003) Steven Hatfill,
. , Hatfill
Foster. Foster
.
51

. , Elliot & Valenza
, A Funeral Elegy.
Foster Elliot & Valenza
Computer and the
Humanities. Grieve (2005, pp. 7-8),
Foster 2002
Elliot & Valenza.
2.4.3.
Morton Foster,

. ,

.
,
.

. Smith (1990, pp. 249-250)

:
1.
.
2. ,
.
3. ,
.
4.
.
5.
.
6. .
, Foster
,
. 4,
,
, 4.1 4,

.
Rudman, ,
7
(Rudman, 1997):
1. :
.
, , ,

52
,
.
2. : Rudman

.
3. :

.
Rudman,

. Efron-Thisted (Efron &
Thisted, 1976)
, ,
cusum
.
4. :
,
( , ..).
5. :

, .
6. : Rudman
.
, ,
.
7. :
, ,

(, ..).
,
.

, , .
Rudman
:

.
( ,
, ..) (workshops)
.
( ,
, ,
)
.
.

.
53
,
.
Rudman 15 ,
.
, .

,
. ,
,
. .
Shlomo Argamon (Argamon, 2011):
( )
[is a mess]. ,
,
.
(Argamon, 2011, p. 1)
.
,
,
.

, .
,
.

.
Abbasi, A., & Chen, H. (2006). Visualizing authorship for identification. In S. Mehrotra, D. D. Zeng, H.
Chen, B. Thuraisingham & F.-Y. Wang (Eds.), Intelligence and Security Informatics. Proceedings
IEEE International Conference on Intelligence and Security Informatics (ISI-2006), 23-24 May 2006,
San Diego, U.S.A (pp. 60-71). Berlin: Springer.
Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and
similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2), 1-29.
doi: 10.1145/1344411.1344413
Abrams, R. (1996). W[illiam] S[hakespeare]'s "Funeral Elegy" and the turn from the theatrical. Studies in
English Literature, 1500-1900, 36(2), 435-460.
Argamon, S. (2011). Scalability Issues in Authorship Attribution.Kim Luyckx. Literary and Linguistic
Computing. doi: 10.1093/llc/fqr048
Baayen, H. R. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.
Baayen, H. R., van Halteren, H., & Tweedie, F. J. (1996). Outside the cave of shadows: using syntactic
annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121-132. doi:
10.1093/llc/11.3.121
54
Barr, G. K., & Ariba, B. D. (1997). The search for verbal fingerprints. Expert Evidence, 5(3), 106-109. doi:
10.1023/a:1008822306661
Bee, R. E. (1971). Statistical methods in the study of the Masoretic Text of the Old Testament. Journal of the
Royal Statistical Society, Series A, 134(4), 611-622.
Bee, R. E. (1972). A statistical study of the Sinai Pericope. Journal of the Royal Statistical Society, Series A,
135(3), 406-421.
Binongo, J. N. G. (1994). Joaquin's Joaquinesquerie, Joaquinesquerie's Joaquin: A statistical expression of a
Filipino writer's style. Literary and Linguistic Computing, 9(4), 267-279. doi: 10.1093/llc/9.4.267
Binongo, J. N. G., & Smith, M. W. A. (1999). A bridge between statistics and literature: The graphs of Oscar
Wilde's literary genres. Journal of Applied Statistics, 26(7), 781-787. doi: 10.1080/02664769922025
Burrows, J. F. (1987). Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in
Method. Oxford: Clarendon Press.
Burrows, J. F. (1989). 'A vision' as a revision. Eighteenth Century Studies, 22(4), 551-565.
Burrows, J. F., & Hassal, A. J. (1988). Anna Boleyn and the authenticity of Fielding's feminine narratives.
Eighteenth Century Studies, 21(4), 427-453.
Campbell, D. (1992, 7 October 1992). Writings on the Wall. False Confessions, The Guardian, p. 25.
Canter, D. V. (1992). An evaluation of the "Cusum" stylistic analysis of confessions. Expert Evidence, 1(3),
93-99.
Chang, W., & McLean, I. (2006). CUSUM: A tool for early feedback about performance? BMC Medical
Research Methodology, 6(1), 8.
Craig, H. D., & Kinney, A. F. (2009). Shakespeare, computers and the mystery of authorship. Cambridge:
Cambridge University Press.
de Haan, P., & Schils, E. (1994). The Qsum plot exposed. In U. Fries, G. Tottie & P. Schneider (Eds.),
Creating and using English language corpora (pp. 93-105). Amsterdam: Rodopi.
Digeon, A. (1931). Fielding a-t-il ecrit le dernier chapitre de 'A Journey from This World to the Next'. Revue
Anglo-Amricaine, 8, 428-430.
Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did shakespeare
know? Biometrika, 63(3), 435-447.
Elliot, W. E., & Valenza, R. J. (1996). And Then There Were None: Winnowing the Shakespeare claimants.
Computer and the Humanities, 30, 191-245.
Farringdon, J., M., Morton, A., Q., Farringdon, M., & Baker, D., M. (1996). Analysing for authorship: A
guide to the Cusum technique. Cardiff: University of Wales Press.
Forstall, C., & Scheirer, W. (2010). Features from frequency: Authorship and stylistic analysis using repetitive
sound. Journal of the Chicago Colloquium on Digital Humanities and Computer Science, 1(2), 1-23.
https://letterpress.uchicago.edu/index.php/jdhcs/article/view/56/67
Foster, D. W. (1989). 'Elegy' by W.S.: A study in attribution. Cranbury, NJ: Associated University Presses.
Foster, D. W. (2003, 15 September 2003). The message in Anthrax. Vanity Fair, 180-200.
Hardcastle, R. A. (1993). Forensic linguistics: an assessment of the CUSUM method for the determination of
authorship. Journal of the Forensic Science Society, 33(2), 95-106. doi: 10.1016/s0015-
7368(93)72987-4
Hardcastle, R. A. (1997). CUSUM: a credible method for the determination of authorship? Science & Justice,
37(2), 129-138. doi: 10.1016/s1355-0306(97)72158-0
55
Holmes, D. I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic
Computing, 13(3), 111-117. doi: 10.1093/llc/13.3.111
Holmes, D. I., & Forsyth, R. S. (1995). The Federalist revisited: New directions in authorship attribution.
Literary and Linguistic Computing, 10(2), 111-127. doi: 10.1093/llc/10.2.111
Holmes, D. I., & Tweedie, F. J. (1995). Forensic stylometry: A review of the Cusum controversy. Revue
Informatique et Statistique dans les Sciences humaines, 23, 20-47.
Hotteling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of
Educational Psychology, 24, 417-441, 498-520.
Husson, F., Josse, J., Le, S., & Mazet, J. (2012). FactoMineR: Multivariate Exploratory Data Analysis and
Data Mining with R: R package version 1.18.
Karian, S. (2006). Authors of the mind: Some notes on the QSUM attribution theory. Studies in Bibliography,
57, 263-286. doi: 10.1353/sib.0.0011
Kjell, B., Woods, W. A., & Frieder, O. (1993). Discrimination of authorship using visualization. Information
Processing & Management, 30(1), 141-150. doi: 10.1016/0306-4573(94)90029-9
Mathews, R. (1993a, 4 July 1993). Harsh words for verbal fingerprints, Sunday Telegraph.
Mathews, R. (1993b, 21 August 1993). Linguistics on trial. New Scientist, 1887.
Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research. Design and interpretation.
Thousand Oaks, CA: Sage.
Morton, A., Q. (1991). Proper words in proper places: Computing Science Department, University of
Glasgow.
Morton, A., Q., & Michaelson, S. (1990). The Qsum plot: University of Edinburgh.
Morton, A., Q., Michaelson, S., & Wake, W. C. (1978). Sentence length in Homer and hexameter verse.
Association for Literary and Linguistic Computing Bulletin, 6(3), 254-267.
Mosteller, F., & Wallace, D. L. (1984). Applied bayesian and classical inference. The case of The Federalist
Papers (2nd ed.). New York: Springer-Verlag.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine,
2, 559-572.
Popescu, I.-I., Almann, G., Grzybek, P., Jayaram, B. D., Khler, R., Krupa, V., . . . Vidya, M. N. (2009).
Word frequency studies. Berlin: Mouton de Gruyter.
Pristin, T. (1997, 19 November 1997). From sonnets to ransom notes; Shakespeare sleuth helps police in
literary detection, New York Times. Retrieved from
http://www.nytimes.com/books/00/11/26/specials/foster97.html
Rawson, C. (1973). Introduction A Journey from This World to the Next (pp. vii-xxvii). London: J. M. Dent.
Rudman, J. (1997). The state of authorship attribution studies: some problems and solutions. Computers and
the Humanities, 31(4), 351-365.
Sams, E. (1994). Edmund Ironside and stylometry. Notes and Queries, 41(4), 469-472.
Sanford, A. J., Aked, J. P., Moxey, L. M., & Mullin, J. (1994). A critical examination of assumptions
underlying the Cusum technique of forensic linguistics. Forensic Linguistics, 1(2), 151-168.
Schils, E., & de Haan, P. (1993). Characteristics of sentence length in running text. Literary and Linguistic
Computing, 8(1), 20-26. doi: 10.1093/llc/8.1.20
Sharma, S. (1996). Applied Multivariate Techniques. New York: John Wiley & Sons, Inc.
Smith, M. W. A. (1990). Attribution by statistics: a critique of four recent studies. Revue Informatique et
Statistique dans les Sciences humaines, 26, 233-251.
56
Smith, M. W. A. (1991). The authorship of The Revenger's Tragedy. Notes and Queries, 236, 508-511.
Smith, M. W. A. (1992). The problem of Acts 1-11 of Pericles. Notes and Queries, 237, 346-355.
Smith, M. W. A. (1994). Edmund Ironside. Notes and Queries, 238, 202-205.
Somers, H. H., & Tweedie, F. J. (2003). Authorship Attribution and Pastiche. Computers and the Humanities,
37(4), 407-429. doi: 10.1023/a:1025786724466
Zhao, Y., Zobel, J., & Vines, P. (2006). Using Relative Entropy for Authorship Attribution Information
Retrieval Technology. In H. Ng, M.-K. Leong, M.-Y. Kan & D. Ji (Eds.), (Vol. 4182, pp. 92-105).
Berlin: Springer.
Zipf, G. K. (1935). The psychobiology of language. Cambridge, Mass.: M.I.T. Press.
Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology.
Cambridge Mass.: Addison-Wesley Press.
57
3:

.

, .

, Zipf, -, .
3.1.

. ,
, 39
. Bailey (1979)
, ,

. ,

.
,

3.1.
39
Rudman (1997, p. 360) 1.000
.
,
.
58
3.1 .
,
. ,
, : ) /
, ) / , ) , ) . ,
, ,
(.. ,
..).
3.2. /
3.2.1.

. Augustus De Morgan
Mendenhall (. 0 )
, ,
.
,
40
40

.
, , , .
59
(Grzybek, 2007, p. 18). , ,
,
.
.

(first central moment), (second
central moment), (skewness)
(fourth central moment) (kyrtosis) .

:
1. :

(Anti, Stadlober, Grzybek, & Kelih, 2006; Kelih, Anti, Grzybek, & Stadlober, 2005).
.
i. :
( )
( , ,
). 41:
k
1
BG = 2 = 2i (10)
j
ilG
m1 , X
i ( ).
, .
ii. :
.
(. (7)
:
k
1
B& = b = (2i 2)& (11)
j1
ilG
iii. :
.

( 3.2):
..: ) - ( 20 )
) -- ( ).
. ,
.
41

, ( ,
..) (moments of distribution) m.
60
3.2 . ,
.

. .
,

. 0,
0
0.
,
:
k
1
BF = (2i 2)F (12)
j1
ilG
(12)
(13) :
BF
9gS = F/&
(13)
B&
iv. :

, .
.
( 3.3):
61
3.3 . , .

,
. 0,
0 0
.
:
k
1
Bq = (2i 2)q (14)
j1
ilG
(14)
(15):
Bq
TWUgS = 3 (15)
B&&
2. (standard deviation): (. (7))

.
(. (Mikros (2007); Park,
Jiexun, Zhao, and Chau (2009))),
.
3. (coefficient of variation):
(16) (Yule, 1944, p. 13):
b
= (16)
BG
62

.
,

.
4. (quotient of dispersion):

1:
B&
K= (17)
(BG 1)
Grotjahn (1982)

(Grzybek, 2007, pp. 47-48).
5. Ord: O Ord (1972) I S , ,
(18):
B& BF
= YVX = (18)
BG B&
6. I ()
( ). S ()
().

,
(Kelih et al., 2005; Stadlober & Djuzelic, 2007). , ,

(Popescu et al., 2009, p. 154),
(Oakes, 2007, pp. 512-513).
7. (word length spectrum): , ,
42.
. ,
1, 2, 3, 7 ,

1, 2, 3, 7 .
(. (Forsyth, Holmes, and Tse (1999); Grieve (2007)))
(. (Markopoulos, Mikros, and
Broussalis (2011); (2009)).
8. : ,
, Forsyth et al. (1999)
,
.
Consolatio ,
. 1583 Carlo Sigonio
Consolatio, .
Forsyth et al. (1999), (
), . ,
(1, 2, 3, 6 ) .
:
42
. Mendenhall (Error! Reference source not found.).
63
i. 1, 2, 3,
4 .
ii. 1, 2, 3, 4
.
iii. 1, 2, 3, 4
.
iv. 1, 2,
3, 4 .
,
Consolatio
Sigonio.
,
. (. (Grieve
(2007, p. 259); Stamatatos, Fakotakis, and Kokkinakis (2001, p. 195))),
(low-level phenomenon)
. ,
,
(Fucks & Lauter, 1965) .
Williams (1976), , Bacon
Sidney, , :
, ,
Sidney Bacon, [
Sidney], Sidney
[ Sidney]. ,
Bacon
Sidney.
(Williams, 1976, p. 211)

. Kelih et al. (2005) 190
.
, ,
, Ord ..
.
(Discriminant Function Analysis) 38,4%,
93%,
.
, Stadlober and Djuzelic (2007)
153 , (,
).
.
( , )
98% .
3.2.2.
3.2.2.1. /
, ,

64
. Stamatatos (2009, p. 541)
, , ,
.

.

. 9 ..
.
.

, , .

(Beker & Piper, 1982).

Cambridge Udny Yule (Yule, 1944). Yule
(Bunyan
Macaulay). Yule:

(. )

Bunyan,
Bunyan Macaulay.

Macaulay
Bunyan
(Yule, 1944, p. 184)

Yule .
,
.
, Yule
,
, (). B, F, H, W
, A, I E, , .
,
Gustav Herdan, Yule
,
:

(Old English)
.

. [ ]

(Herdan, 1966, p. 68)

Thomas Merriam 90.
(Merriam, 1988) [ Grieve (2005, p. 26)] o Merriam
Thomas Heywood (1586-1640;)
65
The Book of Sir Thomas Moore. Merriam (1988, p. 455)
, ,
,
.
, ,
.
, Merriam
Heywood , Anthony
Munday (1560-1633).
, .
Merriam ,
(Pearson r) 26 .
Haywood r
0,990 The Book of Sir Thomas
Moore, Haywood. Merriam
Haywood .
Merriam (1994) [ Grieve (2005, p. 27)]
Marlowe
O . 36
0,078, 6 7 Marlowe
O 0,078. Merriam
:

no not,
.
(Merriam, 1994, p. 469)

Ledger and Merriam (1994) The Two Noble
Kinsmen (1634), ,
John Fletcher . ,

(Chettle, Dekker, Fletcher, Heywood, Munday)
25.000 ( 5.000 ).
(Cluster Analysis),
.
,
.

Fletcher,

.
The Two Noble Kinsmen .

. (Canonical Discriminant
Analysis) .

Fletcher. Ledger and Merriam (1994, pp. 246-
247)

, .

Forsyth and Holmes (1996) Grieve (2007). Forsyth and Holmes (1996)
,
66
,
.
, ,
( 26 96
),
,
.
Grieve (2007)
,
(
).
( 83%).
, , ,
. ,
,
.
Grieve
:
[ ]
, , , ,
. ,

.
(Grieve, 2007, p. 261)
, ,
.
Dell Hymes (Hymes, 1960),
William Wordsworth
(1770-1850) John Keats (1795-1821). Hymes -
, .
,
, .
:
, still, [ ]
, /l, s, t/ /i/, .
Wordsworth still .
(Hymes, 1960, p. 119)
, ,
70 80 (. Tuldava (2005, p. 378) )
, , .

. Forstall and Scheirer (2010),
2.3 2.
Van Bael and van Halteren (2007),
(supervised learning classification
algorithm) linguistic profiling, ,
, , .
(Spoken Dutch Corpus)
340 .
67

.
( , - ),
, /
.

.
, ,
.
3.2.2.2.
, ,
.
(.), (,), () (?), (;),
(()), (), (!), ([]), (),
(@/), , ($,),
(+-*/=^<>), (`~#\|).

Conrad Mascol (Mascol, 1888a, 1888b), , 2.3
1, .
O'Donnell (1966) Stephen Crane
(1871-1900) O Ruddy (1903) 18
Crane. ODonnel
(, )
(O'Donnell, 1966, p. 110).

Robert Barr (1849-1912).
(Chaski, 2001), Carole Chaski,
(forensic linguistics),
48 44
, 10
. 2

92,8%.
,
(Chaski, 2001, p. 12).
Baayen, van Halteren, Neijt, and Tweedie (2002)

8%.
, , ,
(, ...).
2010
.

(Chen, Hao, Chandramouli, & Subbalakshmi, 2011; Ding, Meng,
Chai, & Tang, 2011; Mohtasseb & Ahmed, 2010; Orebaugh & Allnutt, 2010; Reicher, Krito, Bela, & ili,
2010; Sousa-Silva, Sarmento, Grant, Oliveira, & Maia, 2010; Sousa Silva et al., 2011).

.
(emails, , , tweets ..)
(smileys, ,
68
..)
. ,
.
Twitter, (tweet) 140 (
SMS ).
,
.
.

,
Orebaugh and Allnutt (2010) Sousa Silva et al. (2011). Orebaugh and Allnutt (2010)
(Instance Messaging )

.
. ,
. .
9

(U.S. CyberWatch) F.B.I. 100
. 356
.
( , ,
) .
, , ,
(spaces) ,
(emoticons)43.

- (Support Vector Machines - SVM) 88,42% 84,44%
. ,
( ).
, Sousa Silva et al. (2011)
2010 Twitter
,
tweets. 2.000 tweets 120
40 3 ,
,
.
,
Twitter. ,
(hashtags) (.. #music),
(.. @george), URL
44. ,
, 45, (.. LOL
Laughing Out Loud), (.. haaahahahahah)
43

Mors,
email. :-) :-(
.
44
(URL shortening services) ,
,
(
Twitter).
45
, (Sousa Silva et al., 2011, p. 163)
, .. :-) , :)
.. :-) , (-:
69
, ,
, (.. !!!, ???? ..).
, Sousa Silva et al. (2011, p. 167)

( ,
, ).
3.2.3. -
- (character n-grams)

. -
: -
("N-gram,").
- . -
, , , .
- . =1, -
- (unigrams) . -
=1 . -
=2 (character bigrams)
. - =3
(character trigrams) ...
- : .
16:
[_], [_] [], [], [], [], [_], [_], [], [], [], [], [_], [_], [], [], [], [], [], [.]

, (
[.]),
_ . , ,
. , ,
[] 2 ,
, , , .
-
. -
(.. , )
.
, ,
.
, (Natural Language Processing
NLP), (Information Retrieval), (Machine Learning) ..
Stamatatos (2009, pp. 541-542) -:
1. .
2. ,
, , ,
..
3. .
,

.
70
4.
, ,
.

Andrey Markov (1856-1922),
20.000 Evgenii Onegin (Markov, 1913).
-
70. Nawrot (2011, p. 471) Bennett (1976)
.
Kjell (Kjell, 1994; Kjell, Woods, & Frieder, 1993)
,
.
.
, -
(Forsyth & Holmes, 1996; Grieve, 2007; Luyckx, 2010; Luyckx & Daelemans, 2011)

2011 (Argamon & Juola, 2011; Juola, 2004; Juola, Sofko, & Brennan, 2006).
-
, ,
. , , -
(= 3 4)
,
. .. 30
. 30
. , - ( = 4)
[] ,
.
- =2 3
.
-
-. -
(= 2, 3, 4 ...), -
. Forsyth and
Holmes (1996). -
(Progressive Pairwise Chunking).
ASCII
12846. ,
,
. -
.
Houvardas and Stamatatos (2006) -
(multiword terms).
- -
. -
(Symmetrical Conditional Probability),
-.
- ( -
= 3 5),
, -
46
American Standard Code for Information Interchange (ASCII) ,
/. 128
, (..
delete, enter ...).
71
.
(feature selection)
, .
Zhang and Lee (2006),
, - .
, , ..
-
Jianwen, Zongkai, Pei, and Sanya (2010).
- (.. -
1 4). , -
.

(.. (Information Gain))
, .
3.2.4.

.
(.. - ),
, .
, .
,
, ..
(.. , ..), (.. ),
(.. ) ..
20
Thorndike (1901) Henry VIII . them
em Fletcher,
.
Farnham (1916),
Beumont, Fletcher, Massinger .
,
.

.
.
The Captain (1612-13)
Fletcher . Farnham,
,
Fletcher Robert Davenport (1623-1639),
. Henry VIII Two Noble Kinsmen
Fletcher, Messinger.

Messinger.
Messinger
.

, . Tanguy,
Urieli, Calderone, Hathout, and Sajous (2011) Enron email dataset47
47
Enron Email Corpus (Klimt & Yang, 2004) 619.446 emails, 158 ,
Enron. email
, Enron,
2001. : http://www.cs.cmu.edu/~enron/
72
PAN 1148 200 ,
(were),
(isnt), (cannot, gimme) (gonna,
kinda, nope, ya).
,
.
(Hilton, 1982; Olsson, 2008; Osborn, 1929) ,
.
Chaski (2001), ,
,
.
,

, , .
,
. Koppel and Schler (2003)
99
. ,
Microsoft Word
.
( )
.
(Mohtasseb and Ahmed (2009), 2011)) 80.000 (blogs) 565

, ,
.
ASPELL49 ,
. 7
.
,
,
.
3.3. /
3.3.1.
.
, , 60 (Mosteller &
Wallace, 1984) Burrows (. 3.2
2). Burrows ,
(function grammar words).

. ,
. , ,
, , .
Paisley
(1964, p. 227) (minor encoding habits)
:
48
PAN 2011 Lab, Uncovering Plagiarism, Authorship, and Social
Software Misuse CLEF 2011 Conference on Multilingual and Multimodal Information Access
Evaluation, 19-22 2011 (. http://www.webis.de/research/events/pan-11).
49
ASPELL : http://aspell.net/
73
,
.

, .

.

.
.
,
.

.
50.
, , .

. (Mikros & Argiri, 2007),
, ,
.

, ,
. , ,
(Zhao & Zobel,
2005). Argamon and Levitan (2005)
.
(collocations),
,
. ,
.
,
. .
51
. Paisley
(1969), Nixon Kennedy, I we
,
(context-free). ,
(Baayen, van Halteren, & Tweedie, 1996;
Binongo, 1994) . , ,
, . Damerau (1975)

Poisson (. 1.4 2).
50

() (Hatzigeorgiu et al., 2000).
(Hatzigeorgiu, Mikros, &
Carayannis, 2001; Mikros, Hatzigeorgiu, & Carayannis, 2005)
20% 1000 . , 47%
, 33 . 15,5 .
.
51
O Stamatatos (2009, pp. 540-541)
: Abbasi and Chen (2005): 150 , Argamon,
aric, and Stein (2003): 303 , Zhao and Zobel (2005): 365 Koppel and Schler (2003): 480, Argamon et al.
(2007): 675 .
74

. :

Paisley
. []

,
.
.
(Damerau, 1975, p. 274)
Damerau
,

.
,
.
3.3.2.
, (content words)
. ,

.
() (Automatic Text Categorization),
.
Bag of Words (BoW) (Sebastiani,
2002),
. ,
, ( )
, ,
.
,
. Kaster, Siersdorfer, and Weikum (2005)
W Project Gutenberg.
, ,
.
, Ellegrd
(1962) Junius Letters. Junius
, Public Advertiser (
), 21 1769 21 1772.

.

. ,
Sir Philip Francis (1740-1818).
Ellegrd
- (distinctiveness ratio).

+ (plus words) - (minus words).
Junius, Ellegrd
Junius 18 ,
. , 1,
75
Junius (+ ), 1, Junius (
). Ellegrd Junius

.
Junius.

Junius. , Ellegrd 51
Junius
. ( ) Ellegrd
Junius Sir Philip Francis.
Luyckx and Daelemans (2005) ,
, 400
3 .
(Mutual Information),
.

.

. Schler, Koppel, Argamon, and Pennebaker (2006)
37.478 1.405.209 295.526.889 .
,
.
(Linguistic Inquiry and Word Count - LIWC)52
(Pennebaker & Francis,
2001).
,
,
, (Koppel, Schler, & Messeri,
2008, p. 116). Chen et al. (2011) 24
LIWC, .

(.. , ,
, ..)
.

. Zheng, Li, Chen, and
Huang (2006) 11 -
(
). Iqbal, Khan, Fung, and Debbabi (2010)
- ( content based
clustering) 13 ,
. , Pearl and Steyvers (forthcoming)
(topic models).
, ,
52
LIWC
.
. ..
.
.
4.500 (stems) 68 4
: ) ( ) )
( , ,
.. ) ( ) )
( , .. , ..).
: http://www.liwc.net/index.php)
76
.
(
, , ..)
28.500 2.194 10-20 .
, (sparse
logistic regression), (89%).

.
88% ,
83%,
.
,
(
/, .. ..),
.
3.3.3.
, -
, , . O Hoover (2002)
, ,
.
,
, .
Peng, Schuurmans, and Wang (2004)
, Stamatatos, Fakotakis,
and Kokkinakis (2000),
(smoothing) .
- 96%
10 . Guzmn-Cabrera, Montes-y-Gmez, Rosso, and
Villaseor-Pineda (2008)
353 5 .
, ,
.. .
Gehrke (2008) ,
Schler et al. (2006) (. 3.2 3).
100, 250, 500, 750
1000
. 1.000
Bayes, 53
20 91%.
Coyotl-Morales, Villaseor-Pineda, Montes-y-Gmez, and Rosso (2006)
,
.
(Maximal Frequent Word Sequences).

. ,
53
,
( )
.
,
.
(search space)
.
77
, Guzmn-Cabrera et al. (2008) (353 5
)
.

(collocates) .
(.. , ..).
Morton (1978) [
Grieve (2005, pp. 45-46)].
.
O Brien Darnell Morton (O'Brien & Darnell, 1982)
17 ,
the be. Hoover
(2003b) , 2 6
14
6 . 54
,
.
3.4.
3.4.1. Zipf
,
, .

,
(lexical richness indices)
(lexical diversity indices).

.
,

, .
.
, ... ( 3.1) 10
55.
%
1 1.100.288 3,249
2 812.275 2,398
3 741.501 2,190
4 663.393 1,959
5 655.688 1,936
6 562.969 1,662
7 544.160 1,607
8 488.331 1,442
9 457.177 1,350
10 456.818 1,349
54
. Argamon and Levitan (2005) 3.1 3
, de Vel, Anderson, Corney, and Mohay (2001)
Grieve (2007, p. 263).
55
100 3.
78
3.1 10 .

,
.
100 ( 3.4):
3.4 .
,
.
30,
,
.
. George Kingsley Zipf,
, (Zipf, 1935, 1949) .
Zipf, ,

.
,
, ...
, Zipf
(19):

= (19)

79
r ,
, fr C
56.
Zipf
, 57.
, (power laws)
. Zipf
,
( ), ,
. , ( 3.5)
10.000
(Mikros et al., 2005):
3.5
10.000 (33 . ).
(
) , .
(19) :
(20)
~c = ~c = ~c A~c()

56
(19)
, .
57
Zipf (Zanette
& Manrubia, 1997), (Silagadze, 1997),
(Huberman, Pirolli, Pitkow, & Lukose, 1998) .. .
http://linkage.rockefeller.edu/wli/zipf/index_ru.html
80
58
Zipf 59.
Zipf
,
. o Mandelbrot (Mandelbrot, 1953) ,

Zipf-Mandelbrot:

= (21)
( + )
(21)
b ,
, (Baayen, 2001, pp. 101-102).

. Zipf (Zipf, 1949, p. 21)
, .
Zipf ,
.
,
(principle of least effort).
,
, .
,

.
Zipf ,
.
3.4.2.

(types) ,
. : ,
. 12,
8.
(. 3.2):

1 3
2 3
58
.
. ,
(r2) ,
.
r2 0,98 p<0,000,
98% .
,
, .
-1, 02.
59
Zipf . :
(Zipf, 1935), (Tuzzi, Popescu, & Altmann, 2009), , (Le Quan Ha,
Stewart, Hanna, & Smith, 2006), (Choi, 2000), (Jayaram & Vidya, 2008), (Dalkl & ebi,
2005), (Suen, 1986), (Manaris, Pellicoro, Pothering, & Hodges, 2006) ..
81
3 1
4 1
5 1
6 1
7 1
8 1
3.2 ,
.
3
1 , .
,

. N,
V(N). V(i, N)
i N .
N=12, V(12)= 8, V(1, 12)= 6 6, V(3, 12)= 2
2.
, V(N)
,
. , ,
,
. (Templin, 1957) ,
,
() (V(N)).
(Type-Token Ratio TTR) (22):
(j)
= (22)
j
TTR 0,66 (8/12). TTR

(McEvoy & Dodd, 1992) (Avent &
Austermann, 2003; J. F. Miller, 1982).
TTR
.
.
.
60 .
.
.. 700 V(N)
(1400 ) V(N). , ,
700 700. 2.100
... V() .
28.538 40 700
61.
(vocabulary growth rate)
( 3.6):
60
, (1993). . .
61
4.
82
3.6 .
(V(N)), (V(1, N)).
,
. ,
.
,
1. ,

.
TTR
. TTR (
3.7):
83
3.7 TTR
.
TTR (
). TTR
.
. 700
5.000 , TTR
, . ,
. (N)
, .
. , ,
,
. TTR
,
.

TTR . Youmans (1991)
TTR. TTR ,

. ,
TTR
-62. TTR ( 3.8):
62
-, (1991) , -, (1988)
, .
84
3.8 TTR ()
().
TTR
.
TTR
, 10.000 .
(), ,
TTR ,
. TTR
TTR
63.
3.4.3. TTR
TTR
,
.
TTR . O Biber (1988)
400 , o Besnier (1988) 500 Bucks, Singh,
Cuerden, and Wilcock (2000) 1.000 .
, . ..
3 1.000
, 5.000 1.250 . TTR
1.000 (
).
(80%). ,
63
TTR
(Harris Wright, Silverman, & Newhoff, 2003; Malvern, Richards, Chipere, & Durn, 2004; Silverman & Ratner, 2002).
DM&R
3.4.5 .
85
20% , 80%
TTR. , , ,
.
TTR
TTR (Mean Segmental TTR MSTTR)64
. (Manschreck, Maher, & Ader, 1981; Meara, 1978). MSTTR,
(.. 200 ) TTR.
TTR MSTTR
. McCarthy (2005)
, . ,
(.. 20 )
MSTTR
. ,
(.. 1.000 )
, MSTTR .
TTR Covington and McFall
(2010), TTR (Moving Average TTR
MATTR). To MATTR
. .. 500
1.000 , TTR 500 (1 500)
2-502, 3-
503 ... . TTR
MATTR. ,
,
.
TTR ,
N V(N)
TTR. Guiraud (1954) TTR (root TTR RTTR) ( (23))
Carroll (1964) TTR (corrected TTR CTTR) ( (24)):
(j)
= (23)
j
(j)
= (24)
2j
,
. (Herdan (1960), 1964))
( (25)), Dugast (1979) ( (26)), Maas (1972) ( (27)), Dugast (1979)
( (28)), Tuldava (1977) ( (29)) Tuldava (1993) ( (30)) :
~c(j)
= (25)
~cj
~c(j)
= (26)
~c (~cj)
64
MSTTR . WordSmith Tools (Scott,
2008) Standardized TTR, Jarvis (2002) split TTR.
86
~cj ~c(j)
A& = (27)
~c& j
~c& j
= (28)
~cj ~c(j)
1 (j)&
j = (29)
(j)& ~cj
~c (~cj)
= (30)
j
~c (~c + 065 )
j
Brunet (1978) (31)

-0,172,
(Tweedie & Baayen, 1998, p. 328).

= j (k) (31)
McCarthy (2005)

, ,
.
, Measure
of Textual Lexical Diversity (MTLD).

TTR ( 0,72 ). TTR
- TTR 0,72,
( ).
TTR 0,72 (factor).
MTLD
: ,
. MTLD TTR
: (1,00) (1,00), (0,66). MTLD
TTR 0,72, (, )
TTR 2/3=0,66.
3 3 . TTR MTLD
: (1), (1), (1), (1),
(1), (0,83) (0,86) (0,75) (0,78).
0,72. McCarthy
(partial factor).
0,7266 78,6% 0,786.
65
.
66
: 1 - 0,72 = 0,28.
0,79, : 1 - 0,78 = 0,22. (0,22)
78,6% 0,72 (0,22*100/0,28).
0,786.
87
, 1,786 . MTLD

. , MTLD 6,72 (12/1,786).
, McCarthy
, MTLD . ,
MTLD .
: . MTLD
MTLD
MTLD (
).
3.4.4.
,
,
() V(N).
(lexical
frequency spectrum).
.
: .
: V(1, 9)=7, V(2, 9)=2, 9 7
2 .
,
.
Honor (1979) ,
( (32)):
~cj
@ = 100 (32)
(1, j)
1
(j)
Sichel (1975)
( (33)):
(2, j)
= (33)
(j)
Yule (1944, p. 57) (characteristic constant ),

( (34)):
& G
= 10.00067 (34)
G&
S2=(fx2) S1= (fx). (34) (35):

G : 2 & G (: 2)
= 10.000 &
(35)
G : m
(35) Jarvis (2002, p. 59) (36):
67
10.000 Yule,
.
88

G : 2 &
= 10.000 (36)
&
fx , X
.
,
(37)68:
i
ilG (a, j) a &
= 10.000 (37)
&
V(i, N) i .
K :
. N=15.
:
V(1, 15)= 9 [, , , , , , , , ], 9
1 .
V(2, 15)= 3 [, , ], 3 2 .
(36) :
9 1& + 3 2& 15 9 + 12 15 21 15 6
= 10.000 &
= 10.000 &
= 10.000 = 10.000
15 15 225 225
= 10.000 0,0266 = 266,66
Poisson.
.
Herdan (1955)

. (38):
(k)
&
a 1 (38)
= a, j
j (j)
ilG
Tweedie and Baayen (1998, p. 330) (39) Vm:
1 1
& = + (39)
j j
(Thoiron,
1986), , D Simpson
(1949). D .
0 1 0 1 .
68
3 K
(Miranda-Garca & Calle-
Martn, 2005, p. 288).
89
D ,
. D (40)69:
(k)
ilG d (d 1)
(40)
=
j (j 1)
n .
:
(n) n-1 n(n-1)

1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
o 2 1 2
2 1 2
2 1 2
6
()
( )
l
6 6 6
= = = = 0,028
15 (15 1) 15 14 210
D
. ,
, 0.
, D Vm
.
Garca and
Martn (2007), (functional density),
(Cw) (Fw) ( (41)):
|
= (41)
|

.
, ,
(information theory)
(entropy). () ( )
.
(42):
69
(40) Simpson (1949, p. 688) Thoiron (1986, p. 198).
Tweedie and Baayen (1998, p. 330) , , , .
90
(k)
= i ~c& i (42)
i
pi i 70, log2
(
).
: .
() 9 (V(N)) 7.
, ( 3.3):
pi*log2(pi)

pi
log2(pi)
o 2 2/9=0,222 -2.1699 -0.48221
1 1/9=0,111 -3.1699 -0.35221
2 2/9=0,222 -2.1699 -0.48221
1 1/9=0,111 -3.1699 -0.35221
1 1/9=0,111 -3.1699 -0.35221
1 1/9=0,111 -3.1699 -0.35221
1 1/9=0,111 -3.1699 -0.35221
9 (k)
@= i = 2,72548
3.3 .
2,725 bits.
, bits
.
, .
: Hmax = log2(n) n
. Hmax : max = log2(7) = 2,81.
, (Hrel),
max . : Hrel = H / Hmax = 2,71/2,81 = 0,964.
. 1,
, .
(redundancy - R), : R= 1 - Hrel
R = 1 - 0,964 = 0,036. O
(3.6%) ,
. ,
2.
,
.
3.4.5.

,
.
70

().
91
(Large Number of Rare Events LNRE)
Zipf,
. Orlov (1982) Zipf
(generalized Zipf model) (43):
j j
(j) = ~c (43)
~c ( ) j
p (maximum sample probability)

. Z
Zipf . Z71
Zipf
.
,
, Gauss-Poisson ( (44)).
Sichel (1975)
(b, c, ). ()
Tweedie and Baayen (1998, p. 331) -0,5:
2
(j) = 1 6 G7 Gok
(44)

2/bc
V(N) . e
Euler 2,718. b c
, c ,

(Bell, Berridge, & Rayson, 2009, p. 5).

(optimization algorithms), ,
. ,

.
, 72 .
: =-0,992, b=0,0059, c=0,2211.
,
.
() , ( 3.4):
V V1 V2 V3 V4 V5
7.531 5.116 1.075 460 206 135
7.533,33 5.108,62 1.116,74 422,77 217,37 131,91
3.4
.
, Gauss-Poisson
. 7.531
71
. , , and
(2011, pp. 39-50).
72
zipfR (Evert & Baroni, 2007).
92
7.533. , 5.116 (V1),
5.108, .
2,
73,
. ,
( 3.9)
:
3.9 ,
Gauss-Poisson.
Gauss-Poisson

.

,
.
, 28.000 , 50.000 .
( 3.10):
73
2 = 15,8, df=13, p=0,26.
93
3.10 (V(N))
50.000 .

. 28.000
50.000 (
10.737 ).
,
.

.
Gauss-Poisson
. ( 3.11)
,
:
94
3.11
100.000 (1e+05).
.
.
,
,
74.
Malvern et al. (2004), Gauss-Poisson,
Sichel,
, , .
, (45), (DM&R)75
.
G
& j &
= 1+2 1 (45)
j &
DM&R (vocd76),
CHILDES (MacWhinney, 2000a, 2000b).
:
74
, 70.000 , 13.366
, 70.000 14.628
.
75
DM&R Malvern & Richards D Simpson (.
(40)). , (Malvern et al., 2004)
(Malvern & Richards, 1997) Malvern & Richards .
76
vocd CHILDES, CLAN
: http://childes.psy.cmu.edu/
95
1.
.
,
.
2. 35-50 . 35
( ). 36 ...
3. 100
TTR . 35 100
35 100 TTR
.
4. DM&R TTR .
5. DM&R
Doptimum TTR
. Doptimum DM&R
.
6. ( 1 5) 3
DM&R, Doptimum
.
3.4.6.
,

(
). , ,
.
Thoiron (1986) D ( (40))
( (42)) .

. (V(N))
, D
.
, . ,
, ..
, D .
,
,
(Thoiron, 1986, p. 199).
, ,
, . ,

,
.

,
(Thoiron, 1986, p. 200).

Tweedie and Baayen (1998),
77
. Hr ( (32)), S
77
: RTTR, C, k, 2, U, LN, W, Hr, S, K, D, Vm, , Z, b c
Gauss-Poisson.
96
( (33)), Z ( (43)) b c Gauss-Poisson
( (44)), .

. ,
78. ,
20
( 4.4.2 3)
. 5.000 (randomizations)
, 5.000 ,
. 5000

.
(
)
K Z.
,
. 20
Hr ( (32)), LN ( (29)), S ( (33)) b
Gauss-Poisson ( (44))
.

. ,

.
, 16 8 . -
S, ( )
,
.
,
(cluster analysis) 4 : K, D, Vm,
LN c Gauss-Poisson,
S b .

,
.
K
.
()
.

8 100 .
,
.

Z
. ,
, ,
.
Tweedie and Baayen (1998), o Hoover (2003a)

78
,
, .
(urn model)
.
97
,
. Thoiron (1986), o
Hoover 50.000 5
(Vm, K,
). ,
.
Hoover ,
,
.
( )
( ).
.
, ,
, . Hoover
Tweedie
and Baayen (1998).
, Vm
. ,
, .
55 36 188
24.000 . ,
, . , Hoover (2003a,
p. 168),
.
,
, ,
.
O Jarvis (2002) DM&R
TTR,
, .
DM&R,
4 (C ( (25)), RTTR
( (23)), U ( (28)), Z ( (43))).
Tweedie and Baayen (1998), Jarvis 20
,
.
TTR . 79
DM&R U
, ,
.
McCarthy (2005),
,
,
. 207 ,
414.000 ,
23 .

Hess, Sefton, and Landry (1986). , 200 1
200 , 2 100 4 50 .

79
Jarvis (
500 ) .
98
, 80
. Pearson r
( ) ,
.
K 2 ,
U DM&R.
, 100 500 .
, MTLD, McCarthy,
, . MTLD
(K, 2, U, DM&R)
(
MSTTR 2000 ). ,
100 2.000 .
4 ,
. ( 3.5)
:

1 2 3 4
K 100-500 154-666 250-1.000 400-2.000
2 100-154 154-333 200-666 250-2.000
DM&R 100-400 200-500 250-666 400-1.000
U 154-250 200-500 254-1.000 286-2.000
3.5 , .

. K
100 500 , 154
600 , 250 1.000 400 2.000 . , ,
300 1000 2, o
. , 4
250 2.000 400 2.000 .
McCarthy (2005) McCarthy and Jarvis (2007),
DM&R

(hypergeometric distribution). DM&R
(probability mass function)
.
(McCarthy & Jarvis, 2010)
MTLD,
, (DM&R, HD-D, K, 2).
HD-D
TTR . ,
.
, (convergent validity),
(divergent validity), (internal validity) (incremental
validity). H
, .
80
: MTLD, DM&R, , U, CTTR, RTTR, C, 2, W.
,
. ( ) TTR,
.
99
MTLD
. MTLD r=0,795,
MTLD .
,

. MTLD
TTR , ,
MTLD, .

, .
MTLD
.
, 81.
,
,
.
16
.
(stepwise),
MTLD DM&R. ,
,
.
MTLD, o DM&R 2
.

, ,
. MTLD,
, ,
. 0,72
.
McCarthy and Jarvis (2010, p. 386) TTR ,
. , ,
; 0,72
( .. );

.
,
, , ,
.
3.5.
3.5.1.
,
(.
2.3 1 Mascol). ,
.
20 Udny
Yule, .
(Yule, 1939)

81
: K (r=0,112, p<0,001), 2 (r=0,125, p<0,001), DM&R (r=0,190, p<0,001), HD-
D (r=0,282, p<0,001), TTR (r=0,811, p<0,001).
100
, .

Bacon, Coleridge, Lamb Macaulay.
( )
(, (quantile),
(interquantile range)),
(Yule, 1939, p. 370).
Yule
, De
Imitatione Christi, 15 . Yule
Thomas Kempis,
,
. Yule Thomas Kempis,
Jean Charlier de Gerson, , ,
.

. Thomas Kempis
De Imitatione Christi. , Gerson

.
,
.
Yule, o C. B. Williams (Williams, 1940),
.
Yule,
( )
.
(.. ),

. Williams
( )

. log-normal
.
, ()
600 . ,
. , ,
, . Williams

:

,
, [] ,
.
(Williams, 1940, p. 361)
Wake
(1957), , ,
( ,
).
,
.
(10 ) (7
). , , .
101
3 ,
) )
)
.
, Wake . 3

,
.

(Wake, 1957, p. 345). ,
. Grieve
(2005):
O Wake , [ ].

, .
(Grieve, 2005, p. 16)
70 Sichel (1974), Yule, Williams Wake,

log-normal,
, .
,
Poisson (mixing Poisson distributions)
log-normal (. (44)). Sichel (1974, p.
29) Williams (1940) Yule (1939)
,
.

Gustav Herdan,
(Herdan,
1960), Ellegrd (1962, p. 10).
Mosteller and Wallace (1984, p. 7),
Hamilton Madison (. 2.1
2).
(normalization)

(Wake, 1957, p. 334). ,
(Holmes, 1985, pp. 331-332),
,
.
,
, ,
, , , .. Kelih, Grzybek,
Anti, and Stadlober (2006) ,
( 4 , , Ord S I)
333
. 94%,

.
,
, ,
. 80 ,
,
(. (Mikros (2007), 2009); Patton and Can (2004); Sousa-Silva et
al. (2010); Zheng et al. (2006))).
102
3.5.2.
- (Parts of Speech PoS)
82.
Yule (1944) De Imitatione Christi
Thomas Kempis. Somers (1966)
:

,
.
, , .
(Somers, 1966, p. 129)

. , O'Donnell
(1966), , , ,
(O'Donnell,
1966, p. 109).
,

(Part of Speech Taggers).
100%83.

84 : _art _nou _vrb _art _nou.

( )
.
8
(. (Mealand (1995); Mikros (2006); Patton
and Can (2004))).
Frantzi (2006)
17 .
,
, ,
, .

.
82
Nirukta, 5-6 . .
Yska ( Pini),
: ) , ) ) ) ,
. ,
. (, , ,
, , , ),
, 2 ..
.
83
.
http://aclweb.org/aclwiki/index.php?title=POS_Tagging_%28State_of_the_art%29
84
.
, ,
. ,
, XML (XML tags).
XML ,
. , , .

.
103

.
- , -
.
Kukushkina, Polikarpov, and Khmelev (2001, p. 176)
,
.
-

-. .. ,
, . adv adv cnj adv prt
vrb adv. - = 2
: [adv adv], [adv cnj], [cnj adv], [adv prt], [prt vrb], [vrb adv].
(. (Diederich, Kindermann,
Leopold, and Paass (2003); Koppel and Schler (2003); Koppel, Schler, and Zigdon (2005); Madigan et al.
(2005); Zhao and Zobel (2007)) (. (Gamon (2004); Koppel, Akiva, and Dagan (2003);
Koppel, Argamon, and Shimoni (2002); Luyckx and Daelemans (2011))).
-,
. ,
Zhao and Zobel (2007, p. 67),
,
. , ,
.
(Koppel et al., 2002), (Nowson, Oberlander, & Gill, 2005),
, (McCarthy, Myers, Briner, Graesser, & McNamara, 2009)
. ,
,

.

(confounding variables).
3.5.3.
Syntactic Structures Noam Chomsky 1957
.
,
. , ,
.
(parsing) 90
.
,
, ,
.
,

(Glover & Hirst, 1996). , Milic (1967)
(.. ),
.
Baayen et
al. (1996). 40.000
(parser),

.
104
(rewrite rules) .
, : ->
+ , -> + 85. ,
, .
,
, .. -> +
W0001, , -> + W0002 ...
.
,
(, ). ,

. K Yule, D
Simpson, R Honor, S Sichel W Brunet,
.
, ,
,
.

Stamatatos et al. (2000)
(chunk boundaries).

( -NP, -PP, -VP, -AP) .
(Stamatatos et al., 2000, p. 477):
VP[ ] NP[ ] PP[ ] CON[] VP[] CON[] NP[
] PP[ ] PP[ ] VP[ ] PP[
5 . . ] NP[ ] VP[] NP[
]

. ,
,
.
,
. ,
...
.
(10 , 30 )

.
Hirst and Feiguina (2007) Baayen et al. (1996)
Stamatatos et al. (2000)
.
(partial parsing).
(chunking) ,
(recursive grammar).
(islands of certainty),
(Hirst & Feiguina, 2007, p.
408).

.
, , , (noun
group) .. , Baayen et al. (1996),
85
= , = , = , = .
105
o .
( ),
( )
Baayen et al. (1996), .
(Dickens ~ Austen Bront),
,
, ,
1.000 .
Kaster et al. (2005)
.
, (syntax tree depth),
.
, ,
...
, ( 3.12):
3.12 .
,
, .
( 4).

.
(. 2.1 3).
.
(bins),
(classes).
. 6
.. 1-5, 6-10, 11-15 ...

.
Doyle Burton
.
Doyle (6-10), Burton
(11-15 16-20).
,
.
106
,
,
(. (Kim, Kim, Weninger, Han, and Hyun (2011);
Raghavan, Kovashka, and Mooney (2010); Swanson (2011))).
3.6.
.
(, , )
,
, .

.
, O'Donnell (1966), O Ruddy
Stephen Crane :
Stephen Crane ,
[ ]
.
(O'Donnell, 1966, p. 110)
,
(shallow analysis)86,

. Gamon (2004)
,
,
(context free grammars)
(semantic graphs). ,
(binary semantic
features) (semantic modification relations).
, ,
(subcategorization features).
, ..
, , .

.

Argamon et al. (2007). -
- (Systemic-Functional Grammar) (Halliday, 1994)

(functional lexical features). ,
,
,
.
( ),
.
(cohesion),
, (Assessment),
86
,
,
. ,

.
107

(Appraisal),
(Argamon et
al., 2007, p. 804).
CONJUNCTION (),
(.. , , ..).

.
CONJUNCTION (Argamon et al., 2007, p. 805):
(elaboration),
.. .
(extension),
.
(enhancement),
.

, ..
, ...

(.. ).

. ..
CONJUNCTION ,
. ,
( ),
.
WordNet87 (G. A. Miller, 1995)
. WordNet ,
, , ,
(synsets), .
WordNet ,
,
(.. , ..). Tanguy et al. (2011) WordNet
email Enron
(. . 47). Clark and Hannon (2007) (
Dickens) , ,
,
, WordNet.

.
Hedegaard and Simonsen (2011)
, FrameNet88 (Baker, Fillmore, & Lowe, 1998).
,
. (frame
semantics), (semantic frames),
, ,
. , ,
, (), (),
()
87
WordNet : http://wordnet.princeton.edu/wordnet/
88
FrameNet : https://framenet.icsi.berkeley.edu/fndrupal/
108
(_). FrameNet , (frame)
_ (frame elements - FE) ,
, , _. ,
, , (Lexical Units - LU)
_.

.
, FrameNet
89. Hedegaard and Simonsen (2011)
( ) .
2 .
,
19 .
400 , 400 - =3, 4, 5, 400
200
200 . ,
,
. ,
,
400 .

.
, ,
(distributional semantics).
.
, ,
, ,
..
(Latent Semantic Analysis LSA) (Landauer, McNamara, Dennis, & Kintsch, 2007),
, .
,
. (
) . ,
~ (term ~ document matrix)
(Singular Value Decomposition),
(),

.
, (..
Cosine, ..), ,
. Soboroff, Nicholas, Kukla, and Ebert (1997)
- ,
. (
),
. ( )
( ) .

. McCarthy, Lewis, Dufty, and McNamara (2006)
(Coh-Metrix)90, 200
, . ,
(Kipling, Wodehouse, Dickens)
89
FrameNet Gotsoulia et al. (2007).
90
:
http://cohmetrix.memphis.edu/cohmetrixpr/index.html
109
. ,
, , ,
.
3.7.

. ,

(.. 140
Twitter SMS). ,

( , email ..).

. (structural features)
. de Vel et al. (2001)
emails HTML
email. Corney, de Vel, Anderson, and Mohay (2002) Teng, Lai, Ma, and Li (2004)
email, , email
, ,
, HTML,
Zheng et al. (2006)
(indentation). , Iqbal et al. (2010) email :
) email , ) tab
) ) /
) ) URL
.

(web forums). Abbasi and Chen (2005)
, ,
( Ku Klux Klan)
(, ).
,
.
91.
15 ,
120. ,
,
.

.
Abbasi, A., & Chen, H. (2005). Applying Authorship Analysis to Extremist-Group Web Forum Messages.
IEEE Intelligent Systems, 20(5), 67-75. doi: 10.1109/mis.2005.81
91

. Zheng, Qin, Huang, and Chen (2003)
,
.
110
Anti, G., Stadlober, E., Grzybek, P., & Kelih, E. (2006). Word length and frequency distributions in different
text genres In M. Spiliopoulou, R. Kruse, A. Nrnberger, C. Borgelt & W. Gaul (Eds.), From Data
and Information Analysis to Knowledge Engineering. Proceedings of the 29th Annual Conference of
the Gesellschaft fr Klassifikation e.V., 9-11 March 2005, University of Magdeburg (pp. 310-317).
Heidelberg/Berlin: Springer.
Argamon, S., & Juola, P. (2011). Overview of the International Authorship Identification Competition at
PAN-2011 Proceedings of PAN 2011 Lab, Uncovering Plagiarism, Authorship, and Social Software
Misuse held in conjunction with the CLEF 2011 Conference on Multilingual and Multimodal
Information Access Evaluation, 19-22 September 2011, Amsterdam.
Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution
Proceedings of the 2005 ACH/ALLC conference, 15-18 June 2005, Victoria, BC, Canada.
Argamon, S., aric, M., & Stein, S. S. (2003). Style mining of electronic messages for multiple authorship
discrimination: first results Proceedings of the 9th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 475-480). Washington, DC: Association for Computing
Machinery.
Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., & Levitan, S. (2007). Stylistic text classification
using functional lexical features. Journal of American Society for Information Science and
Technology, 58(6), 802-822. doi: 10.1002/asi.v58:6
Avent, J., & Austermann, S. (2003). Reciprocal scaffolding: A context for communication treatment in
aphasia. Aphasiology, 17(4), 397-404. doi: 10.1080/02687030244000743
Baayen, H. R. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.
Baayen, H. R., van Halteren, H., Neijt, A., & Tweedie, F. J. (2002). An experiment in authorship attribution.
Paper presented at the Actes des 6 Journes internationales d'analyse statistiques des donns textuelles
(JADT 02).
Baayen, H. R., van Halteren, H., & Tweedie, F. J. (1996). Outside the cave of shadows: using syntactic
annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121-132. doi:
10.1093/llc/11.3.121
Bailey, R. W. (1979). Authorship attribution in a forensic context. In D. E. Alger, F. E. Knowles & J. Smith
(Eds.), Advances in Computer-aided Literary and Linguistic Research. Proceedings of the Fifth
International Symposium on Computers in Literary and Linguistic Research, 3-7 April 1978,
Birmingham, UK (pp. 1-15). Birmingham: AMLC University of Aston.
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project Proceedings of the 17th
International Conference on Computational Linguistics (ACL '98), 10-14 August 1998, Montreal,
Quebec, Canada (Vol. 1, pp. 86-90). Stroudsburg, PA: Association for Computational Linguistics.
Beker, H., & Piper, F. (1982). Cipher Systems: The Protection of Communications. New York: Jon Wiley &
Sons.
Bell, E. J. L., Berridge, D., & Rayson, P. (2009). Measuring style with the authorship ratio. An invariant
metric of lexical similarity. In M. Mahlberg, V. Gonzlez-Daz & C. Smith (Eds.), Proceedings of the
Corpus Linguistics Conference (CL2009), 20-23 July 2009, Liverpool, UK.
Bennett, W. R. (1976). Scientific and engineering problem-solving with the computer. Englewood Cliffs, N.J.:
Prentice Hall.
Besnier, N. (1988). The Linguistic Relationships of Spoken and Written Nukulaelae Registers. Language,
64(4), 707-736.
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Binongo, J. N. G. (1994). Joaquin's Joaquinesquerie, Joaquinesquerie's Joaquin: A statistical expression of a
Filipino writer's style. Literary and Linguistic Computing, 9(4), 267-279. doi: 10.1093/llc/9.4.267
Brunet, . (1978). Le Vocabulaire de Jean Giraudoux: structure et volution. Genve: Slatkine.
111
Bucks, R. S., Singh, S., Cuerden, J. M., & Wilcock, G. K. (2000). Analysis of spontaneous, conversational
speech in dementia of Alzheimer type: Evaluation of an objective technique for analysing lexical
performance. Aphasiology, 14(1), 71-91. doi: 10.1080/026870300401603
Carroll, J. B. (1964). Language and thought. Englewood Cliffs, NJ: Prentice Hall.
Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. Forensic
Linguistics, 8(1), 1-65.
Chen, X., Hao, P., Chandramouli, R., & Subbalakshmi, K. (2011). Authorship Similarity Detection from
Email Messages. In P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition (Vol.
6871, pp. 375-386). Berlin / Heidelberg: Springer.
Choi, S. W. (2000). Some Statistical Properties and Zipfs Law in Korean Text Corpus. Journal of
Quantitative Linguistics, 7(1), 19-30. doi: 10.1076/0929-6174(200004)07:01;1-3;ft019
Clark, J. H., & Hannon, C. J. (2007). An Algorithm for Identifying Authors Using Synonyms. In M. Farach-
Colton, J. Favela, G. Vargas, V. Mittal & G. Navarro (Eds.), Proceedings of the Eighth Mexican
International Conference on Computer Science, 24-28 September 2007, Morelia, Michoacan, Mexico
(pp. 99-104). Los Alamos, CA: IEEE Computer Society.
Corney, M. W., de Vel, O., Anderson, A., & Mohay, G. (2002). Gender-preferential text mining of e-mail
discourse Proceedings of the 18th Annual Computer Security Applications Conference, 9-13
December 2002, Las Vegas, Nevada (pp. 282-289).
Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian Knot: The Moving-Average TypeToken
Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94-100. doi:
10.1080/09296171003643098
Coyotl-Morales, R., Villaseor-Pineda, L., Montes-y-Gmez, M., & Rosso, P. (2006). Authorship Attribution
Using Word Sequences. In J. Martnez-Trinidad, J. Carrasco Ochoa & J. Kittler (Eds.), Progress in
Pattern Recognition, Image Analysis and Applications (Vol. 4225, pp. 844-853): Springer Berlin /
Heidelberg.
Dalkl, G., & ebi, Y. (2005). Zipfs Law and Mandelbrots Constants for Turkish Language Using Turkish
Corpus (TurCo). In T. Yakhno (Ed.), Advances in Information Systems (Vol. 3261, pp. 273-282).
Berlin / Heidelberg: Springer.
Damerau, F. J. (1975). The use of function word frequencies as indicators of style. Computers and the
Humanities, 9(6), 271-280. doi: 10.1007/bf02396290
de Vel, O., Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for author identification
forensics. SIGMOD Record, 30(4), 55-64. doi: 10.1145/604264.604272
Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship Attribution with Support Vector
Machines. Applied Intelligence, 19(1), 109-123.
Ding, Y., Meng, X., Chai, G., & Tang, Y. (2011). User Identification for Instant Messages. In B.-L. Lu, L.
Zhang & J. Kwok (Eds.), Neural Information Processing (Vol. 7064, pp. 113-120). Berlin /
Heidelberg: Springer.
Dugast, D. (1979). Vocabulaire et Stylistique. I: Thtre et Dialogue. Slatkine-Champion, Geneva: Travaux
de Linguistique Quantitative.
Ellegrd, A. (1962). A Statistical Method for Determining Authorship: The Junius Letters, 1769-1772: Acta
Universitatis Gothoburgensis.
Evert, S., & Baroni, M. (2007). zipfR: Word frequency distributions in R Proceedings of the 45th Annual
Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, 23-30
June 2007, Prague, Chech Republic (pp. 29-32): ACL.
Farnham, W. E. (1916). Colloquial Contractions in Beaumont, Fletcher, Massinger, and Shakespeare as a Test
of Authorship. PMLA, 31(2), 326-358.
112
Forstall, C., & Scheirer, W. (2010). Features from frequency: Authorship and stylistic analysis using repetitive
sound. Journal of the Chicago Colloquium on Digital Humanities and Computer Science, 1(2), 1-23.
https://letterpress.uchicago.edu/index.php/jdhcs/article/view/56/67
Forsyth, R. S., & Holmes, D. I. (1996). Feature-finding for text classification. Literary and Linguistic
Computing, 11(4), 163-174.
Forsyth, R. S., Holmes, D. I., & Tse, E. K. (1999). Cicero, Sigonio and Burrows: Investigating the
authenticity of the "Consolatio". Literary and Linguistic Computing, 14(3), 1-26.
Frantzi, K. T. (2006). Computing stylistics and corpus linguistics for author identification. The International
Journal of the Interdisciplinary Social Sciences, 1(3), 69-78.
Fucks, W., & Lauter, J. (1965). Exaktwissenschhaftliche Musikanalyse. Cologne: Westdeutscher Verlag.
Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis
features Proceedings of the 20th International Conference on Computational Linguistics (COLING
'04), 23-27 August 2004, Geneva, Switzerland (pp. 611-617). Stroudsburg, PA, USA: Association for
Computational Linguistics.
Garca, A. M., & Martn, J. C. (2007). Function Words in Authorship Attribution Studies. Literary and
Linguistic Computing, 22(1), 49-66. doi: 10.1093/llc/fql048
Gehrke, G. T. (2008). Authorship discovery in blogs using Bayesian classification with corrective scaling.
(Masters), Naval Postgraduate School, Monterey, CA.
Glover, A., & Hirst, G. (1996). Detecting stylistic inconsistencies in collaborative writing. In M. Sharples &
T. van der Geest (Eds.), The new writing environment: Writers at work in a world of technology (pp.
147-168). London: Springer-Verlag.
Gotsoulia, V., Desipri, E., Koutsombogera, M., Prokopidis, P., Papageorgiou, H., & Markopoulos, G. (2007).
Towards a frame semantics lexical resource for Greek. In K. De Smedt, J. Haji & S. Kbler (Eds.),
Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (Vol. 1, pp.
55-59): Northern European Association for Language Technology (NEALT).
Grieve, J. W. (2007). Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic
Computing, 22(3), 251-270.
Grotjahn, R. (1982). Ein statistisches Modell fr die Verteilung der Wortlnge. Zeitschrift fr
Sprachwissenschaft, 1, 44-75.
Grzybek, P. (2007). History and methodology of word length studies: The state of the art. In P. Grzybek (Ed.),
Contributions to the science of text and language: Word length studies and related issues (pp. 15-90).
Dordrecth: Springer.
Guiraud, P. (1954). Les caractres statistiques du vocabulaire. Paris: Presses Universitaires de France.
Guzmn-Cabrera, R., Montes-y-Gmez, M., Rosso, P., & Villaseor-Pineda, L. (2008). A Web-Based Self-
training Approach for Authorship Attribution Proceedings of the 6th international conference on
Advances in Natural Language Processing (GoTAL '08), Gothenburg, Sweden (pp. 160-168). Berlin,
Heidelberg: Springer-Verlag.
Halliday, M. A. K. (1994). Introduction to functional grammar (2nd ed.). London: Longman.
Harris Wright, H., Silverman, S., & Newhoff, M. (2003). Measures of lexical diversity in aphasia.
Aphasiology, 17(5), 443-452. doi: 10.1080/02687030344000166
Hatzigeorgiu, N., Gavrilidou, M., Piperdis, S., Carayannis, G., Papakostopoulou, A., Athanasia, S., . . . Iason,
D. (2000). Design and implementation of the online ILSP Greek Corpus. In M. Gavrilidou, G.
Carayannis, S. Markantonatou, S. Piperdis & G. Stainhaouer (Eds.), Proceedings of the second
113
International Conference on Language Resources and Evaluation (LREC 2000) (Vol. III, pp. 1737-
1742). Athens, Greece: ELRA.
Hatzigeorgiu, N., Mikros, G. K., & Carayannis, G. (2001). Word length, word frequencies and Zipf's law in
the Greek language Journal of Quantitative Linguistics, 8(3), 175-185. doi: 10.1076/jqul.8.3.175.4096
Hedegaard, S., & Simonsen, J. G. (2011). Lost in translation: authorship attribution using frame semantics
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies: short papers, 19-24 June 2011, Portland, Oregon, USA (Vol. 2, pp. 65-70).
Stroudsburg, PA, USA: Association for Computational Linguistics.
Herdan, G. (1955). A new derivation and interpretation of Yule's Characteristic K. Zeitschrift fr
Angewandte Mathematik und Physik (ZAMP), 6(4), 332-339. doi: 10.1007/bf01587632
Herdan, G. (1960). Type-Token mathematics. A textbook of mathematical linguistics. The Hague: Mouton.
Herdan, G. (1964). Quantitative Linguistics. London: Butterworth.
Herdan, G. (1966). The advanced theory of language as choice and chance. Berlin / Heidelberg: Springer -
Verlag.
Hess, C. W., Sefton, K. M., & Landry, R. G. (1986). Sample Size and Type-Token Ratios for Oral Language
of Preschool Children. J Speech Hear Res, 29(1), 129-134.
Hilton, O. (1982). Scientific examination of questioned documents. New York: Elsevier.
Hirst, G., & Feiguina, O. g. (2007). Bigrams of Syntactic Labels for Authorship Discrimination of Short
Texts. Literary and Linguistic Computing, 22(4), 405-417. doi: 10.1093/llc/fqm023
Holmes, D. I. (1985). The analysis of literary style--A review. Journal of the Royal Statistical Society. Series
A (General), 148(4), 328-341.
Honor, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic
Computing Bulletin, 7(2), 172-177.
Hoover, D. L. (2002). Frequent Word Sequences and Statistical Stylistics. Literary and Linguistic Computing,
17(2), 157-180. doi: 10.1093/llc/17.2.157
Hoover, D. L. (2003a). Another Perspective on Vocabulary Richness. Computers and the Humanities, 37(2),
151-178. doi: 10.1023/a:1022673822140
Hoover, D. L. (2003b). Frequent Collocations and Authorial Style. Literary and Linguistic Computing, 18(3),
261-286. doi: 10.1093/llc/18.3.261
Houvardas, J., & Stamatatos, E. (2006). N-Gram Feature Selection for Authorship Identification. In J. Euzenat
& J. Domingue (Eds.), Artificial Intelligence: Methodology, Systems, and Applications (Vol. 4183,
pp. 77-86). Heidelberg: Springer
Huberman, B. A., Pirolli, P. L. T., Pitkow, J. E., & Lukose, R. M. (1998). Strong Regularities in World Wide
Web Surfing. Science, 280(5360), 95-97. doi: 10.1126/science.280.5360.95
Hymes, D. H. (1960). Phonological aspect of style: Some English sonnets. In T. A. Sebeok (Ed.), Style in
Language (pp. 109-132). Cambridge, MA: MIT Press.
Iqbal, F., Khan, L. A., Fung, B. C. M., & Debbabi, M. (2010). E-mail Authorship Verification for Forensic
Investigation Proceedings of the 2010 ACM Symposium on Applied Computing (SAC '10), March 22-
26, 2010, Sierre, Switzerland (pp. 1591-1598). New York: ACM.
Jarvis, S. (2002). Short texts, best-fitting curves and new measures of lexical diversity. Language Testing,
19(1), 57-84. doi: 10.1191/0265532202lt220oa
Jayaram, B. D., & Vidya, M. N. (2008). Zipf's Law for Indian Languages. Journal of Quantitative Linguistics,
15(4), 293-317. doi: 10.1080/09296170802326640
Jianwen, S., Zongkai, Y., Pei, W., & Sanya, L. (2010). Variable Length Character N-Gram Approach for
Online Writeprint Identification Proceedings of the 2010 International Conference on Multimedia
114
Information NEtworking and Security (MINES 2010), 4-6 November 2010, Nanjing, China (pp. 486-
490).
Juola, P. (2004). Ad-hoc authorship attribution competiion Proceedings 2004 Joint International Conference
of the Association for Literary and Linguistic Computing and the Association for Computers and the
Humanities (ALLC/ACH 2004), 11-16 June 2004, Gothenbug, Sweden.
Juola, P., Sofko, J., & Brennan, P. (2006). A prototype for authorship attribution studies. Literary and
Linguistic Computing, 21(2), 169-178.
Kaster, A., Siersdorfer, S., & Weikum, G. (2005). Combining text and linguistic document representations for
authorship attribution Proceedings of the SIGIR 2005 Workshop "Stylistic Analysis Of Text For
Information Access", 19 August 2005, Salvador, Bahia, Brazil (pp. 27-35).
Kelih, E., Anti, G., Grzybek, P., & Stadlober, E. (2005). Classification of author and/or genre? The impact of
word length. In C. Weihs & W. Gaul (Eds.), Classification The Ubiquitous Challenge. Proceedings
of the 28th Annual Conference of the Gesellschaft fur Klassifikation e. V., 9-11 March 2004,
University of Dortmund (pp. 498-505). Berlin: Springer.
Kelih, E., Grzybek, P., Anti, G., & Stadlober, E. (2006). Quantitative text typology: The impact of sentence
length. In M. Spiliopoulou, R. Kruse, A. Nrnberger, C. Borgelt & W. Gaul (Eds.), From Data and
Information Analysis to Knowledge Engineering. Proceedings of the 29th Annual Conference of the
Gesellschaft fr Klassifikation e.V., 9-11 March 2005, University of Magdeburg (pp. 382-389).
Heidelberg/Berlin: Springer.
Kim, S., Kim, H., Weninger, T., Han, J., & Hyun, K. D. (2011). Authorship classification: a discriminative
syntactic tree mining approach Proceedings of the 34th International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 455-464). Beijing, China: ACM.
Kjell, B. (1994). Authorship determination using letter pair frequency features with neural network classifiers.
Literary and Linguistic Computing, 9(2), 119-124. doi: 10.1093/llc/9.2.119
Kjell, B., Woods, W. A., & Frieder, O. (1993). Discrimination of authorship using visualization. Information
Processing & Management, 30(1), 141-150. doi: 10.1016/0306-4573(94)90029-9
Klimt, B., & Yang, Y. (2004). Introducing the Enron Corpus Proceedings of the First Conference on Email
and Anti-Spam (CEAS), 30-31 July 2004, Mountain View, CA
Koppel, M., Akiva, N., & Dagan, I. (2003). A corpus-independent feature set for style-based text
categorization Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis
and Synthesis, 10 August 2003, Acapulco, Mexico (pp. 1263-1276).
Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author
gender. Literary and Linguistic Computing, 17(4), 401-412.
Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution Proceedings of
IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis, 10 August 2003,
Acapulco, Mexico (pp. 69-72).
Koppel, M., Schler, J., & Messeri, E. (2008). Authorship attribution in law enforcement scenarios. In C. S.
Gal, P. B. Kantor & B. Shapira (Eds.), Security Informatics and Terrorism: Patrolling the Web -
Social and Technical Problems of Detecting and Controlling Terrorists' Use of the World Wide Web
(pp. 111-119). Amsterdam: IOS Press.
Koppel, M., Schler, J., & Zigdon, K. (2005). Determining an author's native language by mining a text for
errors Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining (KDD '05), Chicago, Illinois, USA (pp. 624-628). New York: ACM.
Kukushkina, O. V., Polikarpov, A. A., & Khmelev, D. V. (2001). Using Literal and Grammatical Statistics for
Authorship Attribution. Problems of Information Transmission, 37(2), 172-184. doi:
10.1023/a:1010478226705
115
Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2007). Handbook of Latent Semantic
Analysis. Mahwah, NJ: Lawrence Erlbaum Associates.
Le Quan Ha, Stewart, D. W., Hanna, P., & Smith, F. J. (2006). Zipf and Type-Token rules for the English,
Spanish, Irish and Latin languages. Web Journal of Formal, Computational and Cognitive Linguistics,
8. http://fccl.ksu.ru/issue8/ha_fccl_zipf.pdf
Ledger, G., & Merriam, T. (1994). Shakespeare, Fletcher, and the Two Noble Kinsmen. Literary and
Linguistic Computing, 9, 235-248.
Luyckx, K. (2010). Scalability issues in authorship attribution Brussels: Uitgeverij UPA University Press
Antwerp.
Luyckx, K., & Daelemans, W. (2005). Shallow Text Analysis and Machine Learning for Authorship
Attribution. Paper presented at the Proceedings of the Fifteenth Meeting of Computational Linguistics
in the Netherlands (CLIN 2004).
Luyckx, K., & Daelemans, W. (2011). The effect of author set size and data size in authorship attribution.
Literary and Linguistic Computing, 26(1), 35-55. doi: 10.1093/llc/fqq013
Maas, H.-D. (1972). ber den Zusammenhang zwischen Wortschatzumfang und Lnge eines Textes.
Zeitschrift fr Literaturwissenschaft und Linguistik, 8, 73-79.
MacWhinney, B. (2000a). The CHILDES Project: Tools for analyzing talk. The database (3rd ed. Vol. 2).
Mahwah, NJ: Erlbaum.
MacWhinney, B. (2000b). The CHILDES Project: Tools for analyzing talk. Transcription format and
programs (3rd ed. Vol. 1). Mahwah, NJ: Erlbaum.
Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the
large scale Proceedings of the Joint Annual Meeting of the Interface and the Classification Society of
North America: Theme: Clustering and Classification. Washington University School of Medicine,
St. Louis, Missouri: Classification Society of North America.
Malvern, D., & Richards, B. J. (1997). A new measure of lexical diversity. In A. Ryan & A. Wray (Eds.),
Evolving models of language (pp. 58-71). Clevedon: Multilingual Matters.
Malvern, D., Richards, B. J., Chipere, N., & Durn, P. (2004). Lexical Diversity and Language Development:
Quantification and Assessment. Houndmills, Basingstoke, Hampshire: Palgrave Macmillan.
Manaris, B. Z., Pellicoro, L., Pothering, G., & Hodges, H. (2006). Investigating Esperanto's statistical
proportions relative to other languages using neural networks and Zipf's law. In V. Devedzic (Ed.),
Proceedings of the 24th IASTED International Conference on Artificial Intelligence and Applications,
13-16 February 2006, Innsbruck, Austria (pp. 102-108): ACTA Press.
Mandelbrot, B. B. (1953). An informational theory of the statistical structure of language. In W. Jackson
(Ed.), Communication Theory, the Second Language Symposium (pp. 486-502). London: Butterworth.
Manschreck, T. C., Maher, B. A., & Ader, D. N. (1981). Formal thought disorder, the type-token ratio and
disturbed voluntary motor movement in schizophrenia. The British Journal of Psychiatry, 139(1), 7-
15. doi: 10.1192/bjp.139.1.7
Markopoulos, G., Mikros, G. K., & Broussalis, G. (2011). Stylometric profiling of the Greek legal corpus
Proceedings of the 10th International Conference of Greek Linguistics, 1-4 September, Komotini,
Greece: Democritus University of Thrace.
Markov, A. A. (1913). An Example of Statistical Analysis of the Text of \Evgenii Onegin" Illustrating the
Linking of Events into a Chain. Bulletin de l, Acadamie Imperiale des Sciences de St. Petersburg,
6(7), 153-162.
Mascol, C. (1888a). Curves of Pauline and Pseudo-Pauline Style I. Unitarian Review, 30, 452-460.
Mascol, C. (1888b). Curves of Pauline and Pseudo-Pauline Style II. Unitarian Review, 30, 539546.
116
McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the
potential of the measure of textual, lexical diversity (MTLD). The University of Memphis.
McCarthy, P. M., & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4),
459-488. doi: 10.1177/0265532207080767
McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated
approaches to lexical diversity assessment. Behavior research methods, 42(2), 381-392. doi:
10.3758/brm.42.2.381
McCarthy, P. M., Lewis, G. A., Dufty, D. F., & McNamara, D. S. (2006). Analyzing writing styles with Coh-
Metrix. In G. Sutcliffe & R. Goebel (Eds.), Proceedings of the Nineteenth International Florida
Artificial Intelligence Research Society Conference (FLAIRS), 11-13 May 2006, Melbourne Beach,
Florida, USA (pp. 764-769).
McCarthy, P. M., Myers, J. C., Briner, S. W., Graesser, A. C., & McNamara, D. S. (2009). A psychological
and computational study of sub-sentential genre recognition. Journal for Language Technology and
Computational Linguistics, 24(1), 23-55.
McEvoy, S., & Dodd, B. (1992). The communication abilities of 2- to 4-year-old twins. International Journal
of Language & Communication Disorders, 27(1), 73-87. doi: 10.3109/13682829209012030
Mealand, D. L. (1995). Correspondence Analysis of Luke. Literary and Linguistic Computing, 10(3), 171-
182. doi: 10.1093/llc/10.3.171
Meara, P. M. (1978). Schizophrenic symptoms in foreign language learners. UEA Papers in Linguistics, 7, 22-
49.
Merriam, T. (1988). Was hand B in Sir Thomas More Heywood's authograph? Notes and Queries, 35(4), 455-
458. doi: 10.1093/nq/35.4.455
Merriam, T. (1994). Letter frequency as a discriminator of authors. Notes and Queries, 41(4), 467-469. doi:
10.1093/nq/41.4.467
Mikros, G. K. (2006). Authorship attribution in Modern Greek newswire corpora. In O. Uzuner, S. Argamon
& J. Karlgren (Eds.), Proceedings of the SIGIR 2006 International Workshop on Directions in
Computational Analysis of Stylistics in Text Retrieval (pp. 43-47). Seattle, Washington, USA: ACM.
Mikros, G. K. (2007). Stylometric experiments in Modern Greek: Investigating authorship in homogeneous
newswire texts. In R. Khler, G. Altmann & P. Grzybek (Eds.), Exact methods in the study of
language and text (pp. 445-456). Berlin / New York: Mouton de Gruyter.
Mikros, G. K. (2009). Content words in authorship attribution: An evaluation of stylometric features in a
literary corpus. In R. Khler (Ed.), Issues in Quantitative Linguistics (pp. 61-75). Ldenscheid: RAM-
Verlag.
Mikros, G. K., & Argiri, E. K. (2007). Investigating topic influence in authorship attribution. In B. Stein, M.
Koppel & E. Stamatatos (Eds.), Proceedings of the SIGIR 2007 International Workshop on
Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (Vol. 276, pp. 29-35).
Amsterdam, Netherlands: CEUR.
Mikros, G. K., Hatzigeorgiu, N., & Carayannis, G. (2005). Basic quantitative characteristics of the Modern
Greek Language using the Hellenic National Corpus. Journal of Quantitative Linguistics, 12(2-3),
167-184.
Milic, L. T. (1967). A quantitative approach to the style of Jonathan Swift. The Hague: Mouton & Co.,
Publishers.
Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39-41.
Miller, J. F. (1982). Quantifying productive language disorders. In J. F. Miller (Ed.), Research on child
language disorders: A decade of progress (pp. 211-220). Austin TX: Pro-Ed.
117
Miranda-Garca, A., & Calle-Martn, J. (2005). Yules Characteristic K Revisited. Language Resources and
Evaluation, 39(4), 287-294. doi: 10.1007/s10579-005-8622-8
Mohtasseb, H., & Ahmed, A. (2009). More blogging features for author identication Proceedings of the
International Conference on Knowledge Discovery (ICKD'09), 6-8 June 2009, Manila, Philippines
(pp. 534-539).
Mohtasseb, H., & Ahmed, A. (2010). The Affects of Demographics Differentiations on Authorship
Identification. In S.-I. Ao & L. Gelman (Eds.), Electronic Engineering and Computing Technology
(Vol. 60, pp. 409-417). Heidelberg: Springer.
Mohtasseb, H., & Ahmed, A. (2011). Two-layered Blogger identification model integrating profile and
instance-based methods. Knowledge and Information Systems, 1-21. doi: 10.1007/s10115-011-0398-0
Morton, A., Q. (1978). Literary detaction: How to prove authorship and fraud in literature and documents.
New York: Scribners.
N-gram. (31 December 2011). Wikipedia, The Free Encyclopedia. Retrieved 20 January 2012, from
http://en.wikipedia.org/w/index.php?title=N-gram&oldid=468841176
Nawrot, M. (2011). Automatic Author Attribution for Short Text Documents. In Z. Vetulani (Ed.), Human
Language Technology. Challenges for Computer Science and Linguistics (Vol. 6562, pp. 468-477).
Nowson, S., Oberlander, J., & Gill, A. J. (2005). Weblogs, genres, and individual differences. In B. G. Bara,
L. W. Barsalou & M. Bucciarelli (Eds.), Proceedings of The 27th Annual Conference of the Cognitive
Science Society. Stresa, Italy (pp. 1666-1671). Hillsdale, NJ: Erlbaum.
O'Brien, D. P., & Darnell, A. C. (1982). Authorship puzzles in the history of economics. A statistical
approach. London: Macmillan.
O'Donnell, B. (1966). Stephen Crane's The O'Ruddy: a problem in authorship discrimination. In J. Leed (Ed.),
The computer and literary style (pp. 107-115). Kent, Ohio: Kent State University Press.
Oakes, M. P. (2007). Ord's criterion with word length spectra. In P. Grzybek & R. hler (Eds.), Exact
methods in the study of language and text (pp. 509-519). Berlin: Mouton de Gruyter.
Olsson, J. (2008). Forensic Linguistics (2nd ed.). London: Continuum International Publishing Group.
Ord, J. K. (1972). Families of frequency distributions. London: Griffin.
Orebaugh, A., & Allnutt, D. J. (2010). Data Mining Instant Messaging Communications to Perform Author
Identification for Cybercrime Investigations. In S. Goel, O. Akan, P. Bellavista, J. Cao, F. Dressler,
D. Ferrari, M. Gerla, H. Kobayashi, S. Palazzo, S. Sahni, X. Shen, M. Stan, J. Xiaohua, A. Zomaya &
G. Coulson (Eds.), Digital Forensics and Cyber Crime (Vol. 31, pp. 99-110). Berlin / Heidelberg:
Springer.
Orlov, J. K. (1982). Ein Modell der Hufigkeitsstruktur des Vokabulars. In H. Guiter & M. V. Arapov (Eds.),
Studies on Zipf's Law (pp. 154-233). Bochum: Brockmeyer
Osborn, A. S. (1929). Questioned documents (2nd ed.). Albany, NY: Biyd Printing Co.
Paisley, W. J. (1964). Identifying the unknown communicator in painting, literature and music: The
significance of minor encoding habits. Journal of Communication, 14(4), 219-237. doi:
10.1111/j.1460-2466.1964.tb02925.x
Paisley, W. J. (1969). Studying 'style' as deviations from the encoding norms. In G. Gerbner, O. R. Holsti, K.
Krippendorff, W. J. Paisley & P. J. Stone (Eds.), The analysis of communication content:
Developments in scientific theories and computer techniques (pp. 134-135). New York: John Wiley &
Sons.
118
Park, T. H., Jiexun, L., Zhao, H., & Chau, M. (2009). Analyzing writing styles of bloggers with different
opinions Proceedings of the 19th Annual Workshop on Information Technologies and Systems
(WITS'09), 14-15 December 2009, Phoenix, Arizona, USA (pp. 151-156).
Patton, J., & Can, F. (2004). A Stylometric Analysis of Yaar Kemals <i>nce Memed Tetralogy.
Computers and the Humanities, 38(4), 457-467. doi: 10.1007/s10579-004-1906-6
Pearl, L. S., & Steyvers, M. (forthcoming). Detecting authorship deception: A supervised machine learning
approach using author writeprints. Literary and Linguistic Computing.
Peng, F., Schuurmans, D., & Wang, S. (2004). Augmenting Naive Bayes Classifiers with Statistical Language
Models. Journal of Information Retrieval, 7(3-4), 317-345. doi:
10.1023/B:INRT.0000011209.19643.e2
Pennebaker, J. W., & Francis, M. E. (2001). Linguistic Inquiry and Word Count: LIWC2001. Mahwah, NJ:
Erlbaum Publishers
Popescu, I.-I., Almann, G., Grzybek, P., Jayaram, B. D., Khler, R., Krupa, V., . . . Vidya, M. N. (2009).
Word frequency studies. Berlin: Mouton de Gruyter.
Raghavan, S., Kovashka, A., & Mooney, R. (2010). Authorship Attribution Using Probabilistic Context-Free
Grammars Proceedings of the ACL 2010 Conference Short Papers, 11-16 July 2010, Uppsala,
Sweden (pp. 38-42): Association for Computational Linguistics.
Reicher, T., Krito, I., Bela, I., & ili, A. (2010). Automatic Authorship Attribution for Texts in Croatian
Language Using Combinations of Features. In R. Setchi, I. Jordanov, R. Howlett & L. Jain (Eds.),
Knowledge-Based and Intelligent Information and Engineering Systems (Vol. 6277, pp. 21-30).
Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. (2006). Effects of age and gender on blogging
Proceedings of the 2006 AAAI Spring Symposium on Computational Approaches for Analyzing
Weblogs, 27-29 March 2006, Stanford, California (pp. 199-205).
Scott, M. (2008). WordSmith Tools (Version 5). Liverpool: Lexical Analysis Software.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR),
34(1), 1-47. doi: 10.1145/505282.505283
Sichel, H. S. (1974). On a Distribution Representing Sentence-Length in Written Prose. Journal of the Royal
Statistical Society. Series A (General), 137(1), 25-34.
Sichel, H. S. (1975). On a Distribution Law for Word Frequencies. Journal of the American Statistical
Association, 70(351), 542-547.
Silagadze, Z. K. (1997). Citations and the Zipf-Mandelbrot Law. Complex Systems, 11, 487-499.
Silverman, S., & Ratner, N. B. (2002). Measuring lexical diversity in children who stutter: application of
vocd. Journal of Fluency Disorders, 27(4), 289-304. doi: 10.1016/s0094-730x(02)00162-6
Simpson, E. H. (1949). Measurement of Diversity. Nature, 163, 688-688. doi: 10.1038/163688a0
Soboroff, I. M., Nicholas, C. K., Kukla, J. M., & Ebert, D. S. (1997). Visualizing document authorship using
n-grams and latent semantic indexing Proceedings of the 1997 Workshop on New paradigms in
Information Visualization and Manipulation (NPIV '97), 10-14 November 1997, Las Vegas, Nevada
(pp. 43-48). New York: ACM.
Somers, H. H. (1966). Statistical methods in literary analysis. In J. Leed (Ed.), The Computer and Literary
Style (pp. 128-140). Kent, Ohio: Kent State University Press.
Sousa-Silva, R., Sarmento, L., Grant, T., Oliveira, E., & Maia, B. (2010). Comparing Sentence-Level Features
for Authorship Analysis in Portuguese. In T. Pardo, A. Branco, A. Klautau, R. Vieira & V. de Lima
119
(Eds.), Computational Processing of the Portuguese Language (Vol. 6001, pp. 51-54). Berlin /
Sousa Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., & Maia, B. (2011). twazn me!!! ;(
Automatic Authorship Analysis of Micro-Blogging Messages. In R. Muoz, A. Montoyo & E. Mtais
(Eds.), Natural Language Processing and Information Systems (Vol. 6716, pp. 161-168). Berlin /
Stadlober, E., & Djuzelic, M. (2007). Multivariate statistical methods in quantitative text analyses. In P.
Grzybek (Ed.), Contributions to the science of text and language: Word length studies and related
issues (pp. 259-275). Dordrecht: Springer.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society
for Information Science and Technology, 60(3), 538-556. doi: 10.1002/asi.21001
Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in terms of genre and
author. Computational Linguistics, 26(4), 471-495.
Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without
lexical measures. Computers and the Humanities, 35, 193-214.
Suen, C. Y. (1986). Computational studies of the most frequent Chinese words and sounds. London: World
Scientific.
Swanson, B. (2011). Using probabilistic tree substitution grammars. (Masters), Brown University.
Tanguy, L., Urieli, A., Calderone, B., Hathout, N., & Sajous, F. (2011). A multitude of linguistically-rich
features for authorship attribution Proceedings of PAN 2011 Lab, Uncovering Plagiarism,
Authorship, and Social Software Misuse held in conjunction with the CLEF 2011 Conference on
Multilingual and Multimodal Information Access Evaluation, 19-22 September 2011, Amsterdam.
Templin, M. C. (1957). Certain Language Skills In Children Minneapolis: University of Minneapolis Press.
Teng, G.-F., Lai, M.-S., Ma, J.-B., & Li, Y. (2004). E-mail authorship mining based on SVM for computer
forensic Proceedings of the 2004 International Conference on Machine Learning and Cybernetics,
26-29 August 2004, Shangai, China (Vol. 2, pp. 1204-1207).
Thoiron, P. (1986). Diversity index and entropy as measures of lexical richness. Computers and the
Humanities, 20(3), 197-202. doi: 10.1007/bf02404461
Thorndike, A. H. (1901). The influence of Beaumont and Fletcher on Shakespeare. New York: AMS Press.
Tuldava, J. (1977). Quantitative Relations between the Size of the Text and the Size of Vocabulary. SMIL
Quarterly, Journal of Linguistic Calculus, 4, 28-35.
Tuldava, J. (1993). The statistical structure of a text and its readability. In L. Hebek & G. Altmann (Eds.),
Quantitative text analysis (pp. 215-227). Trier: Wissenschaftlicher Verlag Trier.
Tuldava, J. (2005). Stylistics, author identification. In R. Khler, G. Altmann & R. G. Piotrowski (Eds.),
Quantitative Linguistics. An International Handbook (pp. 368-387). Berlin, New York: Walter de
Gruyter.
Tuzzi, A., Popescu, I.-I., & Altmann, G. (2009). Zipf's Laws in Italian Texts. Journal of Quantitative
Linguistics, 16(4), 354-367. doi: 10.1080/09296170903211519
Tweedie, F. J., & Baayen, H. R. (1998). How variable may a constant be? Measures of lexical richness in
perspective. Computers and the Humanities, 32(5), 323-352.
Van Bael, C., & van Halteren, H. (2007). Speaker Classification by Means of Orthographic and Broad
Phonetic Transcriptions of Speech. In C. Mller (Ed.), Speaker Classification II (Vol. 4441, pp. 293-
307). Berlin / Heidelberg: Springer.
Wake, W. C. (1957). Sentence-Length Distributions of Greek Authors. Journal of the Royal Statistical
Society. Series A (General), 120(3), 331-346.
120
Williams, C. B. (1940). A Note on the Statistical Analysis of Sentence-Length as a Criterion of Literary Style.
Biometrika, 31(3/4), 356-361.
Williams, C. B. (1976). Mendenhall's studies of word-length distribution in the works of Shakespeare and
Bacon. Biometrika, 62, 207-212.
Youmans, G. (1991). A New Tool for Discourse Analysis: The Vocabulary-Management Profile. Language,
67(4), 763-789.
Yule, U. G. (1939). On Sentence-Length as a Statistical Characteristic of Style in Prose: With Application to
Two Cases of Disputed Authorship. Biometrika, 30(3/4), 363-390.
Yule, U. G. (1944). The statistical study of literary vocabulary. Cambridge: Cambridge University Press.
Zanette, D. H., & Manrubia, S. C. (1997). Role of Intermittency in Urban Development: A Model of Large-
Scale City Formation. Physical Review Letters, 79(3), 523-526.
Zhang, D., & Lee, W. S. (2006). Extracting key-substring-group features for text classification Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD
2006), 20-23 August 2006, Philadelphia, USA (pp. 474-483): ACM.
Zhao, Y., & Zobel, J. (2005). Effective and Scalable Authorship Attribution Using Function Words. In G.
Lee, A. Yamada, H. Meng & S. Myaeng (Eds.), Information Retrieval Technology (Vol. 3689, pp.
174-189). Berlin / Heidelberg: Springer.
Zhao, Y., & Zobel, J. (2007). Searching with style: authorship attribution in classic literature. In D. Gillian
(Ed.), Proceedings of the 30th Australasian Computer Science Conference (ACSC2007), 30 January -
2 February 2007, Ballarat, Victoria, Australia (Vol. 62, pp. 59-68). New York: ACM Press.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online
messages: Writing-style features and classification techniques. Journal of the American Society for
Information Science and Technology, 57(3), 378-393. doi: 10.1002/asi.20316
Zheng, R., Qin, Y., Huang, Z., & Chen, H. (2003). Authorship Analysis in Cybercrime Investigation. In H.
Chen, R. Miranda, D. Zeng, C. Demchak, J. Schroeder & T. Madhusudan (Eds.), Intelligence and
Security Informatics (Vol. 2665, pp. 959-959). Berlin / Heidelberg: Springer
Zipf, G. K. (1935). The psychobiology of language. Cambridge, Mass.: M.I.T. Press.
Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology.
Cambridge Mass.: Addison-Wesley Press.
, ., , ., & , . (2011).
. (Master),
& , .
, . . (2009). .
. (pp. 36-58).
: , .
121
4:

.
corpus
,
( ).

, , ,
.
4.1.
,
. ,
,

.
,
(
) 92
.
,

.
;

:
1. :

.
, , ,
( ,
, , ..).
,
, , .
2. : ,

. ,
.
, 19
,

, .
.
,
92

5, 3.
122

.
3. :

, .

. (classification algorithms)

,
.

,
,

. ,
,
,
. , ,
,
,

.
4. :
,
. ,

. ,
, (test data)
. , ,

,
,
. ,
,
,
,
. ,
,

.
5. :

,
. Smith (1990)
(. 4 2),
.
,
.
,
.
123
4.2.

.

( ) ( )
(). 600 (300 ).
( 4.1):

300 209.848 699,5 343,2 2.308 169
300 151.206 504,02 71,8 1.354 86
600 361.054 601,8 266,4 2.308 86
4.1 .

.
( 71,8 343,2 ).
( 4.1),
:
4.1 .
124
,
, .

, .

.
,
, (.
2.2 2).
PERL, .
:
(Tokenizer):
, (tokens)
. (regular
expressions), ,
.
.
.
:
.

(tokenization) .
(tab delimited text)

. :
:
1 (. 3, 4.4).
:
2 (. 3, 4.4).
:
(. 3, 2.1).
:
(. 3, 5.1)
: .

( ).
MSTTR: TTR (.
3, 4.3).
Yules K: K (. 3, 4.4).
(): (. 3,
4.4).
(Redundancy): ,
(. 3, 4.4).
(Word Length Frequency Spectrum):
1, 2, 3 14
(. 3, 2.1).
:
(. 3, 2.1).
125
:
(. 3, 2.2).
:
.
-:
- .

.
(Part of Speech Tagger):
TreeTagger (Schmid, 1995),
. TreeTagger
,
.
.
1. / > (7
):
1. K,
2. MSTTR,
3. ,
4. ,
5. ,
6. ,
7. .
2. / > :
o 80 .
3. / (46 ):
1. ,
2. .,()%!;,
3. ,
4. .
4. (15 ):
1. ,
2. ,
3. .
,
(language
independent). ,
(Tambouratzis et al., 2004a,
2004b)
, (Mikros, 2006).
, ,
,
.
126
4.3. :
(logistic regression) ,

( ) .
(..
):
1. (robust).

.
2.
. .
3. .
(Meyers et al., 2006, . 222):
(collinearity)
,
.
(specification errors),

.
(..
). Pedhazur (1997)
30 .

. ,
( ),
.
0 1. ( 4.2):

1.txt 1 6,2
2.txt 1 5,8
3.txt 0 4,9
4.txt 1 5,2
5.txt 1 7,3
6.txt 0 4,8
7.txt 0 4,1
8.txt 1 4,6
9.txt 1 4,9
10.txt 1 5,3
11.txt 0 5,1
12.txt 0 4,1
13.txt 1 5,8
14.txt 1 5,7
15.txt 0 4,4
4.2 .
,
,
127
(b0) (b1)
(X).
(odds) :
.(0)
~d = E + G (2) (46)
1 .(0)
:
ln (P(A)/1-P(A)) (.
) (
1).
b0 .
b1
.
.
, (46) :
.(0)
= 6 o () (47)
1 .(0)
P(A), (47) :
1
.(0) = (48)
1 + 67 o ()
:
e Euler 2,718
( 1)
.

( 4.2):
128
4.2 .
, (
) ( ) . (Y)
(0-1) ,
.
, .
( -4 -3)
(0 0,05). ,
(.. 0 1) (0,5 0,7).

, ,
(odds). 2 ,
(P(A))
(P(B)) 1-P(A).

(probability), .
0,5 0,5,
.
P() / P() = P() / 1 - P() = 0,5 / 1 - 0,5 = 0,5/0,5 = 1.
, 6 1/6 = 0,16.
6 P(6) / P(1,2,3,4,5) = P(6) / 1 - P(6) = 0,16 / 1 - 0,16 = 0,16 / 0,84 = 0,19.
, 1 ( 4.2) 0,6 (9/15).
0,6 / 1 - 0.6 = 0,6 / 0,4 = 1,5.

, 1 .
, 1 ,
, 1
129
.
.
0 ( ) 1 ( ). ,
0
.
, (b0) --4,7
(b1) 1,02.
,
(maximum likelihood). , ,

,

(maximum likelihood parameter estimates).
, ,
(iterations) .
(Log Likelihood LL)
(deviance), (-2LL).

-2LL .

, -2LL
.

, -2LL 2
. -2LL ,

.
,
,
. (b1).
. ,

.
,
, 93.
(b1)
(. (46)),
.
, - (odds ratio).
,
. (b1)
(e).

, e (b1)
1,02. e1,02= 2,765.
1 3
1 .
,
e . , ,
93

1 . ..
( )
( ) 3, :
3 .
130
,
6 G,E&& = 6 &.Eq = 7.7.
1 8 2 .
, ,

.
,

. , ,
7 1,
() 7 (48):
, 7 1
0,92 , , 92%.
,
.
1 1 1 1 1
. 0 = = = 7&.qq
= = = 0,92
1 + 67 o 1 + 67 7q,MoG,E&M 1+6 1 + 0.087 1,087
, 7 1
0,92 , , 92%.
,
.
4.4.
4.4.1.

. 600
, 20 (30 ).

(feature selection)
,

. , ,
, .
.
(classification table) ( confusion
matrix) .
(binary classification) :

() ()
() ()
4.2 .

, () (True Positives TP)
. ,
()
(False Positives FP) . ,
(True
131
Negatives TN) .
.
.

. , ,
, () (False Negatives
FN) .
:
(Accuracy):
.
( (49):
e + ee
eYTy_XV = (49)
e + + ee + e
(Precision):
( )
(+
+ ).
( (50)):
e ee
T\WSWV e = YVX T\WSWV() = (50)
e + ee + e
(Recall): (
, )
(+ +
). (
(51)):
e ee
e]Y9SgS(e) = YVX e]Y9SgS = (51)
e + e ee +
F: F (van Rijsbergen, 1979)

. , ,
, ,
. F , F1
,
( (52)):
T\WSWV e]Y9SgS
G = 2 (52)
T\WSWV + e]Y9SgS

, ( 4.3):

5 60
30 15
4.3 .
132
65 (60+5)
45 (30+15). 60
(), , 15 ().
30 (), 5
(). 0,82 82%
(60+30/60+15+30+5).
0,8 80% (60/60+15).
0,86 86% (30/30+5). 0,92 92%
(60/60+5). 0,67 67% (30/30+15). F1
0,86 (((0,8*0,92)/(0,8+0,92))*2) 0,75
(((0,86*0,67)/(0,86+0,67))*2).

.
. ,
,

, . ,
, ,
. ,
.
. , 100%,
. ,
.
, . ,
, .
100%,
. ,
.

. ,
.
,
,

.
, .
(cross-validation) k
k .
k k .
k .
k 10, 10
(Witten & Frank, 2005, p. 150). 10-
(10-fold cross-validation). 10-
:
90% 10%
. 90%
(10%).

9 , 9
9 ().
, 10
.
10 10 .
133
10- (
4.3):
4.3 10- .

90%, 10%. ,
, .
,
.

10- ,
.
4.4.2.

1
3 .

.
.
, ,
,
.
( 4.4):
-> F1
134
280 20 0,936 0,933 0,935
19 281 0,934 0,937 0,935
4.4
.

0,935 93,5%
.
,
.
( 4.5):
z p

-0,565 0,304 -1,86 0,063
MSTTR -0,327 0,34 -0,961 0,337
6,542 1,742 3,756 0,000
-2,341 1,859 -1,26 0,208
2,912 2,68 1,087 0,277
-4,265 0,499 -8,543 0,000
-2,314 0,437 -5,296 0,000
-1,971 1,033 -1,909 0,056
4.5 .
,
,
.

( 4.4):
135
4.4
.
,
,
,

.
(McCarthy & Jarvis, 2010; Tweedie & Baayen, 1998)
.
.
80 , (600 )
20 .
, 20 ,
( 4.6):
-> F1
267 33 0,878 0,890 0,884
37 263 0,889 0,877 0,883
4.6 20
.
0,883 88,33%,
.

,
136
.
( 4.7):
z p

0.247 0.163 1.521 0.128
0.221 0.169 1.306 0.192
-0.699 0.175 -4.003 0.000
0.253 0.167 1.517 0.129
-0.380 0.175 -2.172 0.030
1.146 0.195 5.876 0.000
-0.330 0.177 -1.868 0.062
1.344 0.228 5.885 0.000
-0.610 0.198 -3.078 0.002
-0.936 0.191 -4.894 0.000
-0.517 0.170 -3.050 0.002
-0.278 0.160 -1.734 0.083
-0.239 0.185 -1.293 0.196
1.112 0.202 5.511 0.000
0.145 0.178 0.815 0.415
-0.949 0.201 -4.735 0.000
0.251 0.168 1.492 0.136
1.128 0.208 5.424 0.000
0.568 0.184 3.094 0.002
-0.119 0.162 -0.730 0.465
-0.630 0.218 -2.882 0.004
4.7 .
20 , 12
.
, 4 , ,
: , , , .
4 (, , , ) ( 4.5):
137
4.5 4
.
,
4
, .
/
.
:
,
,
.
/
, .
.
.
( 4.8, 4.9, 4.10):
-> F1
290 10 0,980 0,967 0,973
6 294 0,967 0,980 0,974
4.8
.
138
-> F1
246 54 0,857 0,820 0,838
41 259 0,827 0,863 0,845
4.9
.
-> F1
273 27 0,904 0,910 0,907
29 271 0,909 0,903 0,906
4.10
.
: 0,973,
: 0,842,
: 0,907.
,
,
. ,

, .

( 4.11):
z p

-0,263 0,281 -0,937 0,349
[!] -0,319 0,194 -1,647 0,100
[,] 0,574 0,182 3,147 0,002
[%] 0,292 0,340 0,860 0,390
[()] -0,114 0,269 -0,423 0,672
[;] -0,264 0,229 -1,155 0,248
[.] 4,496 0,428 10,502 0,000
4.11 .

.
97%.
( 4.6):
139
4.6
.

.
. .

. ,

. , ,

.

.
. ,
,
( 4.12):
-> F1
275 25 0,902 0,917 0,909
30 270 0,915 0,900 0,908
4.12
.
0,908 90,8%.

( 4.13):
140
z p

-0.036 0.189 -0.191 0.848
-0.534 0.255 -2.095 0.036
. . -1.561 0.371 -4.203 0.000
0.622 0.260 2.390 0.017
-0.904 0.258 -3.500 0.000
0.311 0.215 1.447 0.148
-0.313 0.217 -1.443 0.149
0.567 0.226 2.503 0.012
-3.160 0.385 -8.198 0.000
-0.742 0.240 -3.090 0.002
-0.101 0.175 -0.578 0.563
-0.089 0.217 -0.409 0.682
0.100 0.189 0.528 0.598
-0.569 0.247 -2.307 0.021
4.13
.

. ,
,

.

, . ,
3,
. , ,

(150) .

.

. ,
25
94
. ( 4.14):
-> F1
292 8 0,977 0,973 0,975
7 293 0,973 0,977 0,975
4.14 25 (
)
.
94
: , , ,
, , , , , , , , , , , , , , ,
. . , , , , , , .
141
0,975 97,5%
.
,

.
,
(feature selection
methods), ,

.
(stepwise) (Venables & Ripley, 2002, p. 175),
Akaike (Akaike Information Criterion AIC). To AIC
(Akaike, 1974)
. ,
AIC,
, .
:
Forward: AIC
. AIC
,
, .
,

.
Backward: AIC
.
AIC, ,
,
.

.
,

( 4.15):
-> F1
293 7 0,977 0,977 0,977
7 293 0,977 0,977 0,977
4.15 .
0,977 97,7%
.
, :
, , , , , , , 1 , 4
, , , , .

(11) (25)
. , ,

.
AIC.
142
( 4.7)
:
4.7
.

, (Mikros &
Perifanos, 2011).
,
.

. ,

.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19(6), 716-723. doi: 10.1109/TAC.1974.1100705
McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated
approaches to lexical diversity assessment. Behavior research methods, 42(2), 381-392. doi:
10.3758/brm.42.2.381
143
Mikros, G. K., & Perifanos, K. (2011). Authorship identification in large email collections: Experiments using
features that belong to different linguistic levels Proceedings of PAN 2011 Lab, Uncovering
Plagiarism, Authorship, and Social Software Misuse held in conjunction with the CLEF 2011
Conference on Multilingual and Multimodal Information Access Evaluation, 19-22 September 2011,
Amsterdam.
Tweedie, F. J., & Baayen, H. R. (1998). How variable may a constant be? Measures of lexical richness in
perspective. Computers and the Humanities, 32(5), 323-352.
van Rijsbergen, K., C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths.
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4rth ed.). New York: Springer.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. San
Francisco, CA: Elsevier.
144
5:

. corpus

(blogs).
(Support Vector Machines SVM),
.
20 corpus
.

, , ,
, , .
5.1.

,
. ,
.

(Luyckx & Daelemans, 2011, p. 52).
, ,
, ,
.
,
. ,
, ,
( ) .
. ,
( 10.000
) . ,

.
, .
,
, .
5.2.

5.2.1.
,
.
.
.

145
.
. , ,
133 . 900.000
77% (McLean, 2009).

,
(Mishne, 2007). ,
.
(Nilsson, 2003). .
(Chafe &
Danielewicz, 1987). ,
.
,
.

.

.

,

(Argamon, Koppel, Pennebaker, & Schler, 2007; Mohtasseb & Ahmed, 2009, 2010, 2011; Schler, Koppel,
Argamon, & Pennebaker, 2006).
(balanced)
,

. 1.000 20 (10 10 ) 50
. 406.460
(
3652 ). ( 5.1):

()

A 50 27.224 544,48 362,8 105 2.731

B 50 17.526 350,52 125,2 183 961
C 50 17.407 348,14 177,0 70 879
D 50 23.462 469,24 265,5 142 1.646
E 50 20.174 403,48 266,3 43 1.255

F 50 24.053 481,06 191,3 152 935
G 50 28.804 576,08 311,5 90 1.263
H 50 21.928 438,56 57,6 128 487
I 50 8.988 179,76 96,7 70 560
J 50 22.141 442,82 144,8 103 768
500 211.707 423,41 243,6 43 2.731
K 50 24.378 487,56 318,7 60 1.659
L 50 6.185 123,7 51,1 64 374
M 50 22.567 451,34 243,7 130 1.111

N 50 20.048 400,96 265,4 121 1.380
O 50 38.147 762,94 717,6 71 3.652
P 50 31.271 625,42 179,9 314 1.124
146
Q 50 15.614 312,28 181,6 63 777
R 50 8.103 162,06 197,0 31 1.035
S 50 15.866 317,32 138,1 81 660
T 50 12.574 251,48 121,2 52 631
500 194.753 389,5 351,1 31 3.652
1.000 406.460 406,46 302,5 31 3.652

5.1 .
20

.

( 4).
:
1. / > (7
):
1. K,
2. ,
3. ,
4. ,
5.
6. ,
7. .
2. / > (900
):
1. 300 ,
2. 300 ,
3. 300 .
3. / (646 ):
1. ,
2. ,
3. ,
4. ( 1, 2, 14
),
5. 300 ,
6. 300 .

4 1.553 .
( , ,
), ,
.
- . , 300
.
147
5.2.2. : (Support Vector
Machines)
1.000 1.553
. ,
. ,
, , ..
. ,
(robust algorithms),
(dimensionality) .
, (text
categorization), (Support Vector Machines SVM)
(Vapnik, 1995).

. ,
.

(supervised learning), ,
, . ,

,
.
:
1. (separating hyperplane),
2. (maximum-margin hyperplane),
3. (soft margin),
4. (kernel function).
, ,
,
. 5.1 (scatter plot),
:
148
5.1
.

(
) 4
17 , ( )
5 17 .
,
. , ,

.

, .
, 3 ,
(. 5.2). ,
, .

.
149
5.2 ,
.
(
) .
, , ,
(. 5.3).
150
5.3 .

.
,
, .
(hyperplane margin)
,
(maximum-margin hyperplane).
(support vectors) (. 5.4).
5.4 . .
, . ,
,
. ( 5.5) ,
:
151
5.5 .
,
(soft margin).

.
(cost parameter C)
.
,
.
(kernel function).
. , ,
, ..
. ,
() ,
. ( 5.6):
152
5.6 .
-5 5,
5 -5.
, ,
.
.
.
( 5.7):
153
5.7
.

.
,
. ,
.
,
,
(over-fitting).
,
. , ,
,
. ,
,
.
(one-versus-all) (one-against-one).
,
, ~ , ~ ,
~ .

, .

, .. ~ , ~
, ~ .
,
(Joachims, 1998), (Dewdney, VanEss-
Dykema, & MacMillan, 2001), (Drucker, Donghui, & Vapnik,
1999) .. , ,
154
(Diederich, Kindermann, Leopold, & Paass, 2003; Manevitz & Yousef, 2001; Plakias & Stamatatos, 2008;
Sousa Silva et al., 2011; Teng, Lai, Ma, & Li, 2004; Zheng, Li, Chen, & Huang, 2006)
.
5.2.3.
:
95
20 10- .
2, 4, 8 ... 20
.
.
2, 4 ... ,
10 , ...,
.

.
( 5.2)
:
-> A B
A 47 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
B 0 39 0 0 0 0 1 3 0 0 2 0 0 1 0 0 3 0 1 0
0 0 46 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1
0 0 0 48 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 0 40 0 1 0 0 0 1 0 2 0 0 1 3 0 0 1
0 1 0 0 0 48 0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 0 2 0 1 0 39 0 0 0 0 1 0 2 1 1 1 0 0 1
0 0 0 0 0 0 1 49 0 0 0 0 0 0 0 0 0 0 0 0
0 0 2 1 0 0 0 0 45 0 0 0 0 0 0 0 0 1 0 1
0 0 1 1 0 0 0 0 0 42 0 0 0 0 1 1 2 0 2 0
0 0 0 0 1 1 0 0 0 0 48 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0 0 40 0 2 0 0 3 0 3 0
0 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0 1 0 42 0 0 1 1 3 0
0 0 0 0 0 0 0 2 0 1 0 1 0 0 41 1 2 0 2 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 49 0 0 0 0
1 1 0 0 1 0 2 0 0 2 0 1 0 1 1 0 35 3 0 2
0 0 1 0 1 0 2 0 0 0 0 1 0 6 2 0 3 31 2 1
0 1 1 0 0 0 0 0 0 2 0 0 0 1 1 0 1 1 42 0
0 2 2 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 43
5.2 (=20).
0,864 86,4%.
( 5.3):
95
kernlab (Karatzoglou, Smola, Hornik, & Zeileis, 2004).
155
F1
0,959 0,94 0,949 A
0,848 0,78 0,813 B
0,807 0,92 0,86
0,96 0,96 0,96
0,889 0,8 0,842
0,96 0,96 0,96
0,83 0,78 0,804
0,86 0,98 0,916
0,978 0,9 0,938
0,84 0,84 0,84
0,941 0,96 0,95
0,889 0,8 0,842
0,962 1 0,98
0,724 0,84 0,778
0,854 0,82 0,837
0,925 0,98 0,951
0,648 0,7 0,673
0,838 0,62 0,713
0,764 0,84 0,8
0,86 0,86 0,86
5.3 .
,
, 0, 864 86,4%.
, F1
0,67 ( ) 0,98 ( ). , ,
,
10- .
, ,
.
(polynomial kernel).
. ( 5.8)

( ):
156
5.8 .

(, , ANOVA,
RBF) (Laplace, Bessel).

. ,

. ,
.

96
. ( 5.9)
C, o
(scale parameter) :
96
caret (Kuhn et al., 2012).
157
5.9 .
,
(scale) 0,002 0,01 1.
20 0,88 88%,
1,6%.

.
:
4 (2, 4, 8, 16 ),

.
, 10 (
),

.
( 5.10)
2 4, 6, 8, 16:
158
5.10
.
10
, ...
2 4
0,95 ( 95%)
.

(ANOVA)
.
ANOVA
(F=53,46, df=3, 36,
p<0,000).
, post-hoc
Honest Significant Difference (HSD) Tukey. ( 5.11)
:
159
5.11 .
, 0,
2-4 . 2 4
,
, .
4 ,
, 8 16
.
5.3.

5.3.1.
,
, .
,
(.. , ) ,
(..
, , ,
...).

.
.
:
160

:

. , ,
.

,
-
( 0% 80%).

:

.

.
( 50
500 ).
, 4

( ).
:
, 1990, [47.929 ].
, 1991, [72.831 ].
, 1988, [78.077 ].
, 1992, [28.602 ].

( 5.4):

(

)
20% 40% 60% 80% 100%
50 979 1.901 2.788 3.773 4.736
100 498 929 1.453 1.896 2.367
200 209 491 718 955 1.181
500 87 191 309 380 471
5.4 - .
,
( 20%
) 87 500 4 ,
4.736 50 . ,
161
1.000 97, 50
100 ,

.
:
1. / > (6
):
1. K,
2. ,
3. ,
4. ,
5. ,
6. .
2. / > (80 )
1. 80 .
3. / (45 )
1. ,
2. ,
3. ,
4. ( 1, 2, 14
).

,
-

.
10- ,
.

500 ( 5.5):
. . F1
->
131 14 0 0 0,879 0,903 0,891
15 158 0 1 0,919 0,908 0,913
0 0 57 0 1 1 1
3 0 0 92 0,989 0,968 0,979
5.5
500 .
4 93%
(),
( ~ ). ,
97

1.000 (Juola, Sofko, & Brennan, 2006, p. 172; Stamatatos, Fakotakis, &
Kokkinakis, 1999, p. 208; Zhao & Zobel, 2007).
162

,
(<1.000 ).
-

.
, (ANOVA),

.
(F= 36,48, df=3, 36, p<0,000),
(F= 0,43, p= 0,78).
( 5.12)

:
5.12
.

.
post hoc
( ),
. Tukey HSD
post hoc ,
( 5.13):
163
5.13 post hoc
.
0
500-200 100-50.
500 200 ,
100 50 . 100
200 .
. 100
, 200
. 200
.

. ,
,
.
5.3.2.

.
80 ,
,
.
, ,
,

, .
164

,
, .

.
. ,

,
.
,
, (
) ( )
.
(Mikros, 2006, 2007a, 2007b)
(key-words) (Bondi & Scott, 2010)

, ( ).
:
1. ,
, (.. ,
...).
2.
(..
, ...).
3. n
.
(Log Likelihood LL).

.
( , , ) ( 5.6):
1.061 2.372
% (
0,2 0,02
)
408.188 9.864.424
% (
99,8 99,98
)
() 409.249 9.866.796
() 100 100
5.6 .

-
.
5.6 :
165

() +
(-) - - +--
+
5.7 .

- (lexical association
measures).
~ .
,
( ~ )
, .
.
Log Likelihood (LL)
:
1.
, . LL
.
2. Dunning (1993) 2
98,
,
2

K. , (<5),
, 2
. LL
,
.
3. LL 0

(Daille, 1995).
, LL,

.
LL ( 5.7) : LL=
2(log() + log() + log() + log() ( + )log( + ) ( + )log( + ) ( + )log( + ) ( +
)log( + ) + ( + + + )log( + + + ))99.
98
2 Hofland and Johansson (1982) Leech and Fallon (1992)
, Rayson, Leech,
and Hodges (1997) British National Corpus BNC.
99
LL :
http://ucrel.lancs.ac.uk/llwizard.html
166
5.3.4.

80 4 (20
) .
,

3.2
5.
:
1. / > (6
):
1. K
2. ,
3. ,
4. ,
5. ,
6. .
2. / > (80 )
1. 80 (20 ).
3. / (45 ):
1. ,
2. ,
3. ,
4. ( 1, 2, 14
).
10-
. ( 5.14)
:
167
5.14
.

. , 10%
(. = 0,85, = 0,75). ,
(two-way ANOVA),

.
(main effects), (interaction effects).

(
F= 84,1, p<0,000, F=83.1, p<0,000),
(F=1,61, p=0,21).
HSD post hoc ( 5.15):
168
5.15 post hoc
.

,
, .
post hoc (. 5.13),

.

5.16:
169
5.16
.
(
)
.

(F= 9,44, p<0,01).
(F= 0,67, p= 0,62) (. 5.12)

.

.

. , 500
0,97 20% ,
(0,94). , 227.439
45.487 , 3% ( ).
, , , ,
100%,
5.000 .
. , 5.000

, 4.

( , ,
170
)

.

, .
Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2007). Mining the blogosphere: Age, gender and
the varieties of selfexpression. First Monday, 12(9).
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2003/1878
Bondi, M., & Scott, M. (Eds.). (2010). Keyness in Texts. Amsterdam: John Benjamins.
Chafe, W., & Danielewicz, J. (1987). Properties of spoken and written language. In R. Horowitz & J. S.
Samuels (Eds.), Comprehending oral and written language (pp. 83-113). New York: Academic Press.
Daille, B. (1995). Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering.
Unit for Computer Rresearch on the English Language (UCREL) Technical Papers, 5, 1-67.
Dewdney, N., VanEss-Dykema, C., & MacMillan, R. (2001). The form is the substance: classification of
genres in text Proceedings of the Workshop on Human Language Technology and Knowledge
Management - Volume 2001, Toulouse, France (pp. 1-8): Association for Computational Linguistics.
Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship Attribution with Support Vector
Machines. Applied Intelligence, 19(1), 109-123.
Drucker, H., Donghui, W., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE
Transactions on Neural Networks, 10(5), 1048-1054.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational
Linguistics, 19(1), 61-74.
Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. London: Longman.
Joachims, T. (1998). Text categorization with Support Vector Machines:Learning with many relevant
features. In C. Ndellec & C. Rouveirol (Eds.), Proceedings of the 10th European Conference on
Machine Learning, 21-24 April 1998, Dorint-Parkhotel, Chemnitz, Germany (pp. 137-142). Berlin:
Springer.
Juola, P., Sofko, J., & Brennan, P. (2006). A prototype for authorship attribution studies. Literary and
Linguistic Computing, 21(2), 169-178.
Karatzoglou, A., Smola, A. J., Hornik, K., & Zeileis, A. (2004). kernlab - An S4 package for kernel methods
in R. Journal of Statistical Software, 11(9), 1-20.
Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., & Engelhardt, A. (2012). caret: Classification and
Regression Training: R package version 5.15-023.
Leech, G., & Fallon, R. (1992). Computer corpora: what do they tell us about culture? ICAME Journal, 16,
29-50.
Luyckx, K., & Daelemans, W. (2011). The effect of author set size and data size in authorship attribution.
Literary and Linguistic Computing, 26(1), 35-55. doi: 10.1093/llc/fqq013
Manevitz, L. M., & Yousef, M. (2001). One-Class SVMs for document classification. Journal of Machine
Learning Research, 2, 139-154.
171
McLean, J. (2009). State of the Blogosphere 2009. Retrieved 9/2, 2011, from
http://technorati.com/blogging/feature/state-of-the-blogosphere-2009/
Mikros, G. K. (2006). Authorship attribution in Modern Greek newswire corpora. In O. Uzuner, S. Argamon
& J. Karlgren (Eds.), Proceedings of the SIGIR 2006 International Workshop on Directions in
Computational Analysis of Stylistics in Text Retrieval (pp. 43-47). Seattle, Washington, USA: ACM.
Mikros, G. K. (2007a). Authorship Attribution Using Discriminant Function Analysis: Exploring Literary
Style Variation in Five Modern Greek Novels. Paper presented at the 5th Trier Symposium on
Quantitative Linguistics, Trier, Germany.
Mikros, G. K. (2007b). Stylometric experiments in Modern Greek: Investigating authorship in homogeneous
newswire texts. In R. Khler, G. Altmann & P. Grzybek (Eds.), Exact methods in the study of
language and text (pp. 445-456). Berlin / New York: Mouton de Gruyter.
Mishne, G. (2007). Applied Text Analytics for Blogs. University of Amsterdam, Amsterdam.
Mohtasseb, H., & Ahmed, A. (2009). More blogging features for author identication Proceedings of the
International Conference on Knowledge Discovery (ICKD'09), 6-8 June 2009, Manila, Philippines
(pp. 534-539).
Mohtasseb, H., & Ahmed, A. (2010). The Affects of Demographics Differentiations on Authorship
Identification. In S.-I. Ao & L. Gelman (Eds.), Electronic Engineering and Computing Technology
(Vol. 60, pp. 409-417). Heidelberg: Springer.
Mohtasseb, H., & Ahmed, A. (2011). Two-layered Blogger identification model integrating profile and
instance-based methods. Knowledge and Information Systems, 1-21. doi: 10.1007/s10115-011-0398-0
Nilsson, S. (2003). The function of language to facilitate and maintain social networks in research weblogs.
Ume Universitet.
Plakias, S., & Stamatatos, E. (2008). Tensor Space Models for Authorship Identification. In J. Darzentas, G.
Vouros, S. Vosinakis & A. Arnellos (Eds.), Artificial Intelligence: Theories, Models and Applications
(Vol. 5138, pp. 239-249): Springer Berlin / Heidelberg.
Rayson, P., Leech, G., & Hodges, M. (1997). Social differentiation in the use of English vocabulary: some
analyses of the conversational component of the British National Corpus. International Journal of
Corpus Linguistics, 2(1), 133-152.
Sousa Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., & Maia, B. (2011). twazn me!!! ;(
Automatic Authorship Analysis of Micro-Blogging Messages. In R. Muoz, A. Montoyo & E. Mtais
(Eds.), Natural Language Processing and Information Systems (Vol. 6716, pp. 161-168). Berlin /
Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (1999). Automatic authorship attribution. Paper presented at
the Proceedings of the ninth conference on European chapter of the Association for Computational
Linguistics, Bergen, Norway.
Teng, G.-F., Lai, M.-S., Ma, J.-B., & Li, Y. (2004). E-mail authorship mining based on SVM for computer
forensic Proceedings of the 2004 International Conference on Machine Learning and Cybernetics,
26-29 August 2004, Shangai, China (Vol. 2, pp. 1204-1207).
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Zhao, Y., & Zobel, J. (2007). Searching with style: authorship attribution in classic literature. In D. Gillian
(Ed.), Proceedings of the 30th Australasian Computer Science Conference (ACSC2007), 30 January -
2 February 2007, Ballarat, Victoria, Australia (Vol. 62, pp. 59-68). New York: ACM Press.
172
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online
messages: Writing-style features and classification techniques. Journal of the American Society for
Information Science and Technology, 57(3), 378-393. doi: 10.1002/asi.20316
173
6:
,
(author
profiling).

.

() .
( )
.

, , .
6.1.

,
(author profiling). , ..
.
,
,
,
.
.

.

. 20
,
(gender) .
,
:

.

() .
()
.

.
,
,
.

.

174
. ,

. ,
,
( - ) ,
,
.
6.2.

6.2.1.
6.2.1.1.

, .
,
,
, .
,
. Moir and Jessel (1992)
:

. [
]
(spatial ability)
.
(Moir & Jessel, 1992, p. 10)

(corpus callosum).
.
(Gorman & Nash, 1992)
(Levitin, 2006). (Clarke et al., 2007) ,
.
, ,
(inferior-parietal lobule) (Frederikse, Lu, Aylward, Barta, & Pearlson, 1999).

. ,
, .

. , . ,

.
6.2.1.2. (Functional Magnetic Resonance Imaging fMRI)

- (Functional Magnetic Resonance Imaging fMRI).
, ,
175

.
.

(functional lateralization) (Shaywitz et al., 1995),
.
,
.
(.
2.2.1 6Error! Reference source not found.).

.
(Kimura, 2000; Linn & Petersen, 1985).
, ,
(Flannery, Liederman, Daly, & Schultz, 2000)
. (Kimura, 1993; Kimura & Hampson, 1994)
,
(48,5%) (30%) .

, ,
,
.

. -
(Sommer, Aleman, Bouma, & Kahn, 2004)
.

(Harrington & Farias, 2008)
.
6.2.2.

.
,

(, 2009).

.

,
.
Fasold (1990) (sociolinguistic gender
pattern).
, .
,
(prestige pronunciation).

(Fischer, 1958)
(Milroy, 1980; Wolfram, 1969). O Labov (1982)
1982 ,
:
176
, ,

.
(Labov, 1982, p. 78)

, (
). (Abd-el-Jawad, 1987; Haeri, 1987)

.
,

. Key (1975)

.
.
Trudgill (1978)
.
Trudgill
. ,
.
, ,
.

.
Trudgill

.

,
.

.
. Trudgill (1972):
. . . [ ], ,
, ,
. . .
. . . . ,
[] .
(Trudgill, 1972, p. 183)
,
, ,

. ,
(covert
prestige).

. ,
,

.
177
.
,
,
.

Deuchar (1988). Brown and Levinson (1978)
(face) ,
.
,
.
,
. ,
.
, Deuchar

. ,

, .
,
.

. ,

, (Brouwer,
Gerritsen, & De Haan, 1979), .
6.3.

Koppel, Argamon, and Shimoni (2002). British
National Corpus BNC (566 )
, .
,
405 - .
1.081 Balanced Winnow.
79,5% 82,6%
.

. ,
() (
), .
(blogs) (37,478 300
. ) (Schler, Koppel,
Argamon, & Pennebaker, 2006). ,
,
, . (hyperlinks), (emoticons),
, .. lol, haha . (. 2.2.2
3). , .
1.502
Multi-Class Real Winnow
80,1%.
,
178

( ).
Corney (2003) ,

, , ..
70,1%
, .
, Hota, Argamon, Koppel, and Zigdon
(2006), 34 .
,

, .
, , ,
, ( 10).
60% 75%
.
,
, .
6.4.

(training corpus)
(validation corpus) . Rudman (1997),
,
:
, ,
,

,
(cross-validation)
,
.
,
.

.
,
. (Mikros & Argiri, 2007),

, ,
.
, ,
. , ,
/
.
, ,
.
,
,
179
, . ,
,
.

, , . ,
:
()
(1 ).
(50)
.

.
,

.
, 700 7 7
.
, (balanced)
.
( 6.1):

7 11 8 26
3.169 6.308 4.999 14.476
84 111 33 228
60.489 77.811 23.071 161.371

14 31 2 47
5.748 16.982 1.646 24.376
17 15 17 49
14.568 10.865 21.710 47.143
N 122 168 60 350
83.974 111.966 51.426 247.366
8 17 25
3.847 8.353 12.200
88 117 22 227
60.283 68.793 20.595 149.671

14 32 2 48
8.453 20.471 1.736 30.660
17 16 17 50
11.301 11.023 17.218 39.542
N 127 182 41 350
83.884 108.640 39.549 232.073
249 350 101 700
180
167.858 220.606 90.975 479.439
6.1 .
100
.
,
. 84% 1.000 .
,
.
:
1. / > (6
):
1. K
2.
3.
4.
5.
6.
7.
8. :
5000 10000
(Hatzigeorgiu et al.,
2000).
2. / > (80 ):
1. 50 .
3. / (45 ):
1. ,
2. ,
3. ,
4. ( 1, 2, 14
).
4. (15 ):
1. ,
2. % : (>18
),
3. .
6.5.

(. 4 6)
.
(Multiple Analysis of
Variance MANOVA),
.
100

(Koutsis, Kouklakis, Mikros, & Markopoulos, 2005).
181
(Analysis of Variance ANOVA),

.
, - (a-level error) .
,
.
,
(Weinfurth, 1995).
(Huberty & Morris, 1989)
,
.
(
) (
).
(linear composite) ,
.

.
,
. (Hair
Jr, Anderson, Tatham, & Black, 1995; Stevens, 2002)
(t )
- ( Bonferroni),
/ / /
.
, , (Bray & Maxwell, 1982; Huberty & Morris, 1989)
,
.

,
,
.
- (Discriminant Function Analysis - DFA)
.

(weights) (Meyers, Gamst, & Guarino,
2006). , ,
, .

.

( 6.1):
182
6.1 .
,
.
,

. ,
,
.

.
6.5.1.

.
Hotelling T2, t. ,
2 (partial 2),
.
( 6.2)
:
183
6.2
.

4 6,
. ,
, (37%)
(27%) (13%). (<10%),
.
Hotelling T2 ,

.

6.5.2.
50
28 ,
.
(standardized coefficient) .
( 6.3)
.
,
.
184
6.3 .
(
) , , , , , , , , , , , ,
, , , , (. .), ,
, , , , , , , , , .
,
: ) (, ) )
(, ).
(Mulac,
Bradac, & Mann, 1985; Mulac, Studley, & Blau, 1990).
, , .
,
(, , ).
(Argamon, Koppel, Pennebaker, & Schler, 2007;
Holmes, 1990; Preisler, 1986; Rayson, Leech, & Hodges, 1997)
/ /.
,
.
6.5.3.

12
.
(standardized coefficient) . ( 6.4)

.
185
6.4 .
(
) , , ,
, , , . ,
, , , , , .
6.5.4.

. ( 6.5)

.
186
6.5 .

2, 3, 13 14 ,
4, 8, 9 10 .

2 3 13 14 .
,
8 4 9 10
.

.
. (2-3 )
. ,
(13 14 )
.
(2 3 ),
,
(2 3 ).

2, 3, 13 14 .
2 3
(r2= -0,354, r3= -0,385), (.
) (2
3 ). , (13 14 )
, (r13=
0,134, r14= 0,144), ,
, , .
187
6.5.5.

(Wilks = 0,987, p=0,003)
(Wilks = 0,995, p=0,049)
. ,
(M= 8,2, SD= 1,9) (= 7,8, SD= 1,7)
(M= 8,2, SD= 2,2) (M= 7,9, SD= 1,9).

,
( : 0,256, : 0,464),
( : -0,403,
: -0,435).
6.5.6.

5.000
(Wilks = 0,980, p=0,001) (Wilks = 0,991, p=0,01)
.
5.000
(= 0,25, SD= 0,05) (= 0,26, SD= 0,04).
(= 82,95, SD= 2,98)
(= 83,54, SD= 3,13).

.
5.000 ,
, ,
.
,
, .
6.6. :
(Artificial Neural Network ANN)
.

(Lowe & Matthews, 1995; Matthews & Merriam, 1993;
Merriam & Matthews, 1994; Singh & Tweedie, 1995; Tearle, Taylor, & Demuth, 2008; Tweedie, Singh, &
Holmes, 1996).
60,
. , ( )
( ).
,
.
() .
(inputs) (outputs)
: x = [x1, x2, xm].
n x{ {lG y{ {lG .
(layers),
(weights) (transfer
functions) (Hagan, Deumuth, & Beale, 1996).
.
, m (input layer) ( m
188
x), nh (hidden layer)
(output layer), : ,
.
, nh(m+1) .

.
f(x) .
( ), .
(
) .
,
,
.
,
,

.
,
.
, (backpropagation).

,
.
6.7.

( 54)
(. 5.1 6).

(weighted sum) .
(activation function)
, .
:
.
,
softmax, 0-1,
(Duda,
Hart, & Stork, 2000).

. (overtraining)
,
.
,
. ,
.
- -,
, .

, , 10-
, (weight
decay). ,
,
189
. ,
.
.
,
101 10
.
( 6.6):
6.6
.
17
0.0074.
0,81 81%.
( ):
-> F1
278 72 0,806 0,794 0,8

67 283 0,797 0,809 0,803
6.2 .
.
80% ,
101
nnet MASS (Venables & Ripley, 2002).
caret (Kuhn et al., 2012).
190
8 10
(.
0 6).
,
. ,
,
, .

:

,
.

.

,
,
.

,

.

(dimensionality).

.

, .
,
..
6.8. :

.
,
, ,
.
.
100% ,
.
, .
,
.
.
,
.

191

:
:

,
.

.
: 2
5,
. ,
(. .
Koppel, Schler, and Argamon (2011)
10.000 ).
,
.
, ..
.
:
(. 3
5). ,
, tweets, sms ..,
,
.

,
,
(sparse
data).
:
, ,
, . ,
.
,
, cusum
4.1 2. , ,
,
,
, .

:
,
.
(authorship
verification). ..
,
.
.

, .
192

,

.
.

102.
,
.
:
,

(.. , ..).

(
, , ,
..).
:
, (text categorization)

.

,
.
,
(benchmark corpora),
.
2004 Patrick Juola (Juola, 2004)
Association for Literary and Linguistic Computing
Association for Computers and the Humanities (ALLC/ACH 2004). ,
(2011,
2012) PAN Lab (Uncovering Plagiarism,
Authorship, and Social Software Misuse) (Argamon & Juola, 2011).
,

.
:
,
.

.

.
102
19
. -, , and (2011).
:
http://www.blod.gr/lectures/Pages/viewlecture.aspx?LectureID=185
193
JGAAP103 (Java Graphical Authorship Attribution Program),
JAVA
(Graphical User Interface GUI),
.

104
.
.

. , ,

. , : ,
, , ,
, ( , . 88).
Abd-el-Jawad, H. R. (1987). Cross-dialectal variation in Arabic: competing prestigious forms. Language in

Society, 16, 359-368.
Argamon, S., & Juola, P. (2011). Overview of the International Authorship Identification Competition at
PAN-2011 Proceedings of PAN 2011 Lab, Uncovering Plagiarism, Authorship, and Social Software
Misuse held in conjunction with the CLEF 2011 Conference on Multilingual and Multimodal
Information Access Evaluation, 19-22 September 2011, Amsterdam.
Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2007). Mining the blogosphere: Age, gender and
the varieties of selfexpression. First Monday, 12(9).
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2003/1878
Bray, J. H., & Maxwell, S. E. (1982). Analyzing and Interpreting Significant MANOVAs. Review of
Educational Research, 52(3), 340-367. doi: 10.3102/00346543052003340
Brouwer, D., Gerritsen, M., & De Haan, D. (1979). Speech differences between women and men: on the
wrong track? Language in Society, 8, 33-50.
Brown, P., & Levinson, S. (1978). Universals in language usage: politeness phenomena. In E. N. Goody (Ed.),
Questions and politeness: strategies in social interaction (pp. 56-311). Cambridge: Cambridge
University Press.
Clarke, D., Wheless, J., Chacon, M., Breier, J., Koenig, M.-K., McManis, M., . . . Baumgartner, J. (2007).
Corpus callosotomy: A palliative therapeutic technique may help identify resectable epileptogenic
foci. Seizure, 16(6), 545-553.
103
JGAAP AGPL v3.0 :
http://evllabs.com/jgaap/w/index.php/Main_Page
104
1997 Waikato ( )
WEKA
100 . WEKA :
http://www.cs.waikato.ac.nz/~ml/weka/. R
1933
. R
: http://cran.cc.uoc.gr/
194
Corney, M. W. (2003). Analysing e-mail text authorship for forensic purposes. (Master), Queensland
University of Technology, Queensland.
Deuchar, M. (1988). A pragmatic account of womens use of standard speech. In D. Cameron & J. Coates
(Eds.), Women in their speech communities: new perspectives on language and sex (pp. 27-32).
London: Longman.
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). New York: John Wiley &
Sons.
Fasold, R. W. (1990). The sociolinguistics of language. Oxford: Blackwell.
Fischer, J. L. (1958). Social influences on the choice of a linguistic variant. Word, 14, 47-56.
Flannery, K. A., Liederman, J., Daly, L., & Schultz, J. K. (2000). Male prevalence for reading disability is
found in a large sample of black and white children free from ascertainment bias. Journal of the
International Neuropsychological Society, 6(4), 433-442.
Frederikse, M. E., Lu, A., Aylward, E., Barta, P., & Pearlson, G. (1999). Sex Differences in the Inferior
Parietal Lobule. Cereb. Cortex, 9(8), 896-901. doi: 10.1093/cercor/9.8.896
Gorman, C., & Nash, M. (1992, 20 January 1992). Sizing up the sexes. TIME, 36-43.
Haeri, N. (1987). Male/female differences in speech: an alternative interpretation. In K. M. Denning, S.
Inkelas, F. C. McNair - Nox & J. R. Ricford (Eds.), Variation in language: NWAV - XV at Stanford.
Proceedings of the fifteenth annual conference on New Ways of Analyzing Variation (pp. 173-196).
Stanford: Department of Linguistics, Stanford University.
Hagan, M. T., Deumuth, H. B., & Beale, M. H. (1996). Neural network design. Boston: PWS Publishing
Company.
Hair Jr, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1995). Multivariate data analysis (5th ed.).
Upper Saddle River, NJ, USA: Prentice-Hall.
Harrington, G. S., & Farias, S. T. (2008). Sex differences in language processing: Functional MRI
methodological considerations. Journal of Magnetic Resonance Imaging, 27, 1221-1228.
Hatzigeorgiu, N., Gavrilidou, M., Piperdis, S., Carayannis, G., Papakostopoulou, A., Athanasia, S., . . . Iason,
D. (2000). Design and implementation of the online ILSP Greek Corpus. In M. Gavrilidou, G.
Carayannis, S. Markantonatou, S. Piperdis & G. Stainhaouer (Eds.), Proceedings of the second
International Conference on Language Resources and Evaluation (LREC 2000) (Vol. III, pp. 1737-
1742). Athens, Greece: ELRA.
Holmes, J. (1990). Hedges and boosters in women's and men's speech. Language and Communication, 10(3),
185-205.
Hota, S. R., Argamon, S., Koppel, M., & Zigdon, I. (2006). Performing gender: Automatic stylistic analysis of
Shakespeare's characters. Paper presented at the Proceedings of Digital Humanities 2006, Paris.
Huberty, C. J., & Morris, J. D. (1989). Multivariate analysis versus multiple univariate analyses.
Psychological Bulletin, 105(2), 302-308.
Juola, P. (2004). Ad-hoc authorship attribution competiion Proceedings 2004 Joint International Conference
of the Association for Literary and Linguistic Computing and the Association for Computers and the
Humanities (ALLC/ACH 2004), 11-16 June 2004, Gothenbug, Sweden.
Key, M. R. (1975). Male/female language. Metuchen, NJ: Scarecrow Press.
Kimura, D. (1993). Neuromotor mechanisms in human communication. Oxford: Oxford University Press.
Kimura, D. (2000). Sex and cognition. Cambridge, MA: MIT Press.
Kimura, D., & Hampson, E. (1994). Cognitive pattern in men and women is influenced by fluctuations in sex
hormones. Current Directions in Psychological Science, 3(2), 57-61.
195
Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author
gender. Literary and Linguistic Computing, 17(4), 401-412.
Koppel, M., Schler, J., & Argamon, S. (2011). Authorship attribution in the wild. Language Resources and
Evaluation, 45(1), 83-94. doi: 10.1007/s10579-009-9111-2
Koutsis, I., Kouklakis, G., Mikros, G. K., & Markopoulos, G. (2005). Minotavros: A tool for the semi-
automated creation of large corpora from the web. Paper presented at the Proceedings from the
Corpus Linguistics Conference Series, Birmingham.
Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., & Engelhardt, A. (2012). caret: Classification and
Regression Training: R package version 5.15-023.
Labov, W. (1982). Building on empirical foundations. In W. P. Lehmann & Y. Malkiel (Eds.), Perspectives
on historical linguistics (pp. 79-92). Amsterdam & Philadelphia: John Benjamins.
Levitin, D. (2006). This is your brain on music: The science of a human obsession. New York: Dutton Adult.
Linn, M. C., & Petersen, A. C. (1985). Emergence and characterization of sex differences in spatial ability: a
meta-analysis. Child Development, 56, 1479-1498.
Lowe, D., & Matthews, R. (1995). Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions.
Computers and the Humanities, 29, 449-461.
Matthews, R., & Merriam, T. (1993). Neural computation in stylometry I: An application to the works of
Shakespeare and Fletcher. Literary and Linguistic Computing, 8(4), 203-209.
Merriam, T., & Matthews, R. (1994). Neural Computation in stylometry II: An application to the works of
Shakespeare and Marlowe. Literary and Linguistic Computing, 9(1), 1-6.
Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research. Design and interpretation.
Thousand Oaks, CA: Sage.
Mikros, G. K., & Argiri, E. K. (2007). Investigating topic influence in authorship attribution. In B. Stein, M.
Koppel & E. Stamatatos (Eds.), Proceedings of the SIGIR 2007 International Workshop on
Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (Vol. 276, pp. 29-35).
Amsterdam, Netherlands: CEUR.
Milroy, L. (1980). Language and social networks. Oxford: Blackwell.
Moir, A., & Jessel, D. (1992). Brain sex: The real difference between men and women. New York: Delta.
Mulac, A., Bradac, J. J., & Mann, S. K. (1985). Male/female language differences and attributional
consequences in children's television. Human Communication Research, 11(4), 481-506.
Mulac, A., Studley, L. B., & Blau, S. (1990). The gender-linked language effect in primary and secondary
students' impromptu essays. Sex roles, 23(9-10), 439-470.
Preisler, B. (1986). Linguistic sex roles in conversation: Social variation in the expression of tentativeness in
English. Berlin: Mouton de Gruyter.
Rayson, P., Leech, G., & Hodges, M. (1997). Social differentiation in the use of English vocabulary: some
analyses of the conversational component of the British National Corpus. International Journal of
Corpus Linguistics, 2(1), 133-152.
Shaywitz, B. A., Shaywitz, S. E., Pugh, K. R., Constable, T. R., Skudlarski, P., Fulbright, R. K., . . . Gore, J.
C. (1995). Sex differences in the functional organization of the brain for language. Nature, 373, 607-
609. doi: 10.1038/373607a0
196
Singh, S., & Tweedie, F. J. (1995). Neural networks and disputed authorship: New challenges Artificial
Neural Networks (Vol. Conference Publication No. 409, pp. 24-28): IEEE.
Sommer, I., Aleman, A., Bouma, A., & Kahn, R. (2004). Do women really have more bilateral language
representation than men? A meta-analysis of functional imaging studies. Brain, 127(8), 1845-1852.
doi: 10.1093/brain/awh207
Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4rth ed.). Hillsdale, NJ: Erlbaum.
Tearle, M., Taylor, K., & Demuth, H. B. (2008). An algorithm for automated authorship attribution using
neural networks. Literary and Linguistic Computing, 23(4), 425-442. doi: 10.1093/llc/fqn022
Trudgill, P. (1972). Sex, covert prestige and linguistic change in the urban British Engliish of Norwich.
Language in Society, 1, 179-195.
Trudgill, P. (1978). On dialect: social and geographic factors. Oxford: Blackwell.
Tweedie, F. J., Singh, S., & Holmes, D., I. (1996). Neural network applications in stylometry: The Federalist
Papers. Computers and the Humanities, 30(1), 1-10. doi: 10.1007/BF00054024
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4rth ed.). New York: Springer.
Weinfurth, K. P. (1995). Multivariate Analysis of Variance. In L. G. Grim & P. R. Yarnold (Eds.), Reading
and understanding multivariate statistics (pp. 245-276). Washington, DC: American Psychological
Association.
Wolfram, W. (1969). A sociolinguistic description of Detroit negro speech. Washington, DC: Center for
Applied Lingustics.
, . . (2009). .
. : .
-, ., , . ., & , . (2011).
:
' , "
", ' , 7-8 2011, (pp. 341-371). : .
197

1

Mosteller & Wallace (1984, pp. 38-43)
1. a 15. do 29. is 43. or 57. this

2. all 16. down 30. it 44. our 58. to
3. also 17. even 31. its 45. shall 59. up
4. an 18. every 32. may 46. should 60. upon
5. and 19. for 33. more 47. so 61. was
6. any 20. from 34. must 48. some 62. were
7. are 21. had 35. my 49. such 63. what
8. as 22. has 36. no 50. than 64. when
9. at 23. have 37. not 51. that 65. which
10. be 24. her 38. now 52. the 66. who
11. been 25. his 39. of 53. their 67. will
12. but 26. if 40. on 54. then 68. with
13. by 27. in 41. one 55. there 69. would
14. can 28. into 42. only 56. things 70. your
.1.1 70 .
74. among 78. both 93. himself 102. perhaps 107. those
75. another 86. did 97. most 104. same 109. under
76. because 89. either 98. nor 105. second 114. where
77. between 90. enough 100. often 106. still 115. whether
.1.2 20 .
71. affect + ed 82. considerable + ly 90. enough105 99. offensive 112. violence
72. again 83. contribute 91. fortune + s 101. pass + es + ed + 113. voice
ing
73. although 84. defensive 92. function + s 103. rapid 116. while
79. city + cities 85. destruction 93. innovation + s 108. throughout 117. whilst
80. commonly 87. direction 95. join + ed 110. vigour + ous
81. consequently 88. disgracing 96. language 111. violate + s + d +
ing
.1.3 28
.
118. about 128. better 138. extent 148. necessary 158. substance
119. according 129. care 139. follow + s + d 149. necessity + ies 159. they
+ ing
120. adversaries 130. choice 140. I 150. others 160. though
121. after 131. common 141. imagine + s + 151. particularly 161. truth + s
d + ing
105
20 .
198
122. aid 132. danger 142. intrust + s + 152. principle 162. us
ed + ing
123. always 133. decide + s + 143. kind 153. probability 163. usage + s
d + ing
124. apt 134. degree 144. large 154. proper 164. we
125. asserted 135. during 145. likely 155. propriety 165. work + s
126. before 136. expence + s 146. matter + s 156. provision + s
127. being 137. expense + s 147. moreover 157. requisite
.1.4 48 (index)
Hamilton Madison.
199
2
Cusum:
1. .
2. [3], [4] [5]
[3], [4] [5].
[1]

.
[2]
,
,
.
[3]
,
,

.
[3]

,
, ..
[4]
, ,
, ..
.
[4]
,
.
[5]
, ,
.
[5]

.
[6]
. . .. . . ,
,

.
[7]
, ,
, .
[8]

, , ,
, .
[9]
, ,
,
.
[10]
200
,
.
[11]
48 . . ,
.
[12]
, ,
.
[13]
, ,
.
[14]

.
[15]

.
[16]
.
[17]
,
.. . . . .
[18]
. .
, . .
.
[19]
13
.
[20]
. . ,
,
()
.
[21]
.
.
.
[22]
, . , ,
, .
[23]

201
3
100 (33 . ) [: Mikros, Hatzigeorgiu, and Carayannis (2005)]
%
1 1.100.288 3,249
2 812.275 2,398
3 741.501 2,190
4 663.393 1,959
5 655.688 1,936
6 562.969 1,662
7 544.160 1,607
8 488.331 1,442
9 457.177 1,350
10 456.818 1,349
11 434.826 1,284
12 413.229 1,220
13 360.554 1,065
14 323.736 0,956
15 298.051 0,880
16 291.686 0,861
17 283.774 0,838
18 277.486 0,819
19 273.556 0,808
20 261.261 0,771
21 252.017 0,744
22 249.837 0,738
23 248.474 0,734
24 247.900 0,732
25 237.779 0,702
26 197.211 0,582
27 133.188 0,393
28 107.023 0,316
29 100.820 0,298
30 99.225 0,293
31 98.644 0,291
32 91.626 0,271
33 90.189 0,266
34 87.751 0,259
35 84.221 0,249
36 81.290 0,240
37 79.403 0,234
38 75.751 0,224
39 73.629 0,217
40 70.962 0,210
41 70.940 0,209
42 67.505 0,199
43 63.578 0,188
44 63.199 0,187
45 61.433 0,181
46 58.342 0,172
202
47 56.963 0,168
48 56.427 0,167
49 56.323 0,166
50 49.533 0,146
51 46.433 0,137
52 44.286 0,131
53 43.897 0,130
54 43.726 0,129
55 43.390 0,128
56 42.663 0,126
57 42.369 0,125
58 42.285 0,125
59 42.105 0,124
60 39.411 0,116
61 35.628 0,105
62 35.544 0,105
63 34.989 0,103
64 34.511 0,102
65 34.094 0,101
66 33.210 0,098
67 31.847 0,094
68 31.826 0,094
69 30.219 0,089
70 30.160 0,089
71 29.967 0,088
72 29.956 0,088
73 29.582 0,087
74 29.532 0,087
75 28.625 0,085
76 27.896 0,082
77 27.521 0,081
78 27.501 0,081
79 27.442 0,081
80 27.353 0,081
81 26.889 0,079
82 26.656 0,079
83 26.611 0,079
84 25.348 0,075
85 25.291 0,075
86 25.233 0,075
87 25.154 0,074
88 24.997 0,074
89 24.961 0,074
90 23.530 0,069
91 23.502 0,069
92 23.215 0,069
93 22.838 0,067
94 22.788 0,067
95 22.697 0,067
96 22.523 0,067
203
97 22.154 0,065
98 22.141 0,065
99 21.552 0,064
100 21.434 0,063
204
4

1 700 386 293 45
2 1400 697 539 78
3 2100 979 745 114
4 2800 1222 922 148
5 3500 1476 1116 175
6 4200 1692 1256 226
7 4900 1926 1434 252
8 5600 2159 1611 272
9 6300 2391 1792 294
10 7000 2593 1936 312
11 7700 2805 2092 339
12 8400 3012 2245 367
13 9100 3210 2380 394
14 9800 3376 2493 421
15 10500 3581 2635 458
16 11200 3773 2777 479
17 11900 3952 2891 516
18 12600 4147 3037 534
19 13300 4337 3181 554
20 14000 4516 3297 590
21 14700 4712 3432 616
22 15400 4880 3540 643
23 16100 5056 3652 674
24 16800 5221 3765 693
25 17500 5386 3888 714
26 18200 5546 3977 754
27 18900 5679 4056 769
28 19600 5823 4147 790
29 20300 5975 4243 811
30 21000 6140 4356 816
31 21700 6335 4499 844
32 22400 6468 4566 886
33 23100 6587 4625 902
34 23800 6709 4683 932
35 24500 6833 4742 963
36 25200 6959 4804 987
37 25900 7093 4872 1015
38 26600 7207 4926 1043
205
39 27300 7321 4982 1061
40 28000 7436 5058 1064
Mikros, G. K.,
Hatzigeorgiu, N., & Carayannis, G. (2005). Basic quantitative characteristics of the Modern Greek Language
using the Hellenic National Corpus. Journal of Quantitative Linguistics, 12(2-3), 167-184.
206

Υπολογιστική Υφολογία (eBook)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Υπολογιστική Υφολογία (eBook)

Uploaded by

Copyright:

Available Formats

.

(Lutosawski, 1897, p. 65)

("Review of A Study of Shakespeare," 1880, pp. 49-50)

25 169 232 187 109 78 79 48 28 20 10 10 2 3

( 1.1) Mendenhall (1901, p. 98)

(Mendenhall, 1901, p. 99)

(word length frequency distributions),

2.1 (20 ) stylometry

2.2.2. Mosteller & Wallace

2.2.3. Mosteller & Wallace

#% & 0 13 & 13 &

2,7187E,FG 0,31E 0,733 1

Mosteller & Wallace

2.3.1. (Principal Component Analysis PCA)

% uXVYQV]gS[ R`P _TQS]__X S RTWS YTXV gP]XgWgV =

(XuX`WXQ RTWS[ YTXV[ gP]XgWgV[ 100) 3,874 100

%i{ = |i{ #i (8)

-2 -1,5 -1 -0,5 0 0,5 1 1,5 2 2,5 3

2.7: and Sarah Henry Fielding.

24 Sarah Fielding ( 47)

2.4.1. Cusum (Cusum controversy)

1 cusum ( )= 2, 2 cusum ( )= 0 (2+(-2))

2.4.2. O Foster W.S.

2. (standard deviation): (. (7))

(Williams, 1976, p. 211)

(Yule, 1944, p. 184)

(Herdan, 1966, p. 68)

(Merriam, 1994, p. 469)

(Grieve, 2007, p. 261)

(Hymes, 1960, p. 119)

(Damerau, 1975, p. 274)

TTR 0,66 (8/12). TTR

Brunet (1978) (31)

Yule (1944, p. 57) (characteristic constant ),

S2=(fx2) S1= (fx). (34) (35):

(35) Jarvis (2002, p. 59) (36):

Tweedie and Baayen (1998, p. 330) (39) Vm:

(n) n-1 n(n-1)

p (maximum sample probability)

7.533,33 5.108,62 1.116,74 422,77 217,37 131,91

(Williams, 1940, p. 361)

(Grieve, 2005, p. 16)

70 Sichel (1974), Yule, Williams Wake,

(Somers, 1966, p. 129)

(O'Donnell, 1966, p. 110)

(Meyers et al., 2006, . 222):

F: F (van Rijsbergen, 1979)

A 50 27.224 544,48 362,8 105 2.731

E 50 20.174 403,48 266,3 43 1.255

M 50 22.567 451,34 243,7 130 1.111

(Moir & Jessel, 1992, p. 10)

6.2.1.2. (Functional Magnetic Resonance Imaging fMRI)

(Labov, 1982, p. 78)

(Trudgill, 1972, p. 183)

278 72 0,806 0,794 0,8

Abd-el-Jawad, H. R. (1987). Cross-dialectal variation in Arabic: competing prestigious forms. Language in

1. a 15. do 29. is 43. or 57. this

You might also like