Text Categorization

Moshe Koppel

Lecture 5: Author Profiling
With Shlomo Argamon, Jonathan Schler, James Pennebaker, Kfir Zigdon and others

Profiling
In real life:
1. We don’t have a closed set of candidate authors 2. We don’t have writing samples from each of them We can still try to say something about the author:
    Gender Age group Linguistic background …

Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . • The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

British National Corpus
• 920 documents labelled for
– author gender – document genre
Fiction / Female 132 132 151

• Used 566 controlled for genre

Fiction / Male Non-fiction / Female

Non-fiction / Male
Arts (Non-academic) Arts (Academic) Belief & Thought Biography Commerce Leisure Science Soc. Sci. (Non-ac.) Soc. Sci. (Ac.) World Affairs

151
16 24 24 54 10 16 26 52 38 42

Experiment
Features: 400+ FW ; 600+ POS n-grams
Learner: exponential gradient / linear SVM Test: 10-fold cross-validation

Results per Feature Set 85 80 75 70 65 60 55 50 FW POS FW+POS All docs Fiction Non-Fiction •Handle fiction and non-fiction separately •Use full feature set .

2 80.3 81.7 77.5 83.8 74.Results per Genre Testing on Genre: Fiction Fiction / Female Fiction / Male Non-fiction Non-fiction / Female Non-fiction / Male Arts (Non-academic) Arts (Academic) Belief & Thought Biography Commerce Leisure Science Social Science (Non-academic) Social Science (Academic) World Affairs # of docs 264 132 132 302 151 151 16 24 24 54 10 16 26 52 38 42 Train on All 74.9 .3 Train on Non-fiction 82.6 85.2 77.5 85.7 74.0 81.7 79.0 78.0 87.5 82.0 84.5 81.0 85.9 76.0 60.6 83.9 79.2 79.0 90.0 75.3 78.4 82.3 77.2 76.5 74.2 Train on Fiction 79.

Learning-Based Feature Reduction • Apply learning algorithm • Eliminate features with low weights • Learn again .

65 0.85 0.6 all 128 64 32 16 8 Number of features accuracy FWPOS FW POS .8 0.Results: Feature Reduction Fiction 0.9 0.75 0.7 0.

7 0.65 0.9 0.8 0.75 0.Results: Feature Reduction Feature reduction for Nonfiction 0.6 all 128 64 32 16 8 Nu mb er of featu res F WPOS POS FW .85 Accuracy 0.

with. as – Female: she. not • Non-Fiction – Male: that. PNP .What are the Distinguishing Features? • Fiction – Male: a. for. the. PRP. AT0 – Female: she. in. for. with. and. one. of.

9 90.3 822 ± 12 204 ± 4.0 170 ± 4.0 611 ± 8.0 66.6 614 ± 12 55 ± 2.1 234 ± 4.0 418 ± 7.9 626 ± 8.4 54 ± 1.1 ± 4.5 Non-fiction Male μ  stderr 291 ± 12 47.9 Female μ  stderr 331 ± 17 48.3 ± 1.5 8.3 21.6 84 ± 2.9 520 ± 8.73 ± 1.5 ± 3.7 ± 1.7 67 ± 4.5 ± 2.1 58.4 287 ± 5.4 623 ± 6.2 77.0 249 ± 5.2 .7 884 ± 9.9 ± 1.7 61.9 355 ± 7.6 56.0 324 ± 7.5 160 ± 2.7 153 ± 2.5 ± 1.3 735 ± 9.5 98 ± 2.7 158 ± 3.3 763 ± 7.5 767 ± 5.8 ± 1.1 220 ± 4.Feature Frequencies Fiction Feature PNP he she AT0 DT0 the XX0 PRP PRF for with and Male μ  stderr 732 ± 14 145 ± 4.9 ± 1.6 ± 1.9 Female μ  stderr 809 ± 15 135 ± 4.2 55.7 ± 1.7 139 ± 6.1 242 ± 3.2 615 ± 5.4 67.

Female Style Males use more • Determiners • Adjectives • of modifiers (e.g. pot of gold) Females use more • Pronouns • for and with • Negation • Present tense Informational features Involvedness features .Summary: Male vs.

irony and repetition are particular means of achieving relevance. it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style. . The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. these are made because the speaker recognises that the original formulation did not achieve optimal relevance .Which is Male/Female? • My aim in this article is to show that given a relevance theoretic approach to utterance interpretation. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. However. perhaps. As I have suggested. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor. a point which is. • The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. Their re-constructions are then compared with the original Hemingway version. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. it is suggested. which are not as readily obtainable through more traditional techniques of stylistic analysis. However. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance.

The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. However. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. . perhaps. a point which is. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. It will be argued that the decision to put something in other words is essentially a decision about style. which are not as readily obtainable through more traditional techniques of stylistic analysis. Their re-constructions are then compared with the original Hemingway version. However. anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. it is suggested. these are made because the speaker recognises that the original formulation did not achieve optimal relevance . it is possible to develop a better understanding of what some of these so-called apposition markers indicate. • The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. irony and repetition are particular means of achieving relevance. As I have suggested.Which is Male/Female? • My aim in this article is to show that given a relevance theoretic approach to utterance interpretation.

he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights. However. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor. a point which is. • The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. these are made because the speaker recognises that the original formulation did not achieve optimal relevance . Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects.Which is Male/Female? • My aim in this article is to show that given a relevance theoretic approach to utterance interpretation. . However. It will be argued that the decision to put something in other words is essentially a decision about style. As I have suggested. Their re-constructions are then compared with the original Hemingway version. it is possible to develop a better understanding of what some of these so-called apposition markers indicate. it is suggested. the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. which are not as readily obtainable through more traditional techniques of stylistic analysis. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. irony and repetition are particular means of achieving relevance. perhaps. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device.

the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights. It will be argued that the decision to put something in other words is essentially a decision about style. a point which is. However. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. • The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. irony and repetition are particular means of achieving relevance. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor. he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. it is possible to develop a better understanding of what some of these so-called apposition markers indicate.Which is Male/Female? • My aim in this article is to show that given a relevance theoretic approach to utterance interpretation. perhaps. these are made because the speaker recognises that the original formulation did not achieve optimal relevance . anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. which are not as readily obtainable through more traditional techniques of stylistic analysis. it is suggested. As I have suggested. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. . Their re-constructions are then compared with the original Hemingway version.

which are not as readily obtainable through more traditional techniques of stylistic analysis.Which is Male/Female? • My aim in this article is to show that given a relevance theoretic approach to utterance interpretation. a point which is. anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. it is possible to develop a better understanding of what some of these so-called apposition markers indicate. However. he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. As I have suggested. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. these are made because the speaker recognises that the original formulation did not achieve optimal relevance . A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor. . irony and repetition are particular means of achieving relevance. Their re-constructions are then compared with the original Hemingway version. However. • The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It will be argued that the decision to put something in other words is essentially a decision about style. it is suggested. perhaps. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences.

occupation.000 blogs • blogger-provided profiles (gender. age. astrological sign) • harvested August 2004 • non-text ignored (formatting. quoting) .Blog Corpus • 85.

it's ok. and I wanted to cry. I felt so rotton.. We were sooo bad. I didn't even want to talk to anyone after. Like. Thank God we weren't competing.Example 1 Yesterday we had our second jazz competition. . but.. I was so ashamed.

.Example 2 My gracious boss had agreed to let me have one week off of "work." He did finally give me my report back after eight freakin' days! Now I only have the rest of this week and then one full week after my vacation to finish this damned thing.

Katy's friend.. Kevin M. . Kevin is friends with a guy named Charlie P. I met Katy N. at a party in New York. lives in Miami and is working on getting a TV series produced. whom she met while living in Barcelona last year.Example 3 So about a month or two ago.

Blog Corpus age unknown 13-17 18-22 23-27 28-32 33-37 38-42 43-48 >48 Total gender female male 12287 12259 6949 4120 7393 7690 4043 6062 1686 3057 860 1827 374 819 263 584 314 906 34169 37324 Total 24546 11069 15083 10105 4743 2687 1193 847 1220 71493 Final balanced corpus: • 19.288 total posts • 141.106.320 total blogs – 8240 in “10s” – 8086 in “20s” – 2994 in “30s” • 681.859 total words .

blog slang Learning algorithms:  Real-valued balanced winnow (RBW)  Bayesian Multinomial Regression (BMR) Evaluation: 10-fold cross-validation .Experimental Setup Feature sets: • Content: words (filtered by infogain on train set) • Style: parts-of-speech. function words.

Age: Classification Style & Content Function Words Content Words RBW 75.0% 67.7% 75.9% BMR 77.2% .4% 76.4% 69.

92 2.28 30s 0.02 0.03 0.53 0.15 0.46 20s 1.37 1.28 0.11 .89 0.02 1.57 0.63 0. feature bored boring awesome mad homework mum maths dumb sis crappy 10s 3..16 1.45 0.22 0.47 0..41 0.05 0.23 0.8 0.25 1.1 0.74 0.26 0.The lifecycle of the common blogger.69 2.18 0.84 3.11 1.

52 0.05 0.11 1.28 0.84 0.84 3.56 0.1 0.89 0.22 0.45 0.15 0.44 0.23 1.92 1.37 1.7 0.51 0.74 0.77 0.18 0.47 0..The lifecycle of the common blogger.26 0.11 feature college bar apartment beer student drunk album dating semester someday 10s 1.4 30s 1.69 2.02 1.64 0.63 0.15 0.8 0.41 0.98 0.45 0.37 0.18 0.88 0.22 0. feature bored boring awesome mad homework mum maths dumb sis crappy 10s 3.53 1.03 0.92 2.53 0.28 30s 0.46 20s 1.25 1.57 0.61 0.31 0.31 1.02 0.41 0.28 ..32 0.23 0.55 0.18 0.16 1.35 20s 1.11 0.65 0.

02 1.72 0.8 0.59 0.28 feature son local marriage development tax campaign provide democratic systems workers 10s 0.29 0.5 0.22 0.35 20s 1.28 0.38 0.69 0.92 2.41 0.36 0.23 1.63 0.37 0.02 0.88 0.46 ..18 0.53 1.47 0.61 0.41 0.32 0.69 2.83 0.18 0.22 0.55 0.54 0.51 0.18 0.28 30s 0.15 0.12 0.64 0.13 0.37 1.84 3.45 0.51 0.03 0.65 0.7 0.53 0.74 0.41 0.16 0.15 0.11 feature college bar apartment beer student drunk album dating semester someday 10s 1.82 0.98 0.The lifecycle of the common blogger.52 0.84 0.56 0.44 0.38 0.35 30s 2.89 0.25 1.46 20s 1.11 1.4 30s 1.45 0.15 0.1 20s 0.27 0.05 0.16 1.26 0.14 0.31 0.18 0.77 0.85 1.31 1.7 0.1 0.57 0.38 0.92 1. feature bored boring awesome mad homework mum maths dumb sis crappy 10s 3..37 1.23 0.11 0.92 1.55 0.14 0.

0% 77.0% 73.Gender: Classification Style & Content Style Words Content Words RBW 80.0% BMR .

1±0.4 27..9±0.6 43.9 159. Women are from Venus.5±0.4±0.4 .3 20..3 25.3 female 56.9±0.5 37..1±0.2 248.5±0. LIWC category job money sports tv sex family eating friends sleep pos-emotions neg-emotions male 68.5±1.4 31.9±0.3 23.4±0.5±0.2 265..2 18.3 32.3 23.5 40.5±0.2±1.4 30.2 43.6±0.4 20.6±0.4±0.4±0.2±0.Men are from Mars.2±0.1±2 178±1.4 21.1±0.2 15.

. log(male/female) . • Consider the most distinctive words for both Age and Gender: – Intersection of the 1000 words with highest Age information gain and the 1000 words with highest Gender information gain – Total of 316 words – Consider log(30s/10s) vs.Relating Age & Gender • Let's examine the connection between age and gender a little more generally..

Relating Age & Gender 8 6 4 log(30s/10s) 2 0 -2 -4 -6 -8 -2 -1 0 1 2 log(male/female) .

Relating Age & Gender 8 6 4 “husband” log(30s/10s) 2 0 -2 -4 -6 -8 -2 -1 0 1 2 log(male/female) .

Native Language Given English text. can we determine the author’s native language? .

But "political correctness" has have positive and negative consecuences. . French and Spanish speakers. called Time Passes. time has passed indeed and Mrs Ramsay has died. There is one more kind of films irritating many television viewers . «Santa Barbara» has even won "Oskar" prize. aids diseaseds. These were written by Russian. etc. There are pejudments of small groups. Can you tell which is which? In the second part of this outhor’s novel. respectively."soap" serials. such as homosexuals.Try it yourself. inmigrants.

Corder81): • Word selection • Syntax • Spelling .Possible Clues Patterns of native language are typically reflected in how other languages are spoken (Rado61.

Measurable Features for Automated Native Language Detection • Frequency of function words • Frequency of letter sequences (adapted from Peng+ 04) • Idiosyncrasies We will gather idiosyncrasies data automatically. .

comit instead of commit) Letter  instead of  (e.g stucktogether) .g.g.g. fisrt instead of first) Inserted letter (e. firsd instead of first) Letter inversion (e. friegnd instead of friend) Missing letter (e.g. remmit instead of remit) Double letter appears once (e.g.Orthographic Idiosyncrasies • • • • • • • Repeated letter (e. frend instead of friend) Conflated words (e.g.

Syntactic Idiosyncrasies • • • • • • • • Sentence Fragment Run-on Sentence Repeated Word Missing Word Mismatched Singular/Plural Mismatched Tense that/which confusion Rare POS pairs (Chodorow-Leacock 00) .

text=remmit suggestion=remit mark as “repeated letter” . Run text through automated spell/grammar checker 2. Mark error accordingly e.g. Compare flagged word to best suggestion 3.Automatically Finding Idiosyncrasies 1.

Summary: Features Used • • • • 400 function words 200 letter sequences 185 error types 250 rare POS pairs Each document is represented as numerical vector of length 1035 .

258 docs from each of – – – – – France Spain Bulgaria Czech Rep.Test Corpus International Corpus of Learner English (Granger98) • • • • 11 countries Subjects same age. Russia . proficiency level Samples same genre. length Actually used in study.

SVM Classification Accuracy (10-fold CV) 90 80 70 60 50 40 30 Errors Letter n-gram s Function words Function words + Letter n-gram s Baseline=20% shaded: w/o error features white: with error features .

Confusion Matrix Classified As Czech French Bulgarian Russian Spanish Actual Czech 209 1 18 20 10 French Bulgarian 9 14 24 16 219 8 8 10 13 211 24 10 12 18 194 7 5 7 8 215 Russian Spanish .

misused o (e.g. undoubled consonant (e.What Gives It Away? • Russian –over. the (infrequent). Mr (no period). outhor) • Spanish – c-q confusion (e.g.g. cannot (uncontracted) • Czech – doubled consonant (e. confortable).g. comit) • Bulgarian – most_ADVERB.g. number_reladverb • French – indeed. cuality). m-n confusion (e. remmit) .

etc."soap" serials. Spanish: There are pejudments of small groups. . such as homosexuals. called Time Passes. inmigrants. «Santa Barbara» has even won "Oskar" prize. Russian: There is one more kind of films irritating many television viewers . French: In the second part of this outhor’s novel. But "political correctness" has have positive and negative consecuences. aids diseaseds.Let’s look back at our examples. time has passed indeed and Mrs Ramsay has died. Now it’s pretty obvious.

Real-Life Issues • Many candidate languages • Very short texts • Unpredictable English proficiency .

determine if author is – – – – – Open Conscientious Neurotic Extroverted Agreeable .Personality • Pennebaker data: – Students wrote essays – Same students took personality assessment tests • Experiment:Given text.

Accuracy Results –Open 66% –Conscientious 65% –Neurotic –Extroverted 63% 62% –Agreeable 60% .

feel. hate. home. you – hope. more . high. football. strange.Key Features • Openness – consciousness. grades – damn. always. bad. team • Conscientiousness – school. friends. you. maybe. thoughts.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.