You are on page 1of 49

lronuers of

CompuLauonal !ournallsm

Columbla !ournallsm School
Week 3: 1exL Analysls

SepLember 19, 2014



8aslc ldea: quanuLauve lnformauon can Lell sLorles
When Pu !lnLao came Lo power ln 2002, Chlna was already experlenclng a worsenlng
soclal crlsls. ln 2004, resldenL Pu oered a rheLorlcal response Lo growlng lnLernal
lnsLablllLy, Lrumpeung whaL he called a harmonlous socleLy." lor some ume, Lhls new
waLchword burgeoned, becomlng vlslble everywhere ln Lhe arLy's propaganda.

- Clan Cang, !"#$%&'()*+ -.")/01 2%/0" #%('31% /#* 4"(#5 6'$"738"(5
8uL by 2007 lL was already on Lhe decllne, as sLablllLy preservauon" made lLs rapld
ascenL. ... 1ogeLher, Lhese conLrasung plcLures of Lhe harmonlous socleLy" and sLablllLy
preservauon" form a porLralL of Lhe real predlcamenL faclng resldenL Pu !lnLao. A
harmonlous socleLy" may be a pleaslng ldea, buL lL's Lhe lron wlll behlnd sLablllLy
preservauon" LhaL packs Lhe real punch.

- Clan Cang, !"#$%&'()*+ -.")/01 2%/0" #%('31% /#* 4"(#5 6'$"738"(5
Coogle ngrams vlewer - 12 of all books ever publlshed
uaLa can glve a wlder vlew
LeL me Lalk abouL uownLon Abbey for a mlnuLe. 1he show's
popularlLy has led many nlLplckers Lo dra up llsLs of mlsLakes. ...
8uL all of Lhese have relled, so far as l can Lell, on ndlng a
phrase or Lwo LhaL sounds a blL o, and checklng Lhe onllne
sources for earllesL use.
l lack such soclal graces. So l LhoughL: why noL [usL check every
slngle llne ln Lhe show for hlsLorlcal accuracy? ... So l found some
coples of Lhe uownLon Abbey scrlpLs onllne, and fed every slngle
Lwo-word phrase Lhrough Lhe Coogle ngram daLabase Lo see
how characLerlsuc of Lhe Lngllsh Language, c. 1917, uownLon
Abbey really ls.

- 8en SchmldL, 9":/01 ;'&0#'0 <'(. #(")/='0"8
8lgrams LhaL do noL appear ln Lngllsh books beLween 1912 and
1921.
8lgrams LhaL are aL leasL 100 umes more common Loday Lhan
Lhey were ln 1912-1921
uocumenLs, noL words
We can use clusLerlng Lechnlques lf we can
converL documenLs lnLo vecLors.

As before, we wanL Lo nd numerlcal feaLures"
LhaL descrlbe Lhe documenL.

Pow do we capLure Lhe meanlng of a documenL
ln numbers?
WhaL ls Lhls documenL "abouL"?
MosL commonly occurrlng words a preuy good
lndlcaLor.

30 the
23 to
19 and
19 a
18 animal
17 cruelty
15 of
15 crimes
14 in
14 for
11 that
8 crime
7 we
1urns ouL feaLures = words works ne
Lncode each documenL as Lhe llsL of words lL
conLalns.

ulmenslons = vocabulary of documenL seL.

value on each dlmenslon = # of umes word
appears ln documenL
Lxample
u1 = l llke daLabases"
u2 = l haLe haLe daLabases"





Lach row = documenL vecLor
All rows = Lerm-documenL maLrlx
lndlvldual enLry = u(L,d) = Lerm frequency"


Aka 8ag of words" model
1hrows ouL word order.

e.g. soldlers shoL clvlllans" and clvlllans shoL soldlers"
encoded ldenucally.
1okenlzauon
1he documenLs come Lo us as long sLrlngs, noL
lndlvldual words. 1okenlzauon ls Lhe process of
converung Lhe sLrlng lnLo lndlvldual words, or "Lokens."

lor Lhls course, we wlll assume a very slmple sLraLegy:
converL all leuers Lo lowercase
remove all puncLuauon characLers
separaLe words based on spaces

noLe LhaL Lhls won'L work aL all for Chlnese. lL wlll fall
ln some ways even for Lngllsh. Pow?
ulsLance funcuon
useful for:
clusLerlng documenLs
ndlng docs slmllar Lo example
maLchlng a search query
8aslc ldea: look for overlapplng Lerms
Coslne slmllarlLy
Clven documenL vecLors a,b dene



lf each word occurs exacLly once ln each documenL, equlvalenL
Lo counung overlapplng words.

noLe: 0'# a dlsLance funcuon, as slmllarlLy /0$(."*.* when
documenLs are. slmllar. (WhaL parL of Lhe denluon of a
dlsLance funcuon ls vlolaLed here?)

similarity(a, b) ! ab
roblem: long documenLs always wln
LeL a = 1hls car runs fasL."
LeL b = My car ls old. l wanL a new car, a shlny car"
LeL query = fasL car"



!"#$ &'( ()*$ +'$! ,- #$ ./0 1 2'*! ' *32 $"#*-
' 1 1 1 1 0 0 0 0 0 0 0 0
4 0 3 0 0 1 1 1 1 1 1 1 1
5 0 1 0 1 0 0 0 0 0 0 0 0
roblem: long documenLs always wln

slmllarlLy(a,q) = 1*1 [car] + 1*1 [fasL] = 2
slmllarlLy(b,q) = 3*1 [car] + 0*1 [fasL] = 3

Longer documenL more slmllar", by vlrLue of
repeaung words.


normallze documenL vecLors
similarity(a, b) !
ab
a b
= cos(C)
reLurns resulL ln [0,1]
normallzed query example
!"#$ &'( ()*$ +'$! ,- #$ ./0 1 2'*! ' *32 $"#*-
' 1 1 1 1 0 0 0 0 0 0 0 0
4 0 3 0 0 1 1 1 1 1 1 1 1
5 0 1 0 1 0 0 0 0 0 0 0 0
similarity(a, q) =
2
4 2
=
1
2
! 0.707
similarity(b, q) =
3
17 2
! 0.514
Coslne slmllarlLy
cos! = similarity(a, b) !
ab
a b
Coslne dlsLance (nally)
dist(a, b) !1"
ab
a b
roblem: common words
We wanL Lo look aL words LhaL dlscrlmlnaLe"
among documenLs.

SLopwords: lf all documenLs conLaln Lhe," are all documenLs
slmllar?

Common words: lf mosL documenLs conLaln car" Lhen car
doesn'L Lell us much abouL (conLexLual) slmllarlLy.
ConLexL mauers
Car 8evlews Ceneral news
= conLalns car"
= does noL conLaln car"
uocumenL lrequency
ldea: de-welghL common words
Common = appears ln many documenLs



documenL frequency" = fracuon of docs conLalnlng
Lerm
df (t, D) = d ! D : t ! d D
lnverse uocumenL lrequency
lnverL (so more common = smaller welghL) and
Lake log





idf (t, D) = log D d ! D : t ! d
( )
1l-lul
Muluply Lerm frequency by lnverse documenL
frequency






n(L,d) = number of umes Lerm L ln doc d
n(L,u) = number docs ln u conLalnlng L

tfidf (t, d, D) = tf (t, d)! idf (d, D)
= n(t, d)! log D n(t, D)
( )
1l-lul depends on enure corpus
1he 1l-lul vecLor for a documenL changes lf we add
anoLher documenL Lo Lhe corpus.





1l-lul ls sensluve Lo $'0#.>#. 1he conLexL ls all
oLher documenLs

tfidf (t, d, D) = tf (t, d)! idf (d, D)
lf we add a documenL, u changes!
WhaL ls Lhls documenL "abouL"?
Lach documenL ls now a vecLor of 1l-lul scores for
every word ln Lhe documenL. We can look aL whlch
words have Lhe Lop scores.


crimes 0.0675591652263963
cruelty 0.0585772393867342
crime 0.0257614113616027
reporting 0.0208838148975406
animals 0.0179258756717422
michael 0.0156575858658684
category 0.0154564813388897
commit 0.0137447439653709
criminal 0.0134312894429112
societal 0.0124164973052386
trends 0.0119505837811614
conviction 0.0115699047136248
patterns 0.011248045148093

SalLon's descrlpuon of u-ldf
- from SalLon? Wong, ?ang, @ 6.$#'( AB"$. 9').8 C'( @3#'<"=$
D0).>/01? 1973
1l
1l-lul
n[-senLaLor-menendez corpus, Cvervlew sample les
color = human Lags generaLed from 1l-lul clusLers
ClusLer PypoLhesls
documenLs ln Lhe same clusLer behave slmllarly
wlLh respecL Lo relevance Lo lnformauon needs"

- Mannlng, 8aghavan, SchuLze, D0#(')3$='0 #' D0C'(<"='0 -.#(/.E"8
noL really a preclse sLaLemenL - buL Lhe cruclal llnk beLween
human semanucs and maLhemaucal properues.

AruculaLed as early as 1971, has been shown Lo hold aL web
scale, wldely assumed.
8ag of words + 1l-lul hard Lo beaL
racucal wln: good preclslon-recall meLrlcs ln LesLs wlLh
human-Lagged documenL seLs.

Sull Lhe domlnanL LexL lndexlng scheme used Loday.
(Lucene, lAS1, Coogle.) Many varlanLs.

Some, buL noL much, Lheory Lo explaln why Lhls works. (L.g.
why LhaL parucular ldf formula? why doesn'L lndexlng
blgrams lmprove performance?)
Collecuvely:
Lhe vecLor space documenL model
roblem SLaLemenL
Can Lhe compuLer Lell us Lhe Loplcs" ln a
documenL seL? Can Lhe compuLer organlze Lhe
documenLs by Loplc"?

noLe: 1l-lul Lells us Lhe Loplcs of a slngle documenL, buL here
we wanL Loplcs of an enure documenL *.#.
SlmplesL posslble Lechnlque
Sum 1l-lul scores for each word across enure
documenL seL, choose Lop ranklng words.

1hls ls how Cvervlew generaLes clusLer descrlpuons. lL wlll also be
your rsL homework asslgnmenL.
1oplc Modellng AlgorlLhms
8aslc ldea: reduce dlmenslonallLy of documenL
vecLor space, so each dlmenslon ls a Loplc.

Lach documenL ls Lhen a vecLor of Loplc welghLs. We
wanL Lo gure ouL whaL dlmenslons and welghLs glve a
good approxlmauon of Lhe full seL of words ln each
documenL.

Many varlanLs: LSl, LSl, LuA, nMl

MaLrlx lacLorlzauon
ApproxlmaLe Lerm-documenL maLrlx v as
producL of Lwo lower rank maLrlxes
v
W
P
=
m docs by n Lerms m docs by r "Loplcs" r "Loplcs" by n Lerms
MaLrlx lacLorlzauon
A "Loplc" ls a group of words LhaL occur
LogeLher.
pauern of words ln Lhls Loplc
non-negauve MaLrlx lacLorlzauon
All elemenLs of documenL coordlnaLe maLrlx !
and Loplc maLrlx F musL be >= 0

Slmple lLerauve algorlLhm Lo compuLe.




Sull have Lo choose number of Loplcs (


LaLenL ulrlchleL Allocauon
lmaglne LhaL each documenL ls wrluen by someone
golng Lhrough Lhe followlng process:

1. lor each doc d, choose mlxLure of Loplcs p(z|d)
2. lor each word w ln d, choose a Loplc z from p(z|d)
3. 1hen choose word from p(w|z)
A documenL has a dlsLrlbuuon of Loplcs.
Lach Loplc ls a dlsLrlbuuon of words.
LuA Lrles Lo nd Lhese Lwo seLs of dlsLrlbuuons.

"uocumenLs"
LuA models each documenL as a dlsLrlbuuon over Loplcs. Lach
word belongs Lo a slngle Loplc.
"1oplcs"
LuA models a Loplc as a dlsLrlbuuon over all Lhe words ln Lhe
corpus. ln each Loplc, some words are more llkely, some are less
llkely.
ulmenslonallLy reducuon
CuLpuL of nMl and LuA ls a vecLor of much lower
dlmenslon for each documenL. ("uocumenL
coordlnaLes ln Loplc space.")

ulmenslons are concepLs" or Loplcs" lnsLead of
words.

Can measure coslne dlsLance, clusLer, eLc. ln Lhls new
space.

You might also like