You are on page 1of 73

Deceiving

Authorship
Detection
ToolstoWriteAnonymously&CurrentTrendsinAdversarial
Stylometry.

MichaelBrennan,SadiaAfrozandRachelGreenstadt.Drexel
University.
Privacy,SecurityandAutomation
Lab
Faculty
Dr.RachelGreenstadt
GraduateStudents
SadiaAfroz(DecepFonDetecFonLead)
DiamondBishop
MichaelBrennan
AylinCaliskan
ArielStolerman(JStyloLeadDeveloper)
UndergraduateStudents
PavanKantharaju
AndrewMcDonald(AnonymouthLeadDeveloper)
26C3/28C3Diff
Review&UpdatedAnalysisof26C3Material
NewCorpus(45authors)
NewMethod(Writeprints)
Muchmorerobustresults.
Thetoolswediscussedarenowbuilt!
JStylo
Anonymouth
DetecFngDecepFoninAdversarialWriFng

AnOverview
WhatisAuthorshipRecogniFonandAdversarial
Stylometry?
Whatistheanonymitythreat?
Analyzing&DeceivingAuthorshipRecogniFon
TwoTools
JStylo
Anonymouth
DetecFngDecepFon
WhatisAuthorship
Recognition?
ThebasicquesFon:whowrotethisdocument?
Stylometry:Thestudyofa]ribuFngauthorshiptodocuments
basedonlyonthelinguisFcstyletheyexhibit.
LinguisFcStyleFeatures:sentencelength,wordchoices,
syntacFcstructure,etc.
HandwriFng,content-basedfeatures,andcontextualfeaturesare
notconsidered.
IndividualshaveuniquewriFngstylesbecauselanguageis
learnedonanindividualbasis.
InthispresentaFon,stylometryandauthorshiprecogniFon
areusedinterchangeably.
WhatisAdversarial
Stylometry?
AdversarialStylometry:ApplyingdecepFontowriFngstylein
ordertoaecttheoutcomeofstylometricanalysis.
But,iswriFngstylemodiable?(Yes!)
IsitpossibletodeceivestylometrythroughalteredwriFngstyle?
(Yes!)
WhataretheimplicaFonsoflookingatstylometryinan
adversarialcontext?
HowCanStylometrybea
Threat?
SupervisedStylometry
Givenasetofdocumentsofknownauthorship,classifya
documentofunknownauthorship.
HypotheFcalScenario:AlicetheAnonymousBloggervs.Bobthe
AbusiveEmployer.
UnsupervisedStylometry
Givenasetofdocumentsofunknownauthorship,clusterthem
intoauthorgroups.
HypotheFcalScenario:AnonymousForumvs.Oppressive
Government.
PurelyHypothetical?
PreviousexamplesarepurelyhypotheFcal.Whataboutareal
example?
FromInsideWikiLeaksbyDanielDomscheit-Berg:
InudgedJulianwithmyfoot.Weexchangedglancesandstarted
giggling. If someone had run WikiLeaks documents through such
a program, he would have discovered that the same two people
werebehindallthevariouspressreleases,documentsummaries,
andcorrespondenceissuedbytheproject.Theocialnumberof
volunteers we had was also, to put it mildly, grotesquely
exaggerated.
AdversarialStylometry:A
Review
Understandthethreatmodel
Buildacorpus.
Evaluatecurrentmethodsofstylometryagainstadversarial
text.
Analyzeresultsanddeveloptools.
ThreatModel
Threat:AuthorshiprecogniFoncanidenFfyyouifthereare
sucientwriFngsamplesandasetofsuspects.
6500+wordsoftrainingdataperauthor
500+wordsoftesFngdata
50orlesssuspects
Thesemaybedierent:
Tweets(shortmessages)
Largenumbersofauthors(Writeprints)
OldassumpFon:WriFngstyleisinvariant.
Itslikeangerprint,youcantreallychangeit.
CircumventionMethods
Challenge:conceivemethodsofcircumvenFngwriFngstyle
analysis.
Obfusca0on
Anauthora]emptstowriteadocumentinsuchawaythattheir
personalwriFngstylewillnotberecognized.
Imita0on
Anauthora]emptstowriteadocumentsuchthatthewriFngstyle
willberecognizedasthatofanotherspecicauthor.
Transla0on*:
MachinetranslaFonisusedtotranslateadocumenttooneormore
languagesandthenbacktotheoriginallanguage.
BuildingaCorpus
Corpus=Datasetofdocuments.
Datasetsforadversarialstylometrydonotexist.ParFcipants
arerequiredtocrakintenFonallyadversarialpassages.
ParFcipaFonhadthreeparts:
Submit6500wordsofpre-exisFngwriFngfromaformalsource.
Writeanew500wordobfuscaFonpassage.
Task:Describeyourneighborhood.
Writeanew500wordimitaFonpassage.
Task:ImitateCormacMcCarthy,describeyourday.
AuthorshadnoformaltrainingorknowledgeinlinguisFcsor
stylometry.
Brennan-GreenstadtCorpus
12IndividualAuthors.
ParFcipantscontactedthroughclasses,colleagues,friendsat
DrexelUniversity.
MoFveforproperparFcipaFon.
One-on-oneinteracFonwithparFcipants.
Corpusispubliclyavailableath]ps://psal.cs.drexel.edu
Goodforpreliminaryresults,butweneedsomethingbe]er.
Toosmall.
Toohomogenous.
BuildingaBetterCorpuswith
AmazonMechanicalTurk
DrexelAMTCorpus
AMT=AmazonMechanicalTurk
Sametasksaspreviouscorpus.
Only45of101ofsubmissionsareusable!
45AcceptedSubmissions.
Guidelineswithoutspoilingdataset.MustfollowdirecFonsand:
Pre-exisFngwriFngmustbeformalinnature
Removenon-content
Minimaldialogue/quotaFons
Refrainfromsubminng:smallsamples,labreports,Q&As,etc.
Releasedtoday.Publiclyavailableath]ps://psal.cs.drexel.edu
Thiscorpusislarge,diverse,andunique.
Originalvs.AMTCorpus
AMTCorpusevaluatedjustasstronglyasDrexel.
9-Featuresdoesworse,Synonymdoesthesame,Writeprints
doesbe]er.
0
0.2
0.4
0.6
0.8
1
2 3 4 8
NumberofAuthors
9-Feature
Brennan-
Greenstad
t
AMT
Random
0
0.2
0.4
0.6
0.8
1
2 3 4 8
NumberofAuthors
Synonym
Brennan-
Greenstadt
AMT
Random
0
0.2
0.4
0.6
0.8
1
2 3 4 8
NumberofAuthors
Writeprints
Brennan-
Greenstadt
AMT
Random
EvaluateStylometryMethods
AgainsttheCorpus
ThreemethodsofStylometry
9-Feature/NeuralNetwork
Synonym-BasedApproach
Writeprints/SVM
Method1:
9-FeatureSetNeuralNetwork
Simplestylometricapproach.DemonstratespotenFal
eecFvenesswithasmallnumberofobscuremetrics.
9-FeatureSet
Uniquewords,Complexity,SentenceCount,AverageSentence
Length,AverageSyllableCount,CharacterCount,Le]erCount,
Gunning-FogReadabilityIndex,FleschReadingEaseScore.
NeuralNetworkClassier.
Method2:
Synonym-BasedApproach
Examineswordchoiceswhencomparedtoavailable
synonymsandfrequencyofuse.
Clark&Hannon,2007.
GooddemonstraFonofsinglefeaturetypestylometry.
Method3:
Writeprints(SVM)
BasedontheWriteprintsapproachbyAbbasi&Chen,2008.
WriteprintsBaselineFeatureSet.
Containshundredsoffeaturesincludingcharacterandwordn-
grams,funcFonwords,parts-of-speechtags,punctuaFon,and
characterlevelmetrics.
SupportVectorMachineClassier
StandardformulF-classclassicaFoninstylometry.
ImplementaFonofthefullWriteprintsapproachusesamore
extensivefeaturesetanduniqueclassicaFonapproach.
DetectingAuthorship
Fourexperiments.
Baseline:Evaluatethenon-adversarialpassagestoestablisha
baselineaccuracyforeachmethodandfeatureset.
Obfusca0on:A]emptclassicaFonoftheobfuscaFonpassages
whenclassiersaretrainedonthebaselinedata.
Imita0on:A]emptclassicaFonoftheimitaFonpassages.
Imita0onSuccess:AnalyzethenumberofsuccessfulimitaFons.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
BaselinePrecision
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
ObfuscationPrecision
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
ImitationPrecision
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
ImitationSuccess(FramingCormacMcCarthy)
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
TwoTools
JStylo:AuthorshipRecogniFonAnalysisTool.
Anonymouth:AuthorshipRecogniFonEvasionTool.
Free,OpenSource.(GNUGPL)
Alphareleasesavailabletodayath]ps://psal.cs.drexel.edu
MigraFngtoGitHubsoon.
JStylo:TheProblem
Stylometry-basedresearchisdicult.
ExisFngtoolsaregoodbutlimited.
Wekaprovidesasuiteofmachine-learningclassicaFontools.
NottailoredfortextanalysisnofeatureextracFonability.
FuncFonsbe]erasanAPIforsokwaredevelopment.
JGAAPhasastrongbasictoolsetforstylometry.
LimitedinrunningmulFplefeaturesets.
StrongAPI.
Extendable.Intendedtobeusedthisway.
Nuancesofstylometryarenoteasytograsp.
ManyopenresearchquesFonsrelatedtoauthorship.We
needaneasy-to-usetoolthatbothresearchersandnon-
technicaluserscanunderstand.
JStylo
JStyloisanauthorshiprecogniFonanalysistool.Itisbuilt
uponaframeworkof:
JGAAP(JavaGraphicalAuthorshipA]ribuFonProject)
Weka3DataMiningSokware
Features
TwoexisFngadversarialcorpora,featuredinthispresentaFon,
andnewcorpusbuildingfuncFonality.
WideselecFonofwriFngfeatureextractorsandabilitytoadd
newextractors.
WideselecFonofmachinelearningbasedclassiers.
IntuiFveGUI.
AlphaReleaseAvailableNow:h]ps://psal.cs.drexel.edu
JStyloDemo
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
(sendbugreports,suggesFons,quesFonstoariels@drexel.edu)
JStyloDevGoals
WiderselecFonofclassicaFonmethodsandfeatures.
Writeprints,Synonym-based,moreWekamethods.
Ensembleclassiers,weightedaveraging.
Greaterpreandpost-processingopFons.
Easiertouseandunderstandfornon-technicalusers.
Addinganonlinetutorial.
GUIinstallsofnewfeatureextractorsandclassiers.
LoggingandgraphingresultsovermulFpleexperiments.
VisualizaFonofdocuments,authors,andclassicaFons.
Anonymouth:TheProblem
AuthorshiprecogniFoncanbealegiFmatethreattoprivacy
andanonymity.
IntuiFoninchangingwriFngstylegoesalongway,butmay
notbeenoughandmaynotbesustainableovermulFple
documents.
Wealreadyseemethodsthatoersomeresistancetoadversarial
passages.
FullyautomatedtextanonymizaFonisanintractableproblem.
WeneedasoluFonthatexplainsauthorshiprecogniFonnuances
asneededandassiststheauthoringmakingthemostuseful
changestowardsanonymity.
Anonymouth
AnonymouthisanauthorshiprecogniFoncircumvenFontool.
Itisbuiltuponaframeworkof:
JStylo(JGAAP&Weka)
WordNet
Features
Corpora,featureextractor,andclassierfuncFonalityfrom
JStylo.
SuggesFonsystemformodifyingdocumentstoevadeauthorship
detecFon.Idealvalueforeachfeatureiscalculated,existenceof
thefeaturesishighlighted,userisassistedinchangingthem.
IteraFveapproachtoanonymizingwriFngstyle.
DicFonary/Synonyms/InteracFveEdiFngConsole
AlphaReleaseAvailableNow:h]ps://psal.cs.drexel.edu
AnonymouthDemo
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
(sendbugreports,suggesFons,quesFonstoawm32@drexel.edu)
AnonymouthChallenges
Featuresareokennotindependent.
Increasingthenumberofcomplexwordswillalsoincrease
averagesyllablecount.
ReducingthenumberofFmesaspecicwordoccurswillalso
aectthelexicaldensity.
Howcanwecreateanalgorithmforanonymitythatgenerates
anobfuscateddocumentwithminimaleortandwithout
circularfeaturemodicaFon?
AnonymouthDevGoals
StreamlinedsuggesFonsystem.
ImprovedautomaFononapplicablefeatures.
ImprovedclusteringalgorithmtoprovideopFmalpathto
anonymity.
ImprovedediFnginterface.
Increasedphraseandwordsynonymsetsupport.
Editbyblocksoftext,notsimplyfeature-by-feature.
WidersetoffeaturesandclassicaFonmethods.
MulF-methodandfeaturecollecFonanalysis.
Usabilityandanonymityuserstudies.
OpeningDevelopment
ProjectwillconFnuetobedevelopedbyPSALatDrexel,but
wewelcomecollaboraFonandparFcipaFon.
Weareinterestedin
LinguisFcExperts
SecurityAdvisors
UIExperts
Canwedetectstylisticdeception?

Regular

Obfuscated
Imitated
Detectingstylisticdeceptionis
possible
98
85
89.5
95.7
75.3
59.9
94.5
48
43
0
10
20
30
40
50
60
70
80
90
100
Regular ImitaFon ObfuscaFon
Writeprint,SVM
Lying-detecFon,J48
9-featureset,J48
-600 -400 -200 0 200 400 600 800
unlque Words
ComplexlLy
8eadablllLy (Cl lndex)
Avg. Syllables
SenLence CounL
Avg. SenLence LengLh
Avg Word LengLh
ShorLWords
ersonal ronoun
Adverb
reposluon
Ad[ecuve
arucle
Con[uncuon
Cardlnal number
LxlsLenual Lhere
!"#$%&'($')&#*+,&''''''
!"#$%&-'($')&#*+,&-'($'./0+-1#23$'45#16'
FeatureChangesin
ObfuscatedPassages
FeatureChangesin
ImitatedPassages
-600 -400 -200 0 200 400 600 800
unlque Words
ComplexlLy
8eadablllLy (Cl lndex)
Avg. Syllables
SenLence CounL
Avg. SenLence LengLh
Avg Word LengLh
ShorLWords
ersonal ronoun
Adverb
reposluon
Ad[ecuve
arucle
Con[uncuon
Cardlnal number
LxlsLenual Lhere
!"#$%&'($')&#*+,&'
!"#$%&-'($')&#*+,&-'($'./(*#01$'23#45'
Problemwiththedataset:
TopicSimilarity
AllthedecepFvedocumentswereofsametopic.
Non-content-specicfeatureshavesameeectascontent-
specicfeatures.
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-.-/0123" 4567804" 29:7;<0123"
!
"
#
$
%
&
'
(
$
)
*+,$($-.)/(+0-1)2%#34$&)
5,$6.)78)9+,$($-.)8$%.'($)&$.)+-)9$.$60-1)
%9:$(&%(+%4)%'.;7(&;+3)
=>3/0<1<"
?5@-<08"
A23/53/"
Hemingway-FaulknerImitation
Corpus

ArFclesfromtheInternaFonalImitaFonHemingwayContest
(2000-2005)
ArFclesfromtheFauxFaulknerContest(2001-2005)
OriginalexcerptsofErnestHemingwayandWilliamFaulkner

Deceptiondetectionispossible
evenwhenthetopicisnotsimilar
81.2%accurateindetecFngimitateddocuments.

Longtermdeception
AGayGirlInDamascusblog:
Originalauthorwasa40-yearoldAmericanciFzen,Thomas
MacMaster.
PretendedtobeaSyriangaywoman,AminaArraf.
Theauthorworkedforatleast5yearstocreateanewstyle.

Longtermdeceptionishardto
detect
NoneoftheblogpostswerefoundtobedecepFve.
ButregularauthorshiprecogniFoncanhelp.
Wetriedtoa]ributeauthorshipoftheblogpostsusing
Thomas(ashimself),Thomas(asAmina),Bri]a(Thomass
wife).
54.3%oftheblogpostswerea]ributedtoThomas(as
himself)
Recap
AvailableNow:
Brennan-GreenstadtAdversarialStylometryCorpus(12Authors)
DrexelAMTAdversarialStylometryCorpus(45Authors)
JStyloAlphaRelease
AnonymouthAlphaRelease
FutureWork:
BetareleasesofJStyloandAnonymouth
AcademicpublicaFonofnewresults
ConFnuedanalysisofdecepFondetecFonandshortmessage
classicaFon
ConFnuedresearchonimprovingparFallyautomated
anonymizaFon
Thanks.
Wewanttohearfromyou.
MikeBrennan(mb553@drexel.edu)
RachelGreenstadt(greenie@cs.drexel.edu)
ArielStolerman,JStyloLead(ariels@drexel.edu)
AndrewMcDonald,AnonymouthLead(ams23@drexel.edu)
SadiaAfroz,DecepFonDetecFonLead(sa499@drexel.edu)
AylinCaliskan,TranslaFon&Stylometry(ac993@drexel.edu)
PSAL:h]ps://psal.cs.drexel.edu
Wearelookingforinterestedgradstudentsandpost-docs!
AddendumSlides
ResearchQuestions,
PracticalImplications.

OurupcomingresearchquesFonshavesubstanFalpracFcal
implicaFons.
Howdoyouanonymizeadocumentsucientlyina
reasonableperiodofFme?
Whatissucient?Whatisreasonable?
CanAnonymouthbeusedtosuccessfullyimitateother
authors?
CanAnonymouthmaintainlong-termdecepFon?Canits
usagebedetected?
JStylovs.Anonymouthwhowins?
BasedonJStylo,Anonymouthwillhaveeverythingitneedsto
helpevadedetecFonbythemethodsitcontains.
TwoTools?
ArentwecreaFngatoolthatenablessurveillanceandde-
anonymizaFon?
AnonymouthcantexistwithoutJStylo.Butitalsoshowsthatyou
cantnecessarilydependonstylometrytoassignauthorship.
JStyloallowsforeasieruseofauthorshiprecogniFontools,butis
extensibleandopen-source.ImplemenFngamethodinJStylowill
enablecounter-a]acksinAnonymouth.
JStylovs.Anonymouthwhowins?
BasedonJStylo,Anonymouthwillhaveeverythingitneedsto
helpevadedetecFonbythemethodsitcontains.
Notethatnothingpreventsothersfromplugginginproprietary
stylometricmethodsintotheirversionofJStylo.

You might also like