Professional Documents
Culture Documents
2000s
Moscow State Linguistic University
Postgraduate research in NLP (authorship detection)
2012 2016
Yandex.Money
Fraud analyst, data scientist, head of antifraud analytics
2016 now
AdCash
Head of antifraud team
Episode I
? We are here
Cards issue
Drugs Supplements
Violence Spyware
Porno Brand replicas
Illegal merchant High risk
activities merchants
(totally prohibited (require limitations /
a.k.a. Deadly Sins) additional checks)
Business model viability and risks
SPSS
SAS
MATLAB
Weka
RapidMiner
RapidMiner Studio environment
Tens of standard ML algorithms with fully customizable parameters
Visual drag and drop approach
Easy to build processes and evaluate models
No need to write code (R and Python extensions are also available)
Ready to use tool for an analyst familiar with ML algorithms
RapidMiner ecosystem
site adult drugs replica weapons normal spyware supplements torrent prediction
accuracy: 88.46%
true true true true normal true hour true true true class
true adult true betting true magic
drugs replica weapons guys hotel spyware supplements torrent precision
class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%
Finally, whats in and out
Referrers Prediction
database model
Merchants
screening
Episode II
Traffic
Publishers
Advertisers
Advertisement
The initial task
We need to keep brand safe by monitoring restricted publisher thematics
?
Is the traffic provided by
publishers free from
restricted thematics?
1,000,000+ ad impressions
All of these metrics won't help you in real life. You should simply
ignore them.
Every data science project should optimize on the value it targets for
in terms of business model need. In the best case a data science
performance measure can only be a proxy of your real objective.
(Martin Schmitz, data scientist @ RapidMiner)
What kind of error is more expensive?
How much losing a customer cost? How much undetected fraud costs us?
How much did it cost us to keep him? Price we pay for blocked good customer?
Was the customer we kept profitable? What are brand risks due to fraud?
Does prediction ever pay off for low How to measure profit wed never receive?
profile customers? How much we spent for building the model?
From URL to features
Raw URL
https://report.adultwebmasternet.com/?page_type=indexes&sid=430
&iid=1&gid=81462211&tid=0&kwid=0&uid=0
N-grams (3-6)
rep epo por ort repo epor port repor eport report adu dul ult adul
dult adult web ebm bma mas ast ste ter webm ebma bmas mast aste ster
webma ebmas bmast maste aster webmas ebmast bmaste master net pag
age page typ ype type ind nde dex inde ndex index
So far, the recall metric (true positive rate) is chosen to answer the question:
"Given a positive example (adult URL), will the classifier detect it?
Formal definition
Applied to text analysis, multinomial Nave Bayes classifier
estimates the conditional probability of a particular word /
term / token given a class as the relative frequency of term
in documents belonging to class C
Putting it simpler
?
If we have seen a particular term previously
in many adult sites URLs, probably another
URL containing it also refers to an adult
website
Data sources
Recall 0.95
API integration
RESTful API allows to use model as a web service and get
predictions for generating custom traffic reports
PREDICT
LEARN
VALIDATE
Prediction RESTful Reporting
model web service engine
API functions
API call example: PREDICT
Input: one URL / multiple URLs / text file (new line or space separated)
Example request URIs:
http://88.196.223.122:5000/predict
http://88.196.223.122:5000/learn
Whitepapers
Vladimir Mikhnovich
v.mikhnovich@adcash.com
https://ee.linkedin.com/in/kypexin