You are on page 1of 44

Websites categorization

for e-commerce and


compliance
Vladimir Mikhnovich, head of antifraud team @ AdCash
Machine Learning Estonia meetup, December 13th 2016
What brought me here today
1998
Moscow State Technical University
Master degree in computer science (expert systems, NLP)

2000s
Moscow State Linguistic University
Postgraduate research in NLP (authorship detection)

2012 2016
Yandex.Money
Fraud analyst, data scientist, head of antifraud analytics

2016 now
AdCash
Head of antifraud team
Episode I

k-NN strikes back


Automatic detection of restricted categories
across 100,000 websites content
E-commerce acquiring pipeline: where to look

? We are here

Online customer Merchant PSP

Cards issue

Issuer bank Visa / Acquiring


MasterCard bank
What to comply with

Business Risk Global Brand


Assessment and Protection
Mitigation (BRAM) Program (GBPP)
The problem of merchants monitoring

Payment systems and aggregators must check merchants


to avoid high risk / prohibited categories / fraud

Drugs Supplements
Violence Spyware
Porno Brand replicas

Illegal merchant High risk
activities merchants
(totally prohibited (require limitations /
a.k.a. Deadly Sins) additional checks)
Business model viability and risks

High risk merchants share (in a normal


business) might be up to 3-5% from total,
while legitimate business models are also
not optimized for selling illegal stuff.
But these 3% are still to be monitored and
detected.

What happens on the Dark Side of the Web?


You never know for sure.
The initial task

Regular automated scanning of


big batches (tens of thousands) of
merchant websites content,
determining high risk categories
for further manual screening.

Total number of active merchants: 100,000+


Problems to solve

Picking and labelling sites for training dataset


Automatic downloading of thousands of websites
Processing content: from websites to text documents
Automatic classification of thousands of documents
Uncertainty of some categories
Defining thresholds for classification results
Bad guys we had

Adult You know it


Drugs Includes laughing gas, smoking kits, hookah, alcohol, cigarettes
Replica Any brand copies / replicas (mostly watches, jewelry, bags)
Weapons Includes traumatic weapons, pepper sprays, shockers etc
Betting Online casinos, exchangers, bookmakers
Hour hotel Hotels / rooms with hourly rate
Magic Voodoo doctors, psychic activities, love potion
Spyware Hidden cams, listening devices etc
Supplements Sport and other nutrition supplements
Torrent Torrents, software downloads, restricted copyright material
Industry scale machine learning systems

Cloud and distributed solutions (just to name a few)

Microsoft Azure Machine Learning


Amazon Machine Learning
IBM Watson Machine Learning

Some standalone products

SPSS
SAS
MATLAB
Weka
RapidMiner
RapidMiner Studio environment
Tens of standard ML algorithms with fully customizable parameters
Visual drag and drop approach
Easy to build processes and evaluate models
No need to write code (R and Python extensions are also available)
Ready to use tool for an analyst familiar with ML algorithms
RapidMiner ecosystem

RapidMiner Studio: RapidMiner Server:


analytics, modelling and evaluation models deployment and web services
Standard modelling pipeline

Train Apply model


Build test Download
classification and get
dataset new websites
model predictions
Training process

Training Text Model


dataset processing evaluation

11 categories Extract text k-NN classifier


300 labeled sites Tokenize, stem with cross-
30,000 words Build TF-IDF matrix validation
TF-IDF metric

TF-IDF (term frequency


inverse document
frequency):
a numerical statistic that
is intended to reflect how
important a word is to
a single document in a
collection of documents
k-NN and thresholds
k-NN classifier, when applied to text analysis, provides a measure
of similarity of the certain text document to known categories
Data structures and sizes

Text data size


1 site converts to a plain text file 0.3 to 1+ megabyte
Corpus makes 150 300 Megabytes of text files in total

TF-IDF matrices examples


Training data: 300 sites x 30,000 words (80 Megabytes)
Test data: 800 sites x 60,000 words (400 Megabytes)
Batching approach

Many attempts to classify thousands sites at once were actually


unsuccessful. Reason? Memory problems.
So far, another approach to overcome physical memory limitations was chosen:
batching. First we download websites and divide them into batches of
reasonable size (empirically, 200-300 sites is enough to fit all matrices in
memory), every batch is downloaded separately, and then all batches are
analyzed in a loop.

Thousands Download Loop Classify


of websites and save all batches every batch
k-NN: tuning k and defining thresholds
k=5 allows to assign significant confidence values to categories.
Only high confidences are taken into account (threshold over 80%).

site adult drugs replica weapons normal spyware supplements torrent prediction

eroshop.ru 100% adult


putana78.com 100% adult
IntimCity.nl 81% 19% adult
kupialco.ru 100% drugs
mari-juana.net 100% drugs
kyritelnie-smesi.nl 81% 19% drugs
03market.ru 19% 0% 81% normal guys
1-ocenka.ru 20% 40% 40% normal guys
1c-interes.ru 60% 40% normal guys
1gb.ru 100% normal guys
100-z.ru 20% 20% 60% spyware
100mile.ru 20% 80% spyware
1belka.ru 20% 20% 40% 20% spyware
1chef.ru 100% weapons
Example of confusion matrix

accuracy: 88.46%
true true true true normal true hour true true true class
true adult true betting true magic
drugs replica weapons guys hotel spyware supplements torrent precision

pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%

pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%

pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%

pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%

pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%

pred. betting 0 0 0 0 0 9 0 0 0 0 0 100.00%

pred. hour hotel 0 0 0 0 2 0 8 0 0 0 0 80.00%

pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%

pred. spyware 0 0 0 0 1 0 0 0 9 0 0 90.00%

pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%

pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%

class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%
Finally, whats in and out

On average, around 3-5% of web sites were assigned high confidence


during classification (>80%) and needed to be screened manually

site prediction confidence


zdoroviak.ru supplements 100%
zscom.ru spyware 100%
zwuk.ru spyware 100%
zarekoy.ru hourhotel 100%
zastava-izhevsk.ru weapons 100%
Very big zutera.ru replica 80%
list of Classifier zoombao.com replica 80%
domains zita-gita.ru adult 80%
zishop.ru replica 80%
zdorovoetelo100.ru supplements 80%
zen-shop.ru drugs 80%
Performance and accuracy

Test run on randomly picked 100 websites:

Downloading time 7 minutes


Processing & classification time 30 seconds (0.3 sec per site)
Cross-validation model accuracy 89%
(without applying threshold)

High risk sites classified as normal 1


(False Negatives)

Correctly classified high risk sites 12 out of 13 (92%)


Normal sites classified as high risk 0
(False Positives)
Production environment architecture
Automatic scanning and handling referrer URLs
Scheduled prediction web-service runs automatically on new data
Predictions are manually reviewed thereafter
Model updated incrementally after results review

Referrers Prediction
database model

Merchants
screening

Online customer Merchant PSP


Two years later

Episode II

Return of Nave Bayes


Automated detection of adult web referrers based on URL features
(in collaboration with Laure Daumal, fraud developer intern @ AdCash)
Online ad network business model and risks

Providing connection between advertisers and publisher media


Taking care of traffic quality which we get from publishers

Traffic

Publishers
Advertisers

Advertisement
The initial task
We need to keep brand safe by monitoring restricted publisher thematics

?
Is the traffic provided by
publishers free from
restricted thematics?

Traffic Lets start with detecting and


predicting adult referrers.

Based ONLY on referrer


Publishers
URL itself.
Why URLs and not content analysis?

Content analysis is still very time and resource


consuming

We do not need precise and real-time classification of


web pages across many thematics, but rather
suspicious URLs detection within regular traffic analytics
(it looks like an adult page)

We need stable and reliable method which is able to


efficiently handle amounts of referrers data we have
Amount of data we have

One big web publisher is capable of generating daily

1,000,000+ ad impressions

500,000 unique referrers

100,000 unique domains


Be realistic: what to expect

What our future What we


classifier is should NOT
intended for? expect from it?
Given a certain URL, it Given a page address,
should predict its its not expected to
likelihood to an adult or correctly guess its true
other restricted topic content category
Tools we can trust
Philosophical question (before choosing an algorithm)
What metric should we optimize the prediction model against?
Philosophical answer
Real life problems are not Kaggle competition!

There are a bunch of data science performance measures around:


AUC, AUPRC, F-Measure, R-Squared, RMSE, accuracy and so on
and so on.

All of these metrics won't help you in real life. You should simply
ignore them.

Every data science project should optimize on the value it targets for
in terms of business model need. In the best case a data science
performance measure can only be a proxy of your real objective.
(Martin Schmitz, data scientist @ RapidMiner)
What kind of error is more expensive?

We predicted when We didnt predict


we shouldnt have when we should have

Churn prediction example Fraud detection example

How much losing a customer cost? How much undetected fraud costs us?
How much did it cost us to keep him? Price we pay for blocked good customer?
Was the customer we kept profitable? What are brand risks due to fraud?
Does prediction ever pay off for low How to measure profit wed never receive?
profile customers? How much we spent for building the model?
From URL to features
Raw URL
https://report.adultwebmasternet.com/?page_type=indexes&sid=430
&iid=1&gid=81462211&tid=0&kwid=0&uid=0

Word breaker (from SphinxSearch.com)


report adult webmaster net page type index es si ii gi ti kw id id

N-grams (3-6)
rep epo por ort repo epor port repor eport report adu dul ult adul
dult adult web ebm bma mas ast ste ter webm ebma bmas mast aste ster
webma ebmas bmast maste aster webmas ebmast bmaste master net pag
age page typ ype type ind nde dex inde ndex index

Assign IDs + count frequencies


rep:1, epo:2, por:3, ort:4, (...) add frequency parameter
Choosing the metric and classifiers comparison
We can afford false positives but wed rather be sure that we detect most of the
adult referrers out there, otherwise we might potentially lose an unhappy advertiser.

So far, the recall metric (true positive rate) is chosen to answer the question:
"Given a positive example (adult URL), will the classifier detect it?

Initial testing with a dictionary of 9,000 words and 1,500 URLs


Classifier Recall
Multinomial Nave Bayes 0.98
MLP Classifier 0.94
Decision Tree 0.93
Logistic Regression 0.91
Random Forest 0.83
k-NN 0.76
SVC 0.68
Multinomial Nave Bayes made simple
Multinomial Nave Bayes classifier used if the words can be represented in terms of their
occurrences (frequency count), not just the presence or absence of a particular word in the
document

Formal definition
Applied to text analysis, multinomial Nave Bayes classifier
estimates the conditional probability of a particular word /
term / token given a class as the relative frequency of term
in documents belonging to class C

Putting it simpler

?
If we have seen a particular term previously
in many adult sites URLs, probably another
URL containing it also refers to an adult
website
Data sources

Directory Mozilla SquidGuard blacklists


Confusion matrix
Dataset: 20,000+ URLs
Dictionary size: 60,000 words
Validation: 5-fold cross-validation

True adult True normal

Predicted adult 10,189 500

Predicted normal 578 8,772

Recall 0.95
API integration
RESTful API allows to use model as a web service and get
predictions for generating custom traffic reports

PREDICT
LEARN
VALIDATE
Prediction RESTful Reporting
model web service engine
API functions
API call example: PREDICT
Input: one URL / multiple URLs / text file (new line or space separated)
Example request URIs:
http://88.196.223.122:5000/predict
http://88.196.223.122:5000/learn

Example API answer with prediction result (JSON format):


{
"avg_score": 0.9999999800429088,
"scores": [{
"url": "http://www.smokingwithstyle.com",
"confidence": 1.0
}, {
"url": "http://www.potseeds.co.uk/magic-mushroom-spore-syringes-10ml",
"confidence": 1.0
}, (...)
]
}
Summing it all up

We learned how to classify web resources in different


ways and detect restricted categories

Not always we need to analyze the content itself: URL


may be sufficient in helping to define a category

Even the simplest classification algorithms are still


capable of doing some magic
Further reading
Books

Foster Provost, Tom Fawcett Data Science for Business: What


You Need to Know about Data Mining and Data-Analytic
Thinking

Markus Hoffman, Andrew Chisholm Text Mining and Visualization:


Case Studies Using Open-Source Tools

Whitepapers

Eda Baykan et al. Purely URL-based Topic Classification

Myriam Abramson et al. Whats in a URL? Genre Classification


from URLs

Min-Yen Kan Fast webpage classification using URL features


Thank you!

Vladimir Mikhnovich
v.mikhnovich@adcash.com
https://ee.linkedin.com/in/kypexin