You are on page 1of 44

Websites categorization

for e-commerce and

Vladimir Mikhnovich, head of antifraud team @ AdCash
Machine Learning Estonia meetup, December 13th 2016
What brought me here today
Moscow State Technical University
Master degree in computer science (expert systems, NLP)

Moscow State Linguistic University
Postgraduate research in NLP (authorship detection)

2012 2016
Fraud analyst, data scientist, head of antifraud analytics

2016 now
Head of antifraud team
Episode I

k-NN strikes back

Automatic detection of restricted categories
across 100,000 websites content
E-commerce acquiring pipeline: where to look

? We are here

Online customer Merchant PSP

Cards issue

Issuer bank Visa / Acquiring

MasterCard bank
What to comply with

Business Risk Global Brand

Assessment and Protection
Mitigation (BRAM) Program (GBPP)
The problem of merchants monitoring

Payment systems and aggregators must check merchants

to avoid high risk / prohibited categories / fraud

Drugs Supplements
Violence Spyware
Porno Brand replicas

Illegal merchant High risk
activities merchants
(totally prohibited (require limitations /
a.k.a. Deadly Sins) additional checks)
Business model viability and risks

High risk merchants share (in a normal

business) might be up to 3-5% from total,
while legitimate business models are also
not optimized for selling illegal stuff.
But these 3% are still to be monitored and

What happens on the Dark Side of the Web?

You never know for sure.
The initial task

Regular automated scanning of

big batches (tens of thousands) of
merchant websites content,
determining high risk categories
for further manual screening.

Total number of active merchants: 100,000+

Problems to solve

Picking and labelling sites for training dataset

Automatic downloading of thousands of websites
Processing content: from websites to text documents
Automatic classification of thousands of documents
Uncertainty of some categories
Defining thresholds for classification results
Bad guys we had

Adult You know it

Drugs Includes laughing gas, smoking kits, hookah, alcohol, cigarettes
Replica Any brand copies / replicas (mostly watches, jewelry, bags)
Weapons Includes traumatic weapons, pepper sprays, shockers etc
Betting Online casinos, exchangers, bookmakers
Hour hotel Hotels / rooms with hourly rate
Magic Voodoo doctors, psychic activities, love potion
Spyware Hidden cams, listening devices etc
Supplements Sport and other nutrition supplements
Torrent Torrents, software downloads, restricted copyright material
Industry scale machine learning systems

Cloud and distributed solutions (just to name a few)

Microsoft Azure Machine Learning

Amazon Machine Learning
IBM Watson Machine Learning

Some standalone products

RapidMiner Studio environment
Tens of standard ML algorithms with fully customizable parameters
Visual drag and drop approach
Easy to build processes and evaluate models
No need to write code (R and Python extensions are also available)
Ready to use tool for an analyst familiar with ML algorithms
RapidMiner ecosystem

RapidMiner Studio: RapidMiner Server:

analytics, modelling and evaluation models deployment and web services
Standard modelling pipeline

Train Apply model

Build test Download
classification and get
dataset new websites
model predictions
Training process

Training Text Model

dataset processing evaluation

11 categories Extract text k-NN classifier

300 labeled sites Tokenize, stem with cross-
30,000 words Build TF-IDF matrix validation
TF-IDF metric

TF-IDF (term frequency

inverse document
a numerical statistic that
is intended to reflect how
important a word is to
a single document in a
collection of documents
k-NN and thresholds
k-NN classifier, when applied to text analysis, provides a measure
of similarity of the certain text document to known categories
Data structures and sizes

Text data size

1 site converts to a plain text file 0.3 to 1+ megabyte
Corpus makes 150 300 Megabytes of text files in total

TF-IDF matrices examples

Training data: 300 sites x 30,000 words (80 Megabytes)
Test data: 800 sites x 60,000 words (400 Megabytes)
Batching approach

Many attempts to classify thousands sites at once were actually

unsuccessful. Reason? Memory problems.
So far, another approach to overcome physical memory limitations was chosen:
batching. First we download websites and divide them into batches of
reasonable size (empirically, 200-300 sites is enough to fit all matrices in
memory), every batch is downloaded separately, and then all batches are
analyzed in a loop.

Thousands Download Loop Classify

of websites and save all batches every batch
k-NN: tuning k and defining thresholds
k=5 allows to assign significant confidence values to categories.
Only high confidences are taken into account (threshold over 80%).

site adult drugs replica weapons normal spyware supplements torrent prediction 100% adult 100% adult 81% 19% adult 100% drugs 100% drugs 81% 19% drugs 19% 0% 81% normal guys 20% 40% 40% normal guys 60% 40% normal guys 100% normal guys 20% 20% 60% spyware 20% 80% spyware 20% 20% 40% 20% spyware 100% weapons
Example of confusion matrix

accuracy: 88.46%
true true true true normal true hour true true true class
true adult true betting true magic
drugs replica weapons guys hotel spyware supplements torrent precision

pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%

pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%

pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%

pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%

pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%

pred. betting 0 0 0 0 0 9 0 0 0 0 0 100.00%

pred. hour hotel 0 0 0 0 2 0 8 0 0 0 0 80.00%

pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%

pred. spyware 0 0 0 0 1 0 0 0 9 0 0 90.00%

pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%

pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%

class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%
Finally, whats in and out

On average, around 3-5% of web sites were assigned high confidence

during classification (>80%) and needed to be screened manually

site prediction confidence supplements 100% spyware 100% spyware 100% hourhotel 100% weapons 100%
Very big replica 80%
list of Classifier replica 80%
domains adult 80% replica 80% supplements 80% drugs 80%
Performance and accuracy

Test run on randomly picked 100 websites:

Downloading time 7 minutes

Processing & classification time 30 seconds (0.3 sec per site)
Cross-validation model accuracy 89%
(without applying threshold)

High risk sites classified as normal 1

(False Negatives)

Correctly classified high risk sites 12 out of 13 (92%)

Normal sites classified as high risk 0
(False Positives)
Production environment architecture
Automatic scanning and handling referrer URLs
Scheduled prediction web-service runs automatically on new data
Predictions are manually reviewed thereafter
Model updated incrementally after results review

Referrers Prediction
database model


Online customer Merchant PSP

Two years later

Episode II

Return of Nave Bayes

Automated detection of adult web referrers based on URL features
(in collaboration with Laure Daumal, fraud developer intern @ AdCash)
Online ad network business model and risks

Providing connection between advertisers and publisher media

Taking care of traffic quality which we get from publishers



The initial task
We need to keep brand safe by monitoring restricted publisher thematics

Is the traffic provided by
publishers free from
restricted thematics?

Traffic Lets start with detecting and

predicting adult referrers.

Based ONLY on referrer

URL itself.
Why URLs and not content analysis?

Content analysis is still very time and resource


We do not need precise and real-time classification of

web pages across many thematics, but rather
suspicious URLs detection within regular traffic analytics
(it looks like an adult page)

We need stable and reliable method which is able to

efficiently handle amounts of referrers data we have
Amount of data we have

One big web publisher is capable of generating daily

1,000,000+ ad impressions

500,000 unique referrers

100,000 unique domains

Be realistic: what to expect

What our future What we

classifier is should NOT
intended for? expect from it?
Given a certain URL, it Given a page address,
should predict its its not expected to
likelihood to an adult or correctly guess its true
other restricted topic content category
Tools we can trust
Philosophical question (before choosing an algorithm)
What metric should we optimize the prediction model against?
Philosophical answer
Real life problems are not Kaggle competition!

There are a bunch of data science performance measures around:

AUC, AUPRC, F-Measure, R-Squared, RMSE, accuracy and so on
and so on.

All of these metrics won't help you in real life. You should simply
ignore them.

Every data science project should optimize on the value it targets for
in terms of business model need. In the best case a data science
performance measure can only be a proxy of your real objective.
(Martin Schmitz, data scientist @ RapidMiner)
What kind of error is more expensive?

We predicted when We didnt predict

we shouldnt have when we should have

Churn prediction example Fraud detection example

How much losing a customer cost? How much undetected fraud costs us?
How much did it cost us to keep him? Price we pay for blocked good customer?
Was the customer we kept profitable? What are brand risks due to fraud?
Does prediction ever pay off for low How to measure profit wed never receive?
profile customers? How much we spent for building the model?
From URL to features

Word breaker (from

report adult webmaster net page type index es si ii gi ti kw id id

N-grams (3-6)
rep epo por ort repo epor port repor eport report adu dul ult adul
dult adult web ebm bma mas ast ste ter webm ebma bmas mast aste ster
webma ebmas bmast maste aster webmas ebmast bmaste master net pag
age page typ ype type ind nde dex inde ndex index

Assign IDs + count frequencies

rep:1, epo:2, por:3, ort:4, (...) add frequency parameter
Choosing the metric and classifiers comparison
We can afford false positives but wed rather be sure that we detect most of the
adult referrers out there, otherwise we might potentially lose an unhappy advertiser.

So far, the recall metric (true positive rate) is chosen to answer the question:
"Given a positive example (adult URL), will the classifier detect it?

Initial testing with a dictionary of 9,000 words and 1,500 URLs

Classifier Recall
Multinomial Nave Bayes 0.98
MLP Classifier 0.94
Decision Tree 0.93
Logistic Regression 0.91
Random Forest 0.83
k-NN 0.76
SVC 0.68
Multinomial Nave Bayes made simple
Multinomial Nave Bayes classifier used if the words can be represented in terms of their
occurrences (frequency count), not just the presence or absence of a particular word in the

Formal definition
Applied to text analysis, multinomial Nave Bayes classifier
estimates the conditional probability of a particular word /
term / token given a class as the relative frequency of term
in documents belonging to class C

Putting it simpler

If we have seen a particular term previously
in many adult sites URLs, probably another
URL containing it also refers to an adult
Data sources

Directory Mozilla SquidGuard blacklists

Confusion matrix
Dataset: 20,000+ URLs
Dictionary size: 60,000 words
Validation: 5-fold cross-validation

True adult True normal

Predicted adult 10,189 500

Predicted normal 578 8,772

Recall 0.95
API integration
RESTful API allows to use model as a web service and get
predictions for generating custom traffic reports

Prediction RESTful Reporting
model web service engine
API functions
API call example: PREDICT
Input: one URL / multiple URLs / text file (new line or space separated)
Example request URIs:

Example API answer with prediction result (JSON format):

"avg_score": 0.9999999800429088,
"scores": [{
"url": "",
"confidence": 1.0
}, {
"url": "",
"confidence": 1.0
}, (...)
Summing it all up

We learned how to classify web resources in different

ways and detect restricted categories

Not always we need to analyze the content itself: URL

may be sufficient in helping to define a category

Even the simplest classification algorithms are still

capable of doing some magic
Further reading

Foster Provost, Tom Fawcett Data Science for Business: What

You Need to Know about Data Mining and Data-Analytic

Markus Hoffman, Andrew Chisholm Text Mining and Visualization:

Case Studies Using Open-Source Tools


Eda Baykan et al. Purely URL-based Topic Classification

Myriam Abramson et al. Whats in a URL? Genre Classification

from URLs

Min-Yen Kan Fast webpage classification using URL features

Thank you!

Vladimir Mikhnovich