Web Categorization - ML Meetup Tallinn 2016-12-13 - FINAL

Websites categorization
for e-commerce and

compliance
Vladimir Mikhnovich, head of antifraud team @ AdCash
Machine Learning Estonia meetup, December 13th 2016
What brought me here today
1998
Moscow State Technical University
Master degree in computer science (expert systems, NLP)
2000s
Moscow State Linguistic University
Postgraduate research in NLP (authorship detection)
2012 2016
Yandex.Money
Fraud analyst, data scientist, head of antifraud analytics
2016 now
AdCash
Head of antifraud team
Episode I
k-NN strikes back

Automatic detection of restricted categories
across 100,000 websites content
E-commerce acquiring pipeline: where to look
? We are here
Online customer Merchant PSP
Cards issue
Issuer bank Visa / Acquiring

MasterCard bank
What to comply with
Business Risk Global Brand

Assessment and Protection
Mitigation (BRAM) Program (GBPP)
The problem of merchants monitoring
Payment systems and aggregators must check merchants

to avoid high risk / prohibited categories / fraud
Drugs Supplements
Violence Spyware
Porno Brand replicas

Illegal merchant High risk
activities merchants
(totally prohibited (require limitations /
a.k.a. Deadly Sins) additional checks)
Business model viability and risks
High risk merchants share (in a normal

business) might be up to 3-5% from total,
while legitimate business models are also
not optimized for selling illegal stuff.
But these 3% are still to be monitored and
detected.
What happens on the Dark Side of the Web?

You never know for sure.
The initial task
Regular automated scanning of

big batches (tens of thousands) of
merchant websites content,
determining high risk categories
for further manual screening.
Total number of active merchants: 100,000+

Problems to solve
Picking and labelling sites for training dataset

Automatic downloading of thousands of websites
Processing content: from websites to text documents
Automatic classification of thousands of documents
Uncertainty of some categories
Defining thresholds for classification results
Bad guys we had
Adult You know it

Drugs Includes laughing gas, smoking kits, hookah, alcohol, cigarettes
Replica Any brand copies / replicas (mostly watches, jewelry, bags)
Weapons Includes traumatic weapons, pepper sprays, shockers etc
Betting Online casinos, exchangers, bookmakers
Hour hotel Hotels / rooms with hourly rate
Magic Voodoo doctors, psychic activities, love potion
Spyware Hidden cams, listening devices etc
Supplements Sport and other nutrition supplements
Torrent Torrents, software downloads, restricted copyright material
Industry scale machine learning systems
Cloud and distributed solutions (just to name a few)
Microsoft Azure Machine Learning

Amazon Machine Learning
IBM Watson Machine Learning
Some standalone products
SPSS
SAS
MATLAB
Weka
RapidMiner
RapidMiner Studio environment
Tens of standard ML algorithms with fully customizable parameters
Visual drag and drop approach
Easy to build processes and evaluate models
No need to write code (R and Python extensions are also available)
Ready to use tool for an analyst familiar with ML algorithms
RapidMiner ecosystem
RapidMiner Studio: RapidMiner Server:

analytics, modelling and evaluation models deployment and web services
Standard modelling pipeline
Train Apply model

Build test Download
classification and get
dataset new websites
model predictions
Training process
Training Text Model

dataset processing evaluation
11 categories Extract text k-NN classifier

300 labeled sites Tokenize, stem with cross-
30,000 words Build TF-IDF matrix validation
TF-IDF metric
TF-IDF (term frequency

inverse document
frequency):
a numerical statistic that
is intended to reflect how
important a word is to
a single document in a
collection of documents
k-NN and thresholds
k-NN classifier, when applied to text analysis, provides a measure
of similarity of the certain text document to known categories
Data structures and sizes
Text data size

1 site converts to a plain text file 0.3 to 1+ megabyte
Corpus makes 150 300 Megabytes of text files in total
TF-IDF matrices examples

Training data: 300 sites x 30,000 words (80 Megabytes)
Test data: 800 sites x 60,000 words (400 Megabytes)
Batching approach
Many attempts to classify thousands sites at once were actually

unsuccessful. Reason? Memory problems.
So far, another approach to overcome physical memory limitations was chosen:
batching. First we download websites and divide them into batches of
reasonable size (empirically, 200-300 sites is enough to fit all matrices in
memory), every batch is downloaded separately, and then all batches are
analyzed in a loop.
Thousands Download Loop Classify

of websites and save all batches every batch
k-NN: tuning k and defining thresholds
k=5 allows to assign significant confidence values to categories.
Only high confidences are taken into account (threshold over 80%).
site adult drugs replica weapons normal spyware supplements torrent prediction
eroshop.ru 100% adult

putana78.com 100% adult
IntimCity.nl 81% 19% adult
kupialco.ru 100% drugs
mari-juana.net 100% drugs
kyritelnie-smesi.nl 81% 19% drugs
03market.ru 19% 0% 81% normal guys
1-ocenka.ru 20% 40% 40% normal guys
1c-interes.ru 60% 40% normal guys
1gb.ru 100% normal guys
100-z.ru 20% 20% 60% spyware
100mile.ru 20% 80% spyware
1belka.ru 20% 20% 40% 20% spyware
1chef.ru 100% weapons
Example of confusion matrix
accuracy: 88.46%
true true true true normal true hour true true true class
true adult true betting true magic
drugs replica weapons guys hotel spyware supplements torrent precision
pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%
pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%
pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%
pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%
pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%
pred. betting 0 0 0 0 0 9 0 0 0 0 0 100.00%
pred. hour hotel 0 0 0 0 2 0 8 0 0 0 0 80.00%
pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%
pred. spyware 0 0 0 0 1 0 0 0 9 0 0 90.00%
pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%
pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%
class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%
Finally, whats in and out
On average, around 3-5% of web sites were assigned high confidence

during classification (>80%) and needed to be screened manually
site prediction confidence

zdoroviak.ru supplements 100%
zscom.ru spyware 100%
zwuk.ru spyware 100%
zarekoy.ru hourhotel 100%
zastava-izhevsk.ru weapons 100%
Very big zutera.ru replica 80%
list of Classifier zoombao.com replica 80%
domains zita-gita.ru adult 80%
zishop.ru replica 80%
zdorovoetelo100.ru supplements 80%
zen-shop.ru drugs 80%
Performance and accuracy
Test run on randomly picked 100 websites:
Downloading time 7 minutes

Processing & classification time 30 seconds (0.3 sec per site)
Cross-validation model accuracy 89%
(without applying threshold)
High risk sites classified as normal 1

(False Negatives)
Correctly classified high risk sites 12 out of 13 (92%)

Normal sites classified as high risk 0
(False Positives)
Production environment architecture
Automatic scanning and handling referrer URLs
Scheduled prediction web-service runs automatically on new data
Predictions are manually reviewed thereafter
Model updated incrementally after results review
Referrers Prediction
database model
Merchants
screening
Online customer Merchant PSP

Two years later
Episode II
Return of Nave Bayes

Automated detection of adult web referrers based on URL features
(in collaboration with Laure Daumal, fraud developer intern @ AdCash)
Online ad network business model and risks
Providing connection between advertisers and publisher media

Taking care of traffic quality which we get from publishers
Traffic
Publishers
Advertisers
Advertisement
The initial task
We need to keep brand safe by monitoring restricted publisher thematics
?
Is the traffic provided by
publishers free from
restricted thematics?
Traffic Lets start with detecting and

predicting adult referrers.
Based ONLY on referrer

Publishers
URL itself.
Why URLs and not content analysis?
Content analysis is still very time and resource

consuming
We do not need precise and real-time classification of

web pages across many thematics, but rather
suspicious URLs detection within regular traffic analytics
(it looks like an adult page)
We need stable and reliable method which is able to

efficiently handle amounts of referrers data we have
Amount of data we have
One big web publisher is capable of generating daily
1,000,000+ ad impressions
500,000 unique referrers
100,000 unique domains

Be realistic: what to expect
What our future What we

classifier is should NOT
intended for? expect from it?
Given a certain URL, it Given a page address,
should predict its its not expected to
likelihood to an adult or correctly guess its true
other restricted topic content category
Tools we can trust
Philosophical question (before choosing an algorithm)
What metric should we optimize the prediction model against?
Philosophical answer
Real life problems are not Kaggle competition!
There are a bunch of data science performance measures around:

AUC, AUPRC, F-Measure, R-Squared, RMSE, accuracy and so on
and so on.
All of these metrics won't help you in real life. You should simply
ignore them.
Every data science project should optimize on the value it targets for
in terms of business model need. In the best case a data science
performance measure can only be a proxy of your real objective.
(Martin Schmitz, data scientist @ RapidMiner)
What kind of error is more expensive?
We predicted when We didnt predict

we shouldnt have when we should have
Churn prediction example Fraud detection example
How much losing a customer cost? How much undetected fraud costs us?
How much did it cost us to keep him? Price we pay for blocked good customer?
Was the customer we kept profitable? What are brand risks due to fraud?
Does prediction ever pay off for low How to measure profit wed never receive?
profile customers? How much we spent for building the model?
From URL to features
Raw URL
https://report.adultwebmasternet.com/?page_type=indexes&sid=430
&iid=1&gid=81462211&tid=0&kwid=0&uid=0
Word breaker (from SphinxSearch.com)

report adult webmaster net page type index es si ii gi ti kw id id
N-grams (3-6)
rep epo por ort repo epor port repor eport report adu dul ult adul
dult adult web ebm bma mas ast ste ter webm ebma bmas mast aste ster
webma ebmas bmast maste aster webmas ebmast bmaste master net pag
age page typ ype type ind nde dex inde ndex index
Assign IDs + count frequencies

rep:1, epo:2, por:3, ort:4, (...) add frequency parameter
Choosing the metric and classifiers comparison
We can afford false positives but wed rather be sure that we detect most of the
adult referrers out there, otherwise we might potentially lose an unhappy advertiser.
So far, the recall metric (true positive rate) is chosen to answer the question:
"Given a positive example (adult URL), will the classifier detect it?
Initial testing with a dictionary of 9,000 words and 1,500 URLs

Classifier Recall
Multinomial Nave Bayes 0.98
MLP Classifier 0.94
Decision Tree 0.93
Logistic Regression 0.91
Random Forest 0.83
k-NN 0.76
SVC 0.68
Multinomial Nave Bayes made simple
Multinomial Nave Bayes classifier used if the words can be represented in terms of their
occurrences (frequency count), not just the presence or absence of a particular word in the
document
Formal definition
Applied to text analysis, multinomial Nave Bayes classifier
estimates the conditional probability of a particular word /
term / token given a class as the relative frequency of term
in documents belonging to class C
Putting it simpler
?
If we have seen a particular term previously
in many adult sites URLs, probably another
URL containing it also refers to an adult
website
Data sources
Directory Mozilla SquidGuard blacklists

Confusion matrix
Dataset: 20,000+ URLs
Dictionary size: 60,000 words
Validation: 5-fold cross-validation
True adult True normal
Predicted adult 10,189 500
Predicted normal 578 8,772
Recall 0.95
API integration
RESTful API allows to use model as a web service and get
predictions for generating custom traffic reports
PREDICT
LEARN
VALIDATE
Prediction RESTful Reporting
model web service engine
API functions
API call example: PREDICT
Input: one URL / multiple URLs / text file (new line or space separated)
Example request URIs:
http://88.196.223.122:5000/predict
http://88.196.223.122:5000/learn
Example API answer with prediction result (JSON format):

{
"avg_score": 0.9999999800429088,
"scores": [{
"url": "http://www.smokingwithstyle.com",
"confidence": 1.0
}, {
"url": "http://www.potseeds.co.uk/magic-mushroom-spore-syringes-10ml",
"confidence": 1.0
}, (...)
]
}
Summing it all up
We learned how to classify web resources in different

ways and detect restricted categories
Not always we need to analyze the content itself: URL

may be sufficient in helping to define a category
Even the simplest classification algorithms are still

capable of doing some magic
Further reading
Books
Foster Provost, Tom Fawcett Data Science for Business: What

You Need to Know about Data Mining and Data-Analytic
Thinking
Markus Hoffman, Andrew Chisholm Text Mining and Visualization:

Case Studies Using Open-Source Tools
Whitepapers
Eda Baykan et al. Purely URL-based Topic Classification
Myriam Abramson et al. Whats in a URL? Genre Classification

from URLs
Min-Yen Kan Fast webpage classification using URL features

Thank you!
Vladimir Mikhnovich
v.mikhnovich@adcash.com
https://ee.linkedin.com/in/kypexin

Web Categorization - ML Meetup Tallinn 2016-12-13 - FINAL

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Categorization - ML Meetup Tallinn 2016-12-13 - FINAL

Uploaded by

Copyright:

Available Formats

Websites categorization

for e-commerce and

k-NN strikes back

Online customer Merchant PSP

Issuer bank Visa / Acquiring

Business Risk Global Brand

Payment systems and aggregators must check merchants

High risk merchants share (in a normal

What happens on the Dark Side of the Web?

Regular automated scanning of

Total number of active merchants: 100,000+

Picking and labelling sites for training dataset

Adult You know it

Cloud and distributed solutions (just to name a few)

Microsoft Azure Machine Learning

Some standalone products

RapidMiner Studio: RapidMiner Server:

Train Apply model

Training Text Model

11 categories Extract text k-NN classifier

TF-IDF (term frequency

Text data size

TF-IDF matrices examples

Many attempts to classify thousands sites at once were actually

Thousands Download Loop Classify

eroshop.ru 100% adult

pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%

pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%

pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%

pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%

pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%

pred. betting 0 0 0 0 0 9 0 0 0 0 0 100.00%

pred. hour hotel 0 0 0 0 2 0 8 0 0 0 0 80.00%

pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%

pred. spyware 0 0 0 0 1 0 0 0 9 0 0 90.00%

pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%

pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%

On average, around 3-5% of web sites were assigned high confidence

site prediction confidence

Test run on randomly picked 100 websites:

Downloading time 7 minutes

High risk sites classified as normal 1

Correctly classified high risk sites 12 out of 13 (92%)

Online customer Merchant PSP

Return of Nave Bayes

Providing connection between advertisers and publisher media

Traffic Lets start with detecting and

Based ONLY on referrer

Content analysis is still very time and resource

We do not need precise and real-time classification of

We need stable and reliable method which is able to

One big web publisher is capable of generating daily

500,000 unique referrers

100,000 unique domains

What our future What we

There are a bunch of data science performance measures around:

We predicted when We didnt predict

Churn prediction example Fraud detection example

Word breaker (from SphinxSearch.com)

Assign IDs + count frequencies

Initial testing with a dictionary of 9,000 words and 1,500 URLs

Directory Mozilla SquidGuard blacklists

True adult True normal

Predicted adult 10,189 500