Professional Documents
Culture Documents
• Important issues
What personal information is transmitted?
To whom does it go?
What can average users do about it?
Initial Analysis of PII Leaks
• What are our apps share about us?
• Controlled experiments
Seed devices with conspicuous PII
Manual tests of top 100 apps for each OS
iOS, Android, Windows Phone
(Note results have significantly better coverage than
automated tests.)
How frequently Is PII Leaked?
How frequently Is PII Leaked?
VPN
Server ReCon
Meddle Server
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Machine Learning Approach
• Training Dataset:controlled experiment
• Text classification approaches
Problem:Given a network flow,does it contain PII?
GET
/product?
pid=245678901&color=blanco&device=iPhone&browser=safari
!GET
/index.html?
id=1234567890&foo=bar&name=Jren&pass=somepassword
Machine Learning Approach
• Text classification approaches
Problem:Given a network flow,does it contain PII?
Feature Extraction:Bag-of-words model
word word
product Index.html
/product?pid=24567890&col id
pid
or=blanco&device=iPhone&b
234567890 1234567890
rowser=safari
color foo
blanco bar
device name
/Index.html?id=1234567890;f
iPhone oo=bar;name=Jren;pass=som JRen
browser password pass
safari somepassword
Machine Learning Approach
word frequency
• Text classification approaches Index.html 52
id 33
Problem:Given a network flow,does it contain PII?
1234567890 2
Feature Extracion:Bag-of-words model foo 33
Feature Selection:Threshold-based filter bar 2
name 33
JRen 2
pass 33
somepassword 2
product 43
pid 44
2345678976 3
Machine Learning Approach
• Text classification approaches
Per-domain-and-OS classifiers(e.g.Google-Analytics)
More accurate
Faster(compared to one-size-fits-all,the general classifier)
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Evaluation-Accuracy
• Metrics:
Correcly Classified Rate(CCR)
(True Negative +True Positive)/ALL
False Positive Rate
False Negative Rate
Area under ther curve(AUC)
[0,1],predictive power
AUC =0.5random
AUC ~ 1 means good
• Manual test dataset
10-fold cross validation
Evaluation-Accuracy(CCR)
Most Per-domain-and-OS DT
classifiers have high accuracy