Recon:Revealing and Controlling PII Leaks in Mobile Network Traffic

Recon:Revealing and Controlling PII Leaks in
Mobile Network Traffic

Privacy in the Mobile Internet
• Mobile devices
Rich sensors
Ubiquitous connectivity
• Important issues
What personal information is transmitted?
To whom does it go?
What can average users do about it?
Initial Analysis of PII Leaks
• What are our apps share about us?
• Controlled experiments
Seed devices with conspicuous PII
Manual tests of top 100 apps for each OS
iOS, Android, Windows Phone
(Note results have significantly better coverage than
automated tests.)
How frequently Is PII Leaked?
How frequently Is PII Leaked?
Significant fraction of very Basic tracking is common

personal information
leaked across all platforms
Why do these Issues Persist?
• Devices are locked down by Oses and carries
• OS-level solutions
OS vendors do not want to add new features
Researchers have limited ability to modify OS
Users are hesitant to root their devices
• Internet work solutions
Who can get access to a mobile carrier?
What happens when users change networks?
Using VPNs to Meddle with Mobile
• Opportunity:(almost) all devices support VPNs
Tunnel traffic to a server we control
Measure,modify,shape or block traffic with user opt-in
User incentives(e.g.,privacy filtering,content-blocking,…)
Detecting PII Leaks in the Network
• Software middleboxes expose network traffic
Independent of OS,app store
Easy to detect PII if you know what to search for
What if you don’t know the PII a priori?

Automatically Identifying PII Leaks
• Hypothesis:PII leaks have distinguishing characteristics
Is it just simple key/value pairs(e.g.,”userId=R3CON”)？
Nope,this leads to high FPR(5.1%) and FNR(18.8%)
Need to learn the structure of PII leaks
Approach:Build ML classifiers to reliably detect leaks

Does not require knowing PII in advance
Resilient to changes in PII leak formats over time
Recon Components
• Machine learning to reveal PII leaks from mobile devices
VPN
Server ReCon
Meddle Server
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Feature selection
Training time
Machine Learning Approach
• Training Dataset:controlled experiment
• Text classification approaches
Problem:Given a network flow,does it contain PII?
GET
/product?
pid=245678901&color=blanco&device=iPhone&browser=safari
!GET
/index.html?
id=1234567890&foo=bar&name=Jren&pass=somepassword
Feature Extraction:Bag-of-words model
word word
product Index.html
/product?pid=24567890&col id
pid
or=blanco&device=iPhone&b
234567890 1234567890
rowser=safari
color foo
blanco bar
device name
/Index.html?id=1234567890;f
iPhone oo=bar;name=Jren;pass=som JRen
browser password pass
safari somepassword
word frequency
• Text classification approaches Index.html 52
id 33
1234567890 2
Feature Extracion:Bag-of-words model foo 33
Feature Selection:Threshold-based filter bar 2
name 33
JRen 2
pass 33
somepassword 2
product 43
pid 44
2345678976 3
Per-domain-and-OS classifiers(e.g.Google-Analytics)
More accurate
Faster(compared to one-size-fits-all,the general classifier)
Feature selection
Training time
Evaluation-Accuracy
• Metrics:
Correcly Classified Rate(CCR)
(True Negative +True Positive)/ALL
False Positive Rate
False Negative Rate
Area under ther curve(AUC)
[0,1],predictive power
AUC =0.5random
AUC ~ 1 means good
• Manual test dataset
10-fold cross validation
Evaluation-Accuracy(CCR)
Most Per-domain-and-OS DT
classifiers have high accuracy
CDF of per-domain-and-OS(PDAO) classifier accuracy
Decision Tree(DT) > Naïve Bayes

Time:DT-based Ensembles > A simple DT
>95% Per-domain-and-OS > General
60% DTs zero error
Evaluation –Accuracy(AUC)
CDF of per-domain-and-OS(PDAO) classifier AUC

Area under the curve (AUC)[0,1]
AUC = 0.5 randomly guessing
AUC ~ 1 means strong predictive power
Most(67%)DT-based classifers have AUC = 1
Evaluation – Accuracy(FNR&FPR)
CDF of per-domain-and-OS(PDAO)classfier accuracy

• Most DT-based classifiers have zero
FPs(71.4%) and FNs(76.2%)
• DT classifiers are NOT black boxes
Align with domain intuition
Non-trivial ML problem
Accuracy/effectiveness of classifiers’
Feature selection
Training time
Information flow analysis
Evaluation – Comparison with IFA
• Information flow analysis(IFA)
Resilient to encrypted/obfuscated flows
Dynamic IFA:Andrubis
Static IFA:Flowdroid
Hybrid IFA:AppAudit
Susceptible to false positives,but not false negatives
• Dataset
750 Android apps producting network traffic
Compare the number of apps that potentially leak PII
ReCon Outperforms Dynamic IFA
• How does ReCon compare to Audrubis(TaintDroid)?
Recon Has Better Overall Coverage
• Recon even outperforms static/hybrid IFA
ReCon finds significantly more PII than IFA solutions

Accuracy/effectiveness of classifiers’
Feature selection
Training time
Information flow analysis
Overview of Ongoing User study
• IRB-approved user study(302 users) as of June 2016
162 iOS,157 Android devices
20/26 response system useful & behavior change
PII found:25,787 cases (10,137 confirmed)
• Surprising cases and impact
195 cases of credential leaks,117 verified
Identified 22 apps exposing passwords in plaintext
Used by millions (Match,Epocrates)
Responsibly disclosed,gave 3 months to remediate
14 have fixed the problem
Match,Epocrates,Musically,Fixster etc
• Anonymized report of PII leaks
Apps:http://recon.meddle.mobi/app-report.html
Mobile browsing http://recon.meddle.mobi/web-report.html
Summary
• Need for improved transparency/control over PII
• ReCon approach addresses this
Learn what information is being leaked
Crowdsourcing to determine correctness/importance
Allow users to block/change what is leaked
• Code and data:http://recon.meddle.mobi

Recon:Revealing and Controlling PII Leaks in Mobile Network Traffic

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recon:Revealing and Controlling PII Leaks in Mobile Network Traffic

Uploaded by

Copyright:

Available Formats

Recon:Revealing and Controlling PII Leaks in

Mobile Network Traffic

Significant fraction of very Basic tracking is common

What if you don’t know the PII a priori?

Approach:Build ML classifiers to reliably detect leaks

CDF of per-domain-and-OS(PDAO) classifier accuracy

Decision Tree(DT) > Naïve Bayes

CDF of per-domain-and-OS(PDAO) classifier AUC

CDF of per-domain-and-OS(PDAO)classfier accuracy

ReCon finds significantly more PII than IFA solutions

You might also like