You are on page 1of 28

Recon:Revealing and Controlling PII Leaks in

Mobile Network Traffic


Privacy in the Mobile Internet
• Mobile devices
Rich sensors
Ubiquitous connectivity

• Important issues
What personal information is transmitted?
To whom does it go?
What can average users do about it?
Initial Analysis of PII Leaks
• What are our apps share about us?

• Controlled experiments
Seed devices with conspicuous PII
Manual tests of top 100 apps for each OS
iOS, Android, Windows Phone
(Note results have significantly better coverage than
automated tests.)
How frequently Is PII Leaked?
How frequently Is PII Leaked?

Significant fraction of very Basic tracking is common


personal information
leaked across all platforms
Why do these Issues Persist?
• Devices are locked down by Oses and carries
• OS-level solutions
OS vendors do not want to add new features
Researchers have limited ability to modify OS
Users are hesitant to root their devices
• Internet work solutions
Who can get access to a mobile carrier?
What happens when users change networks?
Using VPNs to Meddle with Mobile
• Opportunity:(almost) all devices support VPNs
Tunnel traffic to a server we control
Measure,modify,shape or block traffic with user opt-in
User incentives(e.g.,privacy filtering,content-blocking,…)
Detecting PII Leaks in the Network
• Software middleboxes expose network traffic
Independent of OS,app store
Easy to detect PII if you know what to search for

What if you don’t know the PII a priori?


Automatically Identifying PII Leaks
• Hypothesis:PII leaks have distinguishing characteristics
Is it just simple key/value pairs(e.g.,”userId=R3CON”)?
Nope,this leads to high FPR(5.1%) and FNR(18.8%)
Need to learn the structure of PII leaks

Approach:Build ML classifiers to reliably detect leaks


Does not require knowing PII in advance
Resilient to changes in PII leak formats over time
Recon Components
• Machine learning to reveal PII leaks from mobile devices

VPN
Server ReCon

Meddle Server
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Machine Learning Approach
• Training Dataset:controlled experiment
• Text classification approaches
Problem:Given a network flow,does it contain PII?
GET
/product?
pid=245678901&color=blanco&device=iPhone&browser=safari
!GET
/index.html?
id=1234567890&foo=bar&name=Jren&pass=somepassword
Machine Learning Approach
• Text classification approaches
Problem:Given a network flow,does it contain PII?
Feature Extraction:Bag-of-words model
word word
product Index.html
/product?pid=24567890&col id
pid
or=blanco&device=iPhone&b
234567890 1234567890
rowser=safari
color foo
blanco bar
device name
/Index.html?id=1234567890;f
iPhone oo=bar;name=Jren;pass=som JRen
browser password pass
safari somepassword
Machine Learning Approach
word frequency
• Text classification approaches Index.html 52
id 33
Problem:Given a network flow,does it contain PII?
1234567890 2
Feature Extracion:Bag-of-words model foo 33
Feature Selection:Threshold-based filter bar 2
name 33
JRen 2
pass 33
somepassword 2
product 43
pid 44
2345678976 3
Machine Learning Approach
• Text classification approaches
Per-domain-and-OS classifiers(e.g.Google-Analytics)
More accurate
Faster(compared to one-size-fits-all,the general classifier)
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
• Q4:Effectiveness in the wild
Evaluation-Accuracy
• Metrics:
Correcly Classified Rate(CCR)
(True Negative +True Positive)/ALL
False Positive Rate
False Negative Rate
Area under ther curve(AUC)
[0,1],predictive power
AUC =0.5random
AUC ~ 1 means good
• Manual test dataset
10-fold cross validation
Evaluation-Accuracy(CCR)
Most Per-domain-and-OS DT
classifiers have high accuracy

CDF of per-domain-and-OS(PDAO) classifier accuracy

Decision Tree(DT) > Naïve Bayes


Time:DT-based Ensembles > A simple DT
>95% Per-domain-and-OS > General
60% DTs zero error
Evaluation –Accuracy(AUC)

CDF of per-domain-and-OS(PDAO) classifier AUC


Area under the curve (AUC)[0,1]
AUC = 0.5 randomly guessing
AUC ~ 1 means strong predictive power
Most(67%)DT-based classifers have AUC = 1
Evaluation – Accuracy(FNR&FPR)

CDF of per-domain-and-OS(PDAO)classfier accuracy


• Most DT-based classifiers have zero
FPs(71.4%) and FNs(76.2%)
• DT classifiers are NOT black boxes
Align with domain intuition
Non-trivial ML problem
Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers’
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
Information flow analysis
• Q4:Effectiveness in the wild
Evaluation – Comparison with IFA
• Information flow analysis(IFA)
Resilient to encrypted/obfuscated flows
Dynamic IFA:Andrubis
Static IFA:Flowdroid
Hybrid IFA:AppAudit
Susceptible to false positives,but not false negatives
• Dataset
750 Android apps producting network traffic
Compare the number of apps that potentially leak PII
ReCon Outperforms Dynamic IFA
• How does ReCon compare to Audrubis(TaintDroid)?
Recon Has Better Overall Coverage
• Recon even outperforms static/hybrid IFA

ReCon finds significantly more PII than IFA solutions


Learning When PII Is Leaked
• Q1:How do we train a classifier?
• Q2:How well do different ML approaches perform?
Accuracy/effectiveness of classifiers’
Gerneral vs per-domain/OS classifiers
Feature selection
Training time
PII extraction strategies
Impact of Retraining
• Q3:How does it compare with alternative approaches
Information flow analysis
• Q4:Effectiveness in the wild
Overview of Ongoing User study
• IRB-approved user study(302 users) as of June 2016
162 iOS,157 Android devices
20/26 response system useful & behavior change
PII found:25,787 cases (10,137 confirmed)
• Surprising cases and impact
195 cases of credential leaks,117 verified
Identified 22 apps exposing passwords in plaintext
Used by millions (Match,Epocrates)
Responsibly disclosed,gave 3 months to remediate
14 have fixed the problem
Match,Epocrates,Musically,Fixster etc
• Anonymized report of PII leaks
Apps:http://recon.meddle.mobi/app-report.html
Mobile browsing http://recon.meddle.mobi/web-report.html
Summary
• Need for improved transparency/control over PII
• ReCon approach addresses this
Learn what information is being leaked
Crowdsourcing to determine correctness/importance
Allow users to block/change what is leaked
• Code and data:http://recon.meddle.mobi

You might also like