You are on page 1of 131

Taming Fragmentation-Induced

Compatibility Issues in Android


Applications

by

Lili Wei

A Thesis Submitted to
The Hong Kong University of Science and Technology
in Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy
in Department of Computer Science and Engineering

April 2020, Hong Kong

HKUST Library
Reproduction is prohibited without the author’s prior written consent
Acknowledgments

The pursuit of the Ph.D. degree is a long journey that I can never accomplish by
myself alone. I would like to express my gratitude to all the lovely people who have
generously offered me help and support during my Ph.D. study.

First, I would like to thank my supervisor, Prof. Shing-Chi Cheung. Prof.


Cheung has offered me tremendous support in both research and life. I am deeply
impressed by Prof. Cheung’s love and excitement about research. I can always feel
his excitement about research from his eyes when we discuss interesting research
problems during our meetings. His attitude about research also posed significant
influence on me. He has offered me the freedom to explore research areas of my own
interest while giving me many detailed suggestions. He has guided me to identify
interesting research problems, dig into details to resolve research problems, as well
as writing my research contributions in papers. Prof. Cheung is a very patient, kind,
and supportive supervisor. I am very grateful to be a student of Prof. Cheung. I
have spent five fruitful years in CASTLE group.

Second, I would like to thank Dr. Yepang Liu. I have learned a great deal from
Dr. Yepang Liu, especially when I was junior. He has given me many practical
suggestions and helped me get through many obstacles in my research projects.
Yepang is a very careful, helpful, and responsible co-author. I will always remember
when we were catching a deadline when his baby was just born. A fresh father can
be very busy but Yepang still spent a lot of time discussing with me and helping me
revise our paper. Yepang is also a good mentor and friend. I really appreciated the
experience working with Yepang.

Third, I would like to thank Prof. Yuanyuan Zhou, with whom I spent a
memorable half year. I have learned a great deal from Prof. Zhou’s philosophy in
doing research. Prof. Zhou has always been teaching us to be good researchers and
willing to share her prolific experiences. I have seen a different environment at UCSD,
from which I benefit a lot. In addition, Prof. Zhou also gave me many suggestions
from the perspective of a female researcher. These are also valuable experiences to
me.

Fourth, I would like to thank my thesis defense committee, Prof. Kai Prof. Mary
Lou Soffa, Prof. Rhea Liem, Prof. Raymond Wong, Prof. Shuai Wang, and Prof. Kai
Liu. They have carefully reviewed my thesis, spent their valuable time attending my
defense (at the very special moment when COVID-19 broke out). They have given

iv
me many valuable suggestions to further improve the manuscript of this thesis. In
particular, Prof. Soffa joined the defense remotely with twelve-hour time difference.
I especially appreciate that she could attend the defense and give me suggestions.

There are so many people that I would like to express my gratitude to. I would
like to thank Prof. Chang Xu. Without his recommendation, I would never have
the chance to pursue my Ph.D. in CASTLE group. I would also like to thank many
of my friends who made my Ph.D. life very enjoyable. I would like to thank Dr.
Rongxin Wu, Dr. Valerio Terraggni, Dr. Ming Wen as well as other group members
of mine. I would like to thank them for all the joy they brought to me as friends and
colleagues. I will always remember the time we spent together catching deadlines,
discussing research problems, as well as chatting or making jokes. I would like to
thank Xiaonan Huang (Doctor to be). We were roommates for over three years. We
have been sharing all of our secrets, enjoyable experiences, as well as all kinds of
troubles and obstacles we experienced. Her support and encouragement has helped
me through many worrying and upset moments. I would also like to thank my friends
with whom I played Werewolve. The joy you bring to me lightened my daily life in
the last year of my Ph.D. study. At last, I would like to thank my family and my
partner, for their support for me as always.

This thesis is supported by RGC/GRF research grant 16202917.

v
Contents

Title Page i

Authorization Page ii

Signature Page iii

Acknowledgments iv

Table of Contents vi

List of Figures x

List of Tables xii

List of Algorithms xiii

List of Abbreviations xiv

Abstract xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

vi
2 Background & Related Work 9

2.1 Android Fragmentation and FIC Issues . . . . . . . . . . . . . . . . . 9

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Android Fragmentation Issues . . . . . . . . . . . . . . . . . . 11

2.2.2 Mining API Usage Patterns . . . . . . . . . . . . . . . . . . . 14

2.2.3 Differential Testing and Test Selection . . . . . . . . . . . . . 15

3 Understanding & Characterizing FIC Issues 17

3.1 Study Design and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Empirical Study of Real-World FIC Issues . . . . . . . . . . . 17

3.1.2 Interviews with Practitioners . . . . . . . . . . . . . . . . . . 20

3.1.3 Online Practitioner Survey . . . . . . . . . . . . . . . . . . . . 21

3.2 Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 RQ1: Issue Type and Root Cause . . . . . . . . . . . . . . . . 24

3.2.2 RQ2: Issue-Triggering Context . . . . . . . . . . . . . . . . . . 31

3.2.3 RQ3: Issue Consequence . . . . . . . . . . . . . . . . . . . . . 33

3.2.4 RQ4: Issue Testing . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.5 RQ5: Issue Diagnosis and Fixing . . . . . . . . . . . . . . . . 38

3.2.6 Implications of Our Findings . . . . . . . . . . . . . . . . . . . 41

4 FicFinder: Modeling & Detecting FIC Issues 43

4.1 Issue Modeling and Detection . . . . . . . . . . . . . . . . . . . . . . 43

4.1.1 API-Context Pair Model . . . . . . . . . . . . . . . . . . . . . 43

4.1.2 Compatibility Issue Detection . . . . . . . . . . . . . . . . . . 44

4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 RQ6: Issue Detection Effectiveness . . . . . . . . . . . . . . . 50

4.2.2 RQ7: Usefulness of FicFinder . . . . . . . . . . . . . . . . . 54

vii
4.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 Threats To Validity . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.2 Comparison with CiD & Automated API-Context Pair Extraction 58

4.3.3 Device Models Triggering FIC issues . . . . . . . . . . . . . . 60

5 Pivot: Learning Knowledge of FIC Issues 61

5.1 Problem Formulation & Motivation . . . . . . . . . . . . . . . . . . . 61

5.1.1 FIC Issue Pattern & API-Device Correlations . . . . . . . . . 61

5.1.2 Application Scenarios of API-Device Correlations . . . . . . . 63

5.1.3 Motivating Example & Overview . . . . . . . . . . . . . . . . 64

5.2 Pivot Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2.1 Extracting API-Device Correlations . . . . . . . . . . . . . . . 67

5.2.2 Prioritizing API-Device Correlations . . . . . . . . . . . . . . 68

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.1 RQ8: Effectiveness of Pivot . . . . . . . . . . . . . . . . . . 72

5.3.2 RQ9: Usefulness of API-Device Correlations . . . . . . . . . . 76

5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.2 Comparison with FicFinder . . . . . . . . . . . . . . . . . . 79

6 Pixor: Proactively Exposing Inconsistent Android System Behav-


iors 80

6.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1.2 Subroutine Extractor: Breaking Down Apps into Subroutines 82

6.1.3 API Graph Constructor: Estimating Exercisable System Func-


tionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

viii
6.1.4 Subroutine Selector: Guiding Exploration of System Function-
alities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1.5 Test App Generator & Runner . . . . . . . . . . . . . . . . . . 88

6.1.6 Identifying Triggered Platform-Specific Behaviors . . . . . . . 88

6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.2 Baseline Method . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.3.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 Future Work 95

7.1 Future Work of Pivot . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.1.1 API-Device Correlation Clustering . . . . . . . . . . . . . . . 95

7.1.2 Automated API-Device Correlation Validation . . . . . . . . . 96

7.2 Future Work of Pixor . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 Conclusion 98

PhD’s Publications 100

References 102

ix
List of Figures

2.1 Android System Architecture [13] . . . . . . . . . . . . . . . . . . . . 10

2.2 Save For Offline Issue #49: App crashes on API 16 while working well
on API 19. (a) MainActivity of Save For Offline, (b) ViewActivity
works well on an emulator with API level 19, (c) App crashes when
starting ViewActivity on an emulator with API level 16. . . . . . . . 11

3.1 Region of Our Survey Respondents’ Working Place (Marked in Dark


Red Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Working Experience of Our Survey Respondents . . . . . . . . . . . . 24

3.3 CSipSimple Issue 353 (Simplified) . . . . . . . . . . . . . . . . . . . . 26

3.4 AnySoftKeyboard Issue 289 (Simplified) . . . . . . . . . . . . . . . . 27

3.5 AnkiDroid Pull Request 130 (Simplified) . . . . . . . . . . . . . . . . 29

3.6 Number of Respondents Selecting Two Different Types of Options: (1)


Root Causes Identified by Our Empirical Study, (2) Only the Other
Option. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.7 Root Causes Specified by Survey Respondents . . . . . . . . . . . . . 30

3.8 Common Consequences of FIC Issues . . . . . . . . . . . . . . . . . . 33

3.9 Importance of FIC Issue Testing before App Releases . . . . . . . . . 36

3.10 FIC Issue Test Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.11 Android App Test Frameworks . . . . . . . . . . . . . . . . . . . . . 38

3.12 Challenges for FIC Issue Diagnosis and Fixing . . . . . . . . . . . . . 40

4.1 Patch for Twidere #974 . . . . . . . . . . . . . . . . . . . . . . . . . 54

x
5.1 Patch for Camera Preview Frame Rate Issue on Nexus 4 . . . . . . . 63

5.2 Patch for A Crashing FIC Issue on Some LG devices . . . . . . . . . 65

5.3 Overview of Pivot . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Inter-Procedural Control Flow Graph of The Code in Figure 5.2 . . . 67

5.5 Precision@N of Pivot and Baselines . . . . . . . . . . . . . . . . . . 74

6.1 Overview of Pixor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.2 An example of API Graph, its corresponding control flow graph and
code snippet. For presentation simplicity, we simplified the example
and omitted the parameters of the APIs in API Graph. . . . . . . . . 84

6.3 Overlapping Contents on Android Q Beta 4 . . . . . . . . . . . . . . 91

6.4 Code Example That Exhibit Inconsistent Behavior on Android Q . . 91

xi
List of Tables

2.1 Android OS Version Market Share . . . . . . . . . . . . . . . . . . . . 12

3.1 Android Apps Used in Our Empirical Study (1M = 1,000,000) . . . . 17

3.2 Keywords Used in Issue Identification . . . . . . . . . . . . . . . . . . 18

3.3 Basic Information of Our Interviewees . . . . . . . . . . . . . . . . . . 20

3.4 Issue types and root causes for their occurrences . . . . . . . . . . . . 25

3.5 Common Patching Patterns . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Experimental Subjects and Checking Results . . . . . . . . . . . . . . 48

5.1 Statistics of Our App Corpora . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Case Study Subjects and Reported Issues . . . . . . . . . . . . . . . . 76

6.1 Identified Bugs of Different Subroutine Selection Strategies. (The


results of random are decimals since the average results of three
random runs are reported) . . . . . . . . . . . . . . . . . . . . . . . . 93

xii
List of Algorithms

4.1 Detecting FIC Issues in an Android App . . . . . . . . . . . . . . . . 46


6.1 Greedy Algorithm for Set Cover Problem . . . . . . . . . . . . . . . . 86

xiii
List of Abbreviations

FIC issue Fragmentation-Induced Compatibility issue

API Application Programming Interface

xiv
Taming Fragmentation-Induced Compatibility Issues in
Android Applications

by Lili Wei
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology

Abstract

Android ecosystem is heavily fragmented. The numerous combinations of different


device models and operating system versions make it impossible for Android app
developers to exhaustively test their apps. As a result, various compatibility issues
arise, causing poor user experience.

Such fragmentation-induced compatibility issues (FIC issues) have been well-


recognized as a prominent problem in Android app development. However, little
is known on the characteristics of these FIC issues and no mature tools exist
to help developers quickly diagnose and fix these issues. To bridge the gap, I
conducted an empirical study on 220 real-world compatibility issues collected from
five popular open-source Android apps. I further interviewed Android practitioners
and conducted an online survey to gain insights from their real practices. The study
characterized the symptoms, root causes, and triggering contexts of the FIC issues,
investigated common practices to handle the FIC issues, and disclosed that these
issues and their patches exhibit common patterns. With these findings, I proposed a
technique, FicFinder, to automatically detect compatibility issues in Android apps.
FicFinder has been evaluated to be effective in detecting fragmentation-induced
compatibility issues with high precision and satisfactory recall.

An important input required by FicFinder is the FIC issue patterns that capture
specific Android APIs as well as their associated context by which compatibility
issues can be triggered. I denote such FIC issue patterns as API-context pairs. In
the initial version of FicFinder, the API-context pairs were manually extracted
from the empirical study dataset. Manually extracting FIC issue patterns can be
expensive. In addition, API-context pairs can eventually get outdated since FIC
issues are evolving as new Android versions and devices are released. To address this
problem, I developed a novel framework, Pivot, that combines program analysis and
data mining techniques to automatically learn API-context pairs from large corpora
of popular Android apps. Specifically, Pivot takes an Android app corpus as input
and outputs a list of API-context pairs ranked by their likelihood of capturing real
FIC issues. With the learned API-context pairs, we can further transfer knowledge
learned from existing Android apps to automatically detect potential FIC issues using
FicFinder. This can significantly reduce the search space for FIC issues and benefit
Android development community. To evaluate Pivot, I measured the precision
of the learned API-context pairs and leverage them to detect previously-unknown
compatibility issues in open-source Android apps.

One fundamental limitation of Pivot is that it can only learn FIC issue patterns
that have already been identified and fixed by some existing apps. Pivot can
never identify a FIC issue if it is not fixed by any apps in the input app copora.
To address this problem, I proposed Pixor to leverage existing Android apps to
proactively expose inconsistent behaviors of different Android versions. Pixor selects
and exercises subroutines of Android apps on different Android versions to expose
inconsistent Android system behaviors. Such inconsistent behaviors can be useful to
both Android system and app developers. On the one hand, some of the inconsistent
behaviors may indicate bugs in the Android systems. Reporting them to the Android
system developers may help improve the reliability of Android systems. On the other
hand, some of the inconsistent behaviors can be intended customization of specific
Android versions. Identifying such inconsistent behaviors can help app developers
avoid FIC issues. In the evaluation, Pixor has identified both bugs and intended
system behavior changes in Android Q (Android 10, the latest version of Android
system). Three of the identified bugs have been confirmed and fixed by the Android
development team.

xvi
Chapter 1

Introduction

1.1 Motivation

Android is the most popular mobile operating system with over 80% market
share [74]. Due to its open nature, many manufacturers (e.g., Samsung and LG)
develop their mobile devices by adapting the original Android systems. While this
has led to the wide adoption of Android smartphones and tablets, it has also induced
the heavy fragmentation of the Android ecosystem. The fragmentation causes
unprecedented challenges to app developers: there are more than 10 major versions
of Android OS running on 24,000+ distinct device models [12]. It is impractical
for developers to fully test the compatibility of their apps on such a large number
of devices. In practice, developers often receive user complaints on their apps’
poor compatibility on various device models [14, 122]. However, resolving these
fragmentation-induced compatibility issues is difficult.

The search space for fragmentation-induced compatibility issues is huge and evolv-
ing. First, their search space is large, consisting of three dimensions: device models,
API levels and Android APIs. There are 24,000+ distinct device models running
10+ different API levels in the market [12, 24]. Each API level supports thousands
of APIs. Detecting such compatibility issues by checking if each combination of
device model, API level and invoked API can induce inconsistent app behaviors is
practically infeasible. Second, the search space is dynamic. It continually evolves
with the release of new device models and API levels. For example, the number of
distinct device models in 2015 was six times as many as in 2012 [82]. In addition,
there have been two or more API level upgrades each year, which are commonly
customized by Android device manufacturers and shipped with new device models.

1
Section 1.1 Lili Wei

Detecting fragmentation-induced compatibility issues in such a huge and evolving


search space is challenging. Therefore, developers often realize the existence of these
issues only after receiving users’ complaints.

Various studies have been conducted to investigate the Android fragmentation


problem and their induced compatibility issues. For example, Han et al. [119] were
among the first who studied the compatibility issues in Android ecosystem and
provided the evidence of hardware fragmentation by analyzing bug reports of HTC
and Motorola devices. Linares-Vásquez et al. [139] and McDonnell et al. [147]
studied how Android API evolutions can affect the quality (e.g., portability and
compatibility) and development efforts of Android apps. Recently, researchers also
proposed techniques to help developers prioritize Android devices for development
and testing by mining user reviews and usage data [125, 146]. Fazzini et al. [112]
proposed DiffDroid to identify GUI inconsistencies of Android apps when running
on different platforms. DiffDroid focuses on the oracle problem, i.e., how to
determine the existence of compatibility issues in app testing. Although such pioneer
work helped understand Android fragmentation, little is known about the root cause
of fragmentation-induced compatibility issues and how developers diagnose and fix
such issues in reality. In addition, existing studies have not fully investigated these
issues down to the app source code level and hence cannot provide deeper insights
(e.g., common issue patterns and fixing strategies) to ease debugging and bug fixing
tasks. As a result, there are no mature tools to help developers combat Android
fragmentation and improve their apps’ compatibility.

This thesis aims to understand fragmentation-induced compatibility issues down to


the source code level and provide automated tool support to facilitate issue detection.
To better understand fragmentation-induced compatibility issues in Android apps, I
conducted an empirical study on 220 real issues collected from popular open-source
Android apps, aiming to understand the issues’ characteristics. In addition to the
empirical study, I also conducted in-depth developer interviews and field surveys to
provide insights into fragmentation-induced compatibility issues from the perspective
of Android practitioners. Overall, this process aims to explore the following five
research questions. For ease of presentation, we will refer to Fragmentation-Induced
Compatibility issues as FIC issues.

• RQ1: (Issue type and root cause): What are the common types of FIC
issues in Android apps? What are their root causes?

2
Section 1.1 Lili Wei

• RQ2: (Issue-triggering context): What are the common triggering contexts


of FIC issues? Are there any relations among these issue-triggering contexts?
How do Android developers use the context information to diagnose and fix
FIC issues?

• RQ3: (Issue consequence): What are the common perceived consequences


of FIC issues in Android apps?

• RQ4: (Issue testing): Is FIC issue testing an important step during Android
app development process? What are the common practices for FIC issue testing?
Are there any effective tools or frameworks to facilitate the testing tasks?

• RQ5: (Issue diagnosis and fixing): What are the common challenges when
diagnosing FIC issues? How do Android developers fix FIC issues in practice?
Are there any common patterns?

By investigating these research questions, I made interesting findings. For example,


I found that FIC issues in Android apps can cause both functional and non-functional
consequences and observed five major root causes of these issues. Among these root
causes, frequent Android platform API evolution and problematic hardware driver
implementation are two dominant ones. Such findings can provide developers with
guidance to help avoid, expose, and diagnose FIC issues. Besides, I observed from
the patches of the 220 studied FIC issues that they tend to demonstrate common
patterns. To fix FIC issues, developers often adopted similar workarounds under
certain problematic software and hardware environments. Such patterns can be
learned and leveraged to help automatically detect and fix FIC issues.

Based on the findings, I designed a static analysis technique, FicFinder, to


automatically detect FIC issues in Android apps. FicFinder is driven by an API-
context pair model proposed by us. The model captures Android APIs that suffer
from FIC issues and the issue-triggering contexts. The model can be learned from
common issue fixing patterns. I implemented FicFinder on Soot [130] and applied
it to 53 large-scale and popular Android apps. FicFinder successfully generated
129 true warnings of previously-unknown FIC issues. I reported all these warnings in
39 issue reports to the apps’ developers (I group similar warnings in one issue report
in order not to overwhelm developers). So far, I have received acknowledgment from
developers on 22 of the 39 issue reports. 59 warnings mentioned in 13 issue reports
were considered critical and have been quickly fixed. In addition, the interviewees
and survey respondents also showed interests in FicFinder. When communicating

3
Section 1.1 Lili Wei

with one of the interviewees, I found and reported one valid FIC issue to his team.
The reported issue was quickly confirmed and fixed. These results demonstrate the
usefulness of the empirical findings and FicFinder.

In the preliminary version of FicFinder [163, 164], the API-context pairs were
manually extracted from the empirical study. However, this manual approach is
impractical because the FIC issue search space evolves quickly. It requires tremendous
efforts to manually keep track of FIC issues that continually arise from new device
models. This motivates us to study how to automatically extract the knowledge of
FIC issues, which can then be leveraged to derive API-context pairs and effectively
reduce the search space for FIC issue detection. In the empirical study, I observed
that FIC issues can be categorized into two types: non-device-specific ones and
device-specific ones [163]. FIC issues are non-device-specific if they can be triggered
on any device model running a particular Android version, which is denoted by an
integer (e.g., 27 for Android 8.1) known as an API level. FIC issues are device-
specific if they can only be triggered on certain device models running particular
API levels. Compared with non-device-specific ones, the search space for device-
specific FIC issues is much enlarged with an additional dimension of device models.
Hence, device-specific FIC issues are more difficult to detect. As a result, I focus
on learning knowledge for device-specific FIC issues. These issues are common
but challenging to resolve [163, 164]. While resources provided by Google (e.g., API
Guide [11] and emulators) can help developers locate and patch non-device-specific
FIC issues, no similar resources are provided for device-specific FIC issues. The
root causes of device-specific FIC issues often reside in the proprietary systems
customized by the device vendors. These systems are typically closed-source with
little documentation available to the public.

I propose a technique called Pivot (API-DeVice cOrrelaTor) to automatically


learn the knowledge of device-specific FIC issues in terms of API-device correlations
from existing Android apps [165]. My insight is that FIC issues are commonly
handled by exercising an alternative execution path if the running device matches
an issue-triggering model. This insight is gained from the empirical study on the
real-world FIC issues. In other words, the handling of a FIC issue usually involves a
condition that checks device information at runtime. An example of such conditions
is given in Line 4 of Figure 5.1, where Nexus 4 is the issue-triggering model and Lines
4–6 provide an alternative execution path. Based on this insight, Pivot extracts
API-device correlations such as hParameters.setRecordingHint(boolean), “Nexus 4 ”i
by identifying the device-checking conditions and the API invocations guarded by

4
Section 1.1 Lili Wei

them. Such learned knowledge can then be leveraged to facilitate FIC issue testing
or infer rules/patterns to pinpoint FIC issues in Android apps via static analysis
such as API-context pairs and FicFinder .

Pivot extracts API-device correlations of FIC issues from a given Android app
corpus by combining program analysis and mining code repository techniques. It also
prioritizes the extracted correlations based on their likelihood to capture real FIC
issues. An outstanding challenge is to find valid API-device correlations from massive
noises. In the evaluation, the noises account for 97% of the sampled API-device
correlations extracted from popular Android app corpora. Existing techniques in API
precondition mining [151, 157, 168, 174] remove noises based on the assumption that
most API usages in code corpora are correct (i.e., mistakes are rare). However, this
assumption does not hold for APIs that can induce FIC issues due to two phenomena.
First, since FIC issues continually evolve, lots of FIC issues, especially the new
ones, are likely left undetected in Android apps. Second, common code clones and
library usages across Android apps can confuse the mining of FIC issues based on
frequency. These cloned code snippets can be duplicated in many apps [104, 135].
These two phenomena can introduce noises that greatly affect the learning accuracy
of FIC issues. To address this challenge, I devise a novel ranking strategy to prioritize
extracted API-device correlations based on the likelihood that the analyzed apps have
handled the corresponding FIC issues and the diversity of the originating occurrences
of API-device correlations.

I implemented Pivot on Soot [130] and ran it to learn API-device correlations


from two app corpora, each of which comprises over 2,500 top-ranked apps on Google
Play [38] collected at different time. Pivot achieved a precision of 100% among the
top five ranked API-device correlations and over 90% precision among the top ten for
both of the corpora. It also successfully identified 49 distinct and valid API-device
correlations capturing 17 different FIC issues among the top-ranked API-device
correlations. I built an archive accordingly. It comprises the API-device correlations
learned by Pivot and the corresponding discussions collected from online resources.
The issue archive is publicly available to Android developers [63] and is expected to
grow in future.

To show the usefulness of the learned API-device correlations and the issue
archive, I conducted a case study and encoded the knowledge of the FIC issues
learned by Pivot in to API-context pairs of FicFinder to locate undetected issues
in open-source Android apps. I successfully found 10 apps that suffer from these FIC
issues. I further reproduced and reported the detected issues to the app developers.

5
Section 1.1 Lili Wei

So far, seven reported issues have been acknowledged, among which four issues have
been quickly fixed. This demonstrates that the API-device correlations learned by
Pivot can be used to facilitate FIC issue detection for Android apps.

Pivot is effective to identify API-device correlations of FIC issues that have


already been fixed by some existing apps. However, since Pivot mines API-device
correlations from existing Android apps, it cannot reveal API-device correlations
that are in newly-released device models, the FIC issues of which are not yet detected
and fixed by any apps. This is a limitation of Pivot. To address this limitation,
I further propose Pixor, which leverages existing Android apps to actively reveal
inconsistent behaviors across different Android system versions.

Essentially, FIC issues are caused by those system behaviors that only exhibit
on specific device models or Android versions. For ease of presentation, we refer to
such behaviors specific to particular Android versions as platform-specific behaviors.
Pixor performs differential testing to actively expose platform-specific behaviors
of different Android system versions by leveraging code in existing Android apps
to exercise diverse functionalities. Existing Android apps intensively use Android
system functionalities via API calls. While such prolific Android system functionality
use cases are good test cases to expose platform-specific behaviors, many of them
can be equivalent in exposing platform-specific behaviors. Blindly selecting code in
Android apps to exercise Android systems can induce redundant efforts in repeatedly
exercising the same system functionalities. This is the key challenge of Pixor:
how to effectively select code in existing apps to guide the exploration of system
functionalities so that diversified system behaviors are exercised while redundant
testing efforts are minimized?

Pixor resolves this challenge by breaking down apps into subroutines and esti-
mating the exercisable system functionalities of each subroutine by constructing API
Graphs. Specifically, API Graphs models Android API use cases of app subroutines
that capture the invoked APIs and their invocation orders. Pixor guides the explo-
ration of Android system functionalities by selecting app subroutines that can cover
the most uncovered Android system API use cases.

Identifying platform-specific behaviors can benefit both Android app and system
developers. On the one hand, platform-specific behaviors are the causes of FIC issues.
Identifying platform-specific behaviors can guide Android app developers detect FIC
issues in their own app. On the other hand, some platform-specific behaviors may
indicate bugs in the Android systems. Detecting such platform-specific behaviors
can help Android system developers improve the reliability of the Android system.

6
Section 1.2 Lili Wei

I evaluated Pixor by applying it to detect platform-specific behaviors in Android


Q Beta version. Pixor outperformed the random selection strategy and effectively
exposed seven distinct platform-specific behaviors in Android Q Beta version. I
reported the detected platform-specific behaviors to Android development team.
Three of them are confirmed as bugs of the latest Android version and have been
fixed during the development of the Beta version.

1.2 Contributions

To summarize, this thesis makes the following contributions:

• Understanding and Characterizing FIC Issues: To the best of my knowl-


edge, I conducted the first empirical study of FIC issues in real-world Android
apps at source code level. The findings can help better understand and charac-
terize FIC issues while shedding lights on future studies related to this topic.
The dataset is publicly available to facilitate future research [30]. Besides
studying real FIC issues, I also conducted in-depth developer interviews and
surveys to understand FIC issues from Android practitioners’ perspective.
These follow-up studies not only validated my major findings from the empir-
ical study, but also provided additional insights that cannot be obtained by
studying FIC issues alone.

• Modeling and Detecting FIC Issues: I proposed an API-context pair


model to capture the common patterns of FIC issues in Android apps. With
this model, I can generalize developers’ knowledge and practices in handling
FIC issues and transfer such knowledge to aid various software engineering
tasks such as automated issue detection and repair. Based on this model, I
designed and implemented a technique, FicFinder, to automatically detect
FIC issues in Android apps. The evaluation of FicFinder on 53 real-world
subjects confirms that it can effectively detect FIC issues and provide useful
information to facilitate issue diagnosis and fixing.

• Automatically Learning Knowledge of FIC Issues: I proposed and


implemented the first technique, Pivot, to learn knowledge of FIC issues
as API-device correlations from existing Android apps and showed that such
API-device correlations can be transformed into API-context pairs to facilitate
FIC issue detection for Android apps. In Pivot, I devised a new ranking

7
Section 1.3 Lili Wei

strategy that can effectively identify valid API-device correlations capturing


real FIC issues. The evaluation results show that this strategy can significantly
outperform traditional ones. I also archived the FIC issues identified by Pivot.
With the archive and FicFinder, I conducted a case study and successfully
revealed previously-unknown FIC issues in 10 Android apps, many of which
were confirmed or fixed by the app developers.

• Actively Exposing Platform-Specific Behaviors: I proposed Pixor that


actively exposes platform-specific behaviors by differential testing for different
Android versions with the large number of existing Android apps. Pixor
(1) estimates the exercisable system functionalities of subroutines in Android
apps via API Graphs and (2) leverages API Graphs to guide the selection of
subroutines to exercise Android system functionalities and expose platform-
specific behaviors. In the evaluation, Pixor has detected 44 distinct subroutines
that can trigger platform-specific behaviors. From them, I identified seven
distinct platform-specific behaviors.

• The details of the study as well as the delivered tools of this thesis are publicly
available at the project homepages [30, 63].

1.3 Thesis Outline

This thesis is organized as follows. Chapter 2 introduces the background knowl-


edge of Android fragmentation and its induced compatibility issues. I also discuss
the related work on Android fragmentation and FIC issues. Chapter 3 presents
the characteristic study of FIC issues. It includes the settings and findings of the
empirical study, online survey and practitioner interviews. Chapter 4 defines the
proposed API-context pair model and proposes FicFinder, which leverages API-
context pairs to detect FIC issues. Chapter 5 presents Pivot, a framework that can
learn knowledge of FIC issues from large Android app corpora. Chapter 6 introduces
Pixor, my latest effort in actively exposing platform-specific behaviors, which are
causes of FIC issues. Chapter 7 proposes research directions of the future work and
Chapter 8 concludes the thesis.

8
Chapter 2

Background & Related Work

2.1 Android Fragmentation and FIC Issues

While Android provides manufacturers of mobile devices with an open and


flexible software infrastructure to quickly launch their Android device products, it
complicates the task of developing reliable Android apps over these devices. One of
the known complications is induced by the infamous Android fragmentation problem
that arises from the need to support the proliferation of different Android devices
with diverse software and hardware environments [146]. Two major causes account
for the severity of fragmentation.

• Fast evolving Android platforms. Android platform is evolving fast. New


Android versions are released every year. Its API specifications and development
guidelines constantly change. As a result, devices running different OS versions
available on market results in API-level fragmentation in the Android ecosystem.
As shown in Table 2.1, the Android devices on market are running very different
OS versions from 2.3 to 8.0 with API levels from level 10 to level 26 [24].1

• Numerous device models. To meet market demands, manufacturers (e.g., Sam-


sung) keep releasing new Android device models with diverse hardware (e.g.,
different screen sizes, camera qualities, and sensor compositions) and customize
their own Android OS variants by modifying the original Android software
stack. Device manufacturers make two typical customizations: (1) they imple-
ment lower-level system (hardware abstraction layer and hardware drivers) to
allow the higher level system (Figure 2.1) to be agnostic about the lower-level
1
Data collected on November 6, 2017 [24].

9
Section 2.2 Lili Wei

Android Apps
Application Framework
Higher-level
Inter-Process Communication Proxies System
Android System Services and Managers
Hardware Abstraction Layer Lower-level
Linux Kernel (with Hardware Drivers) System

Figure 2.1: Android System Architecture [13]

driver implementations that are specific to device models; (2) they modify
the higher level system to meet device models’ special requirements (e.g., UI
style). Such customization induces many variations across device models (see
examples in Section 3.2.1).

Android fragmentation creates burden to app developers, who need to ensure the
apps that they develop offer compatible behavior on diverse device models, which
support multiple OS versions. This requires tremendous testing and diagnosis efforts.
In reality, it is impractical for developers to exhaustively test their apps to cover
all combinations of device models and OS versions. Hence, FIC issues often escape
testing and are frequently encountered and reported by app users [119, 125, 146].
The API level and device model variations can make Android apps exhibit different
behavior and induce FIC issues. Figure 2.2 shows an example of FIC issue my tools
detected in an open-source Android app, Save For Offline [70]. Save For Offline
is developed to save the webpages to local storage so that the saved webpages
can be viewed even if the phone is offline. This app invokes an Android API,
setMediaPlaybackRequiresUserGesture, which was introduced in API 17, in its
ViewActivity without checking the API level at runtime. The ViewActivity works
well on devices of which the API level is not lower than 17 but crashes on devices
with lower API levels when the API is not available. Figure 2.2 (a) shows the
MainActivity of Save For Offline. Clicking any saved page will lead the app to start
ViewActivity and load the saved webpage. Figure 2.2 (b) shows a screenshot of
ViewActivity that correctly displays the saved webpage on an emulator with API
level 19. On the other hand, Figure 2.2 (c) shows a screenshot of the app crashing on
an emulator with API level 16 when ViewActivity is started by clicking the same
saved page. Such inconsistent behavior is a typical example of FIC issues.

10
Section 2.2 Lili Wei

(a) (b) (c)

Figure 2.2: Save For Offline Issue #49: App crashes on API 16 while
working well on API 19. (a) MainActivity of Save For Offline, (b)
ViewActivity works well on an emulator with API level 19, (c) App
crashes when starting ViewActivity on an emulator with API level 16.

2.2 Related Work

2.2.1 Android Fragmentation Issues

Android fragmentation has been a major challenge for Android app develop-
ers [108, 122]. Recent studies have explored the problem largely from three aspects.

Understanding Android fragmentation. Several studies have been per-


formed to understand the problems of Android fragmentation. Han et al. were among
the first who explored the Android fragmentation problem [119]. They studied the
bug reports related to HTC and Motorola in the Android issue tracking system [1]
and pointed out that Android ecosystem was fragmented. Since then, researchers
have reported various findings related to Android fragmentation. For example, Li et
al. pointed out that app usage patterns are sensitive to device models [133]. Pathak
et al. reported that frequent OS updates represent the largest fraction of user com-
plaints about energy bugs [152]. Liu et al. observed that a notable proportion of
Android performance bugs occur only on specific devices and platforms [143]. Wu
et al. studied the effects of vendor customizations on security by analyzing Android
images from popular vendors [167]. Li et al. [134] found that app usage patterns

11
Section 2.2 Lili Wei

Table 2.1: Android OS Version Market Share

API Release
Version Codename Market share
level Time
2.3.3 - 2.3.7 Gingerbread 10 2011 0.6%
4.0.3 - 4.0.4 Ice Cream Sandwich 15 2011 0.6%
4.1.x 16 2012 2.3%
4.2.x Jelly Bean 17 2012 3.3%
4.3 18 2013 1.0%
4.4 Kitkat 19 2013 14.5%
5.0 21 2014 6.7%
Lollipop
5.1 22 2015 21.0%
6.0 Marshmallow 23 2015 32.0%
7.0 24 2016 15.8%
Nougat
7.1 25 2016 2.0%
8.0 Oreo 26 2017 0.2%

are sensitive to device models. Fan et al. [111] performed an empirical study on
framework crashing issues in Android apps and found that the compatibility issues
induced by fragmentation commonly cause framework crashes. Many studies focus on
security problems induced by Android fragmentation. Zhou et al. [175] has leveraged
problematic file permissions of driver files to hijack sensitive functionalities of specific
device models. Zhang et al. [172] studied vulnerabilities of the memory management
system on different customized device models. Aafer et al. [99] compared security
configurations of ROMs of different device models and identified vulnerabilities.
These studies documented various functional and non-functional issues caused by
Android fragmentation, but did not aim to understand the root causes of such issues,
which are different from this thesis.

In addition, the HCI community also found that different resolutions of device
displays have brought unique challenges in Android app design and implementa-
tion [150,171]. Holzinger et al. reported their experience on building a business appli-
cation for different devices considering divergent display sizes and resolutions [121].
In the study dataset, I also found that these issues were part of FIC issues on Android
platforms and they can seriously affect user experience.

App testing for fragmented ecosystem. Android fragmentation brought new


challenges to app testing. To address the challenges, Kaasila et al. designed an online
system to test Android apps across devices in parallel [123]. Halpern et al. built a
record-and-replay tool to test Android apps across device models [117]. Fazzini et

12
Section 2.2 Lili Wei

al. proposed a testing framework, DiffDroid, that can automatically identify GUI
inconsistencies of Android apps across different device models [112]. DiffDroid
compares GUI models of an app when running on different device models, but such
comparisons do not help reduce the search space of FIC issues. While these papers
proposed general frameworks to test Android apps across devices, they did not
address the problem of huge search space for FIC issue testing (Section 3.2.6).

Several recent studies aimed to reduce the search space in Android app testing by
prioritizing devices. For example, Vilkomir et al. proposed a device selection strategy
based on combinatorial features of different devices [161]. Khalid et al. conducted
a case study on Android games to prioritize device models to test by mining user
reviews on Google Play store [125]. The most recent work by Lu et al. proposed an
automatic approach to prioritizing devices by analyzing user data [146]. It is true
that by prioritizing devices, testing resources can be focused on those devices with
larger user impact. However, these studies alone are not sufficient to significantly
reduce the search space when testing FIC issues. The reason is that the combinations
of different platform versions and issue-inducing APIs are still too many to be fully
explored. In this thesis, I addressed the challenge from a different angle by studying
real-world FIC issues to gain insights at source code level (e.g., root causes and
fixing patterns). By this study, I observed common issue-inducing APIs and the
corresponding software and hardware environments that would trigger FIC issues.
Such findings can further help reduce the search space during app testing and
facilitate automated issue detection and repair.

Kyle et al. [129] proposed DexFuzz, a tool that detects defects in Android
virtual machine by fuzzing Android apps. This work tests Android virtual machines
to identify their bugs and mitigate the problem of Android fragmentation. The
effectiveness of DexFuzz heavily depends on the seed apps and the mutation
operators. However, while DexFuzz provides the ability to fuzz existing Android
apps, it does not provide any solutions on how to effectively select Android apps
for testing. Pixor, on the contrary, focuses on resolving the problem of test app
selection. In this sense, Pixor is complimentary to DexFuzz. In the future, I
plan to combine Pixor with DexFuzz to produce more effective test suites for
platform-specific behavior detection.

Android API evolution. Constantly evolving Android platforms cause Android


fragmentation. Existing work reported that the chosen SDK version and the quality
of the used Android APIs are major factors that affect app user experience [160].
McDonnell et al. investigated how changing Android APIs can affect API adoption

13
Section 2.2 Lili Wei

in Android apps [147]. Linares-Vásquez et al. studied the impact of evolving Android
APIs on user ratings and StackOverflow discussions [139, 140]. Bavota et al. further
conducted a survey to investigate developers’ experience with change-prone and fault-
prone APIs [102]. They observed that apps using change-prone and fault-prone APIs
tend to receive lower user ratings and changes of Android API have motivated app
developers to open discussion threads on Stack Overflow. These studies focused on
understanding the influence of using fast-evolving Android APIs on the development
procedure and app qualities, but did not study the concrete characteristics or fixes
of FIC issues induced by such API changes. In comparison, this thesis not only
uncovered the root causes of FIC issues but also studied concrete issues at the
application source code level so that I can provide new insights (e.g., fixing patterns)
to facilitate FIC issue detection and repair. Lastly, a recent work by Li et al. [136]
proposed a technique CiD to automatically learn patterns of compatibility issues
caused by Android framework API evolution. However, their scope is restricted to
compatibility issues induced by API accessibility inconsistencies at different API
levels, while the scope of this thesis is broader (e.g., I also studied device-specific
FIC issues). I provide a detailed comparison between CiD and FicFinder in
Section 4.3.2.

In comparison, Pivot automatically learns API-device correlations of device-


specific FIC issues from a given Android app corpus. The case study shows that the
learned API-device correlations are useful and can help detect previously-unknown
FIC issues in real-world Android apps.

2.2.2 Mining API Usage Patterns

Various techniques have been proposed to mine API usage patterns from code
repositories. The majority of them aim to mine API co-occurrence relationships. PR-
Miner [138] and DynaMine [144] were proposed to mine sets of APIs that frequently
co-occur in code repositories. MAPO [168, 174] was among the first techniques
to mine frequently-used API sequences. Follow-up studies further improved the
pioneering techniques [100, 114, 162]. Most of them rely on mining techniques and
adopt light-weight static analysis without analyzing code control dependencies.

Several techniques were proposed to infer API calling preconditions via static
analysis and mining code repositories. Ramanathan et al. [155] inferred preconditions
that must hold before API invocations from code revision histories. Nguyen et
al. [151] mined API preconditions with regard to the API arguments and receivers.

14
Section 2.2 Lili Wei

These techniques target at mining general API preconditions that may not be relevant
to FIC issues. In addition, these techniques leverage traditional filtering metrics
such as confidence to identify valid API preconditions. As shown in the evaluation,
such metrics cannot effectively prioritize valid API-device correlations. In contrast,
Pivot focuses on learning API-device correlations and features a new and effective
ranking strategy to prioritize the correlations.

Another line of studies focused on tackling compatibility issues in Android apps,


which are induced by Android fragmentation. For example, Wei et al. [163, 164, 166]
have conducted an empirical study on Android fragmentation-induced compatibility
issues in five open-source Android apps. Based on their empirical findings, they
further proposed a static analyzer to detect potential compatibility issues in Android
apps. Fazzini et al. [112] and Ki et al. [126] proposed test frameworks to compare
the behavior of the same Android app on different device models. These studies
focused on the fragmentation-induced compatibility issues at the Android app level.
In comparison, this thesis addressed the challenge of selecting apps by constructing
API Graph and leverage the Graph to effectively select activities in Android app
corpora as Android platform test cases.

2.2.3 Differential Testing and Test Selection

Pixor performs differential testing to identify platform-specific behaviors. Differ-


ential testing [148] has been widely adopted to all kinds of systems, including deep
learning systems [153], JVMs [106,107], and compilers [105,131,156,169]. Traditional
differential testing techniques assumes different versions should have the same output.
However, this assumption can be too strict in my context as different versions of
Android system, either original or customized releases, commonly exhibit different
behaviors. Assuming exactly the same behavior on different Android versions for
platform-specific behavior detection will induce too many false positives. As a result,
I leveraged a coarse-grained image similarity comparison strategy to detect the
inconsistent behaviors across different version.

How to effectively select subroutines to expose platform-specific behaviors is one


of the key challenges of Pixor. Here, I also discuss related work that focuses on test
selection problem. Test selection is a critical problem in large continuous integration
systems with large number of available test cases [149]. This research area has been
widely studied. Various techniques were proposed to select test cases for different
purposes [110, 132, 149, 170, 173]. These techniques are related to the Subroutine

15
Section 2.2 Lili Wei

Selector component of Pixor. While the idea of selecting subroutines and selecting
test cases from an available test suite is similar, none of the existing work abstracts
the overall characteristics of the input test cases. In comparison, Pixor abstracts the
API use cases in the input Android apps as an API Graph. With API Graph, Pixor
can effectively select app subroutines that can exercise diverse system functionalities.

16
Chapter 3

Understanding & Characterizing


FIC Issues

3.1 Study Design and Setup

In order to understand FIC issues and answer our five research questions, we
performed a large-scale study, which consists of two parts. First, we collected real-
world FIC issues that affected popular open-source Android apps and investigated
them down to source code level to answer RQ1–3 and RQ5. Second, we communicated
with Android practitioners of different roles and experiences via interviews and online
questionnaires to understand FIC issues from their perspective. The findings of
this part will answer RQ4 and additionally help cross validate and supplement the
findings of the first part. In the following, we present the design and setup of our
empirical study of FIC issues, practitioner interviews and survey in detail.

3.1.1 Empirical Study of Real-World FIC Issues

Table 3.1: Android Apps Used in Our Empirical Study (1M = 1,000,000)
Project
App Name Start Category Description Rating Downloads KLOC # FIC Issues
Time
CSipSimple [22] 04/2010 Communication Internet call client 4.3/5.0 1M - 5M 59.2 83
AnkiDroid [16] 07/2009 Education Flashcard app 4.5/5.0 1M - 5M 50.8 44
K-9 Mail [42] 10/2008 Communication Email client 4.2/5.0 5M - 10M 93.7 37
VLC [88] 11/2010 Media & Video Media player 4.4/5.0 10M - 50M 151.1 36
AnySoftKeyboard [18] 05/2009 Tools 3rd-party keyboard 4.4/5.0 1M - 5M 38.9 20

17
Section 3.1 Lili Wei

Table 3.2: Keywords Used in Issue Identification


device compatible compatibility samsung
lge sony moto lenovo
asus zte google htc
huawei xiaomi android.os.build

Dataset Collection

Step 1: selecting app subjects. To study the research questions, we need to


collect three kinds of data from real-world Android projects: (1) bug reports and
discussions, (2) app source code, and (3) bug fixing patches and related code revisions.
To accomplish that, we searched for suitable subjects on three major open-source
Android app hosting sites: F-Droid [28], GitHub [33], and Google Code [37]. We
targeted subjects that have: (1) over 100,000 downloads (popular), (2) a public
issue tracking system (traceable), (3) over three years of development history with
more than 500 code revisions (well-maintained), and (4) over 10,000 lines of code
(large-scale). We adopted these criteria because the FIC issues in the subjects such
selected likely affect a large user population using diverse device models.

We manually checked popular open-source apps on the above-mentioned three


platforms and found 27 candidates. Table 3.1 lists some examples (the full list of
apps is available on our project website [30]). The table gives the demographics
of each app, including: (1) project start time, (2) category, (3) brief description of
functionality, (4) user rating, (5) number of downloads on Google Play store, (6)
lines of code, (7) number of code revisions, and (8) number of issues documented
in its issue tracking system. The demographics show that the selected apps are
popular (e.g., with millions of downloads) and highly rated by users. Besides, their
projects all have a long history and are well-maintained, containing thousands of
code revisions.

Step 2: identifying FIC issues. To locate the FIC issues that affected the
27 candidate apps, we searched their source code repositories for two types of code
revisions:

• The revisions whose commit message contains fragmentation-related keywords.


We first included the name of top Android device brands from an Android Frag-
mentation Report [12] as fragmentation-related keywords. We then randomly
inspected some code commits concerning FIC issues, which were identified by
searching the brand names, and observed that the words such as “device”,

18
Section 3.1 Lili Wei

“compatible” and “compatibility” commonly occurred in the commit messages.


We added these words into our keyword set and Table 3.2 lists all the keywords
(case non-sensitive).

• The revisions whose code diff contains the keyword “android.os.build”. We


selected this keyword because the android.os.Build class encapsulates device
information (e.g., OS version, manufacturer name, device model) and provides
app developers with the interface to query such device information and adapt
their code accordingly to ensure app compatibility.

The keyword search returned many results, suggesting that all of our 27 candidate
apps might have suffered from various FIC issues. We selected the five apps with the
most number of code revisions that contain our searching keywords and manually
examined these revisions. Table 3.1 shows the information of the five apps. In total,
2,108 code revisions were found from the selected five apps. We carefully checked
these revisions and understood the code changes. After such checking, we collected
a set of 303 revisions, concerning 220 FIC issues. The number of revisions is larger
than the number of issues because some issues were fixed by multiple revisions.
Note that it is generally impossible to guarantee that all our collected issues have
been completely fixed. There are chances that the issues, although deemed fixed by
developers, could still occur on certain untested device models (e.g., those with very
small user bases). However, we made our best effort to locate all the patches for
each of our collected issues from the corresponding code repository. Table 3.1 (last
column) reports the number of found issues for each app subject. These 220 issues
form the dataset for our empirical study.

Data Analysis

To understand FIC issues and answer our research questions, we performed the
following tasks. First, for each issue, we (1) identified the issue-inducing APIs,
(2) recovered the links between the issue-fixing revisions and the bug reports, and
(3) collected and studied related discussions from online forums such as Stack
Overflow [76]. Second, we followed the process of open coding in the grounded theory
method, a widely-used approach for qualitative research [109], to analyze our findings
and classify each issue by its type, root cause, consequence, and fixing pattern.

19
Section 3.1 Lili Wei

Table 3.3: Basic Information of Our Interviewees


Developer Company Position Work Experience
P1 Kika Tech Senior Android Engineer 6 years Android app development
P2 Chelaile Android Development Leader 3 years Android app development
P3 Xiaomi Android Engineer 2 years Android app development
P4 Alibaba Senior Android Engineer 2.5 years of Android system development + 3.5 years of app development
P5 Tencent Test Manager 5 years Android app testing

3.1.2 Interviews with Practitioners

To understand the common challenges faced by practitioners when they encounter


FIC issues, we conducted on-site and audio interviews with five experienced Android
professionals in the industry.

Interviewees

Table 3.3 shows the basic information of our interviewees, including their compa-
nies, positions, and work experience. As we can see from the table, our interviewees
are experienced in Android development: all of them have at least two years of
experience and three of them (i.e., P1, P4, and P5) have over five years of experience.
In particular, the interviewee P4 is experienced in developing Android system frame-
works. The interviewees also have diversified background. They work for companies
of different scales, from small startups to large leading IT companies. For example,
P4 and P5 are from Alibaba1 and Tencent2 , two largest IT companies in China. P4
is a senior developer working on the Android version of the online shop Alibaba
T-Mall, which is the largest online shop in China. P5 is a test manager for the
quality assurance of WeChat, which is an app of Tencent and one of the leading
mobile messaging apps all over the globe. On the other hand, P2 is the leader of
the Android development team in a startup company, Chelaile, which provides pub-
lic transportation tracking services in China. Interviewing these practitioners with
different backgrounds allowed us to understand FIC issues from different perspectives.

Interview Design

The interviews were semi-structured. We started the interviews with questions


designed based on RQ1–5. At the same time, we encouraged our interviewees to
share their own experiences beyond these questions. In this way, we can not only
collect useful information to answer our research questions, but also learn about the
1
http://www.alibabagroup.com/en/global/home
2
https://www.tencent.com/en-us/index.html

20
Section 3.1 Lili Wei

real issues, challenges, and solutions that are not covered by our research questions.
At the end of each interview, we also shared with the interviewee our vision of
developing automated tools to help developers combat FIC issues, aiming to seek
feedback to guide the tool’s design and development.

For all interviews, we sought the interviewees’ consent to make audio recording
of the interviews for subsequent analysis. On average, each interview took around 55
minutes. The interview questions are publicly available at our project website [30].

3.1.3 Online Practitioner Survey

Interviews are expensive to conduct. Gathering sufficient number of interviewees


and conducting face-to-face interviews is a time- and money-consuming process (e.g.,
our interviews with P1 and P2 were on site and costly). Thus, it is difficult to scale
up the number of interviewees. To collect sufficient amount of feedback from the
app developers, we further conducted an online survey.

Questionnaire Design

The questionnaire consists of two parts: background information and FIC issue
experiences. Specifically, the questionnaire collects the following information:

Part 1: Background Information:

• Experience in years: 0–1 / 1–3 / 3–5 / over 5

• Region of working place

• Awareness of Android fragmentation: Yes / No

• Awareness of FIC issues: Yes / No

• Ever encountered FIC issues? Yes / No

We collected this background information to (1) filter respondents (e.g., the survey
proceeds only when the respondents have encountered FIC issues), (2) investigate
the awareness of FIC issues among respondents, and (3) analyze survey results of
different participant groups (e.g., groups of respondents with different experience
levels, etc.).

21
Section 3.1 Lili Wei

Part 2: FIC Issue Experiences and Practices:

• Root causes of encountered FIC issues: Problematic driver implementation /


Non-compliant OS customization / Peculiar hardware configuration / Android
platform API evolution / Original Android system bugs / Other (fillable)

• Challenges of FIC issue diagnosis and fix: Huge search space / Difficulties in
resetting issue-triggering device contexts / Difficulties in generating test cases
/ Difficulties in locating issue root causes / Other (fillable)

• Importance of FIC issue testing before app release: Very important / Important
/ Normal / Not important / Totally ignorable

• Platforms used to test FIC issues: Real devices / Online platforms / Emulators
/ Other (fillable)

• Names of used online platforms, pros and cons (short answer)

• Techniques or frameworks used to test FIC issues: Monkey / Android JUnit


Runner / Espresso / UI Automator / Robotium / Appium / Other (fillable)

• Used and expected tools to effectively reveal FIC issues (short answer)

We designed these questions to learn the common practices of the surveyed respon-
dents in solving FIC issues. Specifically, we investigated the common experiences
regarding the major root causes of the encountered FIC issues (RQ1).3 We also
collected information on practices that cannot be learned from the empirical study
materials (e.g., challenges in FIC issue diagnosis and fixing, practices in FIC issue
testing) and surveyed on their experiences and opinions of tools that can effectively
facilitate FIC issue diagnosis.

Participants

To gather a sufficient number of respondents for our survey, we distributed our


questionnaire through different channels. (1) We posted our questionnaire on popular
open-accessible Android developer online forums (e.g., xda-developer4 , Android
3
We did not design survey questions on RQ2 (issue-triggering context) and RQ3 (issue conse-
quence) since answers to these research questions are objective and can be derived from the code
commits and issue reports.
4
https://www.xda-developers.com

22
Section 3.2 Lili Wei

Figure 3.1: Region of Our Survey Respondents’ Working Place (Marked


in Dark Red Color)

development Google group5 , etc.). (2) We contacted professional developers to help


distribute our survey on the intranet forums in their companies. We invited our
contacts from Tencent, Baidu, Alibaba, Huawei, and many other IT companies to
disseminate our survey. (3) We further sent 3,714 emails to developers who have
contributed to open-source Android projects that are mostly starred on GitHub.

In total, we received 237 responses from practitioners working in 28 different


countries, which are annotated with the dark red color in the world map in Figure 3.1.
Figure 3.2 shows the experience background of our respondents. We received sufficient
responses from respondents of each experience level. This enabled us to obtain
opinions on FIC issues from developers with different backgrounds. Among the
respondents, 196 (82.7%) are aware of Android fragmentation, 180 (75.9%) are
aware of FIC issues while 163 (68.8%) have encountered FIC issues in their own
development experiences. We further investigated the 74 (31.2%) respondents who
never encountered any FIC issues in their development experiences and found that
most of these respondents have relatively lower experience level. Respondents with
0–1 year or 1–3 years experience account for 73.0% (54) of the these respondents.
Responses from these respondents that have no experience with FIC issues are
excluded from our further analysis of FIC issues. As such, further analysis of the
survey results is performed based on the 163 responses that were completed by
respondents who had encountered FIC issues.
5
https://groups.google.com/forum/

23
Section 3.2 Lili Wei

14.8% 0 - 1 year
24.1%
1 -3 years
24.9% 3 - 5 years
36.3% Over 5 years

Figure 3.2: Working Experience of Our Survey Respondents

3.2 Study Results

3.2.1 RQ1: Issue Type and Root Cause

Although Android fragmentation is pervasive [119], researchers and practitioners


have different understandings of the problem [55]. So far, there is no standard
taxonomy to categorize various FIC issues. Yet, such a taxonomy is crucial to
understand how FIC issues in each category are induced, which is a key step towards
automated detection of the issues. This motivates us to manually check the 220
issues in our dataset and construct a taxonomy following an approach adopted by
Ham et al. [118]. Specifically, we first categorized the issues into two general types:
device-specific and non-device-specific. The former type of issues can only manifest
on a certain device model, while the latter type of issues can occur widely on device
models that are deployed with a specific API level. Issues are categorized into these
general categories based on the two sources of Android fragmentation. Device-specific
issues mostly come from the numerous device models while non-device-specific issues
are induced due to Android platform evolutions. Then we further categorized the
two types of issues into subtypes according to their respective root causes. Table
3.4 presents our categorization results. In total, 120 (54.5%) of the 220 issues are
device-specific and the remaining 100 (45.5%) are non-device-specific.

We further disclosed and categorized the common root causes of FIC issues. We
identified the root causes of each FIC issue by studying their related discussions or
descriptions in the issue reports, forum posts and API documentations. Specifically,
we identified three common root causes for device-specific issues and two for non-
device-specific issues. In the following, we will discuss the issue types and issue root
causes in detail.

24
Section 3.2 Lili Wei

Table 3.4: Issue types and root causes for their occurrences
Type Root Cause Issue Example # of FIC
Issues
Problematic driver CSipSimple bug #353 57
Device-Specific implementation
Non-compliant OS Android bug #78377 39
customization
Peculiar hardware VLC bug #7943 24
configuration
Android platform API AnkiDroid pull request 88
Non-Device-Specific
evolution #130
Original Android system Android #62319 12
bugs

Device-Specific FIC Issues

Device-specific FIC issues occur when invoking the same Android API exhibits
inconsistent behavior on different device models. These issues can cause serious
consequences such as crashes and lead to poor user ratings [125]. We found 120 such
issues in our dataset and identified three primary reasons for the prevalence of these
issues in Android apps: (1) problematic driver implementation, (2) non-compliant
OS customization, and (3) peculiar hardware configuration.

Problematic driver implementation. Out of the 120 issues, 57 were caused


by the problematic implementations of hardware drivers. Android apps rely on low-
level hardware drivers to operate hardware devices. Different driver implementations
can make an app behave inconsistently across device models. To ensure consistent
app behavior over multiple device models, app developers need to carefully adapt
the code that invokes those Android APIs whose implementation involves direct or
transitive calls to functions defined by hardware drivers. However, there are hundreds
of such APIs on the Android platform. It is impractical for developers to conduct
adequate compatibility tests covering all device models whenever they use such APIs
in their apps. Compatibility issues can be easily left undetected before the release of
their apps.

Typical examples we observed are the issues caused by the proximity sensor
APIs. According to the Android API guide, proximity sensors can sense the distance
between a device and its surrounding objects (e.g., users’ cheek) and the invocation
of the API getMaximumRange() will return the maximum distance the sensor can
detect [64]. In spite of such clear specifications, FIC issues often arise from the use
of proximity sensors in reality. For instance, on devices such as Samsung Galaxy S5
mini or Motorola Defy, the return value of getMaximumRange() may not indicate the

25
Section 3.2 Lili Wei

1. boolean proximitySensorActive = getProximitySensorState();


2. + boolean invertProximitySensor =
3. + android.os.Build.PRODUCT.equalsIgnoreCase("SPH-M900");
4. + if (invertProximitySensor) {
5. proximitySensorActive = !proximitySensorActive;
6. + }

Figure 3.3: CSipSimple Issue 353 (Simplified)

actual maximum distance the proximity sensor can detect. As a result, the proximity
between the device and other objects often gets calculated incorrectly, leading to
unexpected app behaviors [10]. CSipSimple developers filed another example in
their issue tracker (issue 353) [23]. The issue occurred on Samsung SPH-M900
whose proximity sensor API reports a value inversely proportional to the distance.
This causes the app to compute a distance in reverse to the actual distance. The
consequence is that when users hold their phones near their ears to make phone calls
using CSipSimple, the screen would be touch sensitive. So, users often accidentally
hang up their calls. More annoyingly, the phone would be touch insensitive when
the users move their phones away from their ears, making it difficult to hang off a
call. CSipSimple developers later tracked down the problem and fixed the issue by
adapting their code to specially deal with the Samsung SPH-M900 devices, as shown
in Figure 3.3.

Non-compliant OS customization. 39 of the 120 device-specific issues oc-


curred due to non-compliant OS customization made by Android device manufactur-
ers. There are three basic types of customization: (1) functionality modification, (2)
functionality augmentation, and (3) functionality removal. In our study, we observed
that all these three types can result in device-specific issues.

• Functionality modification. Android device manufacturers often modify the


original Android OS implementation for their market needs. However, their
modified versions may contain bugs or do not fully comply with the original
Android platform specification. This can easily cause FIC issues. For example,
several issues in AnkiDroid, AnySoftKeyboard, and VLC were caused by a
bug in the MenuBuilder class of the appcompat-v7 support library on certain
Samsung and Wiko devices running Android 4.2.2, which was documented
in Android issue tracker (issue 78377) [1]. The bug report contains intensive
discussions among many Android app developers, whose apps suffered from
crashes caused by this bug.

26
Section 3.2 Lili Wei

1. Intent startSettings =
2. new Intent(android.provider.Settings.ACTION_INPUT_METHOD_SETTINGS);
3. startSettings.setFlags(Intent.FLAG_ACTIVITY_NEW_TASK);
4. + try {
5. mAppContext.startActivity(startSettings);
6. + } catch (ActivityNotFoundException notFoundEx) {
7. + showErrorMessage("Does not support input method settings");
8. + }

Figure 3.4: AnySoftKeyboard Issue 289 (Simplified)

• Functionality augmentation. Android device manufacturers often add new


features to the original Android OS to make their devices more competitive.
Such functionality augmentation brings burden to app developers, who need to
ensure that their apps correctly support the unique features on some devices
and at the same time remain compatible on other devices. Unfortunately, this is
a non-trivial task and can easily cause FIC issues. For example, after Samsung
introduced the multi-window feature on Galaxy Note slowromancapii@ in 2012,
AnkiDroid and CSipSimple developers attempted to revise their apps to support
this feature. However, after several revisions, the developers of CSipSimple
finally chose to disable the support in revision 8387675 [22], because this new
feature was not fully supported by some old Samsung devices and they failed
to find a reliable workaround to guarantee compatibility.

• Functionality removal. Some components in the original Android OS can be


pruned by device manufacturers if they are considered useless on certain devices.
An app that invokes APIs relying on the removed system components may crash,
causing bad user experience. For example, issue 289 of AnySoftKeyboard [18]
reported a crash on Nook BNRV350. On this device, the input method
setting functionality is by default unavailable; hence invoking the API to start
input method setting activity crashed the app. Developers fixed this issue in
revision b68951a [18] by displaying an error message upon the occurrence of
the ActivityNotFoundException as shown in Figure 3.4. Another example is:
VLC developers disabled the invocation of navigation bar hiding APIs on HTC
One series running an Android OS prior to 4.2 in revision 706530e [88], because
the navigation bar was eliminated on these devices and invoking related APIs
would lead to the killing of the app by the OS.

Peculiar hardware configuration. The remaining 24 device-specific issues


occurred due to the diverse hardware configuration on different Android device
models. It is common that device manufacturers utilize different chipsets and
hardware components with diverse specifications for their models [122]. Such diversity

27
Section 3.2 Lili Wei

of hardware configuration can easily lead to FIC issues, if app developers do not
carefully deal with all hardware variants. For example, SD card has caused much
trouble in real-world apps. Android devices with no SD card, single SD card and
multiple SD cards are all available on market. Even worse, the mount points for
large internal storage and external SD cards may vary across devices. Under such
circumstances, app developers have to spend tremendous effort to ensure correct
storage management on different device models. This is a tedious job and developers
can often make mistakes. For instance, issue 7943 of VLC [89] reported a situation,
where the removable SD card was invisible on ASUS TF700T, because the SD card
was mounted to an unusual point on this device model. Besides ASUS devices, VLC
also suffered from similar issues on some Samsung and Sony devices (e.g., revisions
50c3e09 and 65e9881 [88]), whose mount points of SD cards were uncommon. To
fix these issues, VLC developers hardcoded the directory path in order to correctly
recognize and read data from SD cards on these devices. Besides SD cards, different
screen sizes and resolutions also frequently caused compatibility issues (e.g., revision
b0c9ae0 of K-9 Mail [42], revision 6b60085 of AnkiDroid [16]).

Non-Device-Specific FIC Issues

We observed 100 non-device-specific issues in our dataset. They arose mostly due
to the Android platform evolution or bugs in the original Android systems. These
issues are independent from device models and can affect a wide range of devices
running specific Android OS versions.

Android platform API evolution. Android platform and its APIs evolve
fast [137, 147]. Since its first release in 2008, Android has released 8 versions
comprising 27 different API levels [24]. The APIs are updated with such frequent
system evolutions [137]. These updates may introduce new APIs, deprecate old
ones, or modify API behavior. 88 non-device-specific issues in our dataset were
caused by such API evolutions. For instance, Figure 3.5 gives a typical example
extracted from AnkiDroid revision 7a17d3c [16], which fixed an issue caused by
API evolution. The issue occurred because the app mistakenly invoked an API
SQLiteDatabase.disableWriteAheadLogging(), which was not introduced until
API level 16. Invoking this API on devices with an API level lower than 16 could
crash the app. To fix the issue, developers put an API level check as a guarding
condition to skip the API invocation on devices with lower level APIs (Line 1).

28
Section 3.2 Lili Wei

1. + if (android.os.Build.VERSION.SDK_INT >= 16) {


2. sqLiteDatabase.disableWriteAheadLogging();
3. + }

Figure 3.5: AnkiDroid Pull Request 130 (Simplified)

Original Android system bugs. The other 12 non-device-specific issues were


caused by bugs introduced in specific Android platform versions. As a complicated
system, Android platform itself contains various bugs during development. These
bugs may get fixed in a new release version, while still existing in older versions and
hence leading to FIC issues. For instance, K-9 Mail encountered a problem caused
by Android issue 62319 [1], which only affects Android 4.1.2. On devices running
Android 4.1.2, calling KeyChain.getPrivateKey() without holding a reference of
the returned key would crash K-9 Mail with a fatal signal 11 after garbage collection.
A member from the Android project explained that the issue was caused by a severe
bug introduced in Android 4.1.2, which was later fixed in Android 4.2, and provided
a valid workaround for app developers.

Root Causes from Practitioners’ Perspective

To further validate our above empirical findings of FIC issues, we designed a


multiple choice question in the survey and asked the respondents to select all the
root causes of the FIC issues they encountered. The options include all the major
root causes that we have previously discussed as well as an additional fillable “Other”
option. The “Other” option enables respondents to select it when they are unsure
which option to choose or want to elaborate their answers. Figure 3.7 shows the
distribution of the answers for this question from 163 survey respondents. From the
results, we can make two important observations.

First, overall, our observed root causes are supported by the survey results. 11
respondents selected the “Other” option, while 3 of them only selected this option
without selecting any root cases identified by our empirical study. This shows that
98.2% of our respondents agreed that at least one of our identified root causes
induced their encountered FIC issues (Figure 3.6). Eight of the 11 respondents
who selected “Other” provided elaboration on their answers. We processed their
elaborations and re-categorize them if appropriate. As a result, the root causes
specified by six of these eight respondents can be categorized into one of the existing
options. For example, one respondent filled in “Required API (is) not available on

29
Section 3.2 Lili Wei

1.8%

Respondents selecting at
least one "non-Other"
option
Respondents selecting
only "Other" option
98.2%

Figure 3.6: Number of Respondents Selecting Two Different Types of


Options: (1) Root Causes Identified by Our Empirical Study, (2) Only
the Other Option.

Non-Compliant OS Customization
Android Platform API Evolution
Original Android System Bugs
Peculiar Hardware Comfiguration
Problematic Driver Implementation
Other
0 20 40 60 80 100 120

Figure 3.7: Root Causes Specified by Survey Respondents

older devices”. This answer is a typical case of “API evolution” where the respondent
intends to use an API that is unavailable on old devices with low API levels. After
such re-categorization, only two respondents specified possible root causes for FIC
issues other than the root causes we have observed (i.e., interactions with other
apps installed on the device and different versions of Android development tools).
The survey results confirm that our five categories of root causes for FIC issues are
commonly encountered by developers in practice.

Second, according to Figure 3.7, “non-compliant OS customization” is recognized


as a root cause of FIC issues by most (71.2%) of our 163 respondents. This indi-
cates that device-specific issues caused by OS customizations are widespread across
Android apps. In the answers for fillable questions and email exchanges, some of
our respondents stated that they often encounter device-specific issues, which are

30
Section 3.2 Lili Wei

considered critical by them. For example, a respondent put down “Most issues are
device-specific (e.g. Samsungs love of breaking APIs or UI), which Lint can’t know
about”. Besides the online survey, our interviewees also shared similar opinions.
On commenting the two types of FIC issues, interviewees P1 and P2 found that
non-device-specific issues are easier to address while device-specific issues are common
and challenging. P1 explained “We can use Lint and Android support libraries to
locate and help fix the non-device-specific issues but for device-specific issues, we
don’t have such supports. Most of the time, we can only guess and try (to search for
valid solutions)”. Such observation is in line with our empirical study result that both
device-specific and non-device-specific issues are pervasive while the device-specific
issues occur relatively more often than non-device-specific issues in practice due to
the difficulties of locating and fixing them.

Answer to RQ1: We observed five major root causes of FIC issues in Android
apps, which can be categorized into two general types: device-specific and non-
device-specific. The results of our survey and interviews showed that the observed
root causes are common in practice. It is challenging to ensure app compatibility
on various device models with diverse software and hardware environments.

3.2.2 RQ2: Issue-Triggering Context

A distinguishing feature of FIC issues is that they exhibit exceptional behavior


on specific device models or API levels. Such hardware or software environment is a
key to trigger FIC issues: FIC issues could only be triggered under specific runtime
environment. Comprehensively understanding the triggering context for FIC issues
can help better understand the nature of FIC issues and further model the issues to
design detection techniques.

FIC issues’ triggering contexts (e.g., the names of device models or API levels that
can trigger FIC issues) are commonly discussed in the issue reports and leveraged in
the patches for FIC issues. In our study, we investigated the issue reports and issue
patches to identify the key issue-triggering contexts. We observed that the issue-
triggering contexts can be categorized into device contexts and API-level contexts.
This is compliant with the two major sources of FIC issues: diversified device models
and API levels. We now discuss these two types of FIC issue-triggering context.

Device contexts. Device contexts encapsulate the properties of device models.


130 issues in our dataset require device contexts to trigger. From our dataset, we
observed that device contexts can be explicit or implicit. Explicit device contexts are

31
Section 3.2 Lili Wei

identifiers of the device models. The majority of explicit device contexts are specific
identifiers of the device models, including DEVICE (i.e., the name of the industrial
design), MODEL (i.e., the end-user-visible name for the device product), or PRODUCT
(i.e., the name of the overall device product). Other explicit device contexts are more
general like the MANUFACTURER (i.e., the manufacturer of the device/hardware) or
BRAND (i.e., the consumer-visible brand with which the device/hardware is associated)
of the devices. These identifiers are encapsulated in the android.os.Build class and
can be retrieved at runtime. Many developers leverage such context information to
patch FIC issues. For example, Figure 3.3 shows an issue where the device context is
explicitly applied in the issue patch. The triggering context of this issue is a typical
explicit device context: device models of which the PRODUCT identifier is “SPH-
M900”. Apart from explicit device contexts, implicit device contexts are indicators
for specific device properties such as special software or hardware configurations.
AnysoftKeyboard issue 289 [18] is an example, where implicit device context is
applied. The issue describes a crash when the app starts the system input method
setting on a device without this functionality as discussed in Section 3.2. This is a
typical example of implicit device context that can trigger FIC issues.

API-level contexts. Some FIC issues can be triggered on platforms with


specific API levels. We refer to the contexts capturing the information of API level
as API-level contexts. 110 issues in our dataset require API-level contexts to trigger.
Figure 3.5 shows an issue that can be triggered on devices running Android with an
API level lower than 16. The API disableWriteAheadLogging() is only supported
since API level 16. So invocation of this API on devices with API levels lower than 16
will cause app crashes. The patch explicitly checks the API level of the current device
and avoid calling the API on old devices with lower API levels. In this example,
“API level lower than 16” is a typical API-level context.

Device and API-level contexts are orthogonal. The triggering contexts of real-
world FIC issues can be combinations of these two types of contexts, e.g., some issues
can only be triggered on specific device models with specific API levels. 20 issues
in our dataset require both types of contexts to trigger. On the other hand, the
manifestation of FIC issues could also depend on specific program states such as the
argument values flowing into API calls. In this aspect, FIC issues do not differ from
other types of software defects and we do not make further discussions here.

In our dataset, device contexts and API-level contexts are widely adopted in FIC
issue patches. The patches check such contexts either explicitly or implicitly to fix
FIC issues accordingly. Our interviewees shared similar experiences. P1, P2, P3,

32
Section 3.2 Lili Wei

4.1% 15.9%
Functional Issues

Performance Issues

User Experience
80.0% Degradation

Figure 3.8: Common Consequences of FIC Issues

and P4 stated that they would check the two types of contexts to fix FIC issues
but sometimes it was not easy to explicitly isolate the exact contexts that could
trigger FIC issues (especially for device contexts). For example, P3 said “Yes, we
used platform-related context checkings to fix FIC issues. Sometimes, it is difficult
to enumerate all the device models that are affected by certain FIC issues so we
may use some implicit method to check the behavioral features to identify the
problematic devices.” With such observations, we further leverage the information of
issue-triggering context to model FIC issues and design detection tools. More details
are discussed in Section 4.1.

Answer to RQ2: Device and API-level contexts are two key types of triggering
contexts of FIC issues. They are orthogonal and the triggering contexts of real-
world issues can be combinations of the two types of contexts. Such context
information are widely used in FIC issue patching and can provide guidance for
FIC issue modeling and detection.

3.2.3 RQ3: Issue Consequence

FIC issues can cause inconsistent app behavior across devices. Such behavioral
inconsistency can be functional or non-functional. We studied the inconsistent
behaviors discussed in the code comments and issue reports to investigate the
consequences of FIC issues. We discuss some common consequences below.

Figure 3.8 shows the distribution of the common consequences induced by FIC
issues. 185 (84.1%) of the 220 FIC issues are functional and performance issues.
Their consequences resemble those of conventional functional and performance issues.
The vast majority (176 / 220 = 80.0%) of them are functional issues. Performance

33
Section 3.2 Lili Wei

issues only account for a minority (9 / 220 = 4.1%) of the 220 issues found. As
aforementioned in Section 3.2.1, the functional issues can cause an app to crash (e.g.,
AnkiDroid issue 289 in Figure 3.5) or not function as expected (e.g., CSipSimple
issue 353 in Figure 3.3). The performance issues can slow down an app. For instance,
AnkiDroid introduced a simple interface specifically for low-end devices in revision
35d5275 [16] to deal with an issue that slows down the app. In addition to slowing
down an app, some performance issues can cause an app to consume an excessive
amount of memory resources. For example, K-9 Mail fixed an issue in revision
1afff1e [42] caused by an old API delete() of the SQLiteDatabase class, which does
not delete database journal files, resulting in a significant waste of storage space.

The remaining 35 (15.9%) of the 220 issues do not cause functional or performance
problems, but still seriously affect user experience. These issues are not bugs per se.
Some of them lead to UI variations on devices with different screen sizes. For example,
AnySoftKeyboard added ScrollView to enhance the UI of the app on small-screen
devices in revision 8b911f2 [18]. Other issues require support for device-specific
features. For example, in revision e3a0067, AnkiDroid developers revised the app to
use Amazon market instead of Google Play store to support app store features on
Amazon devices [16]. If Google Play store is used on Amazon devices, the app store
features cannot function properly. These issues present a major challenge to Android
app developers [12,122] and are a primary reason for bad user reviews of an app [125].
App developers often need to fix such issues by enhancing or optimizing their code to
interoperate with the diversified features of different devices to guarantee satisfactory
user experience.

In our interviews, the participants also shared with us their experiences regarding
consequences of FIC issues. All of them have encountered FIC issues arising from UI
display variations. Despite the commonness of such issues, they are mostly “tolerable”
and “subjective”. For example, P1 stated “The determination of UI problems is
subjective. For some devices, it is not possible to implement exactly the same user
interface as the designs. These interface differences are not necessarily FIC issues
if they do not cause serious problems”. P4 held similar opinions: “If UI display
problems do not affect the major functionalities, we will consider it tolerable and may
not fix it immediately”. Functional problems are also common and the interviewees
especially care about crashing issues. P1, P2 and P3 stated that they use crash
reporting systems to collect crash information (including device model information
and API levels) from end users for crash diagnosis. They observed crashes that only
occurred on certain device models. P2 stated “We will fix these crashes if they affect

34
Section 3.2 Lili Wei

a considerable number of users. If the root causes cannot be located, we will at


least catch the exception to avoid app crashes”. The interviewees also indicated
that they encountered some performance problems but they are not common in
their encountered FIC issues. These experiences are consistent with our empirical
findings. We observed many functional issues but few performance issues. Note that
the number of UI display problems that we observed is not significant, although
our interviewees claimed that such problems are common. This is reasonable as the
interviewees also explained that display problems are tolerable and may not get fixed
if they do not affect the major functionalities. Since we collected FIC issues from
code change revisions and the unfixed UI display problems are by no means archived
in the change histories. Thus, we cannot identify such “tolerated” display problems
in our empirical study.

Answer to RQ3: FIC issues can cause both functional and non-functional
consequences such as app crashing, app not functioning, performance, and user
experience degradation. In practice, Android app developers pay most attention
to FIC issues causing functional problems, especially app crashes.

3.2.4 RQ4: Issue Testing

In this section, we study developers’ practice of FIC issue testing. Specifically, we


study the importance of FIC issue testing, commonly used platforms, and commonly
used tools. Various studies have been conducted to understand Android developers
practices in testing Android apps [141, 142]. Our study focuses on testing FIC issues
As explained in Section 3.1.2 and Section 3.1.3, we base our study on the interviews
with five experienced Android practitioners and the 163 responses of an online survey.

Importance. Compatibility issues are a well-recognized challenge in Android


app testing [128]. Figure 3.9 shows the survey results on importance of FIC issue
testing before app releases. The majority (74.8%) of our survey respondents indicated
that it is “very important” or “important” to test FIC issues before app releases
with non-trivial efforts. Another 22.1% of the respondents considered that FIC issue
testing is of normal importance and should be conducted with some efforts. Only
3.1% of our respondents considered that FIC issue testing before release is “not
important”, indicating that they “do not perform specific FIC issue testing but will
fix FIC issues when found”. No respondents regarded FIC issue testing as “totally
ignorable”. This indicates that developers generally consider that FIC issue testing
before app releases is important.

35
Section 3.2 Lili Wei

3.1%

Very Important
22.1% 33.1% Important
Normal
Not Important
41.7%
Ignorable

Figure 3.9: Importance of FIC Issue Testing before App Releases

Test platforms. Physical Android devices and emulators are widely used for
Android app testing. Existing commercial online platforms also provide cloud-
based services for app developers to test their apps with remote Android devices.
We surveyed what testing platforms Android developers use to test FIC issues.
Figure 3.10 shows the distribution of commonly used testing platforms. The vast
majority of the respondents (152 out of 163 respondents) use “real devices” to test
FIC issues and a considerable number (78) of the respondents use emulators to test
FIC issues. Although testing on real devices is preferred, acquiring an adequate
coverage of such devices for testing is expensive. Android device emulators, which
support configurable API levels and screen sizes, provide a nice alternative for testing
non-device-specific and UI display issues. A respondent commented that emulators
“cannot reveal problems caused by interactions between hardware and software”. They
cannot emulate the actual Android systems on real devices to test device-specific
issues. 32 (19.6%) of 163 respondents have used online testing platforms. Although
online testing platforms provide a good alternative to access various device models,
they have not yet been widely adopted. We designed a follow-up short answer
question for those respondents who have used online testing platforms to name the
specific online platforms they used and to identify the advantages and disadvantages
of these platforms. In their answers, they commented that they used online platforms
mainly from the following three aspects: (1) high testing costs, (2) limited number
of supported device models, and (3) limited number of supported testing tools (e.g.,
Android debug bridge). This information could guide the online testing providers to
further improve their products and services. The remaining six respondents selected
the “other” choice and specified that they also use their own built test farms or
crowd-based testing platforms to test FIC issues.

36
Section 3.2 Lili Wei

Real Devices
Emulators
Online Testing Platforms
Other
0 20 40 60 80 100 120 140 160

Figure 3.10: FIC Issue Test Platforms

Testing frameworks and tools. Various testing frameworks and tools have
been designed to facilitate app testing. For example, Monkey is a tool maintained
by Google that randomly generates user event sequences and automatically exercise
Android apps [87]. We are interested in learning if there are (1) tools that the
respondents commonly used for testing (general purpose), (2) tools that they found
effective for FIC issue testing, and (3) other non-testing tools that they used for
finding FIC issues.

Figure 3.11 shows the common frameworks/tools used by our respondents for
app testing. The most commonly used tools are Monkey and AndroidJUnitRunner.
In our survey and interviews, none of our respondents or interviewees named an
effective tool for FIC issue testing and one respondent indicated that he or she
manually tested FIC issues. Existing tools are primarily designed for general testing
purposes of Android apps and cannot resolve unique challenges of FIC issue testing
such as the huge search space discussed in Section 3.2.5. In our interviews, all of our
interviewees claimed that their apps are mainly manually tested on different devices
owned by the company for finding FIC issues. P4 commented that his company
has a dedicated team for improving app compatibility by manually testing the apps
on diversified device models. Among our survey respondents, some respondents
mentioned that Lint can help expose some non-device-specific issues, but it often
generates a large amount of false positives and is not helpful in finding device-specific
issues as discussed in Section 3.2.1. This suggests a lack of effective testing support
for FIC issues that can resolve the unique challenges of FIC issue testing. Future
studies can focus on designing specific testing techniques that facilitate FIC issue
testing by resolving the challenges discussed in the next section.

Answer to RQ4: FIC issue testing is commonly regarded as an important task


before Android app releases. Most developers test FIC issues on physical devices
and no effective testing tools for FIC issue are found.

37
Section 3.2 Lili Wei

Monkey

AndroidJUnitRunner

Espresso

UI Automator

Appium

Robotium

Other

0 10 20 30 40 50 60 70 80

Figure 3.11: Android App Test Frameworks

3.2.5 RQ5: Issue Diagnosis and Fixing

To answer RQ5, we further performed the following tasks: (1) we studied the
challenges developers encountered when diagnosing and fixing FIC issues, (2) we
interviewed developers on how they resolve FIC issues, (3) we analyzed the patches
applied to the 220 issues in our dataset, and (4) we analyzed the discussions in the
issue reports that we have successfully associated with our collected FIC issues. We
aim to identify (1) the major challenges encountered by developers when diagnosing
and fixing FIC issues, (2) how app developers debug and fix FIC issues, (3) whether
FIC issue patches exhibit common patterns, and (4) how complicated the FIC issue
patches are.

Challenges. Figure 3.12 shows the challenges that our survey respondents
encountered in their experiences when resolving FIC issues. A respondent may put
down more than one challenge during the online survey. Here, we discuss the major
challenges that received votes from over half of our respondents.

The top one ranked challenge is the huge search space: 104 of our respondents
considered the huge search space as a challenge that complicates FIC issue diagnosis
and fixing. The search space for Android FIC issues not only involves numerous
components and used APIs of Android apps but also includes runtime device platforms
as FIC issues can only be triggered on particular device platforms. The search space
for device platforms can be a combination of device models and API levels. Android
developers need to make tremendous efforts to exhibit, debug, and patch FIC issues
within such a huge search space.

38
Section 3.2 Lili Wei

A comparatively large number (101) of respondents considered that it is challeng-


ing to set up a device (including its software or hardware environment) to trigger
FIC issues. One interviewee P1 shared an experience by developers in his group.
Many end users do not update their apps or Android systems to the latest available
versions. It can be difficult for developers to set up mobile devices to reproduce
the FIC issues that arise in these users’ apps. Furthermore, this challenge can be
more problematic for small companies and open-source Android projects as they
have limited access to a large number of devices for testing. We found that this
challenge have more impact on open-source project developers: 70.8% of the survey
respondents who are open-source project developers on GitHub selected this option,
while only 50% of the survey respondents we invited from IT companies selected this
option. One of our interviewees, P2, who is the Android development team leader
for a startup company shared with us that sometimes his testing group did not have
the device model he needed to reproduce FIC issues. In such cases, he would check
if anyone in his company has a device of the concerned model. If he failed to find a
device of the model, he would patch the issues based on his own experience and seek
feedback on the patch from users.

Another prominent challenge is to locate the root causes of FIC issues, which often
reside at the system or device APIs instead of the application logic. Thus, locating
such root causes likely requires close examination of the software and hardware
environments of the concerned Android devices. However, device firmwares and
implementation details of hardware drivers are mostly inaccessible to developers. In
our empirical study, we also observed cases in which app developers could not find a
valid patch and gave up fixing the issues. For example, issue 1491 of CSipSimple [23]
was caused by the buggy call log app on an HTC device. However, since the
implementation details are not accessible, CSipSimple developers failed to figure out
the root cause of the issue and hence left the issue unfixed. Below is their comment
after multiple users reported similar issues (issue 2436 [23]):

“HTC has closed source this contact/call log app which makes things almost impossible
to debug for me.”

Identifying valid patches. When the root causes occur at the operating system
or device driver levels, developers tend to call their patches “workarounds” rather
than “fixes”. This is because they mostly resolved the issues by finding a way to work
around the problems, making their apps compatible with the problematic devices.

39
Section 3.2 Lili Wei

Huge Search Space


Environment Setup

Root Cause
Test Case
Other

0 20 40 60 80 100 120

Figure 3.12: Challenges for FIC Issue Diagnosis and Fixing

Table 3.5: Common Patching Patterns


Fixing Pattern # Issues Issue Example
Check device context or API level 154 AnkiDroid pull request #130 (Figure 7)
Check software/hardware component availability 14 AnySoftKeyboard issue #289 (Figure 6)

In the situations where the root causes cannot be located, app developers derived
workarounds by trial and error. In many cases, they leveraged information collected
by crash reporting systems and asked volunteers to help test different workarounds
or sought advice from other developers whose apps suffered from similar issues.
Our interviewees indicated that they usually fix FIC issues based on guesses and
experience. For crashing issues, they leveraged the stack traces collected from crash
reporting systems for diagnosis. Patches of such issues are often validated by checking
if the patch deployment leads to a reduction of crashes. Another information source
for FIC issue diagnosis and fixing is end users’ feedback. All of our interviewees
agreed that user feedback can help them diagnose and fix FIC issues. P3 said “User
feedback is a very important information source for problems of our app. We have a
forum where users can provide all kinds of feedback for products in our company.
Many real problems are actually identified by our users”. P1 also agreed that user
feedback can facilitate FIC issue diagnosis and fixing, but he also indicated that the
quality or usefulness of the feedback strongly depend on the background of the users.
He gave us an example: “One of our users who has reported an issue is a developer
himself so he can provide very important debugging information for us. However,
for normal users, they may just complain on the problems but cannot provide any
useful information”.

Common patching patterns and patch complexity. Although FIC issues


are often caused by specific problems, many patches share common patterns (Ta-
ble 3.5):

40
Section 3.2 Lili Wei

• The most common pattern (154 out of 220 patches) is to check explicit device
context or API level (e.g., manufacturer, SDK version) before invoking the
APIs that can cause FIC issues on certain devices. One typical way to avoid
the issues is to skip the API invocations (see Figure 3.5 for an example) or
to replace them with an alternative implementation on problematic devices.
This can help fix issues caused by problematic driver implementations, Android
platform API evolutions as well as Android system bugs.

• Another common strategy (14 out of 220 issues) is to check the availability
of certain software or hardware components on devices (i.e., implicit device
context) before invoking related APIs and methods. This is a common patching
strategy for FIC issues caused by OS customizations and peculiar hardware
compositions on some device models.

• The remaining patches are app-specific workarounds and strongly related to


the context of the FIC issues.

The patches are usually simple and small in size (comprising several lines of
code). Many patches only require adding a condition that checks device information
before invoking issue-inducing APIs. Similarly, checking the presence of certain
software/hardware components usually only requires adding a try-catch block or a
null pointer checking statement. Two-thirds of the 220 issues in our dataset were
patched by these two kinds of patches.

Answer to RQ5: FIC issue diagnosis and fixing are challenged by the huge
search space, complications in setting up specific software/hardware environments
for reproducing issues as well as the difficulties of locating the root causes. The
majority of FIC issue fixes follow a simple pattern: performing careful checking
of device contexts or API levels before invoking issue-inducing APIs.

3.2.6 Implications of Our Findings

In this section, we discuss two possible applications of our study findings and
illustrate how we can help developers combat FIC issues.

Compatibility testing. As discussed in Section 3.2.5, one critical challenge for


FIC issue diagnosis of Android apps is the huge search space caused by fragmentation.
Due to limited time and budget, it is impractical for developers to fully test all
components of their apps on all potential users’ device models running various

41
Section 3.2 Lili Wei

customized Android versions. To address this challenge, existing work proposed


approaches to prioritize device models for compatibility testing [125, 146]. However,
prioritizing device models alone cannot effectively reduce the search space. The
other two dimensions, i.e., OS versions by vendors and app components, can still
create a large number of combinations. The findings and dataset of our empirical
study [30] can be leveraged to further reduce the search space by providing extra
clues about the device models, APIs, and OS versions that likely cause FIC issues.
Such information can guide developers to expose FIC issues with less test effort. For
example, we discussed earlier that due to the problematic driver implementation,
the proximity sensor APIs caused many FIC issues in popular apps. App developers
can prioritize test efforts on the components that use these APIs on the devices with
problematic drivers.

Automated detection and repair of FIC issues. By studying the root causes
and patches of the FIC issues, we observed that FIC issues recur (e.g., Android
issue 78377 discussed in Section 3.2.1) in multiple apps and share similar fixes. It
suggests the possibility to generalize the observations made by studying the 220 FIC
issues to automatically detect and repair FIC issues in other apps. As discussed in
Section 3.2.5, some app developers fixed the FIC issues in their apps by referencing
the patches of similar issues in other apps. This indicates the transferability of such
knowledge. Furthermore, we invited our respondents to share their own ideas on
possible tools that can be useful for FIC issues in our survey. Some respondents
proposed similar ideas on FIC issue documentation and detection. For example,
two respondents hoped to have public platforms or databases to archive known
FIC issues from the field and one respondent further suggested that a static code
analysis tool built based on the database would also be helpful. This motivates us
to design automated FIC issue detection tool to generalize the knowledge obtained
in our empirical study. To demonstrate the feasibility, we manually extracted 20
concrete issue patterns from our dataset and designed a static analysis tool to check
for potential FIC issues in Android apps by pattern matching. In Section 4.1, we will
present how to encode the issue patterns and the algorithm for issue detection. We
will also show by experiments that such a tool can help detect real and unreported
FIC issues in many real-world Android apps in Section 4.2.

42
Chapter 4

FicFinder: Modeling & Detecting


FIC Issues

4.1 Issue Modeling and Detection

With our empirical findings, we propose a static code analysis technique FicFinder
to detect FIC issues in Android apps automatically. The detection is based on a
model that captures issue-inducing APIs and the issue-triggering context of each
pattern of FIC issues. FicFinder takes an Android app’s Java bytecode or .apk
file as input, performs static analysis, and outputs its detected compatibility issues.
In the following, we first explain the modeling of compatibility issue patterns and
then describe the detection algorithm of FicFinder.

4.1.1 API-Context Pair Model

From the root causes and fixes of FIC issues, we observed that many issues are
triggered by the improper use of an Android API, which we call issue-inducing API, in
a problematic software/hardware environment, which we call issue-triggering context.
This motivates us to encode compatibility issue patterns into pairs of issue-inducing
API and issue-triggering context, or API-Context pairs in short. We observed that
most issue-triggering contexts can be expressed in the following context-free grammar.

Context → Condition | Context ∧ Condition

Condition → API Level cond | Device cond | API usage

43
Section 4.1 Lili Wei

In the grammar, an issue-triggering context is comprised of three kinds of condi-


tions: API-level, device, and API usage. An issue-triggering context can be either of
these conditions or conjunctions of these conditions. The formulation is designed to
cover the various types of issue-triggering contexts discussed in Section 3.2.2. More
specifically, an API-level condition refers to an API level or SDK version. A device
condition contains the information of a device’s brand, model, hardware composition
and so on. API usage describes how the concerned API is used, comprising the
information of arguments, calling context and so on. As a result, an issue-triggering
context gives the set of conditions under which invoking an issue-inducing API will
cause compatibility issues.

Let us illustrate this using an example extracted from two issues that affected
AnkiDroid (discussed in pull request 128 and 130 [16]).
 
API: SQLiteDatabase.disableWriteAheadLogging(),
Context: API level < 16 ∧ Dev model != “Nook HD”

This simple example models the issue that invoking the API SQLiteDatabase.dis-
ableWriteAheadLogging causes the app to crash on devices other than “Nook HD”
running Android systems with an API level lower than 16 [16]. In this example,
API level < 16 is an API-level condition and Dev model != “Nook HD” is a device
condition. Such an API-context pair can be extracted from apps that have fixed the
compatibility issue, and used for detecting its occurrences in other apps.

4.1.2 Compatibility Issue Detection

API-context pair extraction. In our study, 188 of 220 fixed compatibility


issues can be modeled as API-context pairs. In this work, we selected 20 API-context
pairs that satisfy three criteria for tool implementation and experiments: (1) they are
statically checkable in the sense that checking the issue-triggering context does not
require information that can only be acquired at runtime, (2) they affect popularly
used Android versions (see Table 2.1), and (3) they model non-device-specific issues.
The first two criteria allow us to focus on FIC issues that can be statically checked
and are likely of interest to developers. There are issues in our dataset that only
affect apps running on old device models with an API level lower than 11. Such
issues may affect few users and detecting them may not be useful to developers.
Specifically, we only include those API-context pairs that can affect devices with API
level higher than 16. We set the third criterion that excludes device-specific issues in
our current experiments due to two reasons. First, unlike non-device-specific issues

44
Section 4.1 Lili Wei

whose triggering contexts can be validated by Android documentation or information


from Android issue tracking systems, the issue-triggering contexts of device-specific
issues extracted from our dataset may not be entirely accurate and complete. Thus,
issue detection based on such extracted API-context pairs may generate false alarms
or lead to false negatives. Second, we have limited access to the various device
models to verify device-specific issues if they are detected in our experiments. In our
future work, we plan to leverage the knowledge crowd-sourced from a large number
of mature Android apps to determine issue-triggering contexts for device-specific
issues.

FicFinder algorithm. Algorithm 4.1 describes the issue detection algorithm


of FicFinder. It takes two inputs: (1) an Android app (Java bytecode or .apk
file), and (2) a list of API-context pairs. It outputs a set of detected FIC issues
and provides information to help developers debug and fix these issues. Intuitively,
FicFinder analyzes the calling contexts of issue-inducing APIs in an app, and
compares them with the modeled issue-triggering contexts of these APIs. More
specifically, FicFinder collects the preconditions of calling the issue-inducing APIs
and analyzes whether the conditions defined in the issue-triggering contexts are
considered.

The algorithm works as follows. For each API-context pair acP air, it first searches
for the call sites of the issue-inducing API of concern (Line 2). For each call site
callsite, it then performs two checks (Line 4): (1) it checks whether the usage of API
matches the problematic usage defined in acP air, if there is a condition regarding
API usage, (2) it checks whether the supported SDK version configuration of the
app matches the problematic configuration defined in acP air, if there is a condition
regarding API level. The checks for API usage are specific to the issue-triggering
context CTX defined in each acP air and need to be specifically implemented. SDK
version configuration are checked against the minimum SDK version (or API level)
that is supported by the app of concern. For instance, for an API requiring a lowest
SDK version, the algorithm will check the minimum SDK version specified in the
app’s AndroidManifest.xml file, a mandatory configuration file of an Android app.
If it is newer than the required SDK version, invocations of this API will not be
regarded as issues. By these two checks, FicFinder can filter out call sites that
do not match the issue-triggering context at early stage. If both checks pass, the
algorithm proceeds to check device or API-level related conditions defined in acP air
in the API calling contexts. These checks require analyzing the statements that the
issue-inducing API invocation statement transitively depends on. For this purpose,

45
Section 4.1 Lili Wei

Algorithm 4.1: Detecting FIC Issues in an Android App


Input : An Android app under analysis,
A list of API-context pairs acP airs
Output : Detected compatibility issues
1 foreach acP air in acP airs do
2 callsites ← GetCallsites(acP air.AP I)
3 foreach callsite in callsites do
4 if MatchAPIUsage(callsite, acP air.CT X) and
MatchSWConfig(callsite, acP air.CT X) then
5 slice ← BackwardSlicing(callsite)
6 issueHandled ← false
7 foreach stmt in slice do
8 if stmt checks acP air.CT X then
9 issueHandled ← true
10 if issueHandled == false then
11 report a warning and debugging info

the algorithm performs an inter-procedural backward slicing on callsite and obtains


a slice of statements of interest (Line 5). More specifically, the backward slicing is
performed based on the program dependence graph [113] and call graph. Initially,
only the API invocation statement is in the slice. The algorithm then traverses the
program dependence graph and call graph, adding statements into the slice. One
statement will be added if there exists any statement in the slice that depends on
it. This process repeats until the size of the slice converges. Next, the algorithm
iterates over all statements in slice (Lines 6–9). If any statement in slice checks
the software and hardware environment related conditions defined in acP air, the
algorithm conservatively considers that the app developers have already handled the
potential compatibility issues (Lines 8–9). Otherwise, the algorithm warns developers
that their app may have a compatibility issue and outputs information to help them
debug and fix the issues (Lines 10–11). Such information are actionable and include
the call sites of issue-inducing APIs and issue-triggering contexts.

Implementation. We implemented FicFinder on top of the Soot analysis


framework [130]. FicFinder leveraged Soot’s program dependence graph and call
graph APIs to obtain the intra- and inter-procedural slices. In general, the soundness
of slicing results depends on the accuracy of the statically-constructed call graphs
and program dependence graphs. An limitation of Soot is that it does not capture
the relationship between event handlers. As a result, the graphs it constructs for
event handler analysis can be imprecise, inducing false positives to the detection of

46
Section 4.2 Lili Wei

FIC issues. Fortunately, this limitation does not significantly affect FicFinder’s
performance as it is uncommon to have an invocation of issue-inducing API and its
context checkings implemented in different handlers. Our evaluation results show
that FicFinder reports few false warnings.

4.2 Evaluation

In this section, we evaluate FicFinder with real-world Android apps. FicFinder


tool and the evaluation data are available at our project website [30]. Our evaluation
aims to answer the following two research questions:

• RQ6: (Issue detection effectiveness): Can FicFinder, which is built with the
20 API-context pairs extracted from our collected FIC issues, help detect unknown
FIC issues in real-world Android apps?

• RQ7: (Usefulness of FicFinder): How do app developers view the FIC issues
detected by FicFinder? Can FicFinder provide useful information for app
developers to facilitate the FIC issue diagnosis and fixing process?

To answer the two research questions, we conducted experiments on 53 actively-


maintained open-source Android apps. The apps were selected from the F-Droid
database [28] by the following steps:

1. We first crawled from F-Droid all the 1,258 apps, of which the source code
repositories are hosted on GitHub.

2. Among the 1,258 apps, we then selected all the 750 apps that invoke at least
one of the issue-inducing APIs in our API-context pair list by scanning the
.apk files crawled from F-Droid. We performed this step as FicFinder can
only report FIC issues with regard to our known API-context pairs. Inclusion
of subjects that do not even invoke the APIs of concern will not return any
results and is not meaningful for our FicFinder evaluation.

3. We further selected popular and actively-maintained Android apps from the


750 candidates according to their statistics on GitHub. Specifically, we only
kept those apps of which the number of stars is not less than 70, and the date
of the last commit is not earlier than three months before our subject collection.
With such selection criteria, we filtered out 654 (87.2%) of the 750 candidates
and 96 apps remained.

47
Section 4.2 Lili Wei

Table 4.1: Experimental Subjects and Checking Results

ID App Name Category


Latest FicFinder Results Lint Results Issue ID(s)
Revision KLOC # Stars
TP FP GP TP FP
1 AFWall+ [7] Tools 0c0dde3 21.1 925 0 0 0 0 0 –
2 And Bible [9] Books & Reference eca13b2 26.5 144 1 0 0 0 0 100
3 AndStatus [15] Social edef64e 50.0 133 2 1 0 0 0 473a
4 AnkiDroid [16] Education 1c80630 50.8 953 0 0 4 0 2 –
5 AntennaPod [17] Video Players & Editors 657a91b 47.1 1629 2 0 0 0 0 2452c
6 Diode [26] News & Magazines fca6b41 13.4 122 32 0 0 0 0 72a , 73a
7 DuckDuckGo [27] Books & Reference 75e54cf3 15.0 438 1 0 2 0 0 311
8 Forecastie [31] Weather e1c2f20 3.0 275 8 0 4 0 0 246b
9 Gadgetbridge [32] – f03d539 39.4 1545 1 0 1 0 0 849a
10 Glucosio [34] Medical d43eb2a 36.5 269 2 0 0 0 0 369b
11 Good Weather [35] Weather c23a553 4.5 71 0 0 2 0 0 –
12 iFixit [39] Books & Reference ffc0e5d 19.4 129 7 0 0 0 0 279b
13 K-9 Mail [42] Communication dcb8587 93.7 3280 1 0 2 0 1 1237c
14 KISS [43] Personalization c2bf46b 6.4 658 0 0 1 0 0 –
15 LeafPic [44] Photography d36c761 11.3 1,875 0 0 6 0 0 –
16 LibreTorrent [46] Video Players & Editors 112025e 20.3 420 0 0 6 0 0 –
17 Materialistic Hacker News [47] News & Magazines d886b78 25.1 1371 0 2 1 0 1 –
18 MiMangaNu [48] – da6efdf 23.9 111 0 0 5 0 0 –
19 MinTube [49] – 4cb9c66 2.5 203 2 0 0 0 0 30
20 MozStumbler [50] Tools a0321a9 24.1 523 1 0 1 0 0 1825
21 Muzei [51] Personalization 3b15476 21.3 3293 1 0 0 0 0 492c
22 OctoDroid [53] Productivity bdf74bb 42.0 811 0 0 0 0 0 –
23 Omni-Notes [54] – 564b9c4 14.2 1134 3 0 2 0 0 395a , 396c
24 Open Food [56] Health & Fitness 0021ba8 7.1 74 0 0 0 0 0 –
25 OpenLauncher [57] Personalization 0cacbf6 14.0 248 2 0 2 0 0 273a
26 OpenTasks [59] Productivity 6d5db7b 24.0 443 25 0 4 0 0 467b , 468b
27 OsmAnd [60] Travel & Local accc8dd 204 1200 10 0 3 0 0 4636b , 4642a , 4671b
28 ownCloud [61] Productivity c2bbfc0 40.0 2217 1 0 3 1 0 2052b
29 Pedometer [62] Health & Fitness 7a277f5 2.4 865 2 0 0 0 0 111c
30 QKSMS [65] Communication bfbb7bf 56.3 1574 2 2 1 0 2 742b
31 RasPi Check [66] Tools cb87901 7.3 115 2 0 0 0 0 151b
32 Red Moon [67] Tools 8ae707c 4.1 254 0 0 0 0 0 –
33 RedReader [68] News & Magazines 0e6a37d 30.4 751 1 0 0 0 0 539a
34 Riot.im [69] Communication 5bdd9c4 47.0 427 2 0 1 0 0 1697a
35 Save For Offline [70] – b6e53eb 2.2 70 1 0 0 1 0 49b
36 SimpleTask [72] Productivity 89ac740 4.3 261 0 2 0 0 0 –
37 Slide for Reddit [73] News & Magazines 54fa096 72.2 755 8 2 14 0 0 2574c , 2575, 2576c
38 SoundWaves [75] Music & Audio 68b18ee 168.9 93 3 0 2 2 0 156, 158c
39 Stocks Widget [78] Financial 5ecb354 6.2 100 1 0 0 0 0 42b
40 StreetComplete [79] Travel & Local e2e71c4 24.4 695 0 0 0 0 0 –
41 Subsonic [80] Music & Audio 9f89e7e 37.8 231 0 0 1 0 0 –
42 SurvivalManual [81] Books & Reference a513685 1.5 259 0 0 0 0 0 –
43 Timber [83] Music & Audio 19c8e96 19.5 3492 1 3 0 0 0 330
44 Tinfoil Facebook [84] Social 8850f06 1.7 221 1 0 0 0 0 116
45 Transdroid [85] Tools f913c6a 23.5 541 0 1 0 0 0 –
46 Twidere [86] Social f379e6e 35.0 1413 1 0 1 0 1 974b
47 Wallabag [90] Productivity f01a162 12.3 176 – – – 0 0 –
48 Walleth [91] Financial 2af910d 5.7 74 0 0 0 0 0 –
49 Web Opac App [92] Education 0952761 14.5 72 0 0 3 0 0 –
50 WeeChat [93] Communication fcf9439 8.3 308 0 0 2 0 0 –
51 Wikimedia Commons [95] Photography fece67e 10.6 107 0 0 0 0 0 –
52 Wikipedia [96] Books & Reference 6868c89 58.2 542 0 0 4 0 0 –
53 Yaxim [98] Communication 096659f 7.9 83 2 0 0 0 0 203c
Total – – – – 129 13 78 4 7 –
“–” means not applicable. Superscript “a” means the issues were acknowledged and developers agreed to fix in future. “b” means the issues have already been fixed.
“c” means the issues do not seriously affect the app and developer chose not to fix them.

48
Section 4.2 Lili Wei

4. As we need to manually verify the experiment results for each app, which
is labor-intensive, we further filtered the apps by their categories. We first
identified the Google Play store page for each app and collected the app’s
category information to classify the 96 apps into different categories. If a
category contains no more than five apps, we selected all the apps in the
category as our experimental subjects. If a category contains more than five
apps, we only selected five of them as experimental subjects based on the
number of stars their projects received on GitHub. To select representative
subjects, we sorted the apps in the category according to the number of stars
and selected the first app (i.e., the one that received the largest number of
stars), the first quartile, the median, the third quartile and the last app in the
sorted list. For apps that failed to identify their corresponding Google Play
store page, we grouped them into one special category for subject selection.

As a result, we collected 53 open-source Android apps. These 53 Android apps are


not a superset of the 27 candidate apps mentioned in Section 3.1.1. This is because
these two dataset were collected at different time and we adopted more robust subject
selection criteria in the evaluation. For example, some of the 27 candidate apps
selected earlier in the conference version were not included in the 53 Android apps
because they are no longer actively maintained. Two of the 53 apps (i.e., K-9 Mail
and Ankidroid) overlap with the subjects in our empirical study. However, when we
conducted experiments, we chose the latest version of the two apps. The versions
from which we collected FIC issues for our empirical study are much older. Table 4.1
gives the demographics of these apps, which include: (1) app name, (2) category, (3)
the revision used in our experiments, (4) lines of code, and (5) the number of stars on
GitHub. As we can see from the table, these apps are diverse (covering 10 different
app categories), non-trivial (containing thousands of lines of code), and popular
(received tens to thousands of stars). For the experiments, we applied FicFinder
to the latest version of these apps for the detection of FIC issues. The experiments
were conducted on a Dell PowerEdge R520 Linux server running CentOS 7.4 with 2x
Intel Xeon E5-2450 Octa Core CPU @2.1GHz and 192 GB RAM.

It is also worth mentioning that in this thesis version, we leveraged 20 API-


context pairs to conduct experiments while we leveraged 25 API-context pairs in our
previous conference version. The numbers are different because FIC issues evolve
with the evolution of the Android ecosystem. In this version, we removed some
API-context pairs because they capture issues that can only be triggered on old
devices. For example, ten API-context pairs used in our conference version can

49
Section 4.2 Lili Wei

induce issues on device models running API levels lower than 15. According to the
statistics released by Google [24], such devices rarely exist in practice now. As a
result, these API-context pairs were removed and replaced by new ones derived from
newly collected FIC issues.

4.2.1 RQ6: Issue Detection Effectiveness

To evaluate FicFinder’s performance, we applied it to the latest version of


the 53 app subjects. We configured FicFinder to report: (1) the call sites of the
issue-inducing APIs that can cause compatibility issues (warnings) and (2) the call
sites of the issue-inducing APIs where developers have already implemented guarding
conditions to avoid potential compatibility issues (good practices). FicFinder
reported a total of 142 warnings after analyzing the 53 apps. For validation, my
collaborator and I manually checked all the warnings and the source code of the
corresponding apps to categorize the warnings into true positives and false positives.
A warning would be categorized as a true positive if they agree that the issue-inducing
API can be invoked under its corresponding issue-triggering context. During the
independent checking process, the we only had disagreement for four warnings and
they further discussed and reached consensus. The results are reported as “TP” (true
positives) and “FP” (false positives) in Table 4.1. For good practices, we also report
them as “GP” in the table.

Precision. As shown in Table 4.1, 129 of the 142 warnings reported by


FicFinder are true positives (i.e., the precision is 91.5%). The true positives
were detected by nine different API-context pairs. The number of distinct API-
context pairs is relatively small compared with the number of warnings detected
by them. This indicates that some commonly used issue-inducing APIs were not
well handled by app developers. It should be noted that the chances to detect FIC
issues in the wild for each API-context pair is different by nature. If the API of an
API-context pair is less commonly used, this API-context pair will be less likely to
detect FIC issues in the wild. Besides, the issues corresponding to some API-context
pairs could be more easily noticed and fixed by developers at early development
stage while others could be difficult to be noticed. For example, FIC issues caused
by invoking APIs on devices with API levels where the APIs are not available can
be more easily identified as the app will crash at the API invocation point. On the
other hand, FIC issues caused by subtle API behavior changes may not be quickly
noticed by the developers as it may require specific test oracles. In practice, if FIC

50
Section 4.2 Lili Wei

issues can be easily identified, it is more likely that they have already been fixed
by the developers, and thus fewer issues could be detected by the corresponding
API-context pairs.

Causes of imprecision. For the 13 false positives, we manually inspected them


and identified two major causes of the imprecision. First, most false positives were
caused by the incomplete call graphs and program dependence graphs constructed
by Soot, which led to missing dependencies in the slices of issue-inducing APIs’ call
sites. For example, recovering data dependencies involving database operations is a
well-recognized challenge in static analysis. Such dependencies cannot be recovered
by Soot. The false positives in QKSMS arose due to this reason. The false positives
were reported at the call sites of the API setStatusBarColor, which was introduced
at API level 21. Invoking this API on devices with API levels lower than 21 will cause
app crashes. The two call sites of this issue-inducing API in QKSMS are dependent
on a flag value that is read from the app’s database. By carefully investigating
the app code, we figured out that the flag value is dependent on the app’s runtime
API level and the API setStatusBarColor in fact cannot be invoked if the app is
running on an API level lower than 21. FicFinder generated false positives in such
circumstances. Second, some implicit calling contexts in the programs were missed
by FicFinder. For example, the issue-inducing APIs in the app Materialistic Hacker
News were transitively invoked by a system event handler, which was introduced
together with the APIs. In such cases, the APIs will never be invoked on a device with
an improper API level. Without any knowledge of such event handlers’ specification,
FicFinder reported false warnings at the call sites of the issue-inducing APIs.

Besides the warnings, FicFinder also found 78 good practices in 27 out of the
53 apps analyzed, covering 10 distinct API-context pairs. This confirms that FIC
issues are common in Android apps and developers often fix such issues to improve
the compatibility of their apps.

One may also observe from Table 4.1 that there are several apps, for which
FicFinder did not report any TP, FP or GP. As discussed earlier, in the experiments,
we considered the APIs in the 20 API-context pairs as issue-inducing and scanned
the .apk files to make sure that the concerned APIs are used in our app subjects.
Therefore, it is expected that FicFinder would report either warnings or good
practices. However, it is not the case for several apps. We looked into this unexpected
phenomenon and found out the reasons. First, the .apk files on F-Droid are release
versions of our app subjects, which sometimes are old snapshots of the apps, but our
experiments were conducted on the latest development versions of the apps available

51
Section 4.2 Lili Wei

on GitHub. Some issue-inducing APIs are no longer used in our experimented


versions. Second, the .apk files on F-Droid contain different third-party libraries that
invoke issue-inducing APIs in our API-context pair list. However, in our evaluation,
we only investigated FIC issues in the application code and thus, FicFinder reported
nothing for apps with issue-inducing APIs that are only used in third-party libraries.
In total, there are only five (out of 53) subjects, for which FicFinder did not report
results due to the two reasons. So we can see that our subject selection process, which
analyzed API usage of candidate subjects by scanning their .apk files, introduced
little noise in the final set of experimental subjects. Finally, there is also one app,
Wallabag [90], for which FicFinder (in fact Soot) crashed when parsing its byte
codes.

Recall. FicFinder adopts conservative strategies to match issue-triggering


contexts. As long as an app under analysis considers any conditions in the issue-
triggering contexts of an issue-inducing API, FicFinder will not report warnings
(Lines 8–9 of Algorithm 4.1). Such strategies help ensure high precision, which is
a desirable property for automated bug detection tools [127]. However, they may
filter out problematic call sites of issue-inducing APIs, leading to false negatives. To
evaluate the recall of FicFinder, we conducted another experiment. Because it is
difficult to identify all FIC issues in the latest version of our app subjects (i.e., no
ground truth exists), we collected fixed FIC issues related to the issue-inducing APIs
in our API-context pair list in the 53 apps’ code repository for the experiment. We
excluded the FIC issues in our empirical study dataset to avoid the potential threats
of overfitting problems in the evaluation results (i.e., the threat posed by evaluating
the tool with the issues, based on which FicFinder is designed and API-context
pairs are derived). For issue collection, we used keywords including the method
names of the APIs and SDK INT to search the code diffs of the revisions committed
in the past three years in each app’s code repository. We then manually checked the
search results to identify the revisions that fixed FIC issues. As a result, we found 12
revisions that fixed 21 FIC issues concerning APIs in our API-context pair list (the
data are available on our project website [30]). We then applied FicFinder to the
buggy versions (i.e., the versions that were modified by the issue-fixing revisions) of
the corresponding apps to compute the recall. We found that FicFinder detected
15 of the 21 issues, achieving a recall of 71.4%. Comparing with the high precision
that FicFinder achieved (91.5%), the recall is lower. As expected, all the issues
were missed due to the conservative strategies adopted by FicFinder to match the
issue-triggering context (Line 4 in Algorithm 4.1) and the over-approximation when
computing the dependencies in the slice (Line 5 in Algorithm 4.1). For example, one

52
Section 4.2 Lili Wei

issue in SoundWaves [75] was missed by FicFinder due to its over-approximation of


the API call site dependencies. The issue was induced by the API setBackground,
which should not be invoked when API level is lower than 16. The slice of the API
call site included a conditional statement checking that the API level should not be
lower than 16. However, the conditional statement was not a guarding condition of
the API call site. It was included in the slice because FicFinder conservatively
included all statements that the API call site may transitively depend on. As a result,
FicFinder considers that the issue has been handled (Lines 8–9 of Algorithm 4.1)
and chooses not to report it.

Comparison with Lint. Lint is a widely-used static analyzer for Android


apps. It is the default code analysis tool in Android Studio and supports detecting
some non-device-specific FIC issues that are caused by Android API evolutions,
including API introduction, API deprecation, and API removal.1 To evaluate the
effectiveness of FicFinder, we also compared it with Lint .

We applied Lint to the 53 app subjects and manually analyzed the warnings
generated by the checkers that are relevant to the APIs in our 20 API-context
pairs. We categorized the warnings them as true positives (TPs) and false positives
(FPs), leveraging the same criteria that we adopted when inspecting those warnings
generated by FicFinder. Column “Lint Results” in Table 4.1 shows the evaluation
results of Lint. In total, Lint reported 11 warnings, four of which are true positives,
i.e., the precision is 36.4%. In comparison, FicFinder reported 129 true positives
and achieved 91.5% precision. All the four true positives reported by Lint were also
reported by FicFinder, but Lint missed 125 true positives reported by FicFinder,
among which, 100 have already been acknowledged or fixed by the app developers.
This shows that FicFinder is more effective at detecting FIC issues than Lint.

We further inspected the true positives missed by Lint and figured out the major
reason. While Lint supports detecting FIC issues caused by the use of deprecated
and removed APIs, it does not support detecting subtle issues that are induced by the
improper use of those APIs whose behaviors have significantly evolved. For example,
Forecastie issue #246 was induced by the API AsyncTask.execute, whose behavior
changed at API level 11 (the tasks scheduled via this API will run sequentially on a
single background thread rather than a pool of threads on devices with API level 11
or higher). Lint could not detect such common issues. We also studied the false
positives reported by Lint and found the main reason of Lint’s low precision. The
1
To the best of our knowledge, Lint is the only widely-used static analysis tool that
supports compatibility checking for Android apps.

53
Section 4.2 Lili Wei

1. + if (Build.VERSION.SDK_INT < Build.VERSION_CODES.LOLLIPOP) {


2. CookieSyncManager.createInstance(this);
3. + }

Figure 4.1: Patch for Twidere #974

false positives are reported at the call sites of the issue-inducing APIs that have
already been properly guarded by conditional statements, which prevents the APIs’
invocations at certain API levels. In all these cases, the guarding condition and
the API call sites are located in different methods. Identifying the dependencies
between the call sites and the context checks requires inter-procedural analysis that
is capable of handling long chains of method calls. Lint, which aims to provide
real-time feedback to developers, could not capture such complex inter-procedural
dependencies and therefore reported false positives. These findings here may explain
why some of our survey respondents mentioned that Lint can only help expose
certain non-device-specific issues and often generates a large number of false positives
(see the “testing frameworks and tools” part of Section 4.4).

While FicFinder can detect more FIC issues induced by the APIs in our API-
context pair list and achieved high precision, it could miss valid FIC issues that
can be detected by Lint since its current version only supports a small number
of API-context pairs manually extracted from our empirical study. This can be
improved by automating the API-context pair extraction process and incorporating
Lint’s checking rules, which we plan to explore in our future work.

Answer to RQ6: The API-context pairs extracted from existing FIC issues can
help FicFinder effectively detect unknown FIC issues in real-world Android
apps. FicFinder achieved high precision and acceptable recall, and outperforms
the widely-used tool Lint.

4.2.2 RQ7: Usefulness of FicFinder


We reported the 129 true warnings detected by FicFinder to the corresponding
app developers for confirmation. In total, we submitted 39 issue reports, whose IDs
are provided in Table 4.1. Each issue report includes all the true warnings related to
an issue-inducing API. In the issue reports, we also provided issue-related discussions
from online forums, API guides, and possible fixes to help developers diagnose the
issues. So far, 22 of the 39 reports have been acknowledged by developers. Among
the 22 acknowledged ones, we observed that: (1) the 59 warnings reported in 13

54
Section 4.2 Lili Wei

reports were quickly fixed by app developers (marked with “b ” in Table 4.1) and (2)
the warnings reported in the remaining ten reports will be fixed according to the
developers’ feedback or have already been assigned to certain developers (marked with
“a ” in Table 4.1). Other than the 22 acknowledged ones, eight of our 39 submitted
issue reports are still pending. There are also nine issue reports that were closed or
labeled as “won’t fix” by developers because the reported issues do not cause serious
consequences to the apps (marked with “c ” in Table 4.1).

In our previous conference version [163], FicFinder, which was implemented


with a different set of API-context pairs then, also reported 46 true warnings
after analyzing 27 real-world Android apps. We submitted 14 issue reports to the
corresponding app developers. Eight of our issue reports were acknowledged and the
26 warnings reported in five issue reports have already been fixed. Accumulatively,
our FicFinder has detected 175 (= 129 + 46) true warnings in real-world Android
apps and successfully helped app developers fix 85 (= 59 + 26) warnings.

Besides generating warnings, our tool can also provide information including
issue related discussions, API guides, and possible fixes to help developers diagnose
and fix FIC issues. Such information was all collected from our empirical study.
In the output of FicFinder, it provides links to online discussions or API guides.
FicFinder can also suggest possible fixes because the patches for FIC issues are
small and demonstrate common patterns. As discussed in Section 4.5, the majority
of our studied issues were patched by explicitly checking device context or API
levels before invoking issue-inducing APIs. The patches suggested by FicFinder
also follow this pattern. To study whether such provided information is helpful, we
included them in our submitted issue reports. We learnt developers’ opinions on FIC
issues by communicating with them and made several interesting findings.

First, for the warnings mentioned in 12 issue reports, developers adopted our
suggested fixes. Developers of ownCloud [61] and Stocks Widget [78] asked for pull
requests and have merged our submitted ones after careful code reviews. Other
developers fixed the issues as we suggested. For example, in Twidere #974, we
reported an issue caused by invoking an API defined in the class CookieSyncManager,
which was designed to synchronize the browser cookie store between RAM and
permanent storage. The class has been deprecated since API level 21 and another
class, CookieManager, can provide full functionality to manage the cookies since
then. We suggested Twidere developers that they should avoid invoking this API
from API level 21. Figure 4.1 shows the patch adopted by the app developers for
this issue. The issue was fixed by adding a guarding condition to check the API level

55
Section 4.3 Lili Wei

of the running device and only invoking the API when the API level is lower than 21.
Such results show that the issue diagnosis information and fix suggestions provided
by FicFinder are useful to developers. They also indicate that the knowledge and
developers’ practices learned from our studied FIC issues can be transferable across
apps.

Second, while our issue reports facilitate developers to identify potential FIC
issues in their apps, whether to fix such issues depends on the apps’ functionality
and purpose. For issues mentioned in nine of our issue reports, developers chose not
to fix them. For instance, we reported to K-9 Mail developers that the behavior of
the AlarmManager.set() API has changed since API level 19: using this API on
devices with a higher API level does not guarantee the scheduled task to be executed
exactly at the specified time. However, the developers considered that such API
behavior change would not significantly affect their app’s main functionality (i.e.,
their scheduled tasks do not necessarily need to be executed precisely at a specific
time) and did not proceed to fix the issues, but linked our issue report to two existing
unresolved bugs. Other issues that are closed without fixes are due to similar reasons.
On the other hand, there are also apps whose functionality could be affected by this
issue. The app Stocks Widget, which is a home widget that displays stock price
quotes, is a typical example. AlarmManager is used to schedule timely updates for
stock price. The developers confirmed our issue report and quickly fixed the reported
issue. From the examples and comparison, we can see that it is difficult to determine
whether a FIC issue can affect the app’s functionality or not by static analysis alone.
To address this problem, we plan to leverage dynamic analysis to further validate
the impact of FIC issues in our future study.

Answer to RQ7: FicFinder can provide useful information to help developers


diagnose and fix FIC issues. Future research efforts can be made on addressing
the challenges of manifesting FIC issues and determining the impact of the
identified FIC issues on app functionality.

4.3 Discussions

4.3.1 Threats To Validity

Open-source subjects. In our empirical study and evaluation, we only lever-


aged open-source Android apps as our subjects. This can pose a threat that some

56
Section 4.3 Lili Wei

of our findings may not generalize to commercial apps. We mitigated this threat
by conducting surveys and interviews with developers of commercial Android apps.
As discussed in Section 3.2, the survey and interview findings are consistent with
our empirical study findings. In addition, our proposed tool FicFinder can also
be applied to commercial apps. Our interviewees and survey respondents showed
interest in the tool. The tool also helped one of our interviewees find a real issue in
the product of his team during our interview.

Subject selection. The validity of our empirical study results may be subject
to the threat that we only selected five open-source Android apps for the study.
However, these five apps are selected from 27 candidate apps as they contain most
FIC issues. The apps are also diverse, covering four different categories. More
importantly, we observed that the findings obtained from studying the 220 real FIC
issues in them are useful and helped locate previously-unknown FIC issues in many
other apps.

FIC issue selection. Our FIC issue selection strategy may also pose threats to
the validity of our empirical study results. We understand that using our current
strategy, we could miss some FIC issues in dataset preparation. For example, our
keywords do not contain less popular Android device brands. To reduce the threat,
we made effort to find the code revisions whose diff contains code that checks the
device information encapsulated in the android.os.Build class. This indeed helped
us find issues that happened to many small brand devices (e.g., Wiko and Nook).
On the other hand, some existing empirical studies also selected issues by keyword
searching in the issue tracking system of open-source apps [119, 143]. However, such
strategies are inapplicable to this work due to two reasons. First, FIC issues usually
do not have specific labels, and thus are mixed with various other issues. Second,
as app developers often suggest users to provide device information when reporting
bugs, simply searching device related keywords can return many results that are
irrelevant to FIC issues.

Limited number of issues to compute recall. In Section 4.2.1, we only


used 21 fixed FIC issues to evaluate the recall of FicFinder. Comparing with the
number of warnings reported by FicFinder, the number is small. This number is
small because of two reasons. First, we only collected those issues that are related to
the issue-inducing APIs in our API-context pair list (FicFinder can only support
detecting certain FIC issues). Second, we also required that the commit message
of the issue-fixing revisions should explicitly indicate that the code changes were
committed to fix real FIC issues (i.e., these issues are worth fixing to the developers).

57
Section 4.3 Lili Wei

In future, we plan to automate the API-context pair extraction process and improve
the detection capability of FicFinder. With more API-context pairs, we can then
conduct larger-scale studies to evaluate FicFinder’s recall.

Restricted scope of studying FIC issues induced by Android APIs.


Another potential threat to the generalizability of our findings is that we only
studied FIC issues related to Android APIs. While we indeed observed cases where
compatibility issues are caused by using third-party library APIs and native libraries,
we restricted the scope of our current study to FIC issues caused Android APIs. We
made this choice because FIC issues induced by Android APIs likely have a broader
impact than those arising from third-party or native libraries.

Evolving Android device market. As Android device market evolves, existing


device models or OS versions will be gradually phased out. FIC issues detected by the
API-context pairs learned from existing issues may gradually become less significant
as devices affected by these issues may eventually get outdated and finally disappear
from the market. This motivates us to further extend our work to automatically
extract API-context pairs as FIC issue patterns in future (see more discussions in
Section 4.3.2). Apart from the API-context pairs, our empirical findings such as the
root causes of FIC issues, on the other hand, may not easily become outdated and
can be used to guide FIC issue diagnosis and fixing.

Errors in manual checking. The last threat to the validity of our findings is
the errors in our manual checking of FIC issues. We understand that such a process
is subject to mistakes. To reduce the threat, we followed the widely-adopted open
coding approach [109] and cross-validated all results for consistency. We also release
our dataset and results for public access [30].

4.3.2 Comparison with CiD & Automated API-Context Pair


Extraction

A recent work by Li ei al. [136] proposed a technique named CiD to detect


compatibility issues induced by Android API evolution across different API lev-
els. CiD can automatically extract lifecycles for Android APIs based on Android
framework code. The lifecycle of an API means the API levels on which the API
is accessible. Comparatively, FicFinder relies on manually extracted API-context
pairs for compatibility analysis. However, despite the lack of automated learning of
FIC issue patterns, our work has a broader scope.

58
Section 4.3 Lili Wei

CiD focuses on the compatibility issues induced by accessibility inconsistencies of


Android APIs (see Figure 7 for an example). Such issues only form a subset of the
non-device-specific FIC issues studied in our work (Section 3.2.1). Those non-device-
specific FIC induced by API behavior inconsistencies (see the AsyncTask.execute
example discussed in Section 4.2.1) cannot be detected by CiD. This is because
the Android API lifecycle model in CiD is built by comparing the sets of publicly
accessible APIs at consecutive API levels and cannot capture API behavior changes.
However, we observed that non-device-specific FIC issues induced by Android API
behavior inconsistencies are common in practice. For example, in our evaluation,
the majority of the 129 issues detected by FicFinder are induced by the behavior
changes of Android APIs. 101 of these detected issues have already been acknowledged
or fixed by app developers. To confirm that CiD cannot detect such issues, we ran
it on our 53 evaluation subjects. We found that CiD indeed only detected one of
the 129 true issues detected by FicFinder, which was induced by Android API
accessibility inconsistency. It is worth mentioning that this result does not imply that
CiD cannot effectively detect compatibility issues. It only detected one issue simply
because its scope is different (i.e., the majority of issues detected by FicFinder are
not the targeting issues of CiD).

CiD does not support detecting device-specific FIC issues such as those caused
by problematic driver implementation or non-compliant OS customization. This is
because that such issues often arise from closed-source system components customized
by device manufacturers while CiD requires Android framework code as input for
learning FIC issue patterns. However, as we have pointed out in our study, device-
specific FIC issues widely exist in real-world Android apps. To the best of our
knowledge, our work is the first systematic study of such device-specific issues.

Nonetheless, manually extracting API-context pairs is a labor-intensive process.


CiD makes the first attempt to automate the process. In our future work, we plan to
study how to automate the process and learn these API-context pairs from popular
and high-quality Android apps, which likely have addressed common FIC issues. We
will focus on extracting API-context pairs for device-specific issues, which are more
common and have induced significant challenges for app developers (Section 3.2.1).
As discussed earlier, the most common pattern of fixing FIC issues is to check device
contexts and API levels before invoking certain APIs. Therefore, it is possible to
extract API-context pairs by correlating Android APIs with the code that performs
such checks. The correlations can be established via static or dynamic code analysis.

59
Section 4.3 Lili Wei

4.3.3 Device Models Triggering FIC issues

In our empirical study, we observed that FIC issues can be triggered on a variety
of Android device models. All kinds of device models can exhibit their specific
behaviors and trigger compatibility issues despite their brands or popularity. We
also observed that Android app developers tend to complain more about FIC issues
on popular Android device models. On the one hand, such device models may be
deeply customized. On the other hand, such popular device models are used by more
users. FIC issues that can be triggered on such device models are more likely to be
identified and complained about by app users.

60
Chapter 5

Pivot: Learning Knowledge of


FIC Issues

5.1 Problem Formulation & Motivation

5.1.1 FIC Issue Pattern & API-Device Correlations

The FicFinder work [163] found that FIC issues demonstrate patterns: they
often arise when certain APIs are invoked on certain device models. We observed
that although FicFinder successfully detected previously-unknown FIC issues, its
major limitation is that the patterns required for FIC issue detection were manually
extracted from an empirical study. As the Android ecosystem evolves quickly, the
applicability of these once effective issue patterns gradually diminishes. For example,
among the 25 issue patterns used by FicFinder, 12 are now outdated because the
device models that can trigger the issues have faded out from the market.

This inspires us to develop a sustainable mechanism to keep the knowledge of


FIC issues updated with respect to the evolution of Android ecosystem. We note
that companies of popular commercial apps are likely to notice and patch FIC issues
in their apps early once new issues emerge because of their large user base and
rich maintenance resources. We therefore propose to learn API-device correlations
from these high quality commercial Android apps to maintain a knowledge base
of FIC issues. The learning leverages an observation that FIC issues are mostly
patched by applying workarounds on specific device models [163]. These workarounds
often comprise a conditional statement that checks the runtime device information
(e.g., the device model identifier) against predefined values (mostly constants). The
runtime device information can be obtained by querying the android.os.Build class

61
Section 5.1 Lili Wei

provided by the Android SDK. Such conditional statements characterize the existence
of code snippets that handle FIC issues. For illustration, we show a code snippet
that handles a low frame rate issue in camera preview on Nexus 4 in Figure 5.1.
When invoking the camera API on Nexus 4 to take photos, the default frame rate
for photo preview is low (6–10 frames per second), making the preview choppy. The
patch applies a workaround specifically for Nexus 4 by setting the recording mode
hint to true. At Line 4, the code checks the app’s runtime environment. If the device
model is Nexus 4, the workaround is applied (Lines 4–6). From this example, we
observe that the conditional statement (Line 4) encapsulates the information of the
device model (Nexus 4) that can trigger the FIC issue. The API call (Line 5) that
is dependent on the conditional statement demonstrates a strong correlation with
the FIC issue. By identifying such conditional statements and API calls that are
dependent on the conditions, we can recover API-device correlations to characterize
FIC issues.

To learn API-device correlations, our technique Pivot formulates each correlation


as a pair of an API and an identifier for a device model (device identifier for short). In
this paper, we focus on learning API-device correlations for Android SDK APIs rather
than APIs of third-party libraries. The device identifier of a correlation describes an
issue-triggering device model such as the name of the model or the manufacturer.
In Figure 5.1, the API setRecordingHint depends on a conditional statement that
checks the device identifier against “Nexus 4”. In such a case, Pivot will derive an
API-device correlation: hParameters.setRecordingHint(boolean), “Nexus 4 ”i. Note
that Pivot does not assume that the APIs in API-device correlations are issue-
inducing. They can also be issue-fixing ones such as the setRecordingHint in our
example. This is the major difference between API-context pairs and API-device
correlations. In the current stage, we cannot distinguish issue-inducing API-device
correlations and issue-fixing API-device correlations. In the future, we plan to study
how to distinguish API-device correlations of these two types and automatically
generate API-context pairs and FIC issue fix patterns.

Currently, Pivot does not include API levels in the extracted API-device corre-
lations. This is because we found that it is uncommon (16.7%) to check both device
models and API levels in the code snippets that handle FIC issues according to the
dataset published by Wei et al. [163]. The underlying reason may be that many
devices’ operating system rarely gets major updates after the devices are shipped. As
a result, most FIC issues can be patched without checking API levels. Therefore, we

62
Section 5.1 Lili Wei

1. Camera mCamera = Camera.open();


2. Camera.Parameters params = mCamera.getParameters();
3. …
4. if (Build.MODEL.equals(“Nexus 4”) {
5. params.setRecordingHint(true);
6. }
7. …
8. mCamera.setParameters(params);
9. mCamera.startPreview();
Figure 5.1: Patch for Camera Preview Frame Rate Issue on Nexus 4

include only the information of APIs and device models in our model of API-device
correlations.

5.1.2 Application Scenarios of API-Device Correlations

API-device correlations can help reduce the search space of FIC issues and save
testing efforts. For example, the API-device correlation extracted from Figure 5.1
suggests the existence of a FIC issue when using the camera of a Nexus 4. With such
information, developers can focus on testing the app components that use cameras on
Nexus 4. This can help trigger the FIC issue quickly. Otherwise, triggering the issue
would require developers to extensively test their apps on a huge number of different
devices. Let us analyze the reduction of testing efforts. Assume that developers
carefully test their apps using Amazon Device Farm [25], a widely-used online testing
platform providing over 200 device models including Nexus 4. According to an
existing study [145], a popular Android app contains 50 entry methods on average.
Then, app developers may need to test 10,000 (200 × 50) combinations of entry
methods and device models to trigger the issue. Such testing efforts are unaffordable
for most development teams. As a result, FIC issues would likely be left undetected in
the released apps. Comparatively, knowing the correlation between Android camera
APIs and Nexus 4, developers can focus on testing a few entry methods that involve
camera APIs on Nexus 4 to expose potential FIC issue before releasing their apps.

API-device correlations can also be analyzed to derive rules to automate FIC issue
detection and patching. Let us consider the example in Figure 5.2 again. From the
API-device correlation hParameters.setRecordingHint(boolean), “Nexus 4 ”i, we can
infer that Android camera APIs may induce FIC issues on “Nexus 4”. We can also
infer that a possible patch is to invoke setRecordingHint and set the flag to true.
We can further validate the observations by checking relevant resources online. For

63
Section 5.1 Lili Wei

example, by a search on Google, we can find concrete evidence (e.g., [21,40]) showing
that camera APIs indeed can cause FIC issues on Nexus 4. Via such analyses, we
can build a knowledge base of FIC issues containing their patterns and high-quality
patches. The issue patterns can be used as inputs to compatibility analysis tools such
as FicFinder [163]. The issue patches can serve as templates to help developers
fix FIC issues and support future research on automatically repairing compatibility
issues. To show the feasibility of this application scenario, in our evaluation, we
derived five FIC issue patterns based on learned API-device correlations. With these
issue patterns, FicFinder detected 39 previously-unknown FIC issues in popular
open-source Android apps. For instance, with the Nexus 4 camera issue pattern,
FicFinder detected three new issues. The developers also quickly fixed these three
issues according to our suggested patch (i.e., invoking setRecordingHint). More
details are discussed in Section 5.3.2.

5.1.3 Motivating Example & Overview

Options menu crash issue on LG devices. Figure 5.2 shows a code snippet
handling an infamous crashing FIC issue triggered by pressing the physical menu
button on some LG devices if the corresponding options menu is customized. The
LG’s solution (Lines 2–7) helps avoid app crashes by explicitly opening the options
menu (Line 5) rather than calling the onKeyUp() method of the super class. With
the example, we now illustrate the major steps and challenges in learning valid
API-device correlations from an Android app corpus.

Step 1: Extracting API-device correlations. API-device correlations can


be extracted by identifying the conditional statements that check device information
and the APIs guarded by these statements. We call these statements device-checking
statements. A device-checking statement and its guarded API calls may reside in dif-
ferent methods. As shown in Figure 5.2, the device-checking statement resides in the
method isLGE() (Line 14), while its guarded API call openOptionsMenu() resides
in the method onKeyUp() (Line 5). Therefore, extracting API-device correlations
requires inter-procedural analysis. To capture these correlations, Pivot first builds
an inter-procedural control flow graph for an Android app and then traverses the
graph to identify the device-checking statements as well as the API calls depending
on them.

Step 2: Filtering noises. API-device correlations extracted in the first step


contain massive noises because APIs that are irrelevant to FIC issues may also be

64
Section 5.1 Lili Wei

1. public boolean onKeyUp(int keyCode, KeyEvent event) {


2. if ((keyCode == KeyEvent.KEYCODE_MENU) &&
3. isLGE()) {
4. Log.i(TAG, "Applying LG workaround");
5. openOptionsMenu();
6. return true;
7. }
8. //App-specific code cloned from VLC
9. View v = getCurrentFocus();
10. …
11. return super.onKeyUp(keyCode, event);
12. }
13. public boolean isLGE() {
14. return Build.MANUFACTURER.compareTo("LGE") == 0;
15. }
16. protected void onCreate (Bundle savedInstanceState) {
17. Log.i(TAG, "Activity created");
18. …
19. }
Figure 5.2: Patch for A Crashing FIC Issue on Some LG devices

guarded by device-checking statements as part of app-specific logic. For example,


the API Log.i() is invoked when the condition “MANUFACTURER equals ‘LGE’ ”
is satisfied. The API is called for profiling purpose but the inter-procedural analysis
in the first step would generate a noisy API-device correlation, hLog.i (), “LGE ”i,
which is unrelated to any FIC issues. According to our experiments (Section 5.3),
97% of our sampled API-device correlations generated by the first step are noises.

Filtering such massive noises is a major challenge in learning valid API-device


correlations from Android apps. Existing API usage mining techniques [151, 157]
cannot effectively help filter such noises due to two reasons. We explain the reasons
and present the intuitions of Pivot’s solution below.

Huge and evolving FIC issue search space. The existing API usage mining
techniques assume that the majority of API usages are correct and adopt conventional
statistical metrics (i.e., confidence or support). However, this assumption may not
hold for API-device correlation learning. Since the search space of FIC issues is huge
and evolving, in practice, FIC issues are commonly left unhandled in released apps.
As such, many valid API-device correlations, especially those related to new FIC
issues, are not subject to high confidence or support.

We observe that the confidence of an API-device correlation within an app helps


distinguish valid and noisy correlations. The intuition is that within the same app:

65
Section 5.2 Lili Wei

Correlation Correlation Ranked


APK Extractor Prioritizer API-Device Correlations

Figure 5.3: Overview of Pivot


(1) The APIs irrelevant to FIC issues can be invoked at various places without
device-checking statements. (2) Callsites of APIs related to the same FIC issues are
often guarded by device-checking statements. Figure 5.2 contains an example: the
irrelevant API Log.i() invoked at Line 4, which is guarded by a device-checking
statement, is also invoked in another method of the app without any device-checking
statements (Line 17). This observation helps distinguish APIs that are relevant to
FIC issues from irrelevant ones. Therefore, we propose a metric, in-app-confidence,
that computes how often an API’s invocation is guarded by a specific device-checking
statement in an app. As illustrated, this metric helps identify irrelevant APIs that
are also often invoked without checking device information.

Code clones in Android apps. Simply ranking API-device correlations by in-


app-confidence and the frequency of an correlation’s occurrences is insufficient because
of code clones, which are common in Android apps [104]. The noisy correlations
contained in one code snippet can be duplicated in many different apps due to code
clones. As a result, noisy API-device correlations in cloned code snippets can recur in
different apps. This increases the noises and makes the valid API-device correlations
indistinguishable. For example, Lines 9–10 in Figure 5.2 are app-specific code, which
are unrelated to FIC issues. Line 9 invokes the API Activity.getCurrentFocus()
to get the currently focused view. This line and the code denoted by “...” in Line 10
are cloned from a popular open-source video player, VLC [88]. In our evaluation, we
observed that instances of noisy API-device correlations extracted from this cloned
code snippet recurred in many apps in our collected app corpora. To mitigate this
problem, our ranking strategy also considers the diversity of the occurrences where
API-device correlations are derived. More details of our approach will be presented
in the next section.

5.2 Pivot Approach


As shown in Figure 5.3, Pivot takes a corpus of Android apps as input and
outputs a ranked list of API-device correlations. The process consists of two steps.

66
Section 5.2 Lili Wei

(1)
keyCode ==
KEYCODE_MENU
(2) call isLGE();

isLGE():
(3) return
MANUFACTURER.compareTo("LGE")==0;

(4) b1 = isLGE();

(5)
b1 == true

MANUFACTURER equals “LGE”

View v = getCurrentFocus();
Log.i(TAG, “Applying LG …"); …
(6) openOptionsMenu(); (7) return
return true; super.onKeyUp(keyCode, event);

MANUFACTURER equals “LGE”

Figure 5.4: Inter-Procedural Control Flow Graph of The Code in


Figure 5.2

First, the correlation extractor performs static analysis to extract raw API-device
correlations from the input apps. Then the correlation prioritizer ranks the extracted
API-device correlations.

5.2.1 Extracting API-Device Correlations

The correlation extractor performs inter-procedural static analysis on an input


app in three steps:

Building inter-procedural control flow graphs. For each input app, the
correlation extractor builds an inter-procedural control flow graph by combining
the app’s call graph and each method’s control flow graph. Figure 6.2c shows the
inter-procedural control flow graph for the code snippet in Figure 5.2. For ease
of presentation, we label each block in the graph with a unique number. Note
that Android apps are event-driven. When building inter-procedural control flow
graphs, Pivot does not consider the implicit control flow between different event
handlers. As reported in a prior study, it is unlikely that FIC issue patches would
cross event handlers [163]. This means that the pieces of code by which an API-device

67
Section 5.2 Lili Wei

correlation is learned likely reside in the same event handler or its callees. As a
result, eliminating control flows between event handlers may not significantly affect
API-device correlation extraction.

Identifying device-checking statements. The correlation extractor then tra-


verses the inter-procedural control flow graph to identify device-checking statements
and evaluates the conditions imposed on each branch. Since device information is
encapsulated by the class android.os.Build, a conditional statement is considered
device-checking if it uses this class. In Figure 6.2c, block 3 contains such a conditional
statement (the block is in method isLGE()).

Computing device constraints for program blocks and deriving API-


device correlations. The correlation extractor then computes the device-related
constraints induced by the conditions that should be satisfied to reach each program
block. For ease of presentation, we refer to these constraints as device constraints.
The device constraint for each block is the disjunction of the device constraints of
all paths by which the block can be reached from the application entry points. As
discussed, the pieces of code by which an API-device correlation is learned likely
reside in the same event handler or its callees. Therefore, Pivot considers each
event handler to be an entry point. Finally, all APIs called in a block are paired with
the device identifiers in the device constraint of this block to produce API-device
correlations. In Figure 6.2c, the method onKeyUp() is an event handler defined in
the Activity class, and is therefore treated as an entry point for our static analysis.
Since there is only one path (1–2–3–4–5–6) reaching block 6 from the entry point, the
device constraint is computed as the conjunction of the device constraints associated
with the edges on this path (i.e., MANUFACTURER equals “LGE”). By pairing up
the APIs called in block 6 and the identifier in the device constraint, two API-device
correlations are produced: hLog.i (), “LGE ”i,hActivity.openOptionsMenu(), “LGE ”i.

5.2.2 Prioritizing API-Device Correlations

In this step, the API-device correlations extracted in the first step are prioritized
based on their likelihood of capturing real FIC issues. As discussed in Section 5.1.3,
existing techniques cannot effectively filter invalid API-device correlations. To
prioritize API-device correlations, we propose a new approach that leverages two
metrics: in-app confidence and occurrence diversity. The two metrics are inspired by
our observations discussed in Section 5.1.3.

68
Section 5.2 Lili Wei

In-App Confidence

New FIC issues continually arise with the release of new device models and the
evolution of Android platforms. These issues may not be commonly and quickly
fixed in real-world Android apps. Therefore, the conventional metric confidence,
which is based on an API-device correlation’s popularity over the whole app corpus,
cannot effectively prioritize API-device correlations. On the other hand, although
FIC issues may not be commonly fixed, we observed that once developers of an app
identify a real FIC issue, they tend to modify all callsites of the issue-inducing API
within the app to fix the issue. The callsites of APIs irrelevant to FIC issues are
mostly not guarded by any device constraints. As such, we propose to use in-app
confidence (IAC ) to prioritize API-device correlations. The IAC of an API-device
correlation c in an app A is defined as:

# occurrences of c in A
IAC (A, c) = (5.1)
# callsites of c.api in A

where c.api is the API of c. Intuitively, IAC computes how often the invocation of
c.api is guarded by a device-checking statement that checks the app’s runtime device
model against the device model captured in c.

The total in-app confidence metric wIAC (c) for an API-device correlation c is the
sum of IACs of all apps in the corpus that contain the API-device correlation:
X
wIAC (c) = IAC (A, c) (5.2)

Occurrence Diversity

Intuitively, if an API-device correlation recurs in more apps, it is more likely to


be valid. However, prioritization of API-device correlations based on their number of
occurrences alone can be insufficient. Since code clones are common in Android apps,
the API-device correlations extracted from cloned code can appear in many apps and
get a high rank even if they have nothing to do with FIC issues (see Section 5.1.3 for
an example). To mitigate this problem, we propose a new metric occurrence diversity
to measure the diversity of the apps and methods, within which the instances of an
API-device correlation c are found. We denote it as d(c). Intuitively, d(c) is higher if
the apps and methods that contain the instances of c are more diverse.

We leverage Shannon index [158] to quantify occurrence diversity. Shannon index


was first proposed to measure the diversity of the characters in a string and was later

69
Section 5.2 Lili Wei

widely applied in ecology to measure the diversity of species [103,159]. Shannon index
is computed based on the distribution of different groups of entities in a population
and is defined as follows:
Xn
H=− pi log pi (5.3)
i=1

where pi is the proportional abundance of the i-th group in a population. Intuitively,


Shannon index is higher if there is a larger number of different groups in the population
and the number of entities in each group are more evenly distributed. Using Shannon
Index, we measure an API-device correlation c’s occurrence diversity at two levels:
app-level and method-level. Specifically, we divided the apps and methods that
contain instances of c and calculated Shannon Index based on the divided groups.

App-level diversity measures the diversity of the apps that contain the instances
of c. We use app names and app company names, which can be extracted from app
markets, as identifiers to distinguish different groups of apps.

Method-level diversity measures the diversity of methods that contain the


API callsites in c’s instances. We use package names of the methods’ enclosing classes
and method control flow structures to distinguish different methods. Package names
can be extracted by static analysis but are subject to code obfuscation. To cope
with obfuscation, we also measure the diversity of control flow structures using an
existing technique [104] that detects code clones in Android apps by projecting the
control flow structure of each method to a three-dimensional centroid and calculating
distances between these centroids. This technique has been shown to be accurate and
scalable. Since it is based on the control flow structures of methods, the technique is
robust to code obfuscation. To calculate the control flow structure diversity, we first
calculate centroids for all methods that contain the API callsites in c’s instances.
Then the methods are clustered based on the centroids according to the single-linkage
clustering algorithm [115]. Specifically, a method is added to an existing cluster if
the distance between its centroid and the centroid of any method in this cluster is
smaller than a threshold θ, which is set to a typical value 0.05 in our experiment.
The output clusters are considered as different control flow structure groups.

After grouping API-device correlation occurrences, we apply Shannon index to


calculate the diversity of app name (Happ ), app company name (Hcompany ), method
package name (Hpackage ), and method centroid (Hcentroid ). The overall occurrence
diversity of an API-device correlation c is then calculated as follows:

d (c) = log(1 + Happ (c) × Hcompany (c)


(5.4)
× Hpackage (c) × Hcentroid (c))

70
Section 5.3 Lili Wei

Ranking Score

The ranking score of an API-device correlation c is then calculated by combining


the above metrics:
S (c) = wIAC (c) × d(c) (5.5)

The extracted API-device correlations are ranked based on their ranking scores. If
the ranking score of an API-device correlation is higher, it is more likely to capture
FIC issues.

5.3 Evaluation

We implemented Pivot on top of Soot [130]. To evaluate the effectiveness


of Pivot and the usefulness of its learned API-device correlations, we study the
following research questions:

• RQ8 (Effectiveness of Pivot) Can Pivot effectively identify valid API-device


correlations of real FIC issues from popular Android apps? Can our proposed
ranking score outperform the metrics adopted by existing API precondition mining
techniques?
• RQ9 (Usefulness of API-device correlations) Can the API-device correla-
tions learned from popular apps facilitate FIC issue detection in other apps?

To investigate the RQs, we conducted two experiments:

• Experiment I: To answer RQ8, we ran Pivot on popular apps collected from


Google Play and evaluated the precision of its learned API-device correlations.
We compared the results with a baseline approach that adopts the ranking metric
of a representative API precondition mining technique proposed by Nguyen et
al. [151]. Since FIC issues evolve quickly, we ran Pivot on two different app
corpora collected in 2017 and 2018 and compared the results to evaluate whether
Pivot can effectively identify valid API-device correlations from popular apps at
different time.
• Experiment II: To answer RQ9, we conducted a case study. We leveraged the
API-device correlations learned by Pivot in the first experiment to detect FIC
issues in open-source Android apps and reproduced the detected issues. To further
evaluate the usefulness of leveraging the API-device correlations identified by

71
Section 5.3 Lili Wei

Pivot from the perspective of Android app developers, we reported all of our
detected FIC issues to their original app developers and sought for their feedback.

Sections 5.3.1 and 5.3.2 report the details of the two experiments and answer RQ1–2.
The experiments were conducted on a Linux server running CentOS 7.4 with two
Intel Xeon E5-2450 Octa Core CPU @2.1GHz and 192 GB RAM.

5.3.1 RQ8: Effectiveness of Pivot


Experiment I Setup

In this experiment, we evaluate the effectiveness of Pivot to prioritize valid


API-device correlations that capture real FIC issues. In the following, we discuss the
data collection process, baseline approach, ground truth, and evaluation metrics.

Data collection. To evaluate whether Pivot can successfully identify FIC


issues that are active at different time, we applied it to two app corpora collected
in 2017 and 2018, respectively. To prepare the two corpora, we crawled the APK
files of the top 100 free apps for each category on Google Play in November 2017
and June 2018. The first collection contains 2,694 apps and we were able to run
Pivot on 2,524 of them. The second collection contains 3,054 apps and we were able
to run Pivot on 2,765 of them. Soot crashed when analyzing the remaining apps.
The number of apps collected is not in hundreds because some APK files cannot be
downloaded due to server errors. The analyzable apps formed our two app corpora.
We will call them Corpus 2017 and Corpus 2018 hereinafter. The two app corpora
slightly overlap: 853 apps are in both corpora but the app versions are different.

We analyzed the APK files of the apps in the two corpora and collected the apps’
meta data from Google Play [38]. Table 5.1 shows the statistics. As shown in the
table, the two app corpora both have tens of millions of classes and over one hundred
million methods. The apps are also of high quality and popular. The average rating
is 4.1 out of 5. On average, each app received over 16 and 14 million downloads in
Corpus 2017 and Corpus 2018, respectively.

Baseline approach. We compared Pivot with a baseline approach adapted


from a state-of-the-art API precondition mining technique by Nguyen et al. [151].
The technique learns API preconditions that involve API receivers and parameters.
It originally does not support extracting device-related preconditions. To adapt
the technique to learn valid API-device correlations, we integrated our Correlation

72
Section 5.3 Lili Wei

Table 5.1: Statistics of Our App Corpora

Corpus 2017 Corpus 2018


# Apps 2,524 2,765
Total # classes 14,297,551 20,023,102
Total # methods 112,181,748 108,861,735
Average rating 4.1 4.1
Average # downloads 16,000,000+ 14,000,000+

Extractor (Section 5.2.1) with the technique’s filtering and ranking components.
Specifically, the API-device correlations were first filtered by their support. Those
API-device correlations that only occurred once in a corpus were filtered out. The
API-device correlations were then ranked by the multiplication of method-level
confidence and project-level (i.e., app-level) confidence. App-level confidence of an
API-device correlation c is computed as the ratio of the apps that contain c’s instances
to the apps that contain callsites of c’s API. Similarly, method-level confidence of an
API-device correlation c is the ratio of the methods containing callsites of c’s API
that are guarded by statements checking the device identifier modeled in c to the
methods that contain the callsites of c’s API.

Ground truth: valid API-device correlations. We consider an API-device


correlation c valid if calling the API in c can trigger or help fix FIC issues on the
corresponding device model. To establish the ground truth, we manually inspected
top 50 API-device correlations of each ranked list produced by Pivot and the baseline
approach. We also randomly sampled 50 API-device correlations from the unranked
API-device correlations extracted by Correlation Extractor to show the massiveness
of noises. We only manually validated top 50 API-device correlations because without
an automated approach, it is unlikely for users to check a large number of API-device
correlations produced by Pivot or the baseline approach. When validating each
API-device correlation, we first inspected the originating location of its instances and
determined whether the API callsites are dependent on some statements checking
the device identifier in the correlation. If yes, we proceeded to use the API signature,
the name of API’s enclosing class, and the device identifier as keywords to search
on Google, GitHub, and Stack Overflow [33, 36, 76]. An API-device correlation c is
then considered valid only if we can successfully find multiple sources (e.g., forum
discussions or issue reports) confirming that (1) the API of c can induce inconsistent
app behavior on the device model specified in c, or (2) the API can be used to
address the FIC issues that can be triggered on the device model. Recall the example

73
Section 5.3 Lili Wei

100
Precision@N (%)
80

60

40

20

0
0 5 10 15 20 25 30 35 40 45 50
Top N API-Device Correlations
Pivot 2017 Pivot 2018 Baseline 2017 Baseline 2018

Figure 5.5: Precision@N of Pivot and Baselines

in Figure 5.2, the patch suggested by LG developers was to explicitly invoke an API
to open the options menu (Line 5) [45, 77]. As a result, the API-device correlation
hActivity.openOptionsMenu(), “LGE ”i is considered valid. We published the valid
API-device correlations and the corresponding online discussions at our project
website [63].

Evaluation metric. To measure the effectiveness of Pivot and the baseline


approach in prioritizing API-device correlations, we adopt the metric Precision@N,
which reports the percentage of valid API-device correlations among the top N
API-device correlations in a ranked list. Higher Precision@N indicates that the
ranking strategy is more effective.

Results of Experiment I

Pivot extracted 32,047 and 24,355 API-device correlations from Corpus 2017 and
Corpus 2018, respectively. To evaluate the effectiveness of Pivot, we (1) randomly
sampled 50 API-device correlations extracted from each app corpus to show the
massiveness of noises, (2) evaluated Precision@N (N = 1, 2, 3, . . ., 50) for top 50
API-device correlations in each ranked list, (3)compared the results of Pivot for
the two different app corpora, and (4) compared the results of Pivot and that of
the baseline approach.

Massiveness of noises. From the 100 randomly-sampled API-device correla-


tions (50 for each app corpus), we only identified three valid ones that capture real
FIC issues. None of these three valid API-device correlations was ranked to top 50 by
Pivot because they either only occurred once in one app or occurred only in cloned

74
Section 5.3 Lili Wei

code snippets. This shows that noises are massive in the API-device correlations
extracted from the app corpora (approximately 97%). Identifying valid API-device
correlations is an outstanding challenge.

Precision@N of Pivot. Figure 5.5 plots the results of Experiment I. The two
solid lines show the results of Pivot for the two app corpora. In both corpora,
Pivot achieves 100% precision among the top five API-device correlations and over
90% precision among the top ten. The precision gradually drops as N grows. This
shows that Pivot can effectively prioritize valid API-device correlations. Pivot
identified 29 valid API-device correlations among the top 50 (58% precision@50) for
Corpus 2017 and 28 valid ones among the top 50 for Corpus 2018 (56% precision@50).
Among these 29 and 28 valid correlations, eight are overlapped.This means that
Pivot identified a total of 49 distinct valid API-device correlations.

We further studied why Pivot ranked some invalid API-device correlations


among top 50. We identified two major reasons. First, some APIs are commonly
used in different apps for specific purposes irrelevant to FIC issues. One typical
example is the logging APIs that are widely used for runtime information collection.
API-device correlations involving such APIs may receive a high ranking score because
the occurrence diversity of such API-device correlations could be high. Second, most
other invalid API-device correlations were ranked high due to the limitations of
existing constraint solvers and simplifiers. When deriving device constraints for
program blocks, we leveraged a popular constraint solver, Choco [154], and a rule-
based constraint simplifier, Jbool Expressions [41], to simplify the device constraints.
As Boolean Expression minimization is an NP-hard problem [120], our simplified
device constraints may not be minimal. Thus, API calls that are not made under any
device constraints could be mistakenly included and paired with device constraints.

Comparison between the results of the two corpora. As shown in Fig-


ure 5.5, Pivot achieved similar precision@N for the two app corpora collected at
different time. The valid API-device correlations among top 50 for Corpus 2017 and
Corpus 2018 only slightly overlap as discussed above. This shows that Pivot can
learn valid API-device correlations from corpora of popular apps at different time.

Comparison with baseline. The two dashed lines in Figure 5.5 show the results
of the baseline approach for the two app corpora. Pivot significantly outperforms
the baseline approach. The baseline approach failed to rank any valid API-device
correlations to top 50 for Corpus 2017. For Corpus 2018, the baseline approach
only identified one valid API-device correlation at the 14th position with a small

75
Section 5.3 Lili Wei

Table 5.2: Case Study Subjects and Reported Issues


Latest
ID App Name Category KLOC # Stars Rating # Downloads Issue ID(s)
Revision
1 Barcode Scanner [19] Shopping c8da0c1 43.2 16,152 4.1 100M+ 959b
2 NewsBlur [52] News & Magazines 0f0f1f1 17.4 4,759 3.8 50K+ 1084
3 Xabber [97] Communication 925e996 46.2 1,509 4.1 1M+ 800
4 OctoDroid [53] Productivity e491d76 43.2 876 4.5 100K+ 821a
5 Simple Task [72] Productivity 0572a62 4.1 288 4.7 10K+ 862a
6 Face Slim [29] - c62bb57 3.5 206 - - 319a
7 Simple Camera [71] Tools 8aebf00 2.2 123 4.2 100K+ 120b
8 Walleth [91] Finance 929216c 9.1 100 4.3 5K+ 201b
9 Calendula [20] Health & Fitness 39e6e96 26.3 76 4.5 1K+K 92b
10 Open Manga [58] - ac3cd27 20.7 51 - - 47
“–” means not applicable. Superscript “a” means the issues were acknowledged and developers agreed to fix in future. “b” means the issues have already been fixed.

confidence value of 0.33. Comparatively, this API-device correlation was ranked at


top six by Pivot.

We further studied why the baseline approach performed poorly. We found that
most of its top-ranked API-device correlations involve rarely-used APIs. For example,
13 API-device correlations received confidence of 1.0 and were ranked top by the
baseline approach for Corpus 2018. However, none of them are valid. This shows
that learning API-device correlations differs from mining API preconditions. General
API precondition mining techniques cannot be applied to identify valid API-device
correlations.

5.3.2 RQ9: Usefulness of API-Device Correlations

Experiment II Setup

We built an archive of 17 distinct FIC issues based on the 49 valid API-device


correlations learned by Pivot. There are fewer distinct FIC issues than valid API-
device correlations because multiple API-device correlations concerning the same
device identifier could refer to the same FIC issue. For example, an API that triggers
an FIC issue and another API that addresses this FIC issue can result in two valid
API-device correlations according to our approach. Furthermore, some FIC issues
concern multiple APIs. Our archive provides the following information for each
FIC issue: (1) the API-device correlations of the issue learned by Pivot, (2) the
package IDs of the apps from which Pivot learned these correlations, and (3) the
related issue discussions we collected from online resources to validate the API-device
correlations. This archive is also publicly available [63].

To investigate the usefulness of the learned API-device correlations, we con-


ducted a case study on open-source Android apps. In the study, we leveraged
FicFinder [163], a static analyzer that detects FIC issues in Android apps, to detect

76
Section 5.3 Lili Wei

and reproduce undiscovered instances of our archived FIC issues. In the following,
we discuss the details of this study.

Issue selection. From our archive, we selected five FIC issues to carry out the
case study. We provide videos to demonstrate the inconsistent app behaviors caused
by these issues on our project website [63]. We selected these five issues because:
(1) the device models that can trigger the issues are available on Amazon Device
Farm [8] or WeTest [94] and (2) the inconsistent app behaviors caused by the issues
are observable on the online testing platforms. The first criterion enables us to
reproduce the later detected FIC issues on online testing platforms. The second
criterion allows us to have a clear oracle to determine the occurrence of FIC issues.

App selection. To find undiscovered instances of the five FIC issues in real-world
Android apps, we collected the latest version of 44 apps indexed by F-Droid [28].
All these 44 apps (1) have received at least 50 stars on GitHub [33], (2) have over
500 commits, (3) have at least one push during a five-month period before the case
study, and (4) contain at least one callsite of any API related to the five FIC issues.
These criteria ensure that our selected apps (1) are popular and well-maintained and
(2) could be liable to the selected FIC issues.

Issue detection and reproduction. For each of the five selected issues, we
encoded it as a rule in FicFinder’s API-Context Pair format. We then ran
FicFinder using these rules as input to analyze the 44 Android app subjects.
FicFinder reported warnings for 35 of these subjects. We manually inspected these
35 apps and excluded those that require special hardware/software environments to
run (e.g., a bluetooth watch or specific server setups) from this study. As a result, we
focused on reproducing 19 detected issues in 19 apps using Amazon Device Farm [8]
and WeTest [94].

Usefulness of API-Device Correlations

Among the 19 detected issues, we successfully reproduced ten in ten different apps.
We failed to reproduce the remaining nine due to three major reasons. First, some
apps did not exhibit inconsistent behavior as they only use the minor functionalities
of the issue-inducing APIs. For example, some apps use the Camera API to turn
on the flash light to use the phone as a torch. In such cases, the camera preview
was not displayed and the issues cannot be observed. Second, FicFinder generated
false positives because it failed to recover issue workarounds that already exist in

77
Section 5.4 Lili Wei

the apps. Third, we failed to reach the API callsites of several apps because they
crashed prematurely.

Table 5.2 gives the information of the ten apps, whose FIC issues were successfully
reproduced. These apps (1) contain thousands of lines of code (large-scale), (2)
cover different app categories (diversified), (3) receive over 50 stars on GitHub and
thousands of downloads on Google Play (popular), and (4) are rated at least 3.8 on
Google Play (decent quality). We reported the 10 reproducible issues to the original
app developers to seek their feedback. The issue IDs of our reports are provided in
the last column of Table 5.2.

Our reported FIC issues received due attention from app developers. Among
the ten reported issues, seven have been acknowledged and four were immediately
fixed. For example, the issue 92 of Calendula would crash the app when invoking
the built-in DatePicker API on some popular Samsung devices running Android 5.0
(e.g., Galaxy Note 5). Upon receiving our report, the app developer quickly fixed
the issue with a workaround documented in our issue archive [63].

These results show that API-device correlations can effectively help detect po-
tential FIC issues in Android apps and reproduce the issues on the corresponding
issue-triggering device models. The information provided by our FIC issue archive
(Section 5.3.2) is actionable and can facilitate the detection and diagnosis of FIC is-
sues. Specifically, the device identifiers and APIs specified in API-device correlations
can help reduce the search space of FIC issues. With the information, developers
can design tests to cover the app components that call FIC issue-inducing APIs and
execute these tests on the device models that can trigger the FIC issues to expose
potential issues. Compared with an unguided testing process, such a guided process
saves much testing effort (Section 5.1.2).

5.4 Discussions

5.4.1 Threats to Validity

Quality of input app corpora. Pivot can only identify valid API-device
correlations from apps that contain code snippets handling FIC issues. It is unlikely
that such code snippets would exist in apps that are not well-maintained. As such,
the quality of app corpora can affect the performance of Pivot. It is suggested to

78
Section 5.4 Lili Wei

build an app corpus with popular apps on app stores as we did in our experiments.
Pivot can then extract valid and useful API-device correlations.

Device-checking statements identification. Pivot identifies device-checking


statements based on the use of the class android.os.Build. If developers leverage
other mechanisms to check device information, Pivot would miss the statements.
Nevertheless, the use of android.os.Build to check device information was shown
to be common in practice [163]. Pivot can also be extended to support alternative
identification methods for device-checking statements.

Imprecise static analysis. Pivot performs static analysis on inter-procedural


control flow graphs to extract API-device correlations. It is possible that the
graphs generated by static analysis are imprecise or unsound [116]. Because of such
imperfectness, Pivot may miss some API-device correlations or generate invalid
API-device correlations. However, as Pivot learns API-device correlations from
large app corpora, such problems can be mitigated in Pivot’s final output.

5.4.2 Comparison with FicFinder


The goals of FicFinder and Pivot are different. While FicFinder aims to
detect FIC issues with a given set of predefined patterns, Pivot aims to learn the
knowledge of FIC issues from popular apps. As we have shown in our evaluation,
the knowledge learnt by Pivot can serve as input to FicFinder and help detect
previously-unknown FIC issues. The original version of FicFinder encoded 20
FIC issue patterns manually extracted from the empirical study dataset. These
patterns will be gradually outdated with the increasing adoption of new Android
device models (Section 5.1.1).

We did not apply Pivot to the apps used by the empirical study in FicFinder [163,
164] to learn API-device correlations because the empirical study involved only five
apps, making it difficult for Pivot to distinguish valid API-device correlations from
noises. Alternatively, we compared the FIC issues in FicFinder’s dataset with
those learnt by Pivot. Ten of the 49 distinct valid API-device correlations learned
by Pivot are related to FIC issues in FicFinder’s dataset. Among the other 39
valid API-device correlations that were not included in FicFinder’s dataset, 14 of
them concern FIC issues that emerged in 2017 (FicFinder’s dataset was published
in 2016). This confirms that Pivot can identify new valid API-device correlations.
In Section 5.3.2, we have also shown that such automatically learned knowledge can
help detect undiscovered FIC issues in real-world Android apps.

79
Chapter 6

Pixor: Proactively Exposing


Inconsistent Android System
Behaviors

A fundamental limitation of Pivot is that it can only identify API-device


correlations for FIC issues that have already been fixed by some existing apps in
the input app corpora. It cannot extract API-device correlations of new FIC issues
that are not fixed by any app of the input app corpora. To address this problem, we
propose Pixor, a framework extracts and selects app subroutines to exercise Android
system functionalities to proactively expose inconsistent Android system behaviors
across different versions. Essentially, FIC issues are caused by the inconsistent
behaviors of the different versions of Android systems. Exposing such inconsistent
behaviors can be both useful to Android system developers and app developers. For
those inconsistent behaviors induced by bugs, Pixor can help system developers
to identify and fix bugs in the Android systems. For those inconsistent behaviors
induced by intended system behavior changes, the identified inconsistent behaviors
can be used to derive API-context pairs and detect FIC issues in Android apps. For
the ease of presentation, we will refer to the system behaviors that only exhibit on
specific Android versions as platform-specific behaviors.

6.1 Approach

6.1.1 Overview

We propose Pixor based on the key observation that the large number of existing
Android apps, which contain a great variety of Android platform API use cases, can

80
Section 6.1 Lili Wei

Subroutine
Subroutine
Extractor
APK Subroutine Test App Test App Platform-Specific
APK
APK Reports
API Graph Selector Generator Runner Behavior Analyzer
API Graph
Constructor

Figure 6.1: Overview of Pixor.

be viewed as test cases to exercise diverse system functionalities and expose platform-
specific behaviors. This motivates us to leverage the large number of existing Android
apps to generate test cases and expose platform-specific behaviors by comparing the
behavior of the same test cases on different Android versions. Android apps contain
real and diverse use cases of Android APIs, which may not be easily covered by test
cases prepared by system testers. As a result, test cases generated from apps in
the wild are likely to expose platform-specific behaviors that escaped from Android
system testing procedure.

While the existing Android apps contain prolific Android API use cases, many of
them can be equivalent in exercising system functionalities and exposing platform-
specific behaviors. Blindly testing platform-specific behaviors can cause wasted
testing efforts on repeatedly exercising the same system behaviors. This poses the
key challenge in this work: how to guide the exploration of system functionalities
so that diverse system behaviors are exercised while redundant testing efforts are
minimized?

To effectively expose platform-specific behaviors, we propose Pixor. Specifically,


Pixor breaks down the input apps into subroutines and effectively selects app sub-
routines to exercise diverse system functionalities by estimating both the exercisable
system functionalities by the whole input app corpus and system functionalities that
can be covered by each subroutine. In the current stage, we only focus on exposing
platform-specific behaviors that can induce UI display inconsistencies (we refer to
such platform-specific behaviors as UI-related platform-specific behaviors). We noe
focus on UI-related platform-specific behaviors due to two reasons: (1) UI-related
platform-specific behaviors are very common in practice and (2) UI-related platform-
specific behaviors demonstrate clear symptoms that can be derived as test oracles so
that we can identify the triggered platform-specific behaviors.

In effectively leveraging existing apps to exercise diverse system functionalities,


Pixor faces three challenges:

• Apps can be large, containing a large amount of code. Directly selecting apps
to conduct system testing can be to coarse-grained. The first challenge faced

81
Section 6.1 Lili Wei

by Pixor is: How to break down the input apps into subroutines so that the
subroutines can be easily executed while preserving their ability to trigger
UI-related platform-specific behaviors?

• How to estimate the system functionalities that can be exercised by the app
subroutines to identify equivalent subroutines and avoid exercising the same
system functionalities?

• Providing with the extracted subroutines and their exercised system func-
tionalities, the core challenge faced by Pixor is: how to leverage them to
effectively select subroutines to exercise diversified system functionalities to
expose UI-related platform-specific behaviors?

Figure 6.1 shows an overview of Pixor. In correspondence to the three research


challenges, Pixor has three key components in its test selection procedure (Subrou-
tine Extractor, API Graph Constructor, and Subroutine Selector). Taken a corpus of
Android apps as input, Subroutine Extractor breaks the input apps into subroutines.
API Graph Constructor estimates the system functionalities that can be covered by
the extracted subroutines via modeling their API use cases as API Graphs. API
Graph Constructor also abstracts all the available API use cases in the whole input
app corpus to estimate all the exercisable system functionalities. Subroutine Selector
leverages the generated API Graphs to select subroutines that will be later used
to exercise Android system functionalities. Finally, Pixor executes the selected
subroutines in its Test Runner and compares the execution results in Result Analyzer
to report potential UI-related platform-specific behaviors. In the following sections,
we present the detailed design of each component in Pixor and how they address
the research challenges.

6.1.2 Subroutine Extractor: Breaking Down Apps into Sub-


routines

Subroutine Extractor first breaks down the input apps into subroutines, which
are later served as units for test selection. Desirable subroutines should be easy to
execute and their ability to trigger UI-related platform-specific behaviors should be
preserved.

Since Pixor focuses on UI-related platform-specific behavior identification, its


Subroutine Selector extracts Activities as subroutines. Activity is a desirable

82
Section 6.1 Lili Wei

subroutine for exposing UI-related platform-specific behaviors since (1) it is a key


component in Android that typically defines the UI display of one screen in Android
apps, and (2) the dependencies between different Activities are usually minimal [6].
As a result, it is easy to generate isolated executable subroutines from Activities.
In this step, Subroutine Extractor scans all the classes in the input apps and outputs
all their non-abstract Activities as subroutines. Subroutine Extractor is built
on top Soot, a code analysis framework for Java and Android programs [130], and
ApkTool, a reverse engineering tool for Android apps [4].

Note that Pixor can also be extended to expose platform-specific behaviors of


other types. Finer-granularity subroutines such as event handlers or methods may be
more suitable in exposing other types of platform-specific behaviors. In the future,
we plan to also explore the possibility to further extend Pixor to expose other types
of platform-specific behaviors and explore breaking down apps into subroutines of
different granularity.

6.1.3 API Graph Constructor: Estimating Exercisable Sys-


tem Functionalities

Precisely identifying the system functionalities exercised by a subroutine can


be challenging. One possible solution is to execute the subroutine and profile all
the triggered system functionalities. However, this postmortem solution requires to
execute all the subroutines, which is not suitable for guiding subroutine selection
before the subroutines are actually executed. This motivates us to estimate the
exercisable system functionalities of each subroutine.

To address this challenge, we made a key observation that Android apps use
system functionalities via Android API calls. We thereby propose to model API
use cases to estimate their exercisable system functionalities. Pixor models API
use cases as directed graphs, denoted as API Graphs. API Graph Constructor first
constructs an API Graph for each of the extracted subroutines and then merges
all the individual to construct a universal API Graph. The individual API Graphs
and universal API Graph abstracts the API use cases of individual subroutines and
the whole input app corpus respectively. In this section, we will (1) introduce our
definition of API Graph, (2) discuss the reasons why we model API use cases as
API Graphs, (3) introduce how API Graphs can be constructed from the input app
corpus, and (4) briefly discuss the relationship between individual API Graphs and
the universal API Graph.

83
Section 6.1 Lili Wei

API: findViewById()
Weight: 1 Weight: 2
Weight: 1 Weight: 1

API: ViewPager.setAdapter() API: TabLayout.setupWithViewPager()


Weight: 1 Weight: 1

(a) API Graph

1. public class MainActivity extends AppCompatActivity {


2. @Override
3. protected void onCreate(Bundle savedInstanceState) {
4. …
5. ViewPager viewPager = (ViewPager) findViewById(R.id.viewpager);
6. if (viewPager != null) {
7. setupViewPager(viewPager);
8. }
9. TabLayout tabLayout = (TabLayout) findViewById(R.id.tabs);
10. tabLayout.setupWithViewPager(viewPager);
11. }
12. private void setupViewPager(ViewPager viewPager) {
13. Log.i("TAG", "Setting View Pager");
14. Adapter adapter = new Adapter(getSupportFragmentManager());
15. …
16. viewPager.setAdapter(adapter);
17. }
18. }
(b) Code Snippet

(1) ViewPager viewPager = (ViewPager) findViewById(R.id.viewpager);

viewPager != null
(2) call setupViewPager();

setupViewPager():
Log.i("TAG", "Setting View Pager");
(3) Adapter adapter = new
Adapter(getSupportFragmentManager());

viewPager.setAdapter(adapter);

TabLayout tabLayout = (TabLayout) findViewById(R.id.tabs);


(4)
tabLayout.setupWithViewPager(viewPager);

(c) Control Flow Graph

Figure 6.2: An example of API Graph, its corresponding control flow


graph and code snippet. For presentation simplicity, we simplified the
example and omitted the parameters of the APIs in API Graph.

Definition of API Graph. An API Graph is a directed graph that captures


invocation order relationship between Android APIs. Figure 6.2a shows an example

84
Section 6.1 Lili Wei

of API Graphand Figure 6.2b shows the corresponding code snippet from which the
API Graph is built. A node in an API Graph represents an Android API. Each
distinct Android API will have only one node in the API Graph. An edge in an
API Graph represents invocation order between the APIs. Specifically, an edge from
AP I1 to AP I2 indicates that AP I1 is invoked before AP I2 and there is not other
Android API invocations between them. For example, the edge from “findViewById”
to “ViewPager.setAdapter” indicates that there exist paths in the program where
“findViewById” is invoked before “ViewPager.setAdapter” and no other APIs of
interest are invoked between them.

Why API Graph? There can be many alternative ways to model API use
cases. We chose to leverage such a directed graph to model API use cases due to the
flexibility it brings: API use case representations at different granularities can be
further derived from the graph-based model. For example, a simplest representation
can be leveraging only the node set in the graph (i.e., invoked API set) to represent
API use cases. On the contrary, we can also derive complicated paths from the
graph to capture interdependencies between the APIs. Since different API use case
representations can be suitable for different application scenarios, our API Graph
provides a great flexibility to fit in different testing scenarios.

Constructing API Graphs. The API Graph of a subroutine can be constructed


from its control flow graph [101]. Pixor constructs an API Graph of an app by
traversing its control flow graph. Specifically, Pixor depth-first traverses all the
blocks in the control flow graph. When Pixor traverses a block in a CFG, it identifies
all the APIs invoked in the block. If any of the invoked APIs are not in the API
Graph, Pixor will add a node for each of them in the API Graph. Pixor will then
add edges between APIs if there is a path in the CFG between two API invocation
statements and the path does not contain any other API callsites. Figure 6.2 shows
an example of a CFG and its corresponding API Graph. Pixor now only builds API
Graphs with UI-related APIs since we focus on exposing UI-related platform-specific
behaviors. Pixor identifies UI-related APIs by matching the package names of the
APIs. API Graph Constructor finally merges the API Graphs of different subroutines
to construct the universal API Graph of the whole input app corpus. For ease of
presentation, we denote an individual API Graph of a subroutine Si as Gi and the
universal API Graph as Gu hereinafter.

To sum up, Gu abstracts the API use cases in all the apps in the input app
corpus, capturing the invoked APIs and their invocation order relationships. If the
app corpus is large enough, Gu can estimate the real-life API use cases in the wild.

85
Section 6.1 Lili Wei

Algorithm 6.1: Greedy Algorithm for Set Cover Problem


Input : U : the universal set of n elements,
S1 , . . . , Sm : subsets of U
1 C ← ∅
2 W hile C 6= U do
3 Pick a subset Si such that |Si − C| is maximized
4 C ← C ∪ Si
5 Output picked subsets

On the contrary, Gi is a subgraph of Gu and abstracts the API use cases that can be
covered by subroutine Si . This information can be leveraged to guide the selection
of subroutines to expose platform-specific behaviors.

6.1.4 Subroutine Selector: Guiding Exploration of System


Functionalities

Subroutine Selector is the key component of Pixor. Subroutine Selector guides


the exploration of system functionalities by effectively selecting subroutines containing
diverse API use cases. In this section, we will first formalize the problem of selecting
app subroutines given our API Graph model. We will then discuss and compare
several subroutine selection strategies that Pixor supports.

Problem formalization. Subroutine Selector takes the universal API Graph


Gu , extracted subroutines and their associated individual API Graphs as input to
guide the selection of subroutines. The universal API Graph Gu is an abstraction of
all the available API use cases and estimates all the system functionalities that can be
exercised by the input app corpus. The API Graph (Gi ) of an individual subroutine
(Si ) models the API use cases of Si , which estimates the system functionalities
that can be exercised by Si . Pixor leverages Gi and Gu to guide the selection
of subroutines. While Gu is constructed by merging from all the individual API
Graphs, each individual API Graph is thus a subgraph of Gu . Intuitively, the goal
of subroutine selection can be thereby formalized as selecting minimized number of
subroutines whose individual API Graphs can maximize the coverage of Gu .

To achieve the selection goal, we propose to formalize the subroutine selection


problem as set cover problem [124]. Set cover problem is a well-known NP-hard
problem [124], which can be defined as:

86
Section 6.1 Lili Wei

Given a universe U of n elements, a list {S1 , . . . , Sm } of subsets of U , find a


collection of these subsets whose union is U and minimize the number of selected
subsets.

In our context, Gu can be viewed as the universal set of available API use cases.
API Graphs of individual subroutines can be viewed as its subsets. Our subroutine
selection problem can thereby be formalized as selecting a set of subroutines, whose
corresponding API Graphs can maximize the coverage of the universal graph Gu .

We can thereby apply the greedy algorithm solving set cover problem to design
our subroutine selection strategies. Algorithm 6.1 shows the greedy algorithm for
typical set cover problems. Intuitively, in each iteration, the greedy algorithm picks
the subset Si that can cover the most number of uncovered elements. The algorithm
stops when all the elements in U are covered by the selected subsets.

To adapt this greedy algorithm to select app subroutines, we need to define


the elements in a set. The elements in a set models system functionality use cases.
With our API Graph, system functionality use cases can be defined at different
granularities:

Nodes in API Graphs: invoked APIs. One simple and intuitive represen-
tation is to simply treat nodes in API Graphs (i.e., the invoked APIs) as elements
in sets. The universal set will be defined as the set of all the APIs in the universal
API Graph constructed from the whole input Android app corpus. Each subset,
Si , of a subroutine will be defined as the set of APIs invoked in the subroutine.
This representation only models single API invocations but cannot capture any
dependencies between APIs.

Edges in API Graphs: invocation order between APIs. Similarly, ele-


ments in a set can also be represented by the edges in API Graphs. The universal
set is naturally the set of edges in the universal API Graph. Each subset, Si , of a
subroutine will be defined as the set of edges in its API Graph. As introduced in
Section 6.1.3, an edge in an API Graph represents the invocation order between
two different APIs. Using edges as elements in a set can capture additional API
invocation order relationship than simply using nodes as set elements. At the same
time, using edges as elements may also enlarge the search space for subroutine
selection since the number of edges is much larger than the number of nodes in API
Graphs.

More complicated representations. We can derive more complicated repre-


sentations of the elements from our API Graph such as API invocation sequences (i.e.,

87
Section 6.1 Lili Wei

paths in API Graphs). More complicated representations may better characterize


dependencies or interactions between APIs. However, they will also much enlarge
the search space of subroutine selection and further complicate the problem. In this
paper, we only focus on selecting subroutines guided by coverage of nodes or edges
in API Graphs. In the future, we plan to also explore the impact of leveraging more
complicated representations to guide subroutine selection.

6.1.5 Test App Generator & Runner

Test app generator finally transforms the selected subroutines into runnable test
apps. In general, Test App Generator modifies the configuration file of an app so
that the selected subroutine can be directly executed once the apps are launched.
Note that when transforming the apps, Pixor does not carve or change the code of
the input apk files.

The generated test apps are fed into test app runner.

Test App Runner performs differential testing. It executes the generated apps
on both the reference Android version and the Android version under test. Test
App Runner will execute the test apps on each Android version for three times and
produce a screenshot for each run of the test app. The generated screenshots will be
later analyzed to identify the triggered platform-specific behaviors. We run the test
apps on each version for three times to mitigate the noise induce by non-deterministic
elements (e.g., animations) of the user interfaces. In the next section, we will detail
our analysis to identify platform-specific behaviors from the generated screenshots.

6.1.6 Identifying Triggered Platform-Specific Behaviors

Pixor identifies the triggered platform-specific behaviors by comparing the


screenshots of the same test app generated on different Android versions. Intuitively,
Pixor reports a platform-specific behavior if it identifies inconsistent UI displays
on different Android versions. To compare the UI displays on different Android
versions, we leveraged feature matching image similarity calculation method offered
by OpenCV [5]. The feature matching method has shown to be effective in identifying
UI inconsistencies in different versions of Android apps [126].

In our scenario, directly applying feature matching method to compare the


screenshots of the same subroutine generated on different versions may induce many

88
Section 6.2 Lili Wei

false positives since non-determinism is prevalent in Android apps. For example,


advertisement displaying widgets widely exist in Android apps. These advertisement
widgets may display different contents each time when they are launched. Such non-
deterministic widgets can induce inconsistent app UI displays yet such inconsistencies
are not platform-specific behaviors. To mitigate this problem, we proposed a method
to rerun the subroutines on each Android version for three times to filter out the
inconsistencies induced by non-determinism. Specifically, we will compute a ranking
score for each subroutine based on the six screenshots taken on the reference Android
version and the version under test. We design the ranking score based on the intuition
that non-deterministic widgets will induce inconsistent displays even when they are
running on the same Android version. In comparison, platform-specific behaviors
can induce inconsistent app behaviors across different versions but are unlikely to
induce inconsistent behaviors on the same Android version. Based on this intuition,
we proposed the ranking score as follows:

Score(S) = ((AvgSimilarityref erence + AvgSimilaritytest )/2 − AvgSimilaritycross )


×AvgSimilarityref erence × AvgSimilaritytest
(6.1)

where S is a subroutine. AvgSimilarityref erence represents the average value of


similarities between each pair of screenshots generated by running S on the reference
Android version and AvgSimilaritytest represents the average similarity of screenshots
generated on the Android version under test. AvgSimilaritycross represents the
average similarity of pairs of screenshots where one of them is generated on reference
Android version while the other is generated on the Android version under test. The
ranking score is higher means that the subroutine is more likely to reveal platform-
specific behaviors. This approach aims to filter out those subroutines that contain
non-deterministic UI components by running the same subroutines. Since such non-
deterministic components are very common in practice, this approach can effectively
reduce the number of false positives compared to the approach that compares a
single pairs of screenshots generated on different Android versions. This approach
may induce false negatives if there exist both non-deterministic UI components and
platform-specific behaviors in the same subroutine. In the future, we may explore
the technique to isolate the non-deterministic UI components and deterministic
components to reduce the number of false negatives.

89
Section 6.2 Lili Wei

6.2 Evaluation

To evaluate the effectiveness of Pixor, we applied Pixor to detecting bugs in


the beta version of Android Q. We will first introduce our evaluation setup and then
discuss our evaluation results.

6.2.1 Evaluation Setup

Data Collection

To evaluate the effectiveness of Pixor, we collected 3,243 apk files from Google
Play [38]. These apps are all top free apps of each category on Google Play collected
in April 2019. The number of apk files is not in hundreds since we failed to download
some of the apps due to server errors. We further filtered out those apk files that
cannot be successfully executed on the emulators. We also filtered out those apk files
whose target SDK versions are lower than 23 since their behavior are changed on
Android Q [2]. As a result, 2,479 apk files are left as the evaluation subjects and
were input to Pixor.

Exposing Platform-Specific Behaviors in Android Q Beta

During July to August 2019, We applied Pixor to exposing platform-specific


behaviors in Android Q Beta version [2]. In testing Android Q Beta, we regarded
the stable Android P version as the reference version and compared the screenshots
generated on Android Q against the screenshots on Android P version. To evaluate
the effectiveness of Pixor, we ran Pixor and evaluated the top 5000 subroutines
selected by each subroutine selection strategy introduced in Section 6.1.4. All the
tests were executed on Android emulators provided by Google. The experiments
were conducted on a Linux server running CentOS 7.4 with two Intel Xeon E5-2450
Octa Core CPU @2.1GHz and 192 GB RAM.

6.2.2 Baseline Method

To evaluate the effectiveness of Pixor, we compared Pixor with a pure random


subroutine selection strategy. The baseline method is the same as Pixor except
its subroutine selection component. The baseline method randomly selects the

90
Section 6.2 Lili Wei

(a) Screenshot on Android P (b) Screenshot on Android Q

Figure 6.3: Overlapping Contents on Android Q Beta 4

1. PermissionInfo permissionInfo = packageManager.


2. getPermissionInfo(“android.permission.READ_PHONE_STATE”, PackageManager.GET_META_DATA);
3. String group = permissionInfo.group;

Figure 6.4: Code Example That Exhibit Inconsistent Behavior on


Android Q

subroutines to exercise Android systems. We ran the baseline method three times
and it selected 5000 subroutines each time. We reported the average results of the
three runs as the final result of this baseline method.

Ground Truth

We built our ground truth by manual inspection. We compared the screenshots


of the same subroutine generated on reference versions and versions-under test to
identify the display differences. If we find any differences between the screenshots,
we further manually ran the generated test apps and reproduce the issues to confirm
whether the differences are platform-specific behaviors. We considered an instance
of platform-specific behavior as a true positive only if we can successfully reproduce
the different display on different Android versions. We inspected the screenshots
of all the subroutines whose ranking scores are higher than a specific threshold
θ (θ is empirically set to 0.15 in our evaluation). We also reported all the true
platform-specific behaviors to Android open issue tracker [3] and sought for feedback
from Android development team.

91
Section 6.2 Lili Wei

6.2.3 Evaluation Results

Identified Bugs

In our experiments, Pixor has identified 44 distinct subroutines that can trigger
platform-specific behaviors. We manually inspected all the instances and identified
seven distinct platform-specific behaviors in Android Q Beta version.

We filed bug reports for all of these identified platform-specific behaviors in the
Android Issue Tracker to seek for feedback from Android system developers. Out
of the seven bugs, three of them are confirmed bugs and have been fixed in the
development of the beta versions. One bug was identified as an intended behavior
change. One of our reports was changed to private so we cannot follow the updates on
it any more. The other two bugs are still left pending at the time of the submission
of this thesis.

All of our identified platform-specific behaviors exhibited inconsistent UI displays


on Android P and Android Q Beta versions. Two of them are induced by changes
to Android permission mechanism while the left five are induced by UI related
changes. Figure 6.3 shows an example of our identified platform-specific behavior.
Specifically, Figure 6.3a shows the screenshot of the subroutine running on Android
P and Figure 6.3b shows the screenshot on Android Q Beta version. As shown in the
figures, the activity is properly displayed on Android P but the item contents and
item numbers of the list are overlapped on Android Q Beta. This was a bug identified
by Pixor in Android Q Beta 4 and was fixed in Android Q Beta 5. We also give
another example of platform-specific behavior, which was induced by changes to
permission mechanism and was also identified as an intended behavior change by
the Android development team. Figure 6.4 shows the code snippet that can trigger
this platform-specific behavior. Specifically, the code snippet retrieves the name
of the permission group of permission READ PHONE STATE. This code snippet can
successfully retrieve the group name of the permission regardless of whether the
permission is granted on Android P. In contrast, on Android Q, the retrieved group
name will be “UNDEFINED” if the permission is not granted. This example was
identified as an intended behavior change by the Android development team. This
behavior change can induce FIC issues in Android apps.

92
Section 6.2 Lili Wei

Table 6.1: Identified Bugs of Different Subroutine Selection Strategies.


(The results of random are decimals since the average results of three
random runs are reported)

Selection Strategy # Instances # Distinct Platform-Specific Behaviors


Random 9.67 4.33
Greedy Node 10 6
Greedy Edge 17 6

Comparison between Different Subroutine Selection Strategies

Table 6.1 shows the number of identified instances of platform-specific behaviors


and the number of distinct platform-specific behaviors detected by the different
subroutine selection strategies. “Random” shows the results of the random baseline
method. The results are decimals because we ran the random baseline method three
times and report the average results here. “Greedy Node” and “Greedy Edge” shows
the results of Pixor using nodes and edges in API Graphs to guide the selection
of subroutine respectively. In general, subroutine selection strategy, Greedy Edge,
identified the most number of instances of platform-specific behaviors and random
baseline method identified the least number of platform-specific behavior instances.
In particular, both of the greedy-algorithm-based subroutine selection strategies of
Pixor identified six distinct platform-specific behaviors while the random baseline
method only identifies 4.33 distinct platform-specific behaviors on average. We
took a closer look at the distinct platform-specific behaviors identified by each
subroutine selection strategy and found that (1) Greedy Edge and Greedy Node
selection strategies of Pixor identified seven distinct platform-specific behaviors in
total, and (2) All of the three runs of the baseline method identified five distinct
platform-specific behaviors in total and all of these five platform-specific behaviors
are also identified by Pixor. Such results show that Pixor is effective in selecting
subroutines to expose platform-specific behaviors.

In addition, while both Greedy Edge and Greedy Node subroutine selection
strategies identified six distinct platform-specific behaviors, Greedy Edge subroutine
selection strategy identified a larger number of platform-specific behavior instances.
This shows that this finer-grained subroutine selection strategy may induce more
redundancy. This is reasonable since some platform-specific behaviors may not
require specific API invocation order to trigger. When leveraging edges in API
Graphs to guide subroutine selection, considering invocation order between APIs may

93
Section 6.3 Lili Wei

tend to select multiple subroutines that can trigger such platform-specific behaviors
and result in redundancy.

6.3 Discussions

6.3.1 Limitations

Not able to locate the specific APIs that can trigger the identified platform-specific
behaviors. Currently, Pixor can only leverage existing Android apps to trigger
platform-specific behaviorof different Android versions. It cannot locate the specific
APIs that can trigger the platform-specific behaviors. It is also non-trivial to manually
identify the specific APIs that can trigger the identified platform-specific behaviors.
As a result, we cannot derive API-context pairs to detect FIC issues in Android apps
from the triggered platform-specific behaviors of Pixor. In the future, we plan to
study how to locate the specific APIs that can trigger the platform-specific behaviors
and automatically derive API-context pairs from the subroutines. Specifically, we may
try to cluster the subroutines that can trigger the same platform-specific behaviors
and identify their invoked APIs in common.

Limited use cases of newly-introduced APIs. Since Pixor leverages Android


API use cases in existing Android apps to expose platform-specific behaviors, the
effectiveness of Pixor largely depends on the API use cases available in the input
app corpus. As a result, Pixor may not be able to thoroughly exercise those
newly-introduced APIs, which may not have enough use cases in existing apps.

API parameters are not considered in API Graph. Our API Graph does not
contain any information about the values of the parameters, which may also be
triggering conditions of platform-specific behaviors. We do not consider the parameter
values at this stage since (1) it is non-trivial to precisely infer the parameter values
from static analysis and (2) taking parameter values into account may further
complicate the representation and much enlarge the search space.

94
Chapter 7

Future Work

7.1 Future Work of Pivot


With the API-device correlation extractor and prioritizer, Pivot still cannot
learn API-context pairs automatically due to two major reasons: (1) noises and
duplications account for a large proportion in the top-ranked correlations, (2) issue-
inducing and issue fixing API-device correlations cannot be distinguished. To address
the problems, we plan to improve Pivot from two directions, which will be discussed
in the following two sections.

7.1.1 API-Device Correlation Clustering

Pivot extracts and prioritizes API-device correlations individually. When in-


specting the results produced by Pivot, we observed that multiple API-device
correlations can be related to the same FIC issue. This increases the difficulty in
manually validating the API-device correlations and deriving API-context pairs. This
motivates us to further improve Pivot by outputting API-device correlations as
clusters. The individual API-device correlations related to the same FIC issue will
be output to the same cluster. We plan to propose a new clustering technique based
on the intuition that API-device correlations of the same FIC issue tend to co-occur
within nearby blocks in different apps. We will propose new metrics to measure
the distance between API-device correlations, which can be computed based on the
control flow graphs of their instances in different apps, to cluster the correlations.
This clustering step can also help reduce noise in the results produced by Pivot. This
is because while valid API-device correlations can form clusters, invalid API-device

95
Section 7.2 Lili Wei

correlations may not be included in any clusters. We will also investigate how to
distinguish issue-inducing and issue-fixing API-device correlations in the clusters
and derive API-context pairs and issue fixing patterns from the clusters.

7.1.2 Automated API-Device Correlation Validation

In our experiments, we manually validated API-device correlations. The manual


process is subject to errors. To reduce human efforts and provide more reliable
validation results, we plan to study automated validation of API-device correlations
in future. Particularly, we plan to combine static and dynamic analysis to synthesize
test apps from the apps that contain top-ranked API-device correlations. With such
test apps, we will be able to exercise the APIs on the device models specified by
API-device correlations to validate the correlations.

7.2 Future Work of Pixor


Many possible directions can be explored to improve Pixor. Here we discuss
three of them.

Guide subroutine selection with other information. Pixor now leverages API
invocations and their invocation orders to guide subroutine selection. Other informa-
tion may also be used to guide the selection such as the arguments passed to the
APIs, the user interface layout files, etc. Leveraging such information can help model
Android API use cases at finer-granularities and may improve the effectiveness of
Pixor in selecting subroutines.

Trigger user events after launching apps. Some platform-specific behaviors can
only be exposed after triggering user events. For example, some platform-specific
behaviors may result in broken functionalities of specific widgets such as ScrollView
may not be scrollable on some specific Android versions. Pixor now does not
trigger any user events after launching the selected Activities. In the future, we
plan to further integrate user event triggering components in Pixor to expose such
platform-specific behaviors.

Exposing other types of platform-specific behaviors. In the current version, Pixor


focuses on exposing UI-related platform-specific behaviors because they are common
and exhibit specific symptoms that can be derived as test oracles. However, the idea
of Pixor is not restricted to detecting only UI-related platform-specific behaviors

96
Section 7.2 Lili Wei

if we adapt the construction of API Graph and the definition of a subroutine. In


the future, we plan to further extend Pixor to support general platform-specific
behaviors in Android or even general bugs in other systems.

97
Chapter 8

Conclusion

This thesis conducted a comprehensive study on fragmentation-induced compati-


bility issues in Android apps. We aim to characterize and automatically detect FIC
issues in Android apps. Considering the evolving nature of the Android ecosystem,
we also proposed a technique to learn patterns of FIC issues, which can be used to
facilitate FIC issue detection.

Specifically, we conducted the first large-scale empirical study on fragmentation-


induced compatibility issues based on 220 real world issues collected from five popular
open-source Android apps. We investigated their root causes, consequences, and
fixing strategies. We communicated with Android practitioners via interviews and
online questionnaires to understand their common practices of testing, diagnosing
and fixing FIC issues. From the studies, we obtained several important findings that
can facilitate automated detection and repair of FIC issues, and guide future studies
on related topics.

Based on the findings of our study, we proposed to use API-context pairs that
capture issue-inducing APIs and issue-triggering contexts to model FIC issues.
With this model, we further designed and implemented a static analysis technique
FicFinder to automatically detect FIC issues in Android apps. Our evaluation
of FicFinder on 53 large-scale subjects showed that FicFinder can detect many
previous-unknown FIC issues in the apps and provide useful information to help
developers diagnose and fix the issues.

We also proposed Pivot, which can automatically learn knowledge of FIC


issues as API-device correlations from existing Android apps to generate inputs for
FicFinder and facilitate FIC issue detection. Pivot performs inter-procedural
static analysis to extract API-device correlations and leverages a novel ranking

98
Section 8.0 Lili Wei

strategy to prioritize them. The evaluation results show that our ranking strategy
can effectively identify valid API-device correlations and significantly outperform
an existing technique. Based on the learned API-device correlations, we built an
archive of FIC issues and further conducted a case study to show the usefulness of
API-device correlations.

Our latest effort aimed to actively expose platform-specific behaviors that can
induce FIC issues. We proposed Pixor, which breaks down existing Android apps
into subroutines and leverages them to perform differential testing for different
Android versions. Pixor effectively selects the minimized number of subroutines
that can maximize the exercised Android system functionalities. The evaluation
results show that Pixor can outperform the random baseline methods and effectively
expose platform-specific behaviors.

Our future work will focus on improving both Pivot and Pixor . Currently,
there is still noise in the top-ranked API-device correlations produced by Pivot.
In the future, we plan to further improve Pivot by improving its precision and
reducing duplications in its results by leveraging clustering techniques. We may also
study how to automate the validation of the API-device correlations and reduce
human efforts required by Pivot. As for Pixor, we may explore the possibility to
locate the exact APIs that induced platform-specific behaviors and further derive
API-context pairs from the subroutines.

99
PhD’s Publications

Thesis-related publications

• Lili Wei, Yepang Liu, Shing-Chi Cheung. Taming Android Fragmentation:


Characterizing and Detecting Compatibility Issues for Android Apps in ASE
2016: The 31st IEEE/ACM International Conference on Automated Software
Engineering, pp 226–237, Singapore, Sep 2016. ACM SIGSOFT Distin-
guished Paper Award.

• Lili Wei, Yepang Liu, S.C. Cheung, Huaxun Huang, Xuan Lu, and Xuanzhe
Liu. Understanding and Detecting Fragmentation-Induced Compatibility Issues
for Android Apps. in TSE 2018: IEEE Transactions on Software Engineering.
24 pages. Accepted, to appear.

• Lili Wei, Yepang Liu, Shing-Chi Cheung. Pivot: Learning API-Device


Correlations to Facilitate Android Compatibility Issue Detection in ICSE
2019: The 41st ACM/IEEE International Conference on Software Engineering,
Montreal, pp 878–888, Canada, May 2019. ACM SIGSOFT Distinguished
Artifact Award.

• Lili Wei, Ming Wen, Yepang Liu, Shing-Chi Cheung, Yuanyuan Zhou. Pixor:
Exposing Platform-Specific Bugs in Android Systems via Differential App
Executions. Under submission.

Other publications

• Cong Li, Chang Xu, Lili Wei, Jue Wang, Jun Ma, and Jian Lu ELEGANT:
Towards Effective Location of Fragmentation-Induced Compatibility Issues for
Android Apps in APSEC 2018: Proceedings of the 25th Asia-Pacific Software
Engineering Conference, pp. 278–287 Nara, Japan, Dec 2018.

100
• Huaxun Huang, Lili Wei, Yepang Liu, and S.C. Cheung Understanding and
Detecting Callback Compatibility Issues for Android Applications in ASE
2018:The 33rd IEEE/ACM International Conference on Automated Software
Engineering, pp 702–713, Montpellier, France, Sep 2018.

• Jiajun Hu, Lili Wei, Yepang Liu, and S.C. Cheung, and Huaxun Huang. A
Tale of Two Cities: How WebView Induces Bugs to Android Applications
in ASE 2018:The 33rd IEEE/ACM International Conference on Automated
Software Engineering, pp 702–713, Montpellier, France, Sep 2018.

• Lili Wei, Yepang Liu, and S.C. Cheung. OASIS: Prioritizing Static Analysis
Warnings for Android Apps Based on App User Reviews in ESEC/FSE 2017:
The 11th joint meeting of the European Software Engineering Conference and
the ACM SIGSOFT Symposium on the Foundations of Software Engineering,
pp 672–682, Paderborn, Germany, Sep 2017

101
References

[1] Android Open Source Project – Issue Tracker. https://code.google.com/p/


android/issues/.

[2] Android 10. https://developer.android.com/about/versions/10, 2019.

[3] Android Public Issue Tracker. https://issuetracker.google.com/issues?


q=componentid:190923, 2019.

[4] ApkTool. https://ibotpeaches.github.io/Apktool/, 2019.

[5] Feature Matching. https://docs.opencv.org/3.4/dc/dc3/tutorial_py_


matcher.html, 2019.

[6] Introduction to Activities. https://developer.android.com/guide/


components/activities/intro-activities, 2019.

[7] AFWall+. https://github.com/ukanth/afwall, 2019-01-17.

[8] Amazon Device Farm. https://aws.amazon.com/device-farm/, 2019-01-17.

[9] And Bible. https://github.com/mjdenham/and-bible, 2019-01-17.

[10] Android - Proximity Sensor Accuracy - Stack Overflow.


http://stackoverflow.com/questions/16876516/proximity-sensor-
accuracy/29954988#29954988, 2019-01-17.

[11] Android API Guide. https://developer.android.com/guide/index.html,


2019-01-17.

[12] Android Fragmentation Visualized (August 2015). http://opensignal.com/


reports/2015/08/android-fragmentation/, 2019-01-17.

[13] Android Interfaces and Architecture. https://source.android.com/devices/index.html,


2019-01-17.

102
[14] Android’s Fragmentation Problem. http://www.greyheller.com/Blog/androids-
fragmentation-problem/, 2019-01-17.

[15] AndStatus. https://github.com/andstatus/andstatus, 2019-01-17.

[16] AnkiDroid. https://github.com/ankidroid/Anki-Android, 2019-01-17.

[17] AntennaPod. https://github.com/AntennaPod/AntennaPod, 2019-01-17.

[18] AnySoftKeyboard. https://github.com/AnySoftKey-board/AnySoftKeyboard/,


2019-01-17.

[19] Barcode Scanner. https://github.com/zxing/zxing, 2019-01-17.

[20] Calendula. https://github.com/citiususc/calendula, 2019-01-17.

[21] Camera.Parameters.setRecordingHint and aspect ratio - Stack Overflow. https:


//stackoverflow.com/a/22561537, 2019-01-17.

[22] CSipSimple. https://github.com/r3gis3r/CSipSimple, 2019-01-17.

[23] CSipSimple – Issues. https://code.google.com/archive/p/csipsimple/


issues/, 2019-01-17.

[24] Dashboards — Android Developers. http://developer.android.com/about/


dashboards/index.html, 2019-01-17.

[25] Device List - Amazon Device Farm. https://aws.amazon.com/device-farm/


device-list/, 2019-01-17.

[26] Diode. https://github.com/zagaberoo/diode, 2019-01-17.

[27] DuckDuckGo. https://github.com/duckduckgo/android-search-and-stories,


2019-01-17.

[28] F-Droid. https://f-droid.org/, 2019-01-17.

[29] Face Slim. https://github.com/indywidualny/FaceSlim, 2019-01-17.

[30] FicFinder Project Website. http://sccpu2.cse.ust.hk/ficfinder/, 2019-


01-17.

[31] Forecastie. https://github.com/martykan/forecastie, 2019-01-17.

[32] Gadgetbridge. https://github.com/Freeyourgadget/Gadgetbridge, 2019-


01-17.

103
[33] GitHub. https://github.com/, 2019-01-17.

[34] Glucosio. https://github.com/Glucosio/glucosio-android, 2019-01-17.

[35] Good Weather. https://github.com/qqq3/good-weather, 2019-01-17.

[36] Google. https://google.com, 2019-01-17.

[37] Google Code. https://code.google.com/, 2019-01-17.

[38] Google play. https://play.google.com/store/apps, 2019-01-17.

[39] iFixit Android. https://github.com/iFixit/iFixitAndroid, 2019-01-17.

[40] Image preview is very slow on Nexus4 - Stack Over-


flow. https://stackoverflow.com/questions/20252352/
image-preview-is-very-slow-on-nexus4, 2019-01-17.

[41] jbool expressions. https://github.com/bpodgursky/jbool_expressions,


2019-01-17.

[42] K9-Mail. https://github.com/k9mail/k-9, 2019-01-17.

[43] KISS. https://github.com/Neamar/KISS, 2019-01-17.

[44] LeafPic. https://github.com/HoraApps/LeafPic, 2019-01-17.

[45] LG Developer - NullPointerException in PhoneWindow with custom PanelView.


http://developer.lge.com/community/forums/RetrieveForumContent.
dev?detailContsId=FC29190703, 2019-01-17.

[46] LibreTorrent. https://github.com/proninyaroslav/libretorrent, 2019-


01-17.

[47] Materialistic - Hacker News. https://github.com/hidroh/materialistic,


2019-01-17.

[48] MiMangaNu. https://github.com/raulhaag/MiMangaNu, 2019-01-17.

[49] MinTube. https://github.com/imshyam/mintube, 2019-01-17.

[50] MozStumbler. https://github.com/mozilla/MozStumbler, 2019-01-17.

[51] Muzei. https://github.com/romannurik/muzei, 2019-01-17.

[52] NewsBlur. https://github.com/samuelclay/NewsBlur/, 2019-01-17.

104
[53] OctoDroid. https://github.com/slapperwan/gh4a, 2019-01-17.

[54] Omni-Notes. https://github.com/federicoiosue/Omni-Notes, 2019-01-17.

[55] On Android Compatibility. http://android-developers.blogspot.com/


2010/05/on-android-compatibility.html, 2019-01-17.

[56] Open Food. https://github.com/openfoodfacts/


openfoodfacts-androidapp, 2019-01-17.

[57] Open Launcher. https://github.com/OpenLauncherTeam/openlauncher,


2019-01-17.

[58] Open Manga. https://github.com/nv95/OpenManga, 2019-01-17.

[59] Open Tasks. https://github.com/dmfs/opentasks, 2019-01-17.

[60] OsmAnd. https://github.com/osmandapp/Osmand, 2019-01-17.

[61] ownCloud. https://github.com/hypery2k/owncloud, 2019-01-17.

[62] Pedometer. https://github.com/j4velin/Pedometer, 2019-01-17.

[63] Pivot Homepage. https://ficissuepivot.github.io/Pivot/, 2019-01-17.

[64] Position Sensors - Android Developers. http://developer.android.com/


guide/topics/sensors/sensors_position.html, 2019-01-17.

[65] QKSMS. https://github.com/qklabs/qksms, 2019-01-17.

[66] RasPi Check. https://github.com/eidottermihi/rpicheck, 2019-01-17.

[67] Red Moon. https://github.com/LibreShift/red-moon, 2019-01-17.

[68] Red Reader. https://github.com/QuantumBadger/RedReader, 2019-01-17.

[69] Rior.im. https://github.com/vector-im/riot-android, 2019-01-17.

[70] Save For Offline. https://github.com/JonasCz/save-for-offline, 2019-01-17.

[71] Simple Camera. https://github.com/SimpleMobileTools/Simple-Camera,


2019-01-17.

[72] Simple Task. https://github.com/mpcjanssen/simpletask-android, 2019-


01-17.

[73] Slide for Reddit. https://github.com/ccrama/Slide, 2019-01-17.

105
[74] Smartphone OS Market Share, 2015 Q2.
http://www.idc.com/prodserv/smartphone-os-market-share.jsp, 2019-01-17.

[75] SoundWaves. https://github.com/bottiger/SoundWaves, 2019-01-17.

[76] Stack overflow. https://stackoverflow.com/, 2019-01-17.

[77] Stack Overflow - NullPointerException. https://stackoverflow.com/


questions/26833242/nullpointerexception-phonewindowonkeyuppanel1002-main/
27024610#27024610, 2019-01-17.

[78] Stocks Widget. https://github.com/premnirmal/StockTicker, 2019-01-17.

[79] StreetComplete. https://github.com/westnordost/StreetComplete, 2019-


01-17.

[80] Subsonic. https://github.com/daneren2005/Subsonic, 2019-01-17.

[81] Survival Manual. https://github.com/ligi/SurvivalManual, 2019-01-17.

[82] There are now more than 24,000 different Android devices. https://qz.com/
472767/there-are-now-more-than-24000-different-android-devices/,
2019-01-17.

[83] Timber. https://github.com/naman14/Timber, 2019-01-17.

[84] Tinfoil for Facebook. https://github.com/velazcod/Tinfoil-Facebook,


2019-01-17.

[85] Transdroid. https://github.com/erickok/transdroid, 2019-01-17.

[86] Twidere for Android. https://github.com/TwidereProject/


Twidere-Android, 2019-01-17.

[87] UI/Application Exerciser Monkey. https://developer.android.com/


studio/test/monkey.html, 2019-01-17.

[88] VLC – Android. https://code.videolan.org/videolan/vlc-android, 2019-01-17.

[89] VLC Bug Tracker. https://trac.videolan.org/vlc/, 2019-01-17.

[90] Wallabag. https://github.com/wallabag/android-app, 2019-01-17.

[91] Walleth. https://github.com/walleth/walleth, 2019-01-17.

[92] Web Opac App. https://github.com/opacapp/opacclient, 2019-01-17.

106
[93] Weechat Android. https://github.com/ubergeek42/weechat-android,
2019-01-17.

[94] WeTest. http://wetest.qq.com, 2019-01-17.

[95] Wikimedia Commons. https://github.com/commons-app/apps-android-


commons, 2019-01-17.

[96] Wikipedia. https://github.com/wikimedia/apps-android-wikipedia, 2019-01-


17.

[97] Xabber. https://github.com/redsolution/xabber-android, 2019-01-17.

[98] Yaxim. https://github.com/ge0rg/yaxim, 2019-01-17.

[99] Yousra Aafer, Xiao Zhang, and Wenliang Du. Harvesting inconsistent secu-
rity configurations in custom android roms via differential analysis. In 25th
{USENIX} Security Symposium ({USENIX} Security 16), pages 1153–1168,
2016.

[100] Mithun Acharya, Tao Xie, Jian Pei, and Jun Xu. Mining API Patterns as
Partial Orders from Source Code: From Usage Scenarios to Specifications. In
ESEC/FSE, pages 25–34. ACM, 2007.

[101] Frances E Allen. Control flow analysis. In ACM Sigplan Notices, volume 5,
pages 1–19. ACM, 1970.

[102] Gabriele Bavota, Mario Linares-Vasquez, Carlos Eduardo Bernal-Cardenas,


Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. The Impact of
API Change-and Fault-proneness on the User Ratings of Android Apps. TSE,
41(4):384–407, 2015.

[103] Anne Chao and Tsung-Jen Shen. Nonparametric Estimation of Shannon’s


Index of Diversity When There are Unseen Species in Sample. Environmental
and Ecological Statistics, 10(4):429–443, 2003.

[104] Kai Chen, Peng Liu, and Yingjun Zhang. Achieving Accuracy and Scalability
Simultaneously in Detecting Application Clones on Android Markets. In ICSE,
pages 175–186, 2014.

[105] Yang Chen, Alex Groce, Chaoqiang Zhang, Weng-Keen Wong, Xiaoli Fern,
Eric Eide, and John Regehr. Taming compiler fuzzers. In ACM SIGPLAN
Notices, volume 48, pages 197–208. ACM, 2013.

107
[106] Yuting Chen, Ting Su, and Zhendong Su. Deep differential testing of jvm
implementations. In Proceedings of the 41st International Conference on
Software Engineering, pages 1257–1268. IEEE Press, 2019.

[107] Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao.
Coverage-directed differential testing of jvm implementations. In ACM SIG-
PLAN Notices, volume 51, pages 85–99. ACM, 2016.

[108] Shauvik Roy Choudhary, Alessandra Gorla, and Alessandro Orso. Automated
Test Input Generation for Android: Are We There Yet? In ASE, pages 429–440,
2015.

[109] John W. Creswell. Qualitative Inquiry and Research Design: Choosing Among
Five Approaches (3rd Edition). SAGE Publications, Inc., 2013.

[110] Emilio Cruciani, Breno Miranda, Roberto Verdecchia, and Antonia Bertolino.
Scalable approaches for test suite reduction. In Proceedings of the 41st In-
ternational Conference on Software Engineering, pages 419–429. IEEE Press,
2019.

[111] Lingling Fan, Ting Su, Sen Chen, Guozhu Meng, Yang Liu, Lihua Xu, Geguang
Pu, and Zhendong Su. Large-Scale Analysis of Framework-Specific Exceptions
in Android Apps. In ICSE, pages 459–472, 2018.

[112] Mattia Fazzini and Alessandro Orso. Automated Cross-Platform Inconsistency


Detection for Mobile Apps. In ASE, pages 308–318, 2017.

[113] Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. The program depen-
dence graph and its use in optimization. TOPLAS, 9(3):319–349, 1987.

[114] Jaroslav Fowkes and Charles Sutton. Parameter-Free Probabilistic API Mining
across GitHub. In FSE, pages 254–265, 2016.

[115] John C Gower and Gavin JS Ross. Minimum Spanning Trees and Single
Linkage Cluster Analysis. Applied statistics, pages 54–64, 1969.

[116] David Grove, Greg DeFouw, Jeffrey Dean, and Craig Chambers. Call Graph
Construction in Object-Oriented Languages. In OOPSLA, pages 108–124,
1997.

[117] Matthew Halpern, Yuhao Zhu, Ramesh Peri, and Vijay Janapa Reddi. Mo-
saic: Cross-Platform User-Interaction Record and Replay for the Fragmented
Android Ecosystem. In ISPASS, pages 215–224, 2015.

108
[118] Hyung Kil Ham and Young Bom Park. Designing Knowledge Base Mobile
Application Compatibility Test System for Android Fragmentation. IJSEIA,
8(1):303–314, 2014.

[119] Dan Han, Chenlei Zhang, Xiaochao Fan, Abram Hindle, Kenny Wong, and
Eleni Stroulia. Understanding Android Fragmentation with Topic Analysis of
Vendor-Specific Bugs. In WCRE, pages 83–92, 2012.

[120] Edith Hemaspaandra and Gerd Wechsung. The Minimization Problem for
Boolean Formulas. In Foundations of Computer Science, 1997. Proceedings.,
38th Annual Symposium on, pages 575–584, 1997.

[121] Andreas Holzinger, Peter Treitler, and Wolfgang Slany. Making Apps Useable
on Multiple Different Mobile Platforms: On Interoperability for Business
Application Development on Smartphones. In CD-ARES, pages 176–189. 2012.

[122] Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. Real Challenges
in Mobile App Development. In ESEM, pages 15–24, 2013.

[123] Jouko Kaasila, Denzil Ferreira, Vassilis Kostakos, and Timo Ojala. Testdroid:
Automated Remote UI Testing on Android. In MUM, number 28, 2012.

[124] Richard M Karp. Reducibility among combinatorial problems. In Complexity


of computer computations, pages 85–103. Springer, 1972.

[125] Hammad Khalid, Meiyappan Nagappan, Emad Shihab, and Ahmed E Hassan.
Prioritizing the Devices to Test Your App on: A Case Study of Android Game
Apps. In FSE, pages 610–620, 2014.

[126] Taeyeon Ki, Chang Min Park, Karthik Dantu, Steven Y Ko, and Lukasz Ziarek.
Mimic: Ui compatibility testing system for android apps. In Proceedings of the
41st International Conference on Software Engineering, pages 246–256. IEEE
Press, 2019.

[127] Yunho Kim, Youil Kim, Taeksu Kim, Gunwoo Lee, Yoonkyu Jang, and Moonzoo
Kim. Automated Unit Testing of Large Industrial Embedded Software Using
Concolic Testing. In ASE, pages 519–528, 2013.

[128] Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas


Zimmermann, and David Lo. Understanding the test automation culture of
app developers. In Software Testing, Verification and Validation (ICST), 2015
IEEE 8th International Conference on, pages 1–10. IEEE, 2015.

109
[129] Stephen Kyle, Hugh Leather, Björn Franke, Dave Butcher, and Stuart Monteith.
Application of domain-aware binary fuzzing to aid android virtual machine
testing. In ACM SIGPLAN Notices - VEE ’15, volume 50, pages 121–132.
ACM, 2015.

[130] Patrick Lam, Eric Bodden, Ondrej Lhoták, and Laurie Hendren. The Soot
Framework for Java Program Analysis: A Retrospective. In CETUS, 2011.

[131] Vu Le, Mehrdad Afshari, and Zhendong Su. Compiler validation via equivalence
modulo inputs. In ACM SIGPLAN Notices, volume 49, pages 216–226. ACM,
2014.

[132] Owolabi Legunsen, Farah Hariri, August Shi, Yafeng Lu, Lingming Zhang,
and Darko Marinov. An extensive study of static regression test selection in
modern software evolution. In Proceedings of the 2016 24th ACM SIGSOFT
International Symposium on Foundations of Software Engineering, pages 583–
594. ACM, 2016.

[133] Huoran Li, Xuan Lu, Xuanzhe Liu, Tao Xie, Kaigui Bian, Felix Xiaozhu Lin,
Qiaozhu Mei, and Feng Feng. Characterizing Smartphone Usage Patterns from
Millions of Android Users. In IMC, pages 459–472, 2015.

[134] Huoran Li, Xuan Lu, Xuanzhe Liu, Tao Xie, Kaigui Bian, Felix Xiaozhu Lin,
Qiaozhu Mei, and Feng Feng. Characterizing Smartphone Usage Patterns from
Millions of Android Users. In IMC, pages 459–472, 2015.

[135] Li Li, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. An In-
vestigation into the Use of Common Libraries in Android Apps. In SANER,
volume 1, pages 403–414, 2016.

[136] Li Li, Tegawendé F Bissyandé, Haoyu Wang, and Jacques Klein. CiD: Au-
tomating the Detection of API-Related Compatibility Issues in Android Apps.
In ISSTA, pages 153–163, 2018.

[137] Li Li, Jun Gao, Tegawendé F Bissyandé, Lei Ma, Xin Xia, and Jacques Klein.
Characterising deprecated android apis. In MSR 2018, 2018.

[138] Zhenmin Li and Yuanyuan Zhou. PR-Miner: Automatically Extracting Implicit


Programming Rules and Detecting Violations in Large Software Code. In
ESEC/FSE, pages 306–315, 2013.

110
[139] Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Massimil-
iano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. API Change and Fault
Proneness: A Threat to the Success of Android Apps. In FSE, pages 477–487,
2013.

[140] Mario Linares-Vásquez, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto,


and Denys Poshyvanyk. How Do API Changes Trigger Stack Overflow Discus-
sions? A Study on the Android SDK. In ICPC, pages 83–94, 2014.

[141] Mario Linares-Vásquez, Carlos Bernal-Cárdenas, Kevin Moran, and Denys


Poshyvanyk. How do developers test android applications? In ICSME, pages
613–622. IEEE, 2017.

[142] Mario Linares-Vasquez, Christopher Vendome, Qi Luo, and Denys Poshyvanyk.


How developers detect and fix performance bottlenecks in android apps. In
ICSME, pages 352–361. IEEE, 2015.

[143] Yepang Liu, Chang Xu, and Shing-Chi Cheung. Characterizing and Detecting
Performance Bugs for Smartphone Applications. In ICSE, pages 1013–1024,
2014.

[144] Benjamin Livshits and Thomas Zimmermann. DynaMine: Finding Common


Error Patterns by Mining Software Revision Histories. In ESEC/FSE, pages
296–305. ACM, 2013.

[145] Long Lu, Zhichun Li, Zhenyu Wu, Wenke Lee, and Guofei Jiang. Chex:
Statically Vetting Android Apps for Component Hijacking Vulnerabilities. In
CCS, pages 229–240. ACM, 2012.

[146] Xuan Lu, Xuanzhe Liu, Huoran Li, Tao Xie, Qiaozhu Mei, Dan Hao, Gang
Huang, and Feng Feng. PRADA: Prioritizing Android Devices for Apps by
Mining Large-Scale Usage Data. In ICSE, pages 3–13, 2016.

[147] Tyler McDonnell, Bonnie Ray, and Miryung Kim. An Empirical Study of API
Stability and Adoption in the Android Ecosystem. In ICSM, pages 70–79,
2013.

[148] William M McKeeman. Differential testing for software. Digital Technical


Journal, 10(1):100–107, 1998.

[149] Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob
Siemborski, and John Micco. Taming google-scale continuous testing. In

111
Proceedings of the 39th International Conference on Software Engineering:
Software Engineering in Practice Track, pages 233–242. IEEE Press, 2017.

[150] Fatih Nayebi, Jean-Marc Desharnais, and Alain Abran. The State of the Art
of Mobile Application Usability Evaluation. In CCECE, pages 1–4, 2012.

[151] Hoan Anh Nguyen, Robert Dyer, Tien N Nguyen, and Hridesh Rajan. Mining
Preconditions of APIs in Large-Scale Code Corpus. In FSE, pages 166–177,
2014.

[152] Abhinav Pathak, Y Charlie Hu, and Ming Zhang. Bootstrapping Energy
Debugging on Smartphones: A First Look at Energy Bugs in Mobile Devices.
In HotNets, number 5. ACM, 2011.

[153] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Auto-
mated whitebox testing of deep learning systems. In Proceedings of the 26th
Symposium on Operating Systems Principles, pages 1–18. ACM, 2017.

[154] Charles Prud’homme, Jean-Guillaume Fages, and Xavier Lorca. Choco Docu-
mentation. TASC - LS2N CNRS UMR 6241, COSLING S.A.S., 2017.

[155] Murali Krishna Ramanathan, Ananth Grama, and Suresh Jagannathan. Static
Specification Inference Using Predicate Mining. In PLDI, pages 123–134, 2007.

[156] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and Xuejun
Yang. Test-case reduction for c compiler bugs. In ACM SIGPLAN Notices,
volume 47, pages 335–346. ACM, 2012.

[157] Martin P Robillard, Eric Bodden, David Kawrykow, Mira Mezini, and Tristan
Ratchford. Automated API Property Inference Techniques. TSE, 39(5):613–637,
2013.

[158] Claude E Shannon. Communication Theory of Secrecy Systems. Bell Labs


Technical Journal, 28(4):656–715, 1949.

[159] Ian F Spellerberg and Peter J Fedor. A Tribute to Claude Shannon (1916–2001)
and a Plea for More Rigorous Use of Species Richness, Species Diversity and
the “Shannon–Wiener” Index. Global Ecology and Biogeography, 12(3):177–179,
2003.

[160] Yuan Tian, Meiyappan Nagappan, David Lo, and Ahmed E Hassan. What
are the Characteristics of High-Rated Apps? A Case Study on Free Android
Applications. In ICSME, pages 301–310, 2015.

112
[161] Sergiy Vilkomir and Brandi Amstutz. Using Combinatorial Approaches for
Testing Mobile Applications. In ICSTW, pages 78–83, 2014.

[162] Jue Wang, Yingnong Dang, Hongyu Zhang, Kai Chen, Tao Xie, and Dongmei
Zhang. Mining Succinct and High-Coverage API Usage Patterns from Source
Code. In MSR, pages 319–328. IEEE Press, 2013.

[163] Lili Wei, Yepang Liu, and Shing-Chi Cheung. Taming Android Fragmentation:
Characterizing and Detecting Compatibility Issues for Android Apps. In ASE,
pages 226–237, 2016.

[164] Lili Wei, Yepang Liu, and Shing-Chi Cheung. Understanding and detection
fragmentation-induced compatibility issues in android apps. In TSE, 2018.

[165] Lili Wei, Yepang Liu, and Shing-Chi Cheung. Pivot: Learning api-device
correlations to facilitate android compatibility issue detection. In ICSE, 2019.

[166] Lili Wei, Yepang Liu, and Shing-Chi Cheung. PIVOT: Learning API-Device
Correlations to Facilitate Android Compatibility Issue Detection. In Proceedings
of the 41th International Conference on Software Engineering, ICSE 2019,
page 11, 2019.

[167] Lei Wu, Michael Grace, Yajin Zhou, Chiachih Wu, and Xuxian Jiang. The
Impact of Vendor Customizations on Android Security. In CCS, pages 623–634,
2013.

[168] Tao Xie and Jian Pei. MAPO: Mining API Usages from Open Source Reposi-
tories. In MSR, pages 54–57. ACM, 2006.

[169] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. Finding and under-
standing bugs in c compilers. In ACM SIGPLAN Notices, volume 46, pages
283–294. ACM, 2011.

[170] Shin Yoo and Mark Harman. Regression testing minimization, selection and
prioritization: a survey. Software Testing, Verification and Reliability, 22(2):67–
120, 2012.

[171] Dongsong Zhang and Boonlit Adipat. Challenges, Methodologies, and Issues
in the Usability Testing of Mobile Applications. IJHCI, 18(3):293–308, 2005.

[172] Hang Zhang, Dongdong She, and Zhiyun Qian. Android ion hazard: The
curse of customizable memory management system. In Proceedings of the 2016

113
ACM SIGSAC Conference on Computer and Communications Security, pages
1663–1674. ACM, 2016.

[173] Lingming Zhang. Hybrid regression test selection. In 2018 IEEE/ACM 40th
International Conference on Software Engineering (ICSE), pages 199–209.
IEEE, 2018.

[174] Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. MAPO: Mining and
Recommending API Usage Patterns. In ECOOP, pages 318–343. Springer,
2009.

[175] Xiaoyong Zhou, Yeonjoon Lee, Nan Zhang, Muhammad Naveed, and XiaoFeng
Wang. The Peril of Fragmentation: Security Hazards in Android Device Driver
Customizations. In SP, pages 409–423, 2014.

114
ProQuest Number: 29098967

INFORMATION TO ALL USERS


The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.

Distributed by ProQuest LLC ( 2022 ).


Copyright of the Dissertation is held by the Author unless otherwise noted.

This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.

This work is protected against unauthorized copying under Title 17,


United States Code and other applicable copyright laws.

Microform Edition where available © ProQuest LLC. No reproduction or digitization


of the Microform Edition is authorized without permission of ProQuest LLC.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

You might also like