Professional Documents
Culture Documents
Witals - AP-centric Health Diagnosis of WiFi
Witals - AP-centric Health Diagnosis of WiFi
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
1
Abstract—In recent years, WiFi has grown in capacity as well cafeterias, restaurants, shops, WiFi at point-of-sales and so
as deployment demand. WiFi system administrators (sysads) on. There are some AP-centric diagnosis frameworks ([19],
want a simple answer to the question “Is my WiFi network [20], [23]) as well, but these too stop short of diagnosing
healthy?”, and a possible follow-up “What is wrong with it?”, if
it is reported as “unhealthy”. But we are far from having such the root cause(s) of WiFi performance problems. They do not
an interface today. It is this gap that this work attempts to fill. have a framework where multiple simultaneous causes can
We present Witals, a system for WiFi performance diagnosis. We be quantified or compared. On the other hand, the state-of-
first design a causal diagnosis graph, that extensively identifies the art commercial tools ([3], [5]) provide plots of various
the underlying cause(s) of WiFi performance problems. Next, we metrics (e.g., error rate, bytes downloaded, number of radios
identify a set of metrics corresponding to nodes in this causal
graph. These metrics are measured in real-time by an operational in the vicinity, throughput etc.) and use many thresholds. These
AP, and help in quantifying the effect of each cause. We design a metrics are incomparable to one another, thus making it further
diagnosis algorithm based on the causal graph and the metrics, difficult to diagnose root cause(s) of performance problems.
which ultimately presents a sanitized view of WiFi network health The design of Witals explicitly avoids the requirement of
to the sysad. We have implemented a prototype of Witals on any distributed trace collection, or centralized merging and
an enterprise grade 802.11n AP platform. Using a variety of
controlled as well as real-life measurements, we show that our analysis, instead uses an AP-centric architecture: a live AP is
diagnosis framework follows ground truth accurately. Witals has constantly also diagnosing any performance problems present.
also helped the sysads uncover some unexpected diagnoses. Witals focuses on WiFi performance problems. Further, we
Index Terms: Health of wireless networks, AP-centric, are focused on diagnosis of WiFi performance in the downlink
Causal graph, Metrics direction, which is the common case.
I. I NTRODUCTION For such performance diagnosis, we start with a causal
In the last decade, the raw capacity of WiFi has increased, diagnosis graph which relates various causes to different kinds
from a meager 1-2Mbps with pre-802.11b, to several hundreds of performance problems. The nodes in the graph are possible
of Mbps with MIMO-based 802.11n, and few Gbps with WiFi performance problems. The edges in the graph are causal
802.11ac. The reach and demand of WiFi deployments have relationships between the nodes. The causal diagnosis graph is
grown too, and so has the variety of clients. However, the non-trivial, given the complex interplay of various phenomena
system administration related pain-points of WiFi installations at different layers of the network stack, in a WiFi network.
have not decreased. Empirical evidence suggests that wireless Our causal diagnosis graph identifies an extensive set of
is considered problematic, even when the problem may lie performance problems (Sec. III). This distilled yet extensive
elsewhere. picture of WiFi performance causes is our first contribution,
As the size and geographic reach of cloud-managed WiFi vis-à-vis prior work.
deployments increase, it is critical for system administrators Corresponding to different nodes of the causal diagnosis
(sysads) to have a simple interface which tells the health status graph, we identify metrics (witals: WiFi vital statistics) which
of a WiFi network: Is the WiFi network healthy? If not, what can be measured real-time in an operational AP (Sec. IV).
ails it? While this is an important question, such performance The set of metrics considered in prior work is vast; e.g.,
diagnosis is technically challenging due to the multitude of MAC layer throughput, frame loss rate, bit-rate. Some of
inter-dependent and temporally varying factors. We address these metrics affect many nodes in the causal graph, hence,
this challenge with the design, implementation, and evaluation it is not possible to diagnose the root cause(s) from these
of Witals, a system for health diagnosis of WiFi installations. metrics. On the other hand, some metrics do not affect any
Prior approaches toward such a WiFi performance diagnosis node. Further, some of the metrics are not in the same unit
system have been incomplete (details in Sec. II), given the as other metrics, for example, average bit-rate with airtime
complexity of such diagnosis. Most prior research works either utilization and so on. In contrast, our metrics (a) are directly
use heavy weight sensor infrastructure ([11], [15]) or extra related to specific nodes of the diagnosis graph, thus helping
radio at the AP ([19], [22]) or need client side modifications in diagnosis, (b) quantify the extent to which a cause is
([7], [9]). Most of them also use a central analysis engine. affecting the overall performance, and importantly (c) they
But it is hard to justify such additional trace-collection in- are comparable with one another. The metrics being compa-
frastructure for a large fraction of WiFi installations: small rable allows the identification and quantification of multiple
and medium businesses, home-office, guest WiFi portals at simultaneous causes of WiFi performance problems, which
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
2
50
W
0
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
3
diagnose problems at the client side. These architectures use the AP. This provides a wrong diagnosis when this ratio is high
help from other nearby clients. In WiMed [13], the diagnosis for other reasons (e.g., slow talker) apart from contention. We
is done at the client using a second sensing radio. It classifies have implemented WISE and compared it with Witals as part
non-WiFi interference types to: bluetooth, microwave etc.. of evaluation in Sec. VI.
These architectures, unlike our work Witals, require client Table I compares prior work with Witals. In summary,
side modifications. In a way, these are complementary to none of the prior work addresses WiFi performance diagnosis
Witals, which performs diagnosis at an AP. in an extensive yet simplified manner for a sysad. Nor can
3) AP-centric architecture: These research tools use inline they quantify the effect of a cause or diagnose multiple
measurement architecture i.e., AP as the vantage point. They simultaneous causes. In addition, much of the prior work has
collect measurements at the operating AP itself and avoid used 802.11a/b/g for prototyping and evaluation. The Witals
use of any other infrastructure for trace collection. The work prototype uses 802.11n, which is a significant test of its low
in WTF [23] is interested in distinguishing between wireless overhead in real-time data collection and diagnosis.
faults versus other ISP faults in home networks. But it stops In terms of commercial systems (e.g., [1], [2], [3], [5]),
short of diagnosing the root cause of any wireless problem while AP vendors are very interested in WiFi performance
present. Architecturally, it is similar to Witals in the use of diagnosis, the state of the art is to present a variety of graphs
an AP-centric measurement framework, but unlike Witals, the of different metrics (e.g., error rate, bytes downloaded, number
fault diagnosis algorithm runs at a central server. WTF also of radios in the vicinity, throughput etc.) incomparable to one
requires TCP header inspection, unlike Witals. another. Also, any problem reporting is based on arbitrarily-
Antler [20] is another system for AP-centric wireless perfor- set thresholds for these different metrics. The incomparable
mance diagnosis. It uses a hierarchical approach for collecting metrics make a sysad’s job difficult to diagnose the root cause
metrics. As a baseline operation, Antler collects a minimal of WiFi performance problem. We quantitatively compare
set of metrics to represent the health of a wireless network. Witals with popular commercial tools in Sec. VI.
Once these metrics cross the threshold, the second tier of
metrics are collected to have a more detailed view of the III. C AUSAL D IAGNOSIS G RAPH
network. If second tier of metrics also cross the threshold,
The performance of a WiFi network might be poor for a
then collection of third tier of metrics starts and so on. Like
variety of reasons. We design Witals to examine the health
WTF, Antler too requires data to be sent to a central server for
of the wireless networks. Additionally, it can be integrated
diagnosis. As we shall show in Sec. VI, the causal diagnosis in
with other network management solutions which diagnose
Witals is extensive. On the other hand, Antler cannot diagnose
problems outside the WiFi network. Witals diagnoses the
several wireless conditions such as slow talker, sub-optimal
WiFi network’s health from the perspective of the downlink
rate-adaptation, 802.11n aggregation-related throughput loss
performance of clients of the AP on which Witals runs. While
and so on. It also cannot distinguish between different causes
uplink performance issues are important too, the focus of this
of interference, which Witals can. Further, it does not provide
work is diagnosing downlink performance issues. In Sec. VII,
a framework where the effect of multiple simultaneous causes
we discuss how we can extend this framework to diagnose
can be quantified and compared. We have implemented Antler
the uplink performance issues. Henceforth in the paper, we
and compared it with Witals in Sec. VI.
use the term APW to refer to the access point on which an
WISE [19] is an AP-centric measurement framework for
instance of Witals is running, while talking about performance
observing home wireless experience. The framework uses
diagnosis by that AP.
primary radio of the AP to collect information about its clients,
We first describe the causality between possible perfor-
a second radio for collecting additional measurements and a
mance problems and their effect on the WiFi network. Fig 2
remote measurement controller for control and configuration
shows the Witals causal graph. The nodes are possible WiFi
of the APs. WISE relates various causes of poor wireless
performance problems. The edges are causal relationships
performance and comes up with a new metric Witt i.e., WiFi-
between the nodes: a → − b, means a causes b.
based TCP throughput. It predicts the TCP throughput of a
new flow in the current wireless condition. WISE computes Note that none of the causal relationships are novel or
Witt metric using linkexp model, it is linkexp = (1 − a) ∗ (1 − unknown; our contribution lies in careful identification and
c) ∗ r, 0 < a < 1, o < c < 1. Where a is airtime utilization organization of extensive causes of performance problems.
from external sources, c is local contention and r is effective At a high level, if downlink throughput of APW is low,
rate. WISE uses Witt metric to classify WiFi performance into there are only three possible reasons A. Low offered load:
5 categories, ranging from very poor to very good. APW does not have enough data to transmit, B. Not enough
WISE mainly focuses on offline analysis. On the contrary, airtime: APW has data to transmit but does not get enough
Witals is made to run in real time. Like PIE, WISE too uses opportunities to transmit, C. Inefficient utilization of provided
an additional radio per AP. Further, WISE needs higher layer airtime: APW has data to transmit and it gets enough oppor-
packet header inspection, unlike Witals. Also, WISE can not tunities to transmit as well but it is not able to utilize airtime
diagnose several wireless conditions for e.g., sub-optimal rate efficiently. Note that by design, one or more of the above must
adaptation algorithm, poor efficiency due to low level of frame happen, for the observed downlink throughput to be poor. That
aggregation etc.. WISE computes local contention from the is, if there is enough offered load, sufficient airtime for the AP,
relative amount of other clients’ traffic to the total traffic at and efficient utilization of the same, downlink throughput is
necessarily good.
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
4
Feature Jigsaw PIE MOJO WiFi profiler, WiMed WTF Antler WISE Witals
/ Wit Client-Conduit
Additional resource for trace collection? Yes Yes Yes No Yes No No Yes
Additional resource for analysis? Yes Yes Yes No No Yes Yes Yes
Client side modification? No No No Yes Yes No No No No
Performance issues covered extensively? No No No No No No No No
Diagnosis & quantification of multiple simultaneous causes? No No No No No No No No
802.11n/ac support? No No No No No Yes No Yes Yes
TABLE I: Summary of comparison of Witals with prior work
Fig. 2: Witals Diagnosis: Causal Graph (underlined green nodes are diagnoses of Witals)
A. Low offered load to its nearby client, it is likely that APW detects it as WiFi,
The first possible reason for low throughput, marked as but cannot decode the bits of the frame: the high data rates of
“A” in Fig 2, is low offered load to the wireless network. 802.11n are decodable only at short distances (10-30 feet).
This could in turn be caused due to a variety of non-wireless B.2 Other BSS: Some of the energy which is decodable,
related reasons such as wired network bottleneck, overloaded could belong to other BSS in the vicinity, sharing APW ’s
web server, or simply when the wireless clients are not channel. Now, this might be caused by data frames of other
generating enough traffic. Witals does not distinguish between BSS or control and management frames of other BSS.
the different reasons causing low offered load. B.3 Control, management overheads: Within APW ’s
The next two subsections discuss the remaining two possible BSS, airtime would be taken up by control and management
reasons for low throughput B. & C.. overheads in WiFi (e.g., beacons, probe request/response, etc.).
We also include here, frames received with FCS (Frame Check
B. Not enough airtime for APW Sequence) error.
We now discuss all the possibilities where APW has data B.4 Upload: Within APW ’s BSS, a portion of the airtime
but does not get enough airtime to transmit: these are shown would be used for uplink transmissions, which effectively
under the node marked “B” in Fig 2. When APW does not causes downlink airtime reduction.
get sufficient airtime, this must necessarily be due to one of B.5 MAC unfairness: Now, even when the above factors
the following possibilities. are not affecting APW ’s airtime, the distributed operation of
B.1 Wasted non decodable energy: The channel is occu- the 802.11 DCF MAC can result in short term or even long
pied with energy which is not decodable. Such energy could term unfairness, which can even be severe. Such unfairness
either be non-WiFi energy or WiFi non-frame energy. The can be in terms of (i) transmission opportunities (txop) or (ii)
former could be due to any of a number of non-WiFi devices airtime. (i) APW may not get sufficient txop when it is the
operating in the same frequency range, e.g., microwave oven, middle node in a classical hidden terminal scenario [14]: APW
baby monitor, etc. Or it could also be due to WiFi radios loses the DCF contention with high probability to one of the
far away from APW . The latter is co-channel interference: other two nodes. (ii) Even with fair txop, due to different bit-
energy which can be detected as WiFi transmission but cannot rates of the transmitters, there can be unfairness in airtime.
be decoded into an intelligible frame (e.g., sync loss after When a radio operates at a low data rate (say 6Mbps) and
PLCP detection). We have observed that with the high data occupies most of the time in the channel, then it slows down
rates of 802.11n, such instances are quite common. When a faster radios (say 144Mbps) significantly. This is the classical
neighbouring AP transmits at a high data rate (say 300 Mbps) slow talker problem.
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
5
B.6 Overcrowding: Finally, even when airtime is fair, there but this also has another important effect. Overcrowding also
may simply be too many radios (N ) in the channel. APW may causes a large number of DCF collisions due to high chance
get N1 airtime, which would result in less airtime if N is large. of two nodes picking the same random number for backoff.
This is shown as cross-link 3 in Fig. 2. Further, when two
C. Inefficient utilization of provided airtime frames collide, the energy becomes non decodable and thus
In this subsection we discuss all the possibilities when APW wasted. This is shown as cross-link 4 in Fig 2.
has data to transmit and gets enough airtime but is not able When there is high loss rate on the channel either due to
to utilize airtime efficiently. These are shown as the causes of overcrowding (high collisions and so high frame losses) or
the node marked “C” in Fig 2. client side issues, this interacts poorly with TCP. High loss rate
C.1 Low bit-rate: A straightforward reason for poor uti- could cause variable RTT for TCP, DUP ACKs, and spurious
lization of airtime is that APW ’s transmissions are at a low TCP timeouts (also reported by prior work [17]). Thus TCP
physical layer bit-rate. APW may be using a low bit-rate if the slows down in response to these, causing low offered load to
client has a low signal strength (RSSI) and thus cannot work at WiFi. This interaction is shown by cross-link 5 in Fig 2.
a high bit-rate. Or even if RSSI is good, APW ’s rate adaptation A more immediate effect of high losses is that the rate
algorithm could be picking a sub-optimal bit-rate. A classical adaptation algorithm, unable to distinguish non channel losses
case of this is when the rate adaptation algorithm confuses from channel losses, reduces the bit-rate. This is shown in
non-channel losses (e.g., collisions) for channel losses due to Fig 2 as cross-link 6.
low SNR.
E. Causal diagnosis, limitations, actions
C.2 High loss rate: Even when APW is operating at an
optimal bit-rate, airtime may not be effectively used if there The causal graph as described above is the basis for the
are too many link layer losses leading to retransmissions. design of the diagnosis algorithm (Sec. V). The underlined
There are two possible reasons for high loss rate. First, DCF green nodes in Fig. 2 are the various conditions Witals can
collisions: 802.11 is a contention based access mechanism. diagnose. By design, the leaf nodes (i.e., those without any
When radios pick the same random number for backoff be- incoming edge) of the graph represent the diagnoses of Witals.
fore transmission, collisions happen, the probability of which However, there are two exceptions: (1) some leaf nodes are
increases with the number of radios. Second, any other client not associated with any diagnosis, and (2) some non-leaf nodes
side issues: it might be possible that when APW transmits, it represent diagnoses. We describe them in more detail below.
may be hidden from some source of WiFi or non-WiFi energy (1) As mentioned earlier, a limitation of Witals is that, due to
which interferes at the client. Another client-side reason for its AP-centric architecture, we cannot diagnose specific client-
high loss rate we have observed is that several clients are side issues and wired side issues. Hence, we can not diagnose
unable to handle high levels of 802.11n aggregation, losing the leaf nodes associated with client side issues and wired
frames between the wireless hardware and the driver; i.e., the side issues. As discussed before, TCP slowdown is caused
client has a CPU/system bottleneck. by high frame losses (cross-link 5), that in turn is caused by
C.3 Low aggregation: In case of 802.11n, even when APW overcrowding. Also, high frame losses cause rate adaptation
is operating at an optimal bit-rate and the loss rate is also confusion (cross-link 6). Therefore, in occasion of TCP slow
low, airtime utilization can be inefficient if level of frame down, overcrowding and RA based problems show up.
aggregation is low. Prior work [18] has reported this and we Aside from these, a specific wireless related leaf node we
have observed the same in our experiments too. do not diagnose is that of hidden terminal. The reason for
this is as follows. We attempted to create the MAC unfairness
D. Cross-links in the causal graph starvation scenario as in [14], experimentally, with APW in
So far we have described the causal elements in the causal between two other APs. However, we observed that starvation
graph (Fig. 2) from the perspective of insufficient airtime and rarely happens; instead the hidden terminal problem shows
inefficient usage of provided airtime. But the various nodes up primarily as increase in wasted non-decodable energy,
in the causal graph (causes) we have introduced have other rather than as MAC unfairness. That is, the cross-link is
effects too. We now describe these as cross-links (dashed lines, experimentally observed to be stronger than the primary causal
marked 1-6) in the causal graph. effect expected. Hence this hidden terminal scenario manifests
When other APs (BSS) operate on same or nearby channel itself as co-channel interference, which is what Witals diag-
of APW , airtime reduction for APW is one possible effect. If noses. Another possible hidden terminal scenario is between
these APs are too far from APW or use high data rates, these two clients of APW , with APW as the middle node. Since
transmissions will not be decodable to APW . This is co-chan- Witals does not have client-side view, we cannot diagnose
nel interference, and hence shows up as an increase in wasted this situation. But we expect this to be rare, since the two
non-decodable energy. Fig. 2 shows this as cross-link 1. clients have to be at opposite ends of APW ’s coverage, for
When APW is a middle node in a hidden terminal scenario, this situation to arise.
this results in collisions being seen at APW . This too shows (2) There are three non leaf nodes that represent diagnoses
up as wasted non-decodable energy. This cause-effect relation are: (a) co-channel interference, (b) rate adaptation confusion,
is shown as cross-link 2 in Fig 2. and (c) low offered load. (a) Co-channel interference causes
When there are a large number of transmitters (N ) in the increase in wasted non-decodable energy. But co-channel
channel, not only is the airtime share ( N1 ) of APW low, interference itself is caused by other BSS in the vicinity. Due
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
6
to this dependency, it becomes a non leaf node, but this is do not choose such metrics. On the contrary, some metrics
a diagnosis as well. (b) High number of frame losses causes such as bytes downloaded, number of associations are not
confusion to the rate adaptation algorithm, which lowers the related to any node, we do not choose them as well. Some
bit-rate in response. Due to this dependency, suboptimal rate other metrics such as signal strength of the AP and the clients,
adaptation algorithm becomes a non leaf node, but this also average bit-rate and so on, are not comparable with other
is a diagnosis. (c) As discussed before, we can not diagnose metrics such as airtime utilization. Hence, we modify these
wired or client side issues, and TCP slowdown shows up as metrics in an appropriate way to make them comparable with
overcrowding and RA. Therefore, we diagnose the non-leaf others.
node low offered load to indicate idle channel condition. As described earlier, the underlined green nodes in the
causal graph of Fig. 2 are various diagnoses of Witals.
IV. W ITALS FOR D IAGNOSIS We come up with a set of metrics, that helps us diagnose
What metrics can we possibly use for diagnosing problems? the green nodes of the causal graph. The metrics fall into
The spectrum of performance metrics considered in prior two categories: (a) airtime reduction metrics, (b) metrics
works is vast. For example, Antler [20] considers the following representing inefficient utilization of airtime. These mirror
metrics: minimum throughput, overhead, airtime utilization of the two major sets of causes marked “B” and “C” in the
the AP and the clients, signal strength of the clients, data causal graph. And all metrics are represented in terms of
rate, packet loss rate, retransmission percentage. WISE [19] percentage of downlink throughput reduction they cause. The
considers the following metrics for computation of W itt: air- close connection between the metrics below and the nodes in
time utilization, local contention, effective rate. We categorize the causal graph is captured in Table III. It lists the various
the set of metrics considered in prior works into following possible diagnoses of Witals, each diagnosis corresponds to
categories. (1) Metrics related to TCP, (2) metrics related a node of the causal graph of Fig. 2. An important point
to MAC throughput, (3) metrics related to number of WiFi is that comparable witals metrics and causal relationships in
entities, (4) metrics related to number of bytes, (4) metrics the causal graph allow the detection of multiple simultaneous
related to airtime utilization, (6) metrics related to signal problems, which is a common occurrence, especially due to
strength (7) metrics related to bit-rate, (8) metrics related to the complex inter-dependences as shown in the causal graph.
loss rate, (9) metrics related to overhead, (10) metrics related Table III also shows the possible follow-up actions for each
to contention, and (11) metrics related to non-WiFi. diagnosis. We must note here that the follow-up actions given
Out of all these metrics, we choose a specific subset. In here are merely suggestive and not exhaustive. While some
choosing the metrics (called witals, or WiFi vital statistics), follow-up actions are short term in nature (e.g., switch AP
we have the following guidelines. (1) We should be able to channel), some are longer term (e.g., re-plan WiFi installation).
diagnose multiple simultaneous causes of WiFi performance A. Airtime reduction metrics
problems, as this is a common scenario. (2) It should be easy We use the following metrics to measure downlink through-
to measure and maintain in real-time in a live AP, even when put reduction due to airtime reduction for APW .
operating at the high data rates of 802.11n or 802.11ac. Wasted non-decodable energy (W ): This metric measures
Common unit of measurement: Given the above guide- the percentage of airtime lost (over the measurement window
lines, we choose a minimal set of metrics. Importantly, we TW ) due to non-decodable energy. This metric is a direct
represent all of our metrics in the same comparable unit of measure of the throughput reduction attributable to this cause.
measurement. We choose percentage reduction in downlink To measure W , we have used vendor support in the AP
throughput due to a cause, as this common comparable unit. driver. The API provides means to measure the amount of
This is an important choice, as this allows us to compare mul- airtime for which there was any energy detected on air. It also
tiple possible simultaneous causes of performance problems, provides means to get per-frame information (length, data-rate
and flag the major ones. etc.) for all frames on air, using which we can get the time
Time window of measurement: We measure metrics in duration of decodable energy. Subtracting the latter from the
windows of time, of duration denoted as TW . The choice of former, we get W . We note that the above API is not unique to
TW has to be such that it is large enough to tide over the fine the vendor’s driver, similar ones are available in popular open-
grained frame-level behaviour of 802.11 DCF. It also has to be source drivers such as ath9k as well as in broadcom chipset
large enough to amortize the fixed overheads of implementing based driver.
the diagnosis algorithm. TW also has to be small enough to CtlMgtOverhead (O), OtherBSSOvhd (OBO), OtherB-
detect a problem before it “disappears” (e.g., a value of a SSData (OBD), Upload (U l): Once we have per frame
minute or two is likely to be too high). In practice, we have (length, data-rate) information, the airtime occupied by control
found that TW of a few seconds works well. and management overheads (O) of own BSS, control and
In Table II, we show the set of metrics considered in management overheads (OBO) of other BSS, data frames
prior works. We also indicate whether we choose them in of other BSS (OBD), upload (U l) are all straightforward to
our framework or not, the justification for the same, and the compute. When represented as a percentage of measurement
corresponding witals metrics (if any). Some of the metrics such window TW , like with W , these form direct measures of
as TCP RTT, MAC throughput considered in prior works affect throughput reduction due to each of these causes.
many nodes in our diagnosis graph, hence, it is not possible to Fig. 3 shows the split of airtime among the above metrics.
diagnose the root cause(s) from these metrics. Therefore, we We also note that we can compute the percentage of idle time
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
7
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
8
of the problems grouped under “B. Not enough airtime” in (c) Aggrredn represents the throughput reduction at the ob-
Fig. 2, with the exception of the over-crowding problem. To served rate Ravg , due to operating at the observed aggregation
diagnose over-crowding, we need a measure of the number of level Aggravg as opposed to operating at the ideal Aggrmax .
contending transmitters (denoted N ), in TW . Since TW is typ- Given the above, we can write, for each client i:
f (Rmax (i),Aggravg (i))−f (Rrssi (i),Aggravg (i))
ically a few seconds, which is much larger than typical frame RSSIredn (i) = f (Rmax (i),Aggravg (i))
transmission durations, we first break-up TW into smaller RAredn (i) =
f (Rrssi (i),Aggravg (i))−f (Ravg (i),Aggravg (i))
f (Rmax (i),Aggravg (i))
windows ts . The value of ts must be large enough to observe f (Ravg (i),Aggrmax )−f (Ravg (i),Aggravg (i))
multiple nodes’ transmissions. We have seen that a ts value Aggrredn (i) = f (Ravg (i),Aggrmax )
of 100-500ms works well in practice (we use ts = 200ms We then compute each metric for APW as the average over
in implementation). Within each ts window, APW counts the the corresponding metrics for each active client i. We use
number of observed transmitting radios, and takes this as an an empirical definition of an active client: APW should have
estimate of N . To compute N for a given TW , we simply transmitted a minimum of 10 data packets to that client in TW .
average these estimates over all the ts windows within TW . For the above computations, we need Rmax (i), Rrssi (i), and
Our experiments confirm that the above mechanism to Aggrmax ; the others are measured by APW over each TW
estimate N is robust in diagnosing over-crowding. Finally, we easily.
note that the metric N is the only exception in our set of Here Rmax (i) is the ideal bit-rate for client i at the ideal
witals metrics in that it does not directly represent percentage RSSI; this is nothing but the maximum bit-rate: 54Mbps for
throughput reduction due to a wireless problem. 802.11g, 72 Mbps for 802.11n 1 × 1 etc.
To estimate Rrssi (i) for client i, we use a combination of
B. Metrics for inefficient airtime utilization (a) the data rate determined by the measured average RSSI
We use three different metrics to diagnose the different on the uplink, and (b) observed loss rates at the various data-
causes under “C. Inefficient utilization of provided airtime” in rates in the past five TW windows. This estimation of Rrssi is
Fig. 2. Each of these metrics represents throughput reduction similar to computing SNR threshold for adapting bit-rate [24].
in the given airtime. We have taken Aggrmax = 10, as this represents (an al-
Let us first calculate ideal throughput at the link layer. It is most) ideal level of aggregation; as [18] shows, an aggregation
a function of link layer bit-rate R and the level of aggregation level of 10 achieves close to optimal efficiency (88% for a bit-
Aggr (in case of 802.11n) [18]. It is of the form: rate of 72Mbps).
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
9
Algorithm 1 Diagnosis algo, runs every TW percentage because of that problem. Even with the metrics
1: M = Set of up to 3 major witals metrics > min-thpt-thresh being comparable, Fig. 1(a) is not easy to interpret visually.
2: if M is ∅ output Healthy Fig. 1(b) next shows the output of our diagnosis algorithm
3: if I > idle-thresh then ⊲ Channel more idle than busy plotted over time. The width of the bubble is proportional
4: output LowLoad, Healthy to the throughput reduction it causes. We also plot aggregate
5: else ⊲ Channel more busy than idle throughput of the network. We can see here that Overcrowding,
6: if O ∈ M output CtlMgtOvhd RA, OtherBSSOvhd, OtherBSSData and CoChInterf are the
7: if OBO ∈ M output OtherBSSOvhd major performance problems for this network. The sysad can
8: if OBD ∈ M output OtherBSSData potentially set event triggers based on the algorithm’s output
9: if U l ∈ M output Upload and take appropriate actions. Note that the distilled view in
10: if Sredn ∈ M output SlowTalker Fig. 1(b), is enabled by the design of comparable witals
11: if RSSIredn ∈ M output LowRSSI metrics and the cross-links. Such distillation is not possible
12: if RAredn ∈ M output RA in current commercial systems or state of the art prior work.
13: if Aggrredn ∈ M output LowAggr C. Parameters in the algorithm
14: if W ∈ M then
In our Witals algorithm, we have used a few parame-
15: if N > simul-tx-thresh output Overcrowding
ters. They are: time window of measurement TW = 2sec,
16: if (OBO + OBD) > otherbss-pres-thresh output
simul-tx-thresh of 5 for determining Overcrowding, slot size
CoChInterf
ts = 200ms for determining number of contending transmit-
17: if neither of the above output NonWiFi
ters (N ), min-thpt-thresh of 10%, idle-thresh of 50%, active
18: end if
clients’ number of data packets threshold of 10, picking top
19: end if
3 reasons, as opposed to some other number. The values of
these parameters are not crucial to our diagnosis algorithm. In
Sec. VI, we evaluate the sensitivity of the parameters.
Algo). We note that these “problems” are diagnosed only when
the channel is more busy than idle (i.e., I < 50%). VI. E VALUATION
Now, from the causal graph (Fig. 2), we see that W We evaluate our diagnosis algorithm in two different settings
(wasted non-decodable energy) could be caused due to a va- (1) controlled experiments and (2) real-life measurements. In
riety of reasons: non-WiFi device, or co channel interference, controlled experiments, we artificially create a problem and
or over-crowding. Lines 14-17 of diagnosis algo distinguish see if the Witals diagnosis matches the ground truth. Next, we
between these possible causes. The metric N which is a run our algorithm in real settings where we corroborate ground
measure of simultaneous transmitters on the channel is used truth from system administrators and from users’ feedbacks.
to flag Overcrowding. Note that, we are measuring number of In all cases, Witals matches ground truth accurately. Here, we
transmitters detected at the AP at small intervals, thus the also evaluate Antler [20] and WISE [19] prior work on AP-
actual number of simultaneous transmitters might be more centric diagnosis (described in Sec II). We also compare Witals
than what the the AP sees. The transmitters who have have with popular commercial tools. Prior to the experiments, we
been successful to send their frames, are the ones seen by first describe our prototype implementation.
AP. Hence, we pick an empirical threshold of 5 for simul- A. Prototype Implementation
tx-thresh (simultaneous transmitters threshold) for detecting
We have implemented Witals on an enterprise grade com-
overcrowding. The presence of significant interference from
mercial AP1 platform capable of 2×2 as well as 3×3 802.11n
other BSS (otherbss-pres-thresh) is used to flag CoChInterf.
operation. The hardware is based on the Atheros 9xxx gener-
If neither of the above is identified as a possible cause, we
ation chipset, and the software is a set of modifications to a
conclude that the high W is due to a NonWiFi device.
licensed version of the Atheros driver. The radio is placed in
In several real-life measurements, we found that a diagnosis
promiscuous mode, so that the driver can extract statistics for
every TW might be too fine-grained for a sysad, and it flags
packets intended for other radios in the vicinity too. We have
several transient problems. The sysad might be interested to
implemented the diagnosis logic as part of AP software suite.
know the persistent performance problems. Therefore, Witals
To measure the accounting of airtime at APW , we take help
also provides a further refined view showing the persistent
of hardware supported registers. We also log extensive info
performance problems in the last 15 minutes or so.
B. An Example Run (address, length, data rate etc.) about transmitted and received
Now we take an example to better understand the diagnosis frames. We had to pay attention to detail, in making the imple-
algorithm. We take a sample trace from a real-life measure- mentation work at the high rates of 802.11n. The simple set of
ment where 80 clients were associated with APW , the same as witals metrics helped too, in keeping the overhead minimal:
mentioned in Sec. I. The clients were tablets used by students while some of the witals metrics such as Sredn are subtle
in a classroom, and the traffic consisted of an online quiz. in design, the per-frame computation itself is minimal. We
Fig 1(a) plots the witals metrics over time. We note here studied the impact of Witals on APW ’s throughput, comparing
that the state-of-the-art in current commercial systems stops it with a vanilla AP as well as the expected throughput as per
with such a view of a multitude of graphs. Now, recall that 1 AirTight AP: AirTight C-55 AP, AirTight C-60 AP.
each witals metric shows airtime or throughput reduction http://www.airtightnetworks.com/home/resources/datasheets.html
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
10
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
11
Healthy Healthy
LowLoad LowLoad
NonWiFi NonWiFi
CoChInterf CoChInterf
CtlMgtOvhd CtlMgtOvhd
OtherBssOvhd OtherBssOvhd
OtherBssData OtherBssData
Upload Upload
SlowTalker SlowTalker
Overcrowding Overcrowding
LowRSSI LowRSSI
RA RA
LowAggr LowAggr
50 100
AggrThpt(Mbps) 25 Jitter(ms) 10
0 1
0 10 20 30 40 50 60 70 80 0 50 100 150
Time(s) Time(s)
Fig. 6: Witals diagnosis: SlowTalker Fig. 9: Witals diagnosis: for VoIP application
16 1
Witt Fig. 2). Retransmissions & variable RTT could also cause TCP
Local Contention
14 Local contention 0.8
slowdown (cross-link 5, Fig. 2).
Witt(Mbps)
12
0.6
10 Despite the above simultaneous phenomena, we see in
0.4
8 Fig 8(a) that Witals algorithm presents a clear picture. It
0.2
6 diagnoses Overcrowding correctly. We see that the SlowTalker,
4 0
0 10 20 30 40 50 60 70 80 CtlMgtOvhd, and RA problems are reported as well. A plot of
Time the witals metrics over time (not shown) indicated that W
Fig. 7: WISE diagnosis
varied between 50% and 80% during the experiment. The
was running iperf in the uplink direction. Here we fixed metric O was also high, about 30%, likely due to collision
the uplink client’s transmission bit-rate to be low: MCS 0, induced FCS errors. We observed that the combination of the
6.5Mbps/7.2Mbps. Fig 6 shows the diagnosis output of Witals above factors reduced the downlink throughput to near zero
algorithm against time. We see that SlowTalker is correctly during the experiment.
diagnosed as a major reason for throughput reduction. We Comparison with Antler and WISE: Antler infers con-
also see that Upload is a significant reason. A plot of the gestion from bit-rate. If bit-rate is low but RSSI is high, it
witals metrics (not shown) indicated that the slow client diagnoses as congestion. We evaluated Antler against the trace
occupies almost 60% of the channel (this shows up as Ul) from this experiment. Fig 8(b) shows the diagnosis. Instead
and APW ’s downlink airtime is only around 26%. The slow of reporting congestion, the diagnosis was interference. The
talker reduction metric (Sredn ) is around 60%. incorrect diagnosis is due to difference between the AP’s
Comparison with Antler and WISE: Here, Antler can airtime and the airtime attributable to its clients’ uplink and
not diagnose the slow talker problem. WISE provides wrong downlink transmissions. High degree of wasted non-decodable
diagnosis of high local contention (> 0.5 [19]). WISE com- energy W resulting from collisions causes this difference.
putes Witt metric using linkexp model, which in turn computes Here, WISE too detects interference. Fig 8(c) shows the
the local contention. Fig 7 shows the Witt value & local Witt value & non-WiFi energy. The reason for high non-WiFi
contention with time. The reason for wrong diagnosis is that energy is high degree of wasted non-decodable energy W
WISE diagnoses local contention in an indirect way, from resulting from collisions. WISE diagnoses this as interference.
the relative amount of other clients’ traffic to the total traffic. Our diagnosis graph specifically cross-links 3, 4 of Fig. 2, and
Since the slow client occupied the channel for more time, other their incorporation into our diagnosis algorithm allows us to
clients could not get enough airtime. Hence, WISE diagnoses flag the correct diagnosis.
this as high local contention for the fast clients. 6) Witals on VoIP performance: So far, we have seen
5) Detection of Overcrowding: To test our Witals frame- that Witals is able to detect the “low throughput” issues
work’s ability to detect Overcrowding, we set up an experiment correctly. To understand, whether Witals framework can detect
where 30 clients were associated with APW . Before the exper- the issues with poor VoIP performance, we set up a controlled
iment, we ensured the access point placement was optimal so experiment. Initially, one client connects to APW . We emulate
that all clients can operate at the highest bit-rate individually. VoIP traffic via UDP CBR (constant bit-rate) traffic of 1.2
We also ensured that the web server or the wired network Mbps (Skype HD video [4]). After 80 sec, we connect another
were not loaded. That is, the performance seen by the clients client to APW . This client runs iperf in the uplink direction for
in our experiment was constrained by the wireless network another 100 sec. Fig 9 shows the diagnosis output of Witals
bottleneck. External WiFi interference was minimal. All the algorithm against time. Here, instead of aggregate throughput,
clients were downloading a large file from a local server and we plot jitter values with time. We see that initially the jitter
simultaneously uploading another large file. values are low, but once the upload starts the jitter values
Even though a controlled experiment, this scenario stress increase. Witals is also able to detect the reason for increase
tests our Witals framework as several phenomena kick-in in jitter as Upload.
simultaneously. As all the clients were downloading as well
as uploading, the number of contending stations is high. This D. Real life measurements
causes collisions, which in turn results in airtime wasted in We tested Witals in various real scenarios. Here, we discuss
non-decodable energy (cross-links 3 & 4, Fig. 2). Collisions results from 4 different real scenarios: (1) a research confer-
also cause frame losses, which in turn cause the rate adap- ence with 10 clients connected to the AP under diagnosis, (2)
tation algorithm to select a sub-optimal bit-rate (cross-link 6, a classroom at our Institute with 43 clients connected to APW
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
12
Healthy 1.4 1
LowLoad Interference
NonWiFi 1.2
0.8
CoChInterf 1
Congestion
Witt(Mbps)
CtlMgtOvhd
Utilization
0.8 0.6
OtherBssOvhd
OtherBssData 0.6
Upload Poor Link 0.4
SlowTalker 0.4
Overcrowding 0.2
Packet Traces 0.2
LowRSSI
RA 0 0
LowAggr 0 50 100 150 200 250 300
5 Healthy
AggrThpt(Mbps) 2
Time
0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Time(s) Time(s) Witt Non-WiFi util
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
13
The WiFi network was used for a classroom activity involving Healthy and LowLoad for quite a few time windows, thus it did
an online quiz using an Android quiz application on the tablets. not affect user experience: this matched with our observation
The students logged in via the app to a server, when the quiz that there were no major complaints by users.
was published by the teacher, the app downloaded the quiz E. Comparison of Witals with Commercial Tools
file. The app enabled students to answer the questions, after
which it submitted the answers to the server. We compare Witals with popular commercial tools. These
The ground truth observed empirically was that the students’ tools plot various metrics incomparable to each other and
experience was really bad. The app took a long time to uses arbitrary thresholds. Further, they do not consider various
download the quiz file. When submitting responses also, many simultaneous cause-effects of wireless problems. We come up
students found their upload was stuck. Out of 200 students with some example scenarios where popular commercial tools
only about 120 students could submit their responses, with such as AirDefense [5] will trigger false performance alarms
the instructor having to fall back on paper-printed quiz forms. but Witals is able to detect the ground truth.
Given that quiz file itself was small, we did not expect such AirDefense will trigger an utilization alarm when con-
poor performance. The APW for which we present results trol/management/data frames or number of associations exceed
served 80 clients. Fig. 1(b) (shown in Sec. I) plots Witals the thresholds. Though high number of associations can po-
diagnosis for a representative 5-minute window. tentially cause high channel utilization, the same will happen
We see here that once again Overcrowding is reported as a if the clients generate enough traffic on the channel. For
major reason. Unlike classroom-43, Upload is not a problem, validation, we conducted a controlled experiment where 90
while OtherBSSData and OtherBSSOvhd are significant prob- clients (tablets) were associated with our Witals-enabled AP
lems. Overcrowding also results in RA, and OtherBSSData and for a classroom quiz. The load on the network was very low,
OtherBSSOvhd also result in CoChInterf; these combined ef- aggregate throughput is merely 100Kbps. Witals diagnoses the
fects reduce throughput significantly, explaining the poor user network as LowLoad and Healthy. But in this case, AirDefense
experience. The value of this refined view can be understood will trigger an utilization alarm. Most commercial tools diag-
by trying to deduce the same visually, from the raw witals nose any non 802.11 energy as interference. But during high
metrics in Fig. 1(a) (in Sec. I). collisions, when two frames collide, the energy becomes non-
Validation: For validation of our diagnosis we spoke to decodable. These tools can not differentiate non-decodable
system administrator of our Institute. They confirmed that energy caused by external sources from the one caused by high
indeed there are many BSS on each of the non overlapping collisions. For validation, we take the overcrowding controlled
channel which was out side their control. The students also experiment described in Sec VI-C. Fig 8(a) (in Sec.VI-C)
complained about the slow network. The authors were also shows that Witals correctly diagnoses Overcrowding as the
present at the venue and they too experienced slow network. major performance problem. Most commercial tools including
4) Hostel: Finally, we present results of testing Witals at a AirDefense will diagnose this as Interference.
student hostel (dormitory) in our Institute campus. The hostel VII. D ISCUSSION AND F UTURE W ORK
access point serves around 100 student-rooms across 4 floors. We now discuss a few aspects related to Witals.
We placed our APW close to the hostel’s AP, to enable Witals What about 802.11ac? While we have built a prototype of
diagnosis. The empirical ground truth here was that the user Witals for 802.11a/b/g/n, fundamentally the algorithm remains
experience was satisfactory. same for 802.11ac as well. We are in process of porting our
While Witals was operational here for a whole day, we look Witals implementation to a 802.11ac driver.
at a 5-min window during a busy period (around 2.50am!), as What about uplink? Conceptually, an instance of Witals
the network is mostly idle otherwise. Even in the relatively can run at the clients and will be able to diagnose uplink
busy period, the network is lightly loaded and healthy for performance issues. But due to huge variability of clients’
several TW windows. Witals diagnoses (not shown here) drivers the implementation might be difficult. On the other
LowRSSI and CtlMgtOvhd as significant reasons. Sub-optimal hand, if we diagnose at the access point we will be los-
rate selection is a problem here too, as indicated by RA, but ing out some important information. For example, when the
unlike the previous classroom scenarios, Witals algorithm does network is lightly loaded, AP can not distinguish whether:
not attribute this to collision, but to other issues: it is likely clients are simply not generating enough traffic, or there
that there are client-side reasons for such sub-optimal rates. is a wireless issue. There are several client side diagnosis
We also see in this measurement, the benefit of our defi- tools such as WiFi-Profiler [9], Client-Conduit [7]. These are
nition of Sredn . There are several slow radios, but they are complementary to our work, we can install these tools at the
not sending sufficient traffic to cause a throughput problem clients and collect client side information, which could be
and trigger a SlowTalker diagnosis in Witals. Had we used a potentially utilized in the Witals framework to diagnose uplink
simple ratio of fast to slow radio, like in [22], several instances performance problems.
of SlowTalker would have been incorrectly diagnosed. What about Multiple APs? Witals can be potentially ex-
Validation: For validation of of our diagnosis we spoke to tended to understand the performance of broader deployments
system administrators of the hostel. They informed that one for enterprise/campus networks by co-ordinating information
access point serves several floors and there is coverage issue in across multiple vantage points. This requires a coarse grained
several rooms. This validates our diagnosis of LowRSSI. The synchronization (in granularity of TW ) between the APs and a
problems reported above were short-term, the network was central server for analysis. Each AP can send Witals diagnosis
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2017.2739152, IEEE
Transactions on Mobile Computing
14
to a central server. Then the server has to merge each AP’s [9] R. Chandra, V. N. Padmanabhan, and M. Zhang. WiFiProfiler: Cooper-
diagnosis to provide a combined view of the network. The ative Diagnosis in Wireless LANs . In MobiSys, 2006.
[10] Y.-C. Cheng, M. Afanasyev, P. Verkaik, P. Benko, J. Chiang, A. C
diagnosis of LowLoad, NonWiFi, Overcrowding, LowRSSI, Snoeren, S. Savage, and G. M. Voelker. Automating Cross-Layer
RA, LowAggr will remain same. The server will note the Diagnosis of Enterprise Wireless Networks. In SIGCOMM, 2007.
frequency of such reportings in all the APs and will flag-off [11] Y.-C. Cheng, J. Bellardo, P. Benko, A. C. Snoeren, G. M. Voelker, and
S. Savage. Jigsaw: Solving the Puzzle of Enterprise 802.11 Analysis.
the majorities. Now, OtherBSSOvhd and OtheBSSData will no In SIGCOMM, 2006.
longer be diagnosed from a channel perspective. Instead, these [12] J. Kim, S. Kim, S. Choi, and D. Qiao. CARA: Collision-Aware Rate
will be CtlMgtOvhd and Download. Adaptation for IEEE 802.11 WLANs. In INFOCOM, 2006.
[13] K. Lakshminarayanan, S. Seshan, and P. Steenkiste. Understanding
Before-and-after analysis? We have performed several 802.11 Performance in Heterogeneous Environments. In HomeNets,
controlled experiments to understand how Witals helps to 2011.
diagnose issues and improve performance. For example, in one [14] S. C. Liew, C. H. Kai, H. Ching Leung, and P. Wong. Back-of-the-
Envelope Computation of Throughput Distributions in CSMA Wireless
controlled experiment, 15 laptops simultaneously downloaded Networks. IEEE Transactions on Mobile Computing, 9(9),, 2010.
and uploaded a long file. Here, Witals reported Overcrowding [15] R. Mahajan, M. Rodrig, D. Wetherall, and J. Zahorjan. Analyzing the
and RA. After diagnosing overcrowding as the root cause, MAC-level Behavior of Wireless Networks in the Wild. In SIGCOMM,
2006.
we plugged in a solution WiFiRR [17] to reduce conges- [16] M. Maity. Health Diagnosis and Congestion Mitigation of WiFi
tion. Thereafter, Witals reported Healthy and LowLoad. The Networks. PhD thesis, Indian Institute of Technology, 2016.
throughput also improved from from 10-12 to 20-40 Mbps. [17] M. Maity, B. Raman, and M. Vutukuru. TCP Download Performance
in Dense WiFi Scenarios: Analysis and Solution. IEEE Transactions on
The details are described in [16]. Mobile Computing, 2017.
VIII. C ONCLUSION [18] N. Mishra, A. Chaurasia, A. Kallavi, B. Raman, and P. Kulkarni. Usage
of 802.11n in Practice: A Measurement Study. In COMSNETS, 2015.
In this paper, we have presented the design of Witals, an [19] A. Patro, S. Govindan, and S. Banerjee. Observing Home Wireless
AP-centric system for diagnosing a WiFi network’s health. Experience through WiFi APs. In MobiCom, 2013.
We do not require any distributed sensor infrastructure, extra [20] R. Raghavendra, P. A. K. Acharya, E. M. Belding, and K. C. Almeroth.
Antler: A multi-tiered approach to automated wireless network manage-
radio at the AP, or client-side modifications. We start with ment. In INFOCOM Workshops 2008, IEEE, 2008.
the design of a causality graph. The causal graph identifies an [21] A. Sheth, C. Doerr, D. Grunwald, R. Han, and D. Sicker. MOJO: A
extensive set of performance problems. We then come up with distributed physical layer anomaly detection system for 802.11 WLANs.
In MobiSys, 2006.
a minimal set of metrics, witals, closely associated with the [22] V. Shrivastava, S. Rayanchu, S. Banerjee, and K. Papagiannaki. PIE in
nodes of the causal graph. An important property of the witals the Sky: Online Passive Interference Estimation for Enterprise WLANs.
metrics is that each of them quantifies throughput reduction, In NSDI, 2011.
[23] S. Sundaresan, N. Feamster, R. Teixeira, Y. Grunenberger, D. Papa-
and being in the same units, are comparable. Our diagnosis giannaki, D. Levin, et al. Characterizing Home Network Performance
algorithm then puts together the causal graph and the metrics. Problems.hal-00864852. 2013.
The final output is capable of presenting a clear picture to [24] J. Zhang, K. Tan, J. Zhao, H. Wu, and Y. Zhang. A practical SNR-guided
rate adaptation. In INFOCOM, 2008.
a sysad, even in the presence of complex real-life WiFi net-
Mukulika Maity Mukulika Maity received her B.E
work problems. The Witals framework is capable of detecting in CSE from BESU in 2010. She received her
multiple simultaneous problems too, a common occurrence. M.Tech + Ph.D. dual degree in CSE from IIT-
We have evaluated Witals extensively in controlled as well Bombay in 2016. Since Dec 2016, she is a faculty at
the CSE department at IIIT- Delhi. Her research in-
as real-life uncontrolled settings, based on a prototype. We terests are broadly in the areas of wireless networks
show that our diagnosis matches ground truth closely. Witals and mobile computing.
also helped the sysads unearth some unexpected root-causes Bhaskaran Raman Bhaskaran Raman received his
behind performance degradation in certain deployments. We M.S. and Ph.D. in CS from U.C.Berkeley, in 1999
and 2002 respectively. Since July 2007, he is a
also perform a quantitative comparison of Witals with research faculty at the CSE department at IIT-Bombay. His re-
and commercial tools. search interests and expertise are in communication
Given the importance of WiFi performance diagnosis, es- networks, wireless/mobile networks, and technology
for education. His current focus and specific interests
pecially in cloud-managed remote WiFi deployments, we are mobile crowdsensing, WiFi performance diagno-
believe that the AP-centric Witals framework is a significant sis, and use of technology in large classrooms for
contribution, even as WiFi speeds grow. effective teaching.
R EFERENCES Mythili Vutukuru Mythili Vutukuru received her
Ph.D. and M.S. degrees in Computer Science from
[1] ”7 Signal”, http://7signal.com/solutions/wi-fi-performance-solution/. the Massachusetts Institute of Technology in 2010
[2] ”AirMagnet WiFi Analyzer”, http://www.flukenetworks.com/enterprise- and 2006 respectively. Since July 2013, she is a
network/wireless-network/AirMagnet-WiFi-Analyzer. faculty at the CSE department at IIT-Bombay. Her
[3] ”Cisco Meraki” https://meraki.cisco.com/solutions/high-density-wifi. research interests are in networked systems, wireless
[4] ”How much bandwidth does Skype need?”. communication, and network security.
https://support.skype.com/en/faq/FA1417/how-much-bandwidth-does-
skype-need. Avinash Chaurasia Avinash Chaurasia received his
[5] ”Motorola Airdefense”. www.airdefense.net. B.Tech in CSE from NIT Allahabad in 2008. He
[6] ”WiFiEagle”, http://nutsaboutnets.com/wifieagle-channel-analyzer/. obtained his M.Tech degree in CSE from IIT Kanpur
[7] A. Adya, P. Bahl, R. Chandra, and L. Qiu. Architecture and Techniques in 2011. Currently he is pursuing Ph.D from IIT
for Diagnosing Faults in IEEE 802.11 Infrastructure Networks. In Bombay and working on areas related to networks
MobiCom, 2004. and systems.
[8] K. V. Cardoso and J. F. de Rezende. Increasing throughput in dense
802.11 networks by automatic rate adaptation improvement. Wireless Rachit Srivastava Rachit Srivastava received his
Networks, 2012. B.Tech in ECE from MNNIT in 2009. He obtained
his M.Tech degree in ECE from IISC Bangalore
in 2011. Currently he is a lead Engineer in Mojo
Networks.
1536-1233 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.