Professional Documents
Culture Documents
Infocom Workshop 2019
Infocom Workshop 2019
Abstract—This paper presents the first study with real-world a clustering method to identify the main usage profiles. The
traffic data from a national K-12 Education Service Provider methodology developed proved to be useful to compare the
(ESP). This ESP, which supports a one-to-one computing pro- profiles found in the two available datasets, corresponding to
gram in Uruguay, is in charge of the Wi-Fi Internet access at
all K-12 schools in the country. Network users include teachers the beginning and the end of the school year, respectively.
and students, where the latter typically range between 6 and The results indicate that the main usage profiles identified
18 years old. In order to gain knowledge of the user behavior were the same in both periods, with a high level of coin-
in such a novel scenario, the work is focused on finding out cidence. This is a remarkable result, as it means that the
the typical network usage patterns and study their evolution categories in which the users should be classified according to
during the year. Based on the selected features and proper
distance measures, four main Wi-Fi usage profiles were identified their Wi-Fi network activity, remain stable during the school
through clustering algorithms. In addition, these classes are stable year. Based on this result, it is possible to build a Wi-Fi
throughout the year, which makes possible to implement a Wi- activity monitoring system which can be trained with data
Fi activity monitoring system trained with data collected during collected during the first weeks of the year only. Once the
the first weeks of the year only. The results suggest that network system has learned the usage profiles, the categories remain
traffic analysis could benefit the ESP for evidence-based decision
making, not only from the network operator perspective, but also fixed, and allow to monitor the individual evolution of users
providing vital information for learning analytics purposes. during the rest of the school year (e.g. increase or decrease in
Index Terms—network traffic, education, clustering, usage their activity). With this information, the ESP could provide
profiles. more personalized services to teachers and students, accord-
ing to their profile. Furthermore, the usage evolution could
I. I NTRODUCTION be exploited for learning analytics purposes (e.g. study the
The increasing data availability, jointly with recent advances correlation with academic performance improvements).
in machine learning and big data analysis, have laid the
groundwork for several studies with datasets from real-world II. P REVIOUS W ORK
Wi-Fi networks (e.g. [1], [2], [3], [4]). These studies are Most of the previous works which are based on data from
typically carried out with data from Internet service providers real-world Wi-Fi networks, typically present a descriptive
(ISPs) or mobile network operators (MNOs). Although many analysis, including several empirical distributions from diverse
articles refer to educational settings such as university cam- measurements [1]. We highlight some studies particularly
puses, studies in K-121 scenarios are very rare. In this paper, focused on pattern analysis, with traffic data from different
we analyze traffic data from Plan Ceibal [5], a major K-12 networks. Cerquitelli et al. [6] analyze users with a similar
education service provider2 (ESP), which leads a nationwide Internet access performance, while Mirylenka et al. [3] look
one-to-one computing program in Uruguay. This governmental for similar temporal activity patterns, both with ISP residential
agency provides technological support to the national K-12 clients data. Concerning the selected features they are at
education system, including Wi-Fi connectivity and videocon- opposite ends, from pure traffic histogram in the first case
ference infrastructure for all public schools, as well as access (where time does not matter at all to describe each user), to
to educational platforms. pure time series in the second case (where users must have
To the best of our knowledge, this work stands out from similar activity at the same time to be grouped together).
the previous as the first of its kind in a K-12 scenario, where An intermediate approach is preferred by Mucelli et al. [4],
most of the users are between 6 and 18 years old. The analyzing a large dataset from a major MNO. A first clustering
main Wi-Fi usage profiles at schools were analyzed, based on stage is based on features extracted from traffic volumes,
unsupervised classification from the collected traffic data. For identifying three main profiles: light, medium and heavy
this purpose, relevant features were extracted, and appropriate users. Then, a second clustering is applied, now with features
distance measures were selected and validated, ending up in based on the session frequency, resulting in two subcategories:
occasional and frequent users. We follow a similar approach,
1 Term coined to refer to primary and secondary education (from 6 to 18
combining both temporal patterns and traffic volumes to rep-
years old).
2 Education service provider (ESP): Organization which helps the education resent users’ behavior. However, instead of making a separate
system to implement comprehensive reforms. sequential clustering, we decided to integrate them both in a
Fig. 1: Daily evolution of the total connected clients through- Fig. 2: Traffic evolution during Set-Nov ’17 (WIFI -
out school days for the WIFI - END dataset (Set-Nov ’17). END dataset).
single feature vector. Thus, we do not impose a fixed structure compares the traffic evolution for each period, where the hori-
of main clusters based on traffic volumes and sub-clusters zontal lines indicate the median of the maximum daily values.
based on session frequency, but instead we explore the data A clear boost is observed from April-May to September-
structure provided by the combined traffic-frequency features. November (24% for downlink and 18% for uplink), which can
Furthermore, we compute separate features for different times be explained by an increase in the Internet access bandwidth
of the day, attending to the main education schedules, which for a large number of schools.
significantly affect the Wi-Fi usage in this context. An additional dataset was collected, including traffic vol-
umes per application (e.g., Facebook, YouTube and Plan
III. D ESCRIPTION OF THE DATASETS
Ceibal’s educational platforms), which was not considered for
The datasets used in this study were collected from urban the profiling stage, but only for the characterization of the
schools, which cover more than 95% of the K-12 students in resulting clusters. This data was gathered at the beginning of
Uruguay. All these schools are provided with an optical fiber the school year (during the same period as WIFI - INI) from
Internet connection and Wi-Fi infrastructure. Each record in a subset of 111 schools (10% of the main dataset) and is
the datasets corresponds to the amount of downlink and uplink hereafter referred to as APP - DATA. In this case the vantage
traffic per device, aggregated during 1 minute, and sampled point was not the AP but the server located at each school,
every 15 minutes. Data were collected from 8587 APs located where the NTOP tool [7] was used to collect traffic flow
in 1110 schools, where 65% are primary schools and 35% are data per application. The measurements were gathered every
secondary schools. 5 minutes with a time window of 5 minutes. The resulting
In order to analyze the evolution throughout the year, two dataset has records from 264.145 different devices, where 19%
datasets were collected: one at the beginning of the school were delivered by Plan Ceibal, and the remaining correspond
year, right after Easter holidays (April to May 2017) and the to BYOD.
other one at the end of the school year, right after Spring break
(September to November 2017), hereafter referred to as WIFI - IV. U NVEILING W I -F I N ETWORK U SAGE P ROFILES
INI and WIFI - END , respectively. The former corresponds to a In order to identify the main usage profiles in Plan
6 weeks period and includes records from 778.940 different Ceibal’s wireless network, a standard unsupervised classifica-
devices, where 28% correspond to the ones delivered by Plan tion pipeline was followed: preprocessing, feature extraction,
Ceibal, and the rest are due to BYOD3 . The latter corresponds and clustering. First, a threshold-based filter was used in order
to an 8 weeks period and includes 975.441 unique devices, to discard many devices with scarce activity. For this purpose,
21% provided Plan Ceibal and the remaining 79% BYOD. we defined minimum values for the accumulated downlink and
Figure 1 shows the daily school days evolution of the total uplink traffic, and the number of days with network activity.
number of simultaneous connected devices (the thick line is a) Features: The selected features combine temporal
the median). It is worth noting that, even with an important activity with traffic volumes. For this purpose, we used a multi-
increase in the total number of unique devices (25% more in histogram feature vector, i.e., a concatenation of histograms of
WIFI - END ), the number of simultaneous devices connected to different types: traffic histograms and frequency histograms.
the Wi-Fi network is quite similar in both datasets. Figure 2 All of them have the same traffic bins, which follow a quasi
3 Bring Your Own Device (BYOD): Acronym which refers to the policy of
logarithmic scale [6]: {0, 1kbps, 10kbps, 100kbps, 1Mbps,
allowing students or employees to bring personally owned devices (laptops, 10Mbps, 20Mbps, 50Mbps, 150Mbps} for downlink and {0,
tablets, and smartphones) to their school or workplace. 0.1kbps, 1kbps, 10kbps, 100kbps, 1Mbps, 10Mbps, 20Mbps,
50Mbps} for uplink, which results in a 8-bin histogram for 1.00
●
● ● ●
0.75 ●
●
TPR
●
●
For the traffic histograms, each bin counts the amount of 0.50 ●
●
●
●
●
●
●
TPR
●
0.50 ●
●
●
●
●
●
●
Next, we present the discrimination power analysis of the 0.00 0.25 0.50 0.75 1.00
FPR
selected features, also taking into account the study of an
appropriate distance metric. Then, we discuss the clustering Fig. 3: ROC curves for the evaluated histogram distances under
strategy used in Section IV-B. the random (left) and sequential (right) subset division.
A. Features and Distance Measures Validation
In this section we evaluate the discrimination power of the
proposed features, comparing at the same time the perfor- light on the question about the temporal stationarity of users’
mance of a series of histogram distances. For this purpose, online behavior. If users’ behavior changes drastically from the
we assume that each user has a relatively constant activity time period used to build S1 to the one used to build S2 , the
pattern in the considered time period.Then, we assess the expected classification performance of all the tested distance
performance of various histogram distances at the verification functions should be low. Otherwise, it should be close to that
task. That is, given a device feature vector and his identity of the random division.
(i.e., the unique MAC address of the device), verify whether c) Histogram distances: A wide variety of histogram
the provided identity is correct. distance functions have been proposed in the literature [8],
a) Methodology: First, the considered dataset is split into [9]. Most of them fall under one of two categories: bin to
two subsets S1 and S2 , each one comprising a fraction of bin or cross-bin, depending on whether information from
the samples available for each user. Then, using the samples different bins is combined or not. Bin to bin distances are
in each subset, two independent feature representations are often faster and simpler to compute, while cross-bin distances
computed for each user i, one to be used as ground-truth gi tend to be more robust and performant, at the cost of a
and the other one for testing fi . Next, for each distance d(·) to higher computational complexity. In this study, we evaluate
be evaluated, N users are randomly chosen and for each user a series of widely used bin to bin histogram distances: L1 ,
we compute, i) the distance between the feature representations L2 , chi-squared, Canberra and intersection, as well as a cross-
of the same user in both subsets d(fi , gi ), i = 1, . . . , N and, bin distance: a quadratic form distance combining each bin
ii) the distance between the feature representation of each user with its two adjacent neighbors with weights 1/2 and 1/4
and that of M randomly chosen different users d(fi , gj ), j = respectively. For the L2 distance, two normalization strategies
1, . . . , M . If the distance between users is below a threshold are also evaluated: standard score and minmax scaling.
the users are assumed to be the same, otherwise they are not. d) Results: Figure 3 shows the ROC curves for the evalu-
Then, the performance of each evaluated distance function d(·) ated distances for the WIFI - INI dataset restricted to Frequent-
is computed as the percentage of correctly classified users. Active users only (c.f. Section V). All of them perform re-
Varying the verification threshold we obtain the ROC curves markably well, considering the users’ activity profile described
shown in Figure 3. by the proposed features is enough to verify their identity
b) Subset division: The division of the dataset into sub- well above 50%-50% performance. This shows the capacity of
sets S1 and S2 is performed in two ways: i) sequential, where the proposed features for user discrimination. Moreover, the
S1 includes the samples of the first half of the considered time chi-squared distance outperforms the rest, reaching a 87.5%
period and S2 those of the second half, and ii) random, the true positive rate with a 12.5% false positive rate under the
temporal order is not taken into consideration and both sub- random subset division. Similar results were obtained under
sets include samples corresponding to interleaved timestamps. the sequential subset division. Although the classification
Comparing the results under these two division strategies sheds performance is slightly degraded, a high classification rate
Sporadic Frequent Frequent MED MED
Active Inactive Sporadic Inactive
10% Active 7% 30% 30%
14% 16%
HIGH+WK HIGH+WK
6% 5%
Ceibal Ceibal
37% Frequent Ceibal Ceibal
Active
BYOD 37% Frequent BYOD BYOD BYOD 15% HIGH
Active 17%
HIGH
Sporadic 39% 40% 47%
Inactive Sporadic 50%
Inactive LOW
LOW
(a) WIFI - INI data (Abr-May ’17) (b) WIFI - END data (Set-Nov ’17) (a) WIFI - INI data (Abr-May ’17) (b) WIFI - END data (Set-Nov ’17)
Fig. 4: Threshold-filter results for the beginning and the end Fig. 5: Clustering results for the beginning and the end of the
of the school year. school year.
is still achieved, suggesting users’ online behavior is fairly A. Wi-Fi Usage Profiles and Evolution During the Year
stationary. In order to study the evolution of the Wi-Fi usage profiles
during the year, two different questions were addressed:
B. Clustering Approach 1) Are the clusters found in the WIFI - INI and WIFI - END
Considering the discriminative capacity of the normalized datasets the same?
L2 distance has been validated, we propose to employ a sim- 2) Do users have the same usage profile throughout the
ple, widely used algorithm: k-means. After the feature scaling year?
normalization, k-means algorithm is applied with different k To answer the first question we need to assess whether the
values. Then, the elbow method is used to find the optimal k, feature space partition given by the selected clustering method
by analyzing the evolution of the variance explained. Finally, is the same for both datasets. First, we found that the optimal
the clusters stability is studied. For this purpose, 85% of the number of stable clusters for both datasets was four. Then,
samples are randomly chosen and the k-means algorithm is we aligned the clusters of both datasets by computing the
applied to partition this subset. The Jaccard index is then distances between the centroids and matching the closest
computed to assess the similarity between the original and ones. This way, we found that the defined cluster mapping
the subset partition. If the original clusters are stable, we corresponds to a bijective function. Moreover, the minimum
expect them to remain after randomly dropping 15% of the distances were small, meaning that the typical usage profiles
samples and therefore have a high Jaccard index with respect identified in both time periods are equivalent. Figure ?? shows
to the original clusters. This process is repeated 100 times and the four clusters found in the Frequent-Active subsets. The
clusters with an average Jaccard index above 0.9 are kept. cluster labels (LOW, MED, HIGH and HIGH + WK) are related
to the activity level for each usage profile, as detailed in the
V. DATA A NALYSIS AND R ESULTS characterization presented in Table I. The difference between
HIGH and HIGH + WK is mainly explained by the activity on
This section presents a summary of the analysis con- weekends. The results also show that the percentage of BYOD
ducted with the ESP data presented in Section III, using the increases with usage intensity.
methodology described in Section IV. The first step was the This result is quite relevant, since it implies that the partition
preprocessing, which corresponds to the data filtering based defined from the data at the beginning of the year has a high
on two thresholds: a minimum number of active days, defined level of coincidence with the partition generated from the
as 15% of the total number of days in each dataset; and a data at the end of the year. Thus, it is possible to model the
minimum aggregated traffic volume, fixed in 1MB for both usage profiles only once at the beginning of the school year.
datasets. This way, as depicted in Figure ??, four different Then, the resulting model could be used to monitor the users’
classes were defined, according to whether the users were behavior evolution for the rest of the year, which enables to
frequent or sporadic, and whether they were active or not. detect changes in user activity and give more personalized
It is worth noting that the percentages in each category were services according to their usage profile.
very similar for the data of both times of the year. The fact that the clusters found in both datasets are equiva-
From now on, we focus only on the Frequent-Active subset, lent does not imply that the same users fall in the same clusters
which has 288.413 different devices for the WIFI - INI dataset at both time periods, as each individual’s behavior can vary
and 359.319 for the WIFI - END dataset. In both cases the subset during the year. Thus, attending to the second question, we
corresponds to 37% of the total number of unique MACs focused on the common devices between the two datasets to
observed in the original datasets. Moreover, the resulting analyze the individual’s behavior changes. The intersection
sharing between devices delivered by Ceibal and BYOD after set has 142.688 devices, that is 49.7% of those included
the filter is also the same for both datasets: 32% and 68% in the WIFI - INI dataset. This is explained by the devices’
respectively. replacement carried out by Plan Ceibal during the year, as well
TABLE I: Main Wi-Fi usage profiles characterization. b) User sessions statistics: Three indicators were defined
WIFI - INI dataset (Apr-May ’17)
in order to analyze user sessions: total sessions per day, session
WIFI - END dataset (Sep-Nov ’17)