Professional Documents
Culture Documents
∗
Amelie Anglade Marco Tiemann, Fabio Vignoli
Centre for Digital Music Philips Research Europe
Queen Mary, University of London High Tech Campus 34
Mile End Road 5656 AE Eindhoven, The Netherlands
London E1 4NS, UK marco.tiemann@philips.com,
amelie.anglade@elec.qmul.ac.uk fabio.vignoli@philips.com
ABSTRACT 1. INTRODUCTION
This article presents an approach to automatically create Music recommender systems and algorithms continue to
virtual communities of users with similar music preferences attract scientific and commercial interest. Generally, music
in a distributed system. Our goal is to create personal- recommender research focuses on either service-based rec-
ized music channels for these communities using the con- ommender systems, that can be used over the Internet, or
tent shared by its members in peer-to-peer networks for each on autonomous recommender systems for stand-alone de-
community. To extract these communities a complex net- vices such as portable music players. In this paper we are
work theoretic approach is chosen. A fully connected graph concerned with providing music recommendations and per-
of users is created using epidemic protocols. We show that sonalized music radio channels directly between connected
the created graph sufficiently converges to a graph created systems and devices using peer-to-peer networks. Such con-
with a centralized algorithm after a small number of proto- cepts of peer-to-peer radio emerged recently, and some sys-
col iterations. To find suitable techniques for creating user tems are already publicly available and prove the concept’s
communities, we analyze graphs created from real-world rec- technical feasibility (e.g. [11]). Regarding personalized rec-
ommender datasets and identify specific properties of these ommendation over peer-to-peer networks, relatively little
datasets. Based on these properties, different graph-based research has been conducted to date. One well-known re-
community-extraction techniques are chosen and evaluated. search project on distributed recommendation is the TRI-
We select a technique that exploits identified properties to BLER project (cf. Wang et al. [14]). TRIBLER provides
create clusters of music listeners. The suitability of this tech- users with personalized recommendations for items shared
nique is validated using a music dataset and two large movie in peer-to-peer networks using a distributed version of the
datasets. On a graph of 6,040 peers, the selected technique widely-used collaborative filtering approach (Resnick et al.
assigns at least 85% of the peers to optimal communities, provide an introduction to collaborative filtering in [13]).
and obtains a mean classification error of less than 0.05%
over the remaining peers that are not assigned to the best In this paper we describe our approach to identifying peers
community. that share similar music preferences and potentially other
criteria and to cluster these peers into communities. Such
communities can be used to provide music recommendations
Categories and Subject Descriptors and shared music radio channels to peers in these groups.
H.4.3 [Systems and Software]: Distributed systems; I.5.3 Music shared by peers in such channels can be transmitted
[Clustering]: Algorithms; G.2.2 [Graph Theory]: Graph directly between peers within the groups or broadcast to the
Algorithms groups of peers depending on the underlying infrastructure.
41
2. PROBLEM DEFINITION AND in terms of music preferences. Each peer is characterized by
SOLUTION APPROACH a vector of preference-song pairs, where the preference val-
ues can be regular (e.g. on a scale from 1 to 5) or binary
The main challenge when automatically creating groups values (like or dislike). In our case, we assume that these
or communities of users in the desired setting is to identify preferences reflect explicit ratings provided by each peer.
clusters of similar peers in a peer-to-peer network so that
every peer in the system can enjoy a selection of music that We use the Pearson correlation coefficient to define the
suits her music preferences. similarity measure s between two peers a and b. This is a
2.1 A dynamical, distributed optimization similarity measure that is frequently used in collaborative
filtering recommender systems (e.g. Resnick et al. [13]):
problem P
Creating communities of peers can be expressed as an op- j (va,j − v̄a ) (vb,j − v̄b )
s(a, b) = qP (1)
timization problem which is composed of several potentially 2P 2
complementary, but also potentially competing objectives j (va,j − v̄a ) j (vb,j − v̄b )
described in the following section. where j is the index of the song for songs present in both
• Each peer must be satisfied with the music selection peer’s preference vectors, va,i is the preference value by peer
proposed to her. Recommended or played music must a on song i, va corresponds to the mean preference value for
match the peer’s personal taste. peer a over her complete song preference vector Pa :
1 X
• Within each community a good average satisfaction v̄a = va,i (2)
|Pa | i∈P
must be obtained. Optimally, each peer should be more a
satisfied with the recommendations provided by her Using this similarity measure we use a four step approach
community than she would be in any other commu- to find a suitable graph-based solution to automatically and
nity. distributedly create communities of peers:
• Adding a new peer to a community must not deterio- • We implement and compare 3 distributed algorithms
rate the average satisfaction. When adding a new peer to build overlay networks of peers.
to a community, the previously stated objective must
still be fulfilled both for her and for all members of the • We create and analyze graphs of peers using real-world
community. reference datasets and identify relevant common prop-
erties of such datasets.
• Extreme community sizes must be avoided. Both cre-
ating very large communities and very small commu- • We comparatively evaluate several candidate cluster-
nities must be avoided. Assigning most or all peers ing techniques. We then select the most appropriate
to a single large community can result in an unsat- technique for extracting communities of peers, based
isfactory listening experience; fragmenting peers into on our objectives formulated in Section 2.
many very small communities limits the exposure to
additional music not included in a peer’s music col- • We validate the performance of the selected technique
lection. The community is a way to expose the users experimentally.
to new interesting music matching their music tastes
We present selected results of our work in the following sec-
(serendipity).
tions. A detailed presentation of the work carried out can
Moreover, various dynamical aspects have to be taken into be found in [2].
account for resolving the optimization problem:
2.3 Datasets
• Peer-to-peer systems are intrinsically dynamic. Peers Three datasets from the music and movie domain collected
may enter or leave the system at any given time. This under real-world conditions have been analyzed. The data-
may require the system to dynamically adapt the com- sets vary in the number of users and in the amount of pref-
munity structure. erence data available for them:
• Music preferences evolve over time. A peer may tem- 1. The EasyAccess (EA) dataset gathered within Philips
porarily or permanently change her music preferences. Research contains 74,631 ratings over 5,234 songs from
Reflecting such changes in a timely manner is impor- 462 users.
tant in order to satisfy listeners.
2. The MovieLens1 (ML1) and MovieLens2 (ML2) data-
Finally, given the distributed nature of peer-to-peer systems, sets [9] contain respectively 1,000,029 ratings over 3,593
all computations for creating and maintaining communities movies provided by 6,040 users and 100,000 ratings of
must be performed by the peers themselves. 1,682 movies from 943 users. These datasets are widely
used as reference datasets for evaluating recommender
2.2 Solution approach algorithms.
We adopt a graph-theoretic approach to solve the given
problems. We start by distributedly constructing a graph All datasets contain regular ratings on a five point nu-
of all peers and then apply a clustering technique on that merical scale from one to five. We report evaluation results
graph in order to form communities of peers. For construct- both for these regular ratings and for binary ratings. Bi-
ing this graph, we link each peer to her N most similar peers nary ratings are constructed as follows: values larger than 3
42
are considered as positive ratings, others are considered as In the “Buddycast algorithm” and “Modified Buddycast
negative ratings. algorithm”, each peer also maintains a list of the K most
The parameter N for the number of most similar peers recently visited random peers which are not in the buddy
used is a variable in the analysis in order to study its influ- list. This “taboo list” is maintained to ensure that recently
ence on the topology of the resulting graphs. In all evalua- visited random peers are not visited again.
tions we report results for values of N = 5, 10, 20, 30, and
46 (which represents 10% of the population of the smallest 3.1.2 Selecting peers to exchange information
dataset). Larger values of N are impractical for distributed The method of selecting peers to exchange information
solutions: it would be prohibitively time-consuming to com- with differs between the three algorithms.
pute large numbers of most similar peers over a peer-to-peer In the “Newscast algorithm”, when a peer p wants to con-
system. tact another peer, she randomly selects one in the joined
list composed of the peers in p’s buddy list and the peers in
3. DISTRIBUTEDLY BUILDING A p’s random cache. At each selection step, each peer in this
CONNECTED GRAPH OF USERS joined list has the same probability to be chosen.
In the “Buddycast algorithm” [14], the buddy list stores
As described in Section 2, our goal is to construct and
the most similar peers as a ranked list. Prior to the selection,
maintain a constantly evolving graph in which each user
the list is appended with λ·N peers from the random cache
of the system is linked to her N most similar peers based
(excluding the K recently visited random peers from the
on music preferences. For this we need to find the top-N
“taboo list”). A roulette-wheel selection proportional to the
most similar peers for each user, which involves computing
rank is applied on this list. This means that higher-ranking
(n(n − 1))/2 similarity measures where n, the number of
peers are more likely selected than lower-ranking peers. The
users in the system, can potentially be quite large (in the
parameter λ is a real non-negative number. The parameter
order of millions). We need to be able to update these lists
value is used to tune the exploitation-to-exploration ratio:
of similar peers quickly to reflect new or changed user prefer-
the higher the value of λ, the more random peers are ap-
ences, which means recomputing the created lists of similar
pended to the list.
peers every time a user’s preferences are modified.
In the “Modified Buddycast algorithm”, before selecting
In a distributed system, peers only have a local view of
a peer to contact, a peer creates an ‘extended buddy list’
the system. Thus it is not possible to directly retrieve pref-
as in the Buddycast algorithm. Unlike in the “Buddycast
erence data of the required n − 1 peers in a distributed sys-
algorithm”, no rank information is used for selecting a peer
tem. Epidemic protocol algorithms can be used to spread
to contact. Instead a peer is selected randomly from the
the required data throughout distributed environments. In
‘extended buddy list’.
this section, we examine several such algorithms to find out
whether or not they are suitable for quickly identifying the 3.1.3 Exchanging information
top-N most similar peers in a peer-to-peer system.
When one peer contacts another one, she obtains informa-
3.1 Distributedly creating a graph tion from her and updates her own information accordingly.
Epidemic protocol algorithms apply mechanisms of epi- In the three epidemic protocol algorithms described here if
demic spreads in nature. Having only a local view of a a peer p1 selects a peer p2 to communicate with, p2 sends p1
system (for example knowing some other peers), each peer her buddy list as well as her own preference list. p1 evaluates
can transmit the pieces of information she has to some other her similarity with p2 ’s buddies and p2 using their preference
peers through direct local interaction with them, in this way lists. From this joined list of up to 2N +1 buddies she retains
changing their states. We implemented and tested three epi- the N most similar to her as her new buddy list.
demic protocol algorithms. In the remainder these are re- If p2 is a random peer, p1 ’s list of the K most recently
ferred to as “Buddycast algorithm”, “Modified Buddycast al- visited peers is updated to include p2 as the most recently
gorithm” (MBuddycast) and “Newscast algorithm” (c.f. [10] visited random peer. p1 ’s random cache (of size c) is up-
for the Newscast algorithm and [14] for the Buddycast al- dated as well, so that it contains c peers that are not in p1 ’s
gorithm). Each of these algorithms performs three basic new buddy list.
tasks for propagating information over the distributed sys-
tem: maintaining local information, selecting peers to ex-
change information, and exchanging information.
3.2 Algorithm performance comparison
3.1.1 Maintaining local information An epidemic protocol is suitable for distributed recom-
In all three algorithms, each peer maintains a top-N list mendation if the list obtained by the peers using the epi-
of peers she considers as the most similar to her among the demic protocol (epidemic information) tends to converge
peers she has encountered in the system, which is referred with the information of a centralized algorithm (exact infor-
to as buddy list. It contains the preference lists, addresses mation) quickly. In simulations, the number of protocol it-
and corresponding similarity of the peers (relative to their erations can be used to measure the time required to achieve
current preference list). The combination of one peer’s pref- this. One iteration of an epidemic protocol denotes updating
erence list of songs (as defined in Section 2) and her buddy the buddy list of every peer in the system exactly once. For
list constitutes her local information that can be updated a performance comparison, each buddy list obtained with
and transmitted to the other peers. In addition, each peer the epidemic algorithms is compared to the exact buddy list
maintains a cache of a number of addresses of random peers of the same peer obtained with a centralized algorithm af-
she can contact. ter each iteration of the epidemic algorithm. The fraction
of overlap f o at this iteration is the number of exact bud-
43
1 1
0.98
fraction of overlap
fraction of overlap
0.8
Buddycast, Binary rat. 0.96
0.6 Buddycast, Binary rat.
Buddycast, Regular rat. 0.94 Buddycast, Regular rat.
MBuddycast, Binary rat. MBuddycast, Binary rat.
0.4 MBuddycast, Regular rat. 0.92
MBuddycast, Regular rat.
Newscast, Binary rat. Newscast, Binary rat.
0.2 0.9
Newscast, Regular rat. Newscast, Regular rat.
0.88
0 10 20 30 40 50 50 100 150 200 250 300 350 400 450 500
iterations iterations
Figure 1: Convergence of the three epidemic pro- Figure 2: Convergence of the three epidemic proto-
tocols to the results obtained with a centralized al- cols. Iterations 50 to 500.
gorithm applied on EA for both kinds of ratings,
N = 10, K = 10, λ = 1. Iterations 0 to 50.
dycast algorithm than with the Newscast algorithm (using
implicit binary ratings) on a different dataset containing 480
dies found by the epidemic algorithm divided by the total peers. Additional simulations using datasets with varying
number of exact buddies. properties need to be performed to determine possible causes
P of these differences in results. However, as already pointed
p∈P |“Epidemic”buddies(p) ∩ “Exact”buddies(p)| out, all three algorithms perform sufficient and quite well
fo = P (3)
p∈P |“Exact”buddies(p)|
on our test dataset. For this dataset, either of the three is
suited to construct a graph of similar users efficiently.
where P is the ensemble of all peers.
The simulations were performed using the EasyAccess da- To verify whether these epidemic protocols would perform
taset (described in Section 2.3). Parameter values for the equally well on peer-to-peer networks of larger sizes and
epidemic algorithms were chosen equal to the values used for a larger number of buddies per peer, a simulation us-
in [14]: N = 10, K = 10 and λ = 1 (note that K and ing the Modified Buddycast algorithm was carried out using
λ are not used in the Newscast algorithm). No dynamics the MovieLens1 dataset (see Section 2.3). For this simula-
were introduced in these simulations: all peers were acces- tion, regular ratings were used, and the parameters N = 30,
sible during the entire simulation, and all peers’ preference K = 30 and λ = 1 were chosen. The evolution of the average
values remained unchanged. fraction of overlap of this simulation is shown in Figure 3.
The results are encouraging: after 50 iterations, the Modifed
The results obtained in these simulations are shown in Buddycast algorithm is able to correctly identify 85% of the
Figure 1 and Figure 2. Each graph represents the average 30 ∗ 6040 top-N peers of the network. The convergence of
fraction of overlap over 30 trials for each epidemic protocol. the epidemic protocol to the exact solution here is almost as
All three epidemic algorithms converge to the results of the rapid as the convergence of the same epidemic protocol on
centralized algorithm very well. At least 90% of the exact a much smaller dataset with three times less peers per user
buddies are found in less than 50 iterations, which is a suf- to examine. Thus, the three implementedepidemic protocols
ficiently small number of iterations. For larger numbers of can be considered to be scalable up to very large graphs of
iterations, all tested epidemic protocols reach very high val- peerswith little degradation in performance.
ues of overlap between 98.22% for the least effective protocol
and up to 99.08% for the best protocol with 500 iterations.
When comparing results for binary versus regular ratings, 4. GRAPH OF PEERS
for small numbers of iterations epidemic algorithms using
binary ratings outperform the ones using regular ratings. 4.1 Graph properties
For larger numbers of iterations however, the epidemic al- Peer-to-peer systems can easily contain thousands or mil-
gorithms obtain similar results with both types of ratings. lions of peers. The recently established scientific field of
The epidemic protocol algorithm experiments with regular complex network theory is concerned with studying and sim-
ratings perform equal or better than in experiments with ulating complex relations and dynamics in large real-world
binary ratings after 45 to 250 iterations varying by the al- networks [15]. Complex network theory is based on graph-
gorithm used. The choice of using regular or binary ratings theoretic foundations and is specifically concerned with real-
thus mainly depends on the epidemic algorithm chosen and world phenomena that exhibit network properties [1]. Thus
on the number of iterations that can be performed before insights and solutions from this field may prove helpful for
giving recommendations. addressing the problems we are interested in.
Finally, the simulations show that the Newscast algorithm
obtains more accurate results than the Modified Buddycast A graph can be constructed as a directed graph, in which
algorithm which in turn outperforms the Buddycast algo- all edges between nodes are directed, or as an undirected
rithm. This is surprising given the results obtained in [14], graph, in which edges are not directed. A component is a
where significantly better results are obtained with the Bud- maximal sub-graph of nodes for which a path exists between
44
1
sc k lr l Cr C
EA N = 5 4 9.5 2.73 2.64 0.0205 0.1041
0.8 EA N = 10 2 17.8 2.13 2.36 0.0384 0.1041
fraction of overlap
45
0 0
10 10 30 90
N=5 80
frequency of k
N=20
frequency of k
−1
10 20 60
N=30
−2 N=46 50
10
40
−2
10 10 30
−3
10 20
10
−3 −4
10 10 0 0
1 2 3 1 2 3 4 20 40 60 80 100 20 40 60 80 100
10 10 10 10 10 10 10
percent. of removed nodes percentage of removed nodes
k k
Figure 4: Degree distribution of the MovieLens2 Figure 5: Robustness of the EA graph to node re-
graph for binary ratings and N = 20 (left) and moval. Average number of components and percent-
MovieLens1 graph for regular ratings and N = 10 age of nodes of the initial graph in the largest com-
(right). On a log-log scale it can be approximated ponent over 50 trials as a function of the percentage
by a straight line (power law distribution) or a bro- of randomly selected removed nodes.
ken line of negative slope.
46
0 0
10 10 ter sizes are very similar to the degree distributions of the
networks of most similar peers. On a log-log scale they can
cumulative frequency
be approximated by a straight line (power law distribution)
or a broken line of negative slope. Here the created cluster
−1
10 −1
structures from the Hub-Based clustering technique seem to
10 closely reflect the intrinsic topology of the network of peers.
In conclusion, both the Max-Flow and the Edge-Betwe-
enness technique only create clusters of extreme sizes. This
is likely caused by the specific properties found in the evalua-
−2
10 −2 tion datasets. The compactness of the graphs can not be di-
10 vided into clusters sufficiently by these two techniques. Hub-
0 2 4 0 2 4
10
10 10 10 10 10 Based clustering creates a range of different cluster sizes,
size of the community which makes it a suitable candidate for further validation.
Figure 6: Cumulative distribution of the sizes of the
communities for binary ratings, N = 20, 150 starting 5.3 Validation measure
hubs ML dataset (left) and for regular ratings, N = Validating the suitability of the Hub-Based clustering tech-
46, 46 starting hubs EA dataset (right) using Hub- nique to create clusters or communities of peers according
Based clustering. to our objectives requires us to assess the quality of the clus-
tering achieved.
47
5.4 Validation results stance to attenuate any possible popularity bias that might
Our validation of the clusters found using the Hub-Based appear with the Pearson correlation coefficient (e.g. by in-
clustering technique results in several noteworthy findings. cluding the inverse user frequency suggested by Breese et
Firstly, the percentage of well-classified users varies between al. [4]). We plan to examine the influence of these different
48% and 92%. It increases with the size of the network. Us- similarity measures on the graph structures created, and to
ing regular ratings leads to better classification results than evaluate their performance and applicability in peer-to-peer
using binary ratings. The neighborhood size chosen also networks.
influences the classification accuracy: the percentage of sat-
isfied users reaches its maximum values for numbers of most 7. ACKNOWLEDGMENTS
similar peers N = 20, and N = 30. It is consistently above The authors acknowledge the support of the UK Engi-
75% for all datasets used for validation for those values of N neering and Physical Research Council (EPSRC) for support
when they are combined with regular ratings and an optimal of the OMRAS2 project (EP/E017614/1) in which Amélie
number of starting hubs chosen so that the average num- Anglade is currently involved. The authors would like to
ber of users per community is between 20 and 30 (cf. [2]). thank the GroupLens Reseach Group for providing the Movie-
For the largest dataset used this percentage is consistently Lens ratings dataset.
above 85%. This indicates that the percentage of correctly
assigned peers increases with the overall community size.
The classification error for the remaining users varies be- 8. REFERENCES
tween 0.03 and 0.16. Again, this error decreases as the size [1] R. Albert and A.-L. Barabási. Statistical mechanics of
of the dataset increases and is consistently smaller than 0.05 complex networks. Reviews of Modern Physics,
for the best parameters described above. This means that 74(1):47–97, 2002.
the music of their community is almost as good for them as [2] A. Anglade. Virtual communities for creating shared
the music from the other communities. music channels. Master’s thesis, Chalmers University
Summarizing our evaluation, we find that the evaluated of Technology, Göteborg, Sweden, 2007.
Hub-Based clustering technique assigns a large majority of [3] A.-L. Barabási and R. Albert. Emergence of scaling in
users to the clusters optimal to them and assigns the non- random networks. Science, 286:509–512, 1999.
optimally assigned users to clusters that are very close to [4] J. Breese, D. Heckerman, and C. Kadie. Empirical
optimal clusters for neighborhood sizes of 20 and 30 most analysis of predictive algorithms for collaborative
similar peers. The results of the validation indicate that the filtering. In Proc. of UAI, 1998.
technique performs with increasing quality the larger the [5] L. da F. Costa. Hub-based community finding.
dataset it is applied on is. arXiv:cond-mat/0405022, 2004.
To confirm these encouraging results we plan to compare [6] P. Erdős and A. Rényi. On the evolution of random
the users’ subjective satisfaction when creating communities graphs. Publ. Math. Inst. Hung. Acad. Sci., 5:17–61,
using the Hub-Based clustering technique with their satisfac- 1960.
tion when applying other well-known clustering techniques [7] G. W. Flake, S. Lawrence, C. L. Giles, and F. M.
such as k-means clustering. Coetzee. Self-organization and identification of web
communities. IEEE Computer, 35(3):66–71, 2002.
[8] M. Girvan and M. E. J. Newman. Community
6. CONCLUSION structure in social and biological networks. In Proc.
In this paper we presented a technique to cluster users Natl. Acad. Sci. USA, volume 99, pages 7821–7826,
according to their explicitly or implicitly stated music pref- 2002.
erences. In addition to dealing with the highly compact na- [9] http://www.grouplens.org/.
ture of the graphs of similar peers this technique also reflects
[10] M. Jelasity and M. van Steen. Large-scale newscast
and uses the intrinsic topology of the created graphs to build
computing on the internet. Technical Report IR-503,
communities. It did so with convincing results, consistently
Vrije Universiteit, Amsterdam, 2002.
assigning at least 85% of all peers to optimal communities,
[11] http://www.mercora.com/.
and assigning remaining users to communities that are al-
most as suitable as their optimal community. The graph of [12] M. E. J. Newman and M. Girvan. Finding and
peers created using the Pearson correlation measure proved evaluating community structure in networks. Physical
to be highly robust to random node removal. Graphs with Review E, 69, 2004.
such properties are well suited to be used in peer-to-peer [13] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and
networks. We showed how such graphs can be efficiently J. Riedl. Grouplens: An open architecture for
and quickly created in peer-to-peer networks with epidemic collaborative filtering of netnews. In Proc. of the
protocols. Conference on Computer Supported Cooperative Work,
Future work includes investigating the performance of the pages 175–186, New York, 1994. ACM.
proposed solution in dynamic environments with rapidly [14] J. Wang, J. A. Pouwelse, J. Fokker, and M. Reinders.
joining and parting network members, and with user pref- Personalization of a peer-to-peer television system. In
erences that change over time. Another interesting chal- Proc. of EuroITV 2006, Athens, 2006.
lenge that remains is to develop highly efficient means to [15] D. Watts. Six Degrees: The Science of a Connected
identify starting hubs for Hub-Based clustering in a peer- Age. Norton, 2003.
to-peer environment, with which a low-overhead distributed [16] D. J. Watts and S. H. Strogatz. Collective dynamics of
implementation of this clustering technique can be provided. ‘small-world’ networks. Nature, 393:440–442, 1998.
We also intend to use different similarity measures for in-
48