BitTorrent Analysis : Its Correlations With The Real World

Sidhant Gupta, Nehil Jain, Kevin Ruffin College of Computing, Georgia Institute of Technology {sidhant, njain31, gtg584a} @ mail.gatech.edu

Abstract
Of the many peer-to-peer file sharing protocols, BitTorrent is one of the most popular and widely used. The Majority of the content transferred on BitTorrent is comprised of audio, video and software. In this project we did an analysis on how the real world popularity of music albums correlates to that of activity and popularity on the BitTorrent network.

1. Introduction
We collected various parameters for music album BitTorrent files, such as number of seeds, number of active downloaders their client program port, IP and location. For the albums that we monitored, we also collected their weekly ranking from a popular website [1],[2] along with sales data for a few. After processing all the data collected over a period of more than four weeks, we plotted select charts depicting the behavior of various parameters. Different inferences can be drawn from each of the charts. Our preliminary analysis shows positive correlation between behaviors of music albums on BitTorrent to that of real world.

Our first step was to identify which music albums we should choose to log data for. We chose the albums, which were spread over online ranking charts. Thus around 40% of the albums were chosen from top ten on charts, another 40% from the top twenty and the rest from the top fifty. Moreover the charts were not only US based, but also from the UK, France, Japan, Canada and Spain. Choosing albums this way ensured that we got diversity in the behaviors and the inferences could be confirmed over different locations. Having different countries provided us with data to localize behaviors such as popularity by country and infer if a particular genre is popular or not. Also, BitTorrent usage patterns with respect to time could be effectively inferred with data from different time zones. In total we selected 30 music albums. This particular number of albums was chosen to make sure that even data became lost or corrupted for 5-6 albums, we would still have substantial data for analysis.

3.2 Data Collection
Once the music album selection was finalized, it was required that we log various parameters periodically for all the albums. In the BitTorrent network, when a peer wishes to download a file it contacts a server that keeps track of all torrents published on it. This server is known as a tracker. When a torrent is published on a tracker, the later inserts its URL and port into the former. When a user loads this .torrent file into his/her BitTorrent client, the client extracts the tracker URL and also computes a hash. The hash uniquely defines the torrent and is used by the tracker to look up its statistics. When file transfer is started, the user needs a set of ‘boot strap’ peers to connect to and start downloading the data. To get this list of peers it needs to contact the tracker. A HTTP GET request with hash, client id and other parameters is sent to the tracker, which then returns a list of 50 peers each packed in six bytes, with four bytes for IP and two for port. Another piece of data the tracker returns is the interval time, i.e. the amount of time client should wait before sending a query to tracker. The tracker also maintains a service called a ‘scrape’. The purpose of the scrape is to provide health information about the torrents it hosts. This

2. Motivation
We were influenced by the paper [3] where the authors analyzed the various aspects of videos such as popularity, length etc. and derived higher-level inferences which yield amazing insight into how and why people use YouTube and similar services. Thus by analyzing such systems in cyber space, higher-level inferences that apply to the real world could be inferred. This motivated us to do a similar study using the BitTorrent network to see how the popularity of music correlates and how location applies to it.

3. Methodology
Our methodology is broadly divided into three tasks namely data selection, data collection and analysis.

3.1 Data Selection

information consists of total number of files; number of active seeders, number of lechers and total downloads so far. The query process is similar to that of the tracker, it is designed such that replacing ‘tracker’ in the query string with ‘scrape’ will return the scrape data. The data returned by a tracker query contains 50 randomly chosen IPs from its complete pool of seeders + leechers. Thus to obtain the maximum number of IPs from the tracker’s pool we have to query it multiple times in a short duration [4]. Most of the trackers check the user-agent property before replying with data. Thus if a conventional browser is used, a correct response is not guaranteed. We needed a way to mimic the behavior of a BitTorrent client and construct GET request for tracker and scrape servers. To achieve this we made use of the curl tool. Curl allows complete control over every property, thus we could set the user-agent to BitTorrent/3.4.3 and construct a GET query. But before we could use curl, the has was required. Hash for a torrent is a 20-byte SHA-1 hash of the torrent’s meta info. The torrent files for all our music albums were parsed using a PHP script that computed the hash, or more specifically the info_hash from the files and then constructed a curl command with various GET parameters such as peer_id which is a randomly chosen 19-byte string to identify the client program, downloaded, uploaded and left notifies the tracker of amount of data transferred by client, event which is set as “started” in our query string so that the tracker provides us with a list of peers by fooling it to believe that we have just started to download, and finally compact set as “1” so that the data is in a compact form, which is 6 bytes for each peer otherwise it is up to 21 bytes and wastes bandwidth. Similar GET strings were generated for the scrape service. When executed, the tracker and scrape successfully returned binary data. We executed the GET queries every 20 minutes by making it a cron job and dumping the response to files one for each album. The response does not contain any date and time information, thus we appended the same after every query was run. The cron job was run for more than 4 weeks, starting from October 26 2007, logging more than 4 Million IPs and 120 thousand data points for scrape. To make sure that data was being collected properly, we developed a small script that decoded the dump files and counted the number of IPs each day, reporting any anomalies to us. Despite that, we lost data for three albums. Two because the trackers they were hosted on went down, and one due to a programming mistake that skipped adding date stamps to the data. Thus with no date information the data was rendered partially useless.

3.3 Data Analysis
For analyzing the data, we wanted a tool that can generate datasets for whatever parameters and constraints we wanted. So the ideal choice was using a database. Thus we converted all the raw file’s data to a database by writing two parsers, one for tracker data and the other for scrape. The format in which BitTorrents’s data is encoded is known as bencoding [5] which packs data in a terse format. The fields start with specifying the length of the field, followed by a colon and then the actual data for that field. Example: peers20:<20 bytes of data>. Bencoding supports lists, dictionary, byte strings and integers. Each has a different way of encoding the data, and its delimiters. To retrieve IPs, port and other parameters from this data, we needed a custom parser. The parser was designed in C, which successfully extracted all the data. This data was then processed through the GeoIP database [6], to retrieve location information such as country, state, city, and region. This was then exported as a CSV file, such that it can be easily imported into the MySQL database. A similar process was done for the scrape data without the GeoIP processing. The CSV files were imported into the database by a small Java program that parsed the CSV and pulled out the necessary parts and generated insert statements from them. The Java program uses the generated inserts to populate the database. Once the data was in the database, queries were formed to pull pertinent data for our analysis. Here is a sample query used to obtain distinct IPs per state in the United States. select state, count(distinct ip) TotalDistinctIPs from AlbumPeers where country='US' and album_id = 1 and time between '2007-10-26 12:00:00' and '2007-12-3 14:00:00' group by state order by count(distinct ip) desc

4. Analysis

Distribution of

total downloaders per hour.

Top 20 countries which use BitTorrent most.

The above graph shows the number of leechers classified according to the hours of a day. It can be seen that the peak activity occurs during the evening hours from 15 to 19 hours and the lowest during the early hours of the day.

This graph shows the top countries where BitTorrent traffic is prevalent. United States comprises 35% of the total IP, which we collected during our trace. The graph is as expected as the songs selected are more popular in United States and other English speaking and European countries.

Distribution of each song per hour.

This graph shows the hour wise distribution for each song and it confirms the previous result that there is a general peak at about 15-19 hours (localized time). This is true for songs where more than 60% of the IP’s are from a particular region apart from the US where peaks are observed corresponding to their local time in evening.

The above graph shows variation of the weekly number of downloads (“Back of my Lac – no”) and the ranks (“Back of my Lac -r”) for the album “Back of my Lac “ . The two graphs would be positively correlated if they converge or diverge together. E.g. In week 3 the rank of the song is 35 and in week 4 it is 51 .Thus the rank of the song increases (i.e. the song becomes less popular). Similarly the number of downloaders decrease from week 3 to week 4 . Thus both of them converge together and can be said to be positively correlated for the period considered .

Overall correlation statistics for US
N u m b e r o f Total Songs P e r c e n t a g e songs showing considered of songs correlation 13 18 72.23 12 18 66.66 12 16 75.0 10 14 71.4

Week 1 Week 2 Week 3 Week 4

Rankings and Number of Downloaders Correlation

Another observation, which was made, was for the songs popular in the UK charts namely Mika, Sugababes, Katie and James Blunt. The number of downloaders in UK follows the rankings quite well i.e. correlate positively to a great extent. This further strengthens our belief that the number of downloaders in that area can measure the success of a song.

The above graph shows the correlation between the rankings of four songs and the number of downloaders for that song per week. By looking at the graph, one can intuitively say there exist a positive correlation. This can be seen from the fact that, with an increase in the rank (becoming less popular), there is a decrease in the number of people downloading and with decrease in the rank (becoming more popular) there is an increase in number of downloaders. We noticed this correlation in most of the songs. We also observed some non-positive correlation for some songs (Foo Figthers, Jennifer Lopez and Nightwish) .In these songs we found that rank was increasing (towards less popular) and yet the number of downloaders increased. Upon analyzing, we found that most of these songs have a stronger presence (in terms of percentage of downloaders) in Europe (UK, Spain, France etc) than the US. These songs had low rank (quite popular) in the European charts and hence had a high number of downloaders, even though the US ranking increased (less popular). Similarly the number of downloaders fell down when the European rank increased (less popular). We correlated these songs to the European rankings and found that they also display positive correlation. As an example for the album by Mika the rank in US increases (become less popular), but the number of downloaders increases. This is due to a huge percentage of downloaders from UK & France, where the album is popular. There was also a strange observation for the song Noel by Josh Groban that was consistently ranked in the top 10. We saw that, even though the rankings improve, the number of downloaders decreases. We observed a huge number of seeders when the chart was climbing up, hence we speculate that a lot of people had already downloaded the album due to its immense popularity and hence a corresponding increase in the number of downloaders as ranking improved is not seen.

This graph shows the number of IPs using multiple ports. The most common group is when people use only one port all the time. The maximum number of different ports was 273 used by a couple of IPs. This gives insights into the fact that most people don’t change their port, which might make it easier to classify P2P traffic. But it also reflects that quite a few people do change their ports. This might be due to certain applications, which randomly choose a port upon startup.

As seen in the previous graph most people fix one port, probably chosen during their BitTorrent client installation, and then do not change the port. A listing of the most common ports reveals that no one port is used commonly i.e. most people use a different port, e.g. Only 8% use 6881 and rest all ports are used less than 1%.

number and behavior of leechers/downloaders for that region. This inference can be directly applied to obtain a more localized marketing strategy. The port information also gives an insight into the popularity of BitTorrent client applications, and that a large number do not change the port.

6. Future Scope
Future work to be done includes collecting a larger data set in time and in breadth. As more and more different album data is collected, our suggested correlations can be proven correct or false. It can also bring about a better understanding of each countries likes and dislikes. More data would also allow for more interesting inferences to be drawn.

7. References
[1] Music album rankings. http://www.billboard.com/ [2] Music album rankings. http://www.allcharts.org/ [3] Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, YongYeol Ahn, Sue Moon "I Tube, You Tube, Everybody Tubes: Analyzing the World's Largest User Generated Content Video System." San Diego, CA, October 2007. [4] J.A. Pouwelse, P. Garbacki, D.H.J. Epema, H.J. Sips "The Bittorrent P2P File-Sharing System: Measurements and Analysis." 2005. [ 5 ] B i t To r r e n t p r o t o c o l s p e c i f i c a t i o n . h t t p : / / www.bittorrent.org/protocol.html [6] Maxmind GeoIP database API. http:// www.maxmind.com/app/api

The above two graphs show the popularity of different genres in US and Greece. This data can be applied for targeted marketing and sales. For example, Hip Hop and Holiday are popular in US whereas in Greece, Jazz is more popular.

5. Conclusions
We found that the behavior of albums on BitTorrent positively correlate to that of the real world. Thus we can infer popularity of a location by measuring the

Appendix Graph 1: Total number of downloaders per day

The graph shows that the amount of downloaders remain almost similar during the entire period of 5 weeks and there are no particular peaks or lows on particular days of the week . The amount of traffic remains same on the Thanksgiving period ( 22-23rd November) which shows there is no specific effect of a holiday on the number of downloaders.

Graph 2: Number of IPs having distinct number of ports.

The graph shows the number of IP having distinct ports number above between a certain range. E.g. There are about 150 IP which have between 100 and 150 distinct ports .

Graph 3 : Percentage usage of top 10 US states.

This graph shows the top US states where BitTorrent is prevalent in terms of percentage of the IP’s for the state which is normalized by the number of households having internet.The graph is quite expected with states like

California and NewYork sharing the top places which may be due to the fact that these states house many business and technological hubs, and hence people are more prone to know and use P2P.

Graph 4 : The top 5 countries in each genre.

This graph shows the popularity of genre in different countries. It shows the top 5 countries for 4 genres (Rock, Pop , HipHop & Country). Unites States has the highest percentage of IP’s for all the genres , followed by Great Britan and Canada . This is as expected because the songs we are chosen are more popular in these areas . In HipHop and Country , US has a very high percentage (above 60%) . However in Pop and Rock , the percentage of IP’s is quite evenly distributed which indicates that these genres are as popular in the US as in other countries and thus are potential market places for these genre especially countries like Sweden , Spain and Australia.

Graph 5 : The number of downloaders (leachers) per day for each song.

Graph 6 : Percentage of distinct IP for each genre and country.

India 40%

Greece

US

UK

Sweden

Netherlands

30%

20%

10%

0% Jazz Holiday Country Alternative Hip-Hop Pop Rock

Graph 7 : Percentage of IPs that are ours. i.e. IP address of tampere.cc.gatech.edu.

25.00 18.75 12.50 6.25 0

50 Cent Bruce Springsteen Faith Hill Gloria Estefan James Blunt Josh Groban Linkin Park Nickelback Rascal Flatts Toby Keith

Bee Gees Common Fergie J. Holiday Jay-Z Kanye West Matchbox Twenty Nightwish Sugababes Will.i.am

Bob Dylan Diana Krall Foo Fighters Jack Penate Jennifer Lopez Katie Melua Mika Pink Timbaland Wolfpack Unleashed

The above graph shows the percentage of our IP (tampere.cc.gatech.edu) from the total IPs received for each song. This may be used to infer that, songs which return our IP more frequently have a small set of peers. This is because the tracker selects 50 random peers to send. If it is less than 50, it is highly likely that we receive our own IP. In fact for the three songs that have large values (Jay Z, Toby Keith and Wolfpack), the data we have is also less.

Graph 8 : Number of Distinct IPs from select Universities.

USC Texas A & M Wisconson Princeton

MIT CMU UPenn

Stanford GT San Jose

New York University University of Texas Oxford

20 15 10 5 0

Table 1 : The list of Albums for which data was collected.
Back Of My Lac'(J. Holiday) Finding Forever(Common) Magic(Bruce Springsteen) Graduation(Kanye West) Still Feels Good(Rascal Flatts) The Very Best Of Diana Krall(Diana Krall) Minutes To Midnight(Linkin Park) Change(Sugababes) Life In Cartoon Motion(Mika) Exile On Mainstream(Matchbox Twenty) Greatest(Bee Gees) Noel(Josh Groban) Brave(Jennifer Lopez) Curtis(50 Cent) Big Dog Daddy(Toby Keith) All The Right Reasons(Nickelback) Anthems Of Resistance(Wolfpack Unleashed) Matinee(Jack Penate) Echoes, Silence, Patience & Grace(Foo Fighters) I'm Not Dead: Tour Edition(Pink) All The Lost Souls(James Blunt) Pictures(Katie Melua) Shock(Timbaland)

The Dutchess(Fergie) Blue Magic(Jay-Z) Dylan(Bob Dylan)

Songs About Girls(Will.i.am) Hits(Faith Hill) 90 Miles(Gloria Estefan) Dark Passion(Nightwish)

Table 2 : Number of albums in each genre
Jazz Holiday Alternative Country Hip-Hop Pop Rock 1 1 1 3 6 9 9

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.