You are on page 1of 6

Observation of Network Structure in Amazon.

com

Teruhiko Yoneyama
Multidisciplinary Science
Rensselaer Polytechnic Institute
Troy, New York, Unites States
yoneyt@rpi.edu
Mukkai S. Krishnamoothy
Computer Science
Rensselaer Polytechnic Institute
Troy, New York, United States
moorthy@cs.rpi.edu


Abstract—Amazon.com is among the largest bookstores on the
internet. It provides the sales rank of each book. Our hypothesis
is as follows: if a book has low sales rank (i.e., it is well–sold), the
related book also has a low sales rank. In the small world
principle, if a network is connected, any two nodes are connected
with relatively small number of links. If this hypothesis is true,
any book can link to a bestseller book with small number of
distance by linking some related books. In this paper, we design
an algorithm to ascertain our hypothesis, and analyze the
network structure of Amazon.com.
Keywords-Complex Network, Small World Principle, Network
Structure, Graph Theory, Component, Distance
I. INTRODUCTION
Stanley Milgram conducted the “small world experiment”
in 1967 [1]. In this experiment, he sent letters to 160 people
who were chosen at random in Nebraska and asked them to
forward the letter to their acquaintance who might be closer to
a target individual in Boston. Then he discovered, by tracking
the chains of letters, that two people in the US who do not
know each other, are connected by 5 or 6 intermediate people.
Based on this experiment, we might conclude that any two
nodes in a connected network are connected through relatively
small number of intermediate nodes and links. This
phenomenon was the discovery of small world in the real world.
The following related works investigate the “distance” in
the real network. Computing Erdös number is the first example.
Paul Erdös is one of the greatest mathematicians and he wrote
more than 1500 papers with many co-authors. The Erdös
number is to show how a mathematician is related to him by
co-authorship. For example, if a person wrote a paper with him,
his/her Erdös number is 1. If he/she never wrote a paper with
Erdös himself but wrote a paper with a person whose number is
1, then his/her Erdös number is 2. Having a fewer Erdös
number could be an honor for mathematicians. Interestingly, no
mathematician whose Erdös number is more than 17 has found
so far, and most of mathematicians, more than 100 thousand,
have 5 or 6. Therefore the network among mathematicians is
also a small world [5].
Brett Tjaden and Glenn Wasson contrived a game called
“The Oracle of Bacon” when they were graduate students at the
University of Virginia. The game is to find the distance from an
actor Kevin Bacon to another actor/actress through being a co-
star(s). From this game, it is found that most of all
actors/actresses connect with Kevin Bacon within 6 links. In
fact, there is no one who needs more than 10 links to reach and
the average distance of 500 thousand actors/actresses was only
2.896 [5].
Duncan J Watts and Steven H. Strogatz, well-known for
small-world and WS model, investigated the power distribution
network. They computed the number of electrical power cable
between power plants, booster stations, and distribution
stations and discovered that average number of power cables
between facilities is 18.7 [5].
Other researches work, such as computing the relationship
among words, among research paper citations, and among web
pages and links, also exhibit a small world behavior.
However, in fact, it is not easy to conduct the same
experiment in Amazon network. Instead, we look at the books
in Amazon.com which is a bookstore in the internet and has a
network structure; if we regard each book as a node and the
relation between books as a link, the bookstore can model a
large network. Each book’s main page lists the book’s sales
rank and related books through “customer who bought this item
also bought the following item.” Therefore books are
connected by this relation and the network is a directed graph.
The related books are usually similar genre/content or by same
author. Since the books are introduced to a customer in this
manner, the customer finds the related book easily and
compares multiple books. Here we have two hypotheses: (1) a
book which is bought with a well–sold book, such as bestseller,
is also a well–sold one, and (2) if we accept the small network
concept, any book can link to a bestseller book, such as Top100
bestsellers, with small number of distance by tracing some
related books. In this paper, in order to test these hypotheses,
we apply an algorithm to trace from a book to the related books,
and analyze and determine the network structure within
Amazon.com. We define as following. A book is a node. A
book is bought with other books and its relation is shown at
Amazon.com. Then these books are connected by a link.
Therefore these two books (nodes) are adjacent to one another.
If we make a direction, the arrow goes from the “target” book
which is found earlier to another book which is bought with the
target book which is connected from the target book.
II. METHODOLOGY AND PRELIMINARY EXPERIMENT
For the preliminary experiment, similar to Milgram’s
experiment, we choose one book in Amazon.com at random
and look at sales ranks of 5 related books. Then we select one
book with the lowest sales rank (i.e., best sold book) among
these 5 books, and repeat this pattern 10 times. The preliminary
algorithm is following.
1. Pick one book at random. Call the
book target.
2. Examine the sales ranks of 5 related
books (which are often bought with
the target).
3. Pick the book which has the lowest
sales rank among the 5 books. Call
this book as (new) target.
4. If the trial is <= 10
5. Repeat from 2
6. Else
7. Done. Return the target’s sales rank.

For the preliminary experiment we tried the algorithm 10
times. Figure 1 shows the transition of sales rank in this
experiment. X–axis shows the number of distance from the
initial target and Y–axis shows the sales rank. According to the
result from this experiment, 9 of 10 cases failed to reach the
Top 100 bestsellers and all cases converge to two points. Some
curves show a decrease in sales rank. This means that same
books repeatedly appear in the algorithm and there are some
cycles around the first chosen books and the transition did not
get out of these cycles. Another observation is that a related
book is of similar genre/content or by same author with the
target book. When a person wants to buy a book, he/she may
tend to also buy the similar book. This works for introducing
similar book to customer and customer can easily find similar
books and compare them. A cycle may appear since a book
will more likely connect to the similar books and they make a
small component.
0
20000
40000
60000
80000
100000
120000
140000
160000
0 1 2 3 4 5 6 7 8 9 10
Distance
S
a
l
e
s

R
a
n
k

Figure 1 Transition of Sales Rank and Distances
In Figure 2, we plot the relation between the target book’s
sales rank and the average sales rank of its 5 related books. X-
axis shows the target book’s sales rank and Y-axis shows the
average sales rank of 5 related books. Both X-axis and Y-axis
are in log. In this figure, the regression line seems imply that
when the target book’s sales rank is higher, the average sales
rank of related books is also higher, and these ranks are in
proportion. In the formal experiment, we will use a large
number of samples to verify this phenomenon.
1
10
100
1000
10000
100000
1000000
10000000
1 10 100 1000 10000 100000 1000000
Target's Sales Rank (in log)
A
v
e
r
a
g
e

S
a
l
e
s

R
a
n
k

o
f
5

R
e
l
a
t
e
d

B
o
o
k
s

(
i
n

l
o
g
)

Figure 2 Relation between Sales Rank of the Target Book
and Average Sales Rank of 5 Related Books
It appears that with this algorithm choosing arbitrary books
cannot link to Top100 bestseller-books. We hypothesize that
each book belongs to a component and it is difficult to link to
the other components.
In our first algorithm, we allow to revisit same book. In the
revised algorithm, we are not allowed to revisit and cycles are
not permitted. The revised algorithm for the formal experiment
follows. Revision is on line 3.
1. Pick one book at random. Call the
book as target.
2. Examine the sales ranks of the
related 5 books which are often
bought with the target.
3. Pick the book which has the lowest
sales rank among the 5 books and has
never been visited before. Call the
book (new) target.
4. If the trial is <= 10
5. Repeat from 2
6. Else
7. Done. Return the target’s sales rank.

We ran this algorithm 10 times and the results are shown in
Figure 3. Now, there is no cycle and no repetition of books.
Also the plots do not exhibit a tendency of decreasing the sales
rank and some sales ranks are increasing, although we always
pick up the lowest sales rank book among the 5 related books
in this algorithm. Similar to Milgram’s experiment, many
letters could link from a random person to the target person
through 5 or 6 intermediate persons, but in our experiment, the
transition doesn’t reach a Top100 bestseller book (this can be a
target book in this experiment). There are other observations
from this experiment. First, as similar to the preliminary
experiment, it shows some small components and it is not easy
to link to the other components. Usually a component consists
of same genre books, similar content books, or books by
multiple (same) authors. Second, there are some core books
and other books are in the neighborhood of the core books in
the component, and not all core books have low sales rank
compared with the neighboring books in the component. Third,
if the sales rank is relatively low, it looks easy to reach to
bestseller whose rank is less than 100. That means a book
which links to a bestseller book has also a low sales rank and
this is related to our first hypothesis; a book which is bought
with a well-sold book, such as bestseller, is also a well-sold one.
Consequently we expect that lower rank (well-sold) book’s
component easily links to other books or other components and
become a bigger component. Therefore the bestseller book
tends to belong to a bigger component compared with other
average rank components. In the following chapters, we
analyze the network structure by multiple trial of the algorithm
and verify our hypothesis.
0
100000
200000
300000
400000
500000
600000
700000
800000
0 1 2 3 4 5 6 7 8 9 10
Distance
S
a
l
e
s

R
a
n
k

Figure 3 10 Transitions of Sales Rank and Distances
III. FORMAL EXPERIMENT WITH DIFFERENT INITIAL
TARGETS
In the previous cases, we chose the initial book at random.
We expected that the network structure may be different
depending on the sales rank of the initial book from the results
of previous experiment. In this section, we experiment with the
revised algorithm by choosing books with some particular sales
rank. This will enable us to observe the different types of
network structures according to the sales rank of the initial
book. We choose the initial target book from Top10 (the sales
rank is  1 and  10), Top100 (the sales rank is  11 and
 100), Top1000 (the sales rank is  101 and  1000), and
Top10000 (the sales rank is 1001 and 10000). In each case,
we run our algorithm 10 times. Then we have each network by
the algorithm which starts from different initial target book. Let
us call it Top10 network when the initial target book is from
Top 10, Top 100 network when the initial target book is from
Top100, and so on.
Figure 4 shows the relation between target book’s sales
rank and the average sales rank of its 5 related books. It seems
that the target book’s sales rank is in proportion to the sales
rank of the related books. This is consistent with the hypothesis
that a lower rank (well-sold) book tends to be linked to other
lower rank books.
1
10
100
1000
10000
100000
1000000
1 10 100 1000 10000 100000 1000000
Target's Sales Rank (in log)
A
v
e
r
a
g
e

S
a
l
e
s

R
a
n
k

o
f
5

R
e
l
a
t
e
d

B
o
o
k
s

(
i
n

l
o
g
)

Figure 4 Relation between Sales Rank of the Target Book
and Average Sales Rank of 5 Related Books in All Cases
Figure 5 shows each raw network. Each node shows the
book. Each directed link shows the relations between books.
For example, when book A is introduced by book B as “Book
A is frequently bought by customers who bought Book B”, a
directed link connects these nodes (books) as B  A.
Interestingly, in the combined graph, there is one large
component and a few middle and small components are present.
Some low sales rank books are referred many times in all
transitions and same books appear in different networks of
Top10 through Top10000 and Random case. Consequently, the
large component becomes larger combining with other
components since such low rank book(s) can be a connection.
This structure is very similar to that of the WWW which has a
Giant Connected Component (GCC), the huge connected
component which plays the role of a percolating cluster and
Disconnected Component (DC), the rest of the network which
consists of separate finite connected components [2].
Figure 6 shows three typical component structures from
Top10 and Top10000 networks. Figure 6 (a) is from Top10
network and some nodes (books) are related to other nodes in
the component. Figure 6 (b) is from Top10000 network and
generally the each node’s degree is lower compared with the
one of the lower sales rank component. But there is an
exception. Like Figure 6 (c) which is also from Top10000
network, there is a very strongly connected component such
that all nodes are related to other all nodes. In fact this
component is a complete graph and there is no way to get out
from here. In this case, the books in this component are about a
very academic and special field and the component is closed.
Figure 7 shows the observed total number of books which
were discovered in the algorithm and the expected total number
of books to be discovered. Since we ran the algorithm of 10
cycles and one book links to 5 related books, the expected total
number of discovered books is 51, including (one) initial target
book, in a run of the algorithm at the maximum case. Then,
since we ran the algorithm 10 times for each network (Top10 ~
Top10000 and Random), the expected total number of
discovered books in each network should be 51 * 10 = 510 at
the maximum case. However, some books are discovered
multiple times in a run of the algorithm and/or in 10 times
executions. The observed total number of books therefore
decreases. The difference between the observed total number of
books and expected total number of books is larger when the
initial target book’s sales rank is low (well-sold). This means
that when the initial target book is well-sold, the books which
belong to same component with the initial target book are more
often discovered in multiple runs of the algorithm. In other
words, in a lower sales rank (well-sold books) network, it is
more likely that each book links to one another.
Figure 8 shows the observed total number of components
which were discovered in the algorithm and the expected total
number of components to be discovered. Since we ran the
algorithm 10 times for each network, the expected total number
of component is 10 in each network at the maximum case.
However, some books are discovered multiple times in
multiple runs of the algorithm. Such books connect two
different components in a network. Therefore the observed
total number of components decreases. In general, in a lower
sales rank (well-sold books) network, the observed total
number of components is small. It means that in lower sales
rank network, each component more tends to link to one
another, and forms bigger component.

Figure 5 Network Graphs

Figure 6 Some Typical Component Structure from Top10
Network ((a)) and from Top10000 Network ((b) and (c))
0
100
200
300
400
500
600
Top10 Top100 Top1000 Top10000 Random
N
u
m
b
e
r

o
f

B
o
o
k
s
Observed Total Book Expected Total Book

Figure 7 Observed Total Number of Books and Expected
Total Number of Books in each Network
0
2
4
6
8
10
12
Top10 Top100 Top1000 Top10000 Random
N
u
m
b
e
r

o
f

C
o
m
p
o
n
e
n
t
s
Observed Total Component Expected Total Components

Figure 8 Observed Total Number of Components and
Expected Total Number of Components in each Network
IV. ANALYSIS OF NETWORK
In this section, we analyze the properties of each network of
Top10 through Top10000 and Random. Each network has
different characteristics. At first, we show the combined
network’s properties in Table 1. Combined network includes
all books (nodes) and the relations (links) of Top10 through
Top10000 and Random. Note that average shortest path length
in the table is calculated only if two nodes are connected
through other intermediate nodes. When two nodes are in
different components, these nodes are disconnected. We don’t
consider this case for the average shortest path length. Figure 9
shows the degree distribution in the combined network. X-axis
shows the number of incoming degree and Y-axis shows the
number of books whose incoming degree is X. Both X-axis and
Y-axis are in log. The distribution follows the power law.
Table 1 Properties of Combined Network
Average Degree Average Clustering
Coefficient
Average Shortest
Path Length
1.986134 0.355569 6.098843

Degree Distribution
1
10
100
1000
1 10 100
Number of Incoming Degree (in log)
N
u
m
b
e
r

o
f

B
o
o
k
s

(
i
n

l
o
g
)

Figure 9 Degree Distributions in Combined Network
Figure 10 shows the degree distributions in different
network. X-axis shows the number of incoming degree and Y-
axis shows the number of books whose incoming degree is X.
In general, the distribution follows the power law. When the
network’s sales rank is higher, the distribution more tends to
follow the power law. Three distributions for Top1000,
Top10000, and Random are similar and these regression curves
overlap. On the other hand, in the lower sales rank networks,
the number of books which have small number of incoming
degree is small compared with those in the higher sales rank
network. This means that each book more likely links to one
another and each degree becomes close to the average.
Degree Distribution
0
20
40
60
80
100
120
140
160
180
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Incoming Degree
N
u
m
b
e
r

o
f

B
o
o
k
s
Top10
Top100
Top1000
Top10000
Random

Figure 10 Degree Distributions in Different Networks
Figure 11 shows the average degrees in different networks.
The average degree is larger when the network’s sales rank is
lower. Thus when the network’s sales rank is lower, each book
in the network more tends to link to other books. This
corresponds to the conclusions from Figure 10.
Figure 12 shows the average clustering coefficient in
different networks. The average clustering coefficient is larger
when the network’s sales rank is lower. Thus when the
network’s sales rank is lower, each book in the network more
tends to link to neighboring books.
Figure 13 shows the average shortest path length in
different networks. It is difficult to conclude from this figure. If
each book is more likely strongly connected to one another in
lower sales rank networks, the average shortest path would be
smaller in those networks. However, this figure doesn’t support
this conjecture. The reason why the average shortest path could
be higher in lower sales rank network is following; In the
process of forming network, each book more tends to link to
other books in different component in lower sales rank network.
Thus the component becomes bigger and the diameter becomes
larger. In the component, it is not that all books are connected
each other, and some books are still distant. Thus the average
shortest path in the big component could be large. Therefore,
the average shortest path doesn’t depend on the sales rank, but
depends on both the strength of relation between books and the
size of component.
Ave Degree
0
0.5
1
1.5
2
2.5
3
Top10 Top100 Top1000 Top10000 Random

Figure 11 Average Degrees in Different Networks
Ave Clustering Coefficient
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Top10 Top100 Top1000 Top10000 Random

Figure 12 Average Clustering Coefficient in Different
Networks
Ave Shortest Path
0
0.5
1
1.5
2
2.5
3
3.5
4
Top10 Top100 Top1000 Top10000 Random

Figure 13 Average Shortest Path Length in Different
Networks
V. CONCLUSIONS
In this paper, we utilized our algorithm to observe the
network structure in Amazon.com. Figure 14 shows the all
books (nodes) and relations (links) that we observed in this
experiment. We discover that there are some giant connected
components which consist of lower sales rank books (i.e., well–
sold books) and many small components which consist of
higher sales rank books.

Figure 14 Observed Network in Amazon.com
We had two hypotheses: 1) a book which is bought with a
well–sold book, such as a bestseller, is also a well–sold one, 2)
according to the small world principle, any book links to a
bestseller book with small number of distance by linking some
related books. The first hypothesis seems to be consistent with
the result of our experiment. Figures 4 and 6 show that in lower
sales rank network, the related books have lower sales rank
books. We find that books in the lower sales rank network are
more likely to connect each other. Therefore, we can say that in
general a well–sold book is bought with another well–sold
book. The second hypothesis however seems incorrect. When
a book belongs to a small component, it is difficult to link to
other components. Even if we trace many books, it may be
possible that a book never reaches at a bestseller when the
books belong to different components.
REFERENCES
[1] S. Milgram, “The Small World Problem,” Physiology Today 2: 60–67,
1967.
[2] S.N. Dorogovtsev, J.F.F. Mendes, “Evolution of Networks, From
Biological Nets to the Internet and WWW”, Oxford University Press,
2003.
[3] D.S. Callaway, J.E. Hopcroft, J.M. Kleinberg, M.E.J. Newman, S.D.
Strogatz, “Are random grown graphs really random?”, Physical Review
E. Volume 64, 041902, 2001.
[4] Y. Jing, S. Baluha, “PageRank for Product Image Search”, WWW
Conference, Beijing, China, ACM 978-1-60558-085-2/08/04, 2008.
[5] M. Buchanan, “Nexus: Small Worlds and the Groundbreaking Science
of Networks”, W. W. Norton & Company, 2003.
[6] A. L. Barabási, “Linked: How Everything is Connected to Everything
Else and What it Means”, Plum, 2003.
[7] D. J. Watts and S. H. Strogatz, “Collective Dynamics of `Small-World’
Networks”, Nature, 393, 440-442, 1998.
[8] M Steyvers and J. B. Tenenbaum, “The Large-Scale Structure of
Semantic Networks: Statistical Analyses and a Model of Semantic
Growth”, Cognitive Science, 29, 41-78, 2005.