You are on page 1of 4

Linear Algebra & Pandora Representing Data Sets in a Vectorspace

Steven Roger August 12, 2011 What is Pandora.com?: While growing slowly since its inception in 2005, Pandora has made a big name for itself. Going public last month (June, 2011) Pandora is valued at over $3 billion. Pandora.com offers a new kind of radio, or so stated on their website. However Pandora has more allure than conventional radio because it only plays songs the listener likes. This may sound great in theory; a radio station that plays only songs one enjoys, however is it too good to be true? Actually, its quite true, and anyone who has used Pandora can attest to Pandoras accuracy at selecting songs the listener enjoys. How does it work?: When first entering the site, a user inputs an artist or song they like. Pandora then generates a channel based on this song or artist. This channel then plays music similar to the input. While listening to a song, a user may thumbs up or thumbs down a song, effectively letting Pandora know if they like or dislike a specific song. Pandora then uses this info to create a more specific picture as to what a user enjoys. Linear Algebra and Pandora: Pandoras ability to accurately play songs the user enjoys is outstanding. The algorithm to select songs for a user is not to simply play all songs of the same genre, or of the same artist. Instead to select songs for a given channel Pandora employs a combination of highly skilled musical experts and linear algebra to rate songs and a users suspected interest in such songs. *While Pandoras actual algorithm implementation is proprietary information and a heavily guarded secret, the overarching algorithm used by Pandora for choosing accurate songs to play based on a users input is discussed below and called Pandoras algorithm for convenience and simplicity. Every accessible song on Pandora has been intensely scrutinized by Pandoras team of musical experts. These experts physically listen to each and every song (this takes on average 20 minutes per song) and score the songs several hundred musical attributes. These attributes are derived from Pandoras musical genome, a collection of parameters that describe all the important characteristics that make a song what it is. This musical genome can be thought of as the DNA configurations of a song, and is essential in Pandoras algorithm. Because these attributes, or traits, are at the very heart of the music songs are able to be more accurately rated. For example, instead of assuming all songs on a given album closely correlate because they are of the same artist and genre, Pandora treats each song independently. It is not surprising if songs from the same album share many of the same traits, however this is not always the case. By providing song selections based only on the listeners preferences, and not on preferences other users who listen to similar songs have, Pandora has set itself apart from competitor sites, and provides radio stations with outstanding song selection accuracy.

The second step in Pandoras algorithm is finding similar songs among the categorized music. Pandora has a quantitative way to represent a song. That is, a song is represented by its n number of attributes, or more precisely, by an n x 1 vector. Each attribute is rated from 1 to 5 with increments of .5. Because every song has been analyzed, every song can be placed into this n dimensional vectorspace and represented by a vector of its ratings. Intuitively one may recognize that overlapping vectors represent either the same song, or a pair of extremely similar songs. Taking this realization a step further it becomes clear that finding similar songs is as easy as finding vectors that are close to each other. This is the essence of Pandoras success, converting songs into vectors and correlating songs using relative distance in the vector space. However Pandora has over 800,000 songs in their database and each song has several hundred attributes. This creates an operational error. The n dimensional space with over 800,000 songs and counting is too large. Accessing this vectorspace and trying to perform functions on it p number of times where p is the total number of stations created by every user requires too much computing resource. If this is how Pandora was setup users would experience severe lag when building channels and changing songs. Yet, no lag is experienced, and Pandora appears to run quite efficiently. This leads us to the third stage of Pandoras algorithm, optimization. Basic linear algebra has been employed thus far to categorize and relate similar songs; i.e making song vectors. However to make the process resource efficient, Pandora must use a few more linear algebra techniques. As discussed above, a large resource hog is the number of genes each song has. The number of genes correlates to the dimension of the vectorspace and more dimensions means more computing power required to select a song. To increase efficiency Pandora decreases the number of elements in each song vector. Immediately one notices that this could decrease song picking accuracy, as an element in a song vector represents a trait. However Pandoras reduction technique is able to reduce the size of the song vector while maintaining a high level of song selecting accuracy. Dimension Reduction: Below the abstraction line, the decrease of elements in a vector is actually a decrease in dimensions. Dimension reduction comes in two flavors; selecting a subset of the existing set, or mapping the set into a new reduced set. To maintain accuracy Pandora uses the latter option. To employ this dimension reduction technique Pandora uses Singular Value Decomposition (SVD), a method whose techniques are a direct result of a theorem from linear algebra that states the following: Any m x n matrix A can be factored into three matrices U, S, and V, such that A = U S V. Where U and V are two orthogonal matrices of size m x m and n x n respectively, and where S is an n x n diagonal matrix of all positive singular values.

The above may seem new, however the concepts are all familiar except for singular values. In short, singular values of a matrix are like eigenvalues of a matrix. However where eigenvalues relate to a matrix being a transformation from one vector space onto itself, singular values relate to applying a transformation from one vector space to a different vector space. A singular value of a matrix A and its corresponding nonzero singular vectors v and u is a nonnegative scalar o such that Av = ou and ATu = ov Now the dimension reduction process begins. Pandora must choose the dimension, k <= n that songs are to be represented in. Matrix S is then reduced into a matrix of its k largest singular values and matrix U is reduced from the square m x m matrix to an m x k matrix simply by removing Us last m k columns. This process is repeated again for matrix V. This shift in matrices U, S, and V lead to a new matrix Ak such that Ak = Uk.Sk.Vk. Pandora however does not randomly select a vectorspace k to store their songs in. The dimension is carefully chosen to minimize the size of U while maintaining maximum song placement accuracy within the vectorspace. In linear algebra terms, k is chosen to minimize the Frobenius norm ||A- Ak|| for all Rank(k) matrices. Where the Frobenius norm is similar to the norm or length, but refers to the size of a matrix, instead of a single vector. If A has singular values then the square of the Frobenius norm of a matrix A is the square of the sum of the singular values of A. The Frobenius norm is used to appropriately size our new matrices Uk,and Vk, to matrices that represent songs and traits in our new vectorspace respectively. To complete the transformation we multiply Uk and Vk by Sk1/2. This results in the following: UkSk1/2 is the matrix of Song coordinates, where Row(n) is a song vector. VkSk1/2 is the matrix of trait coordinates Sk is the matrix where the number of Rank(k) specifies the dimension of the space that that data will be transformed to. Now that Pandora has reduced the dimension of the vectorspace containing all the songs, its time to revisit the earlier idea of locating similar songs. As stated before, similar songs are in proximity to each other. Pandora uses a nearest neighbor approach to select songs. That is, given a desired input from the user, Pandora transforms this input into a vector on the vectorspace with every song. Pandoras algorithm then uses a technique called cosine similarity to determine what songs are within a specified threshold of proximity to this input vector. Cosine similarity works by taking two vectors u and v and computing cos(), where is the angle between u and v. The function is denoted by

cos(u, v) = (u v) / ||u||2 * ||v||2 Thus Pandora is able to populate a station with songs that are in proximity to the input. Sorting data is one of the many examples of how linear algebra is used in the real world. In fact, Pandora is a successful company based on linear algebra fundamentals. Some of that which is discussed above is digresses from the topics covered in Math 54, however each fact and formula is based on what has been taught in Math 54. None of the concepts above are overly complicated, yet the power behind the math is outstanding. It is easy to understand how important effectively sorting and correlating large data sets is when one considers the vast amount of information companies are constantly collecting to better profile their customers. Netflix, Facebook and Amazon are just a few other popular companies that use dimension reduction techniques. Ever wonder how Nextfilx recommends movies, or Facebook is able to sell user targeted add space?

Bibliography
Grigorik, Ilya. "SVD Recommendation System in Ruby." BradBlock. 15 Jan. 2007. Web. <http://www.bradblock.com/>. Moler, Cleve. "Eigenvalues and Singular Values." MathWorks. Web. <http://www.mathworks.com/moler/eigs.pdf>. Sarwar, Badrul M., George Karypis, Joseph A. Konstan, and John T. Riedl. Application of Dimensionality Reduction in Recommender System -- A Case Study. Tech. Minneapolis: Department of Computer Science and Engineering / Army HPC Research Center. Print. Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl. ItemBased Collaborative Filtering Recommendation Algorithms. Tech. GroupLens Research Group/Army HPC Research Center. Web.