of .
P. 1


Published by alexandru_bratu_6

Published by: alexandru_bratu_6 on Jun 07, 2012
The Google Legacy 
Chapter Three: Google Technology
Chapter Three: Google Technology 
“Apart from the problems of scaling traditional search techniques to data of thismagnitude, there are new technical challenges involved with using the additionalinformation present in hypertext to product better search results.... Fast crawlingtechnology is needed to gather the Web documents and keep them up to date.Storage space must be used efficiently to store indices and, optionally, thedocuments themselves. The indexing system must process hundreds of gigabytes ofdata efficiently. Queries must be handled quickly, at the rate of hundreds tothousands per second.” – Sergey Brin and Lawrence Page, 1997
In the beginning, there was BackRub, the service that became Google. Today, Google is mostclosely associated with its PageRank algorithm. PageRank is a voting algorithm weighted forimportance. The indicators of a Web page’s importance is the number of pages that link to aparticular page.Messrs. Brin and Page soon added another factor which voted for the importance of a Webpage. This idea was the number of people who click on a Web page. The more clicks on a Webpage, the more weight that Web page was given. Over time, still other factors have been addedto the PageRank algorithm; for example, the frequency with which content on a page ischanged.Google’s PageRank technology is closely allied with Internet search. Voting algorithms areless effective in enterprise search, for instance. The attention given to Google and its searchtechnology dominate popular thinking about the company. Google search is like a nova. The
1.From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.-db.standord.edu/~backrub/google.html
Chapter Three: Google Technology
The Google Legacy 
luminescence makes it difficult for the observer to see other aspects of the phenomenonclearly or easily.Radiance aside, Google is a technology company.
Some of that technology when described intechnical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual WebSearch Engine” is demanding. The later papers such as “MapReduce: Simplified DataProcessing on Large Clusters” can be a slow read.
Since Google is technology, explainingwhat Google does in an easily-digestible meal is difficult. The diagram below providesunauthorized snapshot of Google’s computing framework.
2.The annex to this monograph contains a listing of more than 60 Google patents. The list isnot all-inclusive; however, it does provide the patent number and a brief description for some ofGoogle’s most important patents. The PageRank patent belongs to the trustees of StanfordUniversity. Google’s patent efforts have focused on systems and methods for relevance,advertising, and other core foci of the company. Google is creating a patent fence to protect itsinterests.3.Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been anadvocate of MapReduce. His most recent papers are available on his Web page at http:// labs.google.com/people/jeff/.
Important Google technologies that underlie this diagram of the Googleplexinclude: [a] modifications to Linux to permit large file sizes and other functions soas to accelerate the overall system; [b] a distributed architecture that allowsapplications and scaling to be “plugged in” without the type of hands-on set-upother operating systems require; [c] a technical architecture that is similar at everylevel of scale; [d] a Web-centric architecture that allows new types of applicationsto be built without a programming language limitation.
The Google Legacy 
Chapter Three: Google Technology
Google’s technology has emerged from a series of continuous improvements or what Japanesemanagement consultants call
. Each Google technical change may be inconsequential tothe average user of Google. But when taken as a whole, Google’s “technological advantage”comes from Google’s incremental innovations, clever adaptations of research-computingconcepts, and Byzantine tweaks to Linux. Some day, a historian of technology will be able toidentify, from the hundreds of improvements that Google has engineered in the last nine years,one or two that stand with PageRank as of major importance. Critics of Google will see thatthe company has grafted to its core technology processes from many different sources.To illustrate, the structure of Google’s data centers and the messages passed to and from thesedata centers is in many ways a variant of grid computing.
Google’s ability to read data frommany computers simultaneously is reminiscent of BitTorrent’s technology.
Google’s use of commodity or “white box” hardware in its data centers is an indication of Google’s hackerethos. The use of memory and discs to store multiple copies of data comes from the frontiersof computing.Google’s approach to technology, then, is eclectic and in many ways represents a buildingblock approach to large-scale systems. Google benefits from that eclecticism in several ways.First, Google’s computational framework delivers sizzling performance from low-costhardware. Second, Google worked around the bottlenecks of such operating systems asSolaris, Windows Advanced Server, and off-the-shelf Linux. Third, Google took goodprogramming ideas from other languages, implementing new functions and libraries toeliminate most of the manual coding required to parallelise an application across Google’sservers.
According to Jeff Dean, one of Google’s senior engineers, “Google engineering is sort of chaotic.”
This is neither surprising nor necessarily a negative. The Googleplex is a toy boxfor engineers and programmers. The tools are sophisticated. The challenges of the problemsand peers make Google “the place to be” for the best and brightest technical talent in theworld. The nature of creativity combined with Google’s approach to innovation make itdifficult to predict the next big thing from Google.Before reviewing selected parts of Google’s technology in somewhat more detail, the diagram“Google’s Computing Framework” provides an overview of the Googleplex and some of itstechnologies. These will be touched upon in this section.
4.Grid computing is applying resources from many computers in a network to a single problemor application. Google uses grid-like technology in its distributed computing system.5.BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in2001.The reference implementation is written in Python and is released under the MIT License.6.Google has anywhere from 100,000 to 165,000 or more servers. Servers are organized intoclusters. Clusters may reside within one rack or across multiple racks of servers. Some Googlefunctions are distributed across data centers.7.From Dr Dean’s speech at the University of Washington in October 2003. See http:// www.uwtv.org/programs/displayevent.asp?rid=2459.

