This action might not be possible to undo. Are you sure you want to continue?
Chapter Three: Google Technology
“Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to product better search results.... Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at the rate of hundreds to thousands per second.” – Sergey Brin and Lawrence Page, 19971
In the beginning, there was BackRub, the service that became Google. Today, Google is most closely associated with its PageRank algorithm. PageRank is a voting algorithm weighted for importance. The indicators of a Web page’s importance is the number of pages that link to a particular page. Messrs. Brin and Page soon added another factor which voted for the importance of a Web page. This idea was the number of people who click on a Web page. The more clicks on a Web page, the more weight that Web page was given. Over time, still other factors have been added to the PageRank algorithm; for example, the frequency with which content on a page is changed. Google’s PageRank technology is closely allied with Internet search. Voting algorithms are less effective in enterprise search, for instance. The attention given to Google and its search technology dominate popular thinking about the company. Google search is like a nova. The
1. From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.db.standord.edu/~backrub/google.html
The Google Legacy
Chapter Three: Google Technology
luminescence makes it difficult for the observer to see other aspects of the phenomenon clearly or easily. Radiance aside, Google is a technology company.2 Some of that technology when described in technical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual Web Search Engine” is demanding. The later papers such as “MapReduce: Simplified Data Processing on Large Clusters” can be a slow read.3 Since Google is technology, explaining what Google does in an easily-digestible meal is difficult. The diagram below provides unauthorized snapshot of Google’s computing framework.
Important Google technologies that underlie this diagram of the Googleplex include: [a] modifications to Linux to permit large file sizes and other functions so as to accelerate the overall system; [b] a distributed architecture that allows applications and scaling to be “plugged in” without the type of hands-on set-up other operating systems require; [c] a technical architecture that is similar at every level of scale; [d] a Web-centric architecture that allows new types of applications to be built without a programming language limitation.
2. The annex to this monograph contains a listing of more than 60 Google patents. The list is not all-inclusive; however, it does provide the patent number and a brief description for some of Google’s most important patents. The PageRank patent belongs to the trustees of Stanford University. Google’s patent efforts have focused on systems and methods for relevance, advertising, and other core foci of the company. Google is creating a patent fence to protect its interests. 3. Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been an advocate of MapReduce. His most recent papers are available on his Web page at http:// labs.google.com/people/jeff/.
The Google Legacy
Google uses grid-like technology in its distributed computing system. Google’s “technological advantage” comes from Google’s incremental innovations. Clusters may reside within one rack or across multiple racks of servers.4 Google’s ability to read data from many computers simultaneously is reminiscent of BitTorrent’s technology. is eclectic and in many ways represents a building block approach to large-scale systems. the structure of Google’s data centers and the messages passed to and from these data centers is in many ways a variant of grid computing. then. Google benefits from that eclecticism in several ways.000 to 165. BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in 2001. Before reviewing selected parts of Google’s technology in somewhat more detail. Google has anywhere from 100. Servers are organized into clusters. 5. “Google engineering is sort of chaotic.Chapter Three: Google Technology Google’s technology has emerged from a series of continuous improvements or what Japanese management consultants call kaizan. and off-the-shelf Linux.asp?rid=2459. The tools are sophisticated. But when taken as a whole. a historian of technology will be able to identify. Google’s computational framework delivers sizzling performance from low-cost hardware. The nature of creativity combined with Google’s approach to innovation make it difficult to predict the next big thing from Google. To illustrate. clever adaptations of research-computing concepts.”7 This is neither surprising nor necessarily a negative.org/programs/displayevent. Google’s approach to technology. The Google Legacy 57 . implementing new functions and libraries to eliminate most of the manual coding required to parallelise an application across Google’s servers.The reference implementation is written in Python and is released under the MIT License. The challenges of the problems and peers make Google “the place to be” for the best and brightest technical talent in the world. Second.5 Google’s use of commodity or “white box” hardware in its data centers is an indication of Google’s hacker ethos. Google took good programming ideas from other languages. Third. The use of memory and discs to store multiple copies of data comes from the frontiers of computing. Each Google technical change may be inconsequential to the average user of Google. the diagram “Google’s Computing Framework” provides an overview of the Googleplex and some of its technologies. The Googleplex is a toy box for engineers and programmers. Critics of Google will see that the company has grafted to its core technology processes from many different sources. Some Google functions are distributed across data centers. These will be touched upon in this section. and Byzantine tweaks to Linux.uwtv. 7. Some day. one or two that stand with PageRank as of major importance. Windows Advanced Server. Grid computing is applying resources from many computers in a network to a single problem or application. First. From Dr Dean’s speech at the University of Washington in October 2003. See http:// www. Google worked around the bottlenecks of such operating systems as Solaris. one of Google’s senior engineers. 4. 6.6 According to Jeff Dean. from the hundreds of improvements that Google has engineered in the last nine years.000 or more servers.
not in a sterile computer lab illuminated with the blue glow of supercomputers. With faster turnaround and the 58 The Google Legacy . Quickly means the sub one-second response times that Google is able to maintain despite its surging growth in usage. In order to make PageRank work. and data center layout. There is the software engineering effort that focuses on PageRank and other applications. Messrs. In fact. When Google got underway in 1996. Google’s Fusion: Hardware and Software Innovations The Google phenomenon comes from the fission occurring when PageRank’s software and hardware engineering interact. Google has refined server racks. Brin and Page had limited computing horsepower. From the beginning – and this is an important issue with regards to Google’s almost-certain collision course with Microsoft – Google had to solve both software engineering and hardware engineering issues to make Google Search viable. The other effort focuses on hardware. Google’s technology delivers super computer applications for mass markets. means writing code and thinking about how computer systems operate in order to get work done quickly. when discussing Google technology. The payoff is lower operating costs and the ability to scale as demand for computing resources increases. applications and data processing. as used here. cable placement. they had to figure out how to get the PageRank algorithm to run on garden-variety computers available to them. it is important to keep in mind that PageRank is important only because it can run quickly in the real world. cooling devices. Software engineering. The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s technology framework has two areas of activity.Chapter Three: Google Technology PageRank requires a lot of computing horsepower cycles to work.
Without Google’s hardware and software. Google’s hardware innovations give it a competitive advantage few of its rivals can equal as of mid-2005. and several thousand people trying to figure out what else the Googleplex can do. the problem is trivial. The hardware has to be more than clever. Consider the links pointing to a Web page. Take the same task and apply it by a couple of billion Web pages. The Google system must find Web pages and perform dozens. if not hundreds of analyses of those Web pages. and in locations from Switzerland to Beijing. The calculations to sort out the “value” of each of these links adds to computational work associated with PageRank. Users do not give much thought to what technology underpins a routine query or the 300 million queries Google handles each day.Chapter Three: Google Technology elimination of such troublesome jobs as backing up data. And when hardware engineers come up with an advance. In a single second. Some of the tinkerers come at the problem from bits and bytes. The software requires a suitable hardware and network infrastructure in which to operate. Google’s approach to data centers. With each new advance in software. Search was the prime mover in the Google universe. Some of these links are likely to come from sites that have more traffic than others. there would be no Google. Others come at the problem from the soldering iron and screwdriver angle. the racks in the data centers. The hardware has to work 24x7. The synergy between software and hardware is perhaps one of Google’s major accomplishments. Once Messrs. One link equals one pointer. Google was headed down a road that it still follows. Hardware and software are inextricably linked at Google. The Google Legacy 59 . the software engineers greedily use that advance to up the functionality of their software. For a single Web page with one link pointing to it. and the computing task becomes one for a supercomputer. PageRank with its layering of additional computations added over the years is a software problem of considerable difficulty. Google’s technology cannot be separated from search. and the devices in the racks in the data centers is as clever as the company’s search system. What Google owns is its own snappy. Sizing up different factors against one another for a single page can be hard without a calculator to help. Google must keep track of them for more than eight billion Web pages. The result is a brilliant product. Some of the links may come from sites that have spoofed Google for fun or profit. These engineers look for ways to build hardware and physical systems that can perform the calculations needed to make PageRank work. Google’s technology handles around 340 queries in dozens of languages from users worldwide. Keeping track of these factors is a big job. interesting software tools. Brin and Page were able to fiddle with a limited number of commodity computers and make their PageRank algorithm work. writing code. Yet this task is everyday stuff for Google and its PageRank process. and weaving applications out of the available functions.000 links pointing to it? The problem becomes many times larger and more computationally demanding. Google’s engineers must make correspondingly significant advances in hardware. turbocharged supercomputer. under continuous load. But what happens when a site has 10.
This effectively allows a standard rack. the staff of the facility may handle virtually all routine chores and may work with the customer’s engineers for certain more specialized tasks. under the direction of the Google File System. normally holding 40 pizza box servers. The staff of the data center manage the power. unlike more typical data centers that can require a week or even a month to get additional resources online.” Like a mouse plugged into the USB port on a laptop. illustrates the symbiotic relationship between these two different engineering approaches. When a data center must expand. a company that owes its existence to both hardware and software. The hardware in a Google data center can be bought at a local computer store. like IBM. Each server. These resources. sometimes filled with 10. This overlap between the hardware and software competencies at Google. Google is conceptually closer to IBM (at one time a hardware and software company) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator of multiple softwares). 1 Google data centers – now numbering about two dozen. find one another and configure themselves with minimal human intervention. and very different. Google’s network of data centers knows when more resources have been connected. At Google. start getting work from other data centers. 60 The Google Legacy . Google racks are assembled for Google to hold servers on their front and back sides. A data center is usually a facility owned and operated by a third party where customers place their servers.Chapter Three: Google Technology How Google Is Different from MSN and Yahoo Google’s technology is simultaneously just like other online companies’ technology. Google has a business model that is advertiser supported. from its inception. Google is not a software company nor is it a hardware company. 2 3 4 5 6 Several of these factors are dependent on software.000 or more Google computers. Each Google server comes in a standard case called a pizza box with one important change: the plugs and ports are at the front of the box to make access faster and easier. Google uses the same types of memory. The customer specifies the computers and components. review this list of characteristics for a Google data center. for the most part. rack and data center works in a way that is similar to what is called “plug and play. A Google data center can go from a stack of parts to online operation in as little as 72 hours. as previously noted. Google software and Google hardware have been tightly coupled. go into operation without human intervention. Before looking at some significant engineering differences between Google and two of its major competitors. Google is. to hold 80. These facilities. Technically. They come online and automatically. Unlike IBM. fans and power supplies as those in a standard desktop PC. although no one outside Google knows the exact number or their locations. air conditioning and routine maintenance. disc drives.
As a result. When more performance is required. Microsoft buys or upgrades machines. Microsoft’s applications run on Microsoft operating systems.Chapter Three: Google Technology Software and hardware engineering cannot be easily segregated at Google. Examples include Microsoft’s use of Dell Computers. are certified by Microsoft. Microsoft buys hardware from various suppliers to run its online systems. massively parallelised computing. the Xbox 360. Re-engineering a software application for higher performance is not typically a priority. Microsoft has multiple operating systems. In addition. Microsoft does not fiddle with its operating systems and their subfunctions to get that extra time slice or two out of the hardware. Microsoft’s engineers focus on stamping out bugs. Recently Microsoft embarked on a new path with its game machine. Microsoft does not design or make its own hardware. for example. Microsoft has no significant track record in designing and building hardware for distributed. and Nintendo next-generation game machines. Two examples will illustrate these differences. Microsoft gets performance the way most computer users do. Microsoft has to support many operating systems and invest time and energy in making certain that important legacy applications such as Microsoft Office or SQLServer can run on these new operating systems. For example. The boat anchor is the need to ensure that legacy code works in Microsoft’s latest and greatest operating systems. The new Xbox uses a processor from IBM’s family of PowerPC chips also used in the Macintosh computer. the Sony PS/3. Most of these suppliers. At MSN and Yahoo hardware and software are more loosely-coupled. Microsoft has a boat anchor tied to its engineer’s ankles. Microsoft – with some minor excursions into the Xbox game machine and peripherals – develops operating systems and traditional applications. Microsoft servers often require a load balancing feature. Unlike Google. Microsoft’s engineers use these machines in configurations required by the Microsoft operating systems and applications. Microsoft upgrades the hardware. The mice and keyboards were a success. Microsoft has continued to lose money on the Xbox. or shifts to higherspeed hard drive technology instead of recoding the operating system itself to deliver higher performance as Google does. Several observations are warranted: 1 Unlike Google. Microsoft does not focus on performance as an end in itself. and the sudden demise of Microsoft’s entry into the home network hardware market provides more evidence that Microsoft does not have a hardware competency equal to Google’s. although a version of Microsoft Office and Internet Explorer run on Apple’s Macintosh. Once a function is released to customers. and its engineers are hard at work on the company’s next-generation of operating systems. not surprisingly. for processors that evolved from the Intel chips for personal computers. 2 3 The Google Legacy 61 . Its operating systems are coded. Unlike Google. Microsoft implements its load balancing via software. adds memory.
commodity hardware. Image Printing. 2 3 62 The Google Legacy . the Web search system developed by Digital Equipment Corp. Yahoo! is in mid-2005 a direct competitor to Google for advertising dollars. Yahoo! may well have considerable competency in supporting a crazy-quilt of hardware and operating systems.com. and the search history function. Yahoo has a jumble of search technology.com to handle Chinese language search and retrieval. Blogger. and deliver high-reliability services from low-cost. and many opposing interests. Yahoo acquired 3721. The user can post pictures to a Google property. Yahoo! has grown through acquisitions. Its engineers make functions from disparate systems available via a portal. The photo management software runs on the user’s Windows PC. hardware and systems. including Fusion. Yahoo! does not have a core competency in hardware engineering for performance and consistency. colleagues and family. Google has the hardware and software engineering expertise to build applications rapidly. Yahoo! also owns AllTheWeb. A Picasa user without a Gmail account is able to register and receive a user name and password. however. Yahoo!’s problem is different from Microsoft’s legacy boat-anchor problem. Yahoo! faces a Balkan-states problem. There are many voices. In search. many needs. for other Google services. perform computationally-intensive applications quickly.Chapter Three: Google Technology In terms of technology. The result is that Yahoo has a mosaic of operating systems. The user’s images can be uploaded and sent via email to friends. which is Google’s personalized portal. Google has one search technology. Integration of these different technologies is a time-consuming. The user can send images to online photo processing services. a Web search site created by FAST Search & Transfer. Posting images on some Web log systems is beyond the expertise of many computer users. Each of these software applications requires servers and systems particular to each technology. The program has been integrated with several of Google’s network-centric applications: 1 Gmail. Yahoo bought Inktomi to provide Web search. Historically Yahoo has acquired technology companies and allowed each company to operate its technology in a silo. Yahoo! licenses InQuira search for customer support functions. expensive activity for Yahoo. The Gmail account can also be used. Blog Publishing. Yahoo! owns the Overture search technology used by advertisers to locate key words to bid on. if the user wishes. The image publishing function is simplified to one or two clicks. Yahoo! must invest in management resources to keep the peace. A good example is Picasa. for example. which saves a registered user’s Google queries for later reference. Google also acquires technology. Yahoo! operates differently from both Google and Microsoft. Yahoo! is not a software engineering company. Yahoo bought Stata Labs in order to provide users with search and retrieval of their Yahoo! mail. Yahoo! owns Alta Vista.com.
Google’s technologists demonstrate a rapid. Consider the 3721. Picasa requires a download. With Picasa. That service remains a separate Chinese language operation available from mostly non-English Yahoo pages. The Google Legacy 63 . The “hooks” are painless to the user. but subject to forces difficult for their helmsman to control. Indexing speed was about five times faster than ACDSee’s image management program. Google constructs an application using some code on the user’s PC and other software running on the Googleplex somewhere on the Internet. These three companies. Even though there is market space between the three.com search system. The installation process is smooth. Yahoo!’s acquisitions. trouble-free installation and an intuitive interface. 8. Like vessels in America’s Cup.Chapter Three: Google Technology One-click access to functions performed on the user’s local computer. In sharp contrast to Yahoo’s approach.8 Google has bundled into one free application point-and-click solutions to make management of digital still images intuitive and fluid. different in structure and technical focus. in general. a competitive product. Recently-viewed images One-click access to network services available as part of the user’s virtual application. each is going toward the same goal. are on a collision course. Google integrated the Picasa application into the Googleplex. are not woven into a seamless experience with other Yahoo! services.
and by extension Microsoft Corporation. The figure below provides an overview of the mid-2005 technical orientation of Google. and enterprise-class servers. Yahoo’s situation is typical to many American organizations. However. Microsoft’s prospects hinge on security. then promise after promise have done little to bolster the firm’s credibility for delivering secure systems and software. Microsoft has weaknesses that can be attacked by Google and other competitors. the company must be able to capture a commanding share of the market for network-centric applications. The Google Legacy . has a core competency in software. Patch after patch. The growth of open source alternatives are hard proof that die-hard Microsoft users are willing to shift for security. Yahoo is beginning to behave like a traditional media company. Looking forward over the next 12 to 18 months. cost savings and functionality. Yahoo is now spending money to break down the walls of its data silos and integrating its user data. Microsoft and Yahoo. incompatible architectures and a Tower of Babel of data formats. The company has grown from its operating system roots to provide a range of products for mobile devices. 64 MSN. Microsoft has expended great effort to push Windows downward to mobile devices and outward to network-centric computers in an effort to increase revenue. After years of flirting with becoming a New Age America Online. Microsoft’s position (whether real or perceived) is its products’ vulnerability to security breaches. the company’s Dot Net technology is Microsoft’s framework for virtual applications. problem after problem. advertisers may abandon Yahoo for services that offer more targeted marketing opportunities. For Microsoft to continue to be the dominant force in software in the future. Yahoo must integrate information from disparate systems and be able to segment and deliver ads to those users efficiently. desktop and notebook computers. cost and its developer community.Chapter Three: Google Technology collisions are inevitable. In some ways. Looking forward. If Yahoo cannot deliver narrowly segmented markets. For Yahoo to deliver specific markets to its advertisers. Dot Net is a less-open version of the AJAX technology that Google uses in the Google Maps and Gmail products. Most large US corporations are a hotch-potch of different systems.
“Explicit Control in a Batch-Aware Distributed File System”. The following snapshots are extreme simplifications of complex. Google is doing less wrong than these two aggressive competitors. and BAD-FS are unlikely to craft dinner conversation from Google’s explanations of the influence of these research computing demonstrations. 24x7 systems has existed as a core precept since 1996. A single device or an entire rack of devices could crash. when such a crash occurs.m. et. Hewlett-Packard. al. among others competing in Web search. homogeneous Googleplex means that the company does not struggle with some integration. The operating systems in use are a combination of Unix and Microsoft operating systems with some Linux and open source components. five precepts thread through Google’s technical papers and presentations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Both contained in Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation. aspects of the Googleplex. Compared to MSN or Yahoo. A large number of cheap devices using off-the-shelf commodity controllers. “HIgh Performance Sorting on Network of Workstations”. River. Google’s high-performance. See for example Andrea C. set up. The eclectic footnotes and references in the earlier BackRub paper have been sharpened in Google’s later technical presentations. yet extremely fundamental. But cheap hardware fails. Cheap Hardware and Smart Software Google’s use of commodity hardware for high-demand. Tucson. Google approaches the problem of reducing the costs of hardware. Arizona. al. burn-in and maintenance pragmatically. cables and memory reduces costs. no full-time systems engineering team has to perform technical triage at 3 a. Most of these innovations are difficult to explain to engineers steeped in traditional approaches to massively distributed. and the overall system would not fail. Google conceived of smart software that would perform whatever tasks were needed when hardware devices fail. MSN The Technology Precepts Google’s technology uses concepts and techniques from the leading edge of computer science. May 1997 or John Bent. The Google Legacy 65 . Most of its competitors’ online systems combine branded hardware from IBM. March 2004. America Online and Tiscali. and Dell Computers with specialized peripherals. The company’s focus on hardware and software engineering gives it a cost and performance advantage over MSN and Yahoo. Google may not be doing everything right from a computer science point of view.Chapter Three: Google Technology and Yahoo! are becoming ad-supported versions of general-interest portals like Yahoo. highly parallelized computing. Google is focusing on applications that tie users to its Googleplex.9 For the purposes of this monograph and understanding the nature of Google’s technology. More important. 9. In contrast. Readers without a first-hand understanding of NOW-Sort. Arpaci-Dusseau. In order to minimize the “cost” of failure. Sun Microsystems. et. performance and cost issues that bedevil Microsoft and MSN.
A single Google cluster embodies the same organizing principle as a single pizza box server A single Google pizza box server A single replicated Google file reflects the controllling organizing principle The diagram illustrates that Google’s technical infrastructure is similar at every level in the Googleplex. Logical Architecture Google’s technical papers do not describe the architecture of the Googleplex as self-similar. The Googleplex can perform mundane computing chores like taking a user’s query and matching it to documents Google has indexed. This famous fractal connotes how Google scales without altering the micro or macro structure of the Googleplex. The overall structure – in this illustration an equilateral triangle – expresses the stability of the Google approach to its system. chosen because it conveys how each component in Google’s infrastructure replicates other larger combinations of servers and data centers. execute parallelized. and a group of Google’s data centers. servers. 66 The Google Legacy . The collections of servers running Google applications on the Google version of Linux is a supercomputer.Chapter Three: Google Technology The focus on low-cost. commodity hardware and smart software is part of the Google culture. highly regular organization of files. a Google spokesman joked that anyone in the room could buy the same hardware that Google uses at Frye’s Electronics. a retail chain with stores in Palo Alto and other cities in California. A data centre uses the same design and is composed of racks. 10. The diagram below shows a representation of the Googleplex’s tightly organized.10 The Googleplex is a larger instance of the organization of a single pizza box server.The illustration is a Sierpinkski Triangle. the Googleplex can perform side calculations needed to embed ads in the results pages shown to user. and more than two dozen data centers in a stable organizational pattern. high-speed data transfers like computers running state-of-the-art storage devices. Google’s technical papers provide tantalizing glimpses of an approach to online systems that makes a single server share features and functions of a cluster of servers. a complete data center. and handle necessary housekeeping chores for usage tracking and billing. clusters. In one presentation at a December 2004 technical conference. Further more.
the Googleplex consults a log for the location of the copies of the needed file. October 2004 The Google Legacy 67 . In commercial data centers. a rack or a data center can fail without data loss or taking the Googleplex down. configure and use the new resource. 11. 12. and to grab additional computing resources in order to complete a job quickly.11 Servers. the Googleplex can recognize. When a copy of that file is not available. according to the fractal architecture. Master servers – Google’s term for the pizza box that is in charge of one or more clusters – instruct other servers to copy data to the new cluster and begin using the clusters to do work. Google’s engineers plug in the needed resources. Google has an almost unlimited flexibility with regard to scaling and accessing the capabilities of the Googleplex. and embedded logic to make the servers working on tasks smarter. According to Jeff Dean. “At Google. Google’s approach delivers a homogeneous computing system.” Cables are attached among the pizza boxes and the rack is then plugged into a network hub. The application then uses that replica of the needed file and continues with the job’s processing.”12 Speed and Then More Speed Google Search is fast with most results coming back to the user in less than one second. Redundancy and other engineering tweaks to Linux gives the Googleplex ways to eliminate or reduce the bottlenecks associated with traditional online computer systems’ operation. Google has infused the Googleplex with logic that allows software to handle data recovery. its storage and its supported applications with an ease and price point rivals cannot easily match. one of Google’s senior engineers. A cluster allows data to be replicated and work shared among pizza boxes with spare capacity. A good example is bringing a new rack of 40 or more pizza box servers online and creating one of the many types of servers Google users. optimized file handling. In fact. highperformance hardware from such manufacturers such as Sun Microsystems and using advanced storage devices connected to the servers by exotic fibre optics.Chapter Three: Google Technology What is of interest is that Google does this with low-cost commodity hardware running on Google’s version of Linux. The term pizza boxes has been appropriated by engineers to describe one of the standard form factors for servers housed in rack mounts in data centers. and the other devices become aware of the new rack’s resources. An engineer turns on the power. The Google operating system ensures that each file is written three to six times to different storage devices. Unlike a collection of different building materials. This architecture allows Google to expand its computational capacity. everything is about scale. The Google technical recipe includes distributed computing. When Google needs to add processing capacity or additional storage. the loss of an individual device is irrelevant. consist of two or more clusters of pizza boxes. In Google’s self-similar architecture. Due to self-similarity. speed has traditionally been achieved by buying high-end. A rack is assembled and then Google’s pizza box servers are “plugged in. to streamline messages passed from server to server.Statement made at the University of Washington.Data centers use computer cases that are shaped like the boxes used to hold pizzas.
With Google’s advanced programming tools.000 megabytes per second. excluding controller and cables.Chapter Three: Google Technology Not Google. If Google’s computational throughput were slow. Google is able to increase the productivity of its engineers. Google generally uses servers that have two processors similar to those found in a typical home computer. Speed is crucial to Google’s PageRank and other analytic processes.000 for 360 gigabytes of storage.000 megabytes a second. Through proprietary changes to Linux and other engineering innovations. which may be a Google watchers enthusiasm boosting already-robust figures. The table below provides some data from 2002 about the speed with which Google can read data from hard drives:13 These data show the results of two clusters’ performance. Google engineers for computational speed. Howard Gobioff. The cost of a single IBM EXP3 in 2002 was about $18. Google has a strong one-two punch. and Shun-Tak Leung (Google) ACM SOSP 2003 Conference Proceedings 1-58113-757-5/03/0010. Google’s read rate in 2002 averaged ten times the read rate of the IBM EXP The write rate is comparable. a particular set 13. Google spends less. To put these data in a context of 2002 technology. Google squeezes out more productivity by applying its engineering talents to application development. Based on increases in commodity drive throughput. Google’s approach has been to focus on making its software engineering produce the turbocharged performance. Google is able to achieve supercomputer performance from components that are cheap and widely available.000. Google has not updated its read rate data. Google uses commodity pizza box servers organized in a cluster. Google could not perform the work needed to know that for a particular query. Google’s cost for comparable storage and the higher performance was about $1. Instead of using exotic servers with eight or more processors. page 12. A cluster is group of computers that are joined together to create a more robust system. In the world of everincreasing demands for speed and storage. This is a one-twothree punch to which Google’s competitors have to respond. but engineers familiar with Google believe that read rates may in some clusters approach 2. consider that an IBM EXP3 storage device available in 2002 could read data in burst mode at the rate of about 58 MB / second. Google runs faster without paying a premium for that performance gain. For greater speed. Google’s read throughput has gone up since 2002. When commodity hardware gets better. 14.14 Advances in commodity storage devices translate to even faster performance for Google. Google’s read rate may be close to 2.From “The Google File System” by Sanjay Ghemawat. Combined with hardware speed and performance. 68 The Google Legacy .
Print and Video services. insert additional hyperlinks to related content before displaying the results page to the user. When these occur. Server handles search-and-retrieval. Among the servers using Google’s go-fast technology are those shown below: Type Advertising server Chunkserver Image servers Index server Mail server News server Web server Function Delivers text and other paid advertisements for AdWords and AdSense. Serves images for Google Image. The Google Legacy 69 . and continuously update values as Google users of click on links. various scores or values from certain algorithms. Speed. Google’s approach is more subtle and computationally involved. analyses and displays news. Orders results and makes them available to users. Google is a hot rod computer that can perform the basic mathematics needed to deliver most search results in less than a half second. and look at a Web page matching a user’s query and. like Google’s ability to scale. Google then uses these different values in other algorithms to find search results. Google purchased Keyhole. By June 30. Google reviews data. extract matching ads from its advertising server. display maps with the speed of a dedicated desktop application like Encarta. typically in less than one second across a public network. Google displays the results page to the user. Gathers. Google had: 1 Released a basic mapping product.Chapter Three: Google Technology of indexed Web pages is the best match. Google applies its high-speed technology to search and to other types of servers. is a core functionality of the Googleplex. 2005. although term matching is an important part of the Google process. identify the best match (Google’s “Feeling Lucky” link). in some applications. Speed also means rapid development and deployment of new products. Once these various query and ad matching processes are complete. Delivers the Gmail service. Speed also means that Google must be able to expand its computational and storage capacity quickly. users would not be willing to run multiple queries and interact fluidly with the Google applications. Speed has many meanings at Google. What does the combination of go-fast technology plus multiple types of Google data allow the company to do? Google can engage in fast new product development. The workhorse of search. Speed means that users can interact with the Google products and services as if the Google application were running on a dedicated PC in front of the user. Google does not mindlessly match key words in a user’s query to the terms in the Google index. Schedules and delivers blocks of data for further processing. Without fast response to a query. In late 2004. Google developed a basic mapping product over the course of 2004. the Googleplex allocates additional resources to eliminate the brown out. One example is Google Maps. The Googleplex does experience slow downs.
a programmer must write some 15. upgraded and redefined online mapping services. Hooked Keyhole satellite imagery into Google Maps in early May 2005. Upgraded the system to integrate two dimensional point-to-point routes on top of satellite imagery.com/archives/23345086. Google integrated Keyhole technology. and the integration of hot links.html. translates the results to the user’s language. launched. and displays the data in a three-dimensional mode.15 This is the results of a Japanese language Google Maps-Earth query for the location of Wendy’s restaurants in New York City.eee-craft. In the span of several days. Another key notion of speed at Google concerns writing computer programs to deploy to Google users. the two dimensional map. The addition of the Japanese language support. the three-dimensional view of the section of Manhattan where the user wants directions. Demonstrated a function that accepts a query in another language. and information about the restaurants was part of Google’s fast-cycle launch and enhancement program designed to beat Microsoft to the market. Announced Google Earth in May 2005.Chapter Three: Google Technology 2 3 4 5 Integrated information from Google Local in early 2005. An example is Google’s creating a library of canned functions to make it easy for a programmer to optimize a program to run on the Googleplex computer. 6 The image below shows that Google’s Map and Earth service pushes the functions of online map and data integration to another level. 70 The Google Legacy .The source for this image was http://blog. Google has developed short cuts to programming. At Microsoft or Yahoo.
The “master server” then uses those data or those processes to complete a task. Spending money wisely does not mean cheaply. Google’s programmers are freed from much of the tedium associated with writing software for a distributed. Unlike Microsoft. Staff burn out or defections could impair Google’s technical resources. uses a function from a Google bundle of canned routines. Google’s approach is pragmatic and less time- 16. Google engineers plug in resources and let the Googleplex handle the other tasks.000 plus programmers. no human intervention. The fractal principal requires that Google replicate data three to six times elsewhere in the Googleplex. Google management faces a challenge in managing its programming talent.” Once in the rack. Google does not have these costs due to its engineering acumen. the Googleplex recognizes the new resources in a way that is similar to how a laptop knows when a user plugs in a USB mouse. Nevertheless.Chapter Three: Google Technology code or fiddle with code to get different pieces of a program to execute simultaneously using multiple processors. partly by necessity and partly by design. If a single programmer can reduce by 10 percent the time required to code a program. The Google Legacy 71 . and lets the Googleplex handle the details. The focus on frugality influenced many hardware and software engineering decisions at the company. Google was frugal. Google technicians assemble one or more racks of Google “pizza boxes. Not at Google. • Google does not have to certify new hardware. If a programmer can slash coding time in half. Many of Google’s most striking technical advances are based on modifying open source software to benefit from insights gained from experimental results in supercomputing. the savings could be several thousand dollars. When additional storage or computational capacity is required. adding necessary services and functions to meet the specific requirements of Google applications. Google did not write a complete operating system for its Googleplex. Google makes each engineering dollar go farther.Some Google programmers have complained about the peer pressure to perform. Google made key changes to Linux. A programmer writes a program. The expensive certification processes otherwise required for some high-end hardware are eliminated. Google gets twice the potential productivity out of each of its 3. What does increased programmer productivity mean? In terms of money.16 Eliminate or Reduce Certain System Expenses Some lucky investors jumped on the Google bandwagon early. Examples of how Google eliminates or reduces certain system expenses include: • Google eliminates the costs associated with backing up and restoring data when a hardware failure occurs. No tape. the “master server” for a task looks at a file that tells where the other copies of the data or the programs are. and no downtime. When a device fails. • Google innovation uses open source code as a starting point. Google does not have to work around known bottlenecks in some commercial operating systems. parallel computer.
72 The Google Legacy . Google can replicate its data and give away free gigabytes of email storage. Much of the cost difference derives from the much higher interconnect bandwidth and reliability of a high-end server. a typical x86-based server contains eight 2-GHz Xeon CPUs. Google has a cost advantage at least with regards to scaling online operations. and Urs Hölzle. As the performance of commodity hardware goes up. [Emphasis added] This means that when Microsoft of Yahoo! spends US$3. Google’s approach is more cohesive.See http://labs.com/papers.000 rack contains 176 2-GHz Xeon CPUs. Yahoo faces integration drudgery as a result of its multiple systems and heterogeneous hardware and data. Google spends less than US$1. Some of the data are in the form of patents. Snapshots of Google Technology Google engineers generate a large volume of technical information.A review of Google’s cost estimates for this monograph revealed that Google is understating its cost advantage by one or two orders of magnitude. 176 Gbytes of RAM. Until then. Bulk purchasing chops as much as 50 percent off the cost of some hardware.Google does not explicitly state that it has embraced a services oriented architecture or SOA. In other words. but again. a $278. Google spends one-third for more computing horsepower and disc space than companies spend using a traditional server architecture.Chapter Three: Google Technology consuming than Microsoft’s “death march” to get Longhorn shipped by late 2006. at least for a highly parallelisable application like ours. Microsoft or Yahoo may implement similar features into their network-centric services. the cost of that hardware goes down. PC-based clusters over high-end multiprocessor servers can be quite substantial. often written in a style that communicates little of the patent’s substance to a lay reader. it costs about $758. March April 2003. Although dated.Luiz André Barroso. competitors such as IBM.000. many of Google’s practices illustrate an informed use of certain features of SOA. In comparison. it underscores the economies of the Google approach:18 The cost advantages of using inexpensive. three times less RAM.19 Over time. Google’s highly redundant architecture does not rely on either of these attributes. standards. However. and open source software for virtually all of its core services and thus spends less time pounding disparate systems and data into a standard type. Compared with Yahoo. 2005. IEEE Computer Society 0272-1732/03. Google has used Linux.00 for better performance. The link for Google’s publications can shift unexpectedly. To illustrate the financial payoff from the use of commodity hardware. and slightly more disk space. If these 2002 data can be accepted. 19. The cost to Google can be as low as a few cents a gigabyte. “Web Search for a Planet: The Google Cluster Architecture”. 18. Jeffrey Dean. For example.html#compilers on June 1.00. 20. and 8 Tbytes of disk space. and 7 Tbytes of disk space. Google engineers revealed a back-of-the-envelope calculation.17 • Google does not spend money for high-performance devices to make its system perform faster.20 Exploring 17. 64 Gbytes of RAM.google. the multi-processor server is about three times more expensive but has 22 times fewer CPUs.
papers were available on such topics as algorithms.” The Google Legacy 73 . The annex to this monograph provides information about more than 60 patents for which Google is believed to be the assignee. genetic algorithms. Roger King September 2004. Video Object Search User types an object name and Google finds that object in a video. OSDI Proceedings. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. To Learn More Services Computing.This is the lex project that “helps write programs whose control flow is directed by instances of regular expressions in the input stream. and operating systems and distributed systems. file system design. the table below identifies selected examples of innovations documented by Google engineers or researchers close to the company. The thrust of Google’s innovation is to build out the search platform and expand the functionality of its backoffice programs such as those used for advertising services. Google is posting more information about operating systems and applications. To provide a more fine-grained look at Google technology. including one by Google’s CEO. in mid-2005. compiler optimization. 2004 IEEE International Conference on (SCC'04) by Stephen Davies. December 2004. among others. information retrieval. For example. data mining. among others. The topics covered in various monographs. white papers and technical notes concern a wide range of subjects. software engineering and design.Chapter Three: Google Technology biographies of Google executives and Google Web logs can yield some useful technical information. 21. Michael D. artificial intelligence. Andrew Zisserman Publication Date: October 2003. MapReduce New functions in Google Linux to speed programming and other processes involving large data sets. Williams. one Google biography linked to more than 36 personal projects. Extension to Google Linux to allow high-speed data reads and writes from commodity drives. For example. Google’s technical papers and Google patents provide some insight into areas of interest at Google. Google explains its use of very large files as well as how the Google-modified version of Linux automatically allocates work and avoids the file system bottlenecks that can plague Solaris and Windows Advanced Server 2003. Most of these papers appeared prior to Google’s receiving a patent for the technology referenced in these reports: Technology Google Suggest Purpose Helps users find needed information by analysing queries and suggesting other queries. Google’s search engine does a hit-and-miss job of indexing Google’s own technical information. Useful engineering information appears on the Google Web site. Google File System ACM Publication 1-58113-757-5/03/ 0010. For example. Ninth IEEE International Conference on Computer Vision Volume 2 Josef Sivic. Serdar Badem.21 Surprisingly.
PRWeaver’s Web log contained a posting of a photograph allegedly taken inside a Google data center. traffic routing and strict rules governing access to the physical boxes.23 The change is a response to the heat and power issues associated with larger concentrations of Google servers. A unique property of the data centers is that replicated content can be written from one data center to another. • Power supplies which fail at a lower rate. This type of dense configuration helps explain the comments about Google’s heat and power concerns. Heat contributes to hard drive failures. each with 10. The Google “plug and play” engineering philosophy appears to be used in and across data centers. Most data centers were not designed to handle dense concentrations of thousands of servers. Repairs are batch operations. Google has to locate hosting facilities that can meet the company’s heat and power requirements. The most failure prone components are: • Fans. • IDE drives which fail at the rate of one per 1. On the plus side. the physical layout of the racks holding an estimated 2. These master servers then mark the rack’s resources as available. Other Data Center Issues Google data centers have access to multiple high-speed lines and normal data center functions such as redundant power. Within the last 12 months.mcdar. Google has shifted from concentrating its servers at about a dozen data centers. If true.These data appear at www. These servers are connected to the network.net/SEOTools. Google packs servers on two sides of a rack. each with fewer machines. Master servers then begin sending work to the new devices. Scheduling the fixes is a major job and work is underway to improve the Google-developed scheduling capability. Google data within the data center are replicated on other servers and other clusters running in the racks. such as the one shown above. If a data center. the dense configuration makes set up and maintenance somewhat easier.000 or more servers squeezes a large amount of hardware in a tightly-packed space. to about 60 data centers. The information about data 23.000 or more servers. it becomes available to the master servers for that data center.Chapter Three: Google Technology state-of-the-art facility reflects what Google engineers have learned about heat and power issues in its other data centers.000 drives per day. the technicians in that center can build a Google rack of 40 pizza box servers. needs additional index server capacity. When the rack is powered up.htm The Google Legacy 75 .
master servers to alert other master servers. automatic failover operating system. Google is content to let the auto-discovery functionality alert a “master server” to a new resource. 3 What’s Up. The power demand at load is greater than data centers typically sustain. 24. monitoring and related issues. Google is developing network management and monitoring tools so that the information in the Google operating system log files can be displayed in a meaningful way to Google network engineers. The general concept seems to be what Google engineers have tried to achieve. work loads and potential problem areas. Power. the operating system and the various “master computers” in a cluster know what device is online and what device is dead. that data center operator faces new challenges. data flows. 76 The Google Legacy . Google has had to create network management tools to manage its self-healing. and Beijing. Network management tools have to provide a broad range of monitoring and support functions for the global network. we have heat. A a Google engineer said. Google complied in order to do business in China.com in order to accelerate its effort in China. Google is developing needed network management tools specifically for its the Googleplex. cooling and power issues. Yahoo! bought 3721. rack or data center goes dark or dies.Chapter Three: Google Technology centers indicates that this “plug and play” concept and automatic discovery of new resources applies to new data centers. We use each day four megawatts of electric power. clusters or a new data center are available for use. When we put in a data center. Special racks with fans that cool the core of the rack are used.” The problems include: 1 2 Heat. a senior Google engineer. Therefore.24 Because the GOS is self-healing. Network management tools.” said Jeff Dean. By eliminating such tasks as certifying and configuring Small Computer System Interface RAID storage devices.The Beijing data center was purpose built to conform to the ruling body’s requirements for online access. and data centers to pass information that racks. It may be an exaggeration that a Google rack and the data center in which the rack resides works like a USB mouse. “Wherever we put a cluster. masters to notify clients of tasks. devices. the Pacific Rim. The overall Googleplex works and continues working even if a device. Off-the-shelf network management tools are not tailored to Google’s requirements. “Our cages are custom built and there’s a lot of work done by us and the data center people before we can flip the switch. Sergey? The Google data centers are concentrated in North America with other data centers located in Switzerland. not just the racks within them.
Second. Google’s use of commodity hardware and cheap storage is a risky solution.Chapter Three: Google Technology Unanticipated Faults Could Derail Google’s Juggernaut Google’s network uses a number of concepts from the fringes of computer innovation as well as its hands-on knowledge gained by from the Googleplex itself. The changes to Linux and the other technical modifications are little more than hackers’ attempts to squeeze a small performance gain. The Google Legacy 77 . Although Google has operated for more than five years without downtime from system failure. The diagram shows how Google’s approach eliminates the bottleneck in parallelized systems produced by excessive message traffic flowing through a server coordinating work among different computers. The advanced technology of Google’s MapReduce tool and its 400 module library could pose as yet unforeseen technical problems. Unknown problems may lurk when cheap components are used in a mission-critical system. First. the possibility – however remote – does exist that something unanticipated could occur. A sufficiently large problem could deal Google a severe blow. Summary of Google’s Drawbacks Critics of Google can point to three “problems” with Google’s approach to performance. Google is a one-trick pony. This is a diagram produced by Google engineers. Increasing the potential risk are the changes Google makes to speed up program execution. The result is a highlyresilient network that may breed problems not previously encountered.
Yahoo wrestles with its many different platforms. A modified Linux delivers fast performance at a bargain basement cost. even on dial-up networks. including Solaris. Yahoo is a fruit cake of hardware. 1 Applications that require a high performance payoff for a low cost such as electronic mail. proprietary knowledge. and applications coded at different times in different languages by different people. Microsoft’s must invest in hardware to squeeze performance out of its platforms. Google so far has not had to spend money for developer marketing programs or train new hires to work in the Googleplex. There are many other applications that can benefit from Google’s approach to online services. Each day Google bets that its technologists can keep the system humming. stateless applications. The biggest boost to Google’s technical approach is that its competitors are following different. cash-strapped companies with techniques from advanced computer systems. a characteristic exploitable by running individual requests on separate servers such as Google Earth. Applications that require request-level parallelism. Google uses the same Web programming techniques that millions of Web developers use. Google’s approach to massively-parallel distributed computing works.Chapter Three: Google Technology Finally. Google uses standard engineering practices. Google is taking a strategic risk with commodity hardware and a souped up version of Linux. Another reason why Google’s approach to technology is paying off is that Google employes the same pragmatism and cleverness in application development. The user experience speaks for itself. Microsoft seems powerless to enhance the speed of its operating system. Both are digital ostriches burying their heads in their own marketing material. The payoff is that it is easy for Google to hire people who can code for the Googleplex. 78 The Google Legacy . Google fused the type of thinking associated with small. Commodity products keep costs down. more expensive approaches. other operating systems – including those from computer research laboratories and even Microsoft – do the same things and have for years. Leveraging the Googleplex Google has demonstrated that search is just one application that can run in the Google environment. 2 3 4 There is little to be gained by trotting out war-horses to trample Google. Microsoft uses its own operating systems but relies on other operating systems as well. Computationally-intensive. operating systems. An application that can run in Google’s redundant environment where there is no private-state replication such as found in IBM’s AS/400 operating environment and others. and off-the-shelf techniques such as its use of Web services.
or at least it has not gone offline since 2000. Google’s applications install and update without burdening the user with gory details and messy crashes. Google’s Googleplex can deliver desktop-server applications now. That platform is optimised to deliver virtual applications to its users worldwide. Google does not break down. the company has advanced the art of online computing. Google’s operating and scaling costs are lower than most other firms offering similar businesses. develop for the Googleplex. Google has a next-generation computing platform. Programmers want to work at Google. “Be like Sergey and Larry”. The skilled programmers want to work at Google. Google learns. and. When the heat and power problems at dense data centers surfaced. 3 4 5 6 7 8 9 A young programmer in Osaka or Beijing is very likely to have been influenced by Google. Google’s patents provide basic technology insight pertinent to Google’s core functionality.” said one recent University of Washington graduate. Google introduced cooling and power conservation innovations to its two dozen data centers. Google uses standard Web technologies in clever ways. consider these items: 1 2 Google is fast anywhere in the world. Although the technical challenges facing Google are formidable.Chapter Three: Google Technology Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude this cursory and vastly simplified look at Google technology. The Google Legacy 79 . Google squeezes more work out of programmers and engineers by design. if possible. The mantra is. create their own Google killer. “Google has cachet.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.