This action might not be possible to undo. Are you sure you want to continue?
's US Patent Application #20050071741 Information Retrieval Based on Historical Data. My own advice and interpretation is offered throughout this paper - please conduct your own research before acting on the recommendations. Sections in this Report: I. Overview of the 5 Most Critical Concepts from this Paper Google's Concept of "Document Inception" How Changing Content can Affect Rankings Spam Detection & Punishment What Google is Attempting to Measure The Impact of this Patent II. Analysis and Interpretation of 63 Patent Components History Data (1) Inception Date (4) Frequency of Document Changes over Time (6) Amount of Changes over Time (3) Click-Through Rate Data (2) Document Association to Search Terms (1) Queries that Remain the Same but have New Meanings over Time (1) Staleness of Documents (3) Link Behavior (4) Freshness of Links (4) Anchor Text Changes over Time (1) Content Changes in a Document compared to Linking Anchor Text (1) Freshness of Anchor Text (2) Traffic Characteristics of Site/Page (2) User Behavior (2) Domain Related Information (3) Prior Rankings Data (4) User Maintained Data (3) Growth Profiles of Anchor Text (1) Linkage of Independent Peers (1) Document Topics (1) Identifying Relevant Documents (1) Plurality of History Data (1) History Component (1) Ranking of Linked Documents (10) III. Documentation on Description Elements Document Inception Date Content Updates/Changes Query Analysis Link-Based Criteria Anchor Text Traffic User Behavior Domain Related Information Ranking History User Maintained/Generated Data Unique Words, Bigrams, Phrases in Anchor Text Linkage of Independent Peers Document Topics IV. List of Additional Coverage & Resources Overview of the 5 Most Critical Concepts from this Paper These 5 concepts are what I believe to be the most ground-breaking and important for search engine optimization professionals to understand in order to best
conduct their work. 1. Google's Concept of "Document Inception" The date of "document inception", which can refer to either a website as a whole or a single page is used in many different areas by Google. This data can come from the registration info, the date Google first found a link to the site/page or the site/page itself. Google will be using this data to rank documents and establish credibility and relevance. 2. How Changing Content can Affect Rankings Changing content over time has a huge impact in Google's measures according to this patent. They use changes to determine "freshness" or "staleness" of websites and pages and how that data impacts the value of the links on the page as well its rankings. They'll also measure large, "real", content changes vs. superfluous changes and rank based on that data. Google also says that for some types of queries, particular results are more valuable - stale results may be desirable for information that doesn't need updating, fresh content is good for results that require it, seasonal results may pop up or down in the rankings based on the time of month/year, etc. 3. Spam Detection & Punishment Google is employing many new systems of spam detection and prevention according to the patent. These include: Watching for sites that rise in the rankings too quickly Watching for registration information, IP addresses, name servers, hosts, etc that are on their "bad list" Growth of off-topic links Speed of link gain Percentage of similar anchor text Topic/Subject shifts or additions 4. What Google is Attempting to Measure Google wants to measure or is attempting to actively measure each of the following: Domain information Registration date Length of renewal (10 years, 5 years, 1 year, etc) Addresses and Names of admin & technical contacts DNS Records Address of Name Servers Hosting Location & Company Stability of this data Information on User Behavior Online CTR (Click-Through Rate) of individual results in the SERPs Length of time spent on a given site/page Data contained on your computer Favorites/Bookmarks List Cache & Temp Files Frequency of visits to particular sites/pages (history) 5. The Impact of this Patent I believe that this patent will help to verify most of the theories surrounding Google's rankings. There has been speculation over the past 18-24 months on nearly every subject covered in this patent at the major SEO forums, but this will serve as verification. Although it is long, I urge every SEO/Webmaster to read this page completely. I
have attempted to make the information legible and readable, and only pulled out parts that are important to the active practice of SEO (which was almost 2/3 of the document, surprisingly). If you have any questions or corrections on this summary, please send me an email. -------------------------------------------------------------------------------Analysis & Interpretation of the 63 Patent Components History Data 1. Documents may be scored in Google's rankings based on "one or more types of history data". Inception Date 2. The "inception date" read - registration date - may be considered as a scoring factor (I assume that older will be considered better, but this is not spelled out). 3. Google may determine how old each of the pages on a given website is and then determine the average age of pages on the website as a whole. The difference between a specific page's age and the average age of all documents on the site will be used in the ranking score. 4. The score for a website may include the amount of time since "document inception" - i.e. how old the website is. 5. One methodology of discovering site age might include when Google first "discovered" - read spiders the site, when Google first finds a link to the site, and when the site contains a "predetermined number of pages". I interpret this to mean that Google has some kind of threshold for site size (number of pages) that when reached, triggers a scoring effect (probably positive). Frequency of Document Changes over Time 6. Google's scoring will (according to the patent) be based on "determining a frequency at which the content changes over time". 7. The "frequency at which the content changes" will be determined by the average time between changes, the number of changes over a particular time period, and the rate of change of one time period vs. the rate of change for another time period. So, if you are updating your website every day, then switch to updating once a week, your scoring in the historical measurements at Google will shift. 8. Scoring will also include how much of the site has changed over a given time period (new pages, changes, etc.). 9. The scoring based on changes (described in #8) will be determined by the number of new pages within a time period, the ratio of new pages vs. old pages and the total "percentage of the content of the document that has changed during a timed period." 10. The scoring of changes (from #8) will be based on the "perceived importance of the portions" that have been changed. The score will also take into account the changes as compared to the weighting(s) of each of the different pages of the site - i.e. if important pages change, it will have a different impact than if unimportant pages changed. My guess is that importance is mostly determined by links (both internal and external) that point to a given page. So if your contact page changes, it's not a big deal, but if your home page changes, that's a bigger deal.
11. The scoring for a "plurality of documents" - many pages in a given website includes determining the last date of change for each page, determining the average date of change, and scoring the documents based on, "at least in part", the difference between a specific page's change and the average document's change. So, if one page had new information added, it would be scored differently than the other pages, while if all the pages changed together (maybe a new date, or new link or copyright in the footer, etc.), they would all be equal (since their date of change compared to the average is the same). Amount of Changes over Time 12. Google's score may also include a measure of the amount of content which changes over time on the given website. 13. The "amount of content changes" from #11 will be determined by the ratio of new pages vs. the total number of pages on the site, and the percentage of content change over a given time period. 14. The "changes over a given time" from #12 will be scored based on "weighting different portions of the content differently based on a perceived importance" once again, I read this as internal and external links to a page - the more links, the more "perceived importance". Click-Through Rate Data 15. The "history data" from #1 could include information on "how often the document is selected when the document is included in a set of search results". This is literally tracking clickthroughs and rewarding those sites with higher CTR - just like AdSense does. Google will be scoring based on the "extent to which the document is selected over time... when included in a set of search results". We always assumed this to be true, but this is the first hard evidence I've seen directly from the horse's mouth. 16. Google may assign a "higher score" when the document is selected more often. No-brainer. Document Association to Search Terms 17. Google might be scoring based on "determining whether a document (that has been showing up in the search results) is associated with the search terms". Queries that Remain the Same but have New Meanings over Time 18. Google (according to the patent) calculates whether the "information relating to queries" remains the same or changes and scores documents based on this. For example, prior to September 11, the phrase 9-11 would not be related with terrorism, afterwards, it would be. Google will score documents based on the changes in the results for a given query to keep up with the times. Staleness of Documents 19. The "staleness of documents" might be calculated as part of Google's scoring. 20. Google may also determine whether "stale documents" are preferable for certain types of queries (those that don't change over time, or for which a specific, single answer is what's necessary). 21. The "favorability" of stale documents may be determined by how often they are clicked on in the search results (over other documents). I relate this to a Wikipedia article on the nature of volcanoes - it doesn't need too much updating and will be a good relevant source for a long time for the query - "nature of volcanoes".
Link Behavior 22. History data scores might also consider the "behavior of links over time". 23. The appearance and disappearance of links figure into the scoring for link behavior (from #22). 24. The appearance/disappearance of links are dated by Google and used in the scoring. 25. The link appearances/disappearances are monitored and Google measures "how many links... appear or disappear during a time period, and whether there is a trend" toward more links or fewer links. The temporal (time-based) nature of groups of links will be scored by Google. Freshness of Links 26. Google may use the "freshness of links" and assign weights to links based on freshness. 27. The "freshness" of a link (from #26) is calculated by the date of appearance of that link, the date of any changes in the link or anchor text, the date which the page and site that the link is from appeared and the date of the links to that linking page. So, if you have a new blog entry that points to a new site, the freshness will be super-fresh, since the page is new, the link to the page is new, the blog page that links to it is new, and the link to your blog entry on your own site is new (that's a lot of new, hence it's super-fresh). 28. The weight of a link also takes into account how "trusted" the site is, how authoritative the page with the link on it is and how "fresh" the page & site containing the link are. 29. The scoring also takes into account the "age distribution associated with the links based on the ages of the links". Google will take into account the age of the links to your page, and the time periods over which you got the links, i.e. lots of new links, a wide distribution over time, most links from a long time ago, etc. Anchor Text Changes over Time 30. Google may also calculate changes in anchor text over time and use this data to score. My guess is that anchor text doesn't change very often, but they're certainly free to measure it. Content Changes in a Document compared to Linking Anchor Text 31. Google might also measure if the content of a document changes, but the anchor text remains the same, or vice-versa. They're trying to protect against the anchor text "bait and switch" that makes a document look relevant to the anchor text, then replaces it with something else. Freshness of Anchor Text 32. Freshness of anchor text can be considered. 33. Freshness of anchor text is calculated by "date of appearance", "date of change", and the dates of change and appearance of the page the link is on. Traffic Characteristics of Site/Page 34. Traffic characteristics associated with a page/site may be taken into account in scoring.
35. The traffic pattern will have associated analysis that might feed into Google's score. So Google must be measuring traffic to a site/page and determining if, over time, it increases, decreases, etc. - they're seeking trends on which to base scoring. User Behavior 36. User behavior regarding a particular page/site may figure into the scoring. 37. Google says that user behavior (from #36) is basically just the percentage of the time users click on a site/page when it is listed in the search results pages, along with the amount of time that users spend "accessing the document". I guess we all need to keep up the amount of time people spend on our sites. Domain Related Information 38. The scoring might also include the sites associated with a given site and the "domain-related" information. This is defined in greater detail below. 39. Associated sites (from #38) are measured in terms of "legitimacy", which I interpret to mean non-spam, different owner, etc. Google says, specifically "scoring the document based... on whether the domain associated with the document is legitimate." 40. The "expiration date of the domain", the "domain name server record" and the "name server associated with the domain" are all parts of how Google will establish the legitimacy of an "associated" site. Prior Rankings Data 41. History data scores could also take into account "information relating to a prior ranking". This means Google will be storing information about previous rankings for a site and using them to base scores on. 42. Google may also calculate where in the previous rankings the site was and how it moved around as pieces to figure into the scoring data. 43. In reference to #41, Google is using seasonal, "burstiness" and changes in scores over time as metrics to calculate the prior rankings scoring. So if a site is particularly relevant for "gifts for girlfriend" around Valentine's Day, but not as much for the same query at Christmas, Google will record this information and rank accordingly. 44. Google could also, with regard to #41, record "spikes in the rank" of site/pages in the search results. User Maintained Data 45. "User maintained data" may also be recorded and monitored for the rankings scores. 46. "User maintained data" includes; favorites lists, bookmarks, temp files and cache files of monitored users. I'm not sure how they could obtain this data without installing "Google Spyware" - perhaps in the form of desktop search or the Google toolbar. 47. Monitoring the rate at which a site/page "is added to or removed from user generated data" may be used in the scoring. Growth Profiles of Anchor Text 48. Scores might include "growth profiles of anchor text" - Google could monitor the use of anchor text in large groups and where/when they point to different
sites & pages. Linkage of Independent Peers 49. Information "relating to linkage of independent peers" might be added to scoring by "determining the growth in a number of independent peers that include the document". Google will basically be monitoring sites that are not in your subject category and how they link to you (I assumed they meant non-related subject peers, but they actually mean off-topic sites; see - Linkage of Independent Peers, below). Document Topics 50. "Document topics" may be included in the scoring, this includes using "topic extraction". I assume this is determined by Google's text mining and analysis of the actual words on the page. Identifying Relevant Documents 51. Relevance of documents to a given search query may be part of the scoring system. This is just Google's way of saying that documents about "pink dogs" will be part of those analyzed by the ranking algorithm when a user queries "pink dogs". Plurality of History Data 52. Google might also use "means for obtaining a plurality of types of history data associated with the document" to score sites/pages. This just means that they will use a methodology that groups all of the bits of historical information into the rankings together to determine scoring. History Component 53. "History data" can be measured by Google and used in the rankings. I'm not sure to what they're referring here - the entire quote is; "A system for scoring a document, comprising: a history component configured to obtain one or more types of history data associated with a document; and a ranking component configured to: generate a score for the document based, at least in part, on the one or more types of history data." Ranking of Linked Documents 54. Google may be measuring the documents you link to and scoring based "on a decaying function of the age of the linkage data". So, fresher links vs. stale links will be taken into account (although whether there is a positive or negative effect associated with this is unknown). 55. For #54, Google says the "linkage data includes at least one link." So, they won't be measuring linkage data for pages with no links. 56. For #54, Google may include the anchor text in the linkage data. 57. For #54, Google says the "linkage data includes a rank based... on links and anchor text provided by one more linking documents." Google is simply saying that linkage data includes the anchor text and other info about the links coming to a page. 58. Google can use the "longevity of the linkage data" and determine from that an adjustment of the rankings based on the changes, stability & age of the linkage data. They explain below how they score this. 59. Google will be "penalizing the ranking if the longevity indicates a short life for the linkage data and boosting the ranking if the longevity indicates a long life for the linkage data." Google is, in effect, explaining a little of what we
call "sandboxing" - they're saying that the older a link is, the more value it has, while new links have relatively lower value. This doesn't completely explain the effect, as many sites rank well quickly, etc. - but, it is an explanation for the phenomenon. 60. Google can adjust scoring by penalizing for linking documents they consider "stale" over a period of time and boost scoring if the content is frequently updated. So, it's better to be linked to on a page that frequently updates its content. 61. "Link churn" may be measured (explained in #62) and scoring adjusted based on this. 62. "Link churn" is "computed as a function of an extent to which one or more links provided by the document changes over time". Once again, Google is referring to the changes in where links point, their anchor text, etc. on a given page. More changes means more "link churn". 63. "Link churn" might create a penalization if it is above a certain threshold. So, if your links are changing all the time, the link will not be as valuable. This would shut down the methods used by the popular "Traffic Power/1p" spam company. -------------------------------------------------------------------------------Patent Description: Background of the Invention: This is designed for IR (Information Retrieval) Systems and specifically to the methods used to generate search results. Description of Related Art: This information is largely irrelevant, but one important quote is: "There are several factors that may affect the quality of the results generated by a search engine. For example, some web site producers use spamming techniques to artificially inflate their rank. Also, "stale" documents (i.e., those documents that have not been updated for a period of time and, thus, contain stale data) may be ranked higher than "fresher" documents (i.e., those documents that have been more recently updated and, thus, contain more recent data). In some particular contexts, the higher ranking stale documents degrade the search results. Thus, there remains a need to improve the quality of results generated by search engines." Summary of the Invention: Google says "history data associated with the documents" may be used to score them in the search results. The invention provides a "method for scoring a document" and it "may include determining the age of linkage data associated with a linked document and ranking the linked document based on a decaying function of the age of the linkage data." Brief Description of the Drawings: The drawings are all exceptional simple charts showing the process for examination. A PDF with the charts at the bottom is available at http://files.bighosting.net/tr19070.pdf Exemplary History Data: This is the canonical and expository section of the patent description. It contains examples and explanations of many of the most important parts of this
study, including detailed descriptions for many of the 63 components. Document Inception Date Google notes that the "date" label is used broadly and may include many time & date measurements. Google describes several of the techniques used to obtain an "inception date" and mentions that some techniques are "biased" because they can be influenced by a 3rd party. The first technique used is when Google learns of or indexes the document - either by finding a link to the site/page, or following it. A second technique uses the registration date of the URL or the first time it was referenced in a "news article, newsgroup, mailing list" or combination of these types of documents. The patent mentions that Google assumes that a "fairly recent inception date will not have a significant number of links from other documents." However, they say that the document's rankings can be adjusted accordingly based on how well it is doing in terms of links with consideration for its age. Google is also wary of spam, they use the following example (which is already being quoted around the web): "Consider the example of a document with an inception date of yesterday that is referenced by 10 back links. This document may be scored higher by (Google) than a document with an inception date of 10 years ago that is referenced by 100 back links because the rate of link growth for the former is relatively higher than the latter. While a spiky rate of growth in the number of back links may be a factor used by (Google) to score documents, it may also signal an attempt to spam search engine 125. Accordingly, in this situation, (Google) may actually lower the score of a document(s) to reduce the effect of spamming." Google might also use the date of inception as a method for measuring the "rate at which links to the document are created". They say that "this rate can then be used to score the document, for example, giving more weight to documents to which links are generated more often." The patent goes so far as to provide a formula for link-based score modification: H=L/log(F+2), H = history-adjusted link score L = link score given to the document, which can be derived using any known link scoring technique that assigns a score to a document based on links to/from the document F = elapsed time measured from the inception date associated with the document (or a window within this period). The result divided by other unit smaller as of this formula would be that on the day of inception, L will be 0.301 - the equivalent of multiplying L by 33.2. After 10 days (or any of time), the formula will divide L by 1.079, making H smaller and time goes on.
The patent then suggests that "for some queries, older documents may be more favorable than newer ones" and that, as a result, Google may "adjust the score of a document based on the difference (in age) from the average age of the result set". This would push certain pages up or down in the rankings depending on their age and the age of their competition. Content Updates/Changes
Protocol, rather than a cleansing agent, so pages on those subjects rose in the results. The number of search results for a particular term is measured to check for "hot topics" or "breaking news" to help Google follow or become aware of trends. An example might be the recent Tsunami in East Asia, where thousands of pages popped up overnight on the subject. Google also measures search queries whose answers or relevance changes over time. They use the example of "World Series Champion" which would be different after each Baseball season. "Staleness" can be a deciding factor in the rankings. Google will use user clicks and traffic to decide if "stale" results are relevant for a particular query or not and rank accordingly. Google says it measures "staleness" by: Creation Date Anchor Growth Traffic Content Changes Forward/Back link growth Link Based Criteria Google can measure various linking based factors including: The dates new links appear to a site/page Dates that link or pages linking to a site/page disappeared The time-varying behavior of links to a page and any possible "trends" that are indicated by this, i.e. is the site gaining links overall or losing them? A downward trend might indicate "staleness", while an upward trend would indicate "freshness". Google may check the number of new links to a document over a given time period compared to the new links the document has received since it was first found. They'll also use the "oldest age of the most recent y% of links compared to the age of the first link found." Google gives an example in the patent of two websites that were both found 100 days ago: Site #1 - 10% of the links were found less than 10 days ago Site #2 - 0% of the links were found less than 10 days ago This data might be used to " predict if a particular distribution signifies a particular type of site (e.g., a site that is no longer updated, increasing or decreasing in popularity, superceded, etc.)" Freshness weights assigned to a link can also be used to rank sites/pages. Several factors can influence link freshness: Date of appearance Date of change of anchor text Date of change of the page the link is on Date of appearance of page the link is on Google says they theorize that a page that is updated (significantly) while the link remains the same is a good indicator of a "relevant and good" link. Other weights for links include: How trusted the links are (they specifically mention government documents as being assigned higher trust) How authoritative the websites and pages linking to the page are Freshness of the page/site - they mention the Yahoo! homepage as one where links frequently appear and disappear. The "sum of the weight of the links" pointing to a page/site may be used to raise or lower the scoring in the rankings. Google will measure the freshness of the page based on the freshness of the links to it and the freshness of the pages which the links are on. Age distribution over time will also be measured, i.e. a site/page will be compared against all of its links over time and when it received them. Google may use link date appearance to "detect spam", "where owners of documents
or their colleagues create links to their own document for the purpose of boosting the score assigned by a search engine". Google says that legitimate sites/pages "attract back links slowly" and that a "large spike in the quantity of back links" may signal either a "topical phenomenon" or "attempts to spam a search engine." Google gives the example of the CDC website after the outbreak of SARS as an example of a "topical phenomenon". Google gives 3 examples of link spam techniques - "exchanging links", "purchasing links" or "gaining links from documents without editorial discretion on making links". Google also gives examples of "documents that give links without editorial discretion" - including guest books, referrer logs and "free for all pages that let anyone add a link to a document." A decrease over time in the number of links a document has can be used to indicate irrelevance, and Google notes that it will discount the links from these "stale" documents. The "dynamic-ness" of links will also be measured and scored, based on how consistently links are given to a particular page. They use the example of "featured link" of the day and note that they'll use a page score based on the pages that link to the page, "for all versions of the documents within a window of time." Anchor Text Google can use anchor text measurements to determine ranking scores: Anchor text changes over time might be used to indicate "an update or change of focus" on a site/page. Anchor text that is no longer relevant or on-topic with the site/page it links to may be tracked and discounted if necessary. Large document changes will result in Google checking the anchor text to see if the subject matter is still the same as the anchor text. Freshness of anchor text can be calculated. It can be determined by: Date of appearance/change of the anchor text Date of appearance/change of the linked to page Date of appearance/change of the page with the link on it Google notes that the date of appearance/change of the page with the link on it makes the link and anchor text more "relevant and good" Traffic Google can measure traffic levels to a page/site as part of their ranking scores. A "large reduction in traffic may indicate that a document may be stale" Google may compare the average traffic for a page/site over the past "j days" (as an example j=30) to the average traffic over the last year to see if the page/site is still as relevant for the query. Google might also use seasonality to help determine if a particular site is more/less relevant for a query during specific times of the year. Google is going to measure "advertising traffic" for websites: "The extent to and rate at which advertisements are presented or updated by a given document over time" The "quality of the advertisers". They note that referrers like Amazon.com will be given more trust and weight than a "pornographic site's" advertisements. The "click-through rate" of the traffic referrals from the pages the ads are on. User Behavior Google may be measuring "aggregate user behavior". This can include: The "number of times that a document is selected from a set of search results" The "amount of time one or more users spend accessing the document" The relative "amount of time" compared to an average that users spend on a particular site/page Google uses an example of a swimming schedule page that users typically spent 30
seconds accessing, but have recently spent "a few seconds" accessing. Google says this can be an indication for them that the page "contains an outdated swimming schedule" and they will push down its rank. Domain-Related Information Information associated with a domain can be used by Google to score sites in the rankings. They mention specific types of " information relating to how a document is hosted within a computer network (e.g., the Internet, an intranet, etc.)" including: Doorway and "throwaway" domains - Google says they will use "information regarding the legitimacy of the domains" Valuable domains, according to Google, "are often paid for several years in advance", while the throwaway domains "rarely are used for more than a year." The DNS records will also be checked to determine legitimacy: Who registered the domain Admin & technical addresses and contacts Address of name servers Stability of data (and host company) vs. high number of changes Google claims they will use "a list of known-bad contact information, name servers, and/or IP addresses" to predict whether a spammer is running the domain. Google will also use information regarding a specific name server in similar ways "A "good" name server may have a mix of different domains from different registrars and have a history of hosting those domains, while a "bad" name server might host mainly pornography or doorway domains, domains with commercial words (a common indicator of spam), or primarily bulk domains from a single registrar, or might be brand new" Ranking History Google can measure the history of where a site ranked over time and data associated with this. Some specifics include: A site that "jumps in rankings across many queries might be a topical document or it could signal an attempt to spam search engine" The "quantity or rate that a document moves in rankings over a period of time might be used to influence future scores" Sites can be weighted according to their position in the results, where the top result receives a higher score and the lower sites receive progressively lower scores. Google uses the equation: [((N+1)-SLOT)/N] Where N=the number of search results measured and SLOT equals the ranking position of the measured site In this equation, the 1st result receives a score of 1.0 and the last result receives a score close to 0. Google could check "commercial queries" specifically and documents that gained X% in the rankings " may be flagged or the percentage growth in ranking may be used" to determine if the "likelihood of spam is higher". Google may also monitor: "The rate at which (a site/page) is selected as a search result over time" Seasonality - fluctuations based on the time of month or year Burstiness - Sudden gains or losses in clicks Other patterns in CTR The rate of change in scores can be measured over time to see if a search term is getting more/less competitive and additional attention is needed. Google "may monitor the ranks of documents over time to detect sudden spikes in the ranks". This could indicate, according to the patent, "either a topical phenomenon (e.g., a hot topic) or an attempt to spam search engine" Google may use preventative measures against spam by: "Employing hysteresis to allow a rank to grow at a certain rate" - hysteresis in
this instance probably means a pull that results in the growth rate falling. The terms has dozens of unique definitions. Limiting the "maximum threshold of growth over a predefined window of time" for a given site/page. Google will also "consider mentions of the document in news articles, discussion groups, etc. on the theory that spam documents will not be mentioned" Certain types of sites/pages (Google specifically mentions "government documents, web directories (e.g., Yahoo), and documents that have shown a relatively steady and high rank over time") may be immune to the "spike" tracking and penalization Google may also "consider significant drops in ranks of documents as an indication that these documents are "out of favor" or outdated" User Maintained/Generated Data Google wants to measure many different types of aggregate data that user keep on their computers about their web visits and experiences, including: Bookmarks & Favorites lists in the browser They want to obtain this data either via a "browse assistant" - like the toolbar or desktop search, or. Directly via the browser itself - I predict they are developing their own Google Browser. Google will use this data over time to predict how valuable a particular site or page is Google also wants to document additions and removals from favorites & bookmarks over time to help predict the value of a site/page Google will also measure how often users access the site/page from their browser to see if it is still relevant, or just a leftover ("outdated" or "unpopular") The "temp or cache files associated with users could be monitored" by Google to identify their visiting patterns on the web and determine whether there is "an upward or downward trend in interest" in a given site/page. Unique Word, Bigrams, Phrases in Anchor Text Google intends to measure the profile of how anchor text appears over time to a particular site/page to watch for spam. They note that "naturally developed web graphs typically involve independent decisions. Synthetically generated web graphs, which are usually indicative of an intent to spam, are based on coordinated decisions". The difference in patterns can be measured and put to use to block spam. Google notes that the "spikiness" of "anchor words/bigrams/phrases" is a prime measurement. They note that spam typical shows "the addition of a large number of identical anchors from many documents". Linkage of Independent Peers Google can also use link data from "independent peers (e.g., unrelated documents)" to check for spam. They say that a " sudden growth in the number of independent peers... with a large number of links... may indicate a potentially synthetic web graph, which is an indicator of an attempt to spam." Google notes that this "indication may be strengthened if the growth corresponds to anchor text that is unusually coherent or discordant" and that they can discount the value of these links either by a "fixed amount" or a "multiplicative factor" - this would give an additional penalty just for having these links. Document Topics Topic extraction can be performed by Google through the following methods: Categorization URL analysis Content analysis Clustering
Summarization A set of unique low frequency words The goal is to "monitor the topic(s) of a document over time and use this information for scoring purposes." Google notes that "a spike in the number of topics could indicate spam" or that significant document topic changes may indicate that the website "has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable." Google says that "if one or more of these situations are detected, (they) may reduce the relative score of such documents and/or the links, anchor text, or other data" from the website.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.