You are on page 1of 15

This report has been prepared to help SEOs understand the concepts and practical

applications contained in Google's US Patent Application #20050071741 -


Information Retrieval Based on Historical Data. My own advice and interpretation
is offered throughout this paper - please conduct your own research before acting
on the recommendations.

Sections in this Report:


I. Overview of the 5 Most Critical Concepts from this Paper
Google's Concept of "Document Inception"
How Changing Content can Affect Rankings
Spam Detection & Punishment
What Google is Attempting to Measure
The Impact of this Patent
II. Analysis and Interpretation of 63 Patent Components
History Data (1)
Inception Date (4)
Frequency of Document Changes over Time (6)
Amount of Changes over Time (3)
Click-Through Rate Data (2)
Document Association to Search Terms (1)
Queries that Remain the Same but have New Meanings over Time (1)
Staleness of Documents (3)
Link Behavior (4)
Freshness of Links (4)
Anchor Text Changes over Time (1)
Content Changes in a Document compared to Linking Anchor Text (1)
Freshness of Anchor Text (2)
Traffic Characteristics of Site/Page (2)
User Behavior (2)
Domain Related Information (3)
Prior Rankings Data (4)
User Maintained Data (3)
Growth Profiles of Anchor Text (1)
Linkage of Independent Peers (1)
Document Topics (1)
Identifying Relevant Documents (1)
Plurality of History Data (1)
History Component (1)
Ranking of Linked Documents (10)
III. Documentation on Description Elements
Document Inception Date
Content Updates/Changes
Query Analysis
Link-Based Criteria
Anchor Text
Traffic
User Behavior
Domain Related Information
Ranking History
User Maintained/Generated Data
Unique Words, Bigrams, Phrases in Anchor Text
Linkage of Independent Peers
Document Topics
IV. List of Additional Coverage & Resources

Overview of the 5 Most Critical Concepts from this Paper


These 5 concepts are what I believe to be the most ground-breaking and important
for search engine optimization professionals to understand in order to best
conduct their work.

1. Google's Concept of "Document Inception"


The date of "document inception", which can refer to either a website as a whole
or a single page is used in many different areas by Google. This data can come
from the registration info, the date Google first found a link to the site/page or
the site/page itself. Google will be using this data to rank documents and
establish credibility and relevance.

2. How Changing Content can Affect Rankings


Changing content over time has a huge impact in Google's measures according to
this patent. They use changes to determine "freshness" or "staleness" of websites
and pages and how that data impacts the value of the links on the page as well its
rankings. They'll also measure large, "real", content changes vs. superfluous
changes and rank based on that data.

Google also says that for some types of queries, particular results are more
valuable - stale results may be desirable for information that doesn't need
updating, fresh content is good for results that require it, seasonal results may
pop up or down in the rankings based on the time of month/year, etc.

3. Spam Detection & Punishment


Google is employing many new systems of spam detection and prevention according to
the patent. These include:

Watching for sites that rise in the rankings too quickly


Watching for registration information, IP addresses, name servers, hosts, etc that
are on their "bad list"
Growth of off-topic links
Speed of link gain
Percentage of similar anchor text
Topic/Subject shifts or additions
4. What Google is Attempting to Measure
Google wants to measure or is attempting to actively measure each of the
following:

Domain information
Registration date
Length of renewal (10 years, 5 years, 1 year, etc)
Addresses and Names of admin & technical contacts
DNS Records
Address of Name Servers
Hosting Location & Company
Stability of this data
Information on User Behavior Online
CTR (Click-Through Rate) of individual results in the SERPs
Length of time spent on a given site/page
Data contained on your computer
Favorites/Bookmarks List
Cache & Temp Files
Frequency of visits to particular sites/pages (history)
5. The Impact of this Patent
I believe that this patent will help to verify most of the theories surrounding
Google's rankings. There has been speculation over the past 18-24 months on nearly
every subject covered in this patent at the major SEO forums, but this will serve
as verification.

Although it is long, I urge every SEO/Webmaster to read this page completely. I


have attempted to make the information legible and readable, and only pulled out
parts that are important to the active practice of SEO (which was almost 2/3 of
the document, surprisingly). If you have any questions or corrections on this
summary, please send me an email.

--------------------------------------------------------------------------------

Analysis & Interpretation of the 63 Patent Components


History Data
1. Documents may be scored in Google's rankings based on "one or more types of
history data".

Inception Date
2. The "inception date" read - registration date - may be considered as a scoring
factor (I assume that older will be considered better, but this is not spelled
out).

3. Google may determine how old each of the pages on a given website is and then
determine the average age of pages on the website as a whole. The difference
between a specific page's age and the average age of all documents on the site
will be used in the ranking score.

4. The score for a website may include the amount of time since "document
inception" - i.e. how old the website is.

5. One methodology of discovering site age might include when Google first
"discovered" - read spiders the site, when Google first finds a link to the site,
and when the site contains a "predetermined number of pages". I interpret this to
mean that Google has some kind of threshold for site size (number of pages) that
when reached, triggers a scoring effect (probably positive).

Frequency of Document Changes over Time


6. Google's scoring will (according to the patent) be based on "determining a
frequency at which the content changes over time".

7. The "frequency at which the content changes" will be determined by the average
time between changes, the number of changes over a particular time period, and the
rate of change of one time period vs. the rate of change for another time period.
So, if you are updating your website every day, then switch to updating once a
week, your scoring in the historical measurements at Google will shift.

8. Scoring will also include how much of the site has changed over a given time
period (new pages, changes, etc.).

9. The scoring based on changes (described in #8) will be determined by the number
of new pages within a time period, the ratio of new pages vs. old pages and the
total "percentage of the content of the document that has changed during a timed
period."

10. The scoring of changes (from #8) will be based on the "perceived importance of
the portions" that have been changed. The score will also take into account the
changes as compared to the weighting(s) of each of the different pages of the site
- i.e. if important pages change, it will have a different impact than if
unimportant pages changed. My guess is that importance is mostly determined by
links (both internal and external) that point to a given page. So if your contact
page changes, it's not a big deal, but if your home page changes, that's a bigger
deal.
11. The scoring for a "plurality of documents" - many pages in a given website -
includes determining the last date of change for each page, determining the
average date of change, and scoring the documents based on, "at least in part",
the difference between a specific page's change and the average document's change.
So, if one page had new information added, it would be scored differently than the
other pages, while if all the pages changed together (maybe a new date, or new
link or copyright in the footer, etc.), they would all be equal (since their date
of change compared to the average is the same).

Amount of Changes over Time


12. Google's score may also include a measure of the amount of content which
changes over time on the given website.

13. The "amount of content changes" from #11 will be determined by the ratio of
new pages vs. the total number of pages on the site, and the percentage of content
change over a given time period.

14. The "changes over a given time" from #12 will be scored based on "weighting
different portions of the content differently based on a perceived importance" -
once again, I read this as internal and external links to a page - the more links,
the more "perceived importance".

Click-Through Rate Data


15. The "history data" from #1 could include information on "how often the
document is selected when the document is included in a set of search results".
This is literally tracking clickthroughs and rewarding those sites with higher CTR
- just like AdSense does. Google will be scoring based on the "extent to which the
document is selected over time... when included in a set of search results". We
always assumed this to be true, but this is the first hard evidence I've seen
directly from the horse's mouth.

16. Google may assign a "higher score" when the document is selected more often.
No-brainer.

Document Association to Search Terms


17. Google might be scoring based on "determining whether a document (that has
been showing up in the search results) is associated with the search terms".

Queries that Remain the Same but have New Meanings over Time
18. Google (according to the patent) calculates whether the "information relating
to queries" remains the same or changes and scores documents based on this. For
example, prior to September 11, the phrase 9-11 would not be related with
terrorism, afterwards, it would be. Google will score documents based on the
changes in the results for a given query to keep up with the times.

Staleness of Documents
19. The "staleness of documents" might be calculated as part of Google's scoring.

20. Google may also determine whether "stale documents" are preferable for certain
types of queries (those that don't change over time, or for which a specific,
single answer is what's necessary).

21. The "favorability" of stale documents may be determined by how often they are
clicked on in the search results (over other documents). I relate this to a
Wikipedia article on the nature of volcanoes - it doesn't need too much updating
and will be a good relevant source for a long time for the query - "nature of
volcanoes".
Link Behavior
22. History data scores might also consider the "behavior of links over time".

23. The appearance and disappearance of links figure into the scoring for link
behavior (from #22).

24. The appearance/disappearance of links are dated by Google and used in the
scoring.

25. The link appearances/disappearances are monitored and Google measures "how
many links... appear or disappear during a time period, and whether there is a
trend" toward more links or fewer links. The temporal (time-based) nature of
groups of links will be scored by Google.

Freshness of Links
26. Google may use the "freshness of links" and assign weights to links based on
freshness.

27. The "freshness" of a link (from #26) is calculated by the date of appearance
of that link, the date of any changes in the link or anchor text, the date which
the page and site that the link is from appeared and the date of the links to that
linking page. So, if you have a new blog entry that points to a new site, the
freshness will be super-fresh, since the page is new, the link to the page is new,
the blog page that links to it is new, and the link to your blog entry on your own
site is new (that's a lot of new, hence it's super-fresh).

28. The weight of a link also takes into account how "trusted" the site is, how
authoritative the page with the link on it is and how "fresh" the page & site
containing the link are.

29. The scoring also takes into account the "age distribution associated with the
links based on the ages of the links". Google will take into account the age of
the links to your page, and the time periods over which you got the links, i.e.
lots of new links, a wide distribution over time, most links from a long time ago,
etc.

Anchor Text Changes over Time


30. Google may also calculate changes in anchor text over time and use this data
to score. My guess is that anchor text doesn't change very often, but they're
certainly free to measure it.

Content Changes in a Document compared to Linking Anchor Text


31. Google might also measure if the content of a document changes, but the anchor
text remains the same, or vice-versa. They're trying to protect against the anchor
text "bait and switch" that makes a document look relevant to the anchor text,
then replaces it with something else.

Freshness of Anchor Text


32. Freshness of anchor text can be considered.

33. Freshness of anchor text is calculated by "date of appearance", "date of


change", and the dates of change and appearance of the page the link is on.

Traffic Characteristics of Site/Page


34. Traffic characteristics associated with a page/site may be taken into account
in scoring.
35. The traffic pattern will have associated analysis that might feed into
Google's score. So Google must be measuring traffic to a site/page and determining
if, over time, it increases, decreases, etc. - they're seeking trends on which to
base scoring.

User Behavior
36. User behavior regarding a particular page/site may figure into the scoring.

37. Google says that user behavior (from #36) is basically just the percentage of
the time users click on a site/page when it is listed in the search results pages,
along with the amount of time that users spend "accessing the document". I guess
we all need to keep up the amount of time people spend on our sites.

Domain Related Information


38. The scoring might also include the sites associated with a given site and the
"domain-related" information. This is defined in greater detail below.

39. Associated sites (from #38) are measured in terms of "legitimacy", which I
interpret to mean non-spam, different owner, etc. Google says, specifically
"scoring the document based... on whether the domain associated with the document
is legitimate."

40. The "expiration date of the domain", the "domain name server record" and the
"name server associated with the domain" are all parts of how Google will
establish the legitimacy of an "associated" site.

Prior Rankings Data


41. History data scores could also take into account "information relating to a
prior ranking". This means Google will be storing information about previous
rankings for a site and using them to base scores on.

42. Google may also calculate where in the previous rankings the site was and how
it moved around as pieces to figure into the scoring data.

43. In reference to #41, Google is using seasonal, "burstiness" and changes in


scores over time as metrics to calculate the prior rankings scoring. So if a site
is particularly relevant for "gifts for girlfriend" around Valentine's Day, but
not as much for the same query at Christmas, Google will record this information
and rank accordingly.

44. Google could also, with regard to #41, record "spikes in the rank" of
site/pages in the search results.

User Maintained Data


45. "User maintained data" may also be recorded and monitored for the rankings
scores.

46. "User maintained data" includes; favorites lists, bookmarks, temp files and
cache files of monitored users. I'm not sure how they could obtain this data
without installing "Google Spyware" - perhaps in the form of desktop search or the
Google toolbar.

47. Monitoring the rate at which a site/page "is added to or removed from user
generated data" may be used in the scoring.

Growth Profiles of Anchor Text


48. Scores might include "growth profiles of anchor text" - Google could monitor
the use of anchor text in large groups and where/when they point to different
sites & pages.

Linkage of Independent Peers


49. Information "relating to linkage of independent peers" might be added to
scoring by "determining the growth in a number of independent peers that include
the document". Google will basically be monitoring sites that are not in your
subject category and how they link to you (I assumed they meant non-related
subject peers, but they actually mean off-topic sites; see - Linkage of
Independent Peers, below).

Document Topics
50. "Document topics" may be included in the scoring, this includes using "topic
extraction". I assume this is determined by Google's text mining and analysis of
the actual words on the page.

Identifying Relevant Documents


51. Relevance of documents to a given search query may be part of the scoring
system. This is just Google's way of saying that documents about "pink dogs" will
be part of those analyzed by the ranking algorithm when a user queries "pink
dogs".

Plurality of History Data


52. Google might also use "means for obtaining a plurality of types of history
data associated with the document" to score sites/pages. This just means that they
will use a methodology that groups all of the bits of historical information into
the rankings together to determine scoring.

History Component
53. "History data" can be measured by Google and used in the rankings. I'm not
sure to what they're referring here - the entire quote is; "A system for scoring a
document, comprising: a history component configured to obtain one or more types
of history data associated with a document; and a ranking component configured to:
generate a score for the document based, at least in part, on the one or more
types of history data."

Ranking of Linked Documents


54. Google may be measuring the documents you link to and scoring based "on a
decaying function of the age of the linkage data". So, fresher links vs. stale
links will be taken into account (although whether there is a positive or negative
effect associated with this is unknown).

55. For #54, Google says the "linkage data includes at least one link." So, they
won't be measuring linkage data for pages with no links.

56. For #54, Google may include the anchor text in the linkage data.

57. For #54, Google says the "linkage data includes a rank based... on links and
anchor text provided by one more linking documents." Google is simply saying that
linkage data includes the anchor text and other info about the links coming to a
page.

58. Google can use the "longevity of the linkage data" and determine from that an
adjustment of the rankings based on the changes, stability & age of the linkage
data. They explain below how they score this.

59. Google will be "penalizing the ranking if the longevity indicates a short life
for the linkage data and boosting the ranking if the longevity indicates a long
life for the linkage data." Google is, in effect, explaining a little of what we
call "sandboxing" - they're saying that the older a link is, the more value it
has, while new links have relatively lower value. This doesn't completely explain
the effect, as many sites rank well quickly, etc. - but, it is an explanation for
the phenomenon.

60. Google can adjust scoring by penalizing for linking documents they consider
"stale" over a period of time and boost scoring if the content is frequently
updated. So, it's better to be linked to on a page that frequently updates its
content.

61. "Link churn" may be measured (explained in #62) and scoring adjusted based on
this.

62. "Link churn" is "computed as a function of an extent to which one or more


links provided by the document changes over time". Once again, Google is referring
to the changes in where links point, their anchor text, etc. on a given page. More
changes means more "link churn".

63. "Link churn" might create a penalization if it is above a certain threshold.


So, if your links are changing all the time, the link will not be as valuable.
This would shut down the methods used by the popular "Traffic Power/1p" spam
company.

--------------------------------------------------------------------------------

Patent Description:
Background of the Invention:
This is designed for IR (Information Retrieval) Systems and specifically to the
methods used to generate search results.

Description of Related Art:


This information is largely irrelevant, but one important quote is: "There are
several factors that may affect the quality of the results generated by a search
engine. For example, some web site producers use spamming techniques to
artificially inflate their rank. Also, "stale" documents (i.e., those documents
that have not been updated for a period of time and, thus, contain stale data) may
be ranked higher than "fresher" documents (i.e., those documents that have been
more recently updated and, thus, contain more recent data). In some particular
contexts, the higher ranking stale documents degrade the search results. Thus,
there remains a need to improve the quality of results generated by search
engines."

Summary of the Invention:


Google says "history data associated with the documents" may be used to score them
in the search results. The invention provides a "method for scoring a document"
and it "may include determining the age of linkage data associated with a linked
document and ranking the linked document based on a decaying function of the age
of the linkage data."

Brief Description of the Drawings:


The drawings are all exceptional simple charts showing the process for
examination. A PDF with the charts at the bottom is available at
http://files.bighosting.net/tr19070.pdf

Exemplary History Data:


This is the canonical and expository section of the patent description. It
contains examples and explanations of many of the most important parts of this
study, including detailed descriptions for many of the 63 components.

Document Inception Date


Google notes that the "date" label is used broadly and may include many time &
date measurements. Google describes several of the techniques used to obtain an
"inception date" and mentions that some techniques are "biased" because they can
be influenced by a 3rd party.

The first technique used is when Google learns of or indexes the document - either
by finding a link to the site/page, or following it. A second technique uses the
registration date of the URL or the first time it was referenced in a "news
article, newsgroup, mailing list" or combination of these types of documents.

The patent mentions that Google assumes that a "fairly recent inception date will
not have a significant number of links from other documents." However, they say
that the document's rankings can be adjusted accordingly based on how well it is
doing in terms of links with consideration for its age.

Google is also wary of spam, they use the following example (which is already
being quoted around the web):

"Consider the example of a document with an inception date of yesterday that is


referenced by 10 back links. This document may be scored higher by (Google) than a
document with an inception date of 10 years ago that is referenced by 100 back
links because the rate of link growth for the former is relatively higher than the
latter. While a spiky rate of growth in the number of back links may be a factor
used by (Google) to score documents, it may also signal an attempt to spam search
engine 125. Accordingly, in this situation, (Google) may actually lower the score
of a document(s) to reduce the effect of spamming."

Google might also use the date of inception as a method for measuring the "rate at
which links to the document are created". They say that "this rate can then be
used to score the document, for example, giving more weight to documents to which
links are generated more often."

The patent goes so far as to provide a formula for link-based score modification:

H=L/log(F+2),

H = history-adjusted link score


L = link score given to the document, which can be derived using any known link
scoring technique that assigns a score to a document based on links to/from the
document
F = elapsed time measured from the inception date associated with the document (or
a window within this period).

The result of this formula would be that on the day of inception, L will be
divided by 0.301 - the equivalent of multiplying L by 33.2. After 10 days (or any
other unit of time), the formula will divide L by 1.079, making H smaller and
smaller as time goes on.

The patent then suggests that "for some queries, older documents may be more
favorable than newer ones" and that, as a result, Google may "adjust the score of
a document based on the difference (in age) from the average age of the result
set". This would push certain pages up or down in the rankings depending on their
age and the age of their competition.

Content Updates/Changes
Google says that a "document's content changes over time may be used to
generate/alter a score associated with that document." They again offer a formula
for calculating this:

U=f(UF, UA)

f = a function, such as a sum or weighted sum


UF = update frequency score that represents how often a document (or page) is
updated
UA = update amount score that represents how much the document (or page) has
changed over time

Google notes that UA can also be determined as:

The number of "new" or unique pages associated with a document over a period of
time
The ratio of the number of new or unique pages associated with a document over a
period of time versus the total number of pages associated with that document
The amount that the document is updated over one or more periods of time (e.g., n
% of a document's visible content may change over a period t (e.g., last m
months)), which might be an average value
The amount that the document (or page) has changed in one or more periods of time
(e.g., within the last x days)
UA could also different pieces of the content weighted differently, helping to
eliminate changes that are cosmetic or insubstantial. Google mentions:

JavaScript
Comments
Advertisements
Navigational elements
Boilerplate material
Date/time tags
They also identify some important areas where content changes might necessitate
greater weight:

Title
Anchor text of forward links
Google also mentions the use of trend analysis in the changes of a site/page by
comparing an acceleration or deceleration of the rate of change (amount of new
content, etc.). Google notes that maintaining all of this information may be too
intensive for practical data storage and proposes measuring only large changes and
storing "term vectors" only or "a small portion" of a page "determined to be
important".

The patent notes that Google may, on occasion prefer stale documents for certain
types of queries. They may also cerate an average age of change and adjust the
scoring for documents based on their relations to the average (if more stale or
more fresh content is desired).

Query Analysis
This technique describes several phenomenon that can influence rankings:

Clicks on a site/page in the SERPs can be used to rank it higher or lower - those
clicked more often, move higher in the rankings (so make sure your title &
description are good)
If a particular search term is increasingly associated with particular subjects,
the pages on those subjects would rank higher for that query. For example, the
meaning of the word "soap" was increasingly associated with Simple Object Access
Protocol, rather than a cleansing agent, so pages on those subjects rose in the
results.
The number of search results for a particular term is measured to check for "hot
topics" or "breaking news" to help Google follow or become aware of trends. An
example might be the recent Tsunami in East Asia, where thousands of pages popped
up overnight on the subject.
Google also measures search queries whose answers or relevance changes over time.
They use the example of "World Series Champion" which would be different after
each Baseball season.
"Staleness" can be a deciding factor in the rankings. Google will use user clicks
and traffic to decide if "stale" results are relevant for a particular query or
not and rank accordingly. Google says it measures "staleness" by:
Creation Date
Anchor Growth
Traffic
Content Changes
Forward/Back link growth
Link Based Criteria
Google can measure various linking based factors including:

The dates new links appear to a site/page


Dates that link or pages linking to a site/page disappeared
The time-varying behavior of links to a page and any possible "trends" that are
indicated by this, i.e. is the site gaining links overall or losing them? A
downward trend might indicate "staleness", while an upward trend would indicate
"freshness".
Google may check the number of new links to a document over a given time period
compared to the new links the document has received since it was first found.
They'll also use the "oldest age of the most recent y% of links compared to the
age of the first link found."
Google gives an example in the patent of two websites that were both found 100
days ago:
Site #1 - 10% of the links were found less than 10 days ago
Site #2 - 0% of the links were found less than 10 days ago
This data might be used to " predict if a particular distribution signifies a
particular type of site (e.g., a site that is no longer updated, increasing or
decreasing in popularity, superceded, etc.)"
Freshness weights assigned to a link can also be used to rank sites/pages. Several
factors can influence link freshness:
Date of appearance
Date of change of anchor text
Date of change of the page the link is on
Date of appearance of page the link is on
Google says they theorize that a page that is updated (significantly) while the
link remains the same is a good indicator of a "relevant and good" link.
Other weights for links include:
How trusted the links are (they specifically mention government documents as being
assigned higher trust)
How authoritative the websites and pages linking to the page are
Freshness of the page/site - they mention the Yahoo! homepage as one where links
frequently appear and disappear.
The "sum of the weight of the links" pointing to a page/site may be used to raise
or lower the scoring in the rankings. Google will measure the freshness of the
page based on the freshness of the links to it and the freshness of the pages
which the links are on.
Age distribution over time will also be measured, i.e. a site/page will be
compared against all of its links over time and when it received them.
Google may use link date appearance to "detect spam", "where owners of documents
or their colleagues create links to their own document for the purpose of boosting
the score assigned by a search engine". Google says that legitimate sites/pages
"attract back links slowly" and that a "large spike in the quantity of back links"
may signal either a "topical phenomenon" or "attempts to spam a search engine."
Google gives the example of the CDC website after the outbreak of SARS as an
example of a "topical phenomenon".
Google gives 3 examples of link spam techniques - "exchanging links", "purchasing
links" or "gaining links from documents without editorial discretion on making
links".
Google also gives examples of "documents that give links without editorial
discretion" - including guest books, referrer logs and "free for all pages that
let anyone add a link to a document."
A decrease over time in the number of links a document has can be used to indicate
irrelevance, and Google notes that it will discount the links from these "stale"
documents.
The "dynamic-ness" of links will also be measured and scored, based on how
consistently links are given to a particular page. They use the example of
"featured link" of the day and note that they'll use a page score based on the
pages that link to the page, "for all versions of the documents within a window of
time."
Anchor Text
Google can use anchor text measurements to determine ranking scores:

Anchor text changes over time might be used to indicate "an update or change of
focus" on a site/page.
Anchor text that is no longer relevant or on-topic with the site/page it links to
may be tracked and discounted if necessary. Large document changes will result in
Google checking the anchor text to see if the subject matter is still the same as
the anchor text.
Freshness of anchor text can be calculated. It can be determined by:
Date of appearance/change of the anchor text
Date of appearance/change of the linked to page
Date of appearance/change of the page with the link on it
Google notes that the date of appearance/change of the page with the link on it
makes the link and anchor text more "relevant and good"
Traffic
Google can measure traffic levels to a page/site as part of their ranking scores.

A "large reduction in traffic may indicate that a document may be stale"


Google may compare the average traffic for a page/site over the past "j days" (as
an example j=30) to the average traffic over the last year to see if the page/site
is still as relevant for the query.
Google might also use seasonality to help determine if a particular site is
more/less relevant for a query during specific times of the year.
Google is going to measure "advertising traffic" for websites:
"The extent to and rate at which advertisements are presented or updated by a
given document over time"
The "quality of the advertisers". They note that referrers like Amazon.com will be
given more trust and weight than a "pornographic site's" advertisements.
The "click-through rate" of the traffic referrals from the pages the ads are on.
User Behavior
Google may be measuring "aggregate user behavior". This can include:

The "number of times that a document is selected from a set of search results"
The "amount of time one or more users spend accessing the document"
The relative "amount of time" compared to an average that users spend on a
particular site/page
Google uses an example of a swimming schedule page that users typically spent 30
seconds accessing, but have recently spent "a few seconds" accessing.
Google says this can be an indication for them that the page "contains an outdated
swimming schedule" and they will push down its rank.
Domain-Related Information
Information associated with a domain can be used by Google to score sites in the
rankings. They mention specific types of " information relating to how a document
is hosted within a computer network (e.g., the Internet, an intranet, etc.)"
including:

Doorway and "throwaway" domains - Google says they will use "information regarding
the legitimacy of the domains"
Valuable domains, according to Google, "are often paid for several years in
advance", while the throwaway domains "rarely are used for more than a year."
The DNS records will also be checked to determine legitimacy:
Who registered the domain
Admin & technical addresses and contacts
Address of name servers
Stability of data (and host company) vs. high number of changes
Google claims they will use "a list of known-bad contact information, name
servers, and/or IP addresses" to predict whether a spammer is running the domain.
Google will also use information regarding a specific name server in similar ways
-
"A "good" name server may have a mix of different domains from different
registrars and have a history of hosting those domains, while a "bad" name server
might host mainly pornography or doorway domains, domains with commercial words (a
common indicator of spam), or primarily bulk domains from a single registrar, or
might be brand new"
Ranking History
Google can measure the history of where a site ranked over time and data
associated with this. Some specifics include:

A site that "jumps in rankings across many queries might be a topical document or
it could signal an attempt to spam search engine"
The "quantity or rate that a document moves in rankings over a period of time
might be used to influence future scores"
Sites can be weighted according to their position in the results, where the top
result receives a higher score and the lower sites receive progressively lower
scores. Google uses the equation:
[((N+1)-SLOT)/N]
Where N=the number of search results measured and SLOT equals the ranking position
of the measured site
In this equation, the 1st result receives a score of 1.0 and the last result
receives a score close to 0.
Google could check "commercial queries" specifically and documents that gained X%
in the rankings " may be flagged or the percentage growth in ranking may be used"
to determine if the "likelihood of spam is higher".
Google may also monitor:
"The rate at which (a site/page) is selected as a search result over time"
Seasonality - fluctuations based on the time of month or year
Burstiness - Sudden gains or losses in clicks
Other patterns in CTR
The rate of change in scores can be measured over time to see if a search term is
getting more/less competitive and additional attention is needed.
Google "may monitor the ranks of documents over time to detect sudden spikes in
the ranks". This could indicate, according to the patent, "either a topical
phenomenon (e.g., a hot topic) or an attempt to spam search engine"
Google may use preventative measures against spam by:
"Employing hysteresis to allow a rank to grow at a certain rate" - hysteresis in
this instance probably means a pull that results in the growth rate falling. The
terms has dozens of unique definitions.
Limiting the "maximum threshold of growth over a predefined window of time" for a
given site/page.
Google will also "consider mentions of the document in news articles, discussion
groups, etc. on the theory that spam documents will not be mentioned"
Certain types of sites/pages (Google specifically mentions "government documents,
web directories (e.g., Yahoo), and documents that have shown a relatively steady
and high rank over time") may be immune to the "spike" tracking and penalization
Google may also "consider significant drops in ranks of documents as an indication
that these documents are "out of favor" or outdated"
User Maintained/Generated Data
Google wants to measure many different types of aggregate data that user keep on
their computers about their web visits and experiences, including:

Bookmarks & Favorites lists in the browser


They want to obtain this data either via a "browse assistant" - like the toolbar
or desktop search, or.
Directly via the browser itself - I predict they are developing their own Google
Browser.
Google will use this data over time to predict how valuable a particular site or
page is
Google also wants to document additions and removals from favorites & bookmarks
over time to help predict the value of a site/page
Google will also measure how often users access the site/page from their browser
to see if it is still relevant, or just a leftover ("outdated" or "unpopular")
The "temp or cache files associated with users could be monitored" by Google to
identify their visiting patterns on the web and determine whether there is "an
upward or downward trend in interest" in a given site/page.
Unique Word, Bigrams, Phrases in Anchor Text
Google intends to measure the profile of how anchor text appears over time to a
particular site/page to watch for spam. They note that "naturally developed web
graphs typically involve independent decisions. Synthetically generated web
graphs, which are usually indicative of an intent to spam, are based on
coordinated decisions". The difference in patterns can be measured and put to use
to block spam.

Google notes that the "spikiness" of "anchor words/bigrams/phrases" is a prime


measurement. They note that spam typical shows "the addition of a large number of
identical anchors from many documents".

Linkage of Independent Peers


Google can also use link data from "independent peers (e.g., unrelated documents)"
to check for spam. They say that a " sudden growth in the number of independent
peers... with a large number of links... may indicate a potentially synthetic web
graph, which is an indicator of an attempt to spam." Google notes that this
"indication may be strengthened if the growth corresponds to anchor text that is
unusually coherent or discordant" and that they can discount the value of these
links either by a "fixed amount" or a "multiplicative factor" - this would give an
additional penalty just for having these links.

Document Topics
Topic extraction can be performed by Google through the following methods:

Categorization
URL analysis
Content analysis
Clustering
Summarization
A set of unique low frequency words
The goal is to "monitor the topic(s) of a document over time and use this
information for scoring purposes."

Google notes that "a spike in the number of topics could indicate spam" or that
significant document topic changes may indicate that the website "has changed
owners and previous document indicators, such as score, anchor text, etc., are no
longer reliable." Google says that "if one or more of these situations are
detected, (they) may reduce the relative score of such documents and/or the links,
anchor text, or other data" from the website.