SEO Book

1
Contents
I Theory
0.1 0.2 Why SEO is important . . . . . . . . . . . . . . . . . . . . . . . . Dierent needs from SEO . . . . . . . . . . . . . . . . . . . . . .
5
5 5
What is a Search Engine?

1.1 1.2 History of Search Engines Important Issues 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 Performance Dynamic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7 8 8 8 8 8 9 10 10 11 12 12 12 12
Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . Spam and Manipulation Text acquisition
How a Search Engine works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Duplicate Content Detection Text transformation Index Creation User Interaction Evaluation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
How good can a search engine be?

2.1 2.2 2.3 NP Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . AI Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . Competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13 14 15
Ranking Factors
3.1 3.2 3.3 On Page Factors O Page Factors 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 3.3.7 3.3.8 3.3.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
16 17 18 19 19 20 21 23 23 23 24 25
Google PageRank Notes . . . . . . . . . . . . . . . . . . . . . . . Short Description . . . . . . . . . . . . . . . . . . . . . . . Mathematical Description . . . . . . . . . . . . . . . . . . Interesting Notes on the Original Implementation of PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimal Linking Strategies HITS
Implementation to make computing PageRank faster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Is linking out a good thing? . . . . . . . . . . . . . . . . . TrustRank / Bad Page Rank Improvements to Google's ranking algorithms . . . . . . .
Detecting Spam and Manipulation

4.1 4.2 4.3 4.4 4.5 Google Webmaster Guidelines . . . . . . . . . . . . . . . . . . . . Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detecting Manipulation in Content . . . . . . . . . . . . . . . . . Detecting Manipulation in Links . . . . . . . . . . . . . . . . . . Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27 28 28 28 29
II Practice
5 An Example Campaign
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Company Prole Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
30
30 30 30 30 31 31 31 31
Competitor Research . . . . . . . . . . . . . . . . . . . . . . . . . Keyword Research Website Check Content Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Preface
This book aims to provide a general overview of how search engines rank documents in practice, the core of which will remain true even as Search Engine's algorithms are rened.
Part I
Theory
0.1 Why SEO is important
clicks than a lower one
A higher search engine result will receive exponentially greater
For example, if a search was repeated 1000 times by dierent users, this is typically how many clicks each result would get.
Position
1 2 3 4 5 6 7 8 9 10
Clicks
222 63 45 32 26 21 18 16 15 16
Source: Leaked Aol Click Data
Paid adverts have low click through rates, and get expensive
quickly % Organic Click Through Rate
72 61 71 50 63
Search Engine
Google Yahoo MSN AOL Average
% Paid Result Click Through Rate

28 39 29 50 37
88% of online search dollars are spent on paid results, even though 85% of searchers click on organic results.
Vanessa Fox, "Marketing in the Age of Google", May 3, 2010
0.2 Dierent needs from SEO

There are many dierent reasons you may wish to engage in optimising your search results, including
Money Reputation
- Sales for e-commerce sites are directly correlated with trac. - Some companies go to the extent of pushing negative arti-
cles down in the rankings.
Branding
- Coming up top in the results pages is impressive to customers,
and is particularly important in industries where reputation is extremely important.
What is a Search Engine?
1.1 History of Search Engines

The rst mechanised information retrieval sysyems were built by the US military to analyse the mass of documents being captured from the Germans. Research was boosted when the UK and US governments funded research to reduce a perceived science gap with the USSR. By the time the internet was becoming commonplace in the early 1990s information retrieval was at an advanced stage. Complicated methods, primarily statistical, had been developed an archives of thousands of documents could be searched in seconds. Web search engines are a special case of information retrieval systems, applied to the massive collection of documents available on the internet. A typical search engine in 1990 was split into two parts: a web spider that traverses the web following links and creating a local index of the pages, then traditional information retrieval methods to search the index for pages relevant to the users query and order the pages by some ranking function. Many factors inuence a person's decision about what is relevant, such as the current task, context and freshness. In 1998 pages were primarily ranked by their contextual content. Since this is entirely controlled by the owner of the page, results were easy to manipulate and as the Internet became ever more commercialized the noise from spam in SERP's (search engine results pages) made search a frustrating activity. It was also hard to discern websites which more people would want to visit, for example a celebrities ocial home page, from less wanted websites with similar content, for example a site. For these reasons directory sites such as Yahoo were still popular, despite being out of date and making the user work out the relevance Google's founders Larry Page and Sergey Brin's Page Rank innovation (named after Larry Page), and that of a similar algorithm also released in 1998 called Hyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the additional meta information from the link structure of the Internet. A more detailed description of Page Rank will follow in [chapter], but for now Google's own description will suce.
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves important weight more heavily and help to make other pages important.
Whilst it is impossible to know how Google has evolved their algorithms since the 1998 paper that launched page rank, and how real world ecient implementation diers from the theory, as Google themselves say the PageRank algorithm remains the heart of Google's software ... and continues to provide the basis for all of [their] web search tools. The search engines continue to evolve at a blistering pace, improving their ranking algorithms (Google says
there are now over 200 ranking factors considered for each search ), and indexing
a growing Internet more rapidly.
1.2 Important Issues

The building of a system as complex as a modern search engine is all about balancing dierent positive qualities. For example, you could eectively prevent low quality spam by paying humans to review every document on the web, but the cost would be immense. Or you could speed up your search engine by considering only every other document your spider encounters, but the relevance of results would suer. Some things, such as getting a computer to analyse a document to with the same quality as a human, are theoretically impossible today, but Google in particular is pushing boundaries and getting ever closer. Search engines have some particular considerations:
1.2.1
Performance
The response time to a user's query must be lightening fast.
1.2.2
Dynamic Data
Unlike a traditional information retrieval system in a library the pages on the Internet are constantly changing.
1.2.3
Scalability
Search engines need to work with billions of users searching through trillions of documents, distributed across the Earth.
1.2.4
Spam and Manipulation
Actively engaging against other humans to maintain the relevancy of results is relatively unique to search engines. In a library system you may have an author that creates a long title packed with words their readers may be interested in, but that's about the worst of it. When designing your search engine you are in a constant battle with adversaries who will attempt to reverse engineer your algorithm to nd the easiest ways to aect your restyles. A common term for this relation ship is "Adverse rial Information Retrieval". The relationship between the owner of a Web site trying to rank high on a search engine and the search engine designer is an adversarial relationship in a zero-sum game. That is, assuming the results were better before, every gain for the web site owner is a loss for the search engine designer. Classifying where your eorts cross helping a search engine be aware of your web site's content and popularity, which should help to improve a search engine's results, and start instead ranking beyond your means and start decreasing the quality of a search engine's results can be
http://googlewebmastercentral.blogspot.com/2008/10/ good-times-with-inbound-links.html
1 See
somewhat tricky. The practicalities of what search engines consider to be spam, and as importantly what they can detect and x, will be discussed later. According to "Web Spam Taxonomy" , approximately 10-15% of indexed content on the web is spam. What is considered spam and duplicate content varies, which makes this statistic hard to verify. million pages There is a core of about 56
3 that are highly interlinked at the center of the Internet, and are
less likely to be spam. Document's further away (in link steps) from this core are more likely be spam. Deciding the quality of a document well (say whether it is a page written by an expert in the eld, or generated by a computer program using natural language processing) is an AI Complete problem, that is it won't be possible until we have articial intelligence that can match that of a human. However, search engines hope to get spam under control by lessening the nancial incentive of spam. This quote from a Microsoft Research paper presses this nicely:
4 ex-
Eectively detecting web spam is essentially an arms race between search engines and site operators. It is almost certain that we will have to adapt our methods overtime, to accommodate for new spam methods that the spammers use. It is our hope that our work will help the users enjoy a better search experience on the web.Victory does not require perfection, just a rate of detec-tion that alters the economic balance for a would-be spammer. It is our hope that continued research on this front can make eective spam more expensive than genuine content.
Google developers for their part describe web spam as the following , citing the detrimental impact it has upon users
These manipulated documents can be referred to as spam.
When a user
receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user's query.
1.3 How a Search Engine works

A typical search engine can split into two parts: Indexing, where the Internet is transformed into an internal representation that can be eciently searched. The query process, where the index is searched for the user query and documents are ranked and returned to the user in a list. Indexing
2 Zoltn Gyngyi and Hector Garcia-Molina, 3 See 4 See 5 See
Stanford University. First International Work-
shop on Adversarial Information Retrieval on the Web, May 2005 On Determining Communities in the Web by K Verbeurg Detecting Spam Web Pages through Content Analysis by A Ntoulas patent 7302645: Methods and systems for identifying manipulated articles
1.3.1
Text acquisition
A crawler starts at a seed site such as the DMOZ directory, then repeatedly follows links to nd documents across the web, storing the content of the pages and associated meta data (such as the date of indexing, which page linked to the site). In a modern search engine the crawler is constantly running, downloading thousands of pages simultaneously, to continuously update and expand the index. A good crawler will cover a large percentage of the pages on the Internet, and visit popular pages frequently to keep its index fresh. A crawler will connect to the web server and use a HTTP request to retrieve the document, if it has changed. On average, Web page updates follow the Poisson distribution - that is the crawler can expect the time until the web page updates next time to follow an exponential distribution. Crawlers are now also indexing near real time data through varying sources such as access to RSS Feeds and the Twitter API, and are able to index a range of formats such as PDF's and Flash. These formats are converted into a common intermediate format such as XML. A crawler can also be asked to update its copy of a page via methods such as a ping or XML sitemap, but the update time will still be up to the crawler. The document data store stores the text and meta data the crawler retrieves, it must allow for very fast access to a large amount of documents. Text can be compressed relatively easily, and pages are typically indexed by a hash of their URL. Google's original patent used a system called BigTable, Google now keeps documents in sections called shards distributed over a range of data centres (this oers performance, redundancy and security benets).
1.3.2
Duplicate Content Detection
Detecting exact duplicates is easy, remove the boilerplate content (menus etc.) then compare the core text through check sums. Detecting near duplicates is harder, particularly if you want to build an algorithm that is fast enough to compare a document against every other document in the index. To perform faster duplicate detection, nger prints of a document are taken. A simple ngerprinting algorithm for this is outlined here: 1. Parse the document into words, and remove formatting content such as punctuation and HTML tags. 2. The words are grouped into groups of words (called n-grams, a 3-gram being 3 words, 4-gram 4 words etc.) 3. Some of these n-grams are selected to represent a document 4. The selected n-grams are hashed to create a shorter description 5. The hash values are stored in a quick look up database 6. The documents are compared by looking at overlaps of ngerprints.
10
Fingerprinting in action A paper
6 by four Google employees found the following statistics across their
index of the web. Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of vegrams: 1,176,470,663 Most common trigram in English: all rights reserved Detecting unusual patterns of n-grams can also be used to detect low quality/spam documents .
1.3.3
Text transformation
Tokenization is the process of splitting a series of characters up into separate words. These tokens are then parsed to look for tokens such as <a ></a> to nd which parts of the text is plain text, links and such.
Identifying Content
Sections of documents that are just content are found, in an attempt to ignore "boiler plate" content such as navigation menus. A simple way is to look for sections where there are few HTML tags, more complicated methods consider the visual layout of the page.
Stopping
Common words such as "the" and "and" are removed to increase the eciency of the search engine, resulting in a slight loss in accuracy. In general, the more unusual a word the better it is at determining if a document is relevant.
6 See 7 See
N-gram Statistics in English and Chinese: Similarities and Dierences http://www.seobythesea.com/?p=5108
11
Stemming
Stemming reduces words to just their stem, for example "computer" and "computing" become "comput". Typically around a 10% improvement is seen in relevance in English, and up to 50% in Arabic. The classic stemmer algorithm is the "Porter Stemmer" which works through a series of rules such as "replace sses with ss to stresses -> stress".
Information Extraction
Trying to determine the meaning of text is very dicult in general, but certain words can give clues. For example the phrase "x has worked at y" is useful when building an index of employees.
1.3.4
Index Creation
Document statistics such as the count of words are stored for use in ranking
8 is created to allow for fast full text searches. 9 The index is distributed across multiple data centres across the globe .
algorithms. An inverted index
1.3.5
User Interaction
The user is provided with an interface in which to give their query. The query is then transformed, using similar techniques to with documents such as stemming, as well as spell checking and expanding the query to nd other queries synonymous with the users query. After ranking the document set, a top set of results are displayed together with snippets to show how they were matched.
1.3.6
Ranking
A scoring function calculates scores for documents. Some parts of the scoring can be performed at query time, others at document processing time.
1.3.7
Evaluation
For
Users queries and their actions are logged in detail for improve results. again, it is likely that they clicked a poor result.
example, if a user clicks on a result then quickly performs the same search
8 An
inverted index is an index data structure storing a mapping from content, such as
words or numbers, to its document in a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added
http://en.wikipedia.org/wiki/Inverted_index approach is at http://highscalability.com/ google-architecture

to the database.
9A
good overview of Google's shard
12
How good can a search engine be?

The two core sets or problems
There are some very specic limits in computer science as to what a computer program is capable of doing, and these have direct consequences for how search engines can index and rank your web pages. are NP-Complete problems, which for large sets of data take too long to solve perfectly, and AI-Complete problems, which can't be done perfectly until we have computers that are intelligent as people. That doesn't mean search engines can't make approximations, for example nding the shortest route on a map is a NP-Complete problem yet Google maps still manages to plot pretty good routes
10 .
2.1 NP Hard Problems

Polynomial (P) problems can be solved in polynomial time, that is relatively quickly. Non Polynomial (NP) problems cannot be solved in polynomial time, that is they can't be solved for any reasonably large set of inputs such as a number of web pages.
The time taken to solve the NP hard problem (in red) grows extremely quickly as the size of the problem grows. These concepts become complex quickly, but the key thing to pick up is that if a problem is NP Hard there is no way it can ever be solved perfectly for something as large as a search engines index, and approximations will have to be used. There are some NP Hard problems that are of particular interest to SEO:
The Hamiltonian Path Problem - Detecting a greedy network (IE if you interlink your web pages to hoard page rank) in the structure of a Hamiltonian path
11 is an NP hard problem
Detecting Page Farms (the set of pages that link to a page) is NP hard
12
10 http://www.youtube.com/watch?v=-0ErpE8tQbw 11 http://en.wikipedia.org/wiki/Hamiltonian_path 12 See Sketching Landscapes of Page Farms by Bin Zhou
and Jian Pei
13
Detecting Phrase Level Duplication in a Search Engine's Index
13
2.2 AI Hard Problems

AI Hard problems require intelligence matching that of a human being to be solved. Examples include the Turing Test (tricking a human into thinking they are talking to a human, not a computer), recognising dicult CAPTCHA's and translating text as well as an expert (who wouldn't be perfect either). During a question-and-answer session after a presentation at his alma matter,Stanford University, in May 2002, Page said that Google would full its mission only when its search engine was AI-complete, and said something similar in an interview with Newsweek then Playboy.
"I think we're pretty far along compared to 10 years ago, he said. At the same time, where can you go? Certainly if you had all the world's information directly attached to your brain, or an articial brain that was smarter than your brain,you'd be better. Between that and today, there's plenty of space to cover."
What would a perfect search engine look like? we asked. "It would be the mind
of God"
14
And, actually, the ultimate search engine, which would understand, you know, exactly what you wanted when you typed in a query, and it would give you the exact right thing back, in computer science we call theatrical intelligence. That means it would be smart, and we're a long way from having smart computers.
15
Of particular interest to SEO is that fully understanding the meaning of human text is an AI complete problem, and even getting close to understanding words in context is very dicult ically is tricky. able quality computer generated text against that of a human expert automatIts not unusual to see websites packed with decent computer generated text (which automatically detecting is an AI complete problem) and single phrases stitched together from a variety of sources (which is an NP complete problem) ranking for Google Trends results. This is particularly hard to stop as for new news items there are less fresh sources available to choose from, this results in search engine poisoning and penalised manually
16 . This means detecting the quality of reason-
17 . Any site that receives a large amount
of trac from this will eventually be visited manually by a Google employee,
18 .
Google's solution to the very similar machine translation problem is interesting; rather than attempting to build AI they use their massive resources and data stored from web pages and user queries to build a reliable statistical engine
13 See
Detecting phrase-level duplication on the world wide web by Microsoft Research
employees
Rater"
14 http: // searchenginewatch. com/ 2156601 15 http: // tech. fortune. cnn. com/ 2011/ 02/ 17/ is-something-wrong-with-google/ 16 http://en.wikipedia.org/wiki/Natural_language_understanding 17 http://igniteresearch.net/spam-in-poisoned-world-cup-results/ 18 http://www.google.co.uk/search?q="Google+Spam+Recognition+Guide+for+Quality+
14
- their approach isn't necessarily far smarter than their competitors but their resources make them the best translator out there.
2.3 Competitors
Although not a classic computer science problem, a big limit to how search engines can treat possible spam is that competitors could attempt to make your website look like it was spamming to lower your ranking, increasing theirs. For example, if your website suddenly receives and inux of low quality links from sites known to ink to spam, how would Google know if you naively ordered this or a competitor did? This is an unsolvable problem, short of non-stop surveillance of all website owners. This is what Google has to say on the matter
19
There's
almost
nothing a competitor can do to harm your ranking or have
your site removed from our index. If you're concerned about another site linking to yours, we suggest contacting the webmaster of the site in question. Google aggregates and organizes information published on the web; we don't control the content of these pages.
I can say from experience that Google bowling most certainly does happen, and there are a couple of experiments written up on the web
20 , though it would
be very dicult to Google bowl a popular website. Essentially, if a small percentage of links to a site are most likely spam they are just ignored, if a large percentage are likely spam then the links may result in a penalty rather than just being ignored. It seems likely that poor quality links are increasingly being ignored. The paper Link Spam Alliances from Stanford, the Google founder's Alma mater, discusses both dated methods of detecting and punishing potential link spam. Note that link spam isn't the only way that sites can potentially be Google bowled, if your competitor lls your comment section with duplicate content about organ enlargement and links to known phishing sites it is unlikely to help your rankings. Google now also takes into account users choosing to block sites from results
21 , presumably with a negative eect.
Ranking Factors
22 . They then run many tests to
The following is from an interview with Google's Udi Manber.
Google engineers update their algorithms daily
check they have the right balance between all these factors.
Q: How do you determine that a change actually improves a set of results? A: We ran over 5,000 experiments last year. Probably 10 experiments for every successful launch. We launch on the order of 100 to 120 a quarter. We have
html 22 http://www.nytimes.com/2007/06/03/business/yourmoney/03google
15
19 http://www.google.com/support/webmasters/bin/answer.py?answer=34449 20 http://bit.ly/jEKzMa 21 http://googlewebmastercentral.blogspot.com/2011/04/high-quality-sites-algorithm-goes.
dozens of people working just on the measurement part. We have statisticians who know how to analyze data, we have engineers to build the tools. We have at least 5 or 10 tools where I can go and see here are 5 bad things that happened. Like this particular query got bad results because it didn't nd something or the pages were slow or we didn't get some spell correction.
I have created a spreadsheet that shows how a search engine may calculate the ranking of a trivial set of documents for a particular query, you can view it and try changing things yourself at
http://igniteresearch.net/ poodle-a-simple-emulation-of-search-engine-ranking-factors/.
3.1 On Page Factors
Keywords
Repetitions of the words in the query in the document, particularly in key areas such as the title and headers are positive signals of relevance. The proximity of the words together is important, particularly having the exact query in the document. A very large repetition, particularly in nongrammatical sentences, can be a negative signal of spam. Presence of the query words in the Domain and URL are useful signals of relevance. Related phrases to the query are The meta also positive signals of relevance (see Latent Semantic Indexing). largely ignored by modern search engines
keywords HTML tag, <meta name=keywords content=my, keywords>, is
Quality
24 .
23 .
A number of dierent authors on a website, good grammar, spelling and long pages written at reasonable time intervals are positive signs of high quality content
Geographical Locality Freshness
Mentions of an address close the user show the document may be geographically relevant to the user, particularly for geograpihcally sensitive queries such as plumbers in london.
For time dependant queries, such as news events, recent pages are more likely to be helpful to the user. See Google's Quality Deserves Freshness drive, of which Google's faster indexing Caeine update was a part.
Duplicate Content
Large percentages of content duplicated either from the same site, or others is an indicator of poor quality content and users will only want to see the canonical copy.
23 See http://googlewebmastercentral.blogspot.com/2009/09/ google-does-not-use-keywords-meta-tag.html 24 See http://www.seobythesea.com/?p=541
16
Adverts
A very large number of adverts can reduce the user experience, and aliate links are often associated with heavily SEO manipulated websites.
Outbound Links Spam

25 .
Links to spammy of phising websites, or an unusually large number of outbound links on a number of pages, are common indicators of a page that users will not want to visit
An unusual repetition of keywords, particularly outside of sentences is a sign of spam. Techniques such as hidden text and sneaky javascript redirects are relatively easy to detect and punish.
3.2 O Page Factors
Site Reliability
26 .
Unreliable or slow sites provide a poor user experience, and so will have a penalty applied. You can be warned if this happens if you sign up for Google webmaster tools
Popularity of the Site

27 .
From aggregated ISP data that search engine's buy and search trac
Incoming Links/ PageRank
The link structure of the internet is a useful pointer of a websites popularity. Anchor text on incoming links related to query shows a search engine the page is related to the query. Links they remain for a long time from sites that have many links pointing to themselves are rated highly. Links that are in boiler plate areas or sitewide may be ignored. Links that are all identical in anchor text (ie blatantly machine generated), from spammy websites (bad neighbourhoods result in penalties.
28 ),
thought to be paid for with the intention of manipulating rankings or spam can Links from sites that are most likely owned by the same A owner, detected either from Whois data or if the sites are hosted within the same Class C IP, are likely considered less reliable signals of importance. normal rate of growth of incoming links, as opposed to bursty start stops indicate link building campaigns
30 .
29 that
25 See
Improving Web Spam Classiers Using Link Structure for a very interesting Yahoo
patent on detecting spam based on the number of inbound and outbound links
www.compete.com 28 See http://www.google.com/support/webmasters/bin/answer.py?answer=35769 29 See http://www.seobook.com/link-growth-profile 30 See http://www.wolf-howl.com/seo/google-patent-analysis/

17
26 See http://www.mattcutts.com/blog/site-speed/ 27 See http://trends.google.com/websites?q=bing.com&geo=all&date=all
and
http://
Other indirect signals of a website's popularity

Other data can include mentions in chats, emails and social networks.
Links from trusted websites
The proximity on web graph to important, trusted sites (Links from old, high page rank websites at the centre of the old heavily interconnected internet are useful signals that a website can be trusted and is important
Links from other sites that rank for the query Geographical Location
31 ).
Results may be reordered based on how they link to each other.
If the geographical location of server, website according to directories, top level domain or location as set in Google Webmaster Tools match that of the user it is a signal that the page will be more relevant to the user, particularly for location sensitive searches.
User Click Data
If users often search again after clicking on the sites result that is an indicator that the page is not a good match for the query. The personal history of results clicked, and pattern of related searches may help indicate what a user is looking for
32 .
Domain Information Manul Reviews
Older domains are likely trusted more. Google is a domain registrar so has extensive information Whois Information, and validates that address information associated with domains is correct.
Google Quality Raters
33 manually reviewing websites and tagging them as cat-
egories such as essential to query, not relevant to query, spam.
3.3 Google PageRank Notes

Google's PageRank was the innovation that propelled Google to the top of the search engine pile. Whilst its implementation has changed much since its original description, and many other factors are now taken into account, it is still at the heart of modern search engines so some extra notes will be made on it here.
31 See http://www.touchgraph.com/seo and type in http://www.nasa.gov for a visual graph 32 See Seehttp://www.seobythesea.com/?p=334 33 See http://searchengineland.com/the-google-quality-raters-handbook-13575
18
3.3.1
Short Description
The key point is that PageRank considers each link a vote, and links from pages which have many links themselves are considered more important. Or as Google puts it:
PageRank reects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.
3.3.2
Mathematical Description
Its not essential to have a mathematical understanding of how PageRank is calculated, but for those familiar with basic graph theory and algebra it is useful. You may wish to skip this section, and read a slightly less mathematical description
34 . For a more complete treatment of the mathematics see the original 35 , the Deeper Inside PageRank by Amy N. Langvilleand PageRank paper
and Carl D, and this thesis
37 by Bin Zhou and Jian Pei: Landscapes of Page Farms
36 . The following is summarised from Sketching
The Web can be modeled as a directed Web graph G = (V, E), where V is the set of Web pages, and E is the set of hyperlinks. A link from page p to page q is denoted by edge p q). q. An edge p q can also be writte nas a tuple (p,
PageRank measues the importance of a page p by considering how collectively other Web pages point to p directly or indirectly. Formally, for a Web page p, the PageRank score is dened as:
Where M(p) = { q| q
to p, OutDeg(pi ) is the out-degree of pi (i.e., the number of hyperlinks from pi pointing to some pages other than pi ), and d is a damping factor (0.85 in the original PageRank implementation) which models the random transitions of the web. If a damping factor of 0.5 is used then at each page there is a 50/50
p } is the set of pages having a hyperlink point
34 See the introductions of http://www.sirgroane.net/google-page-rank/, http://www. webworkshop.net/pagerank.html or the Wikipedia article 35 At http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf 36 http://web.engr.oregonstate.edu/~sheldon/papers/thesis.pdf 37 See http://www.cs.sfu.ca/~bzhou/personal/paper/sdm07_page_farm.pdf
19
chance of the surfer clicking a link, or jumping to a random page on the internet. Without the damping factor the PageRank of any page with an outgoing link would be 0. To calculate the PageRank scroes for all pages in a graph, one can assign a random PageRank score value to each node in the graph, then apply the above equation iteratively until the PageRank scroes in the graph converge. The google toolbar is a logarithmic scale out of 10, not the actual internal data. For example:
Domain
small.com medium1.com medium2.com big.com big2.com
Calculated PageRank
47 54093 84063 1234567 2364854
PageRank displayed in Toolbar

2 5 5 7 7
3.3.3
Interesting Notes on the Original Implementation of PageRank

38 , essential reading for those looking to understand
From PageRank Uncovered
PageRank from an SEO perspective:
PageRank is a multiplier, applied after relevant results are found
Remember, PageRank alone cannot get you high rankings. We've mentioned before that PageRank is a multiplier; so if your score for all other factors is 0 andyour PageRank is twenty billion, then you still score 0 (last in the results). This isnot to say PageRank is worthless, but there is some confusion over when PageRank is useful and when it is not. This leads to many misinterpretations of its worth. The only way to clear up these misinterpretations is to point out when PageRank is not worth while.If you perform any broad search on Google, it will appear as if you've found several thousand results. However, you can only view the rst 1000 of them. PageRank last. Understanding why this is so, explains why you should always concentrate on on the page factors and anchor text rst, and
Each page is born with a small amount of PageRank
A page that is in the Google index has a vote, however small. Thus, the more pages you have in the index the more overall vote you are likely to have. Or,simply put, bigger sites tend to hold a greater total amount of PageRank within their site (as they have more pages to work with).
Note that Google's original algorithm has most likely been amended since to detect and reduce page rank hoarding, and generating PageRank by massive interlinking on auto generated pages. Also for quicker calculations an approx-
38 See http://www.bbs-consultant.net/IMG/pdf_PageRank.pdf
20
imation of PageRank which only gives certain seed pages PageRank may be used
39 .
Interestingly, however, there are examples of this working, see How to get billions of pages indexed in Google at
6999.
http://www.threadwatch.org/node/
40 .
In a related issue, at one point 10% of MSN Search's (now known as
Bing) German index was computer generated content on a single domain
3.3.4
Optimal Linking Strategies
Deciding how to interlink pages that you own or have inuence over is tricky; interlinking can be a good signal that that pages are related and on a certain topic, build PageRank and control PageRank ow. However, heavily interlinking can be a signal of manipulation and spam, and dierent linking structures can make dierent sites in your possession rank higher. The mathematics gets tricky fast, here is a quick overview of the literature today:
Note from Web Spam Taxonomy
Though written about Spam farms, the math holds true for good commercial sites too. Essentially this states that maximum page rank for a target page then is achieved by linking only to the target page from forums, blogs etc. random surfer will jump to a random page on the Internet).
interlinking the network of sites owned (as if there are no outlinks on a page the
1. Inaccessible pages are those that a spammer cannot modify. These are the pages out of reach; the spammer cannot inuence their outgoing links. (Note that a spammer can still point to inaccessible pages.) 2. Accessible pages are maintained by others (presumably not aliated with the spammer), but can still be modied in a limited way by a spammer. comment may contain a link to a spam site. 3. Own pages are maintained by the spammer, who thus has full control over their contents. We can observe how the presented structure maximizes the total PageRank score of the spam farm, and of page t in particular: 1. All available n own pages are part of the spam farm, maximizing the static score total PageRank. 2. All m accessible pages point to the spam farm, maximizing the incoming score incoming PageRank. For example, a spammer may be able to post a comment to a blog entry, and that
Generate-pagerank.htm
40 See
39 For
more
on
why
this
shouldn't
work
see
http://www.pagerank.dk/Pagerank/
http://research.microsoft.com/pubs/65144/sigir2005.pdf
21
3. Links pointing outside the spam farm are suppressed, making PRout outgoing PageRank zero. 4. All pages within the farm have some outgoing links, rendering a zero PRsink score component. Within the spam farm, the the score of page t is maximal because: 1. All accessible and own pages point directly to the target, maximizing its incoming score PRin (t). 2. The target points to all other own pages. Without such links, t would had lost a signicant part of its score (PRsink (t) > 0), and the own pages would had been unreachable from outside the spam farm. Note that it would not be wise to add links from the target to pages outside the farm, as those would decrease the total PageRank of the spam farm.
From Link Spam Alliances
The analysis that we have presented show how the PageRank of target pages can be maximized in spam farms. Most importantly, we nd that there is an entire class of farm structures that yield the largest achievable target PageRank score. All such optimal farm structures share the following properties:
1. All boosting pages point to and only to the target. 2. All hijacked point to the target. 3. There are some links from the target to one or more boosting pages.
From Maximizing PageRank via Outlinks
In this paper we provide the general shape of an optimal link structure for a website in order to maximize its PageRank. This structure with a forward chain and every possible backward link may be not intuitive. At our knowledge, it has never been mentioned, while topologies like a clique, a ring or a star are considered in the literature on collusion and alliance between pages. Moreover, this optimal structure gives new insight into the armation of Bianchini et al. that, in order to maximize the PageRank of a website, hyperlinks to the rest of the webgraph should be in pages with a small PageRank and that have many internal hyperlinks. More precisely, we have seen that the leaking pages must be chosen with respect to the mean number of visits before zapping they give to the website, rather than their PageRank.
From The eect of New Links on PageRank by Xie
Theorem: The optimal linking strategy for a Web page is to have only one outgoing link pointing to a Web page with a shortest mean rst passage time back to the original page. Conclusions: .... We conclude that having no outgoing link is a bad policy and that the best policy is to link to pages from the same Web community. Surprisingly, a new incoming link might not be good news if a page that points to us gives many other irrelevant links at the same time.
Reading this paper fully it is only in very particular circumstances that a new incoming link is not good news.
22
3.3.5
Implementation to make computing PageRank faster

41 , and to adapt it to be better at
There have been a number of proposed improvements to the original PageRank algorithm to improve the speed of calculation the naive algorithm in the original paper determining quality results. No search engine calculates PageRank as shown in
42 .
3.3.6
HITS
HITS is another ranking algorithm that takes into account the pattern of links found throughout the web, and it was released just before PageRank in 1999. HITS treats some pages on the web as authorities, which are good documents on a topic, and hubs, which mostly link to authorities. A page is given a high authority score by being linked to by pages that are recognized as Hubs for information. A page is given a high hub score by linking to nodes that are considered to be authorities on the subject. Unlike PageRank, which is query independent and so computed at indexing time, HITS hub and author scores are query depend ant and so computed (though likely cached) at query time.
3.3.7
Is linking out a good thing?
Whilst TEOMA is the only search engine that uses HITS at its core, its thinking has heavily inuenced search engine designers - so it is likely that linking out to high quality authorities can positively inuence either a pages ranking (though potentially negatively, if designers want authorities rather than hubs to appear in their results webmasters fear linking out to sites as they would rather keep links internal to prevent PageRank owing out (many webmasters also nofollow links to similar reasons, not that this form of PageRank sculpting no longer works according to Matt Cutts, Google's head of [anti]web spam). Matt Cutts also said a number of years ago:
43 ), or the importance of the other links it contains. Many
Of course, folks never know when we're going to adjust our scoring. It's pretty easy to spot domains that are hoarding PageRank; that can be just another factor in scoring.
Some search engines are even concerned about people linking out too much, whilst crawlers can now index a large number of links on a page, a very large number of outbound links often indicates that a site has been hacked with spam links or is machine generated.
A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score. At the same time,the most
41 For example,
see Computing PageRank using Power Extrapolation and Ecient PageR-
ank Approximation via Graph Aggregation
42 Matt Cutts discusses a couple of the implementation details at http://www.mattcutts. com/blog/more-info-on-pagerank/ 43 See http://www.wolf-howl.com/seo/seo-case-study-outbound-links/ and Deeper Inside PageRank, discussed earlier
23
wide-spread method for creating a massive number of outgoing links is directory cloning
44 .
3.3.8
TrustRank / Bad Page Rank
Its likely that after results are generated based on relevance, PageRank is then applied to help order, then Trust Rank to help order the results. A site may lose trust every time it fails some kind of spam test (for example if a large number of reciprocal links are found,cloaking, duplicate content, fake whois data) and gain Trust for certain properties (domain age, trac, being one a number of important "seed" sites that are manually tagged as trusted sites). These initial Trust Ranks could then be propagated in a similar way to PageRank, so linking to and from "bad neighborhoods" would negatively aect the sites Trust Rank through association
45 .
From SEO By The Sea:
In 2004, a Yahoo whitepaper was published which described how the search engine might attempt to identify web spam by looking at how dierent pages linked to each other. That paper was mistakenly attributed to Google by a large number of people, most likely because Google was in the process of trademarking the term TrustRank around the same time, but for dierent reasons. Surprisingly, Google was granted a patent on something it referred to as Trust Rank in 2009, though the concept behind it was dierent than Yahoo's description of TrustRank. Instead of looking at the ways that dierent sites linked to each other, Google's Trust Rank works to have pages ranked according to a measure of the trust associated with entities that have provided labels for the documents.
http://bakara.eng.tau.ac.il/semcomm/GKRT.pdfand http://www. freepatentsonline.com/7603350.html and http://www.cs.toronto.edu/vldb04/protected/ eProceedings/contents/pdf/RS15P3.PDF

24
44 See 45 See
Web Spam Taxonomy
... If you've ever heard or seen the phrase "TrustRank" before, it's possible that whoever was writing about it, or referring to it was discussing a paper titled Combating Web Spam with TrustRank (pdf ). While the paper was the joint work of researchers from Stanford University and Yahoo!, many writers have attributed it to Google since its publication date in 2004 The confusion over who came up with the idea of TrustRank wasn't helped by Google trademarking the term "TrustRank" in 2005. That trademark was abandoned by Google on February 29, 2008, according to the records at the US PTO Tess database. However, a patent called "Search result ranking based on trust" deals with something called trust rank, led on May 9, 2006. Google mentions distrust and trust changes as indicators. More than trust analysis, trust variation analysis is on the road. Fake reviews, sponsored blogs and e-commerce trust network inuence are pointed out.
The paper A Cautious Surfer for PageRank comments on why TrustRank shouldn't be overused:
"However, the goal of a search engine is to nd good quality results; spam-free is a necessary but not sucient condition for high quality. If we use a trust-based algorithm alone to simply replace PageRank for ranking purposes, some good quality pages will be unfairly demoted and replaced, for example, by pages within the trusted seed sets, even though they may be much less authoritative.Considered from another angle, such trust-based algorithms propagate trust through paths originating from the seed set; as a result,some good quality pages may get low value if they are not well connected to those seeds."
3.3.9
Improvements to Google's ranking algorithms
There have been a number of notable algorithm changes which made considerable changes appear to results pages, though often the eects were later scaled back slightly.
NoFollow
Matt Cutts and Jason Shellen created the nofollow specication to help limit the eect and incentive for blog spam. If a search engine comes across a link tagged as nofollow, it will not treat the link as a vote, ie as a positive signal in rankings. Areas where untrusted users can post content are often tagged nofollow, roughly 80% of content management systems (the software that websites run on) implement nofollow. The HTML code of a NoFollow link: <a href="signin.php" rel="nofollow">sign in</a>
Increasing use of anchor text
Even the original PageRank algorithm took into account the anchor text of links, so links were used to give both a number that indicated the sites popularity and information about the content of a document and so its relevance for user queries.
25
Google Bombing Prevention, 2nd February 2007

Google Bombing is the process of massively linking to a page with a specic anchor text, to give PageRank but more importantly indications that the document is related to the anchor text. For example, in 1999 a number of bloggers grouped together to link to Microsoft.com with the anchor text "more evil than Satan himself". This resulted in Microsoft being placed number one in searches for "more evil than Satan himself" despite not having the phrase anywhere on its page. Detecting a sudden inux of links with identical anchor text is very easy, and in 2007 Google changed their indexing structure so that Google bombs such as "miserable failure" would "typically return commentary, discussions, and articles" about the tactic itself. Matt Cutts said the Google bombs had not "been a very high priority for us. Over time, we've seen more people assume that they are Google's opinion, or that Google has hand-coded the results for these Google-bombed queries. That's not true, and it seemed like it was worth trying to correct that perception." within paragraphs of text.
46 Some Google bombs still work, particularly those
tar getting unusual phrases, with varied anchor text, over a period of time,
Florida, November 2003
Results for highly commercial queries, likely informed from the cost of Adwords, became heavily ltered so more trusted academic websites and less commercial optimised websites ranked. Some of these changes resulted in less relevance, for example if a user was searching for buy bricks they probably didn't want to mainly see websites about the process of creating bricks, and were rolled back. For more see
Bourbon, June 2005 Jagger, October 2005 Big Daddy, December 2005
49
47 and 48 .
A penalty was applied to sites with unusually fast or bursty patterns of link growth.
A penalty applied to sites with unusually large amounts of reciprocal links, new methods for detecting hidden text.
According to Matt Cutts, punished were "sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling."
php
46 See http://answers.google.com/answers/main?cmd=threadview&id=179922 47 http://www.searchengineguide.com/barry-lloyd/been-gazumped-by-google-trying-to-make-sense-of-the-florida-upd 48 http://www.seoresearchlabs.com/seo-research-labs-google-report.pdf 49 See http://www.webworkshop.net/googles-big-daddy-update.html
26
Caeine, October 2010

A faster indexing system that changed results little, but allowed for fresher results and some of the later Panda updates
Panda, April 2011

51 .
50 .
Penalty applied to content deemed low quality, detected primarily from user data. Websites which contained masses of articles, focusing on quantity over quality, were often hit
Detecting Spam and Manipulation
You will often hear that your site has to look natural to the search engines. Just what natural means is hard to dene, but essentially it means the prole of a site whose popularity was never engineered or promoted, and was instead based on people luckily coming across it and deciding to recommend it to their friends with links. Whats more, you also need to make your site look popular, creating no links to your site yourself will look natural but you will have no chance of competing with people who do unless you have the cash to buy large amounts of advertising. the potential penalties are. This section briey covers what search engines consider to be acceptable, when and how they can detect violations, and what
4.1 Google Webmaster Guidelines

Google have created a page called Webmaster Guidelines to inform users of what they consider to be acceptable methods of promoting your website. Whilst the lines for crossing general principles such as Would I do this if search engines didn't exist? not to do: are somewhat vague, they do oer some specic notes of what
Avoid hidden text or hidden links. Don't use cloaking or sneaky redirects. Don't send automated queries to Google. Don't load pages with irrelevant keywords. Don't create multiple pages, sub domains, or domains with substantially duplicate content.
and http://www.seobook.com/questioning-questions and http:// googlewebmastercentral.blogspot.com/2011/05/more-guidance-on-building-high-quality. html
50 See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html 51 See http://blog.searchmetrics.com/us/2011/04/12/googles-panda-update-rolls-out-to-uk/
27
Don't create pages with malicious behavior, such as phishing or installing viruses, Trojans, or other bad ware. Avoid "doorway" pages created just for search engines, or other "cookie cutter" approaches such as aliate programs with little or no original content. If your site participates in an aliate program, make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site rst.
Most of the methods listed above are naive and easy to detect, Google have been fairly successful in making successful manipulation aligned with creating genuine content, though without any promotion it is unlikely even the best content will be noticed.
4.2 Penalties
Penalties
52 that Google to detected manipulation vary in length of time and
eect, from small ranking penalties for certain keywords for a page to site wide bans, depending upon the sophistication of the manipulating methods and the quality of the oending site. If you believe you had had one applied, you can submit a Google Reconsideration Requesthttp://www.google.com/ Tools, once you have xed the oending issues.
support/webmasters/bin/answer.py?answer=35843 from Google Webmaster
4.3 Detecting Manipulation in Content

There is a fascinating paper by Microsoft which details a number of methods for detecting spam pages in search engine index's based on their content. A simple way is to use Bayesian lters (one is included with Ignite SEO to test your content as the search engine's would), so for example seeing the phrase buy pills would be a strong indicator of spam. Most of the research is on detecting blatantly computer generated lists of keywords, which is fairly easy to detect. Detecting the quality of human written content is very dicult, so unless you are endlessly repeating your keywords if you are writing your own content you can be reasonably happy with its quality in search engine's eyes. The following graphs are cut from Detecting Spam Web Pages through Content Analysis
53 by Microsoft Research employees.
4.4 Detecting Manipulation in Links

Much research has focused on detecting spam pages through their backlinks or outlinks. Yahoo obtained a patent that uses the rate of link growth to detect
52 http://www.forbes.com/2007/04/29/sanar-google-skyfacet-tech-cx_ag_ 0430googhell.html 53 http://cs.wellesley.edu/~cs315/Papers/Ntoulas-DetectingSpamThroughContentAnalysis. pdf
28
manipulation.
Essentially a constant rate of new backlinks, perhaps with a
small growth over time, is expected for a typical site. A saw-tooth pattern of inlinks is a strong indicator of backlink campaigns that start and stop (though could also be an indicator of say a site that releases new software monthly).
In their paper, Fetterly et al, analyse the indegree (incoming/backlinks) and outdegree (links on the page) distributions of web pages:
Most web pages have in and outdegrees that follow a powerlaw distribution. Occasionally, however, search engines encounter substantially more pages with the exact same in or outdegrees than what is predicted by the distribution formula. The authors nd that the vast majority of such outliers are spam pages.
As discussed in the Trust Rank section earlier, large amount of links from sites that have already been detected as linking to spam (so called untrustworthy hubs) is a negative indicator. Links from unrelated websites, reciprocal links, links out of content, from sites that are known to host paid links and many other signals are likely taken into consideration. Zhang et al have identied a method for identifying unusually highly interconnected groups of web pages. More methods of identifying manipulative sites are listed in "Link Spam Alliances" by Geyongyi and Garcia-Molina.
4.5 Other Methods

If you think a competitor has been using methods that violate the webmaster guidelines, you can report them to Google of trac, Google will sometimes manually review websites without prompting, Google Quality Raters inspect sites for relevance to results but can also take web pages as spam. Particular markets are inspected more often than others. any site you wish to keep for a long time, and expect to get reasonable amounts
54 . Its good practice to ensure that
54 https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1
29
Part II 5
Practice
An Example Campaign
Now we've covered the theory, its time for a real world example of putting it into practice.
5.1 Company Prole

John runs a driving school in Springeld, Ohio. He has a website he has owned for a couple of years, that ranks around the second page for most searches related to driving schools in Ohio and receives about 20 visitors day, a third from search engines and two thirds from links from local websites. A quick search for what he imagines would be his main keyword, driving school Springeld Ohio, has a company directory site at the top followed by other directories, companies and people asking on forums for recommendations. This mix of relevant small companies web site's and small pages on big websites indicates the keyword to be of medium diculty to rank for.
5.2 Goals
John thinks if he can get his site to rank 3rd instead of around the middle of the second page for his core keywords, he will increase his search trac by around 1000%, his overall trac by about 300%, and roughly double his sales. He aims to do this over a period of roughly one month.
5.3 Competitor Research

John nds his main competitors by searching, and gets estimates of their trac sources using sites such as compete.com and serversiders.com. A tool such as Ignite SEO can automatically build SEO reports of competitors, listing their paid and organic keywords, demographics and backlinks. Looking at the HTML source code of some his competitors displays their targeted keywords in the <meta name=keywords content=keyword1, keyword>.
5.4 Keyword Research

John takes his initial guesses of what potential customers might search for, and those from his competitors and his existing trac, and using the Google Keyword Tool
55 and Google Insights56 expands this list.
55 https://adwords.google.co.uk/select/KeywordToolExternal 56 http://www.google.com/insights
30
5.5 Content Creation

John takes his keywords and create a small amount of content on his website containing them. He then creates a large amount of content quickly and creates sites hosted on free hosting sites
57 that, each one targeting a dierent keyword. 58 is perfect for this. The content generator section of Ignite SEO
5.6 Website Check

Before investing in o site promotion (ie link building), it is worth performing a quick check that the site is search engine friendly. Creating an account in Google Webmaster Tools will let you know if Google has any issues indexing your website, and it is worth ensuring navigation isn't over reliant on JavaScript or Flash.
5.7 Link Building

This is the core process that will actually improve John's rankings. By looking at his competitors backlinks using Yahoo's linkdomain: command, John replicates their links to his website by visiting each site one by one. Using a tool such as Ignite SEO, he can automatically build links to the hosted sites he quickly created in 5.5, without the risk of a link campaign negatively aecting the rankings of his core website. Other signals of quality such as facebook and twitter recommendations are built here.
5.8 Analysis
The success of the campaign is measured with a good tracking system such as Google Analytics, as well as tracking the new incoming links with Google Webmaster Tools and Yahoo's link: command. The results are compared with the goals, and the whole process is rened and repeated.
57 http://igniteresearch.net/which-web-2-0-ranks-best-hubpages-vs-squidoo-vs-tumblr-vs-blogspot-etc/ 58 http://igniteresearch.net
31
About the Author

Christopher Doman is a partner of Ignite Research, a rm specialising in software and consultancies for search engine marketing. He holds a BA in Computer Science from the University of Cambridge.
32

SEO Book

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SEO Book

Uploaded by

Copyright:

Available Formats

1

What is a Search Engine?

Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . Spam and Manipulation Text acquisition

How good can a search engine be?

Detecting Spam and Manipulation

A higher search engine result will receive exponentially greater

Source: Leaked Aol Click Data

% Paid Result Click Through Rate

 Vanessa Fox, "Marketing in the Age of Google", May 3, 2010

0.2 Dierent needs from SEO

cles down in the rankings.

- Coming up top in the results pages is impressive to customers,

and is particularly important in industries where reputation is extremely important.

What is a Search Engine?

1.1 History of Search Engines

a growing Internet more rapidly.

1.2 Important Issues

The response time to a user's query must be lightening fast.

Spam and Manipulation

These manipulated documents can be referred to as spam.

1.3 How a Search Engine works

2 Zoltn Gyngyi and Hector Garcia-Molina, 3 See 4 See 5 See

Stanford University. First International Work-

Duplicate Content Detection

Fingerprinting in action A paper

6 by four Google employees found the following statistics across their

N-gram Statistics in English and Chinese: Similarities and Dierences http://www.seobythesea.com/?p=5108

http://en.wikipedia.org/wiki/Inverted_index approach is at http://highscalability.com/ google-architecture

good overview of Google's shard

How good can a search engine be?

2.1 NP Hard Problems

10 http://www.youtube.com/watch?v=-0ErpE8tQbw 11 http://en.wikipedia.org/wiki/Hamiltonian_path 12 See Sketching Landscapes of Page Farms by Bin Zhou

and Jian Pei

Detecting Phrase Level Duplication in a Search Engine's Index

2.2 AI Hard Problems

16 . This means detecting the quality of reason-

17 . Any site that receives a large amount

of trac from this will eventually be visited manually by a Google employee,

Detecting phrase-level duplication on the world wide web by Microsoft Research

nothing a competitor can do to harm your ranking or have

21 , presumably with a negative eect.

Google engineers update their algorithms daily

19 http://www.google.com/support/webmasters/bin/answer.py?answer=34449 20 http://bit.ly/jEKzMa 21 http://googlewebmastercentral.blogspot.com/2011/04/high-quality-sites-algorithm-goes.

3.1 On Page Factors

keywords HTML tag, <meta name=keywords content=my, keywords>, is

Geographical Locality Freshness

23 See http://googlewebmastercentral.blogspot.com/2009/09/ google-does-not-use-keywords-meta-tag.html 24 See http://www.seobythesea.com/?p=541

Outbound Links Spam

3.2 O Page Factors

Popularity of the Site

Incoming Links/ PageRank

www.compete.com 28 See http://www.google.com/support/webmasters/bin/answer.py?answer=35769 29 See http://www.seobook.com/link-growth-profile 30 See http://www.wolf-howl.com/seo/google-patent-analysis/

26 See http://www.mattcutts.com/blog/site-speed/ 27 See http://trends.google.com/websites?q=bing.com&geo=all&date=all

Other indirect signals of a website's popularity

Links from trusted websites

Results may be reordered based on how they link to each other.

User Click Data

Domain Information Manul Reviews

Google Quality Raters

33 manually reviewing websites and tagging them as cat-

egories such as essential to query, not relevant to query, spam.

3.3 Google PageRank Notes

37 by Bin Zhou and Jian Pei: Landscapes of Page Farms

36 . The following is summarised from Sketching

Vanessa Fox, "Marketing in the Age of Google", May 3, 2010

0.2 Dierent needs from SEO

N-gram Statistics in English and Chinese: Similarities and Dierences http://www.seobythesea.com/?p=5108

good overview of Google's shard

10 http://www.youtube.com/watch?v=-0ErpE8tQbw 11 http://en.wikipedia.org/wiki/Hamiltonian_path 12 See Sketching Landscapes of Page Farms by Bin Zhou

of trac from this will eventually be visited manually by a Google employee,

Detecting phrase-level duplication on the world wide web by Microsoft Research

21 , presumably with a negative eect.

keywords HTML tag, <meta name=keywords content=my, keywords>, is

3.2 O Page Factors

egories such as essential to query, not relevant to query, spam.

37 by Bin Zhou and Jian Pei: Landscapes of Page Farms

36 . The following is summarised from Sketching

From PageRank Uncovered

Note from Web Spam Taxonomy

From Link Spam Alliances

From Maximizing PageRank via Outlinks

From The eect of New Links on PageRank by Xie

see Computing PageRank using Power Extrapolation and Ecient PageR-

ank Approximation via Graph Aggregation

Web Spam Taxonomy

Caeine, October 2010

5.1 Company Prole