You are on page 1of 18

WEB-BASED SEARCH ENGINES 1

Web-Based Search Engines:

History, Recent Innovation, and Implications for Libraries

Katherine Fleck

University of South Florida


WEB-BASED SEARCH ENGINES 2

Abstract:

This paper is an assessment of recent innovations in the fields of information technology and

information science as related to web-based search engines. Web-based search engines were first

developed in the 1990s and in the last two decades have changed substantially with the

introduction of advanced language recognition, sponsored search, search personalization, and

relevance rankings. Recently implemented or proposed developments for web-based search

engines include advanced personalization, improvement to crawling technologies, and social

network connectivity. Web-based search engine technologies offer potential for libraries to better

improve their reference services and general patron outreach.

Keywords: web-based search engines, libraries, information technology, information retrieval


WEB-BASED SEARCH ENGINES 3

Web-Based Search Engines: History, Recent Innovation, and Implications for

Information Professionals

Web-based search engines first emerged in the 1990s, and since the beginning of the 21st

century, have played a pivotal role in shaping the web activity and daily lives of users. Search

engines are used to find online retailers, check the weather or traffic, identify and locate local

businesses, answer medical questions, find academic information, keep up with local news, and

seek help for personal issues ranging from fixing a finicky computer to cleaning a wound. Search

engines are also used by students and other academics of all ages to gather information on any

number of subjects from sources that range from Wikipedia to a scientific journal. The capacity

of a web-based search engine seems limitless. Academics in the fields of both Information

Sciences and Information Technology believe the search engine has the potential for much more.

This paper provides a history of the search engine from its roots in the 1960s to its

abilities and appearance in 2016. It then assesses the recently published points of research in the

field, as well as ongoing projects intended to improve or change the function and usage of web-

based search engines. Ranging from addressing the deep web problem to bridging the gap

between search engine and social network usage, current scholarship provides several

meaningful ways in which the search engine can be changed to fit new needs and overcome

historical challenges. Finally, this paper addresses the way in which information science

professionals—librarians in particular—must adapt and can benefit from search engines in both

their current form and from proposed innovations. The desires of users regarding web-based

search engines such as Google speaks volumes about what patrons want out of their library

experience, particularly when it comes to online library searches and online reference services.
WEB-BASED SEARCH ENGINES 4

Web-based search engines were introduced in the 1990s with the rise of the World Wide

Web. However, their predecessor, the computerized search engine, was developed in the 1960s.

The first computerized search engine was developed by Gerard Salton and his team of

developers at Cornell University. The SMART (System for the Mechanical Analysis and

Retrieval of Text) System utilized the basic search structure that is still followed today by web-

based search engines: query using keywords, match the query against a collection of information,

and determine a list of relevant results. Today, all major search engines are still following this

basic design, while implementing their own algorithms and policies to determine individual

relevance rankings (Wright, 2016).

The first internet-based search engine was introduced in 1990 by Alan Emtage, a McGill

University student. This search engine, Archie, was intended for indexing and searching FTP

(File Transfer Protocol). Archie made is possible, in a pre-web world, to store and retrieve files

online. Archie inspired the development of many more advanced internet-based search engines

such as Jughead (now Jugtail) and Veronica that were capable of more complex search modifiers

and fine-tuned advanced searches. These search engines originally relied solely on keyword

relevance to the text of the webpage. However, their capabilities expanded to include metadata in

their search and results relevance algorithms. Internet-based search engine technology was then

adapted with the introduction of the World Wide Web. The W3 Catalog was the first web-based

search engine. Introduced in 1993 by Oscar Nierstrasz of the University of Geneva, the W3

Catalog relied on already compiled lists of websites that could be searched by keyword via

query. Although this search engine is web-based, it is very different from today’s web-based

search engines that rely on crawlers for indexing. W3 Catalog, due to its reliance on lists, was
WEB-BASED SEARCH ENGINES 5

not long-lasting. The project was abandoned in 1996 after the rise of web-based search engines

that relied on crawlers (Gürkaynak, Yılmaz, & Durlu, 2013).

JumpStation, which was released in 1993 by Jonathon Fletcher of the University of

Stirling, is the first web-based search engine that functions similarly to the web-based search

engines we used today. JumpStation introduced the use of crawlers and a completely robot-built

index rather than the manually-built indexes of engines like W3. In the crawling process, search

engines use software to comb web pages and copy the html code to be retained in a cache. The

crawler software then uses this data to compile an index. The index’s cache links represent the

most recent snapshot of the website, and change over time as the website changes its content or

updates its layout. This means that the crawler is never done working. Even after a website has

been indexed, crawlers still must constantly gather new html code in order to maintain up-to-date

indexes. Search queries can then be made on the web-based search engine’s hosted website.

When the user submits a query, the search engine then provides a list of relevant links to the

websites that have been cached in their index (Gürkaynak, Yılmaz, & Durlu, 2013).

Google, which is today’s most prominent web-based search engine, was released in 1996

under a Stanford University domain, and then under its own domain in 1997. Developed by

Larry Page and Sergey Brin, Google differed from previous web-based search engines in how it

ranked search results. Previously, search engines that provided ranked results defined relevance

by the number of times the queried keywords appeared on a webpage. Google developed a new

system of ranking called PageRank, that considered not only the amount of times that the

webpage used the keyword, but also the prominence and popularity of the website based on the

number of its recorded backlinks. Over time, this technology has been improved to account for

many elements to determine relevance, including online traffic, metadata relevance, and keyword
WEB-BASED SEARCH ENGINES 6

and key phrase use. Google has also created a system in which relevance can be artificially

inflated due to paid sponsorships/partnerships with the company (Page, Brin, Motwani,

Winograd, 1999).

The first web-based search engine to adopt these sponsored search practices was GoTo

(later Overture Services) in 1998. They were purchased by Yahoo! and rebranded as Yahoo

Marketing Search in 2005, which launched sponsored search as a common practice among

search engines. Sponsored search allows advertisers to buy space on the search results page of

keywords that relate to their company or product. Search engines work with their sponsors to

analyze click data and ensure that advertisements are appearing for appropriate queries.

Sponsored searches became pivotal to the success of many search engines as they it is a primary

source of funding (Fain & Pedersen, 2006).

Although Google has faced competition from other web-based search engines, especially

in the early 2000s, today it has few competitors when it comes to large-scale web-based

searching. In October 2015, Yahoo, once one of Google’s main competitors, outsourced its

search functions to Google, and Ask.com has abandoned its search engine function to return to

its original Q&A model. Today, Google’s main competitor is Microsoft’s Bing search engine in

the United States, and in other nations, such as China, home-grown web services’ search engines

like Baidu compete with Google. However, this is not to say Google has no competition. Many

people are turning toward individual specialized search engines (many of which Google or its

competitor Microsoft provide the technology for) and to social networking for information. For

instance, rather than use a large-scale search engine like Google, a user might use a real-estate

specific search engine like Zillow.com’s to find a rental property, or in the case that they need
WEB-BASED SEARCH ENGINES 7

advice on which phone to purchase, they might post the question to their Facebook newsfeed to

allow friends to post their opinions.

Research developers have focused on understanding the relationship between social

networking and search engines in an attempt to understand how search engines or search

practices can be improved. According to Eric Goldman, Facebook and Twitter have become

partial alternatives to Google (2011). Twitter has become a go-to resource for “real-time

informational needs” via their search engine using keywords or hashtags. Particularly in the case

of breaking news, Twitter has become a staple for quickly gaining information. Twitter, as well

as Facebook, do not compete directly as a search engine but are a popular alternative to Google

because they compete for “user mindshare.” Both have become staples of internet usage and in

addition to offering the social networking elements they are known for, also provide an

opportunity to gain information that might have been searched for on a traditional engine like

Google. For instance, one might be browsing their newsfeed and see a #bestpizzaever hashtag

along with the name of a local pizza restaurant. When that user is considering ordering pizza,

they might instead turn to that hashtag or particular tweet rather than use a search engine to find

a local restaurant. On Facebook, one might see a deal on amusement park tickets posted to their

newsfeed from a local business or by a friend who knows about the special. Rather than using a

search engine to figure out what to do that weekend or to search for amusement park discounts,

the user was able to user social networking to find desired information.

One of the reasons that users might turn to social networking over traditional search

engines to find information is because a search engine, even while growing increasingly

advanced, does not have the cognitive intelligence of a person. In 2015, Liu, Shi, & Wang

published their study on the intelligence of search engines. What they determined was that while
WEB-BASED SEARCH ENGINES 8

search engines “reach the edge or even beyond the human’s intelligence levels in common sense,

translation, and calculation,” they vastly lacked in intelligence when it came to “gaining

knowledge and giving feedback, especially in the fields that require relatively high intelligence

such as arranging, associative thinking, creating, speculating, choosing and discovering the

pattern.” Based on their IQ test, they argue that Google, which they ranked as having the highest

knowledge base, has half the intelligence of the average 6-year-old child. Liu, Shi, & Wang,

based on their study, argue that there is much room for development of search engine

intelligence. Search engines do not generally have the ability to understand complex queries that

go beyond a search for keywords. They cannot function with the same level of intelligence and

understanding that one might find when asking a human for information.

A major issue plaguing search engine developers has always been the query because it is

the most user-centric element of the search process. For the last decade, developers have

implemented tools to improve search engine queries via semantic analysis such as correcting

spelling errors, recognizing keyword synonyms, and accounting for quirks in language or

grammar. However, developers have now turned to the query itself. By mining and analyzing

query terms, search engine developers hope to provide more precise search results. This has been

made possible by the emergence of new technologies such as “advances in natural language

processing; the spread of location-aware, voice recognition-equipped mobile devices, and the

rise of structured data that allows search engines to extract specific data elements that might once

have remained locked inside a static Web page” (Wright, 2016). The convergence of these

technologies allows a web search like Google to recognize that when someone located in

downtown Milwaukee types “pizza” into their search engine, they may be looking for locations

that serve pizza in their immediate area, not information on the history of pizza or pizza recipes.
WEB-BASED SEARCH ENGINES 9

To improve their understanding of the search query process, Google has developed a project

called the Daily Information Needs study. This annual study is comprised of volunteers who

report on a daily basis what they have been searching for and why. This allows Google to

analyze the types of search needs users have, how they are attempting to answer these search

needs, and how the search engine they used was or was not able to facilitate their needs. Google

has used this technology to change the way in which it answers queries.

Addressing the varying desires of users has led to many changes on the “results” page

across popular web-based search engines. Because developers have run into the problem of

queries that are so general that the search engine is unsure what kind of information the user

needs, they have begun grouping different sets of information about the query topic on the results

page so that the user may select the subsection of the topic they are interested in. Eric Goldman

refers to these as “zones” of search results which include traditional “organic search results, local

results, news results, shopping results, video results, highlighted brands, results from sites in the

searcher’s social network, a map for geographic results,” and more (2014). Within these zones,

links are ranked, and sponsored search results can potentially be found in each zone. Someone

searching for the term “Tampa” will find a blurb about the city from Wikipedia, a list of points

of interest for visitors, links about the city, sponsored links from local businesses, travel

companies, or local government, and a news subsection for the city’s recent events. If you were

only looking for basic information, such as the location, weather, population, or a brief

description, it can be found without clicking a single link. According to D.M. Modesto and S.L.

Ferreia, “search engines are becoming ‘answer engines’, that means that instead of exploring and

seeking information in deep site structures, users demand quick answers from search engines”

(2014). This relatively recent shift benefits many users, especially given how often web-based
WEB-BASED SEARCH ENGINES 10

search engines are used for quick consumer or general information, and how often they are used

on mobile platforms. However, it makes doing in-depth research on a topic more difficult, as it

shrouds the relevant results on a topic behind its various images, media, sponsored results, and

smaller zones. While changes in search result page usability may make them more useful for

casual searches, they hide another issue facing the web-based search engine—the deep web.

Today’s web-based search engines are being shaped by the pure volume of information

that they must try to account for. The web continues to expand exponentially; in 2015, Google

indexed about 60 trillion pages, up from one trillion in 2008 (Wright, 2016). As the web grows,

it becomes increasingly difficult for search engines such as Google to include new or changing

information in their searches. In order to keep their sites up to date and promote search engine

crawling, website owners frequently change their sites. “In addition to the new content generated

every day, it was estimated that 52 percent of web content changes every day” (Zineddine, 2016)

In order for the search engine to maintain up-to-date indexes, crawlers have to constantly revisit

websites to retrieve updates. Despite best efforts, crawlers simply cannot retrieve all of the

available information on the web. According to Bowtie Theory, it is likely that around 80% of

the web’s content is inaccessible through search engines. For this reason, many researchers are

looking for ways to improve the search processes of web-based search engines to better allow

them to access the deep web while keeping up with the constant changes to the websites and

databases they already use.

Mhamed Zineddine has proposed “push” or “pull” systems in which web servers hold the

key to solving the problem of the deep web. In the “pull” method, web servers automatically

provide search engines with an updated Sitemap XML file every time a change is made on any

domain it hosts, rather than crawlers needing to continually check for updates to index. In the
WEB-BASED SEARCH ENGINES 11

“push” method, the web server automatically creates an XML file with just the updates and

instructions needed for the search engine to implement them. In this situation, the search engine

updates when necessary just as any program or application would, saving a significant amount of

resources, as crawlers will not need to continuously return to the same sites to create new XML

files (2016). For both academics and general users, developing means of accessing the deep web

provides new information and insights.

However, for many, the idea of having even more resources available in a web-based

search is undesirable. Often for a casual query, the sheer volume of results is overwhelming, and

therefore, researchers and developers are looking for new ways to narrow these results using

search engine user personalization. Search engine personalization was introduced by Google in

2008 and is comprised of several factors (Tran & Yerberry, 2015). Ad targeting uses information

such as previous search history, age, gender, and location to determine sponsored results.

Additionally, when signed into an account via Google, some personalized information such as

location can be set. One might argue that personalization can be problematic, because it

eliminates consistency and allows the user more authority over their search results while still

providing them with a false sense of a search engine’s objectivity. However, according to a 2015

study, search engine users preferred personalized search results because they receive targeted

information that was more relevant to their needs or interests (Tran & Yerberry).

Developers are still looking for ways to improve the personalization of search engines.

Nathaneal Ramesh and J. Andrews argue that social networking activity can be utilized by search

engines to create personalized search profiles that better understand a user’s interests, dislikes,

daily activity, and other preferences. By mining user data from social networking sites such as

Facebook or Twitter, a search engine could more seamlessly incorporate itself into the life of its
WEB-BASED SEARCH ENGINES 12

user. For instance, by mining a user’s Twitter feed for content and keywords, a search engine can

detect the political issues or news events that interest the user, as well as the media sources they

follow, allowing for the creation of more desirable search results the next time the user searches

for information on a news piece. This information could also be used to determine the types of

stores and restaurants the user prefers, providing them with more targeted advertisements or

product results for queries. Taking search engine personalization even further, Anne Gerard

Schuth has proposed a new search engine evaluation paradigm called “multileaving” in which

“many rankers can be compared at once by combining document lists from these rankers into a

single result list and attributing user interactions with this list to the rankers” (2016).

Multileaving would allow base rankings for search results to be created via user interaction, and

constant means of feedback from users would ensure that search engines are continually able to

improve search results based on a series of rankers. This would allow users to have a

personalized search experience, while also providing the search engine with feedback data to use

to improve general queries.

While all of these ideas and recent innovations are important for the future of commercial

web-based search engines like Bing and Google, they also play an important role for information

science professionals. Information retrieval is at the heart of the field of information sciences,

and professionals in this field, such as librarians, have much to gain from understanding changes

and innovations being implemented in the world of web-based search engines. For many years,

librarians saw search engines such as Google as a potential enemy that promised the demise of

reference services and the rise of unchecked and potentially misguided information gathering by

the general public. However, in recent years, many libraries across the nation have begun
WEB-BASED SEARCH ENGINES 13

embracing web searches as something that they must work with, and in fact, can potentially

improve the library’s connection to its patrons.

In determining which search engine practices are compatible with the library’s services,

usability must be considered. In 2014, a study was conducted among low-level readers’ usage of

web-based search engines which revealed that in some cases, many of the search engine changes

undergone in the last few years have improved low-level readers’ search experience. However,

some changes are to their detriment. The web is a just as important tool for low-level readers as

it is for average and high-level readers. For those with disabilities that prohibit or impede

mobility, it can be even more important. Low-level readers have “limitations related to

perception and search strategies and the use of search tools differently from users with high

reading skills” that therefore must be addressed both by general web-based search engines like

Google or Bing, but also by library search engines and reference tools (Modesto & Ferreira,

2014).

According to their study, low-level readers, when completing a search, struggle to

complete superficial readings of text, and therefore must carefully read each word, even if they

do not necessarily understand them. This means that when they complete a query, low-level

readers have difficulty ascertaining the accuracy or suitability of the results, as well as selecting

which result best answers their question. Many changes adopted by search engines in the last

several years have improved the search experience for low-level reading users. For instance, the

developments in semantic analysis such as spell check and voice recognition in query

development has meant that low-level readers are able to struggle less with typing a query to find

results. Additionally, because search engines now provide a significant amount of information

about the query topic right on the results page rather than solely providing links to potentially
WEB-BASED SEARCH ENGINES 14

appropriate websites, low-level readers have less difficulty in finding answers to their queries.

However, some elements of the modern web-based search engine’s results page are to the

detriment of low-level reading users. Because images, videos, recent news articles, sponsored

results, and organic search results are all listed on the results page, low-level readers can become

distracted or confused. According to this study, low-level readers did not like the multiple search

subsections because it distracted them from the goal of their search and often they could not

differentiate between organic search results and sponsored search results, leading them to often

end up clicking on a sponsored link that was unrelated to the goal of their search or provided

them with inappropriate or inaccurate information (Modesto & Ferreira, 2014).

If libraries are to meet the needs of online users, they may want to consider the

personalized search engine model that is being adopted by commercial web-based search

engines. What if a library could maintain a user profile, based on something as simple as the

patron’s borrowing history, or something as complex as their social networking information, that

could improve their search engine? With the recent innovations in search engine development, it

is possible that a library’s search engine could be developed with personalizations as nuanced as

topical interests, favorite genres, and reading level, in order to provide patrons with a more

customizable search experience. A team of Brent Hecht, Jaime Teevan, Meredith Ringel Morris,

and Dan Liebling at Northwestern University has laid some of the groundwork for this idea

through SearchBuddies, a socially embedded search engine (2012). SearchBuddies actually

functions through the user’s social network and provides information and networking assistance

when a user makes a query. For instance, if a user was to ask “Should I buy the new iPhone?,”

SearchBuddies would respond with both a link to a consumer report about the new iPhone but

can also connect relevant users together, in this case, by recognizing that the user’s Facebook
WEB-BASED SEARCH ENGINES 15

friend had liked the new iPhone’s Facebook page, or had posted about regretting buying the new

iPhone. Programs like SearchBuddies show the potential of how search engines can be

embedded into the lives of users in new ways. This is a particularly exciting opportunity for

libraries. Most libraries have already found that social networking and community outreach are

pivotal to their continuation. A social network embedded search engine offers libraries the

opportunity to assist patrons in ways that they may not expect. It also offers the library a way to

market the resources it has available that the public simply may not know about. If the library

could target social network users in a certain geographic area or those who had liked the library

or local government’s page, they would then be able to assist when someone posts a comment

such as “I wish I had the money for a 3D printer. These things are amazing!” by informing them

that there is a 3D printer at the library and providing a calendar for workshop dates.

Libraries, as long as they continue to work with search engines rather than against them,

have much to gain from their development. From what researchers have seen in recent years,

search engine users are looking for convenience, simplicity, and most importantly, a narrow

personalized pool of search results. Most users are not looking for mass amounts of information

on a topic, but rather, targeted information that best benefits them. These desires speak to the

skills of reference librarians who have abilities that search engines lack—such as creative

thinking and complex interpretation. As search engines continue to strive for more intelligent

design, information professionals play an important role and also have the opportunity to benefit,

particularly when providing reference services, from the increasing amount of deep web content

available via online search. Web-based search engines are one of the fastest evolving

technologies of the modern day. For that reason, libraries should follow the many innovations

search engine developers are proposing and implementing to see how they might be adapted for
WEB-BASED SEARCH ENGINES 16

their own purposes. Web-based search engines are not an enemy of the library, but rather, an

important ally in providing easy access to information to the public.


WEB-BASED SEARCH ENGINES 17

References

Bokhari, M. U., & Adhami, M. K. (2016). How well they retrieve fresh news items: News search
engine perspective. Perspectives In Science, doi:10.1016/j.pisc.2016.06.002

Campos, R., Dias, G., Jorge, A., & Nunes, C. (2016). GTE-Rank: A time-aware search engine to
answer time-sensitive queries. Information Processing & Management, 52(2), 273-298.
doi:10.1016/j.ipm.2015.07.006

Gavali, R. r. (2015). Discovery Service for Engineering and Technology Literature through
Google Custom Search: A Case Study. DESIDOC Journal Of Library & Information
Technology, 35(6), 417-421. doi:10.14429

Goldman, E. (2011). Contemporary Issues in Cyberlaw: Revisiting Search Engine Bias. William
Mitchell Law Review, 3896.

Gürkaynak, G., Yılmaz, İ., & Durlu, D. (2013). Understanding search engines: A legal
perspective on liability in the Internet law vista. Computer Law And Security Review: The
International Journal Of Technology And Practice, 2940-47.
doi:10.1016/j.clsr.2012.11.009

Hecht, B., Teevan, J., Ringel Morris, M., Liebling, D. (2012). SearchBuddies: Bringing Search
Engines into the Conversation. Association for the Advancement of Artificial
Intelligence 2012, 1-8. Retrieved from https://www.microsoft.com/en-us/research/wp
content/uploads/2016/02/searchbuddies_icwsm2012.pdf

Jalal, S., Sutradhar, B., Sahu, K., Mukhopadhyay, P., & Biswas, S. (2015). Search Engines and
Alternative Data Sources in Webometric Research: An Exploratory Study. DESIDOC
Journal Of Library & Information Technology, 35(6), 427-435. doi 10.14429

Kim, K., & Tse, E. (2014). Search engine competition with a knowledge-sharing service.
Decision Support Systems, 66180-195. doi:10.1016/j.dss.2014.07.002

Liu, F., Shi, Y., & Wang, B. (2015). World Search Engine IQ Test Based on the Internet IQ
Evaluation Algorithms. International Journal Of Information Technology & Decision
Making, 14(2), 221-237. doi:10.1142/S0219622015500030

Mohd, M. (2011). Development of Search Engines using Lucene: An Experience. Procedia


Social And Behavioral Sciences, 18(Kongres Pengajaran dan Pembelajaran UKM, 2010),
282-286. doi:10.1016/j.sbspro.2011.05.040

Modesto, D. M., & Ferreira, S. L. (2014). Guidelines for Search Features Development – A
Comparison between General Users and Users with Low Reading Skills. Procedia
Computer Science, 27(5th International Conference on Software Development and
Technologies for Enhancing Accessibility and Fighting Info-exclusion, DSAI 2013), 334
342. doi:10.1016/j.procs.2014.02.037
WEB-BASED SEARCH ENGINES 18

Ramesh, N., Andrews, J. (2015). Personalized Search Engine using Social Networking Activity.
Indian Journal of Science and Technology, 8(4), 301-306. doi:10.17485

Schuth, A.G. (2016). Search Engines that Learn from Their Users (Unpublished doctoral
dissertation). University of Amsterdam, Amsterdam. Retrieved from
http://www.anneschuth.nl/wp-content/uploads/thesis_anne-schuth_search-engines-that
learn-from-their-users.pdf

Terrell, H. h. (2015). Reference is Dead, Long Live Reference: Electronic Collections in the
Digital Age. Information Technology & Libraries, 34(4), 55-62. doi:10.6017

Tran, T., & Yerbury, H. (2015). New Perspectives on Personalised Search Results: Expertise and
Institutionalisation. Australian Academic & Research Libraries, 46(4), 277-290.
doi:10.1080/00048623.2015.1077302

Wright, A. (2016). Reimagining Search. Communications Of The ACM, 59(6), 17-19.


doi:10.1145/2911971

Yang, L. (2016). Metadata Effectiveness in Internet Discovery: An Analysis of Digital


Collection Metadata Elements and Internet Search Engine Keywords. College &
Research Libraries, 77(1), 7-19.

Zineddine, M. (2016). Search engines crawling process optimization: a webserver approach.


Internet Research, 26(1), 311-331. doi:10.1108/IntR-02-2014-0045