Professional Documents
Culture Documents
Intranet Search Engine
Intranet Search Engine
Intranet Search Engine
Although search over World Wide Web pages has recently received much academic and commercial attention, surprisingly little research has been done on how to search the web pages within large/small, diverse intranets. Intranets contain the information associated with the internal workings of an organization. The intranet creates new challenges for information retrieval. The amount of information on the intranet is growing rapidly, as well as the number of new users inexperienced in the art of intranet research. Earlier works that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this project, we address the problem of providing quality answers to navigational queries over the intranet. As intranets grow, providing access to more and more documents, their value grows. The larger the collection, the harder and harder is becomes to find that important presentation, contract, or HR form. Enterprise Information Portals provide a starting point to intranets, and a search engine helps locate information, including archives and unstructured data. Search engines need to be tuned and indexed to provide the best answers. Our approach is based on crawler identification of navigational pages, intelligent generation of term variants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. This Chapter outlines the aims of the project and motivation behind its implementation.
Just one example of improved usability from taking advantage of managed diversity: an intranet search engine can take advantage of weighted keywords to increase precision. Weights are impossible on the open Internet, since every site about widgets will claim to have the highest possible relevance weight for the keyword "widget." On an intranet, even a light touch of information management should ensure that authors assign weights reasonably fairly and that they use, say, a controlled vocabulary correctly to classify their pages. Intranet is network of computers that can be accessed only by an authorized set of users within an organization. Its purpose is typically to share information and computing resources
2
among employees within an organization. The term search engine is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in radically different ways. Crawler-based search engines, such as Google, create their listings automatically. Human-powered directories such as the Open Directory, depends on humans for its listings. The search looks for matches only in the descriptions submitted. In this case, if there are changes to any of the web pages, it has no effect on the listing. The only exception is that a good site, with good content, might be more probable to get review. There are two types of Intranet search, namely desktop-based and web-based. Desktop-based address the whole spectra of electronic information that might be found in an organization, including video, images, database etc.
Most basically, the intranet and the website are two different information spaces. They should look different in order to let employees know when they are on the internal net and when they have ventured out to the public site. Different looks will emphasize the sense of place and thus facilitate navigation. Also, making the two information spaces feel different will facilitate
an understanding of when an employee is seeing information that can be freely shared with the outside and when the information is internal and confidential. An intranet design should be much more task-oriented and less promotional than an Internet design. An organization should only have a single intranet design, so users only have to learn it once. Therefore it is acceptable to use a much larger number of options and features on an intranet since users will not feel intimidated and overwhelmed as they would on the open Internet where people move rapidly between sites. An intranet will need a much stronger navigational system than an Internet site because it has to encompass a larger amount of information. In particular, the intranet will need a navigation system to facilitate movement between servers, whereas a public website only needs to support within-site navigation.
An effective search tool on an intranet can make an enormous difference to its usability. A good search engine ensures that users find what they're looking for, first time, regardless of the format or location of the information. This means that a wide variety of information can be effectively dispersed and made available to staff, without the need for complex navigation systems or filing conventions.
Our project aims to help the user to search and access text information. The search will be a content based search. As stated earlier there is load of information available for the user to access on the intranet. But only specific required information is to be searched, sorted and represented in a systematic manner to the user, thus increasing the availability of useful information for the user to access. The access will be given to only those data which are shared, thus preventing unauthorized access.
3.1.1 Gathering
The index should be kept current. As soon as the new content is published, it should be indexed. Publishing or content management systems can notify the indexer of new data; otherwise, index the frequently changing areas more often. If the search engine cannot respond to queries when updating, use mirrored servers or switch search engines.
3.1.2 Indexing
In addition to HTML, XML, and text, intranet search engines deals with binary file formats such as PDF, MS Office formats, including Word, Excel, and PowerPoint, WordPerfect, and others. The index should store the entire content of every file, even very long documents. It should keep every word and the word position in the document, for later phrase searching and match highlighting.
9
Intranets generally include various levels of security and access controls, and the index should store this information, so it can show only the accessible content in the search results. For high-security content, it is a good idea to create a separate index file to avoid co-mingling private and public text.
10
3.1.3 Crawling
The general algorithm involves backtracking to the root directory and penetrating new web pages via their links. The process continues until the entire website (Intranet) is indexed. Besides, our crawler is able to recognize duplicate pages and discard them accordingly.
11
12
3.2.1 Pros:
Search engines provide access to a fairly large portion of the publicly available pages over the internet and intranet, which itself are growing exponentially. Search engines are the best means devised yet for searching the internet and intranet. Stranded in the middle of this global electronic library of information without either a card catalog or any recognizable structure, how else are you going to find what you're looking for?
3.2.2 Cons:
On the down side, the sheer number of words indexed by search engines increases the likelihood that they will return hundreds of thousands of responses to simple search requests. Remember, they will return lengthy documents in which your keyword appears only once. Additionally, many of these responses will be irrelevant to your search.
13
The user has made a spelling or typing mistake. The user is doing a search in which all the query requirements are not met (for example, one word was matched but the other was not).
To avoid common search failures, create a page that explains these errors and helps users understand what is within the scope of the search engine. If a taxonomy or hierarchy exists, display it on the page to allow users to drill down through the category. 4. Search Log Analysis Search logs are a great window into the minds of intranet users. If the search log tracks the query and the number of matches, this is good. This makes it possible to count the 25 or 100 most popular search terms and to make sure these topics are adequately covered. It is also possible to track the most common terms that do not find matches and to address these problems. 5. The Indexer Full-text indexing literally creates a virtual copy of the entire website. The option is still feasible as it only encompasses Intranet searches. With this, content can be subjected to further scrutiny and hopefully more precise information. The first step is to initiate the creation of an index; this index will contain location information for each and every word in all of your documents. The creation of this index is external of the files and does not affect them in anyway. Indexed documents are typically specified according to directory and extension. There can either be one index for all of the files, or several separate indexes, each for a different project. The indexes automatically are updated when new documents are created, or existing documents are changed. However, any changes to the tables structure such as configuration data will need a complete rebuilding or the full-text index. Once there is an index, it can be used to locate, view and retrieve information. Using the indexes created, the search query can be used to locate the required information in your documents. Results are displayed almost instantly, despite its relatively large size and thus proving the speed and advantages of implementing indexes.
be able to correctly interpret and index the most frequently used or the most important of these formats. If meta-information and XML tags are likely to show up within the documents, the spider must be able to interpret such tags, and it would also be useful if RDF-formatted information could be gathered intelligently. If USENET newsgroups need to be indexed, the spider must be able to crawl through them. That also goes for client side image maps, CGIscripts, ASP generated pages, pages using frames, and Lotus Domino servers. Although frames are frequently used within many companies, spiders, which generally work their way round the net by picking up and following hypertext links, may not be able correctly interpret the different syntax used for framed pages. These links could end up ignored. Spidering Domino servers using the above HTTP requests requires the search engine to be able to intelligently filter out the many collapsed/ expanded versions of the same page, or the index will quickly be filled with duplicates. Another, and arguably better, way would be to access Domino servers via the provided APIs. Another situation that is likely to require access via APIs rather than having to crawl through HTTP is when Content Management (CM) systems are used. In CM tools, the actual content of a page is stored separated from the page layout information. Since pages are rendered dynamically only when requested by a user (via her browser), the spider may not be able to pick up the link information that is embedded in the page code. Without those links, the spider will not be able to find the information. Even if the information is found and indexed correctly it might be difficult for the search engine to understand how to display a search result since the information that has been indexed may belong to several dynamic pages. This is an area not yet fully explored by search engine vendors and proposed solutions should be investigated carefully. Intelligent robots are able to detect copies or replicas of already indexed data while crawling and advanced search engines can index active sites, e.g. sites that update frequently, more often than sites that are more passive. If this is not supported, some manual means of determining time-to-live should be provided. There should be some means of restricting the robot from entering certain areas of the net, including any desired domain, sub-net, server, directory, or file level. Also, check if search depth can be set to avoid loops when indexing dynamically generated pages. Support for proxy servers and password handling can be useful, as can the ability to not only follow links but also detect directories and thus find files not linked to from other pages. The spider should be easy to set up and start. Check how the URLs from which to start are specified as well as if the users may add URLs. Finally, the Robot Exclusion Protocol provides a way for the webmaster to tell the robot not to index a certain part of a server. This should be used to avoid indexing temporary files, caches, test or backup copies, as well as classified information such as password files.
16
4.2.2 Index
Although a good index alone does not make a good search engine, the index is an essential part of a search tool. One of the most important issues is keeping the index up-to-date, and the best way to do that is to allow real-time updates. There is a big difference between indexing the full text or just a portion. Though partial indexing saves disk space it may prevent people from finding what they are looking for. The portion of text being indexed also affects the data that is presented as the search result. Some tools only show the first few lines while others may generate an automatic abstract or use meta-information. If the organization consists of several sub-domains, users might only want to search their specific sub-domain. Allowing the index to be divided into multiple collections might then speed up the search. It may also prove useful to be able to split the index into several collections even though they are kept at one physical location. For example, one may want separate collections for separate topics or business areas. Some tools support linguistic features such as automatic truncation or stemming of the search terms, where the latter is a more sophisticated form that usually performs better. If the organization is located in non-English speaking countries the ability to correctly handle national characters becomes important. Also, note that some products cannot handle numbers. If number searching is required, e.g. serial numbers, this limitation should be taken into consideration. Should words that occur too frequently be removed from the index? Some engines have automatically generated stop-lists, while others require the administrator to remove such words manually. Search engines are of little use if an overview of the indexed data is wanted, unless they are able to categorize the data and present that data as a table of content. Automatic categorization may also be used to focus in on the right sub-topic after having received too many documents. If information about when a particular URL is due for indexing is available, it is useful to make it accessible to the user.
17
Dividing the results into specific categories might help the user to interpret the returned result. Finally, ensure the product comes with good and extensive online user documentation.
preferably by the end-user. Allowing end-users to add links is a feature that will off-load the administrator. Functions like email notification to an operator, should any of the main processes die, and good logging and monitoring capabilities, are features to look for. I found that products with a graphical administrator interface were more easily and intuitively handled, though the possibility of being able to operate the engine via line commands may sometimes be desired. It should also be able to administer the product remotely via any standard browser. Documentation should be comprehensive and adequate. Finally, consider the price - is it a fixed fee or is it correlated to the size of the intranet? In addition, what kind of support is offered and to what cost? Sometimes installation and training are included in the price. How long the products have been available and how often they are updated are important factors that indicate the stability of the product, and it is also important to ask about future plans and directions.
20
subject directories, or creating their own directories, and returning results gathered from a variety of other guides and services as well. Selecting a Search Engine Before taking any action in determining the type of the search engine, we need to determine our technical requirements. Once this is complete, research on currently available engines can be pursued and built an effective search engine that caters to the need of ours.
22
Step 1: Entering the Query. When a user enters their query, they should have the option to do this using a natural language approach; that is, by simply entering the question as they would ask it. Su ch as What is the cost of double-deck refrigerators? There should also be the option to build queries using Boolean operators, so that users who know exactly what they want can be extremely specific with their search. For example returns~ within 10 words of refrigerator but not freezer. Building a search engine with a simple user interface to make sure it is intuitive for basic users, and also provide powerful advanced search functionality for more experienced users will be a definite aim of ours. A good search engine should enable you to group logical chunks of information together so that searches can be conducted on specific areas of interest.
Step 2: Getting the Search Results. If there is specifically defined data, such as legal documents, a high degree of precision may be required to identify and return specific information. In other situations, however, it may be better to return a wider range of documents for a given query. The accuracy we require depends on the role of the search engine and the nature of the data. If we want to make available a large volume of data on your intranet, providing a fast search engine is important. Otherwise users find it frustrating to wait for the search engine to bring back the search results. With smaller amounts of data this will be less of a concern; it all depends on the volume of data that we intend to make available on the intranet.
23
Any good search engine should use some form of intelligent relevancy determination. This is where the search engine, based on the query entered, makes a judgment about which results will be the most relevant, and ranks them accordingly. Step 3: Finding the Right Answer. The search process doesnt stop once the user receives the list of results. They then need to refine and manipulate the results list until they find exactly what they were looking for. There are many features that can assist in this task, some of which include: Document summary information The display of useful document attributes such as file type, file size, date last changed, relevancy rating and the number of hits (key words found) in the document. The display of an extract of the document, say several lines above and below the first hit, is helpful for determining the context in which the document has been returned. Re-sorting The ability to re-sort the results list using different criteria, such as title, number of hits, relevancy, and date changed; file type or any other criteria that makes sense for your organization. Hit-to-hit navigation The provision of navigation buttons enabling users to go directly to the first hit in the returned document, and thereon to the next or previous hit as required. This means users avoid having to read through pages and pages of document before finding the relevant section, making it much more efficient. Hit highlighting A familiar concept from searching the web, hit highlighting is when the key words, or hits, in a document, are highlighted in a different colour. This feature is often not available in an intranet search engine, but it really should be, as combined with hit-to-hit navigation it enables users to immediately see the relevant sections of the document. Fast preview The ability to preview large non-HTML documents in a basic HTML format, without the need for downloading the whole document. This function enables users to view a few lines above and below each hit, and then to expand up or down to continue reading. Search within The ability to search within the current set of results, to further narrow them. Although just some of the features available in intranet search engines, these are the main features
24
required to ensure that users have the best overall experience. Others that may be relevant to your organization might include intelligent agents that automatically advises users when relevant content appears in the data repository, or the ability to save or export search results.
25
Always and Few users understand the concept of Boolean operators. Instead, they expect that when they type in three words, they will be given only those documents that contain all three. Furthermore, typing in more words should provide fewer hits, not more. The search engine must therefore default to and-ing the words together. In fact, eliminate support for Boolean operators all together, unless there is a clear case that they will be of value to your users.
Place the cursor When the search page is opened, the cursor should already be in the search field (this is known as setting the focus). This allows the user to simply type in their words, and hit enter. Its a small point, but it took only days for our users to specifically ask us for it
Behind the scenes Effort should be spent behind the scenes to improve the effectiveness of your search engine. Most engines have capabilities that, when implemented carefully, will help users to find the pages they are looking for. These features must operate transparently, so that the user is not even aware of their impact. They should simply find the search engine both easy to use and effective. Fuzzy searching, stemming, and more Our selected search engine provided a number of powerful searching capabilities: Fuzzy searching, or sounds-like There were three closely-related options which were essentially designed to find terms which sounded like those entered by the user. In this way, it becomes possible to handle spelling mistakes and other inconstancies. Stemming This feature takes the terms entered by the user, and tries other combinations of endings. For example, searching for walks would also find walk, walking, walked. We found this to be very effective, and it eliminated differences in singular versus plural uses of terms in our pages. There are a wide variety of other tools available in modern search engines, beyond those mentioned above. As per our evaluation and study we noted that just because a feature exists, it doesnt mean it will help the users. Weightings and rankings The order in which results are displayed by a search engine is the product of a number of complex weighting and ranking factors behind the scenes. These vary from engine to engine. They also have a big impact on how effective the search engine is. The main aim would be to understand our search engine, and configure it (if required) to meet our specific requirements. The key is to have the search engine work in a transparent and understandable way.
27
28
We have discussed the concept of Intranet search engine. Under this project, the mechanism of intranet, search engine was thoroughly examined. Developing a search engine for intranet needs a complete research as per the needs. In brief, we learned the following lessons as a result of this project: Spend a lot of time identifying your needs, and researching the right search engine. Choosing the wrong search engine is a costly mistake that is not easy to rectify half way through a project. Keep the interface simple. The search page should have a field to type in and a search button. Complex interfaces and advanced searches will confuse users: by default, your search engine should simply do what the users expect. Take the time to configure the intelligence under the hood. The search engine should quietly assist the user to find the desired page (via synonyms, fuzzy searching, and so forth). Track the usage of your search engine, and use this to assess how well it is working. You should be gathering enough information to allow you to refine the engines configuration to better meet user needs.
29
30
Bibliography
[1] Cynthia P. Ruppel and Susan J. Harrington. Sharing Knowledge Through Intranets: A Study of Organizational Culture and Intranet Implementation, 2000. [2] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze. An Introduction to Information Retrieval, Online edition, 2009. [3] Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan and Alexander Loser. Navigating the Intranet with High Precision, 2007. [4] Dick Stenmark. A Method for Intranet Search Engine Evaluations, Proceedings of IRIS22, 1999. [5] Michael Chen, Marti Hearst and Jason Hong. Cha-Cha: A System for Organizing Intranet Search Results, 2002.
31