Designing Rich Internet Applications For Search Engine Accessibility

Introduction
Rich Internet Applications create new opportunities. The most fundamental of these is the ability to create Single Page Interfaces (SPIs). A SPI is an interface that consists of a single HTML page. Additional information that is required, when the user clicks on a ‘link’ or when some other event occurs, is not supplied by means of a traditional full page reload, but is instead retrieved via an XML message. The original page remains intact, its contents or state is simply updated by the contents of the XML message. JavaScript is used to facilitate this whole process. Although it is not mandatory to create a SPI, when using Backbase’s software, a SPI provides a more intuitive user interface and smoother user experience. There are a few questions that need to be answered however when you make use of this new paradigm. One of the main questions is that of search engine accessibility and deep linking. The web sites that have been created up until now, consist almost entirely of Multi Page Interfaces (MPIs). These web sites and applications consist of multiple unique pages, which may or may not have been dynamically generated. Since each page, and for dynamic pages every page state, has a unique URI; it is very easy to link to any page or state within this site. Navigation between pages is done by the user clicking on links or submitting forms, both of which contain the location and state information for the new page. It is these unique URIs that make deep linking possible. Deep linking does not just link to a particular web site, but links directly to a specific page within the site. It is this MPI paradigm which informs the robots which are used by search engines such as Google or Yahoo to index the information in web sites. Search bots are software agents that ‘crawl’ through web sites; they start at the index page and, after categorizing all of the information on the page; they follow the links on this page to other pages on the site. In this way they crawl through the entire web site, visiting any page that has been linked to using a link tag of the type:
<a href=”nextPage.html”>Next Page</a>

However in an SPI, the linked page structure that the search bot is expecting, has been extended with BXML commands, which indicate the use of include files, load commands and form submissions, which only partially update the page, instead of causing a full reload as is the case with normal forms. Since search bots aren’t proper web browsers, they don’t understand or execute any JavaScript. This means that a Backbase SPI needs to be specifically designed to work with these search bots. This article puts forward a set of guidelines, which you can use to design your SPI for maximal search engine accessibility and shows you techniques to allow for deep linking into your SPI.

Page  of 8

Designing Rich Internet Applications For Search Engine Accessibility

Making SPIs Search Engine Accessible
Several approaches are available for making your web site accessible to search engines; these approaches differ in the level of indexing, which is obtainable and how this is achieved. For certain sites, it is not necessarily a requirement that every part of the site can be indexed by search engines. For example, a site, which provides a web-based e-mail service, does not require every single piece of information on the site to be indexed by a search bot. Other sites, however, do require that every piece of information can easily be found and indexed by search engines. For example, a web site with information about the courses provided by a university is such a case. Backbase has identified the following strategies for getting a SPI indexed by search engines: Lightweight Indexing: no structurally changes are made to your site; existing tags such as meta, title and h1 are leveraged. Extra Link Strategy: extra links are placed on the site, which search bots can follow and thereby index the whole site. Secondary Site Strategy: a secondary site is created, which is fully accessible to the search engine. For each of these strategies the following questions will be answered: To what extent is the content of the page indexed? Can links be followed on the page (e.g. link elements (<a href=”xx”>) or s:include elements)? When a link is followed by the search bot, what is the status of the URL that is being indexed. Can this URL be displayed by browsers or will some type of redirection be required? Use a keywords meta element with a content attribute containing some appropriate keywords. For example:
<meta name=”keywords” content=”WebMail, email, bxml, mail” />

Use a description meta element with a content attribute, which contains a relevant description of the web page. The value of this element is often printed as part of a search result by Google. For example:
<meta name=”description” content=”A Free BXML WebMail application. This unique WebMail application offers the look and feel of a normal Windows application, with the ease and portability of a web-based client.” />

Place key content within the main HTML structure and not in an include file, or some other dynamically loaded content. If possible, place this important content within a h1, h2 or h3 element, since search bots deem these to contain more important information. Remember that these tags can be styled in anyway you want using CSS. It should be noted that these points can also be put to good use, in the design of your SPI, in conjunction with the extra link strategy or the secondary site strategy. In summary by using this lightweight-indexing strategy only the content supplied by the title and meta elements and those elements that are directly located on the index page is indexed. No links of type s:include are followed; therefore there is no requirement to deal with redirection. This is not a very full indexing scheme, but it is extremely simple to apply to your site.

The Extra Link Strategy Lightweight Indexing
This strategy should be used if only certain key information needs to be indexed by search engines. In this case it is recommended that you take the following steps when designing your SPI: Use a title element in the document head, preferably containing one or more keywords that specifically relate to the contents of the site. For example:
<title>BXML WebMail – Sign In</title>

There are two main approaches to making a site fully indexable by search engines: the extra link strategy and the secondary site strategy. The extra link strategy is the easiest of these two to implement and it can make the site entirely indexable by search engines, but does not create a secondary site in normal HTML and is therefore not accessible to older browser, which are incompatible with BXML. The essence of this strategy is to create an extra link on the main SPI index page for each include file, whose contents you wish to be indexed. Some experimentation has revealed that the extra links must be of the type:
<a href=”include1.html”>include 1</a>

Designing Rich Internet Applications For Search Engine Accessibility

Page  of 8

The following points must be followed, if you want Google to index these pages: The link must be made by an a element and the include file must be indicated by the href attribute. The include file must have the .html or .htm file extension. This is a bit of workaround, since in reality include files aren’t proper HTML files but are instead XML files. However if you use a div element or a similar HTML element as the root tag, then all modern browsers will be able to read the file as if they were HTML and Google will index it. As far as the BPC (Backbase Presentation Client) is concerned, it merely stipulates that a include file should be well-formed XML and isn’t interested in which file-type extension it uses. NB: The include files should not have a XML declaration or a document type definition, otherwise Internet Explorer will be unable to accept .html or .htm files as include files. The link tag must have some text content. Without this Google will simply ignore it. No attempt should be made at using HTML to hide these links, since Google frowns on this and may not index such pages. You can however use BXML to remove or hide these links, by way of a construct event handler, as shown in the example below:
<div> <s:event b:on=”construct”> <s:setstyle b:display=”none” /> </s:event> <a href=”leftPanel.html”>Left Panel</a> <a href=”rightPanel.html”>Right Panel</a> </div>

<meta http-equiv=”refresh” content=”0;url=index.html” />

Once the browser has been redirected to the SPI index page, this page must parse out the referrer and trigger an event handler, which will update the state of the SPI accordingly. This process of detecting deep linking and updating the page state is explained in much more detail in the appendix at the end of this document. In summary the extra link strategy makes the whole site fully indexable. By adding extra link elements search bots are able to index all pages of the site. However since the URLs of the pages that get indexed, point to include files, which aren’t fully BXML-capable pages, it is necessary to redirect normal browser back to the SPI version of the site and then update the state of this SPI accordingly.

The Secondary Site Strategy
The secondary site strategy is the most complete of all of the indexing strategies. It is also the most labor intensive. The site should be made out of plain HTML and contain a linked multi-paged structure. Though this may seem laborious; having a secondary site to fall back upon makes your site available to people that are using older browsers, which aren’t supported by Backbase, as well as browsers on mobile devices and to disabled people. This gives you a chance to make your site accessible to all users, not just search engines. This strategy has three important components: 1. Generating the secondary site’s pages. . User-agent detection of both the search bots and BXML-compatible browser. . Redirection of browsers and the detection of this redirection, which allows the status of the SPI to be updated to reflect this deep linking.

It is not necessary to detect the user agent of the search bots (see the appendix at the end of this article for full details of this process), since they will simply follow the extra links that are provided for them. However it is necessary to do some detection when these include files are being served up. This is tricky since these include files can be requested by the user in two different ways. When a user is directed to one of these pages through a search engine, they need to be redirected to the main index page. On the other hand when the BPC requests theses pages as include files, no redirection should occur. Due to the fact that both search bots and the BPC ignore meta refresh tags it is possible to solve this problem. Such a meta refresh tag must be included directly inside the body of the include file. Even though these tags are normally placed inside the head element, they will still be executed anywhere in the body by all BXML-compatible browsers. Below is an example of such a meta refresh tag:

Generating the Search Engine Accessible Pages
The search engine accessible pages can be generated in several ways. It is possible to manually generate the secondary fall-back site. It is also possible to automate this process using XSLT. Manual Site Generation. This is a simple, lo-tech solution, but it is also labor-intensive, since you have to build two versions of your web site. There is also a danger that when you update your site with new information, you will forget to update the secondary pages. This will cause the two versions of the site to be out of sync with each other and for the information found on search engines to not be up to date.

Page  of 8

Designing Rich Internet Applications For Search Engine Accessibility

XSLT-Driven Generation. An alternative strategy, which is especially effective if you use a content management system (CMS), is to store all of the information, or at least the ‘copy’ for your site as plain XML. This can be in a format defined by yourself or your CMS. This XML must then be transformed into BXML using an XSLT. A second, much simpler, XSLT is used to transform the XML into the secondary, searchengine accessible site. Although this approach requires a little more effort when you initially develop the site, once both XSLTs are ready, new content can easily be added to the XML data source and then both versions will be generated automatically.

The full solution to this problem consists of two parts. Firstly, BXML compatible browsers need to be redirected to the SPI version of the site. And secondly, the SPI version needs to detect that it has been redirected from one of these deep linked pages and then update the state of the page accordingly, so that the information relevant to this link is shown. Browser Redirection. When one of the MPI pages intended for the search engine, is requested, the user agent must be detected again. However, in this case when a BXMLcompatible browser is detected, it is redirected and not the search bot. The browser is sent to the index page of the SPI version of the site. Detecting Deep Linking. The BXML version of index.html needs to ascertain from which page it was referred. This must be done as soon as the page is loaded, so that the transition appears to be seamless to the user. Full details of how to detect deep linking and how to update the page state can be found in the appendix at the end of this article. In summary the secondary site strategy makes the whole site fully indexable. Since the search bot is redirected to a normal HTML site, all links are followable by the search bot. However since the URLs of the pages, which get indexed when the links are followed, point to non-BXML pages, it is necessary to redirect normal browser back to the SPI version of the site and then update the state of this SPI accordingly.

User-Agent Detection
A vital component of this two-site strategy is browser detection. Techniques that can be used for user-agent detection are discussed in the appendix at the end of the article. Once the user agent has been detected it is necessary to make sure that the BXML-compatible browsers get sent to the BXML site and that the search bots and non BXMLcompatible browsers get sent to the accessible site.

Deep Linking and Browser Redirection
This section looks at an issue that arises from having a secondary multi-paged version of your site that is indexed by search engines. The solution to this problem also immediately offers a solution to the issue of deep linking in an SPI. The issue boils down to the fact that a site with multiple pages is being used to represent a site that consists of only a single page. Lets take an example to illustrate this problem: a simple SPI, which consists of a main index page that itself consists of a tabbed interface, which contains three different tabs. The contents of each of these tabs will be stored in a separate include file and be loaded into the SPI as and when they are required. Therefore to make this site indexable by a search engine, a MPI version of this site would presumably have been made with one index page (e.g. index.html) and three separate HTML pages representing the include files for each of the tabs (e.g. tab1.html, tab2. html and tab3.html). Now if a user’s search term closely matched something indexed on the third tab, then the search engine would point the user to tab3.html. However, in reality, you do not want your user to be redirected to tab3. html. Instead, you want him to be sent to the index.html page of the SPI version of your site and when this page is opened, the third tab, which correlates to tab3.html should be selected.

Designing Rich Internet Applications For Search Engine Accessibility

Page  of 8

Ethics
Google especially and presumably other search engines deeply frown upon any attempts to try and unfairly manipulate search results. Any site that is caught willfully trying to manipulate Google will be banned from Google’s index. Redirection to another site, with different content, based on the user agent is technically called cloaking and is frowned upon. Therefore, you should make sure that the information conveyed by any secondary web sites, which have been set up, with the intention of making your site indexable by Google and other search engines, is exactly the same as the information contained by your BXML site.

Page  of 8

Designing Rich Internet Applications For Search Engine Accessibility

Appendix
User-Agent Detection
A vital component of both the secondary site strategy and the extra link strategy is browser detection. The technical term for a web browser or a search robot or any other piece of software that approaches a web site is a user agent. When a user agent requests a particular page, it supplies details of itself by way of one of the HTTP headers that are sent along with the request. The Firefox browser for instance sends the following request header:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 } if (iIOGecko >= 0){ //extract the string directly after rv: //and check value var iIOrv = sUA.indexOf(“rv:”); var sRv = sUA.substr(iIOrv + 3, 3); if (sRv >= ‘1.5’) bCompatible = true; } //now if compatible redirect if (bCompatible) window.location.href = “bxmlIndex.html”;

It is therefore relatively straightforward to write a script, which determines what the user agent is, and then redirects the user agent to the appropriate version of the site. The most straightforward technique is not to try to find the search bots or other incompatible browsers, since this group is relatively large and hard to qualify. It is easier to determine whether the user agent is a BXML compatible browser and then assume that if the user agent isn’t one of these then it is either a search bot or an incompatible browser. The following browsers are BXML compatible: • Internet Explorer .0 and newer • Mozilla 1. and newer • Firefox 1.0 and newer • Netscape . and newer User-agent detection can be done on the server using a PHP, ASP or JSP script. There are standard libraries, which help take care of this. Alternatively if you cannot or do not wish to use server-side scripts to determine the user agent, it is possible to do this in JavaScript. If you take this approach, you should be aware of the fact that search bots cannot be expected to execute any JavaScript. Therefore if you are using the secondary site strategy in conjunction with JavaScript based detection, the default page provided by the initial page request must be the non-BXML site, which is intended for the search engine bot. When you ascertain that the user agent is a BXML-compatible browser, then JavaScript should redirect the browser to the BXML version of your site. The following code fragment shows a simple JavaScript function, which tests whether a BXML-compatible Mozilla-based browser is in use and then redirects the browser based on this.
function testUA(){ var bCompatible = false; var sUA = window.navigator.userAgent; var iIOGecko = sUA.indexOf(“Gecko”); //Test if the User-Agent string contains //the string Gecko

This function is relatively straightforward but certain parts may need explaining. Firstly, both Netscape and Firefox browsers use the same Gecko core as Mozilla does. They also have similar User-Agent strings. Therefore, the function above firstly searches for a ‘Gecko’ sub-string, which all of their User-Agent string will contain. Once this sub-string has been found, the function searches for the ‘rv:’ sub-string. This is short for revision and it is followed by the version number of the Gecko engine. If this number is 1. or higher, then the Gecko engine is BXML compatible. Therefore, this relatively simple function is able to test for all compatible Netscape, Firefox and Mozilla browsers. Obviously, it is also necessary to test for compatible versions of Internet Explorer too. This can be done in a similar way, but there is one added complication. All compatible versions of Internet Explorer have a User-Agent string that contains the sub-string: ‘MSIE’, which is directly followed by the version number. Below is an example of such a header from an Internet Explorer browser.
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

However unfortunately Opera browsers have a very similar User-Agent string:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.00

Therefore you must firstly test that the User-Agent string doesn’t contain the ‘Opera’ sub-string and once this has been ascertained, then simply parse out the version number which follows the ‘MSIE’ sub-string.

Designing Rich Internet Applications For Search Engine Accessibility

Page  of 8

Detecting Deep Linking and Updating the Page’s State
This section looks at how redirection based on deep linking can be detected and then at how the state of a page can then be updated using this information. Deep linking can be detected on the server by reading the Referrer HTTP request header using a server-side script. Once the referrer has been read then an appropriate construct event handler must be created, which updates the initial state. Alternatively, if you do not have access to server-side scripting, you can use a JavaScript function to do this. The js action is a special BXML action, which is used to call JavaScript functions. The following behavior takes care of calling this function when the page is loaded:
<s:behavior b:name=”updateState”> <s:event b:on=”construct”> <s:task b:action=”js” b:value=”updateState();” /> </s:event>

which page the referrer was, otherwise mistakes can be made. For such cases, more complicated JavaScript will be required to verify this. Now finally lets look at an example of the type of event handler that could be triggered by such an updateState function:
<s:behavior b:name=”redirect”>

... Other event handlers go here ...
<s:event b:on=”tab3.html”> <s:task b:action=”select” b:target=”id(‘tab3’)” /> </s:event> </s:behavior>

... Other event handlers go here ...
</s:behavior>

This behavior contains an event handler for the custom event tab3.html, which is triggered by the JavaScript function when redirection has occurred from the tab3.html page. All it does is perform a select action on a target with an id of ‘tab’. If this corresponds to the appropriate tab, then simply by selecting this tab, the tab should be loaded and become visible.

The updateState function, which this action calls, then needs to parse out the referrer. Once this value has been found the JavaScript function triggers an appropriate BXML event, hereby passing control back to the BPC. This is done by calling the execute method of the bpc object, with a BXML string. A simple version of such a function looks like this:
function updateState(){ //first parse out the value of the referrer //var sReferrer = document.referrer; //do quick test to make sure that referrer //is from the same host if(sReferrer.indexOf( window.location.hostname) >= 0){ var iLastSlash = sReferrer.lastIndexOf(‘/’); var sValue = sReferrer.substr(iLastSlash + 1); //trigger an event with the same name as //the referrer var sExecute = ‘<s:task b:action=”trigger” b:event=”’ + sValue + ‘” b:target=”id(\’main\’)” />’; bpc.execute(sExecute); } }

You should note that this is a very simplistic implementation of such a referrer parsing function. For a more complicated web site structure, it is important that it is totally unambiguous

Page 8 of 8

Designing Rich Internet Applications For Search Engine Accessibility