Autonomy Internationalization WP 20031003

Internationalization & Advanced Linguistics
White Paper
www.autonomy.com
Internationalization & Advanced Linguistics White Paper
Table of Contents
1 Autonomy fundamentals .......................................................................................1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2 Autonomy infrastructure technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2.1 1.2.2 1.2.3 Automated content operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Automatic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Automatic personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.3 Language independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2 Internationalization: Autonomy in the global context .................................................4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2.2 Key factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2.3 Autonomys approach to language: APCM & intrinsic independence . . . . . . . . . .5 2.4 Document set granularity & Autonomys document level property model . . . . . . .6 2.5 Language packs & optimization:Autonomys use of language dependent features .9 3 Advanced linguistic functionality: IDOL and its language operations .........................11 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 3.2 Automatic language detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 3.3 Cross-lingual systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 3.3.1 3.3.2 Single language case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Multi-language case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
4 Language support lists ........................................................................................15 4.1 Optimized languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 4.2 Automatically detectable languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 5 Application examples..........................................................................................17 5.1 Other examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
Table Of Figures
Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: Semantic division of information................................................................6 Legacy system: anti-semantic modeling of repositories .................................7 Legacy system: increased complexity through virtual database layers.............8 Autonomy document level property model ..................................................9 Operation on a single language system ....................................................12 Operation on a multi-language system .....................................................13 Example with English/Spanish.................................................................14 BBC online -- Chinese news ...................................................................17 Tom.com -- Chinese internet portal ..........................................................18 Yatack -- Scandinavian e-commerce .......................................................19 French portal: www.eurosport.fr .............................................................20 Italian shopping site: www.kataweb.it .....................................................20 German news portal: www.tomorrowbusiness.de .....................................21 Spanish health portal: www.saludalia.com ..............................................21
Autonomy Internationalization and Advanced Linguistics White Paper
1 Autonomy fundamentals
1.1 Introduction
Autonomy employs a fundamentally different and unique combination of technologies to enable computers to form an understanding of a page of text, web pages, emails, voice, documents and people. Autonomy's solution is therefore able to power any application dependent upon unstructured information within every market sector, including e-commerce, customer relationship management, knowledge management, enterprise information portals and online publishing applications. This evidence is supported by the significant penetration of the technology in a diversity of vertical markets and has been achieved principally because every market sector needs to manage and leverage the benefits of unstructured information. This combination also provides unique advantages: automatic accurate computational efficiency language independence format agnosticism Already, Autonomy has become the standard for managing and processing unstructured information across every business in every industry. Autonomy technology can be rapidly implemented either as a complete out-of-the-box solution or as an integrated component of an existing software application. For many Fortune 500 organizations, as well as OEMs partners, Autonomy software is playing a key part in their success.
1.2 Autonomy infrastructure technology

Built upon a unique combination of powerful concept-matching algorithms, Autonomy delivers the Intelligent Data Operating Layer - IDOL, an intelligent infrastructure technology, which makes it possible for organizations to automatically process digital information. Sitting above the sea of unstructured information within the enterprise, IDOL automatically identifies the subject matter of each piece of information by extracting the document's "digital essence." Many critical processes and tasks traditionally performed manually within the enterprise can now be automated by Autonomy's technology.
page one
1.2.1 Automated content operations
Whether connecting people to content, content to content or even people to people, Autonomy provides a complete modular range of IDOL functionality that enables organizations to integrate the latest in personalization, collaboration, classification, retrieval and proactive information delivery features that solve real business issues. The strength of Autonomy's technology is that it powers a wide range of operations automatically, reducing costs and adding value in real time. Because of its modular architecture, the enterprise can rapidly tailor the technology's functionality to meet their business objectives. IDOL functionality includes:
1.2.1 Automated content operations:

Concept matching & retrieval
IDOL offers higher degrees of accuracy and sophistication, using a scalable technology that recognizes concepts. This unique Autonomy differentiator provides powerful retrieval features, including natural language, conceptual search, refine by example, cross-language search and query by example. Autonomy also supports legacy retrieval mechanisms, such as keyword, Boolean, Proximity, Exact Phrase, Soundex and many others etc. Active matching
Proactively link users with relevant information they require, accurately, in context and in realtime without the user being needlessly diverted from their work in progress to perform a search or retrieval operation Automatic hyperlinking
Completely removing the requirement to manually insert hyperlinks into content, Autonomy IDOL generates hyperlinks in real-time to all types of data, ensuring they are immediately upto-date. Automatic summarization
Autonomy also intelligently returns an automatic summary of the information containing the most salient concepts of the content. Summaries can be generated that relate to the context of the original inquiry - allowing the most applicable dynamic abstract to be provided for a given operation.
1.2.2 Automatic classification:

Automatic categorization
The flexibility of Autonomy's categorization feature allows you to precisely derive categories using concepts found within unstructured text. This ensures that all data is classified in the correct context with the utmost accuracy.
page two
Taxonomy generation
Autonomy's automatic taxonomy generation eradicates the necessity for human intervention and builds taxonomies based on the meaning of the information itself. Clustering
Autonomy's automatic clustering capabilities can take large sets of document data or even userprofile information and automatically identify the main set of information clusters (themes) inherent within your information assets.
1.2.3 Automatic personalization:

Automatic profiling
Automatic profiling provides the organization with a real time tool to accurately understand individuals' interests based on browsing, content consumption or content contribution. Generating a multi-faceted and concurrent conceptual profile of each user based on both explicit profiles (agents) and implicit profiles (click thru and submission), automatic profiling avoids the need for explicit input of any form from the user and delivers options for identifying and managing expertise and collaboration. Community & collaboration
IDOL automatically stores a concurrent, accurate and multifaceted understanding of every user based on powerful profiling operations. This ability enables IDOL to automatically match users with similar interests and drive collaboration through discovering communities of knowledge. Expertise identification
Automatically gaining an understanding of every individual in the community IDOL facilitates the recognition of highly focused experts enabling users to engage in proactive collaboration ventures.
1.3 Language independence

Autonomy's concept-matching technology does not depend on an intimate knowledge of any particular grammatical structure or linguistic knowledge of a specific language. Words are treated as abstract symbols of meaning, and understanding is derived through analysis of the context of their occurrence, rather than a rigid definition of grammar. Slang and other variations in language will not confuse the software; the technology is independent with regard to language not just in the context of English and French, and Arabic and Japanese, but also the language that appears in the legal as opposed to the marketing idiom and even 'non-traditional' languages, such as voice and video.
page three
1.3 Language independence
Significant industry research has shown that both Europe and Asia have a faster-growing rate of Internet usage than the US: According to IDC, more than 50% of Web users currently reside outside of the US and this will increase to 60% by 2003 Nearly one-half of worldwide e-commerce will take place outside of the US by 2003 The Gartner Group forecasts even sharper growth: $7.3 trillion in total global B2B esales transactions by 2004, with 61% coming from outside the US. As companies realize the value of presenting their sites to global markets, the dominance of the English language is set to decline. Customers expect to be approached and served in their native language and multi-nationals seeking to maximize the information available to their employees gain extra internal value if they can leverage the knowledge they own in a range of languages. Whether implementing a portal site, an e-commerce service or a corporate knowledge management system and whether the audience is internal or external, therefore, the issue of internationalization will have to be surmounted. This White Paper describes the language independent feature of Autonomy's technology that allows Autonomy to fully support a variety of worldwide languages, providing benefits such as An international exchange of expertise Access to your global information assets Growth by reaching new market opportunities.
2 Internationalization: Autonomy in the global context
2.1 Introduction
Language is defined as "the use of arbitrary symbols, such as the spoken and written word, in organized combinations and patterns in order to express and communicate thoughts and feelings." As its definition implies, language is used in set patterns to express the abstract notion of knowledge and information. To fully realize the potential of stored knowledge bases worldwide, the knowledge itself needs to be distributed/shared amongst its global user base, regardless of the language in which it was originally presented. Currently most enterprises need to manage content that is represented in more than one language. Solutions that are independent of language constructs are therefore of utmost
page four
importance (and value). Companies can no longer afford to incur extra costs to utilize or provide new or existing information in another language, made necessary by opening branch offices in new market areas and countries. Autonomy's technology and architecture is ideally positioned; as it is designed to be completely modular, enabling to scale for content exploitation both today and in the future, in any language, in no time.
2.2 Key factors

Internationalization is the process of developing software whose features can be delivered in multiple languages. This process is generally achieved by careful design-stage consideration of features, functionality and core technology in order to ensure the software does not make any assumptions based on a single language's characteristics. The key questions to ask when evaluating any software system against such a yardstick are thus: Does the core algorithm itself depend on the constructs of a particular language? That is, are there inherent assumptions about grammar, linguistics or structure that mean that adding a new language involves dedicated and lengthy re-engineering or development effort? Does the architecture or delivery mechanism of that core algorithm make assumptions about language constructs, the digital representation or encoding of symbols? That is, is the technical way the algorithm is realised bound by assumptions of the style, type, encoding and size of data it is to be applied on, restricting use in more computationally complex non-European languages?
2.3 Autonomy's approach to language: APCM and intrinsic independence

Autonomy's technology, which was designed from inception to be language independent, achieves this agnosticism at a fundamental level. The Intelligent Data Operating Layer that lies at Autonomy's heart uses a unique combination of Bayesian Inference and Shannon's Information Theory to apply a technique known as Adaptive Probabilistic Concept Modelling* to the problem of understanding unstructured information. This technique builds a conceptual understanding of the information it processes by analyzing the patterns of arbitrary signals or symbols in that information. In other words, the technology literally analyses the patterns of language, whatever symbols that language actually uses, to infer higher order conceptual meaning. In this way Autonomy's technology is intrinsically language independent. Out of the box, IDOL is capable not only of understanding English and French equally well, but also Arabic and Japanese, legal as opposed to financial jargon and even voice, video and images - where symbols, though now phonic or visual - still appear in patterns to represent conceptual meaning.
page five
2.4 Document set granularity and Autonomy's document level property model
The choice of language therefore does not compromise the accuracy of the concepts extracted and used by Autonomy. The underlying algorithm is the same regardless of the language used. As a result, Autonomy is not confused by slang and language variations, is not dependent on language parsing, dictionaries or translation modules.
* A further, detailed examination of APCM, its theoretical basis and implications can be found in the Autonomy Technology Whitepaper.
Autonomy's architecture integrates information from many different sources into a single layer of content. While the homogenization of this information is a key value of Autonomy's technology, allowing applications to reason upon the patterns contained in information, regardless of the source format and location of that information, the properties individual documents have in the real world will often have an impact on their use with Autonomypowered applications. A document that has entitlement security associated with it, for example, needs to be delivered only to those who have access to view it. Similarly, when a French user queries a system in French, expecting documents only in French, it is necessary to preserve an understanding of which documents were in French in order to ensure only those are considered.
Figure 1: Semantic division of information
page six
Logically dividing document sets
The majority of information infrastructures support the segmentation of their indices or document stores into logically separated databases, collections or document-sets. In simpler systems, underlying semantic divisions in the information drive this division -- information pertinent to the marketing department is stored in one database, that useful to the sales force is stored in another one. Such segmentation can then be used at the application layer to ensure users who fit a given role are only served information of interest to them, insuring against pollution of results by documents that are not of relevance to them.
Figure 2: Legacy system: anti-semantic modeling of repositories
Legacy systems and database level property models
As such systems matured and the need to store document properties (such as security and locale/encoding information) grew, these logical divisions were appropriated as an architecture upon which to pin document properties. Properties were thus assigned on a per-database level. Such properties, however, are orthogonal to the semantics that originally drove division of the information, and the resulting architectures either did not truly represent the granularity of the underlying system or grew considerably in complexity through technologies such as virtual database layers. Legacy systems still provide document level properties, such as locale/encoding information, at a database level, severely restricting organization's ability to accurately model the granular reality of their information.
page seven
The Autonomy Content Infrastructure and its document level property model
In stark contrast to legacy systems, documents stored in IDOL are treated at the document level. Each document, whichever database it resides in, has its own security and locale/encoding information and all of these orthogonal properties can be used, in any combination, to restrict IDOL operations performed on the information. The enterprise is thus empowered to divide its information by semantics, accurately modeling
Figure 3: Legacy system: increased complexity through virtual database layers
the meaning of different areas of information. In addition to this, each document will have associated with it many arbitrary properties, including its locale/encoding information. With the document level model, IDOL is unique able to directly match the granularity of documents in the enterprise, without the need for illogical, arbitrary and inefficiency-introducing mapping.
page eight
Figure 4: Autonomy document level property model
2.5 Language packs and optimization: Autonomy's use of language dependent features
While the in-built linguistic independence means Autonomy's IDOL can be deployed in multilingual environments without specialized configuration, a number of optimization techniques are utilized in Autonomy's language packs to further enhance the efficiency of processing specific languages and their idiosyncrasies.
Stemming
In most languages certain variations of a word can be stripped to obtain the main stem of the word. In English for example, the words "run", "runner" and "running" can all be stripped down to their stem "run" without much loss of meaning. Stemming rules can be safely used when processing text in order to obtain a list of unique words.
page nine
2.5 Language packs and optimization: Autonomy's use of language dependent features
Autonomy does not require the use of stemming rules, as the statistical analysis would normally determine the importance and relationship of those words automatically. However, an initial configuration of stemming rules allows Autonomy to treat a set of words as one, so that storage resources and processing time can be reduced. Autonomy provides stemming algorithms for most commonly used languages as standard.
Stop lists
Every language has "empty words" that do not carry any significant meaning. In grammatical terms these would normally be prepositions, conjunctions, auxiliary verbs, etc. e.g. In English words such as "the", "a", "and", "to", etc. These words can be safely ignored when processing content. Autonomy provides as standard a set of stop lists and stemming algorithms for most commonly used languages.
Multiple encoding support
IDOL's internal information store is formatted in XML with all data encoded in the variable-byte industry standard, UTF-8. Use of UTF-8 enables Autonomy to encode any human language internally, but conversion is often needed between legacy encoding schemes such as the ASCII and UCS2 data found in existing enterprise repositories. Autonomy provides conversion tables and logic in order to interface with such information and reverse use allows Autonomy to re-encode information stored in UTF-8 internally to other, legacy schemes and character sets at query time.
Transliteration
Transliteration is the ability to represent letters or words in the corresponding characters of another alphabet. Some languages make the use of transliteration schemes in order for people to be able to write text without the need of a special keyboard that supports the original alphabet. Autonomy supports most transliteration schemes used in languages such as Greek, Russian, etc.
Canonicalization of characters
Some languages have more than one way of representing a given symbol, letter or word. The Japanese katakana script, for example, can be encoded in full width or half width characters. Regardless of its width the character in itself carries the same meaning. Autonomy's software infrastructure uses canonicalization to ensure that all character forms are treated equally through automatic conversion to an internationally recognized canonical scheme.
Word boundaries
Information is represented with words that together represent a concept. In most languages each word can be identified easily as in written text these words are normally separated by spaces.
page ten
Certain languages such as Thai, Japanese, Chinese, Korean, etc. are written without the use of spaces to delimit words. A sentence is normally a continuous flow of characters with some punctuation used for readability. The individual words are normally discerned by the context of the text. In order to support this type of languages Autonomy uses well-known third party APIs to perform sentence segmentation.
3 Advanced linguistic functionality: IDOL and its language operations
3.1 Introduction
Internationalization represents the technical baseline software has to conform to in order to operate in multiple languages and encodings. As has been discussed so far, this requirement applies both to the architecture of the technology (for example, through the use of UTF-8) and to the underlying algorithm itself (in the case of Autonomy, intrinsically language independent). Above and beyond these mechanical requirements, however, IDOL's ability to conceptually understand information rather than be forced to doggedly consume information on a keyword basis provides a platform for two further, advanced linguistic operations. These operations go beyond mere compliance across multiple languages and actually deliver value within the enterprise by automating linguistically oriented processes that traditionally could be performed only by expensive, manual methods.
3.2 Automatic language detection

Autonomy's software infrastructure can automatically detect the language and encoding of documents that it processes. This allows the user to set up processes that are automatically applied to documents or document metadata if they are in a specific language. For example, if a document is identified as Chinese, the appropriate preliminary linguistic tools are automatically applied to it. The language detection functionality, which currently supports 75 languages, can generate significant ROI, for example, when applied to repositories that contain unsorted information in multiple languages. Autonomy's language detection can also work at query time. That is, when a user enters a natural language query as a search request to an IDOL-powered application, Autonomy is able to detect the language the query is written in and automatically responds with information that is in the same language. In this way, users of different languages can use the same application interface without pre-configuration or explicit language choices being made.
page eleven
3.3 Cross lingual systems
3.3 Cross lingual systems

Autonomy's technology can be used to set up cross lingual systems. This will allow, for example, a user to look at a document in English, and be suggested with similar information both in English and Spanish. IDOL can be used to establish a correlation between one or more languages. To achieve this, a training dataset is needed, where each document is expressed in the required languages. Ideally each extract will be a direct translation of the other. Electronic dictionaries can also be used successfully for this purpose. The cross lingual content can be indexed into IDOL server, giving the system a general understanding of the concepts involved in both languages. Once this cross lingual mapping is set up the server will be able to correlate terms in more than one language making it possible to retrieve content in more than one language at a time.
3.3.1 Single language case

When IDOL server has aggregated data in one language it has a conceptual understanding of that content in that particular language. When IDOL server looks for query results or suggestions of related documents, it uses the concept of the query or document to look for best matching answers. The results will therefore be in the same language. In the following diagram IDOL server has N databases all of which being in the same language.
page twelve
Figure 5: Operation on a single language system
When a user application queries the system in English on an English Database, IDOL server simply takes the English concept in the query and matches it to any relevant concepts found in the English content. The results are therefore in English. In this case, if a Spanish query was made you would not get any results as the terms in IDOl server are in English with no Spanish concepts to match.
3.3.2 Multi-l language case

In order to give IDOL server the capability to automatically understand concepts in more than one language (e.g., English and Spanish), we pre-index a special multi-language database containing general data (e.g., encyclopaedia data and general world wide news) in both languages. This database is simply for training purposes, and it does not need to contain the documents that you will eventually be querying on. Each one of the "training documents" contains plain text in both languages, each being a direct translation of the other. This multi-language database gives IDOL server a general understanding of a wide range of concepts in both languages. IDOL server can then use this special multilanguage database internally when dealing with queries and suggestions. In the following diagram IDOL server has N databases each of which is in either language contained in the multi-language database.
Figure 6: Operation on a multi-language system
IDOL server first looks for concepts that closely match the query from the multi-language database, giving IDOL server the concepts in both languages. It can then use those concepts to carry on with the original query. In this way, a query in one language can yield automatic results in both languages. For example, you can use an English sentence to query a database with data in Spanish, and vice versa. Please note that this method does not use direct translation of keywords, it uses translation of general concepts.
page thirteen
3.3.2 Multi-l language case
For example, if you are querying a Spanish Database:
Figure 7: Example with English/Spanish
IDOL server takes the English query and looks for matching concepts in the English-Spanish database. IDOL server will match the concepts in English, but because each document in the English-Spanish database is in both languages, the concepts obtained are in both English and Spanish. If you then query the Spanish database, the Spanish concepts obtained above will find relevant documents in the Spanish database.
page fourteen
4 Language support lists
4.1 Optimized languages

Autonomy currently support optimizations for the following 66 languages
Afrikaans Azeri Breton Chinese Danish Estonian French German Hebrew Icelandic Japanese Kurdish Latin Luxembourgish Maori Persian Romanian Slovak Spanish Tagalog Turkish Uzbek
Albanian Basque Bulgarian Croatian Dutch Faroese Gaelic Greek Hindi Indonesian Kazakh Kyrgyz Latvian Macedonian Mongolian Polish Russian Slovenian Swahili Tatar Ukrainian Vietnamese
Arabic Belarussian Catalan Czech English Finnish Galician Greenlandic Hungarian Italian Korean Lappish Lithuanian Malay Norwegian Portuguese Serbian Sorbian Swedish Thai Urdu Welsh
page fifteen
4.2 Automatically detectable languages
4.2 Automatically detectable languages

Autonomy's automatic language detection technology can currently detect the following 57 languages:
Afrikaans Basque Bulgarian Croatian Dutch Faroese Gaelic Greenlandic Hungarian Italian Kurdish Latvian Macedonian Norwegian Romanian Slovak Swahili Thai Urdu
Albanian Belarussian Catalan Czech English Finnish German Hebrew Icelandic Japanese Lappish Lithuanian Maori Polish Russian Slovenian Swedish Turkish Vietnamese
Arabic Breton Chinese Danish Estonian French Greek Hindi Indonesian Korean Latin Luxembourgish Mongolian Portuguese Serbian Spanish Tagalog Ukrainian Welsh
page sixteen
5 Application examples
Autonomy software has already been widely deployed to solve a variety of business problems. Some examples are:
Customer
BBC Online News Site
Languages
Chinese Arabic
Description
Innovative, progressive and pioneering the British Broadcasting Corporation has proved a powerful force in the 20th century providing entertainment, education and information, and captivating millions of viewers and listeners at home and abroad. Autonomy was selected to power their Chinese and Arabic sections of their news site.
Figure 8: BBC online Chinese news
page seventeen
5 Application examples
Customer
TOM.COM
Languages
Chinese
Description
Asian telecommunications giant Hutchinson Whampoa has set up the first series of portals designed specifically for the Chinese community. The portals, which will represent Chinese interests throughout Asia Pacific, will automatically personalize content to users' interests and needs. The portals will make extensive use of Autonomy's technology infrastructure to turn the users' interaction with the site into a productive and interest-focused experience.
Figure 9: Tom.com - Chinese internet portal
page eighteen
Customer
Yatack E-Commerce Site
Languages
Scandinavian Languages
Description
Yatack is a Scandinavian E-Commerce site. Autonomy's technology allows Online Club to deliver the most personalized online shopping experience, by guiding users through the buying process based on an automatically derived under-standing of their interests.
Figure 10: Yatack Scandinavian e-commerce
page nineteen
5.1 Other examples
5.1 Other examples
Figure 11: French portal www.eurosport.fr
page twenty
Figure 12: Italian shopping site www.kataweb.it
Figure 13: German news portal www.tomorrowbusiness.de
Figure 14: Spanish health portal - www.saludalia.com
page twenty one
Headquarters
Autonomy Inc 301 Howard Street 22nd Floor San Francisco CA 94105 Tel: (415) 243 9955 Fax: (415) 243 9984 Email: info@aungate.com Autonomy Corporation Cambridge Business Park Cowley Rd Cambridge CB4 0WZ Tel: +44 (0) 1223 448000 Fax: +44 (0) 1223 448001 Email: info@aungate.com
Autonomy Germany - Hamburg Valentinskamp 24 D-20354 Hamburg Germany Tel: 49 (40) 31 112 - 308 Fax: 49 (40)31 112 - 641 Email: germany@autonomy.com Autonomy Germany - Munich Leopoldstrasse 244 D-80807 Munich Germany Tel: +49 (0) 89 244 45 2027 Fax: +49 (0) 89 244 45 5056 Email: germany@autonomy.com Autonomy Italy Largo Richini, 6 20122 Milano Italy Tel: +39 02 5821 5510 Fax: +39 02 5821 5400 Email: italy@autonomy.com Autonomy Italy Via di Vigna Murata, 40 00143 Rome Italy Tel: +39 06 5483 2028 Fax: +39 06 5483 4000 Email: italy@autonomy.com Autonomy Netherlands Teleport Towers Kingsfordweg 151 1043 GR Amsterdam Postbus 57674 1040 BN Amsterdam Nederland Tel: +31 (0) 20 491 96 80 Fax: +31 (0) 20 491 73 66 E-mail: netherlands@autonomy.com Autonomy Spain C/ Maudes 51 8a Planta 28003 Madrid Spain Tel: +34 91 3956325 Fax: +34 91 3956396 Email: spain@autonomy.com Scandinavia Autonomy Nordic AS Fridjof Nansensplass 4 P.O.Box 35 Sentrum 0101 Oslo Norway Tel: +47 23 100 727 Fax: +47 23 100 701 Email: info@autonomy.no Autonomy Sweden AB Stockholm Stureplan Stureplan 4c, 4th floor Stockholm 114 35 Sweden Phone: +46 8 545 273 70 Fax: +46 8 545 273 89 Email: sweden@autonomy.com Asia-P Pacific Autonomy Asia-P Pacific Level 14 33 Berry Street North Sydney NSW 2060 Australia Tel: 61 (2) 9959 1951 Fax: 61 (2) 9959 1035 Email: asiapac@autonomy.com Autonomy Systems Singapore 3 Temasek Ave Level 34 Centennial Tower Singapore 039190 Tel: +65 6549 7848 Fax: +65 6549 7584 Email: asiapac@autonomy.com
Regional Offices
North America Autonomy Federal Office Autonomy, Inc. 8102 Greensboro Drive Suite 601 McLean, VA 22102 Phone: 1 703 821 1600 Fax: 1 703 821 1662 Autonomy has additional offices in: Boston, MA Dallas, TX, New York, NY, Chicago, IL and Washington, DC. Continental Europe Autonomy Belgium Bessenveldstraat 25 1831 Diegem Belgium Tel: +32 (2) 716 40 05 or +32 (2) 716 40 57 Fax: +32 (2) 716 41 92 Email: belgium@autonomy.com Autonomy France 112, avenue Klber 75116 Paris France Tel: +33 (0) 1 47 55 74 02 Fax: +33 (0) 1 47 55 74 21 Email: france@autonomy.com
(Autonomy Inc. and Autonomy Systems Limited are both subsidiaries of Autonomy Corporation plc) Copyright 2003 Autonomy Systems Ltd. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners. [WP INTER] 10.03
The information contained in this document represents the current view of Autonomy Systems Ltd on the issues discussed as of the date of publication. Because Autonomy must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Autonomy, and Autonomy cannot attest to the accuracy of any information presented after the date of publication. This document is for informational purposes only; Autonomy is not making warranties, express or implied, in this document.
www.autonomy.com

Autonomy Internationalization WP 20031003

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Autonomy Internationalization WP 20031003

Uploaded by

Copyright:

Available Formats

Internationalization & Advanced Linguistics

Internationalization & Advanced Linguistics White Paper

Autonomy Internationalization and Advanced Linguistics White Paper

1.2 Autonomy infrastructure technology

1.2.1 Automated content operations

1.2.1 Automated content operations:

1.2.2 Automatic classification:

Autonomy Internationalization and Advanced Linguistics White Paper

1.2.3 Automatic personalization:

1.3 Language independence

1.3 Language independence

2 Internationalization: Autonomy in the global context

Autonomy Internationalization and Advanced Linguistics White Paper

2.2 Key factors

2.3 Autonomy's approach to language: APCM and intrinsic independence

Figure 1: Semantic division of information

Autonomy Internationalization and Advanced Linguistics White Paper

Logically dividing document sets

Figure 2: Legacy system: anti-semantic modeling of repositories

Legacy systems and database level property models

Figure 3: Legacy system: increased complexity through virtual database layers

Autonomy Internationalization and Advanced Linguistics White Paper

Figure 4: Autonomy document level property model

Multiple encoding support

Autonomy Internationalization and Advanced Linguistics White Paper

3 Advanced linguistic functionality: IDOL and its language operations

3.2 Automatic language detection

3.3 Cross lingual systems

3.3 Cross lingual systems

3.3.1 Single language case

Figure 5: Operation on a single language system

Autonomy Internationalization and Advanced Linguistics White Paper

3.3.2 Multi-l language case

Figure 6: Operation on a multi-language system

3.3.2 Multi-l language case

For example, if you are querying a Spanish Database:

Figure 7: Example with English/Spanish

Autonomy Internationalization and Advanced Linguistics White Paper

4 Language support lists

4.1 Optimized languages

4.2 Automatically detectable languages

4.2 Automatically detectable languages

Autonomy Internationalization and Advanced Linguistics White Paper

Figure 8: BBC online Chinese news

Figure 9: Tom.com - Chinese internet portal

Autonomy Internationalization and Advanced Linguistics White Paper

Figure 10: Yatack Scandinavian e-commerce

5.1 Other examples

5.1 Other examples

Figure 11: French portal www.eurosport.fr

Figure 12: Italian shopping site www.kataweb.it

Autonomy Internationalization and Advanced Linguistics White Paper

Figure 13: German news portal www.tomorrowbusiness.de

Figure 14: Spanish health portal - www.saludalia.com

page twenty one

You might also like