Oracle Text Queries Draft

Nina Bergli 30.10.03 / 1

Queries in Oracle 9i Text
Draft Nina Bergli Oct. 30, 2003 The context index in Oracle Text supports a wide range of query operators with the CONTAINS operator that is defined through the index. CONTAINS must be used when the index is of type CONTEXT. Contains queries can be used in combinations with all of the operators discussed below.

ABOUT
An ABOUT query is a query on a document theme, This implies that text documents can be queried on their theme and not just the words it contain. The ABOUT operator is applied in order to make Oracle retrieve documents that contain concepts that are related to a specific query word or query phrase. Concepts in the knowledge-base are both concrete, for example, issuing a query for whales might return documents about orcas or other specific instances of the species, and abstract, for example documents about lions and happiness. The document need not contain the term whale to be retrieved. It follows that the term or phrase specified in the query need not exactly match the themes stored in the index. Oracle normalizes the word or phrase before performing lookup in the index. The normalization process of a query is based on the knowledge base supplied by Oracle, and which is case-sensitive. Thus, ABOUT queries yield the best results if queries are formulated in proper case. This must not be interpreted as a definite requirement since Oracle will look for other case combinations if no results are found for the case as in the query. For example, a query for Blue Whale might return documents about blue whales Themes are extracted from documents during indexing and enabled by a hierarchical list of categories and concepts that represents a view of the world. This hierarchical list is called a knowledge-base or more commonly a thesaurus. Oracle supplies a knowledge-base/thesaurus for English and French. This knowledge base is enabled through the theme component of a CONTEXT index. The system is then able to look up concept information in the index. The knowledge-base supplied is basically a hierarchy of, terms within different categories for example science and business. Since the ABOUT operator uses the supplied knowledge base in English and French to interpret the phrase that is entered, the ABOUT query is limited to knowing and interpreting the concepts in the knowledge base. By adding a domain specific knowledge base, the results of an ABOUT query can be improved. The about query increases the number of relevant documents retrieved with respect to a specific query. Oracle scores results for an ABOUT query with the most relevant documents retrieving the highest score. An ABOUT query returns a score, which is a relevance measure. Thus, high scores (scores that are close to or equal 1) represent documents that are likely to be highly relevant, while

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 2

low scores represent documents not that likely to be relevant. A threshold for score values to be returned from queries can be defined in the index.1 .

ACCUMULATE (,)
The ACCUMULATE operator is used to search for documents that contain at least one occurrence of any of the query terms. The ACCUMULATE operator ranks documents according to the total term weight of a document. Term weight refers to the weight that is assigned to a query term. A query where the query terms are not explicitly assigned weights will automatically be assigned the weight 1 for each term. Accumulate scoring guarantees that if a document A matches p terms with a total term weight of m, and document B matches q terms with a total term weight of m+1, document B is guaranteed to have a higher relevance score than document A, regardless of the numbers p and q. If two documents have the same weight M, the higher relevance score goes to the document with the higher weighted average term score.

Other operators
And (&): The score returned is the score of the lowest query term. Equivalence (=): Specifies an acceptable substitution for a word in a query. The following example returns all documents that contain either the phrase “bluewhales are social animals” or “minkewhales are social animals”: ‘bluewhales=minkewhales are social animals’. Minus (-): The operator is used to search for documents that contain one query term and the presence of another term should cause the document to be ranked lower. Near (;): The near operator returns a score based on the proximity of two or more query terms. Oracle returns higher scores for terms closer together and lower scores for terms further apart in a document. The scoring for the NEAR operator combines frequency of the terms with proximity of the terms. A score between 1 and 100 is generated for each document that satisfies the query condition. Near can not be used in ABOUT queries. The operator takes the following arguments: o Word 1-n: The terms in the query can be either single words or phrases and must be separated by commas. o Max_span: Can be used to specify the maximum distance allowed between two query terms. Generally, it is used to specify the size of the biggest clump. Clump is defined as the smallest group of words in which all query terms occur. All clumps begin and end with a query term. o Order: Can be specified to true if the documents retrieved should contain the query terms in the same order as in the query formulation. Max_span must be specified when using the order argument. Not ( ): The Not operator is used to search for documents containing one query term but not another.

-

-

1

The Threshold operator is discussed further in the section”Other operators”.

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 3

-

Or (|): The Or operator is used to search for documents that contains at least one occurrence of one of the query terms. Soundex (!): This operator is used to expand queries with words that have similar sounds, that is, words that sound like each other. This function is only available in English. Stem ($): The stem operator is used to search for terms that have the same linguistic root as the query term. Fuzzy: The operator is used to expand queries to include words that are spelled similarly to the specified query term. This type of expansion is helpful for finding more accurate results when there are frequent misspellings in the document set. The syntax enables ranking of the result set so that documents that contain words with high similarity to the query word are scored higher than documents with lower similarity. The number of words generated by a fuzzy expansion depends on what is in the index. Thus, results can vary significantly. The parameters for fuzzy operator are o Query term o Score: specifies a similarity score between 1 and 80. Default is 60. o Numresults: Specifies the maximum number of terms to use. Valid range is from 1 – 5000. Default is 100. o Weight: Specifies whether or not the results are to be weighted according to their similarity scores. Broader term operators (BTO): The BTO’s are used to expand a query to include terms that has been defined in a thesaurus as the broader or higher-level term for a specific query term. Expansion of a broader term is also possible until the top of the thesaurus hierarchy is reached. A level argument can be optionally specified for a BTO query. The level then specifies how many levels of the thesaurus hierarchy that is to be traversed. The default level value is one. If the name of the thesaurus to be used is specified, the levels to be traversed must also be specified. The default value for the thesaurus argument is DEFAULT. A thesaurus called DEFAULT must therefore exist in the thesaurus tables. o BT: Expands a query to a broader term as defined in the thesaurus. o BTG: Expands a query to include all broader generic terms as defined in the thesaurus. o BTP: Expands a query to include all terms defined as broader partitive terms for the specific term in the thesaurus. o BTI: Expands a query to include all terms defined as broader instance terms for the specific term in the thesaurus. Narrower term operators (NTO): The use of NTOs are equal to the use of BTOs but it works opposite. The NTOs can as the BTOs be further divided into: o NT o NTG o NTP o NTI All these operators equals the operators under BTO so they will not be commented further. Preferred term (PT): Used to replace a query term with the preferred term as defined in the thesaurus for the query term.

-

-

-

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 4

-

-

-

-

-

Related term (RT): Expands a query to include all related terms that have been defined in the thesaurus. Synonym (SYN): Expands a query to include all terms that have been defined as synonyms for the query term in the thesaurus. Translation term (TR): Expands a query to include all defined foreign language terms. The following are arguments to the operator: o Term o Language o Thesaurus Translation Term Synonym (TRSYN): Combination of TR and SYN. Top Term (TT): Replaces a term in the query with the top term that has been defined for the specific term in the standard hierarchy (BT, NT) in the thesaurus. Top terms in the generic (BTG, NTG), partitive (BTP, NTP), and instancr (BTI, NTI) hierarchies are not returned. Threshold (>): The operator can be used at two levels, the expression and the query term level. At the expression level documents that score below the defined threshold are eliminated. In the case of the query term level, the threshold operator selects a document based on how a term scores in the document. Weight (*): This operator is used to mark the more important terms in a query given that the query has at least two terms. Terms that are considered to be negative for the overall result can be given weights below 1 so that documents containing this term obtain a reduced score. Weight multiplies the score by the supplied factor and the maximum sum that can be achieved is 100, thus implying that documents that score above 100 will be set to a score of 100. The weight operator is useful in accumulate, OR and AND queries when the expression has more than one query term. When there is no weighting on individual terms, the score will not tell which of the terms that appears more often in the document. Wildcards (%, _): Can be used to expand a search to a pattern search. The wildcard characters are % and _. % specifies that any characters can appear in multiple positions represented by the wildcard. _ specifies a single position in which any character can occur. Wildcard query performance is improved by adding a substring and a prefix index.

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 5

Example queries in Oracle Text
The text table text_tab defined in Appendix A is used in order to illustrate the query operators in Oracle Text. The definition is as follows: create type keyword_nt as object( keyword varchar2(20)); create or replace type keyword_tab as table of keyword_nt; create or replace type text_type as object ( identifier number, dummy char(1), --dummy column for indexing title varchar2(200), topic varchar2(200), format varchar2(10), published_date date, media_type varchar2(200), description varchar2(4000), source varchar2(200), keyword_list keyword_tab, language varchar2(100), text clob, static procedure addKeyword(k varchar2, i in number)); create table text_tab of text_type ( primary key (identifier) ) nested table keyword_list store as text_ntab( (primary key (nested_table_id, keyword)) organization index compress) return as locator;

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 6

Operator About

Example query ‘about(Whales)’ -- Documents about whales ‘about(whales and reproduction)’ – Documents about whales and reproduction In terms of scoring, documents that contain both themes will score higher than documents that have only one of the themes in the query. The ABOUT operator can be combined with other operators for example AND or NOT. about (minke whales) AND reproduction about(blue whales) not about (reproduction) Select score (1),identifier,title From text_tab Where CONTAINS(text,’about(whales)’,1)>0 Order by score(1) desc; select identifier,title,score(1) from text_tab where contains ( text, 'about(whales)',1 ) > 0; SELECT SCORE(1), title,source FROM text_tab WHERE CONTAINS(text, ' bluewhale,orcas,beluga,killer-whale', 1) > 0 ORDER BY SCORE(1) DESC;

Result ok

Accumulate

And

SELECT SCORE(1), title,description FROM text_tab WHERE CONTAINS(text, ' bluewhale AND killer-whale', 1) >0

SCORE(1) TITLE --------------------28 Minke whales 1 1Communic ation -Music to swim by no rows selected

Equivalence SELECT SCORE(1), title from text_tab WHERE CONTAINS(text, ' bluewhales=minkewhales are social animals ', 1) > 0 ORDER BY SCORE(1) DESC; Minus SELECT SCORE(1), title,description FROM text_tab WHERE CONTAINS(text, ' bluewhale - killer-whale', 1) >0 SELECT SCORE(1), title, published_date from text_tab WHERE CONTAINS(text, 'whales;reproduction;death', 1) > 0

no rows selected

ok

Near

ok

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 7

AND published_date >= ('01-01-2000') ORDER BY SCORE(1) DESC; Not SELECT SCORE(1), title,description,source from text_tab WHERE CONTAINS(text, ' whales~minkewhales', 1) > 0 ORDER BY SCORE(1) DESC; SELECT SCORE(1), title,description,source from text_tab WHERE CONTAINS(text, ' bluewhales OR minkewhales', 1>0 ORDER BY SCORE(1) DESC; ok

Or

ok

Soundex Stem Fuzzy

select title from text_tab where contains ( text, '!whale',1 ) > 0 ok select identifier,title,score(1) from text_tab where contains ( ok text, '$moving',1 ) > 0; ok SELECT identifier, title FROM text_tab WHERE CONTAINS(TEXT, ‘fuzzy(whale, 70, 6, weight)’ , 1)>0; This query expands to the first 6 variations of whale in the index that have a similarity score over 70. Documents in the result set are also weighted so that documents that have the more similar words compared to whale are ranked highest. Using the appropriate number of commas can skip unnecessary parameters. For example: SELECT identifier, title FROM text_tab WHERE CONTAINS(TEXT, ‘fuzzy(whale, , , weight)’ , 1)>0; In this case, documents that score above 60 are retrieved, since 60 is the default value for score. The parameter numresults will have the value 100. select * from fuzzy_tbl where contains (text,'?hales')>0;

Treshold

Weight

Wildcards

select identifier,title,score(1) from text_tab where contains ( text, '(pilot-whale > 4) and communication',1 ) >0 order by score(1) desc; select identifier,title,score(1) from text_tab where contains ( text, 'pilot-whale or communication*3',1 ) > 0 order by score(1) desc; select identifier,title,score(1) from text_tab where contains ( text, 'pilot% or communication*3',1 ) > 0 order by score(1) desc;

ok

ok

ok

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 8

Comments about thesaurus queries in Oracle
Oracle supplies a test thesaurus in English. The following path locates the thesaurus: C:\oracle\ora92\ctx\sample\thes. The thesaurus is by default called ‘drOthsus’. If the above queries were to be defined without specifying the name of the thesaurus, Oracle would not find it unless it is called ‘default’. Before using the thesaurus in queries, it has to be loaded. The CTXLOAD procedure is described in McGregor (2002). The examples are syntactically correct, but some of them do not provide semantic meaning. This is because they are only intended as examples at the present point of time. BT ‘BT(Bluewhale)’ ‘BT(Bluewhale, 2, drOthsus)’ When homographs are under consideration a qualifier can be specified in order to avoid misconception. BT(crane{(waterfowl)}) -crane or bird select text from text_tab where contains ( text, 'BT ( orcas, 1, dr0thsus )' ) > 0; BTG select text from text_tab where contains ( text, 'BTG ( Whale, 1, dr0thsus)' ) > 0; select text from text_tab where contains ( text, 'BTP ( Pilot-Whale, 1, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'BTI ( Orcas, 1, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'NT ( Whale, 1, dr0thsus )' ) > 0; select text from text_tab where contains ( text,

BTP

BTI

NT

NTG

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 9

'NTG ( Cats, 1, dr0thsus)' ) > 0; NTP select text from text_tab where contains ( text, 'NTP ( Animal, 1, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'NTI ( Animal, 1, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'PT ( Africa, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'Wolf ( Term, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'SYN ( African lion, dr0thsus )' ) > 0; select text from text_tab where contains ( text, 'TR ( Whale,baleine , dr0thsus )' ) > 0; -- Term -Baleine_French_For_Whale select text from text_tab where contains ( text, 'TRSYN ( Killer-Whale, all, dr0thsus )' ) > 0;

NTI

PT

RT

SYN

TR

TRSYN

Oracle Text Queries Draft

Nina Bergli 30.10.03 / 10

Combining structural predicates with the CONTAINS operator: SELECT SCORE(1), title, published_date from text_tab WHERE CONTAINS(text, 'lions', 1) > 0 AND published_date >= ('01-01-2000') ORDER BY SCORE(1) DESC;

REFERENCES
McGregor (2002). Oracle Text - Reference, Oracle. 2002. available at: http://www.csis.gvsu.edu/GeneralInfo/Oracle/text.920/

Sign up to vote on this title
UsefulNot useful