You are on page 1of 10

Oracle Text Queries Nina Bergli

Draft 30.10.03 / 1

Queries in Oracle 9i Text


Draft
Nina Bergli
Oct. 30, 2003

The context index in Oracle Text supports a wide range of query operators with the
CONTAINS operator that is defined through the index. CONTAINS must be used when the
index is of type CONTEXT. Contains queries can be used in combinations with all of the
operators discussed below.

ABOUT

An ABOUT query is a query on a document theme, This implies that text documents can be
queried on their theme and not just the words it contain. The ABOUT operator is applied in
order to make Oracle retrieve documents that contain concepts that are related to a specific
query word or query phrase. Concepts in the knowledge-base are both concrete, for example,
issuing a query for whales might return documents about orcas or other specific instances of
the species, and abstract, for example documents about lions and happiness. The document
need not contain the term whale to be retrieved. It follows that the term or phrase specified in
the query need not exactly match the themes stored in the index. Oracle normalizes the word
or phrase before performing lookup in the index. The normalization process of a query is
based on the knowledge base supplied by Oracle, and which is case-sensitive. Thus, ABOUT
queries yield the best results if queries are formulated in proper case. This must not be
interpreted as a definite requirement since Oracle will look for other case combinations if no
results are found for the case as in the query. For example, a query for Blue Whale might
return documents about blue whales

Themes are extracted from documents during indexing and enabled by a hierarchical list of
categories and concepts that represents a view of the world. This hierarchical list is called a
knowledge-base or more commonly a thesaurus. Oracle supplies a knowledge-base/thesaurus
for English and French. This knowledge base is enabled through the theme component of a
CONTEXT index. The system is then able to look up concept information in the index.

The knowledge-base supplied is basically a hierarchy of, terms within different categories for
example science and business. Since the ABOUT operator uses the supplied knowledge base
in English and French to interpret the phrase that is entered, the ABOUT query is limited to
knowing and interpreting the concepts in the knowledge base. By adding a domain specific
knowledge base, the results of an ABOUT query can be improved.

The about query increases the number of relevant documents retrieved with respect to a
specific query. Oracle scores results for an ABOUT query with the most relevant documents
retrieving the highest score.
An ABOUT query returns a score, which is a relevance measure. Thus, high scores (scores
that are close to or equal 1) represent documents that are likely to be highly relevant, while
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 2

low scores represent documents not that likely to be relevant. A threshold for score values to
be returned from queries can be defined in the index.1

ACCUMULATE (,)
The ACCUMULATE operator is used to search for documents that contain at least one
occurrence of any of the query terms. The ACCUMULATE operator ranks documents
according to the total term weight of a document. Term weight refers to the weight that is
assigned to a query term. A query where the query terms are not explicitly assigned weights
will automatically be assigned the weight 1 for each term.
Accumulate scoring guarantees that if a document A matches p terms with a total term weight
of m, and document B matches q terms with a total term weight of m+1, document B is
guaranteed to have a higher relevance score than document A, regardless of the numbers p
and q.
If two documents have the same weight M, the higher relevance score goes to the document
with the higher weighted average term score.

Other operators
- And (&): The score returned is the score of the lowest query term.
- Equivalence (=): Specifies an acceptable substitution for a word in a query. The
following example returns all documents that contain either the phrase “bluewhales
are social animals” or “minkewhales are social animals”: ‘bluewhales=minkewhales
are social animals’.
- Minus (-): The operator is used to search for documents that contain one query term
and the presence of another term should cause the document to be ranked lower.
- Near (;): The near operator returns a score based on the proximity of two or more
query terms. Oracle returns higher scores for terms closer together and lower scores
for terms further apart in a document. The scoring for the NEAR operator combines
frequency of the terms with proximity of the terms. A score between 1 and 100 is
generated for each document that satisfies the query condition. Near can not be used in
ABOUT queries. The operator takes the following arguments:
o Word 1-n: The terms in the query can be either single words or phrases and
must be separated by commas.
o Max_span: Can be used to specify the maximum distance allowed between
two query terms. Generally, it is used to specify the size of the biggest clump.
Clump is defined as the smallest group of words in which all query terms
occur. All clumps begin and end with a query term.
o Order: Can be specified to true if the documents retrieved should contain the
query terms in the same order as in the query formulation. Max_span must be
specified when using the order argument.
- Not ( ): The Not operator is used to search for documents containing one query term
but not another.

1
The Threshold operator is discussed further in the section”Other operators”.
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 3

- Or (|): The Or operator is used to search for documents that contains at least one
occurrence of one of the query terms.
- Soundex (!): This operator is used to expand queries with words that have similar
sounds, that is, words that sound like each other. This function is only available in
English.
- Stem ($): The stem operator is used to search for terms that have the same linguistic
root as the query term.
- Fuzzy: The operator is used to expand queries to include words that are spelled
similarly to the specified query term. This type of expansion is helpful for finding
more accurate results when there are frequent misspellings in the document set. The
syntax enables ranking of the result set so that documents that contain words with high
similarity to the query word are scored higher than documents with lower similarity.
The number of words generated by a fuzzy expansion depends on what is in the index.
Thus, results can vary significantly.
The parameters for fuzzy operator are
o Query term
o Score: specifies a similarity score between 1 and 80. Default is 60.
o Numresults: Specifies the maximum number of terms to use. Valid range is
from 1 – 5000. Default is 100.
o Weight: Specifies whether or not the results are to be weighted according to
their similarity scores.

- Broader term operators (BTO): The BTO’s are used to expand a query to include
terms that has been defined in a thesaurus as the broader or higher-level term for a
specific query term. Expansion of a broader term is also possible until the top of the
thesaurus hierarchy is reached. A level argument can be optionally specified for a
BTO query. The level then specifies how many levels of the thesaurus hierarchy that is
to be traversed. The default level value is one. If the name of the thesaurus to be used
is specified, the levels to be traversed must also be specified. The default value for the
thesaurus argument is DEFAULT. A thesaurus called DEFAULT must therefore exist
in the thesaurus tables.
o BT: Expands a query to a broader term as defined in the thesaurus.
o BTG: Expands a query to include all broader generic terms as defined in the
thesaurus.
o BTP: Expands a query to include all terms defined as broader partitive terms
for the specific term in the thesaurus.
o BTI: Expands a query to include all terms defined as broader instance terms for
the specific term in the thesaurus.

- Narrower term operators (NTO): The use of NTOs are equal to the use of BTOs but
it works opposite. The NTOs can as the BTOs be further divided into:
o NT
o NTG
o NTP
o NTI
All these operators equals the operators under BTO so they will not be commented
further.

- Preferred term (PT): Used to replace a query term with the preferred term as defined
in the thesaurus for the query term.
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 4

- Related term (RT): Expands a query to include all related terms that have been
defined in the thesaurus.
- Synonym (SYN): Expands a query to include all terms that have been defined as
synonyms for the query term in the thesaurus.
- Translation term (TR): Expands a query to include all defined foreign language
terms. The following are arguments to the operator:
o Term
o Language
o Thesaurus
- Translation Term Synonym (TRSYN): Combination of TR and SYN.
- Top Term (TT): Replaces a term in the query with the top term that has been defined
for the specific term in the standard hierarchy (BT, NT) in the thesaurus. Top terms in
the generic (BTG, NTG), partitive (BTP, NTP), and instancr (BTI, NTI) hierarchies
are not returned.
- Threshold (>): The operator can be used at two levels, the expression and the query
term level. At the expression level documents that score below the defined threshold
are eliminated. In the case of the query term level, the threshold operator selects a
document based on how a term scores in the document.
- Weight (*): This operator is used to mark the more important terms in a query given
that the query has at least two terms. Terms that are considered to be negative for the
overall result can be given weights below 1 so that documents containing this term
obtain a reduced score. Weight multiplies the score by the supplied factor and the
maximum sum that can be achieved is 100, thus implying that documents that score
above 100 will be set to a score of 100. The weight operator is useful in accumulate,
OR and AND queries when the expression has more than one query term. When there
is no weighting on individual terms, the score will not tell which of the terms that
appears more often in the document.
- Wildcards (%, _): Can be used to expand a search to a pattern search. The wildcard
characters are % and _. % specifies that any characters can appear in multiple
positions represented by the wildcard. _ specifies a single position in which any
character can occur. Wildcard query performance is improved by adding a substring
and a prefix index.
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 5

Example queries in Oracle Text

The text table text_tab defined in Appendix A is used in order to illustrate the query operators
in Oracle Text. The definition is as follows:

create type keyword_nt as object(


keyword varchar2(20));

create or replace type keyword_tab as table of keyword_nt;

create or replace type text_type as object


(
identifier number,
dummy char(1), --dummy column for indexing
title varchar2(200),
topic varchar2(200),
format varchar2(10),
published_date date,
media_type varchar2(200),
description varchar2(4000),
source varchar2(200),
keyword_list keyword_tab,
language varchar2(100),
text clob,

static procedure addKeyword(k varchar2, i in number));

create table text_tab of text_type


(
primary key (identifier)
)
nested table keyword_list store as text_ntab(
(primary key (nested_table_id, keyword))
organization index compress)
return as locator;
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 6

Operator Example query Result


About ‘about(Whales)’ -- Documents about whales ok
‘about(whales and reproduction)’ – Documents about whales
and reproduction
In terms of scoring, documents that contain both themes will
score higher than documents that have only one of the themes in
the query.

The ABOUT operator can be combined with other operators for


example AND or NOT.
about (minke whales) AND reproduction
about(blue whales) not about (reproduction)

Select score (1),identifier,title


From text_tab
Where CONTAINS(text,’about(whales)’,1)>0
Order by score(1) desc;

select identifier,title,score(1)
from text_tab
where contains ( text, 'about(whales)',1 ) > 0;
Accumulate SELECT SCORE(1), title,source SCORE(1)
FROM text_tab TITLE
WHERE CONTAINS(text, ' ---------------
bluewhale,orcas,beluga,killer-whale', 1) > 0 -------
ORDER BY SCORE(1) DESC; 28 Minke
whales
1
1Communic
ation -Music
to swim by
And SELECT SCORE(1), title,description no rows
FROM text_tab selected
WHERE CONTAINS(text, ' bluewhale AND killer-whale',
1) >0

Equivalence SELECT SCORE(1), title from text_tab no rows


WHERE CONTAINS(text, ' bluewhales=minkewhales selected
are social animals ', 1) > 0
ORDER BY SCORE(1) DESC;

Minus SELECT SCORE(1), title,description ok


FROM text_tab
WHERE CONTAINS(text, ' bluewhale - killer-whale', 1)
>0

Near SELECT SCORE(1), title, published_date from text_tab ok


WHERE CONTAINS(text, 'whales;reproduction;death',
1) > 0
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 7

AND published_date >= ('01-01-2000')


ORDER BY SCORE(1) DESC;

Not SELECT SCORE(1), title,description,source from text_tab ok


WHERE CONTAINS(text, ' whales~minkewhales', 1) >
0
ORDER BY SCORE(1) DESC;

Or SELECT SCORE(1), title,description,source from text_tab ok


WHERE CONTAINS(text, ' bluewhales OR
minkewhales', 1>0
ORDER BY SCORE(1) DESC;

Soundex select title from text_tab where contains ( text, '!whale',1 ) > 0 ok
Stem select identifier,title,score(1) from text_tab where contains ( ok
text, '$moving',1 ) > 0;
Fuzzy SELECT identifier, title ok
FROM text_tab
WHERE CONTAINS(TEXT, ‘fuzzy(whale, 70, 6, weight)’ ,
1)>0;
This query expands to the first 6 variations of whale in the index
that have a similarity score over 70. Documents in the result set
are also weighted so that documents that have the more similar
words compared to whale are ranked highest.

Using the appropriate number of commas can skip unnecessary


parameters. For example:
SELECT identifier, title
FROM text_tab
WHERE CONTAINS(TEXT, ‘fuzzy(whale, , , weight)’ ,
1)>0;
In this case, documents that score above 60 are retrieved, since
60 is the default value for score. The parameter numresults will
have the value 100.

select * from fuzzy_tbl where contains (text,'?hales')>0;

Treshold select identifier,title,score(1) ok


from text_tab
where contains ( text, '(pilot-whale > 4) and communication',1 )
>0
order by score(1) desc;
Weight select identifier,title,score(1) ok
from text_tab
where contains ( text, 'pilot-whale or communication*3',1 ) > 0
order by score(1) desc;
Wildcards select identifier,title,score(1) ok
from text_tab
where contains ( text, 'pilot% or communication*3',1 ) > 0
order by score(1) desc;
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 8

Comments about thesaurus queries in Oracle


Oracle supplies a test thesaurus in English. The following path locates the thesaurus:
C:\oracle\ora92\ctx\sample\thes. The thesaurus is by default called ‘drOthsus’. If the above
queries were to be defined without specifying the name of the thesaurus, Oracle would not
find it unless it is called ‘default’. Before using the thesaurus in queries, it has to be loaded.
The CTXLOAD procedure is described in McGregor (2002).

The examples are syntactically correct, but some of them do not provide semantic meaning.
This is because they are only intended as examples at the present point of time.

BT ‘BT(Bluewhale)’
‘BT(Bluewhale, 2, drOthsus)’
When homographs are under
consideration a qualifier can
be specified in order to avoid
misconception.
BT(crane{(waterfowl)}) --
crane or bird

select text from text_tab


where contains ( text,
'BT ( orcas, 1, dr0thsus )' ) >
0;

BTG select text from text_tab


where contains ( text,
'BTG ( Whale, 1, dr0thsus)' )
> 0;

BTP select text from text_tab


where contains ( text,
'BTP ( Pilot-Whale, 1,
dr0thsus )' ) > 0;

BTI select text from text_tab


where contains ( text,
'BTI ( Orcas, 1, dr0thsus )' )
> 0;

NT select text from text_tab


where contains ( text,
'NT ( Whale, 1, dr0thsus )' )
> 0;

NTG select text from text_tab


where contains ( text,
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 9

'NTG ( Cats, 1, dr0thsus)' ) >


0;

NTP select text from text_tab


where contains ( text,
'NTP ( Animal, 1, dr0thsus )'
) > 0;

NTI select text from text_tab


where contains ( text,
'NTI ( Animal, 1, dr0thsus )'
) > 0;

PT select text from text_tab


where contains ( text,
'PT ( Africa, dr0thsus )' ) >
0;

RT select text from text_tab


where contains ( text,
'Wolf ( Term, dr0thsus )' ) >
0;

SYN select text from text_tab


where contains ( text,
'SYN ( African lion,
dr0thsus )' ) > 0;

TR select text from text_tab


where contains ( text,
'TR ( Whale,baleine ,
dr0thsus )' ) > 0;
-- Term
--
Baleine_French_For_Whale

TRSYN select text from text_tab


where contains ( text,
'TRSYN ( Killer-Whale, all,
dr0thsus )' ) > 0;
Oracle Text Queries Nina Bergli
Draft 30.10.03 / 10

Combining structural predicates with the CONTAINS operator:

SELECT SCORE(1), title, published_date from text_tab


WHERE CONTAINS(text, 'lions', 1) > 0
AND published_date >= ('01-01-2000')
ORDER BY SCORE(1) DESC;

REFERENCES
McGregor (2002). Oracle Text - Reference, Oracle. 2002. available at:
http://www.csis.gvsu.edu/GeneralInfo/Oracle/text.920/

You might also like