You are on page 1of 3

Where µ𝐹 (𝑑, 𝑡) specify how much document d is about term 𝑡.

But we can also change perspective and


𝐼

define a set where all documents about a given index term 𝑡 are gathered.

𝐹𝑡 = { µ𝑡(𝑑)/𝑑 | 𝑑∈𝐷 𝑎𝑛𝑑 𝑡∈ 𝑇 } ⊂ 𝐷

( 𝑗)( )
From here on, the membership function µ 𝑑 𝑡𝑖 will be used to identify the fuzzy set 𝐹𝐼, that is the fuzzy set

of the degree to which document 𝑑𝑗 is about term 𝑡𝑖.

Indexing function
( 𝑗)( )
There are different methods of defining the fuzzy indexing function µ 𝑑 𝑡𝑖 . One common approach is to use
the normalized tf-idf
𝑡𝑓𝑖,𝑗 𝑖𝑑𝑓𝑖

)( ) ) = µ(𝑡𝑖)(𝑑𝑗) (1)
µ 𝑑 𝑡𝑖 =
( 𝑗
(𝑡𝑓𝑘,𝑗 ) ( 𝑖𝑑𝑓𝑘

Where 𝑡𝑓𝑖,𝑗 is the term frequency of index term 𝑡𝑖 in the document 𝑑𝑗. 𝑖𝑑𝑓𝑖 is the inverse document frequency

( )
of term 𝑡𝑖, given as 𝑖𝑑𝑓𝑖 = log 𝑙𝑜𝑔 𝑁/𝑁𝑡 . 𝑁 is the number of documents in the collection and 𝑁𝑡 is the
𝑖 𝑖

number of documents in the collection containing the term 𝑡𝑖. With the normalization we get a value in the
unit interval, as required by the membership function.

Query representation
A query q is given as a logical combination of terms 𝑡𝑖. Thus, µ𝑞 is the associated fuzzy set constructed with
the fuzzy extension of logical operators AND, OR, and NOT used to form the query. For example, if the
( )
query is q = “cat AND dog”, to obtain µ𝑞 𝑑𝑗 we have to combine the membership function of term “cat”
( ) ( )
µ"𝑐𝑎𝑡" 𝑑𝑗 , and the membership function of term “dog” µ"𝑑𝑜𝑔" 𝑑𝑗 and combine them together with the fuzzy
extension of the AND operator, for example the minimum t-norm function:

( ) ( ( ) ( )
µ𝑞 𝑑𝑗 = µ𝑞 𝑑𝑗 , µ𝑞 𝑑𝑗
1 2
) = (µ"𝑐𝑎𝑡"(𝑑𝑗), µ"𝑑𝑜𝑔"(𝑑𝑗)) .

Query matching
A document 𝑑 answers a query 𝑞 if the implication 𝐷 ⇒ 𝑄 holds, where 𝐷 and 𝑄 are some logical
manipulation of document 𝑑 and query 𝑞. So, the result of a query is a set of documents that partially match
the request. To be more specific, it is a fuzzy subset of the document set D, in which the membership
function is the degree of relevance of document 𝑑 to the query 𝑞.
Documents retrieved has associated a degree of relevance so they can be ranked. Therefore, the user can
limit the number of retrieved documents so that he could only see the most relevant documents. Another way
of limit the retrieval set is to consider relevant only those document that have membership degree function
( )
µ𝑞 𝑑𝑗 ≥ ε , where ε is a threshold value. This could be interpreted as an ∝ − 𝑐𝑢𝑡 of the fuzzy set µ𝑞 𝑑𝑗 ( )
with parameter ∝ = ε.
In a standard Boolean IR system, a structured dictionary or thesaurus could help users during query
formulation. A fuzzy thesaurus can specify general terms using fuzzy sets over specific terms. A method
(Kohout et al., 1983) for creating a fuzzy thesaurus use the operation of relational product between fuzzy
sets. Each fuzzy set representing a document can be seen as a row in a 𝑛𝐷 × 𝑛𝑡 document-term matrix 𝐷
𝑖
where 𝑛𝑑 is the number of documents and 𝑛𝑡 is the total number of index terms. The element 𝐷𝑖,𝑗 is the
𝑖
𝑇
degree to which document 𝑑𝑖 is about term 𝑡𝑗. The element 𝐷𝑖,𝑗 is the degree to which term 𝑡𝑖 is relevant in
𝑇
document 𝑑𝑗. Using a particular type of product (square product), the new matrix 𝐷⊡𝐷 is a way to define
synonyms. In row 𝑖 column 𝑗 there is the degree to which term 𝑡𝑖 is a synonym of term 𝑡𝑗. Using another type
of product (triangular product) we will obtain more general terms for a given term.

Personalized information retrieval system in the framework of fuzzy logic


Util now general concepts about info fuzzy IR systems have been presented. From now on, there will be the
proposal of Oussalah, Khan and Nefti in their paper “Personalized information retrieval system in the
framework of fuzzy logic”.
They proposed the multidimensional similarity measure: a different way to assign the relevance of a
document to a given query, for each index term. This new measure will consider various problems that are
not dealt with the normalized tf-idf alone. Such as:
● Different term importance with respect to its position (title, document keyword list, body)
● Different term importance if term is preceded by an increasing or decreasing quantifier (Almost,
further, very, nearly, bad, poor)
● Others presented when encountered during the development of the similarity function.

Similarity function
With the introduction of the similarity function, we wanted to find a way to generalize the implication
𝐷 ⇒ 𝑄. The query will be represented in the following form

() ( ( )( ))
μ(𝑞) 𝑡𝑖 = 𝐿𝑘 μ 𝑡 𝑞𝑘
𝑖

Where 𝑞𝑘 are the parts of the query joined by Boolean operators L and 𝐿 is the fuzzy extension of L.

( 𝑖)( )
For example, given the query “brown dog AND black cat”, 𝑞1 is “brown dog”, 𝑞2 is “black cat”, μ 𝑡 𝑞1

( 𝑖)( )
and μ 𝑡 𝑞2 are built with (1). So, the fuzzy set representing the query us

() ( 𝑖)(
μ(𝑞) 𝑡𝑖 = μ 𝑡 𝑞1 𝐴𝑁𝐷 μ) (𝑞2) = 𝑚𝑖𝑛⎛ μ(𝑡 )(𝑞1), μ (𝑞2)⎞
(𝑡𝑖) ⎝
𝑖
(𝑡𝑖) ⎠
The ultimate similarity function is built by improving the following base function

( )( ) ( ( )( )
𝑆𝑖𝑚 𝑑𝑗, 𝑞 𝑡𝑖 = 𝐼 µ 𝑑 𝑡𝑖 , μ(𝑞) 𝑡𝑖 (2)
𝑗
() )
Where 𝐼 is any fuzzy implication operator. There are at least two families of fuzzy implicators

● S implicators, that mock 𝑎 ∧ 𝑏, that is the logic representation of 𝑎 → 𝑏. Using the fuzzy set
operations (t-conorm and negation) it becomes
𝐼𝑆(𝑎, 𝑏) = 𝑆(1 − 𝑎, 𝑏)
● Residual implicators, that mock the idea of partial ordering so that 𝐼(𝑎, 𝑏) = 1 as soon as 𝑎 ≤ 𝑏. For
a given t-norm, the residual implicator is
𝐼𝑅(𝑎, 𝑏) = sup 𝑠𝑢𝑝 { 𝑐 ∈ [0, 1], 𝑇(𝑎, 𝑐) ≤ 𝑏}

If we use S implicator with the max t-conorm then 𝐼(𝑎, 𝑏) = (1 − 𝑎, 𝑏) . Let us think to the similarity
( 𝑗)( ) ( )( )
function given in (2). For all operators 𝐼, if µ 𝑑 𝑡𝑖 = 0 then 𝑆𝑖𝑚 𝑑𝑗, 𝑞 𝑡𝑖 = 1. But this means that even if
the index term is not present in the document the similarity shows maximum pertinence. The function (2) is
then modified to overcome this problem.

( )( ) ( ( ( )( )
𝑆𝑖𝑚 𝑑𝑗, 𝑞 𝑡𝑖 = 𝑚𝑖𝑛 µ 𝑑 𝑡𝑖 , μ(𝑞) 𝑡𝑖 , 𝐼 µ 𝑑 𝑡𝑖 , μ(𝑞) 𝑡𝑖
𝑗
() ) ( ( )( )
𝑗
() )) (3)

In this way if the index term is not present either in the document or in the query the similarity will go to
( 𝑗)( ) ()
zero. And the maximum value reachable by this function is the smallest between µ 𝑑 𝑡𝑖 𝑎𝑛𝑑 μ(𝑞) 𝑡𝑖 .

To get a similarity over the whole set of index terms the following modification should be done

( ) ( ( ( )( )𝑗
( )
𝑆𝑖𝑚 𝑑𝑗, 𝑞 = 𝑚𝑖𝑛 µ 𝑑 𝑡𝑘 , μ(𝑞) 𝑡𝑘 , 𝐼 µ 𝑑 𝑡𝑘 , μ(𝑞) 𝑡𝑘 ) ( ( )( )
𝑗
))
( ) (4)

so, the max operator is computed all over index terms 𝑡𝑘. With (4) the similarity returns only one value when
applied between a document 𝑑𝑗 and a query 𝑞.

But let us think at an example where there are two documents 𝑑1, 𝑑2. 𝑑1 has two term in common with the
query 𝑞1 and 𝑑2 has only one term in common with the same query 𝑞1. For both cases, the similarity defined
in (4) will give a value 1. (4) should be modified to consider the number of co-occurrences of index terms
between the document and the query. The algebraical sum will then replace the max operator

( ) ( ( ( )( )
𝑆𝑖𝑚 𝑑𝑗, 𝑞 = ∑ 𝑚𝑖𝑛 𝑚𝑖𝑛 µ 𝑑 𝑡𝑘 , μ(𝑞) 𝑡𝑘 , 𝐼 µ 𝑑 𝑡𝑘 , μ(𝑞) 𝑡𝑘
𝑡𝑘 𝑗
) ( ( )( )
( ) 𝑗
( ) )) (5)

It has been proved that using certain types of Implication operators (S-Implicators or Residual implicators)
then (3) is equivalent to

( )( ) ( ( )( )
𝑆𝑖𝑚 𝑑𝑗, 𝑞 𝑡𝑖 = 𝑚𝑖𝑛 µ 𝑑 𝑡𝑖 ,  µ(𝑞) 𝑡𝑖 (6)
𝑗
() )
Under this restriction, this result implies that the similarity between the document and the query, for a given
term, is transformed into the intersection of the two fuzzy sets representing respectively the document and
the query. Additionally, this turns the asymmetric like operation, due to implication operators, into a
symmetric operation.
With same hypothesis as the (6) it follows

( ) ( ( )( )
𝑆𝑖𝑚 𝑑𝑗, 𝑞 = ∑ 𝑚𝑖𝑛 µ 𝑑 𝑡𝑘 ,  µ(𝑞) 𝑡𝑘 (7)
𝑡𝑘 𝑗
( ) )
In this way, the total similarity between the document and the query is the cardinality of the intersection of
the two fuzzy sets representing the document and the query.

You might also like