You are on page 1of 4

Excluding stuff from Cogito rules

If you identify a problem with something that needs to be excluded from a rule that works well
otherwise, there are four basic techniques to do so. These are:

 AND NOT
 MINUS
 Positional sequences with negation
 Discard rules

1. AND NOT  

Imagine you notice a precision problem with the term “Boston beans”, because this rule fires on
some articles that are not about Boston:

SCOPE SENTENCE
{
//1. Boston SYNCON/Geography
DOMAIN(BOSTON:NORMAL)
{
ANCESTOR(14911586:99:syncon/geography)//# 14911586: Boston,
Beantown, Hub of the Universe
}
}

You can do this with a boolean operator “AND NOT”.

SCOPE SENTENCE
{
//1. Boston SYNCON/Geography
DOMAIN(BOSTON:NORMAL)
{
ANCESTOR(14911586:99:syncon/geography)//# 14911586: Boston, Beantown, Hub of
the Universe
AND NOT
KEYWORD("boston beans")
}
}

This will leave out of consideration ( = rule will ignore) any sentence that contains Boston
beans. Since I entered as a keyword in lowercase, it will pick any variant, lower / upper case. It’s
the equivalent to a NOT in RBC, with the difference that in RBC the typical scope is the entire
article, while this is generally not the case in RBC. So, using the above example, if you have two
sentences on an article, one with “Boston” and one with “Boston beans”, the sentence with just
“Boston” will still be matched by your rule.

2. Difference (MINUS)

This is used to exclude something that is part of an attribute (as an ANCESTOR, SYNCON,
LEMMA etc…). Example: assume you find problems with SYNCON #12391407 (Western Area),
which is a district in Sierra Leone, because disambiguator thinks any “Western Area” appearing
in an article relates to Sierra Leone. You could do AND NOT as in 1 above. But since AND NOT
excludes entire sentences, that would also leave out a sentence like: Prime minister visits
Western Area. AND NOT would leave the entire sentence out. What we can do there is just
excluding “Western Area” not all sentences with “Western Area” as some might actually be
about Sierra Leone. We can do that with MINUS symbol.

SCOPE SENTENCE
{
//1. Sierra Leone: Ancestor syncon/geography
DOMAIN(SILEN:NORMAL)
{
ANCESTOR(12392184:99:syncon/geography)//# 12392184: Sierra Leone, Republic of
Sierra Leone, Serra Leona
-KEYWORD("western area")
}
}

If we do that, in a sentence like the above: “Prime minister of Sierra Leone visits Western Area”.
Sierra Leone is still matched. If we do AND NOT, it doesn’t because as it’s a sentence with
“Western Area” is just left out of the scope of the rule.

In fact, this is just a way to combine attributes and you can also combine with plus symbol. More
details about this on the Expert System wiki page. There are some restrictions when combining
attributes, for instance you cannot combine two attributes with more than one value each. The
below would be invalid:

SYNCON(value1, value2) - KEYWORD(“value3”, “value4”)

(we’d need to create separate rules for each syncon for this to work, but don’t worry about this
for now)

3. Negation in sequences, THE SAME AS ORDERED DISTANCE link to wiki


This is just using the positional operators (>, >>, <> etc.) combined with exclamation mark !
This exclamation mark (only used with sequences) is used to exclude a specific sequence of
attributes. For instance, going back to the Boston beans example…. Let’s assume you have a
sentence like:

You should go to Boston and try the Boston beans.


You don’t want to capture “Boston beans” but you do want to capture the standalone “Boston”. If
you do as in #1 and use AND NOT you will be excluding the entire sentence. You won’t match
anything on that sentence, because it contains “Boston beans” so gets automatically out of the
scope of the rule. And here, you don’t want to exclude sentences with “Boston beans”, you just
want to avoid Cogito matching “Boston” in a specific sequence: when followed by the term
“beans”. You can do this:

ANCESTOR(14911586:99:syncon/geography)
>>
!KEYWORD("bean", "beans")

The above means. Match me the ancestor for Boston, but not if it’s immediate followed (>>) by
the words “bean” or “beans”. And voilà, your rule will match standalone Boston:

You should go to Boston and try the Boston beans

4. DISCARD rules

Also used to exclude things. To be precise, they remove codes. But don’t worry much about
Discard rules for now as we need to be very sure about them. Too many discard rules can
impact cogito performance. This should be used only when none of the above three techniques
is valid. I think some of us use too many discards, and this can be problematic. It also has a
potential of impacting recall.

DISCARD is one of the standard score options and is used in the DOMAIN line. It makes Cogito
remove the code (actually, adds an infinite negative score) if the rule matches. There are
hundreds of bad examples of DISCARD rules in our projects and I don’t want to use them here
so I will use one that I use quite often to avoid mismatching regions with similar names.

For instance let’s assume that there’s a guy called Al Franken and Cogito has problems to see
“Al” as a person name so it doesn’t know what “Al” is and thinks that Franken is
SYNCON(12789661)//# 12789661: Franken, Franconia, which is matched by rule for GFR
number 2 (omninomen/parsnomen). What I did here was creating the below rule. This means
that if the conditions are matched in the scope of the entire body (scope section(body)), the
code for Germany should be removed from the article. The conditions are 1) “Al Franken”
should be there, 2) the syncon for Franken should be there (if Franken is not recognized as that
syncon, that’s not a problem because the rule for Germany won’t fire anyway so no need to
remove the code), 3) that article should not have anything below the Germany
syncon/geography ancestor, or the syncons for German. So if we had an article about Al
Franken going to Germany, the code would not be removed.

SCOPE SECTION(BODY)
{
//8. discard for Al Franken
DOMAIN(GFR:DISCARD)
{
KEYWORD("Al Franken")
AND
SYNCON(12789661)//# 12789661: Franken, Franconia
AND NOT
ANCESTOR(12907079:99:syncon/geography,43498,97966)// # 12907079:
Germany, Federal Republic of Germany, Deutschland, Bundesrepublik Deutschland,
Duitschland, FRG, Ger., F.R.G.; 43498: German, Ger.; 97966: German
}
}

There are quite a few ways to use discard and we should be quite careful with it but it can be
quite helpful. For instance here’s an example of rule made to discard the code for Switzerland
on articles about the company Zurich Insurance.

SCOPE SECTION(BODY) IF DOMAIN(insurance:0.5%)


{
//7. DISCARD Zurich the company
DOMAIN(SWITZ:DISCARD)
{
SYNCON(12645121)//# 12645121: Zurich, Zürich, Zuerich
+KEYWORD("Zurich")
AND NOT
ANCESTOR(12645653:99:syncon/geography)//# 12645653: Switzerland,
Schweiz, Suisse, Swiss Confederation, Svizzera, Helvetia, Confederatio Helvetica,
Confédération Suisse, Schweizerische Eidgenossenschaft, Schwiz, Switz.
-KEYWORD("Zurich")
AND NOT

ANCESTOR(12645653:99:syncon/geography:geography/structures,43411,97987)// #
12645653: Switzerland, Schweiz, Suisse, Swiss Confederation, Svizzera, Helvetia,
Confederatio Helvetica, Confédération Suisse, Schweizerische Eidgenossenschaft, Schwiz,
Switz.; 43411: Swiss, Swiss people, Switzer; 97987: Swiss
AND NOT
KEYWORD("in Zurich", "ZURICH")
}
}

You might also like