You are on page 1of 2

Matching

Matching happens in the following cases:

1. Upload
2. Import
3. Search via bar
4. Search via filter
5. dry-run

Upload, Import, Search via filter, and dry-run match type-specifically, Search via bar has to
match against all types.

Normalization

Any matching demands the normalization of all participants:

1. Replace umlauts and special characters: Ä->ae, Ö->oe, Ü->ue, ä->ae, ö->oe, ü->ue, ß->ss
2. Replace capital letters with lower case
3. Replace all but [0..9][a..z] with blank
4. Separate into a list of words where word is any sequence != blank and > 2 letters
5. Remove stop-words from the list of words
6. Remove duplicates from the list of words
7. Store the list of words sorted alphanumerically as one text, separated by blanks with the
AU entry

For the AU it will be done with every creation or change of an AU entry and saved with the AU.

When creating combined entries, duplicates must be removed.

An AU entry consists of the following data:

AU data Description
value_<lang> the defining phrase, consisting of one or more words
variants_<lang> one or more phrases for value
groups_<lang> one or more generic terms for value
links_<lang> one or more URLs describing the value
n_value_<lang> normalized value phrase, unique between different types
n_variants_<lang> normalized variant phrases
n_groups_<lang> normalized group phrases
n_value_all combined languages of normalized values
n_variants_all combined languages of normalized variants and values
n_groups_all combined languages of normalized groups and variants and values
Matching
Matchings

For the phrase to be matched, normalization has to happen on the fly and ends in a list of words.

We define the following matchings between the resulting list of words and the AU:

1. All listed words match all n_value_<lang> (exact match)


2. All listed words match any (n_value_<lang> (exact match)
3. All listed words match all n_value_<lang> (fuzzy match)
4. All listed words match any n_value_<lang> (fuzzy match)
5. One listed word match any (n_value_<lang> (exact match)
6. One listed word match any (n_value_<lang> (fuzzy match)
7. All listed words match all n_value_all (exact match)
8. All listed words match any (n_value_all (exact match)
9. All listed words match all n_value_all (fuzzy match)
10. All listed words match any n_value_all (fuzzy match)
11. One listed word match any (n_value_all (exact match)
12. One listed word match any (n_value_all (fuzzy match)
13. All listed words match all n_variants_all (exact match)
14. All listed words match any (n_variants_all (exact match)
15. All listed words match all n_variants_all (fuzzy match)
16. All listed words match any n_variants_all (fuzzy match)
17. One listed word match any (n_variants_all (exact match)
18. One listed word match any (n_variants_all (fuzzy match)

Algorithms

Every matching returns 0, 1, or more results. The general procedure is as follows:

 if count(results) = 1 then return this AUID


 else if count(results) > 1 then create new AU entry
 else if next matching defined then goto that
 else create new AU entry

Basic algorithm for Upload, Import, Search via filter, and dry-run:

 1->2->3->4->5->6->7->8->9->10->11->12->13->14->15->16->17->18

You might also like