You are on page 1of 3

HC1

Precisie = deel van de geleverde docs dat relevant zijn voor de info. die
nodig is (selectiviteit).
Recall = deel van relevante docs in collectie die zijn opgeleverd (sensitivity).
Boolean retrieval -> basic model, AND OR NOT () (bibliotheek catalogen).
Sparse matrix approach -> docID, Hash Table (no range queries)/B-Tree voor Dict, array/linked list voor post. Q = t1
AND t2 -> locate p1 and p2 for t1 and t2, calc. intersection of p1 and p2 by list merging -> √n skip pointers, n =p.L
Phrase Q -> positional index. Wil-card: permuterm index.

HC2
MapReduce: Map and Reduce workers in parallel processing. (impl.
Hadoop). <k,v>, map worker: scans own input once, does one uniform
calc. on each k-v pair, output = set of k-v pairs: >=0, is first grouped on
key, then given to Reduce worker. Output results of all RW = result of
calc. Map always works on one < k, v > tuple. Reduce always works on
one < k, [v1, v2, ..., vn] > tuple.

HC3
scoring function: s: <q, d> -> v, with q(uery), d(ocument) v ∈ [0, 1] or v ∈ R+.  expresses quality of match between
q, d, enables us to calc. top-k.
term frequency: higher score, if t occurs more in d. Should contribute to the score function.
inverse document frequency: measure of rareness. Define dft = numb of docs containing t, N is total numb of docs.
idft = log(N/dft).  weight(t,d) = tft,d X idft It does nothing with single queries, fav long docs
 solution = Vector space model.

want to look for the doc vector with the smallest angle with query vector. = generalization of tf-idf scoring.
x . y = 0  two vectors have a mismatch on all terms.

Conjunctive equality query: SELECT * FROM R WHERE C1 AND … AND Ck


Hierin is conditie Ci van de vorm Aj = qj of Aj IN (qj1, … , qjl).
Similarity coefficient: S(u,v) = idfk(u) als u (waarde Ak in q) = v (waarde Ak in tupel), anders 0.
QF similarity: gebruik workload, en is maat voor de populariteit van de zoekterm. QFk(v) = RQFk(v) / RQFMAXk
-> RQFk(v) is de “raw query frequency” van waarde (term) onder attribuut Ak in de workload.
-> RQFMAXk is de frequentie van de meest voorkomende term.
-> S(u,v) = QF(u), als u = v, anders 0. En voor tuple T en query Q:
Attribute value similarity: Jaccard coefficient meet de similarity tussen sets W(t) en W(q):
Attribute similarity: similarity tussen een query term q en een term t, S(t,q) =
J(W(t), W(q)) * QF(q)
HC4
top-k query processing: f is monotone & non-increasing  if you
increase the value of one of the parameters of f and keep the others
constant, f will not decrease.
Treshold algorithm:
No Random Access
Algorithm:

Frequent item set mining: for an association rule X → Y, define • the support is
s(XY) • the confidence is s(XY)/s(X).
An item set is frequent if its support is bigger than a user- specified minimum
support threshold. The Apriori property: A set is
a candidate frequent set if all its subsets are
frequent, because:
- If X is frequent, then all its subsets are also
frequent.
- If X has a subset that is not frequent, then it cannot be frequent.
naïve complexity = O(2m), apriori worst-
case complexity =

HC5
PageRank = Using link structure to
define importance of a web site:
- When many sites refer to you, you are
important
- When important sites refer to you, you are important
- When a site referring to you has many outgoing links, this decreases the weight of the reference
HP = P, solving this gives complexity of O(n^3 ) & all-or-nothing.
alternative  fixpoint iteration :
- Start with a vector P (0) = (1/n, 1/n, ..., 1/n)^T
- Calculate P^(k) = HP^(k−1), for a certain k
- To solve dangling node (no outgoing edges) use teleportation
G = αS + (1 – α)T
- α = 1, we cannot guarantee convergence & it is slower;
- α = 0, we get results that completely ignore the structure of the web: all pages are equal.
HC6
BLOSUM  represents log-odds ratios.
dynamic programming: O(mn)
- Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)

BLAST: if size of k

• increases, then precision increases, recall decreases


• decreases, then precision decreases, recall increases

You might also like