Samenvatting Data Analyse

HC1
Precisie = deel van de geleverde docs dat relevant zijn voor de info. die
nodig is (selectiviteit).
Recall = deel van relevante docs in collectie die zijn opgeleverd (sensitivity).
Boolean retrieval -> basic model, AND OR NOT () (bibliotheek catalogen).
Sparse matrix approach -> docID, Hash Table (no range queries)/B-Tree voor Dict, array/linked list voor post. Q = t1
AND t2 -> locate p1 and p2 for t1 and t2, calc. intersection of p1 and p2 by list merging -> √n skip pointers, n =p.L
Phrase Q -> positional index. Wil-card: permuterm index.
HC2
MapReduce: Map and Reduce workers in parallel processing. (impl.
Hadoop). <k,v>, map worker: scans own input once, does one uniform
calc. on each k-v pair, output = set of k-v pairs: >=0, is first grouped on
key, then given to Reduce worker. Output results of all RW = result of
calc. Map always works on one < k, v > tuple. Reduce always works on
one < k, [v1, v2, ..., vn] > tuple.
HC3
scoring function: s: <q, d> -> v, with q(uery), d(ocument) v ∈ [0, 1] or v ∈ R+.  expresses quality of match between
q, d, enables us to calc. top-k.
term frequency: higher score, if t occurs more in d. Should contribute to the score function.
inverse document frequency: measure of rareness. Define dft = numb of docs containing t, N is total numb of docs.
idft = log(N/dft).  weight(t,d) = tft,d X idft It does nothing with single queries, fav long docs
 solution = Vector space model.
want to look for the doc vector with the smallest angle with query vector. = generalization of tf-idf scoring.
x . y = 0  two vectors have a mismatch on all terms.
Conjunctive equality query: SELECT * FROM R WHERE C1 AND … AND Ck

Hierin is conditie Ci van de vorm Aj = qj of Aj IN (qj1, … , qjl).
Similarity coefficient: S(u,v) = idfk(u) als u (waarde Ak in q) = v (waarde Ak in tupel), anders 0.
QF similarity: gebruik workload, en is maat voor de populariteit van de zoekterm. QFk(v) = RQFk(v) / RQFMAXk
-> RQFk(v) is de “raw query frequency” van waarde (term) onder attribuut Ak in de workload.
-> RQFMAXk is de frequentie van de meest voorkomende term.
-> S(u,v) = QF(u), als u = v, anders 0. En voor tuple T en query Q:
Attribute value similarity: Jaccard coefficient meet de similarity tussen sets W(t) en W(q):
Attribute similarity: similarity tussen een query term q en een term t, S(t,q) =
J(W(t), W(q)) * QF(q)
HC4
top-k query processing: f is monotone & non-increasing  if you
increase the value of one of the parameters of f and keep the others
constant, f will not decrease.
Treshold algorithm:
No Random Access
Algorithm:
Frequent item set mining: for an association rule X → Y, define • the support is
s(XY) • the confidence is s(XY)/s(X).
An item set is frequent if its support is bigger than a user- specified minimum
support threshold. The Apriori property: A set is
a candidate frequent set if all its subsets are
frequent, because:
- If X is frequent, then all its subsets are also
frequent.
- If X has a subset that is not frequent, then it cannot be frequent.
naïve complexity = O(2m), apriori worst-
case complexity =
HC5
PageRank = Using link structure to
define importance of a web site:
- When many sites refer to you, you are
important
- When important sites refer to you, you are important
- When a site referring to you has many outgoing links, this decreases the weight of the reference
HP = P, solving this gives complexity of O(n^3 ) & all-or-nothing.
alternative  fixpoint iteration :
- Start with a vector P (0) = (1/n, 1/n, ..., 1/n)^T
- Calculate P^(k) = HP^(k−1), for a certain k
- To solve dangling node (no outgoing edges) use teleportation
G = αS + (1 – α)T
- α = 1, we cannot guarantee convergence & it is slower;
- α = 0, we get results that completely ignore the structure of the web: all pages are equal.
HC6
BLOSUM  represents log-odds ratios.
dynamic programming: O(mn)
- Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)
BLAST: if size of k
• increases, then precision increases, recall decreases

• decreases, then precision decreases, recall increases

Samenvatting Data Analyse

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Samenvatting Data Analyse

Uploaded by

Copyright:

Available Formats

HC1

Conjunctive equality query: SELECT * FROM R WHERE C1 AND … AND Ck

• increases, then precision increases, recall decreases

You might also like