Professional Documents
Culture Documents
CT075-3-2
•Explain the information retrieval concept and how we can use this
concepts in text mining.
•
Key Terms you must be able to use
• Typical IR systems
– Online library catalogs
– Online document management systems
• Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update,
transaction management, complex objects
– Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance
IR Basic definition
The core IR vocabulary includes the following four terms:
• Example
the words disease, diseases, and diseased share a common
stem term disease, and can be treated as different occurrences
of this word.
Basic Measures for Text Retrieval
• Precision evaluates the ability of the IR system to retrieve top-ranked documents that
are mostly relevant, and is defined as the percentage of the retrieved documents that
are truly relevant to the user’s query.
• Recall evaluates the ability of the IR system to find all the relevant items in the
database and is defined as the percentage of the documents that are relevant to the
user’s query and that were retrieved
The IR Quality
• very high precision and low recall. In this case, the system returns a few
documents and almost all of them are relevant, but at the same time a significant
number of other relevant documents is missing.
• very high recall and relatively low precision. In this case, the system returns a
large number of documents that include almost all relevant documents but also
include a significant number of unwanted documents.
Vector-Space Model
• Vocabulary is the set of all distinct terms that remain after linguistic
preprocessing of the documents in the database, and contains t index
terms.
• cosine measure
Example Text Similarity
Text Similarity for below?
Given query : 0T1, 0T2, 1T3, 0T4