Professional Documents
Culture Documents
LSI AND LDA BUG 2 by Dr. Devesh
LSI AND LDA BUG 2 by Dr. Devesh
Two of the problems (among several) that LSI sets out to solve are
the issues of synonymy and polysemy.
Polysemy refers to words and phrases that have more than one
meaning. The word jaguar can mean an animal, automobile, or an
American football team.
LDA
Topic modeling is a type of statistical modeling for discovering the
abstract “topics” that occur in a collection of documents. Latent
Dirichlet Allocation (LDA) is an example of topic model and is
used to classify text in a document to a particular topic
In this, observations (e.g., words) are collected into documents, and each word's
presence is attributable to one of the document's topics. Each document will contain
a small number of topics.
Machine learning[edit]
One application of LDA in machine learning - specifically, topic discovery, a subproblem
in natural language processing - is to discover topics in a collection of documents, and then
automatically classify any individual document within the collection in terms of how "relevant" it is
to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words
or phrases) that, taken together, suggest a shared theme.
For example, in a document collection related to pet animals, the
terms dog, spaniel, beagle, golden retriever, puppy, bark, and woof would suggest
a DOG_related theme, while the terms cat, siamese, Maine coon, tabby, manx, meow, purr,
and kitten would suggest a CAT_related theme. There may be many more topics in the
collection - e.g., related to diet, grooming, healthcare, behavior, etc. that we do not discuss for
simplicity's sake. (Very common, so called stop words in a language - e.g., "the", "an", "that",
"are", "is", etc., - would not discriminate between topics and are usually filtered out by pre-
processing before LDA is performed. Pre-processing also converts terms to their "root" lexical
forms - e.g., "barks", "barking", and "barked" would be converted to "bark".)
If the document collection is sufficiently large, LDA will discover such sets of terms (i.e., topics)
based upon the co-occurrence of individual terms, though the task of assigning a meaningful
label to an individual topic (i.e., that all the terms are DOG_related) is up to the user, and often
requires specialized knowledge (e.g., for collection of technical documents). The LDA approach
assumes that: