You are on page 1of 2

LSA

Latent semantic indexing (also referred to as Latent Semantic


Analysis) is a method of analyzing a set of documents in order to
discover statistical co-occurrences of words that appear together
which then give insights into the topics of those words and
documents.

Two of the problems (among several) that LSI sets out to solve are
the issues of synonymy and polysemy.

Synonymy is a reference to how many words can describe the same


thing.

A person searching for “flapjack recipes” is equal to a search for


“pancake recipes” (outside of the UK) because flapjacks and
pancakes are synonymous.

Polysemy refers to words and phrases that have more than one
meaning. The word jaguar can mean an animal, automobile, or an
American football team.

LSI is able to statistically predict which meaning of a word


represents by statistically analyzing the words that co-occur with it
in a document.

If the word “jaguar” is accompanied in a document by the word


“Jacksonville,” it is statistically probable that the word “jaguar” is a
reference to an American football team.

LDA
Topic modeling is a type of statistical modeling for discovering the
abstract “topics” that occur in a collection of documents. Latent
Dirichlet Allocation (LDA) is an example of topic model and is
used to classify text in a document to a particular topic

In this, observations (e.g., words) are collected into documents, and each word's
presence is attributable to one of the document's topics. Each document will contain
a small number of topics.

Machine learning[edit]
One application of LDA in machine learning - specifically, topic discovery, a subproblem
in natural language processing - is to discover topics in a collection of documents, and then
automatically classify any individual document within the collection in terms of how "relevant" it is
to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words
or phrases) that, taken together, suggest a shared theme.
For example, in a document collection related to pet animals, the
terms dog, spaniel, beagle, golden retriever, puppy, bark, and woof would suggest
a DOG_related theme, while the terms cat, siamese, Maine coon, tabby, manx, meow, purr,
and kitten would suggest a CAT_related theme. There may be many more topics in the
collection - e.g., related to diet, grooming, healthcare, behavior, etc. that we do not discuss for
simplicity's sake. (Very common, so called stop words in a language - e.g., "the", "an", "that",
"are", "is", etc., - would not discriminate between topics and are usually filtered out by pre-
processing before LDA is performed. Pre-processing also converts terms to their "root" lexical
forms - e.g., "barks", "barking", and "barked" would be converted to "bark".)
If the document collection is sufficiently large, LDA will discover such sets of terms (i.e., topics)
based upon the co-occurrence of individual terms, though the task of assigning a meaningful
label to an individual topic (i.e., that all the terms are DOG_related) is up to the user, and often
requires specialized knowledge (e.g., for collection of technical documents). The LDA approach
assumes that:

1. The semantic content of a document is composed by combining one or more


terms from one or more topics.
2. Certain terms are ambiguous, belonging to more than one topic, with different
probability. (For example, the term training can apply to both dogs and cats, but
are more likely to refer to dogs, which are used as work animals or participate in
obedience or skill competitions.) However, in a document, the accompanying
presence of specific neighboring terms (which belong to only one topic) will
disambiguate their usage.
3. Most documents will contain only a relatively small number of topics. In the
collection, e.g., individual topics will occur with differing frequencies. That is, they
have a probability distribution, so that a given document is more likely to contain
some topics than others.
4. Within a topic, certain terms will be used much more frequently than others. In
other words, the terms within a topic will also have their own probability
distribution.

You might also like