You are on page 1of 1

corpus - collection of written text

sentence segmentation - splits after full stop


tokenization - words/puctuation marks(split on basis of space)
stopwords - a/an/are/as/for... etc(grammatical)
*we remove numbers special characters and stop words if not needed*

HELLO=HeLlO=heLlO=HElLO=>hello(commen case)(abbreviations as well eg:- NLP--> nlp)

healed -ed heal


healing -ing heal
healer -er heal
studies -es studi(wrong)
*stemming*(it removes affixes to give a stem word(might be correct)

healed -ed heal


healing -ing heal
healer -er heal
studies -es study(right)
*lemmatization*(it removes affixes to give a lemma word(dictionary word)

Caring---->Care(Lemmatization)
Caring---->Car (Stemming)
*difference*(one can be used as per convinience)
lemmatization is quicker than stemming

*ALL THESE PROCESSES FALL UNDER TEXT NORNALIZATION*

'Bag of Words' is like a frequency table of the words(tokens) in a corpus.


1.collect data do text normalization
2.create a dictionary of the present words
3.create a document vector(frequency)(per document)
This in major sense did Text->Tokens->Numbers

TF-IDF - Term Frequency Inverse Document Frequency

stop words - greater occurence lower value


rare words - lower occurence greater value
frequent words- alomost same occurance and value(names etc)

TF=Frequency of words=Bag of words


IDF(w)= log(#Documents)
--------------------------------
(Documents containing the word 'w')

You might also like