tokenization - words/puctuation marks(split on basis of space) stopwords - a/an/are/as/for... etc(grammatical) *we remove numbers special characters and stop words if not needed*
HELLO=HeLlO=heLlO=HElLO=>hello(commen case)(abbreviations as well eg:- NLP--> nlp)
healed -ed heal
healing -ing heal healer -er heal studies -es studi(wrong) *stemming*(it removes affixes to give a stem word(might be correct)
healed -ed heal
healing -ing heal healer -er heal studies -es study(right) *lemmatization*(it removes affixes to give a lemma word(dictionary word)
Caring---->Care(Lemmatization) Caring---->Car (Stemming) *difference*(one can be used as per convinience) lemmatization is quicker than stemming
*ALL THESE PROCESSES FALL UNDER TEXT NORNALIZATION*
'Bag of Words' is like a frequency table of the words(tokens) in a corpus.
1.collect data do text normalization 2.create a dictionary of the present words 3.create a document vector(frequency)(per document) This in major sense did Text->Tokens->Numbers
TF-IDF - Term Frequency Inverse Document Frequency
stop words - greater occurence lower value
rare words - lower occurence greater value frequent words- alomost same occurance and value(names etc)
TF=Frequency of words=Bag of words
IDF(w)= log(#Documents) -------------------------------- (Documents containing the word 'w')