You are on page 1of 11

Mayank Sagavekar.

DAV TE -C9E Roll No : 4g


D FINOLEX ACADEMY OF MANAGEMENT AND TECHNOLOGY, RATNAGIRI

Assigoment Noy
Q.l What are 7 practices of text anolytics
i) Search and information retneval
Shorage& retieval of text documerts, includ ing
Search engines 6 keyunord Rarth.
nDacument cluuteing:
GoupingG Categon2ing terMs Shipets, paragaphs
documents ing data mining clustering methods..
i) Dacumeot dasification:
Grauping 6 cate gonzing Snippets, paragraphs a
documents using dta mining classification methods,
based on models +rained on la beled examples.

Ao) Web mining:


Data 4 toxt mining on the nternet oth a Specifc ocu
on the Scale 6 interconnectedness of Hhe web.

) Information extraction
ldenti ficahion and ertraction af releuant facts &relaiansh
ips fron unstructured texti the process af making.
gtructured data from unstructured Semistruc tured text.

) Natural langiage proc gssina CNLP)


Low- leuel longuage pracecsing Gunderstanding tasks
cften ed Synonymausly aith camputahional linguistics
Và) Soncept
extraction
Grouptng-of wuords 6phrages ita semantials Similar qoups.
VAQ
w INOLEX ACADEMY OF MANAGEMENT ANDTECHNOLOGY, RATNAGIRI

9 Lia Yanou methodg of Setiroent analys9


Lexicon- based approach
a. Machine -learning techni ques
a) Superuised learnNiNg
b) UnsupenNISed learning.
8 Hybid approach.
Q3Giue anexam ple of Stemminq and lemnati 2aion in
Sentiment analysis.
Sentence : The cats are playin in the garden!
Stemming: The cat are play in the garden"
lemmotization: " The mt be play in the carden"
0ioim
Qe Oifference btueen Shemming and lemnat2ction.
Stemming lemmatizatian.,

-eaor Reduce conds to their root or, Reduce sords to the ir camnial
obaye form ofarm boed an their meaning.
iLess accurate as it mag resulk Klore accurate a it onsiders
in on- dichonary uords. he drctionary form of worcs.

biurü)ay resuh in Don dictionary Always results In dchonary


words. uords.

)Simpe and fast: More omplex 6 Shoem.


FAMT
FINOLEX ACADEMY OF MANAGEMENT AND TECHNOLOGY, RATNAGIRI

Qs Oference betsRen BoW and TE IDf


TE ID

)Represents text data ai a Measure the importance ofa


ollecion of w0d accurrence,cwcond in a doc relati ue to a
ignonng grommerword ordercallection of documents
)Caunts the AreaIen Cå of each Considers botth the freq, of a
od in the document. word in the doc &its inverse
treq. qctoss all doc.

AI (ords are equally LYords ane wegbhed bordon


neighted. |therr inportance in the doc.
land across the etire opus.

i is less effchue in distingu ore effecte in highlighting


ishing beth common 6rare limportant uwords_ubile
ords doundayìng omon (ords.

ss Whet is text mining 9 Ehlist and Expltn 7prochice ateae

of text analgics.
Text oining, alsa knoon as text analutics or text data
mining s the nrocess of exitacting ueful insights 6info
Arom unstructured textual data. tinuglves techniq ue from.
NLPML nd computatonal inguistlcs to analyze &
henteroret large Nolumes af text.
FINOLEX ACADEMY OF MANAGEMENT AND TECHNOLOGY, RATNAGIRI
IPracicp
a)
areog f text anal ysis
Search
Ihis qrRa
and Infor nation Retneval:
ef frienty inuolves Stonng &retrieuing text doc.
Search queriesoftenl+ focuses
wing Search engines 6keyword - baued
on matching uer queries coíth.
elevant documents ftom large
pages databases ar ollectisnsSuch a cweb
doc.repostne
b) Document Custeing:
Document clurtenng
coc ar Snippets based oninuolue grouping & ategaizing text
mining dastesnq_methods thelr similarity Hempbys data
such as k-medns
hierachiml clusteing o
clusters, whichclustenng to
Can aid inarganTe doc. into Tneaningful
dotument organi zotion, and exploratots
topic disco
data analysis
Uery.
QDocument classfiation:
Dacument classifiation is the prscess at
assigning predefined catecgoHes orlcbels to automatially
Snippets or parographs It utilizes data miningtextdocuments.
methods often based on ML alg. trained gn classfiation
eKAmples,to classify doc:into relevant categorie labeled
Commonly ed in Spam detectionDens cote qarizatiOnHs6
sentiment analysis.
L) kleb ining
Web mining focuses an extacting knawledge <insiqhts
from data available on the WNANt enompases text&
Ldota miningtechnique taiored far the unique characte nstCS
f eb data,such a the Scale interconDectedness &
heterogeneity oft weh resOurces. Web mining includes tasks
TECHNOLOGY, RATNAGIRI
MANAGEMENT AND
FINOLEX ACADEMY OF

TPrachiCO areng of text ahal ysis


a) Search and lnfor mation Retneval' text doc.
Stoing 6 retrieuing
This qrea inuolves gearch ergines keyund-based
etienty often_ wing
queries lt focuses on matching wer queries oith
Search
ollectionSSuch a cweb
eleuant documents fromlarge
pages databases or doC. reposiHame

b) Document Clustenng
Document cluttening involue qrouping G ategaizing text
coc ar Snippets based cn thelr simlaits. Hempbys data
mining dustedng methods such as k-mecns clste ring
hierarchial clustenng to arqanize doc. into neaningfu
clusters. which can ald in explortots data ana lysis
document organ 2tton, and topic discoVery.
Document clossification:
Document classifiation is the process at autamatially
assigning predefined ategories or lcbels to text documerts.
Shippets or farographs |t utHlzes data mining classificatin
methods, often based Gn ML oaloo. traned onabeled
eKOTDples. to classify doc. into releuant categorie Its
Commonly used in spam detection, neus catego rization 6
9entiment analysis.

L) Web ining'
Webmining focuses an extacting knaaledge <ainsighis
Arom data auailable on the WAtenompasses textR
dota miningtechniqug tailored tur the unique characte istics
&
of oeb dataSuch a the Scale interconnectedness
heterogeneity of weh resources. Web mining indude tasks
L FINOLEX ACADEMY OF MANAGEMENT AND
TECHNOLOGY, RATNAGIRI

\Web conte nt minina, oeb structure mining ó web uage


mining:
)hformation Extraction
Information extracion inuolue idenh fying 6ertracing
Structured info trom unstructured r semistructured text
gources t aims to automahicalls ceognize Gextract
teleuont facts ntitje relatinshi DS Quents from tet
data, tranforming it into oStructured Pomat hat an
be easily processed o analyzed by machine.
) Ntuo] language Processiog CNLP):
NLO focue) mthe interacion beth computers 6 human
lahguage. t eharnpases a uwide_ range af tasks includina
lous-leuel language processino tasks like takenizathi¡n
part-of-Gpeech taggig. GSyndocic parsing gs wsllau
hiaher leuel tasks like omed entty recognitian, Senti ment
aalysis machine ttanslation ótext generotian.
) Concept extracion:
Conceçh Extraction inualves grouping Cuards d phrases
iCto SemantBally Similar qcups ar conceçBs talms to
identify &ategoize conce pts r toplcs oresent in text
data reqardless af the Specife cunrds s9ed. cancept
Qxttactton techniques often rely on Semantic dnalysls
topicnodeling or otology-baed qppraachei to organize
text dctoa into meAningful Conceptua groups
n FINOLEX ACADEMY OF MANAGEMENTAND TECHNOLOGY, RATNACIDI

QA Explain uwith Suttable example Tem frequenta (TE)


DoCtiment frequency Co) and Inverse Document frequency.-
) Term frequency CT):
In document d the frequencs represe nts the o-of
instances of o give h word t. therefore ue Cin See that
t becomel more releuant ohen a uord appegrs in the
texthich is atianal Since the arde nng of terms is.
not Siqnificant e can use a vector to deschibe the text
in the baa of term models. for each Soecifc term
the paper there IS an entry uith the yale being the
term frequen cy.
Document l: The cat is block
Dacument o: The dog is braun!
Document 3: ^The cat and the dog are friends

TE (No 0f times term appeas ina documert )


Ctotal o of terms in he document).

TE cot)
TECat", Docu ment )= ly =o-25
TE C"Cat" Document): o
TF ("CatDoCument 3)=/? s0:)43

n) Doument frequency ( DE):


This tes ts the meaning of the textwhich is Yery
imilar to TF in the ushole corpua collectian. The only
Aifference is that tn doc.dTE is the freg. cOUnter
term tti while df s the n of accurrerce, in the
for a term
doc. gt N of the term t.
"Calculate
a)Lexicon-based
dictionaryhe Approach:
anolgsis. List term"co:thoer IDF: paperr
th e term
in
Significant, aim )
ADER.
Erample Assign IDE hat DE for
Gxampe InuersSe
Scores ond("ot") log( fit ot
ainly ("cat")=2 4he
frequencies FINOLEX FIN
exphin the term
SeniiMent Containing
feerm ).Total Document
the
lexicon: ofthe ACADEMY
its denand. tsearch t "Cat"
OUerall og is tests (
method no. Appea
constituent therefore to is
include score 3|2)=0:176 of to freq OF
how rsin MANAGEMENT
documentsNoof meansure Snce
Senhment thot locate ueDac.
to relevant DCy
not t
Sent
(0ords. Can
the onlConside the CIDE): AND
y
be useigbt qnd TECHNOLOGY,
Nord of appropriate the
possible ts
in ed Doc3).
Net text o all wrd
g far
lexicon documents ofterms RATNAGIRI
and Sentiment the to isThe
baedon
term equal9 eecorda
the key
TECHNOLOGY, RATNAGIRI
FINOLEX ACADEMY OF MANAGEMENT AND

b) Machi ne Learning CML) Techniquey:


) Superused Learning
.Thaina model on \abeled dataleq. mouie reuieug
0ith Sentiment abels).
Use algo like Support Mector Machines (sVM)
NiueBayes or logistic reqres sion.
i) Unsugervised
Leatning
. Discover potterns and enthmentS ín text dha ittout
labeled examples.
techniques incude clutening rto pic modeling, self
orqanizing maps.

)Hybnd Approache
Combine multiple tehnique to improve acuracs
for example, combininglexican -bayed methodi ith
mochine learning algothms.
Q9 List and Exphin Steps In text analysis.
Stacel: Data gathenng
In this Stage. g gather text data from nternal ar
ekterna Sources
Internadata:
Internal dota s text content that is interhal ta qour
buuiness is eadils auailable
eq. emnails chats iVoíces.
EANT
FINOLEX ACADEMY OF
MANAGEMENT AND TECHNOLOGY, RATNAGIRI

External dato:
\oa can nd externa doto in source Such a)
media posts online reviewS news articles. t isSscial
harder to acouire externa) data becawe it is beyond
your control. You might need to we web Scraping tools
| Itegtate ith rd party Solutons to extroct erternab
data.

stage : Data prepordhon:


Datapreporstion is an essential patt of tet analysis. |+
inuolue stractunng raw text data in an acceptable ocmat
for analysis.

Tokenizotian:
Tokenization s segrogatieng the ras text into multiple
ports that makes Semanhe Sense. forq the ph rale text
analuhics benefits business tokenize tte to the uord
textanalytics benefits &businesses
Part-of-Speech tagging
Part- of -Speech taggieg assigns grammatical tacs to the
tokenized text for eg applying this Step to preuiQusly
ekntioed tokens results in tert : Naun: analytics: Noun
benefit Verb buiness :Noun

Parstna:
Porsig
Porsing estobl ishe neaningtal connechians beth the tokenized
wOrde eoth English gra mmex. Hhe lps the text anglysis
SftuareNisualize the relation ship beth. words.
SSMAL

OHNOLEX ACADEMY OF MANAGEMENT AND TECHNOLOGY, RATNAGIRI

Lenmahization
lemmatization iS a linguistHc process Ahat Simplifies rd
Into thelr dictisngry form cr len na. eg
eq the dichonarg
form of visualiz1ng is Visuglize.
Shop ord removal:
Stop oords are words that offer ittle or no.Serrortic ortext
of a setence such a ond 1or for. Pepending on the sse
Cae the Softoare might remoue them from the Strutured text.
-Shage 3: Text anolysis
Text analysis s the Core patof tte ptocess in uhich text
analusrs Softoare processeA the text be ausng di£f.
method

Text classfiation:
Classification is the process Of assqning tags to the text
Oatatht qre based mrlle n ML bayedSysems.
Text extraction
Extocon inuohe ldentifying the presence of Specific
keyaords in he text 6 associ ging tbem ith toas.
Shage L Visuolzattonid

Wisualizatton s about turning text analysis resutts into ah


easily understan dable format. 9au sill Eind text analytis
Lresutsin graphS chorts tables.The Visuglized resuts
help you dentiy priterns &trends & build achionplan.

You might also like