Professional Documents
Culture Documents
Sma 8
Sma 8
Task 2
1. of manual approach
2. of TF-IDF vectorizer and comment on the result.
Program:
import pandas as pd
import sklearn as sk
import math
idfDict = dict.fromkeys(docList[0].keys(), 0)
for word, val in idfDict.items():
idfDict[word] = math.log10(N / (float(val) + 1))
return(idfDict)
#inputing our sentences in the log file
idfs = computeIDF([wordDictA, wordDictB])
# TF-IDF
def computeTFIDF(tfBow, idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val*idfs[word]
return(tfidf)
#running our two sentences through the IDF:
idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)
#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])
idf.head(30)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
#for the sentence, make sure all words are lowercase or you will run
#into error. for simplicity, I just made the same sentence all
#lowercase
data1 = """The Phone 1 has a refreshingly simple lens lineup on the back,
with only two 50-megapixel cameras – one normal and one ultrawide –
forgoing additional rubbish macro or monochrome cameras common on mid-
range phones for marketing purposes.
Both cameras are good for the money. The main camera produces the best
images that have generally good colour balance and detail. Photos can lack
a little sharpness and fine detail when viewed at full size, and it can be
a little difficult to get a sharp shot in low light. The ultrawide
produces images with cooler tones and softer detail but is still decent.
The camera can occasionally oversaturate parts of an image, such as red
flowers losing all definition and almost glowing.
"""
data2 = "Both cameras are good for the money. The main camera produces the
best images that have generally good colour balance and detail. Photos can
lack a little sharpness and fine detail when viewed at full size, and it
can be a little difficult to get a sharp shot in low light. The ultrawide
produces images with cooler tones and softer detail but is still decent.
The camera can occasionally oversaturate parts of an image, such as red
flowers losing all definition and almost glowing."
tfidf_vectorizer = TfidfVectorizer()
doc_vec = tfidf_vectorizer.fit_transform(df1.iloc[0])
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
index=tfidf_vectorizer.get_feature_names_out())