Professional Documents
Culture Documents
9 Feature Engineering Text Data
9 Feature Engineering Text Data
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
import warnings
warnings.filterwarnings('ignore')
Bag of words
In [2]: text = ['Hi, how are you you?',
'I am Fine. You?']
print("Suppose. This is our text data : ",text)
Suppose. This is our text data : ['Hi, how are you you?', 'I am Fine. You?']
[[0 1 0 1 1 2]
[1 0 1 0 0 1]]
In [7]: text
I am Fine. You? 1 0 1 0 0 1
Bag of N-grams
In [6]: text = ['Hi, how are you?',
'Fine. You?']
print("Suppose. This is our text data : ",text)
Suppose. This is our text data : ['Hi, how are you?', 'Fine. You?']
In [9]: # help(CountVectorizer)
Words of Bag-of-words : ['am', 'am fine', 'are', 'are you', 'fine', 'fine you', 'hi', 'hi how', 'how', 'how are', 'you', 'you
you']
Out[11]: am am fine are are you fine fine you hi hi how how how are you you you
I am Fine. You? 1 1 0 0 1 1 0 0 0 0 1 0
In [12]: print(f'{words}\nunigram count: {len(words)} \n{bigrams} \nbigram count: {len(bigrams)}')
https://stackoverflow.com/questions/42440621/how-term-frequency-is-calculated-in-tfidfvectorizer (https://stackoverflow.com/questions/42440621/how-term-
frequency-is-calculated-in-tfidfvectorizer)
In [16]: from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
text = ['it is a puppy',
'it is a cat',
'it is a kitten',
'that is a dog and this is a pen']
tfidf = TfidfVectorizer()
tfidf.fit(text)
tf = tfidf.get_feature_names()
tf
Out[16]: ['and', 'cat', 'dog', 'is', 'it', 'kitten', 'pen', 'puppy', 'that', 'this']
it is a puppy 0.000000 0.000000 0.000000 0.402642 0.492489 0.000000 0.000000 0.771579 0.000000 0.000000
it is a cat 0.000000 0.771579 0.000000 0.402642 0.492489 0.000000 0.000000 0.000000 0.000000 0.000000
it is a kitten 0.000000 0.000000 0.000000 0.402642 0.492489 0.771579 0.000000 0.000000 0.000000 0.000000
that is a dog and this is a pen 0.405245 0.000000 0.405245 0.422947 0.000000 0.000000 0.405245 0.000000 0.405245 0.405245
Tf-Ids value for word/term puppy: 0.34657359027997264 But in real-life implementaion it is different.
In [21]: # data.rar
# Load Yelp business data
biz_f = open('data/yelp_academic_dataset_business.json')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()
biz_df.head()
Out[21]: business_id full_address open categories city review_count name neighborhoods longitude state stars latitude
8466 W
[Accountants,
Peoria Peoria
Professional
0 rncjoVoEFUJGCUoC1JgnUA Ave\nSte True Peoria 3 Income Tax [] -112.241596 AZ 5.0 33.581867 bu
Services, Tax
6\nPeoria, AZ Service
Servi...
85345
[Sporting
2149 W Wood
Goods,
1 0FNFSzCFP_rGUoJx8W7tJg Dr\nPhoenix, True Phoenix 5 Bike Doctor [] -112.105933 AZ 5.0 33.604054 bu
Bikes,
AZ 85029
Shopping]
1134 N
Valley
Central
2 3f_lyB6vFK48ukH6ScvLHg True [] Phoenix 4 Permaculture [] -112.073933 AZ 5.0 33.460526 bu
Ave\nPhoenix,
Alliance
AZ 85004
845 W
Southern [Food,
3 usAsSV36QmUej8--yvN-dg True Phoenix 5 Food City [] -112.085377 AZ 3.5 33.392210 bu
Ave\nPhoenix, Grocery]
AZ 85041
6520 W
[Food,
Happy Valley
Bagels, Glendale Hot Bagels &
4 PzOqRohWw7F7YEPBz6AubA Rd\nSte True 14 [] -112.200264 AZ 3.5 33.712797 bu
Delis, Az Deli
101\nGlendale
Restaurants]
Az, ...
In [20]: review_df['text']
0 JxVGJ9Nly2FFIs_WpJvkug They built a Sauce in Minneapolis & it only la... [Pizza, Restaurants] False
1 JxVGJ9Nly2FFIs_WpJvkug I was pleasantly surprised by Sauce. We went h... [Pizza, Restaurants] False
2 JxVGJ9Nly2FFIs_WpJvkug I was very disappointed my last experience at ... [Pizza, Restaurants] False
3 JxVGJ9Nly2FFIs_WpJvkug Fun, Fast, Easy, Yummy. Nice Flatbread style ... [Pizza, Restaurants] False
4 Jj7bcQ6NDfKoz4TXwvYfMg Pros... Quick, good, cooked right, self serve ... [Burgers, Restaurants] False
7207 QzXFdjIbFRGhzL83goPPLA ordered the steak sandwich (medium rare). Cam... [Asian Fusion, Restaurants] False
7208 QzXFdjIbFRGhzL83goPPLA Good food when it's slow. Not so good when sup... [Asian Fusion, Restaurants] False
7209 QzXFdjIbFRGhzL83goPPLA Good service. The waitress was friendly. but t... [Asian Fusion, Restaurants] False
7210 GZ8KctCJxGzYZ7aAdapprg Not recommended if you're not white. For me, i... [Active Life, Amusement Parks, Nightlife, Bowl... True
7211 F3tqTcfKnljJcSyyqN0bbw Great food, clean a bit old but nice [Mexican, Restaurants] False
In [25]: print(two_biz_reviews.target.value_counts())
# two_biz_reviews.head()
False 5899
True 1313
Name: target, dtype: int64
nightlife : (1313, 4)
restaurants : (6911, 4)
In [28]: (1313*.9)/6911
Out[28]: 0.17098827955433368
(1182, 4) (1175, 4)
Out[29]: (2357, 4)
Data Split
Data Representation
11478
Fine-Tuning
using BoW:
Fitting 5 folds for each of 5 candidates, totalling 25 fits
using tf-idf:
Fitting 5 folds for each of 5 candidates, totalling 25 fits
In [ ]: