Welcome to Scribd!

Chapter 2

Uploaded by

0% found this document useful (0 votes)

4 views3 pages

The document discusses tokenizing text data for machine learning. It defines functions to combine text from multiple columns into a single vector, instantiate CountVectorizers with different token patterns to extract tokens, and transform text data into token counts for modeling. It prints the number of tokens extracted using each approach to compare the effects of different tokenization methods.

Original Description:

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

4 views3 pages

Chapter 2

Uploaded by

Basit Mir

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 3

Search inside document

Chapter 2:

# Instantiate the classifier: clf

clf = OneVsRestClassifier(LogisticRegression())

# Fit it to the training data

clf.fit(X_train, y_train)

# Load the holdout data: holdout

holdout = pd.read_csv('HoldoutData.csv', index_col=0)
holdout= holdout[NUMERIC_COLUMNS].fillna(-1000)

# Generate predictions: predictions

predictions = clf.predict_proba(holdout)

# Generate predictions: predictions

predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

# Format predictions in DataFrame: prediction_df

prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
index=holdout.index,
data=predictions)

# Save prediction_df to csv

prediction_df.to_csv('predictions.csv')

# Submit the predictions for scoring: score

score = score_submission(pred_path = 'predictions.csv')

# Print score
print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))

Tokenizing text
b) 6
Chapter 2:

1-grams + 2-grams + 3-grams is 5 + 4 + 3 = 12

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Fill missing values in df.Position_Extra

df.Position_Extra.fillna('',inplace=True)

# Instantiate the CountVectorizer: vec_alphanumeric

vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit to the data

vec_alphanumeric.fit_transform(df.Position_Extra)
# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])

# Define combine_text_columns()
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
""" converts all text in each row of data_frame to single vector """

# Drop non-text columns that are in the df

to_drop = set(to_drop) & set(data_frame.columns.tolist())
text_data = data_frame.drop(to_drop, axis=1)

# Replace nans with blanks

text_data.fillna("",inplace=True)

# Join all text items in a row that have a space in between

return text_data.apply(lambda x: " ".join(x), axis=1)
Chapter 2:

# Import the CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Create the basic token pattern

TOKENS_BASIC = '\\S+(?=\\s+)'

# Create the alphanumeric token pattern

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate basic CountVectorizer: vec_basic

vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

# Instantiate alphanumeric CountVectorizer: vec_alphanumeric

vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Create the text vector

text_vector = combine_text_columns(df)

# Fit and transform vec_basic

vec_basic.fit_transform(text_vector)

# Print number of tokens of vec_basic

print("There are {} tokens in the dataset".format(len(vec_basic.get_feature_names())))

# Fit and transform vec_alphanumeric

vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_alphanumeric

print("There are {} alpha-
numeric tokens in the dataset".format(len(vec_alphanumeric.get_feature_names())))

4622 Dcap101 Basic Computer Skills PDF
Document305 pages
4622 Dcap101 Basic Computer Skills PDF
Sdz
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
Rating: 3 out of 5 stars
3/5 (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
Document10 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
Raju Rimal
100% (1)
Grid and Cloud Computing Notes
Document20 pages
Grid and Cloud Computing Notes
Sherril Vincent
100% (3)
Practical Electronics September 2019 Avxhm - Se
Document86 pages
Practical Electronics September 2019 Avxhm - Se
llonllon
No ratings yet
Partner Demovm Lab - Setup Instructions: Demo VM V13.0
Document12 pages
Partner Demovm Lab - Setup Instructions: Demo VM V13.0
Jorge Cruz
No ratings yet
Machine Learning Notes: 2. All The Commands For Eda
Document5 pages
Machine Learning Notes: 2. All The Commands For Eda
naveen katta
100% (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Pandas DataFrame Notes
Document6 pages
Pandas DataFrame Notes
Nhan Nguyen
100% (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
(Texts in Computer Science) Doron A. Peled-Software Reliability Methods-Springer (2001)
Document343 pages
(Texts in Computer Science) Doron A. Peled-Software Reliability Methods-Springer (2001)
Sameen Shakeel
No ratings yet
Advanced Engineering Mathematics Zill 3rd Edition PDF GLVNG
Document3 pages
Advanced Engineering Mathematics Zill 3rd Edition PDF GLVNG
Mohd Muzani
0% (9)
ML 1-10
Document53 pages
ML 1-10
22128008
No ratings yet
Design A Neural Network For Classifying Movie Reviews
Document5 pages
Design A Neural Network For Classifying Movie Reviews
hxd3945
No ratings yet
Mercedes-Benz Greener Manufacturing Ai
Document16 pages
Mercedes-Benz Greener Manufacturing Ai
Puji
0% (1)
Correction
Document3 pages
Correction
bougmazisoufyane
No ratings yet
NLP Tushar
Document21 pages
NLP Tushar
Yash Amin
No ratings yet
DL Lab Manual
Document35 pages
DL Lab Manual
lavanya penumudi
100% (1)
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
Document20 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
Saloni Tuli
No ratings yet
Fall Semester 2020-21 AI With Python ECE-4031
Document5 pages
Fall Semester 2020-21 AI With Python ECE-4031
sejal mittal
No ratings yet
Aryan Cs Project
Document28 pages
Aryan Cs Project
aryan12gautam12
No ratings yet
Asep Purnama - 140710180027 - Praktik LSTM
Document9 pages
Asep Purnama - 140710180027 - Praktik LSTM
Asep
No ratings yet
Programming Exercise 2 - Writing Recognition With TensorFlow.
Document3 pages
Programming Exercise 2 - Writing Recognition With TensorFlow.
brian ngaruiya
No ratings yet
Assignment 2
Document1 page
Assignment 2
estebandgono
No ratings yet
ML2 Practical List
Document80 pages
ML2 Practical List
Yash Amin
No ratings yet
Chapter04 - Getting Started With Neural Networks
Document9 pages
Chapter04 - Getting Started With Neural Networks
Jas Lim
No ratings yet
Logistic Regression
Document2 pages
Logistic Regression
Raghupal reddy Gangula
No ratings yet
Sample
Document6 pages
Sample
www.santhoshvjd123
No ratings yet
Cs Activity
Document29 pages
Cs Activity
hariharan97g
No ratings yet
Ex No 6
Document3 pages
Ex No 6
kpramya19
No ratings yet
Unit1 ML Programs
Document5 pages
Unit1 ML Programs
diroja5648
No ratings yet
ML Practical 205160694034
Document33 pages
ML Practical 205160694034
09Samrat Bikram Shah
No ratings yet
ML Week10.1
Document5 pages
ML Week10.1
gowrishankar nayana
No ratings yet
Ass 8
Document2 pages
Ass 8
Taqwa Elsayed
No ratings yet
Code
Document6 pages
Code
Keerti Gulati
No ratings yet
Machine Learning
Document54 pages
Machine Learning
Jacob
No ratings yet
Machine Learnin
Document23 pages
Machine Learnin
Manoj Kumar 1183
100% (1)
Xii CSC Practicals-Ii
Document11 pages
Xii CSC Practicals-Ii
Samran
No ratings yet
Handwritten Character Recognition With Neural Network
Document12 pages
Handwritten Character Recognition With Neural Network
shreyash sonone
No ratings yet
Appendix PDF
Document5 pages
Appendix PDF
Rama
No ratings yet
SML - Week 3
Document5 pages
SML - Week 3
szho68
No ratings yet
Unit2 ML Programs
Document7 pages
Unit2 ML Programs
diroja5648
No ratings yet
Wine - Data2.py: Import As Import As Def
Document2 pages
Wine - Data2.py: Import As Import As Def
Daniela Kotaran Plejić
No ratings yet
Python CA 4
Document9 pages
Python CA 4
subham patra
No ratings yet
Sample Code
Document8 pages
Sample Code
Dudhipala Deepak Reddy
No ratings yet
Test File
Document2 pages
Test File
hwqs
No ratings yet
Customer Airticketbooking
Document9 pages
Customer Airticketbooking
Tanisha Maahira Thameem
No ratings yet
Assignment4
Document4 pages
Assignment4
Priyansh Jain
No ratings yet
ML Lab
Document7 pages
ML Lab
Rishi TP
No ratings yet
IR - 754 All Practical
Document21 pages
IR - 754 All Practical
754Durgesh Vishwakarma
No ratings yet
Loading and Saving Data
Document5 pages
Loading and Saving Data
durgapriyachikkala05
No ratings yet
DL 4
Document1 page
DL 4
21jr1a43d1
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
Document13 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
mnbatrawi
No ratings yet
Pattern Recognition
Document26 pages
Pattern Recognition
Aryan Attri
No ratings yet
Image Caption2
Document9 pages
Image Caption2
MANAL BENNOUF
No ratings yet
23MCA1104 - EX10 - KMEANS - Ipynb - Colab
Document1 page
23MCA1104 - EX10 - KMEANS - Ipynb - Colab
Piyush Verma
No ratings yet
22MCA1008 - Varun ML LAB ASSIGNMENTS
Document41 pages
22MCA1008 - Varun ML LAB ASSIGNMENTS
S Varun (RA1931241020133)
100% (1)
Assignment 11-17-15: Michael Petzold November 19, 2015
Document4 pages
Assignment 11-17-15: Michael Petzold November 19, 2015
mikey p
No ratings yet
Lecture 21
Document138 pages
Lecture 21
Tev Wallace
No ratings yet
Tensor Flow and Keras Sample Programs
Document22 pages
Tensor Flow and Keras Sample Programs
vinothkumar0743
No ratings yet
KNN
Document2 pages
KNN
Raghupal reddy Gangula
No ratings yet
Image Classification Handson-Image - Test
Document5 pages
Image Classification Handson-Image - Test
Pushpendra Singh
No ratings yet
ML Lab Programs
Document23 pages
ML Lab Programs
Roopa 18-19-36
No ratings yet
Feedforward
Document2 pages
Feedforward
Sur Esh
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
Document8 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
Sudan Pudasaini
No ratings yet
Codes
Document37 pages
Codes
Tame PcAddict
No ratings yet
Course Outline Switching and Routing
Document4 pages
Course Outline Switching and Routing
Russell F Phiri
100% (1)
Unit-2 ASM2 Guide
Document4 pages
Unit-2 ASM2 Guide
Tuấn Anh
No ratings yet
Big Data Pgdca
Document23 pages
Big Data Pgdca
rahul
No ratings yet
PAL 4 Dashboard User Manual (Vessel) : 11 August 2017
Document14 pages
PAL 4 Dashboard User Manual (Vessel) : 11 August 2017
OLEG
No ratings yet
Imrad Sad Final
Document4 pages
Imrad Sad Final
Harold Filomeno
No ratings yet
1 Chapter 9 Software Evolution
Document23 pages
1 Chapter 9 Software Evolution
sameen khan
No ratings yet
Data Handling Using Pandas-1
Document25 pages
Data Handling Using Pandas-1
shakti singh
No ratings yet
AirMAX - How To Reset Your Device With TFTP Firmware Recovery - Ubiquiti Support and Help Center
Document8 pages
AirMAX - How To Reset Your Device With TFTP Firmware Recovery - Ubiquiti Support and Help Center
Deep_Mind
No ratings yet
Storm Crawler
Document2 pages
Storm Crawler
katherine976
No ratings yet
NSM Lab Manual
Document49 pages
NSM Lab Manual
PRINCIPAL HAVERI
No ratings yet
BHM-78 Microphone Report and Findings April 2021
Document5 pages
BHM-78 Microphone Report and Findings April 2021
Zolindo Zolindo
No ratings yet
RM (Practical File) - 4th Sem
Document53 pages
RM (Practical File) - 4th Sem
Pratham Bhardwaj
No ratings yet
AGe of Empires Readme
Document1 page
AGe of Empires Readme
Renzo Paolo Quintanilla
No ratings yet
Ultra-Rugged Smartphone: Sonim XP8
Document2 pages
Ultra-Rugged Smartphone: Sonim XP8
MohammedBujair
No ratings yet
Eventide NexLog Recorder Setup (ANT 00192 100)
Document24 pages
Eventide NexLog Recorder Setup (ANT 00192 100)
jcuervo
No ratings yet
Hardware 1
Document50 pages
Hardware 1
JobyThomasIssac
No ratings yet
Software Reliability: Abstract
Document18 pages
Software Reliability: Abstract
9890189067
No ratings yet
Isogen PDF
Document2 pages
Isogen PDF
Jonathen Hormen
No ratings yet
Software Development Life Cycle (SDLC)
Document24 pages
Software Development Life Cycle (SDLC)
Mohammad Noman
No ratings yet
Icm10 From The Windows NT Source Code Leak
Document50 pages
Icm10 From The Windows NT Source Code Leak
illegal-manuals
No ratings yet
COVID 2019 InforNYPStudents
Document10 pages
COVID 2019 InforNYPStudents
weiweiah
No ratings yet
Huawei g630 Repair Manual
Document65 pages
Huawei g630 Repair Manual
thiszmine
0% (2)
Chapter 3 Set 11111
Document5 pages
Chapter 3 Set 11111
candy
No ratings yet
Assignment Number: Title: Reverse Engineering
Document4 pages
Assignment Number: Title: Reverse Engineering
Aditya Ghuge
No ratings yet