Welcome to Scribd!

Preprocessing ch.4

Uploaded by

0% found this document useful (0 votes)

2 views20 pages

This document discusses feature selection techniques for preprocessing data in machine learning. It describes feature selection as selecting a subset of relevant features for building models without creating new features. Feature selection can improve model performance by reducing noise and removing redundant or correlated features. Methods covered include manually removing redundant features, identifying correlated features using Pearson's correlation coefficient, and using dimensionality reduction techniques like principal component analysis to extract features.

Original Description:

Original Title

Preprocessing_ch.4

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

2 views20 pages

Preprocessing ch.4

Uploaded by

oml78531

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 20

Search inside document

Feature selection

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
What is feature selection?
Selecting features to be used for modeling
Doesn't create new features

Improve model's performance

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

When to select features
city state lat long
hico tx 31.982778 -98.033333
mackinaw city mi 45.783889 -84.727778
winchester ky 37.990000 -84.179722

Reducing noise

Features are strongly statistically correlated

Reduce overall variance

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON
Removing redundant
features
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Redundant features
Remove noisy features
Remove correlated features

Remove duplicated features

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Scenarios for manual removal
city state lat long
hico tx 31.982778 -98.033333
mackinaw city mi 45.783889 -84.727778
winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Correlated features
Statistically correlated: features move together directionally
Linear models assume feature independence

Pearson's correlation coefficient

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Correlated features
print(df)

A B C
0 3.06 3.92 1.04
1 2.76 3.40 1.05
2 3.24 3.17 1.03
...

print(df.corr())

A B C
A 1.000000 0.787194 0.543479
B 0.787194 1.000000 0.565468
C 0.543479 0.565468 1.000000

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON
Selecting features
using text vectors
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Looking at word weights
print(tfidf_vec.vocabulary_) print(text_tfidf[3].data)

{'200': 0, [0.19392702 0.20261085 ...]

'204th': 1,
'33rd': 2, print(text_tfidf[3].indices)
'ahead': 3,
'alley': 4,
[ 31 102 20 70 5 ...]
...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Looking at word weights
vocab = {v:k for k,v in zipped_row = dict(zip(text_tfidf[3].indices,
tfidf_vec.vocabulary_.items()} text_tfidf[3].data))

print(vocab) print(zipped_row)

{0: '200', {5: 0.1597882543332701,

1: '204th', 7: 0.26576432098763175,
2: '33rd', 8: 0.18599931331925676,
3: 'ahead', 9: 0.26576432098763175,
4: 'alley', 10: 0.13077355258450366,
... ...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Looking at word weights
def return_weights(vocab, vector, vector_index):

zipped = dict(zip(vector[vector_index].indices,
vector[vector_index].data))

return {vocab[i]:zipped[i] for i in vector[vector_index].indices}

print(return_weights(vocab, text_tfidf, 3))

{'and': 0.1597882543332701,
'are': 0.26576432098763175,
'at': 0.18599931331925676,
...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON
Dimensionality
reduction
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Dimensionality reduction and PCA
Unsupervised learning method Principal component analysis
Combines/decomposes a feature space Linear transformation to uncorrelated

Feature extraction - here we'll use to reduce space

our feature space Captures as much variance as possible in

each component

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PCA in scikit-learn
from sklearn.decomposition import PCA
pca = PCA()
df_pca = pca.fit_transform(df)

print(df_pca)

[88.4583, 18.7764, -2.2379, ..., 0.0954, 0.0361, -0.0034],

[93.4564, 18.6709, -1.7887, ..., -0.0509, 0.1331, 0.0119],
[-186.9433, -0.2133, -5.6307, ..., 0.0332, 0.0271, 0.0055]

print(pca.explained_variance_ratio_)

[0.9981, 0.0017, 0.0001, 0.0001, ...]

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PCA caveats
Difficult to interpret components
End of preprocessing journey

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Mcse 004 Combined
Document44 pages
Mcse 004 Combined
RejithaVijayachandran
No ratings yet
Week3 Linear Algebra - Gauss Elim
Document61 pages
Week3 Linear Algebra - Gauss Elim
Nindya Haura
No ratings yet
Chapter4 PDF
Document20 pages
Chapter4 PDF
Tongai Mutengwa
No ratings yet
Preprocessing ch.2
Document19 pages
Preprocessing ch.2
oml78531
No ratings yet
Chapter 3
Document58 pages
Chapter 3
david Abotsitse
No ratings yet
Chapter 2
Document47 pages
Chapter 2
david Abotsitse
No ratings yet
Preprocessing ch.1
Document24 pages
Preprocessing ch.1
oml78531
No ratings yet
Preprocessing ch.3
Document23 pages
Preprocessing ch.3
oml78531
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
Document42 pages
Designing Machine Learning Workflows in Python Chapter3
Fgpeqw
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
Document32 pages
Designing Machine Learning Workflows in Python Chapter1
Fgpeqw
No ratings yet
Chapter 4
Document31 pages
Chapter 4
david Abotsitse
No ratings yet
Finance With Python and MPT
Document31 pages
Finance With Python and MPT
ravinyse
No ratings yet
Chapter 1
Document36 pages
Chapter 1
david Abotsitse
No ratings yet
Correlation and Regression (TP)
Document4 pages
Correlation and Regression (TP)
Hiimay Channel
No ratings yet
Lab Workbook: 18Cs3064 Big Data Optimization
Document100 pages
Lab Workbook: 18Cs3064 Big Data Optimization
Subhash Devalla
No ratings yet
21BECE30036 Prac 1
Document10 pages
21BECE30036 Prac 1
Nimisha
No ratings yet
Preprocessing Data For Machine Learning: Sarah Guido
Document21 pages
Preprocessing Data For Machine Learning: Sarah Guido
mehdichiti
No ratings yet
Intro To R Assignment
Document10 pages
Intro To R Assignment
Jidnesh madhavi
No ratings yet
ML 4 To 9 Keyur
Document21 pages
ML 4 To 9 Keyur
jay limbad
No ratings yet
Assignment 4
Document5 pages
Assignment 4
shellierathee
No ratings yet
Analyzing IoT Data in Python Chapter4
Document34 pages
Analyzing IoT Data in Python Chapter4
Fgpeqw
No ratings yet
Import As Import As From Import As Import As Import As
Document3 pages
Import As Import As From Import As Import As Import As
Deepesh Yadav
No ratings yet
Matplotlib - Pyplot PLT Numpy NP Scipy Seaborn Sns Scipy Random
Document4 pages
Matplotlib - Pyplot PLT Numpy NP Scipy Seaborn Sns Scipy Random
Brangy Castro
No ratings yet
Machine Learning Hands-On
Document18 pages
Machine Learning Hands-On
Vivek JD
100% (1)
Dap Sar
Document36 pages
Dap Sar
siddu Rati
No ratings yet
Introduction To Operations Research
Document27 pages
Introduction To Operations Research
kavita8642
100% (3)
Chapter 4 - Linear Regression
Document25 pages
Chapter 4 - Linear Regression
anshita
100% (1)
ML Activity Kalyan
Document21 pages
ML Activity Kalyan
aishwarya kalyan
No ratings yet
178 DL
Document31 pages
178 DL
Nidhi
No ratings yet
Chapter 2
Document40 pages
Chapter 2
Zia Malik
No ratings yet
Supervised Learning With Scikit-Learn: Preprocessing Data
Document32 pages
Supervised Learning With Scikit-Learn: Preprocessing Data
NourheneMbarek
No ratings yet
Introduction To Python and Computer Programming 1704298503
Document44 pages
Introduction To Python and Computer Programming 1704298503
el.tico.138623
No ratings yet
ML 2.4 Prashant
Document3 pages
ML 2.4 Prashant
deadm2996
No ratings yet
Laboratorio Regresión Logística - Colaboratory Grupo 2
Document7 pages
Laboratorio Regresión Logística - Colaboratory Grupo 2
Priscila Flores
No ratings yet
Application of The OpenCV-Python For Personal Identifier Statement
Document4 pages
Application of The OpenCV-Python For Personal Identifier Statement
Muzammil Naeem
No ratings yet
Chapter4 3
Document37 pages
Chapter4 3
Zia Malik
No ratings yet
Tutorial 3: Statistics With MATLAB
Document4 pages
Tutorial 3: Statistics With MATLAB
Wahyu Joe Pradityo
No ratings yet
361 Project Code
Document10 pages
361 Project Code
skdlf
No ratings yet
Deep Learning Lite
Document58 pages
Deep Learning Lite
MJ Rivera
No ratings yet
Labpractice 2
Document29 pages
Labpractice 2
Rajashree Das
100% (2)
ML LAB Mannual - Index
Document29 pages
ML LAB Mannual - Index
Nisahd hitesh
No ratings yet
Linear Regression Python Sklearn Numpy P PDF
Document2 pages
Linear Regression Python Sklearn Numpy P PDF
Pranabesh Chatterjee
No ratings yet
Python 27
Document24 pages
Python 27
Naveenesh
No ratings yet
Chapter 6 - Advanced Machine Learning PDF
Document37 pages
Chapter 6 - Advanced Machine Learning PDF
Siddharth Upadhyay
No ratings yet
ML Lab Manual Sem-7
Document31 pages
ML Lab Manual Sem-7
Prabhdeep Gill
No ratings yet
Sklearn Tutorial: DNN On Boston Data
Document9 pages
Sklearn Tutorial: DNN On Boston Data
hopkeinst
No ratings yet
Lab 1. Boston House
Document7 pages
Lab 1. Boston House
dimas bayu
No ratings yet
Dhrumil Aml
Document14 pages
Dhrumil Aml
SAVALIYA FENNY
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
Document11 pages
St. John College of Engineering and Management, Palghar - Maharashtra
pranay
No ratings yet
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
Document35 pages
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
Mayank Saini
No ratings yet
ML0101EN Clas Logistic Reg Churn Py v1
Document13 pages
ML0101EN Clas Logistic Reg Churn Py v1
banicx
100% (1)
Answer PDF Lab
Document34 pages
Answer PDF Lab
Al Kafi
No ratings yet
DBMS - File - FInal
Document23 pages
DBMS - File - FInal
ramandubey69
No ratings yet
Human Activity Recognition Using Smartphone Data
Document18 pages
Human Activity Recognition Using Smartphone Data
officialimraneben
No ratings yet
Fds Mannual
Document39 pages
Fds Mannual
sudha
No ratings yet
Linear and Logistic Regression
Document6 pages
Linear and Logistic Regression
Mahevish Fatima
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
Document39 pages
Designing Machine Learning Workflows in Python Chapter2
Fgpeqw
No ratings yet
Lab Manual 04
Document12 pages
Lab Manual 04
Islam Ulhaq
No ratings yet
DWDM Lab Report
Document26 pages
DWDM Lab Report
Simran Shrestha
No ratings yet
Chapter4 PDF
Document34 pages
Chapter4 PDF
Hana Banana
No ratings yet
SVKM'S Nmimsuniversity Mukeshpatelschool of Technology Management & Engineering
Document10 pages
SVKM'S Nmimsuniversity Mukeshpatelschool of Technology Management & Engineering
Vydheh Sumod
No ratings yet
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
From Everand
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
Blaine Bateman
No ratings yet
Chapter 4-Part 2-Sampling Quantizing Encoding PCM
Document43 pages
Chapter 4-Part 2-Sampling Quantizing Encoding PCM
yohans shegaw
No ratings yet
Yousef Saad - Iterative Methods For Sparse Linear Systems-Society For Industrial and Applied Mathematics (2003)
Document460 pages
Yousef Saad - Iterative Methods For Sparse Linear Systems-Society For Industrial and Applied Mathematics (2003)
Disinlung Kamei Disinlung
No ratings yet
Unit 3 Daa
Document89 pages
Unit 3 Daa
DEEPANSHU LAMBA (RA2111003011239)
No ratings yet
Sorting Algorithm
Document25 pages
Sorting Algorithm
Abdataa
No ratings yet
Deployment: Cheat Sheet: Machine Learning With KNIME Analytics Platform
Document1 page
Deployment: Cheat Sheet: Machine Learning With KNIME Analytics Platform
Gaby González
No ratings yet
Cs230exam spr21 Soln
Document21 pages
Cs230exam spr21 Soln
MOHAMMAD
No ratings yet
ISyE 6669 Homework 15 PDF
Document3 pages
ISyE 6669 Homework 15 PDF
Ehxan Haq
No ratings yet
Time Integration Fundamentals For Class
Document50 pages
Time Integration Fundamentals For Class
cucras
No ratings yet
Ch2-Polynomial and Rational Functions Revised
Document5 pages
Ch2-Polynomial and Rational Functions Revised
api-395386039
No ratings yet
DT Fourier Transform (Exercises)
Document1 page
DT Fourier Transform (Exercises)
Emmanuel Correa
No ratings yet
Theory of Computation Homework: P Is Closed Under Star Operation
Document2 pages
Theory of Computation Homework: P Is Closed Under Star Operation
Butch Ellen Grace Salem
No ratings yet
Ada Lab Manual - Shitanshu Jain
Document28 pages
Ada Lab Manual - Shitanshu Jain
shubh agrawal
No ratings yet
Approximation Algorithms PDF
Document59 pages
Approximation Algorithms PDF
Mr Phi
No ratings yet
Sub Band
Document7 pages
Sub Band
AriSuta
No ratings yet
Sp14 Cs188 Lecture 5 - Csps II
Document38 pages
Sp14 Cs188 Lecture 5 - Csps II
imoptra143
No ratings yet
Lab 3
Document5 pages
Lab 3
akumar5189
No ratings yet
EX 4 - Scheduling Algorithms
Document4 pages
EX 4 - Scheduling Algorithms
M.Micheal Berdinanth
No ratings yet
I. MULTIPLE CHOICE. Write Only The Letter of The Best Answer On The Space Provided Before Each Number. (3
Document2 pages
I. MULTIPLE CHOICE. Write Only The Letter of The Best Answer On The Space Provided Before Each Number. (3
jomer_juan14
No ratings yet
CO PO Mapping
Document4 pages
CO PO Mapping
manoj walekar
No ratings yet
2008 MOP Blue Polynomials-I
Document3 pages
2008 MOP Blue Polynomials-I
Weerasak Boonwuttiwong
No ratings yet
The Finite Element Method Using MATLAB - Kwon and Bang
Document527 pages
The Finite Element Method Using MATLAB - Kwon and Bang
Rajeuv Govindan
50% (4)
FST Explained
Document7 pages
FST Explained
Nuttagrit Chakane
No ratings yet
KDD Tutorial Part2 Network Embedding and GCN
Document38 pages
KDD Tutorial Part2 Network Embedding and GCN
Zayani Ghaith
No ratings yet
Linear Program - Original Data: TORA Optimization System, Windows®-Version 1.00 Miércoles, Junio 10, 2020 8:32
Document2 pages
Linear Program - Original Data: TORA Optimization System, Windows®-Version 1.00 Miércoles, Junio 10, 2020 8:32
Dick De la Vega
No ratings yet
Course Syllabus and Schedule/Map - Fall 2020 (Session A) : CSE 551: Foundations of Algorithms
Document17 pages
Course Syllabus and Schedule/Map - Fall 2020 (Session A) : CSE 551: Foundations of Algorithms
Ioana Raluca Tiriac
No ratings yet
Digital Filters (IIR)
Document28 pages
Digital Filters (IIR)
Sujatanu
0% (1)
Persamaan Panas
Document25 pages
Persamaan Panas
Imam Uzi
No ratings yet
Scene Text Recognition by Using EE-MSER and Optical Character Recognition For Natural Images-35843
Document5 pages
Scene Text Recognition by Using EE-MSER and Optical Character Recognition For Natural Images-35843
Editor IJAERD
No ratings yet