You are on page 1of 9

Python

Cheat Sheet

Pandas | Numpy | Sklearn


Matplotlib | Seaborn
BS4 | Selenium | Scrapy
by Frank Andrade
Pandas Selecting rows and columns Merge multiple data frames horizontally:
df3 = pd.DataFrame([[1, 7],[8,9]],

Cheat Sheet
Select single column: index=['B', 'D'],
df['col1'] columns=['col1', 'col3'])
#df3: new dataframe
Select multiple columns: Only merge complete rows (INNER JOIN):
Pandas provides data analysis tools for Python. All of the df[['col1', 'col2']] df.merge(df3)
following code examples refer to the dataframe below.
Show first n rows: Left column stays complete (LEFT OUTER JOIN):
df.head(2) df.merge(df3, how='left')
axis 1
col1 col2 Show last n rows: Right column stays complete (RIGHT OUTER JOIN):
df.tail(2) df.merge(df3, how='right')
A 1 4
Select rows by index values: Preserve all values (OUTER JOIN):
axis 0
df = B 2 5 df.loc['A'] df.loc[['A', 'B']] df.merge(df3, how='outer')

C 3 6 Select rows by position: Merge rows by index:


df.loc[1] df.loc[1:] df.merge(df3,left_index=True,
right_index=True)

Getting Started Data wrangling Fill NaN values:


df.fillna(0)
Import pandas: Filter by value:
import pandas as pd df[df['col1'] > 1] Apply your own function:
def func(x):
Sort by columns: return 2**x
Create a series: df.sort_values(['col2', 'col2'], df.apply(func)
s = pd.Series([1, 2, 3], ascending=[False, True])
index=['A', 'B', 'C'], Identify duplicate rows: Arithmetics and statistics
name='col1') df.duplicated()
Add to all values:
Create a dataframe:
Identify unique rows: df + 10
data = [[1, 4], [2, 5], [3, 6]] df['col1'].unique()
index = ['A', 'B', 'C'] Sum over columns:
df = pd.DataFrame(data, index=index, Swap rows and columns: df.sum()
df = df.transpose()
columns=['col1', 'col2']) df = df.T Cumulative sum over columns:
Load a dataframe: df.cumsum()
df = pd.read_csv('filename.csv', sep=',', Drop a column:
df = df.drop('col1', axis=1) Mean over columns:
names=['col1', 'col2'], df.mean()
index_col=0, Clone a data frame:
encoding='utf-8', clone = df.copy() Standard deviation over columns:
df.std()
nrows=3) Connect multiple data frames vertically:
df2 = df + 5 #new dataframe Count unique values:
pd.concat([df,df2]) df['col1'].value_counts()

Summarize descriptive statistics:


df.describe()
Hierarchical indexing Data export Box-and-whisker plot:
df.plot.box()
Create hierarchical index: Data as NumPy array:
df.stack() df.values Histogram over one column:
df['col1'].plot.hist(bins=3)
Dissolve hierarchical index: Save data as CSV file:
df.unstack() df.to_csv('output.csv', sep=",") Histogram over all columns:
df.plot.hist(bins=3, alpha=0.5)
Format a dataframe as tabular string:
Aggregation df.to_string() Set tick marks:
labels = ['A', 'B', 'C', 'D']
Create group object: Convert a dataframe to a dictionary: positions = [1, 2, 3, 4]
g = df.groupby('col1') df.to_dict() plt.xticks(positions, labels)
plt.yticks(positions, labels)
Iterate over groups: Save a dataframe as an Excel table:
for i, group in g: df.to_excel('output.xlsx') Select area to plot:
print(i, group) plt.axis([0, 2.5, 0, 10]) # [from
x, to x, from y, to y]
Aggregate groups:
g.sum()
g.prod()
Visualization Label diagram and axes:
plt.title('Correlation')
g.mean() Import matplotlib: plt.xlabel('Nunstück')
g.std() import matplotlib.pyplot as plt plt.ylabel('Slotermeyer')
g.describe()
Start a new diagram: Save most recent diagram:
Select columns from groups: plt.figure() plt.savefig('plot.png')
g['col2'].sum() plt.savefig('plot.png',dpi=300)
g[['col2', 'col3']].sum() Scatter plot: plt.savefig('plot.svg')
df.plot.scatter('col1', 'col2',
Transform values: style='ro')
import math
g.transform(math.log) Bar plot:
df.plot.bar(x='col1', y='col2',
Apply a list function on each group: width=0.7)
def strsum(group):
return ''.join([str(x) for x in group.value]) Area plot:
df.plot.area(stacked=True,
g['col2'].apply(strsum) alpha=1.0)

Find practical examples in these


guides I made:
- Pandas Guide for Excel Users(link)
- Data Wrangling Guide (link)
- Regular Expression Guide (link)
Made by Frank Andrade frank-andrade.medium.com
NumPy Saving & Loading Text Files Aggregate functions:
np.loadtxt('my_file.txt') a.sum()
np.genfromtxt('my_file.csv', a.min()

Cheat Sheet delimiter=',') b.max(axis= 0)


np.savetxt('myarray.txt', a, b.cumsum(axis= 1) #Cumulative sum
delimiter= ' ') a.mean()
NumPy provides tools for working with arrays. All of the Inspecting Your Array b.median()
a.shape a.corrcoef() #Correlation coefficient
following code examples refer to the arrays below. np.std(b) #Standard deviation
len(a)
NumPy Arrays b.ndim
e.size Copying arrays:
axis 1 b.dtype #data type h = a.view() #Create a view
1D Array 2D Array np.copy(a)
b.dtype.name
1 2 3 1.5 2 3 b.astype(int) #change data type h = a.copy() #Create a deep copy
axis 0
Data Types Sorting arrays:
4 5 6 a.sort() #Sort an array
np.int64
np.float32 c.sort(axis=0)
Getting Started np.complex
np.bool Array Manipulation
Import numpy: np.object
np.string_ Transposing Array:
import numpy as np
np.unicode_ i = np.transpose(b)
i.T
Create arrays: Array Mathematics Changing Array Shape:
a = np.array([1,2,3])
Arithmetic Operations b.ravel()
b = np.array([(1.5,2,3), (4,5,6)], dtype=float) >>> g = a-b g.reshape(3,-2)
c = np.array([[(1.5,2,3), (4,5,6)], array([[-0.5, 0. , 0. ],
[(3,2,1), (4,5,6)]], [-3. , 3. , 3. ]]) Adding/removing elements:
>>> np.subtract(a,b) h.resize((2,6))
dtype = float) np.append(h,g)
Initial placeholders: >>> b+a np.insert(a, 1, 5)
np.zeros((3,4)) #Create an array of zeros array([[2.5, 4. , 6. ], np.delete(a,[1])
[ 5. , 7. , 9. ]])
np.ones((2,3,4),dtype=np.int16) >>> np.add(b,a) Combining arrays:
d = np.arange(10,25,5) np.concatenate((a,d),axis=0)
np.linspace( 0,2, 9) >>> a/b np.vstack((a,b)) #stack vertically
array([[ 0.66666667, 1. , 1. ], np.hstack((e,f)) #stack horizontally
e = np.full((2,2), 7) [ 0.2 5 , 0.4 , 0 . 5 ]])
f = np.eye(2) >>> np.divide(a,b) Splitting arrays:
np.random.random((2,2)) np.hsplit(a,3) #Split horizontally
>>> a*b np.vsplit(c,2) #Split vertically
np.empty((3,2)) array([[ 1 . 5, 4. , 9. ],
[ 4. , 10. , 18. ]]) Subsetting 1.5 2 3

Saving & Loading On Disk: >>> np.multiply(a,b) b[1,2] 4 5 6

np.save('my_array', a) >>> np.exp(b) Slicing:


np.savez('array.npz', a, b) >>> np.sqrt(b) a[0:2] 1 2 3

np.load('my_array.npy') >>> np.sin(a)


>>> np.log(a) Boolean Indexing:
1 2 3
>>> e.dot(f) a[a<2]
Scikit-Learn Training and Test Data
from sklearn.model_selection import train_test_split

Cheat Sheet
X_train,X_test,y_train,y_test = train_test_split(X,y,
random_state = 0)#Splits data into training and test set

Sklearn is a free machine learning library for Python. It features various


Preprocessing The Data
Standardization
classification, regression and clustering algorithms. Standardizes the features by removing the mean and scaling to unit variance.
from sklearn.preprocessing import StandardScaler
Getting Started scaler = StandardScaler().fit(X_train)
standarized_X = scaler.transform(X_train)
The code below demonstrates the basic steps of using sklearn to create and run a model standarized_X_test = scaler.transform(X_test)
on a set of data.
The steps in the code include loading the data, splitting into train and test sets, scaling Normalization
Each sample (row of the data matrix) with at least one non-zero component is
the sets, creating the model, fitting the model on the data using the trained model to rescaled independently of other samples so that its norm equals one.
make predictions on the test set, and finally evaluating the performance of the model. from sklearn.preprocessing import Normalizer
from sklearn import neighbors,datasets,preprocessing scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
from sklearn.model_selection import train_test_split normalized_X_test = scaler.transform(X_test)
from sklearn.metrics import accuracy_score
iris = datasets.load_iris() Binarization
Binarize data (set feature values to 0 or 1) according to a threshold.
X,y = iris.data[:,:2], iris.target from sklearn.preprocessing import Binarizer
X_train, X_test, y_train, y_test=train_test_split(X,y) binarizer = Binarizer(threshold = 0.0).fit(X)
scaler = preprocessing_StandardScaler().fit(X_train) binary_X = binarizer.transform(X_test)
X_train = scaler.transform(X_train) Encoding Categorical Features
X_test = scaler.transform(X_test) Imputation transformer for completing missing values.
knn = neighbors.KNeighborsClassifier(n_neighbors = 5) from sklearn import preprocessing
le = preprocessing.LabelEncoder()
knn.fit(X_train, y_train) le.fit_transform(X_train)
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred) Imputing Missing Values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=0, strategy ='mean')
imp.fit_transform(X_train)
Loading the Data
Generating Polynomial Features
The data needs to be numeric and stored as NumPy arrays or SciPy spare matrix from sklearn.preprocessing import PolynomialFeatures
(numeric arrays, such as Pandas DataFrame’s are also ok) poly = PolynomialFeatures(5)
>>> import numpy as np poly.fit_transform(X)
>>> X = np.random.random((10,5))
Find practical examples in these
array([[0.21,0.33],
guides I made:
[0.23, 0.60],
- Scikit-Learn Guide (link)
[0.48, 0.62]])
- Tokenize text with Python (link)
>>> y = np.array(['A','B','A'])
- Predicting Football Games (link)
array(['A', 'B', 'A'])
Made by Frank Andrade frank-andrade.medium.com
Create Your Model Evaluate Your Model’s Performance
Supervised Learning Models Classification Metrics
Linear Regression Accuracy Score
from sklearn.linear_model import LinearRegression knn.score(X_test,y_test)
from sklearn.metrics import accuracy_score
lr = LinearRegression(normalize = True) accuracy_score(y_test,y_pred)
Support Vector Machines (SVM)
from sklearn.svm import SVC Classification Report
from sklearn.metrics import classification_report
svc = SVC(kernel = 'linear') print(classification_report(y_test,y_pred))
Naive Bayes
from sklearn.naive_bayes import GaussianNB Confusion Matrix
from sklearn .metrics import confusion_matrix
gnb = GaussianNB() print(confusion_matrix(y_test,y_pred))
KNN
from sklearn import neighbors Regression Metrics
Mean Absolute Error
knn = neighbors.KNeighborsClassifier(n_neighbors = 5) from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_pred)
Unsupervised Learning Models
Mean Squared Error
Principal Component Analysis (PCA) from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA mean_squared_error(y_test,y_pred)
pca = PCA(n_components = 0.95)
R² Score
K means from sklearn.metrics import r2_score
from sklearn.cluster import KMeans r2_score(y_test, y_pred)
k_means = KMeans(n_clusters = 3, random_state = 0)
Clustering Metrics
Model Fitting Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
Fitting supervised and unsupervised learning models onto data. adjusted_rand_score(y_test,y_pred)
Supervised Learning
Homogeneity
lr.fit(X, y) #Fit the model to the data from sklearn.metrics import homogeneity_score
knn.fit(X_train,y_train) homogeneity_score(y_test,y_pred)
svc.fit(X_train,y_train)
V-measure
Unsupervised Learning from sklearn.metrics import v_measure_score
k_means.fit(X_train) #Fit the model to the data v_measure_score(y_test,y_pred)
pca_model = pca.fit_transform(X_train)#Fit to data,then transform
Tune Your Model
Prediction Grid Search
Predict Labels from sklearn.model_selection import GridSearchCV
params = {'n_neighbors':np.arange(1,3),
y_pred = lr.predict(X_test) #Supervised Estimators 'metric':['euclidean','cityblock']}
y_pred = k_means.predict(X_test) #Unsupervised Estimators grid = GridSearchCV(estimator = knn, param_grid = params)
Estimate probability of a label grid.fit(X_train, y_train)
print(grid.best_score_)
y_pred = knn.predict_proba(X_test) print(grid.best_estimator_.n_neighbors)
Data Viz Barplot
x = ['USA', 'UK', 'Australia']
Seaborn

Cheat Sheet
y = [40, 50, 33] Workflow
plt.bar(x, y)
plt.show() import seaborn as sns
import matplotlib.pyplot as plt
Matplotlib is a Python 2D plotting library that produces Piechart import pandas as pd
plt.pie(y, labels=x, autopct='%.0f %%') Lineplot
figures in a variety of formats. plt.figure(figsize=(10, 5))
plt.show()
Figure flights = sns.load_dataset("flights")
Y-axis Histogram may_flights=flights.query("month=='May'")
ages = [15, 16, 17, 30, 31, 32, 35] ax = sns.lineplot(data=may_flights,
bins = [15, 20, 25, 30, 35] x="year",
plt.hist(ages, bins, edgecolor='black') y="passengers")
plt.show() ax.set(xlabel='x', ylabel='y',
title='my_title, xticks=[1,2,3])
Boxplots ax.legend(title='my_legend,
ages = [15, 16, 17, 30, 31, 32, 35] title_fontsize=13)
Matplotlib X-axis
plt.boxplot(ages) plt.show()
Workflow plt.show()
Barplot
The basic steps to creating plots with matplotlib are Prepare Scatterplot tips = sns.load_dataset("tips")
a = [1, 2, 3, 4, 5, 4, 3 ,2, 5, 6, 7] ax = sns.barplot(x="day",
Data, Plot, Customize Plot, Save Plot and Show Plot. y="total_bill,
b = [7, 2, 3, 5, 5, 7, 3, 2, 6, 3, 2]
import matplotlib.pyplot as plt plt.scatter(a, b) data=tips)
Example with lineplot plt.show() Histogram
penguins = sns.load_dataset("penguins")
Prepare data sns.histplot(data=penguins,
x = [2017, 2018, 2019, 2020, 2021]
y = [43, 45, 47, 48, 50]
Subplots Boxplot
x="flipper_length_mm")

Add the code below to make multple plots with 'n' tips = sns.load_dataset("tips")
Plot & Customize Plot ax = sns.boxplot(x=tips["total_bill"])
number of rows and columns.
plt.plot(x,y,marker='o',linestyle='--',
fig, ax = plt.subplots(nrows=1, Scatterplot
color='g', label='USA') ncols=2, tips = sns.load_dataset("tips")
plt.xlabel('Years') sharey=True, sns.scatterplot(data=tips,
plt.ylabel('Population (M)') figsize=(12, 4)) x="total_bill",
Plot & Customize Each Graph y="tip")
plt.title('Years vs Population') ax[0].plot(x, y, color='g')
plt.legend(loc='lower right') ax[0].legend()
Figure aesthetics
ax[1].plot(a, b, color='r') sns.set_style('darkgrid') #stlyes
plt.yticks([41, 45, 48, 51]) sns.set_palette('husl', 3) #palettes
ax[1].legend()
Save Plot plt.show() sns.color_palette('husl') #colors
plt.savefig('example.png')
Fontsize of the axes title, x and y labels, tick labels
Show Plot Find practical examples in these and legend:
plt.show() guides I made: plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=14)
Markers: '.', 'o', 'v', '<', '>' - Matplotlib & Seaborn Guide (link) plt.rc('xtick', labelsize=13)
Line Styles: '-', '--', '-.', ':' - Wordclouds Guide (link) plt.rc('ytick', labelsize=13)
Colors: 'b', 'g', 'r', 'y' #blue, green, red, yellow - Comparing Data Viz libraries(link) plt.rc('legend', fontsize=13)
plt.rc('font', size=13)
Made by Frank Andrade frank-andrade.medium.com
Web Scraping “Siblings” are nodes with the same parent.
A node’s children and its children’s children are
XPath

Cheat Sheet
called its “descendants”. Similarly, a node’s parent We need to learn XPath to scrape with Selenium or
and its parent’s parent are called its “ancestors”. Scrapy.
it’s recommended to find element in this order.
a. ID
Web Scraping is the process of extracting data from a b. Class name XPath Syntax
website. Before studying Beautiful Soup and Selenium, it's c. Tag name An XPath usually contains a tag name, attribute
d. Xpath
good to review some HTML basics first. name, and attribute value.

Beautiful Soup //tagName[@AttributeName="Value"]


HTML for Web Scraping
Let's take a look at the HTML element syntax. Workflow Let’s check some examples to locate the article,
Importing the libraries title, and transcript elements of the HTML code we
Tag Attribute Attribute from bs4 import BeautifulSoup
name name value End tag import requests
used before.

Fetch the pages //article[@class="main-article"]


<h1 class="title"> Titanic (1997) </h1> result=requests.get("www.google.com")
result.status_code #get status code //h1
result.headers #get the headers //div[@class="full-script"]
Attribute Affected content
Page content
HTML Element content = result.text XPath Functions and Operators
XPath functions
This is a single HTML element, but the HTML code behind a Create soup
soup = BeautifulSoup(content,"lxml") //tag[contains(@AttributeName, "Value")]
website has hundreds of them.
HTML in a readable format XPath Operators: and, or
HTML code example
print(soup.prettify())
<article class="main-article">
//tag[(expression 1) and (expression 2)]
<h1> Titanic (1997) </h1> Find an element
<p class="plot"> 84 years later ... </p> soup.find(id="specific_id")
XPath Special Characters
<div class="full-script"> 13 meters. You ... </div> Find elements
soup.find_all("a") Selects the children from the node set on the
</article> /
soup.find_all("a","css_class") left side of this character
The HTML code is structured with “nodes”. Each rectangle below soup.find_all("a",class_="my_class") Specifies that the matching node set should
soup.find_all("a",attrs={"class": //
represents a node (element, attribute and text nodes) "my_class"}) be located at any level within the document
Get inner text Specifies the current context should be used
Root Element Parent Node sample = element.get_text() . (refers to present node)
sample = element.get_text(strip=True,
<article>
separator= ' ') .. Refers to a parent node
Get specific attributes A wildcard character that selects all
Element Attribute Element Element sample = element.get('href') * elements or attributes regardless of names
<h1> class="main-article" <p> <div>
Siblings @ Select an attribute
Text Attribute Text Attribute Text () Grouping an XPath expression
Titanic (1997) class="plot" 84 years later ... class="full-script"" 13 meters. You ...
Indicates that a node with index "n" should
[n]
be selected
Selenium Scrapy
Workflow Scrapy is the most powerful web scraping framework in Python, but it's a bit
from selenium import webdriver complicated to set up, so check my guide or its documentation to set it up.
web="www.google.com"
path='introduce chromedriver path'
driver = webdriver.Chrome(path) Creating a Project and Spider
driver.get(web) To create a new project, run the following command in the terminal.
scrapy startproject my_first_spider
Find an element To create a new spider, first change the directory.
driver.find_element_by_id('name') cd my_first_spider
Create an spider
Find elements scrapy genspider example example.com
driver.find_elements_by_class_name()
driver.find_elements_by_css_selector The Basic Template
driver.find_elements_by_xpath() When you create a spider, you obtain a template with the following content.
driver.find_elements_by_tag_name()
driver.find_elements_by_name() import scrapy
class ExampleSpider(scrapy.Spider):
Quit driver name = 'example'
driver.quit()
allowed_domains = ['example.com'] Class
Getting the text start_urls = ['http://example.com/']
data = element.text
def parse(self, response):
Implicit Waits Parse method
import time pass
time.sleep(2)
The class is built with the data we introduced in the previous command, but the
Explicit Waits parse method needs to be built by us. To build it, use the functions below.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait Finding elements
from selenium.webdriver.support import expected_conditions as EC To find elements in Scrapy, use the response argument from the parse method
response.xpath('//tag[@AttributeName="Value"]')
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID,
'id_name'))) #Wait 5 seconds until an element is clickable Getting the text
To obtain the text element we use text() and either .get() or .getall(). For example:
Options: Headless mode, change window size response.xpath(‘//h1/text()’).get()
from selenium.webdriver.chrome.options import Options response.xpath(‘//tag[@Attribute=”Value”]/text()’).getall()
options = Options()
options.headless = True Return data extracted
options.add_argument('window-size=1920x1080') To see the data extracted we have to use the yield keyword
driver=webdriver.Chrome(path,options=options)
def parse(self, response):
title = response.xpath(‘//h1/text()’).get()
Find practical examples in these guides I
made: # Return data extracted
- Web Scraping Complete Guide (link) yield {'titles': title}
- Web Scraping with Selenium (link) Run the spider and export data to CSV or JSON
- Web Scraping with Beautiful Soup (link) scrapy crawl example
scrapy crawl example -o name_of_file.csv
Made by Frank Andrade frank-andrade.medium.com scrapy crawl example -o name_of_file.json

You might also like