You are on page 1of 25

Cheat Sheets for AI

Neural Networks,
Machine Learning,
DeepLearning &
Big Data

The Most Complete List


of Best AI Cheat Sheets

BecomingHuman.AI
Table of
Content

Data Science with Python


Tensor Flow Pandas
Machine 11 17
Data Wrangling
Learning Python Basics 18 with Pandas
12
Machine
06 Learning PySpark Basics 19 Data Wrangling
with dplyr & tidyr
Basics 13
Neural 20 SciPi
07 Scikit Learn Numpy Basics
Networks with Python 14
21 MatPlotLib
Neural Networks 08 Scikit Learn
03 Basics Algorithm 15 Bokeh
22 Data Visualization
with ggplot
Neural Network 09 Choosing
04 Graphs ML Algorithm 16 Karas 23 Big-O
Part 1

Neural
Networks
Neural Perceptron (P)
Feed
Forward (FF)
Radial Basis
Network (RBF)
Deep Feed
Forward (DFF)
Recurrent Neural
Network (RNN)
Long / Short Term
Memory (LSTM)
Gated Recurrent
Unit (GRU)

Networks
Basic Auto Variational Sparse Denoising Markov Hopfield Boltzman Restricted

Cheat Sheet Encorder (AE) AE (VAE) AE (SAE) AE (DAE) Chain (MC) Network (HN) Machine (BM) BM (RBM)

BecomingHuman.AI

Deep Believe Deep Convolutional Deep Deep Convolutional


Network (DBN) Network (DCN) Network (DN) Inverse Graphics Network (DCIGN)
Index
Backfed Input Cell

Input Cell

Noisy Input Cell

Hidden Cell

Generative Adversial Liquid State Extreme Learning Echo Network Kohonen


Probablisticc Hidden Cell Network (GAN) Machine (LSM) Machine (ELM) Machine (ENM) Network (KN)

Spiking Hidden Cell

Output Cell

Match Input Output Cell

Recurrent Cell

Memory Cell Deep Residual Network (DRN) Support Vector Machine (SVM) Neural Turing Machine (SVM)

Different Memory Cell

Kernel

Convolutional or Pool
www.asimovinstitute.org/neural-network-zoo/
input input sigmoid input sigmoid
bias bias sum sigmoid Deep Feed

Neural Networks
input input sigmoid input sigmoid bias
Forward Example

bias bias

Graphs Cheat Sheet input

input
sum
bias
sum
bias
relu

relu
sum
bias
sum
bias
relu

relu
sum sigmoid
bias
Deep Recurrent Example
(previous literation)

BecomingHuman.AI
input sum relu sum relu
bias bias sum sigmoid Deep Recurrent
Example
input sum relu sum relu bias
bias bias

sum sigmoid sum sigmoid


bias multiply bias multiply
sum tanh sum tanh sum tanh sum tanh
bias multiply bias multiply
sum sigmoid sum sigmoid

input sum sigmoid multiply sum sum sigmoid multiply sum input bias multiply bias multiply

bias invert multiply bias invert multiply sum sigmoid sum sigmoid

sum sigmoid multiply sum tanh sum sigmoid multiply sum tanh sum sigmoid Deep GRU Example bias bias sum sigmoid
(previous literation) Deep LSTM Example
bias bias bias bias bias sum sigmoid sum sigmoid bias
(previous literation)
bias multiply bias multiply

input sum sigmoid multiply sum sum sigmoid multiply sum sum tanh sum tanh sum tanh sum tanh

bias invert multiply bias invert multiply bias multiply bias multiply

sum sigmoid multiply sum tanh sum sigmoid multiply sum tanh sum sigmoid sum sigmoid

bias bias bias bias input bias multiply bias multiply


sum sigmoid sum sigmoid
bias bias

sum sigmoid sum sigmoid


bias multiply bias multiply
sum tanh sum tanh sum tanh sum tanh

input sum sigmoid multiply sum sum sigmoid multiply sum bias multiply bias multiply

bias invert multiply bias invert multiply sum sigmoid sum sigmoid

sum sigmoid multiply sum tanh sum sigmoid multiply sum tanh sum sigmoid input bias multiply bias multiply
Deep GRU Example sum sigmoid sum sigmoid
bias bias bias bias bias
bias bias sum sigmoid Deep LSTM Example

input sum sigmoid multiply sum sum sigmoid multiply sum sum sigmoid sum sigmoid bias

bias invert multiply bias invert multiply bias multiply bias multiply

sum sigmoid multiply sum tanh sum sigmoid multiply sum tanh sum tanh sum tanh sum tanh sum tanh

bias bias bias bias bias multiply bias multiply


sum sigmoid sum sigmoid
input bias multiply bias multiply
sum sigmoid sum sigmoid
bias bias

ht p:/ www.asimovinsti ute.org/neural-network-zo -prequel-cel s-layers/


Part 2

Machine
Learning
CLASSIFICATION FEATURE REDUCTION

T-DISTRIB STOCHASTIC
NEIB EMBEDDING
NEURAL NET
MachineLearning Overview neural_network.MLPClassifier()
Complex relationships. Prone to overfitting
manifold.TSNE()
Visual high dimensional data. Convert
similarity to joint probabilities
Basically magic.

MACHINE LEARNING IN EMOJI PRINCIPLE


COMPONENT ANALYSIS
BecomingHuman.AI decomposition.PCA()
Distill feature space into components
K-NN that describe greatest variance
neighbors.KNeighborsClassifier()
Group membership based on proximity
CANONICAL
CORRELATION ANALYSIS
decomposition.CCA()
Making sense of cross-correlation matrices
human builds model based
SUPERVISED on input / output DECISION TREE
tree.DecisionTreeClassifier() LINEAR
If/then/else. Non-contiguous data. DISCRIMINANT ANALYSIS
human input, machine output Can also be regression. lda.LDA()
UNSUPERVISED human utilizes if satisfactory Linear combination of features that
separates classes

human input, machine output


REINFORCEMENT human reward/punish, cycle continues
RANDOM FOREST
ensemble.RandomForestClassifier() OTHER IMPORTANT CONCEPTS
Find best split randomly
BASIC REGRESSION CLUSTER ANALYSIS Can also be regression BIAS VARIANCE TRADEOFF

UNDERFITTING / OVERFITTING

LINEAR K-MEANS SVM INERTIA


linear_model.LinearRegression() cluster.KMeans()
svm.SVC() svm.LinearSVC()
Lots of numerical data Similar datum into groups based
Maximum margin classifier. Fundamental
on centroids ACCURACY FUNCTION
Data Science algorithm (TP+TN) / (P+N)

PRECISION FUNCTION
manifold.TSNE()

ANOMALY SPECIFICITY FUNCTION


NAIVE BAYES TN / (FP+TN)
LOGISTIC DETECTION GaussianNB() MultinominalNB() BernoulliNB()
linear_model.LogisticRegression() covariance.EllipticalEnvelope()
Updating knowledge step by step SENSITIVITY FUNCTION
Target variable is categorical Finding outliers through grouping with new info TP / (TP+FN)
Evaluate Your Create Your Model

Cheat-Sheet Skicit learn Model’s Performance


Classification Metrics
Supervised Learning Estimators
Linear Regression
>>> from sklearn.linear_model import LinearRegression

Phyton For Data Science


>>> lr = LinearRegression(normalize=True)
Accuracy Score
>>> knn.score(X_test, y_test) Estimator score method
>>> from sklearn.metrics import accuracy_score Metric scoring functions Support Vector Machines (SVM)
>>> accuracy_score(y_test, y_pred) >>> from sklearn.svm import SVC
>>> svc = SVC(kernel='linear')

BecomingHuman.AI Classification Report


>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred))
Precision, recall, f1-score
and support
Naive Bayes
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
Confusion Matrix
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred)) KNN
>>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)

Regression Metrics
Mean Absolute Error
>>> from sklearn.metrics import mean_absolute_error
Unsupervised Learning Estimators
Skicit Learn Preprocessing The Data
>>> y_true = [3, -0.5, 2]
>>> mean_absolute_error(y_true, y_pred)
Principal Component Analysis (PCA)
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=0.95)
Skicit Learn is an open source Phyton library that Mean Squared Error
implements a range if machine learning, processing, cross Standardization >>> from sklearn.metrics import mean_squared_error K Means
>>> from sklearn.preprocessing import StandardScaler
validation and visualization algorithm using a unified >>> mean_squared_error(y_test, y_pred) >>> from sklearn.cluster import KMeans
>>> scaler = StandardScaler().fit(X_train)
>>> standardized_X = scaler.transform(X_train) >>> k_means = KMeans(n_clusters=3, random_state=0)
A basic Example >>> standardized_X_test = scaler.transform(X_test)
R² Score
>>> from sklearn import neighbors, datasets, preprocessing >>> from sklearn.metrics import r2_score
>>> from sklearn.cross validation import train_test_split >>> r2_score(y_true, y_pred)
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load _iris() >>> X, y = iris.data[:, :2], iris.target
>>> Xtrain, X test, y_train, y test = train_test_split (X, y, random stat33) Normalization Clustering Metrics
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X train = scaler.transform(X train)
>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train) Adjusted Rand Index Training And Test Data
>>> X test = scaler.transform(X test) >>> normalized_X = scaler.transform(X_train) >>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred) >> from sklearn.cross validation import train_test_split
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> normalized_X_test = scaler.transform(X_test)
>> X train, X test, y train, y test - train_test_split(X,
>>> knn.fit(X_train, y_train) y,
>>> y_pred = knn.predict(X_test) Homogeneity
>>> from sklearn.metrics import homogeneity_score random state-0)
>>> accuracy_score(y_test, y_pred)
>>> homogeneity_score(y_true, y_pred)
Binarization
>>> from sklearn.preprocessing import Binarizer V-measure
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> binary_X = binarizer.transform(X)
>>> from sklearn.metrics import v_measure_score
>>> metrics.v_measure_score(y_true, y_pred)
Tune Your Model
Prediction Grid Search
>>> from sklearn.grid_search import GridSearchCV
Supervised Estimators Cross-Validation >>> params = {"n_neighbors": np.arange(1,3)
>>> y_pred = svc.predict(np.random.radom((2,5)))
>>> y_pred = lr.predict(X_test)
Predict labels
Predict labels Encoding Categorical Features >>> from sklearn.cross_validation import cross_val_score "metric": ["euclidean","cityblock"]}
>>> y_pred = knn.predict_proba(X_test) Estimate probability of a label >>> from sklearn.preprocessing import Imputer >>> print(cross_val_score(knn, X_train, y_train, cv=4)) >>> grid = GridSearchCV(estimator=knn,
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0) >>> print(cross_val_score(lr, X, y, cv=2)) param_grid=params)
Unsupervised Estimators >>> imp.fit_transform(X_train) >>> grid.fit(X_train, y_train)
Predict labels in clustering algos
>>> y_pred = k_means.predict(X_test) >>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)

Imputing Missing Values Randomized Parameter Optimization


Loading the Data >>> from sklearn.preprocessing import Imputer Model Fitting >>> from sklearn.grid_search import RandomizedSearchCV
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0) >>> params = {"n_neighbors": range(1,5),
>>> imp.fit_transform(X_train) Supervised learning "weights": ["uniform", "distance"]}
Your data beeds to be nmueric and stored as NumPy arrays >>> lr.fit(X, y) Fit the model to the data
or SciPy sparse matric. other types that they are comvertible >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> knn.fit(X_train, y_train) param_distributions=params,
to numeric arrays, such as Pandas Dataframe, are also >>> svc.fit(X_train, y_train) cv=4,
acceptable n_iter=8,
>>> import numpy as np >> X = np.random.random((10,5))
Generating Polynomial Features Unsupervised Learning random_state=5)
>>> from sklearn.preprocessing import PolynomialFeatures >>> k_means.fit(X_train) Fit the model to the data
>>> y = np . array ( PH', IM', 'F', 'F' , 'M', 'F', 'NI', 'tvl' , 'F', 'F', 'F' )) >>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it >>> rsearch.fit(X_train, y_train)
>>> poly = PolynomialFeatures(5)
>>> X [X < 0.7] = 0 >>> print(rsearch.best_score_)
>>> poly.fit_transform(X)

www.https:/dwatww.acadmatpa.ccaomp.m/ccoom/mmucommuninity/btylo/gbl/osg/ciskcitk-ilte-laerarn-ncheeatat-shheete t
Skicit-learn Algorithm
BecomingHuman.AI

START
get more data

classification kernel
approximation
NOT WORKING
NO
>50 samples regression
SGD CLassifier SVR(kernel='rbf')
YES EnsembleRegressors
NO SGD Regressor ElasticNet Lasso
SVC Ensemble
Classifiers KNeighbors Classifier <100K samples
NOT
NOT predicting NO YES WORKING
WORKING YES a category NO
YES
Naive Bayes Text Data Linear SVC <100K samples few features should RidgeRegression SVR
YES NOT be important NO (kernel='linear')
WORKING YES predicting YES
a quantity
YES do you have
labeled data NO

just looking

NOT WORKING
Spectral Clustering GMM KMeans number of NO
categories knows Randomized PCA Isomap Spectral LLE
Embedding NO
YES
YES
MiniBatch KMeans <10K samples NO NOT
NO
NO WORKING YES dimensionality
reduction
<10K samples kernel approximation
YES NO
clustering MeanShift VBGMM <10K samples tough luck NOT WORKING

NOT predicting
WORKING structure

Created by Skikit-Learn.org BSD Licence. See Original here.


Algorithm Cheat Sheet
BecomingHuman.AI
This cheat sheet helps you choose the best Azure Machine Learning Studio algorithm for your predictive
analytics solution. Your decision is driven by both the nature of your data and the question you're trying to
answer.

CLUSTERING MULTICLASS CLASSIFICATION


Discovering
K-means
structure Fast training, linear model Multiclass logistic regression
Three or
more Accuracy, long
Multiclass neural network
training times

Accuracy, fast training Multiclass decision forest


START
ANOMALY DETECTION Accuracy, small
Multiclass decision jungle
memory footprint

One-class SVM >100 features, Depends on the two-class


aggressive boundary Finding One-v-all multiclass
Predicting classifier, see notes below
unusual
PCA-based anomaly detection Fast training
data points categories

Predicting
values TWO CLASS CLASSIFICATION

REGRESSION >100 features, linear model Two-class SVM


Two
Ordinal regression Data in rank Fast training, linear model Two-class averaged perceptron
ordered categories

Poisson regression Predicting event counts Fast training, linear model Two-class logistic regression

Fast forest quantile regression Predicting a distribution Fast training, linear model Two-class Bayes point machine

Linear regression Fast training, linear model Accuracy, fast training Two-class decision forest

Linear model,
Bayesian linear regression Accuracy, fast training Two-class boasted decision tree
small data sets

Accuracy, long Accuracy, small


Neural network regression Two-class decision jungle
training time memory footprint

Decision forest regression Accuracy, fast training >100 features Two-class locally deep SVM

Accuracy, long
Boasted decision tree regression Accuracy, fast training Two-class neural network
training times
Part 3

Part 3

Data Science
with Python
Tensor Flow Cheat Sheet
BecomingHuman.AI

Installation Tensor Flow Skflow


How to install new package in Python Main classes Main classes
pip install <package-name> tf.Graph() TensorFlowClassifier
tf.Operation() TensorFlowRegressor
In May 2017 Google Example: pip install requests
tf.Tensor() TensorFlowDNNClassifier

announced the How to install tensorflow?


tf.Session() TensorFlowDNNRegressor
TensorFlowLinearClassififier
device = cpu/gpu Some useful functions
second-generation of python_version = cp27/cp34 tf.get_default_session()
TensorFlowLinearRegressor
TensorFlowRNNClassifier
the TPU, as well as sudo pip install
https://storage.googleapis.com/tensorflow/linux/$device/ten-
tf.get_default_graph()
tf.reset_default_graph()
TensorFlowRNNRegressor
sorflow-0.8.0-$python_version-none-linux_x86_64.whl TensorFlowEstimator
the availability of the TPUs in sudo pip install
ops.reset_default_graph()
tf.device(“/cpu:0”)
Each classifier and regressor have
Google Compute Engine.[12] The How to install Skflow tf.name_scope(value)
tf.convert_to_tensor(value)
following fields
second-generation TPUs deliver pip install sklearn n_classes=0 (Regressor), n_classes
How to install Keras TensorFlow Optimizers are expected to be input (Classifier)
up to 180 teraflops of perfor- pip install keras
GradientDescentOptimizer batch_size=32,
AdadeltaOptimizer steps=200, // except
mance, and when organized into update ~/.keras/keras.json – replace “theano” by “tensorflow”
AdagradOptimizer TensorFlowRNNClassifier - there is 50
MomentumOptimizer optimizer=’Adagrad’,
clusters of 64 TPUs provide up AdamOptimizer learning_rate=0.1,

to 11.5 petaflops. FtrlOptimizer


RMSPropOptimizer
Each class has a method fit
Helpers fit(X, y, monitor=None, logdir=None)
Reduction
X: matrix or tensor of shape [n_samples, n_features…]. Can be

Info
reduce_sum
iterator that returns arrays of features. The training input
Python helper Important functions reduce_prod samples for fitting the model.
type(object) reduce_min
Y: vector or matrix [n_samples] or [n_samples, n_outputs]. Can
Get object type reduce_max
be iterator that returns array of targets. The training target
TensorFlow help(object)
reduce_mean
values (class labels in classification, real numbers in
reduce_all
TensorFlow™ is an open source software library created by Get help for object (list of available methods, attributes, regression).
reduce_any
Google for numerical computation and large scale signatures and so on) accumulate_n monitor: Monitor object to print training progress and invoke
computation. Tensorflow bundles together Machine Learning, dir(object)
early stopping
Deep learning models and frameworks and makes them Get list of object attributes (fields, functions) Activation functions logdir: the directory to save the log file that can be used for
useful by way of common metaphor. tf.nn? optional visualization.
str(object) relu predict (X, axis=1, batch_size=None)
Transform an object to string object? relu6 Args:
Keras Shows documentations about the object elu X: array-like matrix, [n_samples, n_features…] or iterator.
globals() softplus axis: Which axis to argmax for classification.
Keras is an open sourced neural networks library, written in
Return the dictionary containing the current scope's global softsign By default axis 1 (next after batch) is used. Use 2 for sequence
Python and is built for fast experimentation via deep neural
variables. dropout predictions.
networks and modular design. It is capable of running on top of
bias_add
TensorFlow, Theano, Microsoft Cognitive Toolkit, or PlaidML. locals() batch_size: If test set is too big, use batch size to split it into
sigmoid mini batches. By default the batch_size member variable is
Update and return a dictionary containing the current scope's tanh used.
Skflow local variables. sigmoid_cross_entropy_with_logits Returns:
id(object) softmax y: array of shape [n_samples]. The predicted classes or
Scikit Flow is a high level interface base on tensorflow which can log_softmax
Return the identity of an object. This is guaranteed to be unique predicted value.
be used like sklearn. You can build you own model on your own softmax_cross_entropy_with_logits
among simultaneously existing objects.
data quickly without rewriting extra code.provides a set of high sparse_softmax_cross_entropy_with_logits
level model classes that you can use to easily integrate with your import _builtin_ weighted_cross_entropy_with_logits
existing Scikit-learn pipeline code. dir(_builtin_) etc.
Other built-in functions

Originally created by Altoros. www.altoros.com/blog/tensorflow-cheat-sheet/


Phyton For Data Science Strings Also see NumPy Arrays

Cheat-Sheet Phyton Basic


>>> my_string = 'thisStringIsAwesome'
>>> my_string
'thisStringIsAwesome'

String Operations
>>> my_string * 2

BecomingHuman.AI 'thisStringIsAwesomethisStringIsAwesome'
>>> my_string + 'Innit'
'thisStringIsAwesomeInnit'
>>> 'm' in my_string
True

String Operations Index starts at 0

>>> my_string[3]
>>> my_string[4:9]

Variables and Data Types Lists Also see NumPy Arrays Numpy Arrays Also see Lists
String Methods
>>> a = 'is' >>> my_list = [1, 2, 3, 4] >>> my_string.upper() String to uppercase
Variable Assignment >>> b = 'nice' >>> my_array = np.array(my_list)
>>> my_string.lower() String to lowercase
>>> my_list = ['my', 'list', a, b] >>> my_2darray =
>>> x=5 np.array([[1,2,3],[4,5,6]])
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] >>> my_string.count('w') Count String elements
>>> x
5 >>> my_string.replace('e', 'i') Replace String elements
Selecting List Elements Index starts at 0 Selecting Numpy
>>> my_string.strip() Strip whitespaces
Subset
Array Elements Index starts at 0
Calculations With Variables >>> my_list[1] Select item at index 1
Subset
>>> my_list[-3] Select 3rd last item
>>> x+2 Sum of two variables Slice >>> my_array[1] Select item at index 1
7 2
>>> my_list[1:3] Select items at index 1 and 2
>>> x-2 Subtraction of two variables >>> my_list[1:] Select items after index 0 Slice
>>> my_array[0:2] Select items at index 0 and 1
Libraries
3 >>> my_list[:3] Select items before index 3
>>> x*2 Multiplication of two variables array([1, 2])
>>> my_list[:] Copy my_list Import libraries
10 Subset Lists of Lists Subset 2D Numpy arrays
>>> import numpy
>>> x**2 Exponentiation of a variable >>> my_list2[1][0] my_list[list][itemOfList] >>> my_2darray[:,0] my_2darray[rows, columns]
array([1, 4]) >>> import numpy as np
25 >>> my_list2[1][:2] Selective import
>>> x%2 Remainder of a variable >>> from math import pi
1
>>> x/float(2) Division of a variable List Operations Numpy Array Operations
2.5 >>> my_list + my_list >>> my_array > 3
['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] array([False, False, False, True], dtype=bool)
>>> my_array * 2
>>> my_list * 2 array([2, 4, 6, 8])
['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] >>> my_array + np.array([5, 6, 7, 8])
Calculations With Variables >>> my_list2 > 4
array([6, 8, 10, 12])
Install Python
True
str() '5', '3.45', 'True' Variables to strings

int() 5, 3, 1 Variables to integers List Methods Numpy Array Operations


Leading open data science platform
float() 5.0, 1.0 Variables to floats >>> my_list.index(a) Get the index of an item >>> my_array.shape Get the dimensions of the array powered by Pytho
>>> my_list.count(a) Count an item >>> np.append(other_array) Append items to an array
bool() True, True, True Variables to booleans
>>> my_list.append('!') Append an item at a time >>> np.insert(my_array, 1, 5) Insert items in an array
>>> my_list.remove('!') Remove an item Free IDE that is included
>>> np.delete(my_array,[1]) Delete items in an array with Anaconda
>>> del(my_list[0:1]) Remove an item
>>> my_list.reverse() Reverse the list >>> np.mean(my_array) Mean of the array
>>> my_list.extend('!') Append an item >>> np.median(my_array) Median of the array
Create and share
Asking For Help >>> my_list.pop(-1)
>>> my_list.insert(0,'!')
Remove an item
Insert an item
>>> my_array.corrcoef() Correlation coefficient documents with live code,
visualizations, text, ...
>>> np.std(my_array) Standard deviation
>>> help(str) >>> my_list.sort() Sort the list

ht ps:/ w w .dat ac map.com/c/ocmomuniutny/ityu/oturiaolrsi/aplyst/hpoynt-hdoant-dsactie-nscei-cnhea-tchseaet-sbhaesict-sbasics


Python For Data Science Cheat Sheet Retrieving
Reshaping Data
RDD Information
PySpark - RDD Basics Basic Information
Reducing
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect() each key
Merge the rdd values for
[('a',9),('b',2)]
>>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
BecomingHuman.AI >>> rdd.count()
3
>>> rdd.countByKey()
Count RDD instances

Count RDD instances by key


('a',7,'a',2,'b',2)

defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() Count RDD instances
Grouping by
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1}) by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
>>> rdd.collectAsMap() Return (key,value) pairs as a .mapValues(list)
{'a': 2,'b': 2} dictionary .collect()
Loading Data >>> rdd3.sum() Sum of RDD elements
4950
Sum of RDD elements >>> rdd.groupByKey()
.mapValues(list)
Group rdd by key
.collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
true
Parallelized Collections
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) Summary Aggregating
>>> rdd3 = sc.parallelize(range(100)) >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1)) Aggregate RDD elements of each
>>> rdd4 = sc.parallelize([("a",["x","y","z"]), >>> rdd3.max() Maximum value of RDD elements partition and then the results
99 >>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
("b",["p", "r"])])
>>> rdd3.min() Minimum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate values of each RDD key
PySpark is the Spark Python API External Data
0 (4950,100)
Aggregate the elements of each
>>> rdd3.mean() Mean value of RDD elements >>> rdd.aggregateByKey((0,0),seqop,combop)
that exposes the Spark programming Read either one text file from HDFS, a local file system or or any 49.5 .collect()
4950 partition, and then the results
Merge the values for each key
>>> rdd3.stdev() Standard deviation of RDD elements [('a',(9,2)), ('b',(2,1))]
model to Python. Hadoop-supported file system URI with textFile(), or read in a directory
of text files with wholeTextFiles(). 28.866070047722118 >>> rdd3.fold(0,add)
>>> rdd3.variance() Compute variance of RDD elements 4950
>>> textFile = sc.textFile("/my/directory/*.txt") 833.25 >>> rdd.foldByKey(0, add)
>>> textFile2 = sc.wholeTextFiles("/my/directory/") >>> rdd3.histogram(3) Compute histogram by bins .collect()
([0,33,66,99],[33,33,34]) [('a',9),('b',2)]
Initializing Spark >>> rdd3.stats() Summary statistics (count, mean,
stdev, max & min)
>>> rdd3.keyBy(lambda x: x+x)
.collect()
Create tuples of RDD elements by
applying a function

SparkContext Selecting Data


>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]') Getting Applying Functions
>>> rdd.collect() Return a list with all RDD elements Reshaping Data
Calculations With Variables [('a', 7), ('a', 2), ('b', 2)] >>> rdd.map(lambda x: x+(x[1],x[0]))
.collect()
Apply a function to each
RDD element
>>> rdd.take(2) Take first 2 RDD elements [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] >>> rdd.repartition(4) New RDD with 4 partitions
>>> sc.version Retrieve SparkContext version [('a', 7), ('a', 2)]
>>> rdd5 = rdd.flatMap(lambda x: Apply a function to each RDD >>> rdd.coalesce(1) Decrease the number of partitions in the
>>> sc.pythonVer Retrieve Python version >>> rdd.first() Take first RDD element x+(x[1],x[0])) element and flatten the result RDD to 1
('a', 7)
>>> sc.master Master URL to connect to >>> rdd5.collect()
>>> rdd.top(2) Take top 2 RDD elements ['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> str(sc.sparkHome) Path where Spark is installed on [('b', 2), ('a', 7)]
worker nodes >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each
.collect() (key,value)pair of rdd4 without
>>> str(sc.sparkUser()) Retrieve name of the Spark User [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')] changing the keys
running SparkContext Sampling
>>> sc.appName
>>> sc.applicationId
Return application name
Retrieve application ID
>>> rdd3.sample(False, 0.15, 81).collect()
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
Return sampled subset of rdd3 Saving
>>> sc.defaultParallelism Return default level of parallelism >>> rdd.saveAsTextFile("rdd.txt")
>>> sc.defaultMinPartitions Default minimum number of
partitions for RDDs
Filtering Mathematical Operations >>> rdd.saveAsHadoopFile ("hdfs://namenodehost/parent/child",
'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.subtract(rdd2) Return each rdd value not contained
Configuration .collect() Return distinct RDD values .collect() in rdd2
[('a',7),('a',2)] Return (key,value) RDD's keys [('b',2),('a',7)]
>>> from pyspark import SparkConf, SparkContext
>>> rdd5.distinct().collect() >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> conf = (SparkConf() .collect() with no matching key in rdd
['a',2,'b',7]
.setMaster("local") [('d', 1)]
>>> rdd.keys().collect()
.setAppName("My app")
.set("spark.executor.memory", "1g"))
['a', 'a', 'b']
>>> rdd.cartesian(rdd2).collect() Return the Cartesian product
of rdd and rdd2
Stopping SparkContext
>>> sc = SparkContext(conf = conf)
>>> sc.stop()

Configuration
In the PySpark shell, a special interpreter-aware SparkContext Iterating Sort
is already created in the variable called sc.
$ ./bin/spark-shell --master local[2] Getting >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
$ ./bin/pyspark --master local[4] --py-files code.py .collect()
>>> def g(x): print(x) [('d',1),('b',1),('a',2)]
Set which master the context connects to with the --master
argument, and add Python .zip, .egg or .py files to the runtime
>>> rdd.foreach(g)
('a', 7)
>>> rdd2.sortByKey() Sort (key, value)
.collect()
RDD by key Execution
path by passing a comma-separated list to --py-files. ('b', 2) [('a',2),('b',1),('d',1)]
('a', 2) $ ./bin/spark-submit examples/src/main/python/pi.py

ht ps:/ www.datacamp.com/community/blog/pyspark-cheat-she t-python


Content Copyright by DataCamp.com. Design Copyright by BecomingHuman.Ai. See Original here.
NumPy Basics Cheat Sheet Copying Arrays
>>> h = a.view()
>>> np.copy(a)
Create a view of the array with the same data
Create a copy of the array
Sorting Arrays
>>> a.sort()
>>> c.sort(axis=0)
Sort an array
Sort the elements
BecomingHuman.AI >>> h = a.copy() Create a deep copy of the array of an array's axis

Data Types Subsetting, Slicing, Indexing


>>> np.int64 Signed 64-bit integer types
The NumPy library is the core library for
scientific computing in Python. It provides a
>>> np.float32 Standard double-precision floating point Subsetting
>>> np.complex Complex numbers represented by 128 floats
high-performance multidimensional array >>> a[2] 1 2 3 Select the element at the 2nd index
>>> np.bool Boolean type storing TRUE and FALSE 3
object, and tools for working with these arrays.
>>> np.object Python object type values >>> b[1,2] 1.5 2 3 Select the element at row 1 column 2
6.0 4 5 6 (equivalent to b[1][2])
>>> np.string_ Fixed-length string type
1D array 2D array 3D array >>> np.unicode_ Fixed-length unicode type
axis 1 axis 2 Slicing
1 2 3 axis 1 >>> a[0:2] Select items at index 0 and 1
1.5 2 3 1 2 3
array([1, 2])
axis 0 axis 0
4 5 6
Asking For Help >>> b[0:2,1]
array([ 2., 5.])
1.5 2
4 5
3
6
Select items at rows 0 and 1 in column 1

Creating Arrays >>> np.info(np.ndarray.dtype) >>> b[:1]


array([[1.5, 2., 3.]])
1.5 2
4 5
3
6
Select all items at row 0
(equivalent to b[0:1, :])
>>> c[1,...] Same as [1,:,:]
>>> a = np.array([1,2,3]) array([[[ 3., 2., 1.],
>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) [ 4., 5., 6.]]])
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]],dtype = float) Array Mathematics >>> a[ : :-1]
array([3, 2, 1])
Reversed array a

Initial Placeholders Arithmetic Operations


>>> np.zeros((3,4)) Create an array of zeros >>> g = a - b Subtraction Boolean Indexing
array([[-0.5, 0. , 0. ], >>> a[a<2] Select elements from a less than 2
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones [-3. , -3. , -3. ]]) 1 2 3
array([1])
>>> d = np.arange(10,25,5) Create an array of evenly spaced >>> np.subtract(a,b) Subtraction
values (step value)
>>> b + a Addition
>>> np.linspace(0,2,9) Create an array of evenly array([[ 2.5, 4. , 6. ],
spaced values (number of samples) [ 5. , 7. , 9. ]]) Fancy Indexing
>>> e = np.full((2,2),7) Create a constant array >>> np.add(b,a) Addition
>>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
>>> f = np.eye(2) Create a 2X2 identity matrix >>> a / b Division array([ 4. , 2. , 6. , 1.5])
>>> np.random.random((2,2)) Create an array with random values array([[ 0.66666667, 1. , 1. ],
[ 0.25 , 0.4 , 0.5 ]]) >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
>>> np.empty((3,2)) Create an empty array array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
[ 4. , 5. , 6. , 4. ],
>>> a * b Multiplication [ 1.5, 2. , 3. , 1.5]])
array([[ 1.5, 4. , 9. ],
[ 4. , 10. , 18. ]])
>>> np.multiply(a,b) Multiplication
I/O >>> np.exp(b) Exponentiation
>>> np.sqrt(b) Square root
Saving & Loading On Disk >>> np.sin(a) Print sines of an array Array Manipulation
>>> np.save('my_array', a) >>> np.cos(b) Element-wise cosine
>>> np.savez('array.npz', a, b) >>> np.log(a) Element-wise natural logarithm Transposing Array Changing Array Shape
>>> np.load('my_array.npy') >>> e.dot(f) Dot product >>> i = np.transpose(b) Permute array dimensions
array([[ 7., 7.], >>> b.ravel() Flatten the array
[ 7., 7.]]) >>> i.T Permute array dimensions >>> g.reshape(3,-2) Reshape, but don’t change data
Saving & Loading Text Files
>>> np.loadtxt("myfile.txt") Comparison
>>> np.genfromtxt("my_file.csv", delimiter=',')
>>> np.savetxt("myarray.txt", a, delimiter=" ")
>>> a == b
array([[False, True, True],
Element-wise comparison Adding/Removing Elements Combining Arrays
[False, False, False]], dtype=bool) >>> h.resize((2,6)) Return a new array with shape (2,6) >>> np.concatenate((a,d),axis=0) Concatenate arrays
>>> a < 2 Element-wise comparison array([ 1, 2, 3, 10, 15, 20])
>>> np.append(h,g) Append items to an array
array([True, False, False], dtype=bool) >>> np.vstack((a,b)) Stack arrays vertically (row-wise)
>>> np.insert(a, 1, 5) Insert items in an array array([[ 1. , 2. , 3. ],
>>> np.array_equal(a, b) Array-wise comparison
Inspecting Your Array >>> np.delete(a,[1]) Delete items from an array [ 1.5, 2. , 3. ],
[ 4. , 5. , 6. ]])

>>> a.shape Array dimensions


Aggregate Functions >>> np.r_[e,f] Stack arrays vertically (row-wise)

>>> len(a) Length of array >>> a.sum() Array-wise sum


Splitting Arrays >>> np.hstack((e,f))
array([[ 7., 7., 1., 0.],
Stack arrays horizontally
(column-wise)
>>> np.hsplit(a,3) Split the array [ 7., 7., 0., 1.]])
>>> b.ndim Number of array dimensions >>> a.min() Array-wise minimum value
[array([1]),array([2]),array([3])] index horizontally >>> np.column_stack((a,d)) Create stacked
>>> e.size Number of array elements >>> b.max(axis=0) Maximum value of an array row at the 3rd array([[ 1, 10], column-wise arrays
>>> b.dtype Data type of array elements >>> b.cumsum(axis=1) Cumulative sum of the elements [ 2, 15],
>>> np.vsplit(c,2) Split the array vertically at the 2nd index [ 3, 20]])
>>> b.dtype.name Name of data type >>> a.mean() Mean [array([[[ 1.5, 2. , 1. ],
[ 4. , 5. , 6. ]]]), >>> np.c_[a,d] Create stacked
>>> b.astype(int) Convert an array to a different type >>> b.median() Median column-wise arrays

ht ps:/ w w.dat camp.com/com unity/blog/python- umpy-cheat-she t


Renderers & Visual Customizations
Glyphs Customized Glyphs Also see data

Scatter Markers Selection and Non-Selection Glyphs


>>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> p = figure(tools='box_select')
fill_color='white') >>> p.circle('mpg', 'cyl', source=cds_df,
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], selection_color='red',
color='blue', size=1) nonselection_alpha=0.1)
Line Glyphs Hover Glyphs
>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> hover = HoverTool(tooltips=None, mode='vline')
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]),
>>> p3.add_tools(hover)
pd.DataFrame([[3,4,5],[3,2,1]]),
color="blue") US
Asia
Colormapping

Bokeh Cheat Sheet


Europe

>>> color_mapper = CategoricalColorMapper(


factors=['US', 'Asia', 'Europe'],
Rows & Columns Layout palette=['blue', 'red', 'green'])
>>> p3.circle('mpg', 'cyl', source=cds_df,
Rows color=dict(field='origin',
transform=color_mapper),
BecomingHuman.AI
>>> from bokeh.layouts import row
>>> layout = row(p1,p2,p3) legend='Origin'))
Columns Linked Plots Also see data
>>> from bokeh.layouts import columns
>>> layout = column(p1,p2,p3)
Linked Axes
>>> p2.x_range = p1.x_range
Nesting Rows & Columns >>> p2.y_range = p1.y_range
>>>layout = row(column(p1,p2), p3)
Linked Brushing
>>> p4 = figure(plot_width = 100, tools='box_select,lasso_select')
>>> p4.circle('mpg', 'cyl', source=cds_df)
Grid Layout >>> p5 = figure(plot_width = 200, tools='box_select,lasso_select')
>>> from bokeh.layouts import gridplot
>>> row1 = [p1,p2] Tabbed Layout
>>> row2 = [p3]
>>> layout = gridplot([[p1,p2],[p3]]) >>> from bokeh.models.widgets import Panel, Tabs
>>> tab1 = Panel(child=p1, title="tab1")
>>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
Legends
Legend Location Legend Orientation
Data Types Data Also see Lists, NumPy & Pandas Inside Plot Area >>> p.legend.orientation = "horizontal"
The Python interactive visualization library Bokeh enables >>> p.legend.location = 'bottom_left' >>> p.legend.orientation = "vertical"
high-performance visual presentation of large datasets in modern Under the hood, your data is converted to Column Data Outside Plot Area
web browsers.
Sources. You can also do this manually: >>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) Legend Background & Border
>>> r2 = p2.line([1,2,3,4], [3,4,5,6])
Bokeh’s mid-level general purpose bokeh.plotting >>> import numpy as np >>> p.legend.border_line_color = "navy"
>>> legend = Legend(items=[("One" , [p1, r1]),("Two" , [r2])], location=(0, -30))
interface is centered around two main components: data and glyphs. >>> import pandas as pd >>> p.legend.background_fill_color = "white"
>>> p.add_layout(legend, 'right')
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'],
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]),
columns=['mpg','cyl', 'hp', 'origin'],
index=['Toyota', 'Fiat', 'Volvo'])
Statistical Charts
>>> from bokeh.models import ColumnDataSource
>>> cds_df = ColumnDataSource(df) Output With Bokeh Also see Data
data glyphs plot
Bokeh’s high-level bokeh.charts interface is ideal for
The basic steps to creating plots with the bokeh.plotting Output to HTML File quickly creating statistical charts
interface are:
1. Prepare some data:
Plotting >>> from bokeh.io import output_file, show
>>> output_file('my_bar_chart.html', mode='cdn')
Bar Chart
>>> from bokeh.charts import Bar
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values Embedding >>> p = Bar(df, stacked=True, palette=['red','blue'])
2. Create a new plot >>> from bokeh.plotting import figure Notebook Output
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> from bokeh.io import output_notebook, show
3. Add renderers for your data, with visual customizations Box Plot
>>> p2 = figure(plot_width=300, plot_height=300, >>> output_notebook()
4. Specify where to generate the output x_range=(0, 8), y_range=(0, 8)) >>> from bokeh.charts import BoxPlot
5. Show or save the results >>> p3 = figure() >>> p = BoxPlot(df, values='vals', label='cyl',
legend='bottom_right')
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Histogram
>>> x = [1, 2, 3, 4, 5] step 1 Standalone HTML >>> from bokeh.charts import Histogram
>>> y = [6, 7, 2, 4, 5] >>> from bokeh.embed import file_html >>> p = Histogram(df, title='Histogram')
>>> p = figure(title="simple line example", step 2 >>> html = file_html(p, CDN, "my_plot")
x_axis_label='x', Scatter Plot
y_axis_label='y') Components
>>> p.line(x, y, legend="Temp.", line_width=2) step 3 Show or Save Your Plots >>> from bokeh.embed import components
>>> from bokeh.charts import Scatter
>>> p = Scatter(df, x='mpg', y ='hp',
>>> output_file("lines.html") step 4 >>> script, div = components(p) marker='square',
>>> show(p) step 5 >>> show(p1) >>> save(p1)
>>> show(layout) >>> save(layout) xlabel='Miles Per Gallon',

ht ps:/ www.datacamp.com/community/blog/bokeh-cheat-she t-python


Keras Cheat Sheet Inspect Model
>>> model.output_shape
>>> model.summary()
Model output shape
Model summary representation
>>> model.get_config() Model configuration

BecomingHuman.AI >>> model.get_weights() List all weight tensors in the model

Prediction
>>> model3.predict(x_test4, batch_size=32)
>>> model3.predict_classes(x_test4,batch_size=32)

Keras is a powerfuland easy-to-use


deep learning library for Theano and Model Architecture
TensorFlow that provides a high-level neural Model Fine-tuning
Sequential Model
networks API to develop and evaluate deep
>>> from keras.models import Sequential Optimization Parameters Model Training
learning models. >>> model = Sequential() >>> from keras.optimizers import RMSprop >>> model3.fit(x_train4,
>>> model2 = Sequential() >>> opt = RMSprop(lr=0.0001, decay=1e-6) y_train4,
A Basic Example >>> model3 = Sequential() >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
batch_size=32,
epochs=15,
verbose=1,
metrics=['accuracy'])
>>> import numpy as np
Multilayer Perceptron (MLP) validation_data=(x_test4,y_test4))

>>> from keras.models import Sequential Binary Classification Early Stopping


>>> from keras.layers import Dense >>> from keras.layers import Dense >>> from keras.callbacks import EarlyStopping
>>> model.add(Dense(12,
>>> data = np.random.random((1000,100)) input_dim=8, >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> labels = np.random.randint(2,size=(1000,1)) kernel_initializer='uniform',
activation='relu'))
>>> model3.fit(x_train4, Evaluate Your
y_train4,
>>> model = Sequential()
>>> model.add(Dense(32,
>>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid'))
batch_size=32,
epochs=15, Model's Performance
activation='relu', validation_data=(x_test4,y_test4),
Multi-Class Classification callbacks=[early_stopping_monitor])
input_dim=100)) >>> score = model3.evaluate(x_test,
>>> from keras.layers import Dropout y_test,
>>> model.add(Dense(1, activation='sigmoid')) batch_size=32)
>>> model.add(Dense(512,activation='relu',input_shape=(784,)))
>>> model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
>>> model.add(Dropout(0.2))
>>> model.add(Dense(512,activation='relu'))
Compile Model
metrics=['accuracy']) >>> model.add(Dropout(0.2))
>>> model.add(Dense(10,activation='softmax')) MLP: Binary Classification Preprocessing
Regression >>> model.compile(optimizer='adam',
Data Also see NumPy, Pandas & Scikit-Learn
>>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1]))
>>> model.add(Dense(1))
loss='binary_crossentropy',
metrics=['accuracy']) Sequence Padding
Convolutional Neural Network (CNN) MLP: Multi-Class Classification >>> from keras.preprocessing import sequence
Your data needs to be stored as NumPy arrays or as a list of NumPy >>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80)
arrays. Ideally, you split the data in training and test sets, for which you >>> model.compile(optimizer='rmsprop', >>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80)
>>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten loss='categorical_crossentropy',
can also resort to the train_test_split module of sklearn.cross_validation. metrics=['accuracy'])
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
>>> model2.add(Activation('relu'))
MLP: Regression One-Hot Encoding
Keras Data Sets >>> model2.add(Conv2D(32,(3,3)))
>>> from keras.utils import to_categorical
>>> model2.add(Activation('relu')) >>> model.compile(optimizer='rmsprop',
>>> from keras.datasets import boston_housing, loss='mse', >>> Y_train = to_categorical(y_train, num_classes)
>>> model2.add(MaxPooling2D(pool_size=(2,2))) metrics=['mae'])
mnist, >>> Y_test = to_categorical(y_test, num_classes)
cifar10, >>> model2.add(Dropout(0.25))
>>> Y_train3 = to_categorical(y_train3, num_classes)
imdb >>> model2.add(Conv2D(64,(3,3), padding='same'))
>>> Y_test3 = to_categorical(y_test3, num_classes)
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data() >>> model2.add(Activation('relu'))
>>> model2.add(Conv2D(64,(3, 3)))
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>> model2.add(Activation('relu')) Recurrent Neural Network Train and Test Sets
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.compile(loss='binary_crossentropy', >>> from sklearn.model_selection import train_test_split
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.25)) optimizer='adam',
metrics=['accuracy']) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X,
>>> num_classes = 10 >>> model2.add(Flatten()) y,
>>> model2.add(Dense(512)) test_size=0.33,
>>> model.fit(data,labels,epochs=10,batch_size=32) random_state=42)
>>> model2.add(Activation('relu'))
>>> predictions = model.predict(data) >>> model2.add(Dropout(0.5))
>>> model2.add(Dense(num_classes)) Standardization/Normalization
Other >>> model2.add(Activation('softmax')) >>> from sklearn.preprocessing import StandardScaler
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
Recurrent Neural Network (RNN) Save/ Reload Models >>> scaler = StandardScaler().fit(x_train2)
>>> standardized_X = scaler.transform(x_train2)
ml/machine-learning-databases/pima-indians-diabetes/ >>> from keras.klayers import Embedding,LSTM >>> standardized_X_test = scaler.transform(x_test2)
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.models import load_model
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> model3.save('model_file.h5')
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> my_model = load_model('my_model.h5')

htps:/ w w.dat camp.com/com unity/blog/keras-cheat-she t


Pandas Basics
Cheat Sheet Asking For Help
BecomingHuman.AI Selection Also see NumPy Arrays
>>> help(pd.Series.loc)

Getting
>>> s['b'] Get one element
-5
>>> df[1:] Get subset of a DataFrame

Use the following import convention: >>> import pandas as pd Country Capital Population
1 India New Delhi 1303171035
Applying Functions
2 Brazil Brasília 207847528 >>> f = lambda x: x*2
>>> df.apply(f) Apply function
Selecting, Boolean Indexing & Setting
The Pandas library is By Position
>>> df.applymap(f) Apply function element-wise

Select single value by row &


built on NumPy and Dropping
>>> df.iloc[[0],[0]]
'Belgium'
>>> df.iat([0],[0])
column

'Belgium' Data Alignment


provides easy-to-use >>> s.drop(['a', 'c']) Drop values from rows (axis=0) By Label
>>> df.loc[[0], ['Country']]
Select single value by row &
column labels
>>> df.drop('Country', axis=1) Drop values from columns(axis=1) Internal Data Alignment
data structures and 'Belgium'
>>> df.at([0], ['Country']) 'Belgium' NA values are introduced in the indices that don’t overlap:
By Label/Position >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
data analysis tools for Sort & Rank >>> df.ix[2]
Country Brazil
Select single row of
subset of rows
>>> s + s3
a 10.0
Capital Brasília b NaN

the Python >>> df.sort_index()


>>> df.sort_values(by='Country')
Sort by labels along an axis
Sort by the values along an axis
Population 207847528
>>> df.ix[:,'Capital']
0 Brussels
Select a single column of
subset of columns
c 5.0
d 7.0

1 New Delhi
programming >>> df.rank() Assign ranks to entries 2 Brasília
>>> df.ix[1,'Capital'] Select rows and columns
Arithmetic Operations with Fill Methods
You can also do the internal data alignment yourself with
'New Delhi'

language. Boolean Indexing


>>> s[~(s > 1)] Series s where value is not >1
the help of the fill methods:
>>> s.add(s3, fill_value=0)
>>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 a 10.0
Retrieving Series/ >>> df[df['Population']>1200000000] Use filter to adjust DataFrame
b -5.0
c 5.0
Pandas Data Structures DataFrame Information Setting
Set index a of Series s to 6
d 7.0
>>> s.sub(s3, fill_value=2)
>>> s['a'] = 6
>>> s.div(s3, fill_value=4)
Series
A one-dimensional >>> df.shape (rows,columns)
labeled array a
>>> df.index Describe index
capable of holding any
data type >>> df.columns
>>> df.info()
Describe DataFrame columns
Info on DataFrame
I/O
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
>>> df.count() Number of non-NA values
Read and Write to CSV Read and Write to SQL Query or Database Table
Data Frame >>> pd.read_csv('file.csv', header=None, nrows=5) >>> from sqlalchemy import create_engine
A two-dimensional column
Belgium Capital Population Summary >>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:')
labeled data structure 0 Belgium Brussels 11190846 >>> pd.read_sql("SELECT * FROM my_table;", engine)
with columns of 1 India New Delhi 1303171035
potentially different index 2 Brazil Brasilia 207847528
>>> df.sum() Sum of values Read and Write to Excel >>> pd.read_sql_table('my_table', engine)
types >>> df.cumsum() Cummulative sum of values >>> pd.read_sql_query("SELECT * FROM my_table;", engine)
>>> pd.read_excel('file.xlsx')
>>> df.min()/df.max() Minimum/maximum values
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], a 3
b -5 >>> df.idxmin()/df.idxmax() Minimum/Maximum index value
>>> pd.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1') read_sql()is a convenience wrapper around read_sql_table()
'Capital': ['Brussels', 'New Delhi', 'Brasília'], c 7 Read multiple sheets from the same file and read_sql_query()
'Population': [11190846, 1303171035,index 207847528]}
d 4
>>> df.describe() Summary statistics
>>> df = pd.DataFrame(data, >>> df.mean() Mean of values >>> xlsx = pd.ExcelFile('file.xls') >>> pd.to_sql('myDf', engine)
columns=['Country', 'Capital', 'Population']) >>> df.median() Median of values >>> df = pd.read_excel(xlsx, 'Sheet1')

htps:/ w w.dat camp.com/com unity/blog/pandas-cheat-she t-python


Advanced Indexing Combining Data
Pandas
Also see NumPy Arrays

Selecting data1 data2


>>> df3.loc[:,(df3>1).any()] Select cols with any vals >1
>>> df3.loc[:,(df3>1).all()] Select cols with vals > 1 X1 X2 X1 X3
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN 11.432

Cheat Sheet
a a 20.784
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN
b 1.303 b NaN
Indexing With isin
c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
>>> df3.filter(items=”a”,”b”]) Filter on values

BecomingHuman.AI >>> df.select(lambda x: not x%5)


Where
Select specific elements Pivot
>>> pd.merge(data1,
>>> s.where(s > 0) Subset the data data2, X1 X2 X3
Query how='left',
on='X1') a 11.432 20.784
>>> df6.query('second > first') Query DataFrame
b 1.303 NaN

Setting/Resetting Index c 99.906 NaN

>>> df.set_index('Country') Set the index >>> pd.merge(data1, X1 X2 X3


>>> df4 = df.reset_index() Reset the index data2,
how='right', a 11.432 20.784
Pandas Data Structures >>> df = df.rename(index=str,
columns={"Country":"cntry",
"Capital":"cptl",
Rename DataFrame
on='X1')
b 1.303 NaN
"Population":"ppltn"}) d NaN 20.784
Pivot Reindexing
>>> df3= df2.pivot(index='Date', Spread rows into columns >>> pd.merge(data1, X1 X2 X3
>>> s2 = s.reindex(['a','c','d','e','b']) data2,
columns='Type',
how='inner', a 11.432 20.784
values='Value') Forward Filling Forward Filling on='X1')
>>> df.reindex(range(4), >>> s3 = s.reindex(range(5), b 1.303 NaN
Date Type Value
method='ffill') method='bfill')
0 2016-03-01 a 11.432 Type a b c >>> pd.merge(data1, X1 X2 X3
Country Capital Population 0 3
data2,
1 2016-03-02 b 13.031 Date 0 Belgium Brussels 11190846 1 3 how='outer', a 11.432 20.784
2 2016-03-01 c 20.784 1 India New Delhi 1303171035 2 3 on='X1')
2016-03-01 11.432 NaN 20.784 b 1.303 NaN
3 2016-03-03 a 99.906 2 Brazil Brasília 207847528 3 3
2016-03-02 1.303 13.031 NaN
3 Brazil Brasília 207847528 4 3 c 99.906 NaN
4 2016-03-02 a 1.303 2016-03-03 99.906 NaN 20.784
d NaN 20.784
5 2016-03-03 c 20.784
MultiIndexing Join
Pivot Table >>> arrays = [np.array([1,2,3]),
>>> data1.join(data2, how='right')
np.array([5,4,3])]
>>> df4 = pd.pivot_table(df2, Spread rows into columns >>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
values='Value', >>> tuples = list(zip(*arrays)) Concatenate
index='Date',
>>> index = pd.MultiIndex.from_tuples(tuples, Vertical
columns='Type']) names=['first', 'second']) >>> s.append(s2)
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
Horizontal/Vertical
0 1 1 5 0 0.233482 >>> df2.set_index(["Date", "Type"])
>>> pd.concat([s,s2],axis=1, keys=['One','Two'])
1 5 0.233482 0.390959 1 0.390959 >>> pd.concat([data1, data2], axis=1, join='inner')

2 4 0.184713 0.237102 2 4 0 0.184713


Duplicate Data
3 3 0.433522 0.429401 1 0.237102

Unstacked 3 3 0 0.433522
>>> s3.unique()
>>> df2.duplicated('Type')
Return unique values
Check duplicates
Dates
1 0.429401 >>> df2.drop_duplicates('Type', keep='last') Drop duplicates
>>> df2['Date']= pd.to_datetime(df2['Date'])
Stacked >>> df.index.duplicated() Drop duplicates
>>> df2['Date']= pd.date_range('2000-1-1', periods=6,
Melt freq='M')
>>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
>>> pd.melt(df2, Gather columns into rows >>> index = pd.DatetimeIndex(dates)
id_vars=["Date"], Grouping Data >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
value_vars=["Type", "Value"],
value_name="Observations") Aggregation
Date Variable Observations >>> df2.groupby(by=['Date','Type']).mean()
Date Type Value
0 2016-03-01 Type a >>> df4.groupby(level=0).sum()
0

1
2016-03-01

2016-03-02
a

b
11.432

13.031
1
2
2016-03-02
2016-03-01
Type
Type
b
c
>>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), 'b': np.sum}) Visualization
Transformation
2016-03-01 c 20.784 3 2016-03-03 Type a
2 >>> customSum = lambda x: (x+x%2) >>> import matplotlib.pyplot as plt
4 2016-03-02 Type a
3 2016-03-03 a 99.906 >>> df4.groupby(level=0).transform(customSum)
5 2016-03-03 Type c >>> s.plot() >>> df2.plot()
4 2016-03-02 a 1.303
6 2016-03-01 Value 11.432 >>> plt.show() >>> plt.show()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031
8 2016-03-01 Value 20.784 Missing Data
9 2016-03-03 Value 99.906
10 2016-03-02 Value 1.303 >>> df.dropna() Drop NaN value
11 2016-03-03 Value 20.784 >>> df3.fillna(df3.mean()) Fill NaN values with a predetermined value
>>> df2.replace("a", "f") Replace values with others

htps:/w .dat camp.com/com unity/blog/pand s-cheat-she t-python


Data Wrangling with pandas Cheat Sheet
BecomingHuman.AI
Syntax Creating DataFrames Tidy Data A foundation for wrangling in pandas Summarise Data Handling Missing Data
df['w'].value_counts() df.dropna()

&
a b c F M A F M A Tidy data complements pandas’s
In a tidy M A F Count number of rows with each unique value Drop rows with any column having NA/null data.
vectorized operations. pandas will
1 4 7 10 data set: of variable
automatically preserve observations df.fillna(value)
2 5 8 11 as you manipulate variables. No len(df)
Each variable Each observation other format works as intuitively with # of rows in DataFrame.
3 6 9 12 is saved in its is saved in its pandas M A
own column own row df['w'].nunique()
df = pd.DataFrame( # of distinct values in a column.
{"a" : [4 ,5, 6],
"b" : [7, 8, 9], df.describe()
Make New Columns
"c" : [10, 11, 12]}, Basic descriptive statistics for each column
index = [1, 2, 3])
Specify values for each column.
Reshaping Data Change the layout of a data set (or GroupBy)

df = pd.DataFrame( df.sort_values('mpg')
[[4, 7, 10], Order rows by values of a column (low to high).
[5, 8, 11], df.assign(Area=lambda df: df.Length*df.Height)
df.sort_values('mpg',ascending=False) pandas provides a large set of summary functions Compute and append one or more new columns.
[6, 9, 12]], Order rows by values of a column (high to low). that operate on different kinds of pandas objects
index=[1, 2, 3], pd.melt(df) df.pivot(columns='var', values='val') (DataFrame columns, Series, GroupBy, Expanding df['Volume'] = df.Length*df.Height*df.Depth
columns=['a', 'b', 'c']) Gather columns into rows. Spread rows into columns. df.rename(columns = {'y':'year'}) and Rolling (see below)) and produce single Add single column.
Specify values for each row. Rename the columns of a DataFrame values for each of the groups. When applied to a
DataFrame, the result is returned as a pandas pd.qcut(df.col, n, labels=False)
df.sort_index() Series for each column. Examples: Bin column into n buckets.
a b c Sort the index of a DataFrame
n v
df.reset_index() sum() min()
1 4 7 10 Reset index of DataFrame to row numbers, Sum values of each object. Minimum value in Vector Vector
d moving index to columns.
2 5 8 11 each object. function function
count()
e 2 6 9 12 pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df.drop(columns=['Length','Height']) Count non-NA/null max()
Append rows of DataFrames Append columns of DataFrames Drop columns from DataFrame values of each object. Maximum value in pandas provides a large set of vector functions that
df = pd.DataFrame( each object. operate on allcolumns of a DataFrame or a single selected
{"a" : [4 ,5, 6], median()
column (a pandas Series). These functions produce
"b" : [7, 8, 9], Median value of mean()
vectors of values for each of the columns, or a single
each object. Mean value of each object.
"c" : [10, 11, 12]},
index = pd.MultiIndex.from_tuples( Subset Observations (Rows) Subset Variables (Columns) quantile([0.25,0.75]) var()
Series for the individual Series. Examples:

max(axis=1) min(axis=1)
[('d',1),('d',2),('e',2)], Quantiles of each object. Variance of each object. Element-wise max. Element-wise min.
names=['n','v']))
Create DataFrame with a MultiIndex apply(function) std() clip(lower=-10,upper=10) abs()
Apply function to Standard deviation Trim values at input thresholds Absolute value.
each object of each object.

df[df.Length > 7] df.sample(frac=0.5) df[['width','length','species']]


Method Chaining Extract rows that meet
logical criteria.
Randomly select fraction
of rows.
Select multiple columns with specific names.

Most pandas methods return a DataFrame so that df.drop_duplicates() df.sample(n=10)


df['width'] or df.width
Select single column with specific name.
Combine Data Sets
another pandas method can be applied to the Remove duplicate Randomly select n rows.

+ = + =
result. This improves readability of code. rows (only considers adf x1 x2
bdf x1 x3

df.iloc[10:20]
df.filter(regex='regex') ydf x1
A
x2
1
zdf x1
B
x2
2 A 1 A T
df = (pd.melt(df) columns). Select columns whose name matches regular expression regex. B 2 C 3 B 2 B F
.rename(columns={ Select rows by position. C 3 D 4 C 3 C T
df.head(n) Logic in Python (and pandas)
'variable' : 'var',
Select first n rows. df.nlargest(n, 'value') '\.' Matches strings containing a period '.'
'value' : 'val'}) Select and order top n entries. 'Length$ Matches strings ending with word 'Length'
Set Operations Standard Joins
.query('val >= 200') '^Sepal' Matches strings beginning with the word 'Sepal'
df.tail(n) '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
) Select last n rows. df.nsmallest(n, 'value') '^(?!Species$).*' Matches strings except the string 'Species' x1 x2 pd.merge(ydf, zdf) x1 x2 x3 dpd.merge(adf, bdf,
B 2 A 1 T
Select and order bottom C 3
Rows that appear in both ydf and zdf B 2 F
how='left', on='x1')
n entries. df.loc[:,'x2':'x4'] (Intersection). C 3 NaN Join matching rows from bdf to adf.
Select all columns between x2 and x4 (inclusive).
Windows
x1 x2
pd.merge(ydf, zdf, how='outer') x1 x2 x3 pd.merge(adf, bdf,
A 1 A 1.0 T
Logic in Python (and pandas) df.iloc[:,[1,2,5]] B 2 Rows that appear in either or both ydf and zdf B 2.0 F
how='right', on='x1')
< Less than != Not equal to Select columns in positions 1, 2 and 5 (first column is 0). C 3 (Union). D NaN T Join matching rows from adf to bdf.
df.expanding() > Greater than df.column.isin(values) Group membership D 4
== Equal to pd.isnull(obj) Is NaN df.loc[df['a'] > 10, ['a','c']]
Return an Expanding object allowing summary <= Less than or equal to pd.notnull(obj) Is not NaN x1 x2 pd.merge(ydf, zdf, how='outer', x1 x2 x3 pd.merge(adf, bdf,
functions to be applied cumulatively. >= Greater than or equal to &,|,~,^,df.any(),df.all( Logical and, or, not, Select rows meeting logical condition, and only the A 1 indicator=True) A 1 T
how='inner', on='x1')
B 2 F
xor, any, all specific columns . .query('_merge == "left_only"') Join data. Retain only rows in both sets.
df.rolling(n) .drop(columns=['_merge'])
Return a Rolling object allowing summary Rows that appear in ydf but not zdf (Setdiff) x1 x2 x3 pd.merge(adf, bdf,
functions to be applied to windows of length n.
Windows A
B
C
1
2
3
T
F
NaN
how='outer', on='x1')
Join data. Retain all values, all rows.
D NaN T
df.groupby(by="col") The examples below can also be applied to groups. In this case, the function is applied on a per-group basis, and the
Windows Return a GroupBy object, grouped by
values in column named "col".
returned vectors are of the length of the original DataFrame.
Filtering Joins
shift(1) rank(method='first') cummin()
df.plot.hist() df.plot.scatter(x='w',y='h') df.groupby(level="ind") x1 x2 adf[adf.x1.isin(bdf.x1)]
Copy with values shifted by 1. Ranks. Ties go to first value. Cumulative min. A 1
Histogram for Scatter chart using pairs of Return a GroupBy object, grouped by All rows in adf that have a match in bdf.
rank(method='dense') shift(-1) cumprod() B 2
each column points values in index level named "ind". Ranks with no gaps. Copy with values lagged by 1. Cumulative product adf[~adf.x1.isin(bdf.x1)]
x1 x2
All of the summary functions listed above can be applied to a group. rank(method='min') cumsum() C 3
All rows in adf that do not have a match
Additional GroupBy functions: Ranks. Ties get min rank. Cumulative sum. in bdf
size() agg(function) rank(pct=True) cummax()
Size of each group. Aggregate group using function. Ranks rescaled to interval [0, 1]. Cumulative max.

htps:/github.com/rstudio/cheatshe… r/LICENSE
Data Wrangling with
dplyr and tidyr
Syntax Helpful conventions for wrangling Summarise Data Make New Variables
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than
Cheat Sheet
data frames. R displays only the data that fits onscreen
BecomingHuman.AI dplyr::summarise(iris, avg = mean(Sepal.Length))
Summarise data into single row of values.
dplyr::summarise_each(iris, funs(mean))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width)
Compute and append one or more new columns.
dplyr::mutate_each(iris, funs(min_rank))
Apply summary function to each column. Apply window function to each column.

Reshaping Data Change the layout of a data set dplyr::count(iris, Species, wt = Sepal.Length)
Count number of rows with each unique value of
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)
Compute one or more new columns. Drop original columns
variable (with or without weights).

dplyr::data_frame(a = 1:3, b = 4:6) summary window


Combine vectors into data frame (optimized). function function
dplyr::arrange(mtcars, mpg)
dplyr::glimpse(iris) tidyr::gather(cases, "year", "n", 2:4) tidyr::spread(pollution, size, amount) Order rows by values of a column (low to high).
Information dense summary of tbl data. Gather columns into rows. Spread rows into columns
dplyr::arrange(mtcars, desc(mpg)) Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of
Order rows by values of a column (high to low). take a vector of values and return a single value, values and return another vector of values, such as:
utils::View(iris) such as: dplyr::lead dplyr::cumall
View data set in spreadsheet-like display (note capital V) dplyr::rename(tb, y = year) Copy with values shifed by 1. Cumulative all
Rename the columns of a data frame. dplyr::first min
tidyr::separate(storms, date, c("y", "m", "d")) tidyr::unite(data, col, ..., sep) dplyr::lag dplyr::cumany
First value of a vector. Minimum value in a vector.
separate(storms, date, c("y", "m", "d")) Unite several columns into one. Copy with values lagged by 1. Cumulative any
dplyr::last max
Last value of a vector. Maximum value in a vector. dplyr::dense_rank dplyr::cummean
Ranks with no gaps. Cumulative mean
dplyr::nth mean
Subset Observations (Rows) Subset Variables (Columns) Nth value of a vector. Mean value of a vector. dplyr::min_rank
Ranks. Ties get min rank.
cumsum
Cumulative sum
dplyr::n median
# of values in a vector. Median value of a vector. dplyr::percent_rank cummax
Ranks rescaled to [0, 1]. Cumulative max
dplyr::n_distinct var
# of distinct values in Variance of a vector. dplyr::row_number cummin
a vector. Ranks. Ties got to first value. Cumulative min
sd
dplyr::select(iris, Sepal.Width, Petal.Length, Species) IQR Standard deviation of a dplyr::ntile cumprod
dplyr::filter(iris, Sepal.Length > 7) Bin vector into n buckets. Cumulative prod
Select columns by name or helper function. IQR of a vector vector.
dplyr::%>% Extract rows that meet logical criteria.
dplyr::between pmax
Passes object on lef hand side as first argument (or . dplyr::distinct(iris) Are values between a and b? Element-wise max
Helper functions for select - ?select
argument) of function on righthand side. Remove duplicate rows. dplyr::cume_dist pmin
x %>% f(y) is the same as f(x, y) Cumulative distribution. Element-wise min
y %>% f(x, ., z) is the same as f(x, y, z ) dplyr::sample_frac(iris, 0.5, replace = TRUE) select(iris, contains("."))

"Piping" with %>% makes code more readable, e.g.


Randomly select fraction of rows. Select columns whose name contains a character string. Combine Data Sets
select(iris, ends_with("Length"))
dplyr::sample_n(iris, 10, replace = TRUE) Select columns whose name ends with a character string.
iris %>% Z
A
+ B
= Y
+ =
x1 x2 x1 x2 x1 x2 x1 x2
Randomly select n rows.
group_by(Species) %>% select(iris, everything()) A
B
1
2
A
B
T
F
A
B
1
2
B
C
2
3
summarise(avg = mean(Sepal.Width)) %>% dplyr::slice(iris, 10:15) Select every column. C 3 C T C 3 D 4

arrange(avg) Select rows by position. select(iris, matches(".t."))


dplyr::top_n(storms, 2, date) Select columns whose name matches a regular expression. Mutating Joins Set Operations
Select and order top n entries (by group if grouped data). select(iris, num_range("x", 1:5))
Select columns named x1, x2, x3, x4, x5.
x1 x2 x3 dplyr::lef_join(a, b, by = "x1") x1 x2 dplyr::intersect(y, z)
A 1 T B 2
Logic in R - ? Comparison, ?base ::Logic B 2 F
Join matching rows from b to a. Rows that appear in both y and z.
select(iris, one_of(c("Species", "Genus"))) C 3

Tidy Data A foundation for wrangling in R


<
>
==
Less than
Greater than
Equal to
!=
%in%
is.na
Not equal to
Group membership
Is NA
Select columns whose names are in a group of names.
C

x1
3

x3
NA

x2 dplyr::right_join(a, b, by = "x1") x1 x2 dplyr::union(y, z)


select(iris, starts_with("Sepal"))
<= Less than or equal to !is.na Is not NA A T 1 Join matching rows from a to b. A 1 Rows that appear in either or both y and z.
>= Greater than or equal to &,|,!,xor,any,all Boolean operators Select columns whose name starts with a character string. B F 2 B 2
In a tidy data set: C T NA C 3
select(iris, Sepal.Length:Petal.Width) D 4
Select all columns between Sepal.Length and Petal.Width (inclusive).

&
F M A F M A x1 x2 x3 dplyr::inner_join(a, b, by = "x1")
select(iris, -Species) A 1 T
Join data. Retain only rows in both sets.
x1 x2 dplyr::setdiff(y, z)
A 1
Select all columns except Species.
B 2 F Rows that appear in y but not z.

Each variable is
saved in its own
Each observation
is saved in its own
Group Data x1
A
B
x2
1
2
x3
T
F
dplyr::full_join(a, b, by = "x1")
Join data. Retain all values, all rows. Binding
column row
C 3 NA x1 x2 dplyr::bind_rows(y, z)
D NA T A 1
dplyr::group_by(iris, Species) iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…) Append z to y as new rows.
Tidy data complements R’s B 2

Group data into rows with the Compute separate summary row for each group. Compute new variables by group. Filtering Joins C 3
vectorized operations. R will M A F B 2
automatically preserve same value of Species. x1 x2 dplyr::semi_join(a, b, by = "x1") C 3
A 1 D 4
observations as you dplyr::ungroup(iris) B 2
All rows in a that have a match in b.
manipulate variables. No Remove grouping information x1 x2 x1 x2 dplyr::bind_cols(y, z)
other format works as from data frame. x1 x2 dplyr::anti_join(a, b, by = "x1") A 1 B 2
Append z to y as new columns.
intuitively with R M A C 3
All rows in a that do not have a match in b B 2 C 3
C 3 D 4 Caution: matches rows by position.
The SciPy library is one of the
core packages for scientific
computing that provides
mathematical algorithms and
Scipy Linear Algebra
convenience functions built on
the NumPy extension of
Python.
Cheat Sheet
BecomingHuman.AI
Interacting With NumPy Also see NumPy Linear Algebra Also see NumPy

>>> import numpy as np


>>> a = np.array([1,2,3]) You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)]) >>> from scipy import linalg, sparse
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]])
Creating Matrices Matrix Functions Sparse Matrix Routines
Index Tricks >>> A = np.matrix(np.random.random((2,2))) Addition Inverse
>>> np.mgrid[0:5,0:5] Create a dense meshgrid >>> B = np.asmatrix(b) >>> np.add(A,D) Addition >>> sparse.linalg.inv(I) Inverse
>>> np.ogrid[0:2,0:2] Create an open meshgrid >>> C = np.mat(np.random.random((10,5))) Norm
Subtraction
>>> np.r_[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise) >>> D = np.mat([[3,4], [5,6]]) >>> sparse.linalg.norm(I) Norm
>>> np.subtract(A,D) Subtraction
>>> np.c_[b,c] Create stacked column-wise arrays Solving linear problems
Basic Matrix Routines Division
>>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
Shape Manipulation Inverse
>>> np.divide(A,D) Division

>>> np.transpose(b) Permute array dimensions >>> A.I Inverse Multiplication


>>> b.flatten() Flatten the array >>> linalg.inv(A) Inverse >>> A @ D Multiplication operator (Python 3) Sparse Matrix Functions
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> np.multiply(D,A) Multiplication >>> sparse.linalg.expm(I) Sparse matrix exponential
Transposition
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) >>> np.dot(A,D) Dot product
>>> A.T Tranpose matrix
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index >>> np.vdot(A,D) Vector dot product
>>> A.H Conjugate transposition
>>> np.vpslit(d,2) Split the array vertically at the 2nd index >>> np.inner(A,D) Inner product Decompositions
Trace >>> np.outer(A,D) Outer product
Eigenvalues and Eigenvectors
Polynomials >>> np.trace(A) Trace >>> np.tensordot(A,D) Tensor dot product
>>> la, v = linalg.eig(A) Solve ordinary or generalized
Norm >>> np.kron(A,D) Kronecker product
>>> from numpy import poly1d >>> l1, l2 = la eigenvalue problem for
>>> p = poly1d([3,4,5]) Create a polynomial object >>> linalg.norm(A) Frobenius norm square matrix
Exponential Functions
>>> linalg.norm L1 norm (max column sum) >>> v[:,0] First eigenvector
>>> linalg.expm(A) Matrix exponential
Vectorizing Functions >>> linalg.norm(A,np.inf) L inf norm (max row sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> v[:,1] Second eigenvector
>>> linalg.eigvals(A) Unpack eigenvalues
>>> def myfunc(a): Rank >>> linalg.expm3(D) Matrix exponential (eigenvalue
if a < 0: decomposition)
return a*2 >>> np.linalg.matrix_rank(C) Matrix rank Singular Value Decomposition
else: Logarithm Function >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
return a/2 Determinant >>> linalg.logm(A) Matrix logarithm >>> M,N = B.shape
>>> np.vectorize(myfunc) Vectorize functions >>> linalg.det(A) Determinant >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
Solving linear problems Trigonometric Functions
Type Handling >>> linalg.sinm(D) Matrix sine
LU Decomposition
>>> linalg.solve(A,b) Solver for dense matrices >>> P,L,U = linalg.lu(C) LU Decomposition
>>> np.real(b) Return the real part of the array elements >>> linalg.cosm(D) Matrix cosine
>>> E = np.mat(a).T Solver for dense matrices
>>> np.imag(b>>> Return the imaginary part of the array elements >>> linalg.tanm(A) Matrix tangent
>>> linalg.lstsq(F,E) Least-squares solution to linear matrix
np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0
Generalized inverse Hyperbolic Trigonometric Functions
Sparse Matrix Decompositions
>>> np.cast['f'](np.pi) Cast object to a data type
>>> linalg.pinv(C) Compute the pseudo-inverse of a matrix >>> linalg.sinhm(D) Hypberbolic matrix sine >>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
Other Useful Functions (least-squares solver) >>> linalg.coshm(D) Hyperbolic matrix cosine >>> sparse.linalg.svds(H, 2) SVD
>>> linalg.pinv2(C) Compute the pseudo-inverse of >>> linalg.tanhm(A) Hyperbolic matrix tangent
>>> np.angle(b,deg=True) Return the angle of the complex argumen a matrix (SVD)
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values
(number of samples)
Creating Matrices Matrix Sign Function
>>> g [3:] += np.pi
>>> F = np.eye(3, k=1) Create a 2X2 identity matrix >>> np.signm(A) Matrix sign function
>>> np.unwrap(g) Unwrap
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale)
>>> G = np.mat(np.identity(2))
>>> C[C > 0.5] = 0
Create a 2x2 identity matrix
Matrix Square Root
Asking For Help
>>> np.select([c<4],[c*2]) Return values from a list of arrays
depending on conditions >>> H = sparse.csr_matrix(C) Compressed Sparse Row matrix >>> linalg.sqrtm(A) Matrix square root
>>> misc.factorial(a) Factorial >>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> help(scipy.linalg.diagsvd)
>>> misc.comb(10,3,exact=True) Combine N things taken at k time >>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix Arbitrary Functions >>> np.info(np.matrix)
>>> misc.central_diff_weights(3) Weights for Np-point central derivative >>> E.todense() Sparse matrix to full matrix >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> sparse.isspmatrix_csc(A) Identify sparse matrix

ht ps:/ www.datacamp.com/community/blog/python-scipy-cheat-she t
Matplotlib is a Python 2D plotting library
which produces publication-quality figures
in a variety of hardcopy formats and
interactive environments across
Matplotlib Cheat Sheet
platforms. BecomingHuman.AI

Anatomy & Workflow Prepare The Data Also see Lists & NumPy Customize Plot
Plot Anatomy Index Tricks Colors, Color Bars & Color Maps Limits, Legends & Layouts
>>> import numpy as np >>> plt.plot(x, x, x, x**2, x, x**3)
Limits & Autoscaling
>>> x = np.linspace(0, 10, 100) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> y = np.cos(x) >>> ax.plot(x, y, c='k')
Axes/Subplot >>> ax.axis('equal') Set the aspect ratio
>>> z = np.sin(x) >>> fig.colorbar(im, orientation='horizontal') of the plot to 1
>>> im = ax.imshow(img, >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
2D Data or Images cmap='seismic') >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> data = 2 * np.random.random((10, 10)) Markers
>>> data2 = 3 * np.random.random((10, 10)) Legends
>>> fig, ax = plt.subplots() >>> ax.set(title='An Example Axes', Set a title and x-and
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] ylabel='Y-Axis', y-axis labels
>>> ax.scatter(x,y,marker=".")
>>> U = -1 - X**2 + Y xlabel='X-Axis')
>>> ax.plot(x,y,marker="o")
Figure >>> V = 1 + X - Y**2 >>> ax.legend(loc='best') No overlapping
Y-axis plot elements
>>> from matplotlib.cbook import get_sample_data Linestyles
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) Ticks
>>> plt.plot(x,y,linewidth=4.0) >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks
>>> plt.plot(x,y,ls='solid') ticklabels=[3,100,-12,"foo"])
direction='inout',
Make y-ticks longer and
Create Plot >>> plt.plot(x,y,ls='--')
>>> plt.plot(x,y,'--',x**2,y**2,'-.')
length=10)
go in and out

>>> import matplotlib.pyplot as plt >>> plt.setp(lines,color='r',linewidth=4.0)


Subplot Spacing
X-axis
Figure >>> fig3.subplots_adjust(wspace=0.5,
hspace=0.3,
>>> fig = plt.figure() left=0.125,
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0))
Text & Annotations right=0.9,
top=0.9,
>>> ax.text(1, bottom=0.1)
Axes -2.1, 'Example Graph',
style='italic')
>>> fig.tight_layout()

All plotting is done with respect to an Axes. In most >>> ax.annotate("Sine", xy=(8, 0),
Workflow cases, a subplot will fit your needs. A subplot is an xycoords='data',
xytext=(10.5, 0), Axis Spines
axes on a grid system. textcoords='data', >>> ax1.spines['top'=].set_visible(False) Make the top axis
arrowprops=dict(arrowstyle="->", line for a plot invisible
>>> fig.add_axes() connectionstyle="arc3"),)
>>> ax1 = fig.add_subplot(221) # row-col-num >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom
axis line outward
01 Prepare data
04 Customize plot >>> ax3 = fig.add_subplot(212)
>>> fig3, axes = plt.subplots(nrows=2,ncols=2)
Mathtext
>>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> fig4, axes2 = plt.subplots(ncols=3)

02 Create plot 05 Save plot


Plotting Routines Save Plot
03 Plot 06 Show plot 1D Data Vector Fields Save figures
>>> plt.savefig('foo.png')
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes

>>> import matplotlib.pyplot as plt


>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows Save transparent figures
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width) >>> axes[0,1].streamplot(X,Y,U,V) Plot 2D vector fields >>> plt.savefig('foo.png', transparent=True)
step 1 >>> x = [1,2,3,4]
>>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) Plot horiontal rectangles (constant height)
>>> y = [10,20,25,30]
>>> axes[1,1].axhline(0.45) Draw a horizontal line across axes Data Distributions
step 2
step 3
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> axes[0,1].axvline(0.65) Draw a vertical line across axes >>> ax1.hist(y) Plot a histogram Show Plot
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot
step 3,4 >>> ax.plot(x, y, color='lightblue', linewidth=3) >>> plt.show()
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
>>> ax.scatter([2,4,6],
[5,15,25], 2D Data
color='darkgreen',
marker='^') >>> fig, ax = plt.subplots() Colormapped or RGB >>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array
>>> ax.set_xlim(1, 6.5) >>> im = ax.imshow(img,
arrays cmap='gist_earth',
>>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array Close & Clear
>>> plt.savefig('foo.png') interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> plt.cla()
step 5 >>> plt.show() >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> plt.clf()
>>> axes2[2]= ax.clabel(CS) Label a contour plot
>>> plt.close()

ht ps:/ w w.datacamp.com/community/blog/python-matplotlib-cheat-she t
Data Visualisation Geoms Use a geom to represent data points, use the geom’s aesthetic properties to represent variables. Each function returns a layer Stats An alternative way to build a layer Scales
with ggplot2 One Variable Two Variables Some plots visualize a transformation of the original data set.
Use a stat to choose a common transformation to visualize,
Scales control how a plot maps data values to the visual values of
an aesthetic. To change the mapping, add a ustom scale.
Continuous Continuous X, Continuous Y Continuous Bivariate Distribution e.g. a + geom_bar(stat = "bin")

Cheat Sheet a <- ggplot(mpg, aes(hwy))


a + geom_area(stat = "bin")
x, y, alpha, color, fill, linetype, size
f <- ggplot(mpg, aes(cty, hwy))
f + geom_blank()
i <- ggplot(movies, aes(year, rating))
i + geom_bin2d(binwidth = c(5, 0.5))
xmax, xmin, ymax, ymin, alpha, color, fill,
fl cty ctl F M A

+
4

2
=
4

2
n <- b + geom_bar(aes(fill = fl))
n
aesthetic prepackaged
b + geom_area(aes(y = ..density..), stat = "bin") linetype, size, weight 1 1 scale_ to adjust scale to use
i + geom_density2d() stat geom
a + geom_density(kernel = "gaussian") f + geom_jitter() x=x 0 1 2 3 4 0 1 2 3 4
n + scale_fill_manual(
Basics x, y, alpha, color, fill, shape, size x, y, alpha, colour, linetype, size data y = ...count...
x, y, alpha, color, fill, linetype, size, weight coordinate plot
b + geom_density(aes(y = ..county..)) system values = c("skyblue", "royalblue", "blue", "navy"),
f + geom_point() i + geom_hex() Each stat creates additional variables to map aesthetics to. These limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", "r"),
a + geom_dotplot() x, y, alpha, color, fill, shape, size x, y, alpha, colour, fill size
variables use a common ..name.. syntax. stat functions and geom range of
ggplot2 is based on the grammar of graphics, the idea that you x, y, alpha, color, fill name = "fuel", labels = c("D", "E", "P", "R"))
functions both combine a stat with a geom to make a layer, i.e. values to breaks to
can build every graph from the same few components: a data set, f + geom_quantile() Continuous Function include in title to use in labels to use in use in
a + geom_freqpoly() x, y, alpha, color, linetype, size, weight stat_bin(geom="bar") does the same as geom_bar(stat="bin")
a set of geoms—visual marks that represent data points, and a x, y, alpha, color, linetype, size j <- ggplot(economics, aes(date, unemploy)) mapping legend/axis legend/axis legend/axis
stat layer specific variable created
coordinate system. b + geom_freqpoly(aes(y = ..density..))
f + geom_rug(sides = "bl") function mappings by transformation
x, y, alpha, color, linetype, size, weight j + geom_area()
F M A a + geom_histogram(binwidth = 5) x, y, alpha, color, fill, linetype, size General Purpose scales
4 4
x, y, alpha, color, fill, linetype, size, weight i + stat_density2d(aes(fill = ..level..),
Use with any aesthetic:
+ = b + geom_histogram(aes(y = ..density..)) f + geom_smooth(model = lm) geom = "polygon", n = 100)
3 3

2 2 x, y, alpha, color, fill, linetype, size, weight


j + geom_line() alpha, color, fill, linetype, shape, size
x, y, alpha, color, linetype, size
1 1 Discrete geom for layer parameters for stat scale_*_continuous() - map cont’ values to visual values
0 1 2 3 4 0 1 2 3 4
b <- ggplot(mpg, aes(fl)) C f x,+ y,geom_text(aes(label = cty))
alpha, color, fill, linetype, size, weight j + geom_step(direction = "hv") scale_*_discrete() - map discrete values to visual values
data geom coordinate plot AB x, y, alpha, color, linetype, size
a + stat_bin(binwidth = 1, origin = 10) scale_*_identity() - use data values as visual values
x=F
y=A
system b + geom_bar()
x, alpha, color, fill, linetype, size, weight x, y | ..count.., ..ncount.., ..density.., ..ndensity..
scale_*_manual(values = c()) - map discrete values to
To display data values, map variables in the data set to aesthetic Discrete X, Continuous Y Visualizing error a + stat_bindot(binwidth = 1, binaxis = "x") manually chosen visual values
x, y, | ..count.., ..ncount..
properties of the geom like size, color, and x and y locations df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
Graphical Primitives g <- ggplot(mpg, aes(class, hwy)) a + stat_density(adjust = 1, kernel = "gaussian") X and Y location scales
F M A k <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se)) x, y, | ..count.., ..density.., ..scaled..
4 4
g + geom_bar(stat = "identity") Use with x or y aesthetics (x shown here)
k + geom_crossbar(fatten = 2)
+ = c <- ggplot(map, aes(long, lat)) x, y, alpha, color, fill, linetype, size, weight
3 3

2 2 x, y, ymax, ymin, alpha, color, fill, linetype, size f + stat_bin2d(bins = 30, drop = TRUE) scale_x_date(labels = date_format("%m/%d"),
1 1
c + geom_polygon(aes(group = group)) x, y, fill | ..count.., ..density..
breaks = date_breaks("2 weeks"))
x, y, alpha, color, fill, linetype, size g + geom_boxplot() k + geom_errorbar() f + stat_binhex(bins = 30) - treat x values as dates. See ?strptime for label formats.
lower, middle, upper, x, ymax, ymin, alpha, x, y, fill | ..count.., ..density..
0 1 2 3 4 0 1 2 3 4
color, fill, linetype, shape, size, weight x, ymax, ymin, alpha, color, linetype, size, scale_x_datetime() - treat x values as date times. Use
data geom coordinate plot d <- ggplot(economics, aes(date, unemploy)) width (also geom_errorbarh()) same arguments as scale_x_date().
x=F system g + geom_dotplot(binaxis = "y", f + stat_density2d(contour = TRUE, n = 100)
y=A d + geom_path(lineend="butt", k + geom_linerange() x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale
color = F
linejoin="round’, linemitre=1) stackdir = "center") x, ymin, ymax, alpha, color, linetype, size
size = A
x, y, alpha, color, fill m + stat_contour(aes(z = z)) scale_x_reverse() - Reverse direction of x axis
x, y, alpha, color, linetype, size x, y, z, order | ..level..
Build a graph with qplot() or ggplot() g + geom_violin(scale = "area") k + geom_pointrange() scale_x_sqrt() - Plot x on square root scale
d + geom_ribbon(aes(ymin=unemploy - 900, x, y, alpha, color, fill, linetype, size, weight x, y, ymin, ymax, alpha, color, fill, linetype, shape, size m+ stat_spoke(aes(radius= z, angle = z))
aesthetic mappings data geom ymax=unemploy + 900)) angle, radius, x, xend, y, yend | ..x.., ..xend.., ..y.., ..yend.. Color and fill scales
x, ymax, ymin, alpha, color, fill, linetype, size
qplot(x = cty, y = hwy, color = cyl, data = mpg, geom = "point") m + stat_summary_hex(aes(z = z), bins = 30, fun = mean) Use with x or y aesthetics (x shown here)
d <- ggplot(economics, aes(date, unemploy))
Maps x, y, z, fill | ..value..
Creates a complete plot with given data, geom, and
mappings. Supplies many useful defaults. e + geom_segment(aes( Discrete X, Discrete Y data <- data.frame(murder = USArrests$Murder, g + stat_boxplot(coef = 1.5) n <- b + geom_bar( o <- a + geom_dotplot(
state = tolower(rownames(USArrests))) x, y | ..lower.., ..middle.., ..upper.., ..outliers..
xend = long + delta_long, h <- ggplot(diamonds, aes(cut, color)) aes(fill = fl)) aes(fill = ..x..))
ggplot(data = mpg, aes(x = cty, y = hwy)) yend = lat + delta_lat)) map <- map_data("state") g + stat_ydensity(adjust = 1, kernel = "gaussian", scale = "area")
Begins a plot that you finish by adding layers to. No x, xend, y, yend, alpha, color, linetype, size h + geom_jitter() l <- ggplot(data, aes(fill = murder)) x, y | ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. o + scale_fill_gradient(
x, y, alpha, color, fill, linetype, size, n + scale_fill_brewer( low = "red",
defaults, but provides more control than qplot(). e + geom_rect(aes(xmin = long, ymin = lat, l + geom_map(aes(map_id = state), map = map) + palette = "Blues")
expand_limits(x = map$long, y = map$lat) f + stat_ecdf(n = 40) high = "yellow")
xmax= long + delta_long, x, y, alpha, color, fill, linetype, size, x, y | ..x.., ..y.. For palette choices:
data ymax = lat + delta_lat)) library(RcolorBrewer) o + scale_fill_gradient2(
add layers, elements with + xmax, xmin, ymax, ymin, alpha, color, fill, linetype, size f + stat_quantile(quantiles = c(0.25, 0.5, 0.75), formula = y ~ log(x), display.brewer.all()
ggplot(mpg, aes(hwy, cty)) + method = "rq") low = "red", hight = "blue",
x, y | ..quantile.., ..x.., ..y.. n + scale_fill_grey( mid = "white", midpoint = 25)
geom_point(aes(color = cyl)) + layer = geom + default stat +
geom_smooth(method ="lm") + layer specific mappings Position Adjustments f + stat_smooth(method = "auto", formula = y ~ x, se = TRUE, n = 80,
fullrange = FALSE, level = 0.95)
start = 0.2, end = 0.8,
na.value = "red")
o + scale_fill_gradientn(
colours = terrain.colors(6))
coord_cartesian() + x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
scale_color_gradient() +
Three Variables f + stat_ecdf(n = 40)
Also: rainbow(), heat.colors(),
additional elements Position adjustments determine how to arrange geoms that would otherwise occupy the topo.colors(), cm.colors(),
x, y | ..x.., ..y.. RColorBrewer::brewer.pal()
theme_bw() seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2)) same space
m <- ggplot(seals, aes(long, lat)) f + stat_quantile(quantiles = c(0.25, 0.5, 0.75), formula = y ~ log(x), Shape scales
s <- ggplot(mpg, aes(fl, fill = drv)) method = "rq")
x, y | ..quantile.., ..x.., ..y.. p <- f + geom_point(
Add a new layer to a plot with a geom_*() or stat_*() function. Each m + geom_contour(aes(z = z)) s + geom_bar(position = "dodge")
provides a geom, a set of aesthetic mappings, and a default stat and x, y, z, alpha, colour, linetype, size, weight Arrange elements side by side f + stat_smooth(method = "auto", formula = y ~ x, se = TRUE, n = 80, aes(shape = fl)) 0 6 12 18 24
fullrange = FALSE, level = 0.95)
position adjustment. x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax.. 1 7 13 19 25
m + geom_raster(aes(fill = z), hjust=0.5, s + geom_bar(position = "fill") p + scale_shape(
last_plot() vjust=0.5, interpolate=FALSE) Stack elements on top of one another, normalize height solid = FALSE) 2 8 14 20
ggplot() + stat_function(aes(x = -3:3), *
Returns the last plot x, y, alpha, fill fun = dnorm, n = 101, args = list(sd=0.5)) .
3 9 15 21
x | ..y..
ggsave("plot.png", width = 5, height = 5) m + geom_tile(aes(fill = z)) s + geom_bar(position = "stack") p + scale_shape_manual( 4 10 16 22 0
Saves last plot as 5’ x 5’ file named "plot.png" in x, y, alpha, color, fill, linetype, size Stack elements on top of one another f + stat_identity() values = c(3:7))
Shape values shown in 5 11 17 23 O
working directory. Matches file type to file extension. ggplot() + stat_qq(aes(sample=1:100), distribution = qt, chart on right
f + geom_point(position = "jitter") dparams = list(df=5))
Add random noise to X and Y position of each element to avoid overplotting sample, x, y | ..x.., ..y.. Size scales
f + stat_sum()
Coordinate Systems Faceting Each position adjustment can be recast as a function with manual width and height
arguments
x, y, size | ..size..
f + stat_summary(fun.data = "mean_cl_boot")
q <- f + geom_point(
aes(size = cyl))
q + scale_size_area(max = 6)
Value mapped to area of circle
s + geom_bar(position = position_dodge(width = 1)) (not radius)
Facets divide a plot into subplots based on the values f + stat_unique()
r <- b + geo m_bar()
of one or more discrete variables.
r + coord_cartesian(xlim = c(0, 5)) t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
xlim, ylim
The default cartesian coordinate system
t + facet_grid(. ~ fl)
facet into columns based on fl
Labels Themes Zooming
r + coord_fixed(ratio = 1/2) t + facet_grid(year ~ .) t + ggtitle("New Plot Title ")
Add a main title above the plot
ratio, xlim, ylim
Cartesian coordinates with fixed aspect
facet into rows based on year Without clipping (preferred)
ratio between x and y units t + xlab("New X label") Use scale functions
t + facet_grid(year ~ fl) Change the label on the X axis to update legend
r + coord_flip() facet into both rows and columns
xlim, ylim t + ylab("New Y label") labels t + coord_cartesian(
Flipped Cartesian coordinates t + facet_wrap(~ fl) Change the label on the Y axis xlim = c(0, 100), ylim = c(10, 20))
wrap facets into a rectangular layoutof one r + theme_bw() r + theme_classic()
or more discrete variables. t + labs(title =" New title", x = "New x", y = "New y") White background White background
r + coord_polar(theta = "x", direction=1 ) Set scales to let axis limits vary across facets
All of the above with grid lines no gridlines
theta, start, direction
Polar coordinates t + facet_grid(y ~ x, scales = "free")
x and y axis limits adjust to individual facets
r + coord_trans(ytrans = "sqrt")
xtrans, ytrans, limx, limy
Transformed cartesian coordinates. Set
• "free_x" - x axis limits adjust
• "free_y" - y axis limits adjust
Legends With clipping (removes unseen data points)
extras and strains to the name
of a window function. Set labeller to adjust facet labels t + theme(legend.position = "bottom") t + xlim(0, 100) + ylim(10, 20)
t + facet_grid(. ~ fl, labeller = label_both) Place legend at "bottom", "top", "left", or "right"
z + coord_map(projection = "ortho", fl: c fl: d fl: e fl: p fl: r r + theme_grey() r + theme_minimal() t + scale_x_continuous(limits = c(0, 100)) +
orientation=c(41, -74, 0)) t + guides(color = "none") Grey background Minimal theme
t + facet_grid(. ~ fl, labeller = label_both) Set legend type for each aesthetic: colorbar, legend, or none (no legend) (default theme) scale_y_continuous(limits = c(0, 100))
projection, orientation, xlim, ylim
Map projections from the mapproj package t + scale_fill_discrete(name = "Title", labels = c("A", "B", "C"))
(mercator (default), azequalarea, lagrange, etc.) t + facet_grid(. ~ fl, labeller = label_both) Set legend title and labels with a scale function.
c d e p r ggthemes - Package with additional ggplot2 themes

Creative Commons Data Visualisation with ggplot2 by RStudio is licensed under CC BY SA 4.0
Big-O Complexity Chart
Horrible
O(n!) O(n^2)

Operations
O(2^n)

O(n log n)
Bad
Big-O Cheat Sheet
BecomingHuman.AI
Fair
O(n)
Good
O(log n), O(1)

Excellent
Elements
Data Structure Array Sorting
Operation Algorithms

Time Complexity Space Complexity Time Complexity Space Complexity

Average Worst Worst Best Average Worst Worst

Access Search Insertion Deletion Access Search Insertion Deletion

Array Θ(1) Θ(n) Θ(n) Θ(n) Θ(1) Θ(n) Θ(n) Θ(n) Θ(n) Quicksort Ω(n log(n)) Θ(n log(n)) O(n^2) O(n log(n))

Stack Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Mergesort Ω(n log(n)) Θ(n log(n)) O(n log(n)) O(n log(n))

Queue Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Timsort Ω(n) Θ(n log(n)) O(n log(n)) Θ(n)

Singly-Linked List Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Heapsort Ω(n log(n)) Θ(n log(n)) O(n log(n)) O(n log(n))

Doubly-Linked List Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Bubble Sort Ω(n) Θ(n^2) O(n^2) Θ(n)

Skip List Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Θ(n) Θ(n) Θ(n) O(n log(n)) Insertion Sort Ω(n) Θ(n^2) O(n^2) Θ(n)

Hash Table N/A Θ(1) Θ(1) Θ(1) N/A Θ(n) Θ(n) Θ(n) Θ(n) Selection Sort Ω(n^2) Θ(n^2) O(n^2) Ω(n^2)

Binary Search Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Θ(n) Θ(n) Θ(n) Θ(n) Tree Sort Ω(n log(n)) Θ(n log(n)) O(n^2) O(n log(n))

Cartesian Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) N/A Θ(n) Θ(n) Θ(n) Θ(n) Shell Sort Ω(n log(n)) Θ(n(log(n))^2) O(n(log(n))^2) O(n log(n))

B-Tree N/A Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Bucket Sort Ω(n+k) Θ(n+k) O(n^2) Ω(n+k)

1 10 100
Red-Black Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Radix Sort Ω(n+k) Θ(n+k) Ω(n+k) Ω(n+k)

Splay Tree N/A Θ(log(n)) Θ(log(n)) Θ(log(n)) N/A Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Counting Sort Ω(n+k) Θ(n+k) Ω(n+k) Ω(n+k)

AVL Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Cubesort Ω(n) Θ(n log(n)) O(n log(n)) O(n log(n))

KD Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(n) Θ(n) Θ(n) Θ(n) Θ(n)

Originally created by bigocheatsheet.com http://bigocheatsheet.com/

You might also like